Commit Graph

930 Commits

Author SHA1 Message Date
Rado Vrbovsky 4da7c39b53 Merge: io_uring: Update to upstream v6.10 + fixes 2025-01-13 18:58:47 +00:00
Rado Vrbovsky ac45be9b4d Merge: CNB96: net/sched: update TC core to upstream v6.12
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5417

JIRA: https://issues.redhat.com/browse/RHEL-57768
Depends: !5270
Depends: !5305
Depends: !4753
Depends: !5235
Depends: !5318
Depends: !5539

Commits:
```
4fdb6b6063f0 ("net: count drops due to missing qdisc as dev->tx_drops")
bab4923132fe ("tracing/net_sched: NULL pointer dereference in perf_trace_qdisc_reset()")
3abbd7ed8b76 ("act_ct: prepare for stolen verdict coming from conntrack and nat engine")
b07593edd2fa ("net/sched: act_skbmod: convert comma to semicolon")
e2d0fadd703c ("sched: act_ct: avoid -Wflex-array-member-not-at-end warning")
216203bdc228 ("UAPI: net/sched: Use __struct_group() in flex struct tc_u32_sel")
a0c9fe5eecc9 ("tc-testing: don't access non-existent variable on exception")
a7a45f02a093 ("net: sched: Correct spelling in headers")
938863727076 ("tc: adjust network header after 2nd vlan push")
59c330eccee8 ("selftests: tc_actions: test ingress 2nd vlan push")
2da44703a544 ("selftests: tc_actions: test egress 2nd vlan push")
bc21000e99f9 ("net_sched: sch_fq: fix incorrect behavior for small weights")
d5c4546062fd ("net: sched: consistently use rcu_replace_pointer() in taprio_change()")
c48994baefdc ("sch_cake: constify inverse square root cache")
95ecba62e2fd ("net: fix races in netdev_tx_sent_queue()/dev_watchdog()")
34d35b4edbbe ("net/sched: act_api: deny mismatched skip_sw/skip_hw flags for actions created by classifiers")
```

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: Petr Oros <poros@redhat.com>
Approved-by: Davide Caratti <dcaratti@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
2024-12-05 16:12:46 +00:00
Jeff Moyer 289efb74a9 net: always try to set ubuf in skb_zerocopy_iter_stream
JIRA: https://issues.redhat.com/browse/RHEL-64867

commit 9e2db9d3993e270b24fbc4ce1ca7e09756e8df25
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Thu Jun 27 13:59:41 2024 +0100

    net: always try to set ubuf in skb_zerocopy_iter_stream
    
    skb_zcopy_set() does nothing if there is already a ubuf_info associated
    with an skb, and since ->link_skb should have set it several lines above
    the check here essentially does nothing and can be removed. It's also
    safer this way, because even if the callback is faulty we'll
    have it set.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Reviewed-by: Willem de Bruijn <willemb@google.com>
    Reviewed-by: Jens Axboe <axboe@kernel.dk>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-12-02 11:21:27 -05:00
Jeff Moyer 7bc353cc45 net: add callback for setting a ubuf_info to skb
JIRA: https://issues.redhat.com/browse/RHEL-64867

commit 65bada80dec1f2108a751644773b2120bd789934
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Fri Apr 19 12:08:40 2024 +0100

    net: add callback for setting a ubuf_info to skb
    
    At the moment an skb can only have one ubuf_info associated with it,
    which might be a performance problem for zerocopy sends in cases like
    TCP via io_uring. Add a callback for assigning ubuf_info to skb, this
    way we will implement smarter assignment later like linking ubuf_info
    together.
    
    Note, it's an optional callback, which should be compatible with
    skb_zcopy_set(), that's because the net stack might potentially decide
    to clone an skb and take another reference to ubuf_info whenever it
    wishes. Also, a correct implementation should always be able to bind to
    an skb without prior ubuf_info, otherwise we could end up in a situation
    when the send would not be able to progress.
    
    Reviewed-by: Jens Axboe <axboe@kernel.dk>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Reviewed-by: Willem de Bruijn <willemb@google.com>
    Link: https://lore.kernel.org/all/b7918aadffeb787c84c9e72e34c729dc04f3a45d.1713369317.git.asml.silence@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-11-28 17:39:44 -05:00
Jeff Moyer 97718db85e net: extend ubuf_info callback to ops structure
JIRA: https://issues.redhat.com/browse/RHEL-64867
Conflicts: The conflicts here existed upstream, and were resolved by
merge commit 3830fff39941 ("Merge branch 'for-uring-ubufops' of
git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux into
for-6.10/io_uring").  This patch incorporates the fixes from that
merge commit.

commit 7ab4f16f9e2440e797eae88812f800458e5879d2
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Fri Apr 19 12:08:39 2024 +0100

    net: extend ubuf_info callback to ops structure
    
    We'll need to associate additional callbacks with ubuf_info, introduce
    a structure holding ubuf_info callbacks. Apart from a more smarter
    io_uring notification management introduced in next patches, it can be
    used to generalise msg_zerocopy_put_abort() and also store
    ->sg_from_iter, which is currently passed in struct msghdr.
    
    Reviewed-by: Jens Axboe <axboe@kernel.dk>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Reviewed-by: Willem de Bruijn <willemb@google.com>
    Link: https://lore.kernel.org/all/a62015541de49c0e2a8a0377a1d5d0a5aeb07016.1713369317.git.asml.silence@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-11-28 17:38:44 -05:00
Jeff Moyer d84f719557 skbuff: use mempool KASAN hooks
JIRA: https://issues.redhat.com/browse/RHEL-64867

commit 74e831af165acc968418a4d9fde8c2e099f3e8bf
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Tue Dec 19 23:29:04 2023 +0100

    skbuff: use mempool KASAN hooks
    
    Instead of using slab-internal KASAN hooks for poisoning and unpoisoning
    cached objects, use the proper mempool KASAN hooks.
    
    Also check the return value of kasan_mempool_poison_object to prevent
    double-free and invali-free bugs.
    
    Link: https://lkml.kernel.org/r/a3482c41395c69baa80eb59dbb06beef213d2a14.1703024586.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Cc: Alexander Lobakin <alobakin@pm.me>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Breno Leitao <leitao@debian.org>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Evgenii Stepanov <eugenis@google.com>
    Cc: Marco Elver <elver@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-11-28 16:23:44 -05:00
Jeff Moyer 34644cd1d2 kasan: rename and document kasan_(un)poison_object_data
JIRA: https://issues.redhat.com/browse/RHEL-64867

commit 1ce9a0523938f87dd8505233cc3445f8e2d8dcee
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Tue Dec 19 23:29:03 2023 +0100

    kasan: rename and document kasan_(un)poison_object_data
    
    Rename kasan_unpoison_object_data to kasan_unpoison_new_object and add a
    documentation comment.  Do the same for kasan_poison_object_data.
    
    The new names and the comments should suggest the users that these hooks
    are intended for internal use by the slab allocator.
    
    The following patch will remove non-slab-internal uses of these hooks.
    
    No functional changes.
    
    [andreyknvl@google.com: update references to renamed functions in comments]
      Link: https://lkml.kernel.org/r/20231221180637.105098-1-andrey.konovalov@linux.dev
    Link: https://lkml.kernel.org/r/eab156ebbd635f9635ef67d1a4271f716994e628.1703024586.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Reviewed-by: Marco Elver <elver@google.com>
    Cc: Alexander Lobakin <alobakin@pm.me>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Breno Leitao <leitao@debian.org>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Evgenii Stepanov <eugenis@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-11-28 16:22:44 -05:00
Jeff Moyer 63317c47dc net: skbuff: drop the word head from skb cache
JIRA: https://issues.redhat.com/browse/RHEL-64867

commit 025a785ff083729819dc82ac81baf190cb4aee5c
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Wed Feb 8 22:06:42 2023 -0800

    net: skbuff: drop the word head from skb cache
    
    skbuff_head_cache is misnamed (perhaps for historical reasons?)
    because it does not hold heads. Head is the buffer which skb->data
    points to, and also where shinfo lives. struct sk_buff is a metadata
    structure, not the head.
    
    Eric recently added skb_small_head_cache (which allocates actual
    head buffers), let that serve as an excuse to finally clean this up :)
    
    Leave the user-space visible name intact, it could possibly be uAPI.
    
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-11-28 16:03:44 -05:00
Ivan Vecera ad35cce099 tc: adjust network header after 2nd vlan push
JIRA: https://issues.redhat.com/browse/RHEL-57768

commit 938863727076f684abb39d1d0f9dce1924e9028e
Author: Boris Sukholitko <boris.sukholitko@broadcom.com>
Date:   Thu Aug 22 13:35:08 2024 +0300

    tc: adjust network header after 2nd vlan push

    <tldr>
    skb network header of the single-tagged vlan packet continues to point the
    vlan payload (e.g. IP) after second vlan tag is pushed by tc act_vlan. This
    causes problem at the dissector which expects double-tagged packet network
    header to point to the inner vlan.

    The fix is to adjust network header in tcf_act_vlan.c but requires
    refactoring of skb_vlan_push function.
    </tldr>

    Consider the following shell script snippet configuring TC rules on the
    veth interface:

    ip link add veth0 type veth peer veth1
    ip link set veth0 up
    ip link set veth1 up

    tc qdisc add dev veth0 clsact

    tc filter add dev veth0 ingress pref 10 chain 0 flower \
            num_of_vlans 2 cvlan_ethtype 0x800 action goto chain 5
    tc filter add dev veth0 ingress pref 20 chain 0 flower \
            num_of_vlans 1 action vlan push id 100 \
            protocol 0x8100 action goto chain 5
    tc filter add dev veth0 ingress pref 30 chain 5 flower \
            num_of_vlans 2 cvlan_ethtype 0x800 action simple sdata "success"

    Sending double-tagged vlan packet with the IP payload inside:

    cat <<ENDS | text2pcap - - | tcpreplay -i veth1 -
    0000  00 00 00 00 00 11 00 00 00 00 00 22 81 00 00 64   ..........."...d
    0010  81 00 00 14 08 00 45 04 00 26 04 d2 00 00 7f 11   ......E..&......
    0020  18 ef 0a 00 00 01 14 00 00 02 00 00 00 00 00 12   ................
    0030  e1 c7 00 00 00 00 00 00 00 00 00 00               ............
    ENDS

    will match rule 10, goto rule 30 in chain 5 and correctly emit "success" to
    the dmesg.

    OTOH, sending single-tagged vlan packet:

    cat <<ENDS | text2pcap - - | tcpreplay -i veth1 -
    0000  00 00 00 00 00 11 00 00 00 00 00 22 81 00 00 14   ..........."....
    0010  08 00 45 04 00 2a 04 d2 00 00 7f 11 18 eb 0a 00   ..E..*..........
    0020  00 01 14 00 00 02 00 00 00 00 00 16 e1 bf 00 00   ................
    0030  00 00 00 00 00 00 00 00 00 00 00 00               ............
    ENDS

    will match rule 20, will push the second vlan tag but will *not* match
    rule 30. IOW, the match at rule 30 fails if the second vlan was freshly
    pushed by the kernel.

    Lets look at  __skb_flow_dissect working on the double-tagged vlan packet.
    Here is the relevant code from around net/core/flow_dissector.c:1277
    copy-pasted here for convenience:

            if (dissector_vlan == FLOW_DISSECTOR_KEY_MAX &&
                skb && skb_vlan_tag_present(skb)) {
                    proto = skb->protocol;
            } else {
                    vlan = __skb_header_pointer(skb, nhoff, sizeof(_vlan),
                                                data, hlen, &_vlan);
                    if (!vlan) {
                            fdret = FLOW_DISSECT_RET_OUT_BAD;
                            break;
                    }

                    proto = vlan->h_vlan_encapsulated_proto;
                    nhoff += sizeof(*vlan);
            }

    The "else" clause above gets the protocol of the encapsulated packet from
    the skb data at the network header location. printk debugging has showed
    that in the good double-tagged packet case proto is
    htons(0x800 == ETH_P_IP) as expected. However in the single-tagged packet
    case proto is garbage leading to the failure to match tc filter 30.

    proto is being set from the skb header pointed by nhoff parameter which is
    defined at the beginning of __skb_flow_dissect
    (net/core/flow_dissector.c:1055 in the current version):

                    nhoff = skb_network_offset(skb);

    Therefore the culprit seems to be that the skb network offset is different
    between double-tagged packet received from the interface and single-tagged
    packet having its vlan tag pushed by TC.

    Lets look at the interesting points of the lifetime of the single/double
    tagged packets as they traverse our packet flow.

    Both of them will start at __netif_receive_skb_core where the first vlan
    tag will be stripped:

            if (eth_type_vlan(skb->protocol)) {
                    skb = skb_vlan_untag(skb);
                    if (unlikely(!skb))
                            goto out;
            }

    At this stage in double-tagged case skb->data points to the second vlan tag
    while in single-tagged case skb->data points to the network (eg. IP)
    header.

    Looking at TC vlan push action (net/sched/act_vlan.c) we have the following
    code at tcf_vlan_act (interesting points are in square brackets):

            if (skb_at_tc_ingress(skb))
    [1]             skb_push_rcsum(skb, skb->mac_len);

            ....

            case TCA_VLAN_ACT_PUSH:
                    err = skb_vlan_push(skb, p->tcfv_push_proto, p->tcfv_push_vid |
                                        (p->tcfv_push_prio << VLAN_PRIO_SHIFT),
                                        0);
                    if (err)
                            goto drop;
                    break;

            ....

    out:
            if (skb_at_tc_ingress(skb))
    [3]             skb_pull_rcsum(skb, skb->mac_len);

    And skb_vlan_push (net/core/skbuff.c:6204) function does:

                    err = __vlan_insert_tag(skb, skb->vlan_proto,
                                            skb_vlan_tag_get(skb));
                    if (err)
                            return err;

                    skb->protocol = skb->vlan_proto;
    [2]             skb->mac_len += VLAN_HLEN;

    in the case of pushing the second tag. Lets look at what happens with
    skb->data of the single-tagged packet at each of the above points:

    1. As a result of the skb_push_rcsum, skb->data is moved back to the start
       of the packet.

    2. First VLAN tag is moved from the skb into packet buffer, skb->mac_len is
       incremented, skb->data still points to the start of the packet.

    3. As a result of the skb_pull_rcsum, skb->data is moved forward by the
       modified skb->mac_len, thus pointing to the network header again.

    Then __skb_flow_dissect will get confused by having double-tagged vlan
    packet with the skb->data at the network header.

    The solution for the bug is to preserve "skb->data at second vlan header"
    semantics in the skb_vlan_push function. We do this by manipulating
    skb->network_header rather than skb->mac_len. skb_vlan_push callers are
    updated to do skb_reset_mac_len.

    Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-11-22 11:07:15 +01:00
Felix Maurer 8643c21aa1 page_pool: check for PP direct cache locality later
JIRA: https://issues.redhat.com/browse/RHEL-57765
Conflicts:
- Context differences (missing skb_cow_data_for_xdp) due to missing
  e6d5dbdd20aa ("xdp: add multi-buff support for xdp running in generic
  mode")
- net/core/skbuff.c: context difference (condition moved to function) due
  to missing 8cfa2dee325f ("skbuff: Add a function to check if a page
  belongs to page_pool") with no functional changes
- net/core/skbuff.c: context difference (missing skb_kfree_head) due to
  missing bf9f1baa279f ("net: add dedicated kmem_cache for typical/small
  skb->head"); this can appear in revumatic as if skb_free_head was moved
  but that isn't true, the hunks are just reordered (check the line nums)

commit 4a96a4e807c390a9d91b450ebe04eeb2e0ecc076
Author: Alexander Lobakin <aleksander.lobakin@intel.com>
Date:   Fri Mar 29 17:55:06 2024 +0100

    page_pool: check for PP direct cache locality later

    Since we have pool->p.napi (Jakub) and pool->cpuid (Lorenzo) to check
    whether it's safe to use direct recycling, we can use both globally for
    each page instead of relying solely on @allow_direct argument.
    Let's assume that @allow_direct means "I'm sure it's local, don't waste
    time rechecking this" and when it's false, try the mentioned params to
    still recycle the page directly. If neither is true, we'll lose some
    CPU cycles, but then it surely won't be hotpath. On the other hand,
    paths where it's possible to use direct cache, but not possible to
    safely set @allow_direct, will benefit from this move.
    The whole propagation of @napi_safe through a dozen of skb freeing
    functions can now go away, which saves us some stack space.

    Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
    Link: https://lore.kernel.org/r/20240329165507.3240110-2-aleksander.lobakin@intel.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2024-11-06 18:18:24 +01:00
Felix Maurer edee1c1e12 page_pool: disable direct recycling based on pool->cpuid on destroy
JIRA: https://issues.redhat.com/browse/RHEL-57765

commit 56ef27e3abe6d6453b1f4f6127041f3a65d7cbc9
Author: Alexander Lobakin <aleksander.lobakin@intel.com>
Date:   Thu Feb 15 12:39:05 2024 +0100

    page_pool: disable direct recycling based on pool->cpuid on destroy
    
    Now that direct recycling is performed basing on pool->cpuid when set,
    memory leaks are possible:
    
    1. A pool is destroyed.
    2. Alloc cache is emptied (it's done only once).
    3. pool->cpuid is still set.
    4. napi_pp_put_page() does direct recycling basing on pool->cpuid.
    5. Now alloc cache is not empty, but it won't ever be freed.
    
    In order to avoid that, rewrite pool->cpuid to -1 when unlinking NAPI to
    make sure no direct recycling will be possible after emptying the cache.
    This involves a bit of overhead as pool->cpuid now must be accessed
    via READ_ONCE() to avoid partial reads.
    Rename page_pool_unlink_napi() -> page_pool_disable_direct_recycling()
    to reflect what it actually does and unexport it.
    
    Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
    Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Link: https://lore.kernel.org/r/20240215113905.96817-1-aleksander.lobakin@intel.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2024-10-21 16:37:42 +02:00
Wander Lairson Costa a69829f6ec
net: add debugging checks in skb_attempt_defer_free()
JIRA: https://issues.redhat.com/browse/RHEL-9145

commit e8e1ce8454c9cc8ad2e4422bef346428e52455e3
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Apr 21 09:43:53 2023 +0000

    net: add debugging checks in skb_attempt_defer_free()

    Make sure skbs that are stored in softnet_data.defer_list
    do not have a dst attached.

    Also make sure the the skb was orphaned.

    Link: https://lore.kernel.org/netdev/CANn89iJuEVe72bPmEftyEJHLzzN=QNR2yueFjTxYXCEpS5S8HQ@mail.gmail.com/T/
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Wander Lairson Costa <wander@redhat.com>
2024-09-16 16:04:30 -03:00
Wander Lairson Costa 7f4cde038e
net: do not provide hard irq safety for sd->defer_lock
JIRA: https://issues.redhat.com/browse/RHEL-9145

commit 931e93bdf8ca71cef1f8759c43bc2c5385392b8b
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Apr 21 09:43:54 2023 +0000

    net: do not provide hard irq safety for sd->defer_lock

    kfree_skb() can be called from hard irq handlers,
    but skb_attempt_defer_free() is meant to be used
    from process or BH contexts, and skb_defer_free_flush()
    is meant to be called from BH contexts.

    Not having to mask hard irq can save some cycles.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Wander Lairson Costa <wander@redhat.com>
2024-09-16 16:04:29 -03:00
Wander Lairson Costa 9b810452bc
net: avoid strange behavior with skb_defer_max == 1
JIRA: https://issues.redhat.com/browse/RHEL-9145

commit c09b0cd2cc6c3f91988a20d45fa45c889f72c56c
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Wed May 18 11:55:22 2022 -0700

    net: avoid strange behavior with skb_defer_max == 1

    When user sets skb_defer_max to 1 the kick threshold is 0
    (half of 1). If we increment queue length before the check
    the kick will never happen, and the skb may get stranded.
    This is likely harmless but can be avoided by moving the
    increment after the check. This way skb_defer_max == 1
    will always kick. Still a silly config to have, but
    somehow that feels more correct.

    While at it drop a comment which seems to be outdated
    or confusing, and wrap the defer_count write with
    a WRITE_ONCE() since it's read on the fast path
    that avoids taking the lock.

    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20220518185522.2038683-1-kuba@kernel.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Wander Lairson Costa <wander@redhat.com>
2024-09-16 16:04:28 -03:00
Wander Lairson Costa 0a68918319
net: add skb_defer_max sysctl
JIRA: https://issues.redhat.com/browse/RHEL-9145

commit 39564c3fdc6684c6726b63e131d2a9f3809811cb
Author: Eric Dumazet <edumazet@google.com>
Date:   Sun May 15 21:24:55 2022 -0700

    net: add skb_defer_max sysctl

    commit 68822bdf76f1 ("net: generalize skb freeing
    deferral to per-cpu lists") added another per-cpu
    cache of skbs. It was expected to be small,
    and an IPI was forced whenever the list reached 128
    skbs.

    We might need to be able to control more precisely
    queue capacity and added latency.

    An IPI is generated whenever queue reaches half capacity.

    Default value of the new limit is 64.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Wander Lairson Costa <wander@redhat.com>
2024-09-16 16:04:28 -03:00
Wander Lairson Costa ca510a6c82
net: Use backlog-NAPI to clean up the defer_list.
JIRA: https://issues.redhat.com/browse/RHEL-9145

commit 80d2eefcb4c84aa9018b2a997ab3a4c567bc821a
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Mon Mar 25 08:40:30 2024 +0100

    net: Use backlog-NAPI to clean up the defer_list.

    The defer_list is a per-CPU list which is used to free skbs outside of
    the socket lock and on the CPU on which they have been allocated.
    The list is processed during NAPI callbacks so ideally the list is
    cleaned up.
    Should the amount of skbs on the list exceed a certain water mark then
    the softirq is triggered remotely on the target CPU by invoking a remote
    function call. The raise of the softirqs via a remote function call
    leads to waking the ksoftirqd on PREEMPT_RT which is undesired.
    The backlog-NAPI threads already provide the infrastructure which can be
    utilized to perform the cleanup of the defer_list.

    The NAPI state is updated with the input_pkt_queue.lock acquired. It
    order not to break the state, it is needed to also wake the backlog-NAPI
    thread with the lock held. This requires to acquire the use the lock in
    rps_lock_irq*() if the backlog-NAPI threads are used even with RPS
    disabled.

    Move the logic of remotely starting softirqs to clean up the defer_list
    into kick_defer_list_purge(). Make sure a lock is held in
    rps_lock_irq*() if backlog-NAPI threads are used. Schedule backlog-NAPI
    for defer_list cleanup if backlog-NAPI is available.

    Acked-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Wander Lairson Costa <wander@redhat.com>
2024-09-16 16:04:27 -03:00
Wander Lairson Costa de7a3b7b85
net: add generic percpu page_pool allocator
JIRA: https://issues.redhat.com/browse/RHEL-9145

Conflicts: we already have 490a79faf95e ("net: introduce include/net/rps.h")

commit 2b0cfa6e49566c8fa6759734cf821aa6e8271a9e
Author: Lorenzo Bianconi <lorenzo@kernel.org>
Date:   Mon Feb 12 10:50:54 2024 +0100

    net: add generic percpu page_pool allocator

    Introduce generic percpu page_pools allocator.
    Moreover add page_pool_create_percpu() and cpuid filed in page_pool struct
    in order to recycle the page in the page_pool "hot" cache if
    napi_pp_put_page() is running on the same cpu.
    This is a preliminary patch to add xdp multi-buff support for xdp running
    in generic mode.

    Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
    Reviewed-by: Toke Hoiland-Jorgensen <toke@redhat.com>
    Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
    Link: https://lore.kernel.org/r/80bc4285228b6f4220cd03de1999d86e46e3fcbd.1707729884.git.lorenzo@kernel.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Wander Lairson Costa <wander@redhat.com>
2024-09-16 16:04:27 -03:00
Wander Lairson Costa 072de1240f
net: fix possible race in skb_attempt_defer_free()
JIRA: https://issues.redhat.com/browse/RHEL-9145

commit 97e719a82b43c6c2bb5eebdb3c5d479a332ac2ac
Author: Eric Dumazet <edumazet@google.com>
Date:   Sun May 15 21:24:53 2022 -0700

    net: fix possible race in skb_attempt_defer_free()

    A cpu can observe sd->defer_count reaching 128,
    and call smp_call_function_single_async()

    Problem is that the remote CPU can clear sd->defer_count
    before the IPI is run/acknowledged.

    Other cpus can queue more packets and also decide
    to call smp_call_function_single_async() while the pending
    IPI was not yet delivered.

    This is a common issue with smp_call_function_single_async().
    Callers must ensure correct synchronization and serialization.

    I triggered this issue while experimenting smaller threshold.
    Performing the call to smp_call_function_single_async()
    under sd->defer_lock protection did not solve the problem.

    Commit 5a18ceca63 ("smp: Allow smp_call_function_single_async()
    to insert locked csd") replaced an informative WARN_ON_ONCE()
    with a return of -EBUSY, which is often ignored.
    Test of CSD_FLAG_LOCK presence is racy anyway.

    Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Wander Lairson Costa <wander@redhat.com>
2024-09-16 16:04:26 -03:00
Wander Lairson Costa e5f704d611
net: generalize skb freeing deferral to per-cpu lists
JIRA: https://issues.redhat.com/browse/RHEL-9145

Conflicts:
inet/tls/tls_sw.c: we already have:
* 4cbc325ed6b4 ("tls: rx: allow only one reader at a time")
net/ipv4/tcp_ipv4.c: we already have:
* 67b688aecd tcp: fix tcp_cleanup_rbuf() for tcp_read_skb()
* 0240ed7c51 tcp: allow again tcp_disconnect() when threads are waiting
* 0d5e52df56 bpf: net: Avoid do_tcp_getsockopt() taking sk lock when called from bpf
* 7a26dc9e7b43 net: tcp: add skb drop reasons to tcp_add_backlog()

commit 68822bdf76f10c3dc80609d4e2cdc1e847429086
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Apr 22 13:12:37 2022 -0700

    net: generalize skb freeing deferral to per-cpu lists

    Logic added in commit f35f821935d8 ("tcp: defer skb freeing after socket
    lock is released") helped bulk TCP flows to move the cost of skbs
    frees outside of critical section where socket lock was held.

    But for RPC traffic, or hosts with RFS enabled, the solution is far from
    being ideal.

    For RPC traffic, recvmsg() has to return to user space right after
    skb payload has been consumed, meaning that BH handler has no chance
    to pick the skb before recvmsg() thread. This issue is more visible
    with BIG TCP, as more RPC fit one skb.

    For RFS, even if BH handler picks the skbs, they are still picked
    from the cpu on which user thread is running.

    Ideally, it is better to free the skbs (and associated page frags)
    on the cpu that originally allocated them.

    This patch removes the per socket anchor (sk->defer_list) and
    instead uses a per-cpu list, which will hold more skbs per round.

    This new per-cpu list is drained at the end of net_action_rx(),
    after incoming packets have been processed, to lower latencies.

    In normal conditions, skbs are added to the per-cpu list with
    no further action. In the (unlikely) cases where the cpu does not
    run net_action_rx() handler fast enough, we use an IPI to raise
    NET_RX_SOFTIRQ on the remote cpu.

    Also, we do not bother draining the per-cpu list from dev_cpu_dead()
    This is because skbs in this list have no requirement on how fast
    they should be freed.

    Note that we can add in the future a small per-cpu cache
    if we see any contention on sd->defer_lock.

    Tested on a pair of hosts with 100Gbit NIC, RFS enabled,
    and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around
    page recycling strategy used by NIC driver (its page pool capacity
    being too small compared to number of skbs/pages held in sockets
    receive queues)

    Note that this tuning was only done to demonstrate worse
    conditions for skb freeing for this particular test.
    These conditions can happen in more general production workload.

    10 runs of one TCP_STREAM flow

    Before:
    Average throughput: 49685 Mbit.

    Kernel profiles on cpu running user thread recvmsg() show high cost for
    skb freeing related functions (*)

        57.81%  [kernel]       [k] copy_user_enhanced_fast_string
    (*) 12.87%  [kernel]       [k] skb_release_data
    (*)  4.25%  [kernel]       [k] __free_one_page
    (*)  3.57%  [kernel]       [k] __list_del_entry_valid
         1.85%  [kernel]       [k] __netif_receive_skb_core
         1.60%  [kernel]       [k] __skb_datagram_iter
    (*)  1.59%  [kernel]       [k] free_unref_page_commit
    (*)  1.16%  [kernel]       [k] __slab_free
         1.16%  [kernel]       [k] _copy_to_iter
    (*)  1.01%  [kernel]       [k] kfree
    (*)  0.88%  [kernel]       [k] free_unref_page
         0.57%  [kernel]       [k] ip6_rcv_core
         0.55%  [kernel]       [k] ip6t_do_table
         0.54%  [kernel]       [k] flush_smp_call_function_queue
    (*)  0.54%  [kernel]       [k] free_pcppages_bulk
         0.51%  [kernel]       [k] llist_reverse_order
         0.38%  [kernel]       [k] process_backlog
    (*)  0.38%  [kernel]       [k] free_pcp_prepare
         0.37%  [kernel]       [k] tcp_recvmsg_locked
    (*)  0.37%  [kernel]       [k] __list_add_valid
         0.34%  [kernel]       [k] sock_rfree
         0.34%  [kernel]       [k] _raw_spin_lock_irq
    (*)  0.33%  [kernel]       [k] __page_cache_release
         0.33%  [kernel]       [k] tcp_v6_rcv
    (*)  0.33%  [kernel]       [k] __put_page
    (*)  0.29%  [kernel]       [k] __mod_zone_page_state
         0.27%  [kernel]       [k] _raw_spin_lock

    After patch:
    Average throughput: 73076 Mbit.

    Kernel profiles on cpu running user thread recvmsg() looks better:

        81.35%  [kernel]       [k] copy_user_enhanced_fast_string
         1.95%  [kernel]       [k] _copy_to_iter
         1.95%  [kernel]       [k] __skb_datagram_iter
         1.27%  [kernel]       [k] __netif_receive_skb_core
         1.03%  [kernel]       [k] ip6t_do_table
         0.60%  [kernel]       [k] sock_rfree
         0.50%  [kernel]       [k] tcp_v6_rcv
         0.47%  [kernel]       [k] ip6_rcv_core
         0.45%  [kernel]       [k] read_tsc
         0.44%  [kernel]       [k] _raw_spin_lock_irqsave
         0.37%  [kernel]       [k] _raw_spin_lock
         0.37%  [kernel]       [k] native_irq_return_iret
         0.33%  [kernel]       [k] __inet6_lookup_established
         0.31%  [kernel]       [k] ip6_protocol_deliver_rcu
         0.29%  [kernel]       [k] tcp_rcv_established
         0.29%  [kernel]       [k] llist_reverse_order

    v2: kdoc issue (kernel bots)
        do not defer if (alloc_cpu == smp_processor_id()) (Paolo)
        replace the sk_buff_head with a single-linked list (Jakub)
        add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Paolo Abeni <pabeni@redhat.com>
    Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Wander Lairson Costa <wander@redhat.com>
2024-09-16 16:04:25 -03:00
Antoine Tenart a8e32bb8a9 net: introduce sk_skb_reason_drop function
JIRA: https://issues.redhat.com/browse/RHEL-48648
Upstream Status: net-next.git
Conflicts:\
- Context difference due to missing upstream commit 1cface552a5b ("net:
  add skb_data_unref() helper") in c9s.

commit ba8de796baf4bdc03530774fb284fe3c97875566
Author: Yan Zhai <yan@cloudflare.com>
Date:   Mon Jun 17 11:09:09 2024 -0700

    net: introduce sk_skb_reason_drop function

    Long used destructors kfree_skb and kfree_skb_reason do not pass
    receiving socket to packet drop tracepoints trace_kfree_skb.
    This makes it hard to track packet drops of a certain netns (container)
    or a socket (user application).

    The naming of these destructors are also not consistent with most sk/skb
    operating functions, i.e. functions named "sk_xxx" or "skb_xxx".
    Introduce a new functions sk_skb_reason_drop as drop-in replacement for
    kfree_skb_reason on local receiving path. Callers can now pass receiving
    sockets to the tracepoints.

    kfree_skb and kfree_skb_reason are still usable but they are now just
    inline helpers that call sk_skb_reason_drop.

    Note it is not feasible to do the same to consume_skb. Packets not
    dropped can flow through multiple receive handlers, and have multiple
    receiving sockets. Leave it untouched for now.

    Suggested-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Yan Zhai <yan@cloudflare.com>
    Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-07-16 17:29:42 +02:00
Antoine Tenart 283c6d9f79 net: add rx_sk to trace_kfree_skb
JIRA: https://issues.redhat.com/browse/RHEL-48648
Upstream Status: net-next.git

commit c53795d48ee8f385c6a9e394651e7ee914baaeba
Author: Yan Zhai <yan@cloudflare.com>
Date:   Mon Jun 17 11:09:04 2024 -0700

    net: add rx_sk to trace_kfree_skb

    skb does not include enough information to find out receiving
    sockets/services and netns/containers on packet drops. In theory
    skb->dev tells about netns, but it can get cleared/reused, e.g. by TCP
    stack for OOO packet lookup. Similarly, skb->sk often identifies a local
    sender, and tells nothing about a receiver.

    Allow passing an extra receiving socket to the tracepoint to improve
    the visibility on receiving drops.

    Signed-off-by: Yan Zhai <yan@cloudflare.com>
    Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-07-16 17:29:42 +02:00
Xin Long 4c164dafae net: core: reject skb_copy(_expand) for fraglist GSO skbs
JIRA: https://issues.redhat.com/browse/RHEL-39781
CVE: CVE-2024-36929
Tested: compile only

commit d091e579b864fa790dd6a0cd537a22c383126681
Author: Felix Fietkau <nbd@nbd.name>
Date:   Sat Apr 27 20:24:19 2024 +0200

    net: core: reject skb_copy(_expand) for fraglist GSO skbs

    SKB_GSO_FRAGLIST skbs must not be linearized, otherwise they become
    invalid. Return NULL if such an skb is passed to skb_copy or
    skb_copy_expand, in order to prevent a crash on a potential later
    call to skb_gso_segment.

    Fixes: 3a1296a38d ("net: Support GRO/GSO fraglist chaining.")
    Signed-off-by: Felix Fietkau <nbd@nbd.name>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Xin Long <lxin@redhat.com>
2024-06-07 14:39:53 -04:00
Petr Oros 6c5988280c page_pool: remove PP_FLAG_PAGE_FRAG
JIRA: https://issues.redhat.com/browse/RHEL-31941

Conflicts:
- drivers/net/ethernet/hisilicon/hns3/hns3_enet.c: chunk skipped due to
  missing 93188e9642c3ce ("net: hns3: support skb's frag page recycling
  based on page pool")
- drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c chunk skipped
  due to missing b2e3406a38f0f4 ("octeontx2-pf: Add support for page
  pool")

Upstream commit(s):
commit 09d96ee5674a0eaa800c664353756ecc45c4a87f
Author: Yunsheng Lin <linyunsheng@huawei.com>
Date:   Fri Oct 20 17:59:49 2023 +0800

    page_pool: remove PP_FLAG_PAGE_FRAG

    PP_FLAG_PAGE_FRAG is not really needed after pp_frag_count
    handling is unified and page_pool_alloc_frag() is supported
    in 32-bit arch with 64-bit DMA, so remove it.

    Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
    Link: https://lore.kernel.org/r/20231020095952.11055-3-linyunsheng@huawei.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Petr Oros <poros@redhat.com>
2024-05-16 19:27:54 +02:00
Petr Oros a9e7bc19d6 net: skbuff: always try to recycle PP pages directly when in softirq
JIRA: https://issues.redhat.com/browse/RHEL-31941

Upstream commit(s):
commit 4a36d0180c452c3482792e0ff14e2bcf536a9284
Author: Alexander Lobakin <aleksander.lobakin@intel.com>
Date:   Fri Aug 4 20:05:29 2023 +0200

    net: skbuff: always try to recycle PP pages directly when in softirq

    Commit 8c48eea3adf3 ("page_pool: allow caching from safely localized
    NAPI") allowed direct recycling of skb pages to their PP for some cases,
    but unfortunately missed a couple of other majors.
    For example, %XDP_DROP in skb mode. The netstack just calls kfree_skb(),
    which unconditionally passes `false` as @napi_safe. Thus, all pages go
    through ptr_ring and locks, although most of time we're actually inside
    the NAPI polling this PP is linked with, so that it would be perfectly
    safe to recycle pages directly.
    Let's address such. If @napi_safe is true, we're fine, don't change
    anything for this path. But if it's false, check whether we are in the
    softirq context. It will most likely be so and then if ->list_owner
    is our current CPU, we're good to use direct recycling, even though
    @napi_safe is false -- concurrent access is excluded. in_softirq()
    protection is needed mostly due to we can hit this place in the
    process context (not the hardirq though).
    For the mentioned xdp-drop-skb-mode case, the improvement I got is
    3-4% in Mpps. As for page_pool stats, recycle_ring is now 0 and
    alloc_slow counter doesn't change most of time, which means the
    MM layer is not even called to allocate any new pages.

    Suggested-by: Jakub Kicinski <kuba@kernel.org> # in_softirq()
    Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
    Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
    Link: https://lore.kernel.org/r/20230804180529.2483231-7-aleksander.lobakin@intel.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Petr Oros <poros@redhat.com>
2024-05-16 19:27:54 +02:00
Petr Oros af8cacc642 net: skbuff: avoid accessing page_pool if !napi_safe when returning page
JIRA: https://issues.redhat.com/browse/RHEL-31941

Upstream commit(s):
commit 5b899c33b3b852b9559b724cfee67801324a0886
Author: Alexander Lobakin <aleksander.lobakin@intel.com>
Date:   Fri Aug 4 20:05:27 2023 +0200

    net: skbuff: avoid accessing page_pool if !napi_safe when returning page

    Currently, pp->p.napi is always read, but the actual variable it gets
    assigned to is read-only when @napi_safe is true. For the !napi_safe
    cases, which yet is still a pack, it's an unneeded operation.
    Moreover, it can lead to premature or even redundant page_pool
    cacheline access. For example, when page_pool_is_last_frag() returns
    false (with the recent frag improvements).
    Thus, read it only when @napi_safe is true. This also allows moving
    @napi inside the condition block itself. Constify it while we are
    here, because why not.

    Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
    Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
    Link: https://lore.kernel.org/r/20230804180529.2483231-5-aleksander.lobakin@intel.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Petr Oros <poros@redhat.com>
2024-05-16 19:27:54 +02:00
Patrick Talbert bed353f1ff Merge: CNB95: net: remove gfp_mask from napi_alloc_skb()
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4100

JIRA: https://issues.redhat.com/browse/RHEL-32108
Signed-off-by: Izabela Bakollari <ibakolla@redhat.com>

Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com>
Approved-by: Corinna Vinschen <vinschen@redhat.com>
Approved-by: Sabrina Dubroca <sdubroca@redhat.com>
Approved-by: Kamal Heib <kheib@redhat.com>

Merged-by: Patrick Talbert <ptalbert@redhat.com>
2024-05-03 12:43:32 +02:00
Lucas Zampieri bce8aa3053 Merge: xfrm: backports from upstream
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4013

JIRA: https://issues.redhat.com/browse/RHEL-31751

Signed-off-by: Sabrina Dubroca <sdubroca@redhat.com>

Approved-by: Florian Westphal <fwestpha@redhat.com>
Approved-by: Davide Caratti <dcaratti@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-04-26 12:34:25 +00:00
Izabela Bakollari d18d335178 net: remove gfp_mask from napi_alloc_skb()
JIRA: https://issues.redhat.com/browse/RHEL-32108

__napi_alloc_skb() is napi_alloc_skb() with the added flexibility
of choosing gfp_mask. This is a NAPI function, so GFP_ATOMIC is
implied. The only practical choice the caller has is whether to
set __GFP_NOWARN. But that's a false choice, too, allocation failures
in atomic context will happen, and printing warnings in logs,
effectively for a packet drop, is both too much and very likely
non-actionable.

This leads me to a conclusion that most uses of napi_alloc_skb()
are simply misguided, and should use __GFP_NOWARN in the first
place. We also have a "standard" way of reporting allocation
failures via the queue stat API (qstats::rx-alloc-fail).

The direct motivation for this patch is that one of the drivers
used at Meta calls napi_alloc_skb() (so prior to this patch without
__GFP_NOWARN), and the resulting OOM warning is the top networking
warning in our fleet.

Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240327040213.3153864-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
(cherry picked from commit 6e9b01909a811555ff3326cf80a5847169c57806)
Signed-off-by: Izabela Bakollari <ibakolla@redhat.com>
2024-04-23 08:33:11 +02:00
Sabrina Dubroca 06fe287412 net: skbuff: don't include <net/page_pool/types.h> to <linux/skbuff.h>
JIRA: https://issues.redhat.com/browse/RHEL-31751

Conflicts: context around #include in net/core/skbuff.c

commit 75eaf63ea7afeafd026ffef03bdc69e31f10829b
Author: Alexander Lobakin <aleksander.lobakin@intel.com>
Date:   Fri Aug 4 20:05:25 2023 +0200

    net: skbuff: don't include <net/page_pool/types.h> to <linux/skbuff.h>

    Currently, touching <net/page_pool/types.h> triggers a rebuild of more
    than half of the kernel. That's because it's included in
    <linux/skbuff.h>. And each new include to page_pool/types.h adds more
    [useless] data for the toolchain to process per each source file from
    that pile.

    In commit 6a5bcd84e8 ("page_pool: Allow drivers to hint on SKB
    recycling"), Matteo included it to be able to call a couple of functions
    defined there. Then, in commit 57f05bc2ab24 ("page_pool: keep pp info as
    long as page pool owns the page") one of the calls was removed, so only
    one was left. It's the call to page_pool_return_skb_page() in
    napi_frag_unref(). The function is external and doesn't have any
    dependencies. Having very niche page_pool_types.h included only for that
    looks like an overkill.

    As %PP_SIGNATURE is not local to page_pool.c (was only in the
    early submissions), nothing holds this function there. Teleport
    page_pool_return_skb_page() to skbuff.c, just next to the main consumer,
    skb_pp_recycle(), and rename it to napi_pp_put_page(), as it doesn't
    work with skbs at all and the former name tells nothing. The #if guards
    here are only to not compile and have it in the vmlinux when not needed
    -- both call sites are already guarded.
    Now, touching page_pool_types.h only triggers rebuilding of the drivers
    using it and a couple of core networking files.

    Suggested-by: Jakub Kicinski <kuba@kernel.org> # make skbuff.h less heavy
    Suggested-by: Alexander Duyck <alexanderduyck@fb.com> # move to skbuff.c
    Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
    Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
    Link: https://lore.kernel.org/r/20230804180529.2483231-3-aleksander.lobakin@intel.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Sabrina Dubroca <sdubroca@redhat.com>
2024-04-11 10:04:27 +02:00
Antoine Tenart b757199955 net: deal with integer overflows in kmalloc_reserve()
JIRA: https://issues.redhat.com/browse/RHEL-28786
Upstream Status: linux.git
Conflicts:\
- Context difference due to missing upstream commit bf9f1baa279f ("net:
  add dedicated kmem_cache for typical/small skb->head") and follow-ups
  in c9s.

commit 915d975b2ffa58a14bfcf16fafe00c41315949ff
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Aug 31 18:37:50 2023 +0000

    net: deal with integer overflows in kmalloc_reserve()

    Blamed commit changed:
        ptr = kmalloc(size);
        if (ptr)
          size = ksize(ptr);

    to:
        size = kmalloc_size_roundup(size);
        ptr = kmalloc(size);

    This allowed various crash as reported by syzbot [1]
    and Kyle Zeng.

    Problem is that if @size is bigger than 0x80000001,
    kmalloc_size_roundup(size) returns 2^32.

    kmalloc_reserve() uses a 32bit variable (obj_size),
    so 2^32 is truncated to 0.

    kmalloc(0) returns ZERO_SIZE_PTR which is not handled by
    skb allocations.

    Following trace can be triggered if a netdev->mtu is set
    close to 0x7fffffff

    We might in the future limit netdev->mtu to more sensible
    limit (like KMALLOC_MAX_SIZE).

    This patch is based on a syzbot report, and also a report
    and tentative fix from Kyle Zeng.

    [1]
    BUG: KASAN: user-memory-access in __build_skb_around net/core/skbuff.c:294 [inline]
    BUG: KASAN: user-memory-access in __alloc_skb+0x3c4/0x6e8 net/core/skbuff.c:527
    Write of size 32 at addr 00000000fffffd10 by task syz-executor.4/22554

    CPU: 1 PID: 22554 Comm: syz-executor.4 Not tainted 6.1.39-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/03/2023
    Call trace:
    dump_backtrace+0x1c8/0x1f4 arch/arm64/kernel/stacktrace.c:279
    show_stack+0x2c/0x3c arch/arm64/kernel/stacktrace.c:286
    __dump_stack lib/dump_stack.c:88 [inline]
    dump_stack_lvl+0x120/0x1a0 lib/dump_stack.c:106
    print_report+0xe4/0x4b4 mm/kasan/report.c:398
    kasan_report+0x150/0x1ac mm/kasan/report.c:495
    kasan_check_range+0x264/0x2a4 mm/kasan/generic.c:189
    memset+0x40/0x70 mm/kasan/shadow.c:44
    __build_skb_around net/core/skbuff.c:294 [inline]
    __alloc_skb+0x3c4/0x6e8 net/core/skbuff.c:527
    alloc_skb include/linux/skbuff.h:1316 [inline]
    igmpv3_newpack+0x104/0x1088 net/ipv4/igmp.c:359
    add_grec+0x81c/0x1124 net/ipv4/igmp.c:534
    igmpv3_send_cr net/ipv4/igmp.c:667 [inline]
    igmp_ifc_timer_expire+0x1b0/0x1008 net/ipv4/igmp.c:810
    call_timer_fn+0x1c0/0x9f0 kernel/time/timer.c:1474
    expire_timers kernel/time/timer.c:1519 [inline]
    __run_timers+0x54c/0x710 kernel/time/timer.c:1790
    run_timer_softirq+0x28/0x4c kernel/time/timer.c:1803
    _stext+0x380/0xfbc
    ____do_softirq+0x14/0x20 arch/arm64/kernel/irq.c:79
    call_on_irq_stack+0x24/0x4c arch/arm64/kernel/entry.S:891
    do_softirq_own_stack+0x20/0x2c arch/arm64/kernel/irq.c:84
    invoke_softirq kernel/softirq.c:437 [inline]
    __irq_exit_rcu+0x1c0/0x4cc kernel/softirq.c:683
    irq_exit_rcu+0x14/0x78 kernel/softirq.c:695
    el0_interrupt+0x7c/0x2e0 arch/arm64/kernel/entry-common.c:717
    __el0_irq_handler_common+0x18/0x24 arch/arm64/kernel/entry-common.c:724
    el0t_64_irq_handler+0x10/0x1c arch/arm64/kernel/entry-common.c:729
    el0t_64_irq+0x1a0/0x1a4 arch/arm64/kernel/entry.S:584

    Fixes: 12d6c1d3a2ad ("skbuff: Proactively round up to kmalloc bucket size")
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Reported-by: Kyle Zeng <zengyhkyle@gmail.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-04-08 15:21:32 +02:00
Antoine Tenart 1f8dbc1cbd net: factorize code in kmalloc_reserve()
JIRA: https://issues.redhat.com/browse/RHEL-28786
Upstream Status: linux.git

commit 5c0e820cbbbe2d1c4cea5cd2bfc1302c123436df
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Feb 6 17:31:02 2023 +0000

    net: factorize code in kmalloc_reserve()

    All kmalloc_reserve() callers have to make the same computation,
    we can factorize them, to prepare following patch in the series.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Acked-by: Paolo Abeni <pabeni@redhat.com>
    Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-04-08 15:21:32 +02:00
Antoine Tenart 4acc1cb6c2 net: remove osize variable in __alloc_skb()
JIRA: https://issues.redhat.com/browse/RHEL-28786
Upstream Status: linux.git

commit 65998d2bf857b9ae5acc1f3b70892bd1b429ccab
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Feb 6 17:31:01 2023 +0000

    net: remove osize variable in __alloc_skb()

    This is a cleanup patch, to prepare following change.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Acked-by: Paolo Abeni <pabeni@redhat.com>
    Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-04-08 15:21:32 +02:00
Antoine Tenart d7c131c88f net: add SKB_HEAD_ALIGN() helper
JIRA: https://issues.redhat.com/browse/RHEL-28786
Upstream Status: linux.git

commit 115f1a5c42bdad9a9ea356fc0b4a39ec7537947f
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Feb 6 17:31:00 2023 +0000

    net: add SKB_HEAD_ALIGN() helper

    We have many places using this expression:

     SKB_DATA_ALIGN(sizeof(struct skb_shared_info))

    Use of SKB_HEAD_ALIGN() will allow to clean them.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Acked-by: Paolo Abeni <pabeni@redhat.com>
    Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-04-08 15:21:32 +02:00
Antoine Tenart 6529d7d8b2 skbuff: Proactively round up to kmalloc bucket size
JIRA: https://issues.redhat.com/browse/RHEL-28786
Upstream Status: linux.git

commit 12d6c1d3a2ad0c199ec57c201cdc71e8e157a232
Author: Kees Cook <keescook@chromium.org>
Date:   Tue Oct 25 15:39:35 2022 -0700

    skbuff: Proactively round up to kmalloc bucket size

    Instead of discovering the kmalloc bucket size _after_ allocation, round
    up proactively so the allocation is explicitly made for the full size,
    allowing the compiler to correctly reason about the resulting size of
    the buffer through the existing __alloc_size() hint.

    This will allow for kernels built with CONFIG_UBSAN_BOUNDS or the
    coming dynamic bounds checking under CONFIG_FORTIFY_SOURCE to gain
    back the __alloc_size() hints that were temporarily reverted in commit
    93dd04ab0b2b ("slab: remove __alloc_size attribute from __kmalloc_track_caller")

    Cc: "David S. Miller" <davem@davemloft.net>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Jakub Kicinski <kuba@kernel.org>
    Cc: Paolo Abeni <pabeni@redhat.com>
    Cc: netdev@vger.kernel.org
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Nick Desaulniers <ndesaulniers@google.com>
    Cc: David Rientjes <rientjes@google.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Link: https://patchwork.kernel.org/project/netdevbpf/patch/20221021234713.you.031-kees@kernel.org/
    Signed-off-by: Kees Cook <keescook@chromium.org>
    Link: https://lore.kernel.org/r/20221025223811.up.360-kees@kernel.org
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-04-08 15:21:32 +02:00
Antoine Tenart e5433daab7 skbuff: pass the result of data ksize to __build_skb_around
JIRA: https://issues.redhat.com/browse/RHEL-28786
Upstream Status: linux.git

commit a5df6333f1a08380c3b94a02105482263711ed3a
Author: Li RongQing <lirongqing@baidu.com>
Date:   Wed Sep 22 14:17:19 2021 +0800

    skbuff: pass the result of data ksize to __build_skb_around

    Avoid to call ksize again in __build_skb_around by passing
    the result of data ksize to __build_skb_around

    nginx stress test shows this change can reduce ksize cpu usage,
    and give a little performance boost

    Signed-off-by: Li RongQing <lirongqing@baidu.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-04-08 15:21:32 +02:00
Scott Weaver 68fc749fdd Merge: add kabi reserved fields and excludes to networking
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3584

JIRA: https://issues.redhat.com/browse/RHEL-21356
Upstream Status: mostly RHEL-only patches

This series adds reserved fields to networking structs, and excludes
some areas of networking from the kABI guarantee. These reserved
fields are only needed during backports to z-stream.

Signed-off-by: Sabrina Dubroca <sdubroca@redhat.com>

Approved-by: Jan Stancek <jstancek@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Paolo Abeni <pabeni@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>

Signed-off-by: Scott Weaver <scweaver@redhat.com>
2024-01-22 12:01:19 -05:00
Paolo Abeni 3649e472b9 net: prevent mss overflow in skb_segment()
JIRA: https://issues.redhat.com/browse/RHEL-21447
Tested: LNST, Tier1
Conflicts: rhel lacks the upstream commit 867046cc7027 ("minmax: relax \
  check to allow comparison between unsigned arguments and signed \
  constant"), use the min_t() macro instead of plain min() to avoid a \
  compile warning.

Upstream commit:
commit 23d05d563b7e7b0314e65c8e882bc27eac2da8e7
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Dec 12 16:46:21 2023 +0000

    net: prevent mss overflow in skb_segment()

    Once again syzbot is able to crash the kernel in skb_segment() [1]

    GSO_BY_FRAGS is a forbidden value, but unfortunately the following
    computation in skb_segment() can reach it quite easily :

            mss = mss * partial_segs;

    65535 = 3 * 5 * 17 * 257, so many initial values of mss can lead to
    a bad final result.

    Make sure to limit segmentation so that the new mss value is smaller
    than GSO_BY_FRAGS.

    [1]

    general protection fault, probably for non-canonical address 0xdffffc000000000e: 0000 [#1] PREEMPT SMP KASAN
    KASAN: null-ptr-deref in range [0x0000000000000070-0x0000000000000077]
    CPU: 1 PID: 5079 Comm: syz-executor993 Not tainted 6.7.0-rc4-syzkaller-00141-g1ae4cd3cbdd0 #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/10/2023
    RIP: 0010:skb_segment+0x181d/0x3f30 net/core/skbuff.c:4551
    Code: 83 e3 02 e9 fb ed ff ff e8 90 68 1c f9 48 8b 84 24 f8 00 00 00 48 8d 78 70 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 8a 21 00 00 48 8b 84 24 f8 00
    RSP: 0018:ffffc900043473d0 EFLAGS: 00010202
    RAX: dffffc0000000000 RBX: 0000000000010046 RCX: ffffffff886b1597
    RDX: 000000000000000e RSI: ffffffff886b2520 RDI: 0000000000000070
    RBP: ffffc90004347578 R08: 0000000000000005 R09: 000000000000ffff
    R10: 000000000000ffff R11: 0000000000000002 R12: ffff888063202ac0
    R13: 0000000000010000 R14: 000000000000ffff R15: 0000000000000046
    FS: 0000555556e7e380(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000020010000 CR3: 0000000027ee2000 CR4: 00000000003506f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    <TASK>
    udp6_ufo_fragment+0xa0e/0xd00 net/ipv6/udp_offload.c:109
    ipv6_gso_segment+0x534/0x17e0 net/ipv6/ip6_offload.c:120
    skb_mac_gso_segment+0x290/0x610 net/core/gso.c:53
    __skb_gso_segment+0x339/0x710 net/core/gso.c:124
    skb_gso_segment include/net/gso.h:83 [inline]
    validate_xmit_skb+0x36c/0xeb0 net/core/dev.c:3626
    __dev_queue_xmit+0x6f3/0x3d60 net/core/dev.c:4338
    dev_queue_xmit include/linux/netdevice.h:3134 [inline]
    packet_xmit+0x257/0x380 net/packet/af_packet.c:276
    packet_snd net/packet/af_packet.c:3087 [inline]
    packet_sendmsg+0x24c6/0x5220 net/packet/af_packet.c:3119
    sock_sendmsg_nosec net/socket.c:730 [inline]
    __sock_sendmsg+0xd5/0x180 net/socket.c:745
    __sys_sendto+0x255/0x340 net/socket.c:2190
    __do_sys_sendto net/socket.c:2202 [inline]
    __se_sys_sendto net/socket.c:2198 [inline]
    __x64_sys_sendto+0xe0/0x1b0 net/socket.c:2198
    do_syscall_x64 arch/x86/entry/common.c:52 [inline]
    do_syscall_64+0x40/0x110 arch/x86/entry/common.c:83
    entry_SYSCALL_64_after_hwframe+0x63/0x6b
    RIP: 0033:0x7f8692032aa9
    Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 d1 19 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
    RSP: 002b:00007fff8d685418 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
    RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f8692032aa9
    RDX: 0000000000010048 RSI: 00000000200000c0 RDI: 0000000000000003
    RBP: 00000000000f4240 R08: 0000000020000540 R09: 0000000000000014
    R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff8d685480
    R13: 0000000000000001 R14: 00007fff8d685480 R15: 0000000000000003
    </TASK>
    Modules linked in:
    ---[ end trace 0000000000000000 ]---
    RIP: 0010:skb_segment+0x181d/0x3f30 net/core/skbuff.c:4551
    Code: 83 e3 02 e9 fb ed ff ff e8 90 68 1c f9 48 8b 84 24 f8 00 00 00 48 8d 78 70 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 8a 21 00 00 48 8b 84 24 f8 00
    RSP: 0018:ffffc900043473d0 EFLAGS: 00010202
    RAX: dffffc0000000000 RBX: 0000000000010046 RCX: ffffffff886b1597
    RDX: 000000000000000e RSI: ffffffff886b2520 RDI: 0000000000000070
    RBP: ffffc90004347578 R08: 0000000000000005 R09: 000000000000ffff
    R10: 000000000000ffff R11: 0000000000000002 R12: ffff888063202ac0
    R13: 0000000000010000 R14: 000000000000ffff R15: 0000000000000046
    FS: 0000555556e7e380(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000020010000 CR3: 0000000027ee2000 CR4: 00000000003506f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

    Fixes: 3953c46c3a ("sk_buff: allow segmenting based on frag sizes")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
    Reviewed-by: Willem de Bruijn <willemb@google.com>
    Link: https://lore.kernel.org/r/20231212164621.4131800-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-01-12 17:13:08 +01:00
Sabrina Dubroca 986c223fcf net: add reserved fields to sk_buff using custom code
JIRA: https://issues.redhat.com/browse/RHEL-21356
Upstream Status: RHEL-only

Add 16 bytes of reserved space to sk_buff, using the same mechanism as
in RHEL8 to help make its future usage less error-prone.

Signed-off-by: Sabrina Dubroca <sdubroca@redhat.com>
2024-01-12 14:27:38 +01:00
Petr Oros 8333fb8ac8 page_pool: split types and declarations from page_pool.h
JIRA: https://issues.redhat.com/browse/RHEL-16983

Conflicts:
- net/core/skbuff.c:
   adjusted context conflict due to missing 78476d315e1905 ("mctp: Add flow
   extension to skb")
- drivers/net/ethernet/hisilicon/hns3/hns3_enet.h:
   adjusted context conflict due to missing 87a9b2fd9288c5 ("net: hns3: add
   support for TX push mode")
- drivers/net/ethernet/mediatek/mtk_eth_soc.[c|h]
   Chunks ommited due to lack of page_pool support in driver. Missing
   upstream commit 23233e577ef973 ("net: ethernet: mtk_eth_soc: rely on
   page_pool for single page buffers")
- drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
   adjusted context conflict due to missing 67f245c2ec0af1 ("mlx5:
   bpf_xdp_metadata_rx_hash add xdp rss hash type")
- drivers/net/ethernet/microsoft/mana/mana_en.c
   adjusted context conflict due to missing 92272ec4107ef4 ("eth: add
   missing xdp.h includes in drivers")
- drivers/net/veth.c
   Chunks ommited due to missing 0ebab78cbcbfd6 ("net: veth: add page_pool
   for page recycling")
- Unmerged path's (missing in rhel):
   drivers/net/ethernet/engleder/tsnep_main.c,
   drivers/net/ethernet/microchip/lan966x/lan966x_fdma.c,
   drivers/net/ethernet/microchip/lan966x/lan966x_main.h,
   drivers/net/ethernet/wangxun/libwx/wx_lib.c

Upstream commit(s):
commit a9ca9f9ceff382b58b488248f0c0da9e157f5d06
Author: Yunsheng Lin <linyunsheng@huawei.com>
Date:   Fri Aug 4 20:05:24 2023 +0200

    page_pool: split types and declarations from page_pool.h

    Split types and pure function declarations from page_pool.h
    and add them in page_page/types.h, so that C sources can
    include page_pool.h and headers should generally only include
    page_pool/types.h as suggested by jakub.
    Rename page_pool.h to page_pool/helpers.h to have both in
    one place.

    Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
    Suggested-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
    Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
    Link: https://lore.kernel.org/r/20230804180529.2483231-2-aleksander.lobakin@intel.com
    [Jakub: change microsoft/mana, fix kdoc paths in Documentation]
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Petr Oros <poros@redhat.com>
2023-11-30 19:11:24 +01:00
Jan Stancek 9eea5b8c8f Merge: net: backport drop reason related patches
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3265

JIRA: https://issues.redhat.com/browse/RHEL-14554
Depends: !3196

Skb drop reason related patches, and a few extra ones for easier backports.

Signed-off-by: Antoine Tenart <atenart@redhat.com>

Approved-by: Hangbin Liu <haliu@redhat.com>
Approved-by: Sabrina Dubroca <sdubroca@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-11-20 21:49:35 +01:00
Jan Stancek 25a74f04f2 Merge: net-core: stable backports for 9.4 phase 1
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3251

net-core: stable backports for 9.4 phase 1

JIRA: https://issues.redhat.com/browse/RHEL-14364
Tested: LNST, Tier1

A bunch of fixes for the networking core, comprising a few
serious ones.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Approved-by: Florian Westphal <fwestpha@redhat.com>
Approved-by: Sabrina Dubroca <sdubroca@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-11-20 21:49:02 +01:00
Antoine Tenart 8ad0fa4384 net: skb_queue_purge_reason() optimizations
JIRA: https://issues.redhat.com/browse/RHEL-14554
Upstream Status: net-next.git

commit d86e5fbd4c965fdda72f99ccd54a1031ea4df51d
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Oct 3 18:19:20 2023 +0000

    net: skb_queue_purge_reason() optimizations

    1) Exit early if the list is empty.

    2) splice the list into a local list,
       so that we block hard irqs only once.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Simon Horman <horms@kernel.org>
    Link: https://lore.kernel.org/r/20231003181920.3280453-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2023-11-10 17:40:30 +01:00
Antoine Tenart 005a9126b4 net: add skb_queue_purge_reason and __skb_queue_purge_reason
JIRA: https://issues.redhat.com/browse/RHEL-14554
Upstream Status: linux.git

commit 4025d3e73abde4f65f4b04d4b1d8449b00e31473
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Aug 18 09:40:39 2023 +0000

    net: add skb_queue_purge_reason and __skb_queue_purge_reason

    skb_queue_purge() and __skb_queue_purge() become wrappers
    around the new generic functions.

    New SKB_DROP_REASON_QUEUE_PURGE drop reason is added,
    but users can start adding more specific reasons.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Simon Horman <horms@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2023-11-10 17:40:30 +01:00
Antoine Tenart b2c4833a40 net: skbuff: update and rename __kfree_skb_defer()
JIRA: https://issues.redhat.com/browse/RHEL-14554
Upstream Status: linux.git

commit 8fa66e4a1bdd41d55d7842928e60a40fed65715d
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Wed Apr 19 19:00:05 2023 -0700

    net: skbuff: update and rename __kfree_skb_defer()

    __kfree_skb_defer() uses the old naming where "defer" meant
    slab bulk free/alloc APIs. In the meantime we also made
    __kfree_skb_defer() feed the per-NAPI skb cache, which
    implies bulk APIs. So take away the 'defer' and add 'napi'.

    While at it add a drop reason. This only matters on the
    tx_action path, if the skb has a frag_list. But getting
    rid of a SKB_DROP_REASON_NOT_SPECIFIED seems like a net
    benefit so why not.

    Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
    Link: https://lore.kernel.org/r/20230420020005.815854-1-kuba@kernel.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2023-11-10 17:40:29 +01:00
Scott Weaver 6cf5659031 Merge: CNB94: page_pool: allow caching from safely localized NAPI
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3196

JIRA: https://issues.redhat.com/browse/RHEL-12613
Tested: Using LNST net-driver test-suite on i40e, bnxt_en, ice and mlx5_core [http://dashboard.lnst.anl.lab.eng.bos.redhat.com/pipeline/3644]

Commits:
```
4727bab4e9bb ("net: skb: move skb_pp_recycle() to skbuff.c")
eedade12f4cb ("net: kfree_skb_list use kmem_cache_free_bulk")
f72ff8b81ebc ("net: fix kfree_skb_list use of skb_mark_not_on_list")
9dde0cd3b10f ("net: introduce skb_poison_list and use in kfree_skb_list")
b07a2d97ba5e ("net: skb: plumb napi state thru skb freeing paths")
8c48eea3adf3 ("page_pool: allow caching from safely localized NAPI")
dd64b232deb8 ("page_pool: unlink from napi during destroy")
```

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Michal Schmidt <mschmidt@redhat.com>

Signed-off-by: Scott Weaver <scweaver@redhat.com>
2023-11-09 07:22:35 -05:00
Ivan Vecera d80ce17d20 page_pool: allow caching from safely localized NAPI
JIRA: https://issues.redhat.com/browse/RHEL-12613

Conflicts:
- simple context conflict in net/core/dev.c due to absence of commit
  8b43fd3d1d7d8 ("net: optimize ____napi_schedule() to avoid extra
  NET_RX_SOFTIRQ") that is out of scope of this series

commit 8c48eea3adf3119e0a3fc57bd31f6966f26ee784
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Wed Apr 12 21:26:04 2023 -0700

    page_pool: allow caching from safely localized NAPI

    Recent patches to mlx5 mentioned a regression when moving from
    driver local page pool to only using the generic page pool code.
    Page pool has two recycling paths (1) direct one, which runs in
    safe NAPI context (basically consumer context, so producing
    can be lockless); and (2) via a ptr_ring, which takes a spin
    lock because the freeing can happen from any CPU; producer
    and consumer may run concurrently.

    Since the page pool code was added, Eric introduced a revised version
    of deferred skb freeing. TCP skbs are now usually returned to the CPU
    which allocated them, and freed in softirq context. This places the
    freeing (producing of pages back to the pool) enticingly close to
    the allocation (consumer).

    If we can prove that we're freeing in the same softirq context in which
    the consumer NAPI will run - lockless use of the cache is perfectly fine,
    no need for the lock.

    Let drivers link the page pool to a NAPI instance. If the NAPI instance
    is scheduled on the same CPU on which we're freeing - place the pages
    in the direct cache.

    With that and patched bnxt (XDP enabled to engage the page pool, sigh,
    bnxt really needs page pool work :() I see a 2.6% perf boost with
    a TCP stream test (app on a different physical core than softirq).

    The CPU use of relevant functions decreases as expected:

      page_pool_refill_alloc_cache   1.17% -> 0%
      _raw_spin_lock                 2.41% -> 0.98%

    Only consider lockless path to be safe when NAPI is scheduled
    - in practice this should cover majority if not all of steady state
    workloads. It's usually the NAPI kicking in that causes the skb flush.

    The main case we'll miss out on is when application runs on the same
    CPU as NAPI. In that case we don't use the deferred skb free path.

    Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
    Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
    Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-10-31 15:09:26 +01:00
Ivan Vecera 8d205ac34b net: skb: plumb napi state thru skb freeing paths
JIRA: https://issues.redhat.com/browse/RHEL-12613

Conflicts:
- adjusted due to lack of commit bf9f1baa279f ("net: add dedicated
  kmem_cache for typical/small skb->head")

commit b07a2d97ba5ef154fe736aa510e43a3299eee5f8
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Wed Apr 12 21:26:03 2023 -0700

    net: skb: plumb napi state thru skb freeing paths

    We maintain a NAPI-local cache of skbs which is fed by napi_consume_skb().
    Going forward we will also try to cache head and data pages.
    Plumb the "are we in a normal NAPI context" information thru
    deeper into the freeing path, up to skb_release_data() and
    skb_free_head()/skb_pp_recycle(). The "not normal NAPI context"
    comes from netpoll which passes budget of 0 to try to reap
    the Tx completions but not perform any Rx.

    Use "bool napi_safe" rather than bare "int budget",
    the further we get from NAPI the more confusing the budget
    argument may seem (particularly whether 0 or MAX is the
    correct value to pass in when not in NAPI).

    Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
    Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-10-31 15:09:08 +01:00
Paolo Abeni ed60ca7371 skbuff: skb_segment, Call zero copy functions before using skbuff frags
JIRA: https://issues.redhat.com/browse/RHEL-14364
Tested: LNST, Tier1

Upstream commit:
commit 2ea35288c83b3d501a88bc17f2df8f176b5cc96f
Author: Mohamed Khalfella <mkhalfella@purestorage.com>
Date:   Thu Aug 31 02:17:02 2023 -0600

    skbuff: skb_segment, Call zero copy functions before using skbuff frags

    Commit bf5c25d608 ("skbuff: in skb_segment, call zerocopy functions
    once per nskb") added the call to zero copy functions in skb_segment().
    The change introduced a bug in skb_segment() because skb_orphan_frags()
    may possibly change the number of fragments or allocate new fragments
    altogether leaving nrfrags and frag to point to the old values. This can
    cause a panic with stacktrace like the one below.

    [  193.894380] BUG: kernel NULL pointer dereference, address: 00000000000000bc
    [  193.895273] CPU: 13 PID: 18164 Comm: vh-net-17428 Kdump: loaded Tainted: G           O      5.15.123+ #26
    [  193.903919] RIP: 0010:skb_segment+0xb0e/0x12f0
    [  194.021892] Call Trace:
    [  194.027422]  <TASK>
    [  194.072861]  tcp_gso_segment+0x107/0x540
    [  194.082031]  inet_gso_segment+0x15c/0x3d0
    [  194.090783]  skb_mac_gso_segment+0x9f/0x110
    [  194.095016]  __skb_gso_segment+0xc1/0x190
    [  194.103131]  netem_enqueue+0x290/0xb10 [sch_netem]
    [  194.107071]  dev_qdisc_enqueue+0x16/0x70
    [  194.110884]  __dev_queue_xmit+0x63b/0xb30
    [  194.121670]  bond_start_xmit+0x159/0x380 [bonding]
    [  194.128506]  dev_hard_start_xmit+0xc3/0x1e0
    [  194.131787]  __dev_queue_xmit+0x8a0/0xb30
    [  194.138225]  macvlan_start_xmit+0x4f/0x100 [macvlan]
    [  194.141477]  dev_hard_start_xmit+0xc3/0x1e0
    [  194.144622]  sch_direct_xmit+0xe3/0x280
    [  194.147748]  __dev_queue_xmit+0x54a/0xb30
    [  194.154131]  tap_get_user+0x2a8/0x9c0 [tap]
    [  194.157358]  tap_sendmsg+0x52/0x8e0 [tap]
    [  194.167049]  handle_tx_zerocopy+0x14e/0x4c0 [vhost_net]
    [  194.173631]  handle_tx+0xcd/0xe0 [vhost_net]
    [  194.176959]  vhost_worker+0x76/0xb0 [vhost]
    [  194.183667]  kthread+0x118/0x140
    [  194.190358]  ret_from_fork+0x1f/0x30
    [  194.193670]  </TASK>

    In this case calling skb_orphan_frags() updated nr_frags leaving nrfrags
    local variable in skb_segment() stale. This resulted in the code hitting
    i >= nrfrags prematurely and trying to move to next frag_skb using
    list_skb pointer, which was NULL, and caused kernel panic. Move the call
    to zero copy functions before using frags and nr_frags.

    Fixes: bf5c25d608 ("skbuff: in skb_segment, call zerocopy functions once per nskb")
    Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
    Reported-by: Amit Goyal <agoyal@purestorage.com>
    Cc: stable@vger.kernel.org
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-20 13:49:01 +02:00
Paolo Abeni 68e7745c12 net: prevent skb corruption on frag list segmentation
JIRA: https://issues.redhat.com/browse/RHEL-14364
Tested: LNST, Tier1

Upstream commit:
commit c329b261afe71197d9da83c1f18eb45a7e97e089
Author: Paolo Abeni <pabeni@redhat.com>
Date:   Fri Jul 7 10:11:10 2023 +0200

    net: prevent skb corruption on frag list segmentation

    Ian reported several skb corruptions triggered by rx-gro-list,
    collecting different oops alike:

    [   62.624003] BUG: kernel NULL pointer dereference, address: 00000000000000c0
    [   62.631083] #PF: supervisor read access in kernel mode
    [   62.636312] #PF: error_code(0x0000) - not-present page
    [   62.641541] PGD 0 P4D 0
    [   62.644174] Oops: 0000 [#1] PREEMPT SMP NOPTI
    [   62.648629] CPU: 1 PID: 913 Comm: napi/eno2-79 Not tainted 6.4.0 #364
    [   62.655162] Hardware name: Supermicro Super Server/A2SDi-12C-HLN4F, BIOS 1.7a 10/13/2022
    [   62.663344] RIP: 0010:__udp_gso_segment (./include/linux/skbuff.h:2858
    ./include/linux/udp.h:23 net/ipv4/udp_offload.c:228 net/ipv4/udp_offload.c:261
    net/ipv4/udp_offload.c:277)
    [   62.687193] RSP: 0018:ffffbd3a83b4f868 EFLAGS: 00010246
    [   62.692515] RAX: 00000000000000ce RBX: 0000000000000000 RCX: 0000000000000000
    [   62.699743] RDX: ffffa124def8a000 RSI: 0000000000000079 RDI: ffffa125952a14d4
    [   62.706970] RBP: ffffa124def8a000 R08: 0000000000000022 R09: 00002000001558c9
    [   62.714199] R10: 0000000000000000 R11: 00000000be554639 R12: 00000000000000e2
    [   62.721426] R13: ffffa125952a1400 R14: ffffa125952a1400 R15: 00002000001558c9
    [   62.728654] FS:  0000000000000000(0000) GS:ffffa127efa40000(0000)
    knlGS:0000000000000000
    [   62.736852] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [   62.742702] CR2: 00000000000000c0 CR3: 00000001034b0000 CR4: 00000000003526e0
    [   62.749948] Call Trace:
    [   62.752498]  <TASK>
    [   62.779267] inet_gso_segment (net/ipv4/af_inet.c:1398)
    [   62.787605] skb_mac_gso_segment (net/core/gro.c:141)
    [   62.791906] __skb_gso_segment (net/core/dev.c:3403 (discriminator 2))
    [   62.800492] validate_xmit_skb (./include/linux/netdevice.h:4862
    net/core/dev.c:3659)
    [   62.804695] validate_xmit_skb_list (net/core/dev.c:3710)
    [   62.809158] sch_direct_xmit (net/sched/sch_generic.c:330)
    [   62.813198] __dev_queue_xmit (net/core/dev.c:3805 net/core/dev.c:4210)
    net/netfilter/core.c:626)
    [   62.821093] br_dev_queue_push_xmit (net/bridge/br_forward.c:55)
    [   62.825652] maybe_deliver (net/bridge/br_forward.c:193)
    [   62.829420] br_flood (net/bridge/br_forward.c:233)
    [   62.832758] br_handle_frame_finish (net/bridge/br_input.c:215)
    [   62.837403] br_handle_frame (net/bridge/br_input.c:298
    net/bridge/br_input.c:416)
    [   62.851417] __netif_receive_skb_core.constprop.0 (net/core/dev.c:5387)
    [   62.866114] __netif_receive_skb_list_core (net/core/dev.c:5570)
    [   62.871367] netif_receive_skb_list_internal (net/core/dev.c:5638
    net/core/dev.c:5727)
    [   62.876795] napi_complete_done (./include/linux/list.h:37
    ./include/net/gro.h:434 ./include/net/gro.h:429 net/core/dev.c:6067)
    [   62.881004] ixgbe_poll (drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:3191)
    [   62.893534] __napi_poll (net/core/dev.c:6498)
    [   62.897133] napi_threaded_poll (./include/linux/netpoll.h:89
    net/core/dev.c:6640)
    [   62.905276] kthread (kernel/kthread.c:379)
    [   62.913435] ret_from_fork (arch/x86/entry/entry_64.S:314)
    [   62.917119]  </TASK>

    In the critical scenario, rx-gro-list GRO-ed packets are fed, via a
    bridge, both to the local input path and to an egress device (tun).

    The segmentation of such packets unsafely writes to the cloned skbs
    with shared heads.

    This change addresses the issue by uncloning as needed the
    to-be-segmented skbs.

    Reported-by: Ian Kumlien <ian.kumlien@gmail.com>
    Tested-by: Ian Kumlien <ian.kumlien@gmail.com>
    Fixes: 3a1296a38d ("net: Support GRO/GSO fraglist chaining.")
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-20 13:43:45 +02:00
Scott Weaver 03206d751a Merge: CNB94: net: move gso declarations and functions to their own files
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3198

JIRA: https://issues.redhat.com/browse/RHEL-12679
Tested: Just built... no functional change

Commits:
```
d457a0e329b0 ("net: move gso declarations and functions to their own files")
```

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com>
Approved-by: Sabrina Dubroca <sdubroca@redhat.com>

Signed-off-by: Scott Weaver <scweaver@redhat.com>
2023-10-19 10:36:22 -04:00