Commit Graph

930 Commits

Author SHA1 Message Date
Ivan Vecera 497f645693 net: move gso declarations and functions to their own files
JIRA: https://issues.redhat.com/browse/RHEL-12679

commit d457a0e329b0bfd3a1450e0b1a18cd2b47a25a08
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Jun 8 19:17:37 2023 +0000

    net: move gso declarations and functions to their own files

    Move declarations into include/net/gso.h and code into net/core/gso.c

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Stanislav Fomichev <sdf@google.com>
    Reviewed-by: Simon Horman <simon.horman@corigine.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Link: https://lore.kernel.org/r/20230608191738.3947077-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-10-11 13:35:27 +02:00
Ivan Vecera b4aa21f5ad net: introduce and use skb_frag_fill_page_desc()
JIRA: https://issues.redhat.com/browse/RHEL-12625

Conflicts:
* drivers/net/ethernet/freescale/enetc/enetc.c
- context due to missing 8feb020f92a5 ("net: ethernet: enetc: unlock
  XDP_REDIRECT for XDP non-linear buffers")
* drivers/net/ethernet/fungible/funeth/funeth_rx.c
  - removed hunk for non-existing file
* drivers/net/ethernet/marvell/mvneta.c
  - context due to missing 76a676947b56 ("net: mvneta: update frags bit
    before passing the xdp buffer to eBPF layer")
* drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
  - adjusted due to missing 27602319e328 ("net/mlx5e: RX, Take shared
    info fragment addition into a function")

commit b51f4113ebb02011f0ca86abc3134b28d2071b6a
Author: Yunsheng Lin <linyunsheng@huawei.com>
Date:   Thu May 11 09:12:12 2023 +0800

    net: introduce and use skb_frag_fill_page_desc()

    Most users use __skb_frag_set_page()/skb_frag_off_set()/
    skb_frag_size_set() to fill the page desc for a skb frag.

    Introduce skb_frag_fill_page_desc() to do that.

    net/bpf/test_run.c does not call skb_frag_off_set() to
    set the offset, "copy_from_user(page_address(page), ...)"
    and 'shinfo' being part of the 'data' kzalloced in
    bpf_test_init() suggest that it is assuming offset to be
    initialized as zero, so call skb_frag_fill_page_desc()
    with offset being zero for this case.

    Also, skb_frag_set_page() is not used anymore, so remove
    it.

    Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
    Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
    Reviewed-by: Simon Horman <simon.horman@corigine.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-10-11 12:38:04 +02:00
Ivan Vecera c756370130 net: introduce skb_poison_list and use in kfree_skb_list
JIRA: https://issues.redhat.com/browse/RHEL-12613

commit 9dde0cd3b10f63bc4100ebadc7e32275baabfa68
Author: Jesper Dangaard Brouer <brouer@redhat.com>
Date:   Fri Feb 3 13:59:29 2023 +0100

    net: introduce skb_poison_list and use in kfree_skb_list

    First user of skb_poison_list is in kfree_skb_list_reason, to catch bugs
    earlier like introduced in commit eedade12f4cb ("net: kfree_skb_list use
    kmem_cache_free_bulk"). For completeness mentioned bug have been fixed in
    commit f72ff8b81ebc ("net: fix kfree_skb_list use of skb_mark_not_on_list").

    In case of a bug like mentioned commit we would have seen OOPS with:
     general protection fault, probably for non-canonical address 0xdead000000000870
    And content of one the registers e.g. R13: dead000000000800

    In this case skb->len is at offset 112 bytes (0x70) why fault happens at
     0x800+0x70 = 0x870

    Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-10-11 12:16:14 +02:00
Ivan Vecera 1c444c6edb net: fix kfree_skb_list use of skb_mark_not_on_list
JIRA: https://issues.redhat.com/browse/RHEL-12613

commit f72ff8b81ebc6a0a25e41b7e6c1dc42e3aa33e7e
Author: Jesper Dangaard Brouer <brouer@redhat.com>
Date:   Fri Jan 20 11:34:44 2023 +0100

    net: fix kfree_skb_list use of skb_mark_not_on_list

    A bug was introduced by commit eedade12f4cb ("net: kfree_skb_list use
    kmem_cache_free_bulk"). It unconditionally unlinked the SKB list via
    invoking skb_mark_not_on_list().

    In this patch we choose to remove the skb_mark_not_on_list() call as it
    isn't necessary. It would be possible and correct to call
    skb_mark_not_on_list() only when __kfree_skb_reason() returns true,
    meaning the SKB is ready to be free'ed, as it calls/check skb_unref().

    This fix is needed as kfree_skb_list() is also invoked on skb_shared_info
    frag_list (skb_drop_fraglist() calling kfree_skb_list()). A frag_list can
    have SKBs with elevated refcnt due to cloning via skb_clone_fraglist(),
    which takes a reference on all SKBs in the list. This implies the
    invariant that all SKBs in the list must have the same refcnt, when using
    kfree_skb_list().

    Reported-by: syzbot+c8a2e66e37eee553c4fd@syzkaller.appspotmail.com
    Reported-and-tested-by: syzbot+c8a2e66e37eee553c4fd@syzkaller.appspotmail.com
    Fixes: eedade12f4cb ("net: kfree_skb_list use kmem_cache_free_bulk")
    Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/167421088417.1125894.9761158218878962159.stgit@firesoul
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-10-11 12:16:03 +02:00
Ivan Vecera bb18a44e29 net: kfree_skb_list use kmem_cache_free_bulk
JIRA: https://issues.redhat.com/browse/RHEL-12613

commit eedade12f4cb7284555c4c0314485e9575c70ab7
Author: Jesper Dangaard Brouer <brouer@redhat.com>
Date:   Fri Jan 13 14:52:04 2023 +0100

    net: kfree_skb_list use kmem_cache_free_bulk

    The kfree_skb_list function walks SKB (via skb->next) and frees them
    individually to the SLUB/SLAB allocator (kmem_cache). It is more
    efficient to bulk free them via the kmem_cache_free_bulk API.

    This patches create a stack local array with SKBs to bulk free while
    walking the list. Bulk array size is limited to 16 SKBs to trade off
    stack usage and efficiency. The SLUB kmem_cache "skbuff_head_cache"
    uses objsize 256 bytes usually in an order-1 page 8192 bytes that is
    32 objects per slab (can vary on archs and due to SLUB sharing). Thus,
    for SLUB the optimal bulk free case is 32 objects belonging to same
    slab, but runtime this isn't likely to occur.

    The expected gain from using kmem_cache bulk alloc and free API
    have been assessed via a microbencmark kernel module[1].

    The module 'slab_bulk_test01' results at bulk 16 element:
     kmem-in-loop Per elem: 109 cycles(tsc) 30.532 ns (step:16)
     kmem-bulk    Per elem: 64 cycles(tsc) 17.905 ns (step:16)

    More detailed description of benchmarks avail in [2].

    [1] https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm
    [2] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/kfree_skb_list01.org

    V2: rename function to kfree_skb_add_bulk.

    Reviewed-by: Saeed Mahameed <saeed@kernel.org>
    Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-10-11 12:15:45 +02:00
Ivan Vecera 858a781232 net: skb: move skb_pp_recycle() to skbuff.c
JIRA: https://issues.redhat.com/browse/RHEL-12613

commit 4727bab4e9bbeafeff6acdfcb077a7a548cbde30
Author: Yunsheng Lin <linyunsheng@huawei.com>
Date:   Fri Oct 21 10:58:22 2022 +0800

    net: skb: move skb_pp_recycle() to skbuff.c

    skb_pp_recycle() is only used by skb_free_head() in
    skbuff.c, so move it to skbuff.c.

    Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
    Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-10-11 12:15:39 +02:00
Jan Stancek 8b67bdc2b8 Merge: CNB: net: extend drop reasons for multiple subsystems
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2703

Bugzilla: https://bugzilla.redhat.com/2215988

commit 071c0fc6fb919dcf29c676a842dda08a674877d7
Author: Johannes Berg <johannes.berg@intel.com>
Date:   Wed Apr 19 14:52:53 2023 +0200

    net: extend drop reasons for multiple subsystems

    Extend drop reasons to make them usable by subsystems
    other than core by reserving the high 16 bits for a
    new subsystem ID, of which 0 of course is used for the
    existing reasons immediately.

    To still be able to have string reasons, restructure
    that code a bit to make the loopup under RCU, the only
    user of this (right now) is drop_monitor.

    Link: https://lore.kernel.org/netdev/00659771ed54353f92027702c5bbb84702da62ce.camel@sipsolutions.net
    Signed-off-by: Johannes Berg <johannes.berg@intel.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>

Approved-by: John B. Wyatt IV <jwyatt@redhat.com>
Approved-by: Xin Long <lxin@redhat.com>
Approved-by: Antoine Tenart <atenart@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-08-14 14:00:39 +02:00
Paolo Abeni b7607ad33f net: fix skb leak in __skb_tstamp_tx()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2217529
Tested: LNST, Tier1

Upstream commit:
commit 8a02fb71d7192ff1a9a47c9d937624966c6e09af
Author: Pratyush Yadav <ptyadav@amazon.de>
Date:   Mon May 22 17:30:20 2023 +0200

    net: fix skb leak in __skb_tstamp_tx()

    Commit 50749f2dd685 ("tcp/udp: Fix memleaks of sk and zerocopy skbs with
    TX timestamp.") added a call to skb_orphan_frags_rx() to fix leaks with
    zerocopy skbs. But it ended up adding a leak of its own. When
    skb_orphan_frags_rx() fails, the function just returns, leaking the skb
    it just cloned. Free it before returning.

    This bug was discovered and resolved using Coverity Static Analysis
    Security Testing (SAST) by Synopsys, Inc.

    Fixes: 50749f2dd685 ("tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp.")
    Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
    Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Reviewed-by: Willem de Bruijn <willemb@google.com>
    Link: https://lore.kernel.org/r/20230522153020.32422-1-ptyadav@amazon.de
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-06-26 16:58:50 +02:00
Paolo Abeni bfc3e077cb tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2217529
Tested: LNST, Tier1

Upstream commit:
commit 50749f2dd6854a41830996ad302aef2ffaf011d8
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Mon Apr 24 15:20:22 2023 -0700

    tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp.

    syzkaller reported [0] memory leaks of an UDP socket and ZEROCOPY
    skbs.  We can reproduce the problem with these sequences:

      sk = socket(AF_INET, SOCK_DGRAM, 0)
      sk.setsockopt(SOL_SOCKET, SO_TIMESTAMPING, SOF_TIMESTAMPING_TX_SOFTWARE)
      sk.setsockopt(SOL_SOCKET, SO_ZEROCOPY, 1)
      sk.sendto(b'', MSG_ZEROCOPY, ('127.0.0.1', 53))
      sk.close()

    sendmsg() calls msg_zerocopy_alloc(), which allocates a skb, sets
    skb->cb->ubuf.refcnt to 1, and calls sock_hold().  Here, struct
    ubuf_info_msgzc indirectly holds a refcnt of the socket.  When the
    skb is sent, __skb_tstamp_tx() clones it and puts the clone into
    the socket's error queue with the TX timestamp.

    When the original skb is received locally, skb_copy_ubufs() calls
    skb_unclone(), and pskb_expand_head() increments skb->cb->ubuf.refcnt.
    This additional count is decremented while freeing the skb, but struct
    ubuf_info_msgzc still has a refcnt, so __msg_zerocopy_callback() is
    not called.

    The last refcnt is not released unless we retrieve the TX timestamped
    skb by recvmsg().  Since we clear the error queue in inet_sock_destruct()
    after the socket's refcnt reaches 0, there is a circular dependency.
    If we close() the socket holding such skbs, we never call sock_put()
    and leak the count, sk, and skb.

    TCP has the same problem, and commit e0c8bccd40fc ("net: stream:
    purge sk_error_queue in sk_stream_kill_queues()") tried to fix it
    by calling skb_queue_purge() during close().  However, there is a
    small chance that skb queued in a qdisc or device could be put
    into the error queue after the skb_queue_purge() call.

    In __skb_tstamp_tx(), the cloned skb should not have a reference
    to the ubuf to remove the circular dependency, but skb_clone() does
    not call skb_copy_ubufs() for zerocopy skb.  So, we need to call
    skb_orphan_frags_rx() for the cloned skb to call skb_copy_ubufs().

    [0]:
    BUG: memory leak
    unreferenced object 0xffff88800c6d2d00 (size 1152):
      comm "syz-executor392", pid 264, jiffies 4294785440 (age 13.044s)
      hex dump (first 32 bytes):
        00 00 00 00 00 00 00 00 cd af e8 81 00 00 00 00  ................
        02 00 07 40 00 00 00 00 00 00 00 00 00 00 00 00  ...@............
      backtrace:
        [<0000000055636812>] sk_prot_alloc+0x64/0x2a0 net/core/sock.c:2024
        [<0000000054d77b7a>] sk_alloc+0x3b/0x800 net/core/sock.c:2083
        [<0000000066f3c7e0>] inet_create net/ipv4/af_inet.c:319 [inline]
        [<0000000066f3c7e0>] inet_create+0x31e/0xe40 net/ipv4/af_inet.c:245
        [<000000009b83af97>] __sock_create+0x2ab/0x550 net/socket.c:1515
        [<00000000b9b11231>] sock_create net/socket.c:1566 [inline]
        [<00000000b9b11231>] __sys_socket_create net/socket.c:1603 [inline]
        [<00000000b9b11231>] __sys_socket_create net/socket.c:1588 [inline]
        [<00000000b9b11231>] __sys_socket+0x138/0x250 net/socket.c:1636
        [<000000004fb45142>] __do_sys_socket net/socket.c:1649 [inline]
        [<000000004fb45142>] __se_sys_socket net/socket.c:1647 [inline]
        [<000000004fb45142>] __x64_sys_socket+0x73/0xb0 net/socket.c:1647
        [<0000000066999e0e>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
        [<0000000066999e0e>] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
        [<0000000017f238c1>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

    BUG: memory leak
    unreferenced object 0xffff888017633a00 (size 240):
      comm "syz-executor392", pid 264, jiffies 4294785440 (age 13.044s)
      hex dump (first 32 bytes):
        00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        00 00 00 00 00 00 00 00 00 2d 6d 0c 80 88 ff ff  .........-m.....
      backtrace:
        [<000000002b1c4368>] __alloc_skb+0x229/0x320 net/core/skbuff.c:497
        [<00000000143579a6>] alloc_skb include/linux/skbuff.h:1265 [inline]
        [<00000000143579a6>] sock_omalloc+0xaa/0x190 net/core/sock.c:2596
        [<00000000be626478>] msg_zerocopy_alloc net/core/skbuff.c:1294 [inline]
        [<00000000be626478>] msg_zerocopy_realloc+0x1ce/0x7f0 net/core/skbuff.c:1370
        [<00000000cbfc9870>] __ip_append_data+0x2adf/0x3b30 net/ipv4/ip_output.c:1037
        [<0000000089869146>] ip_make_skb+0x26c/0x2e0 net/ipv4/ip_output.c:1652
        [<00000000098015c2>] udp_sendmsg+0x1bac/0x2390 net/ipv4/udp.c:1253
        [<0000000045e0e95e>] inet_sendmsg+0x10a/0x150 net/ipv4/af_inet.c:819
        [<000000008d31bfde>] sock_sendmsg_nosec net/socket.c:714 [inline]
        [<000000008d31bfde>] sock_sendmsg+0x141/0x190 net/socket.c:734
        [<0000000021e21aa4>] __sys_sendto+0x243/0x360 net/socket.c:2117
        [<00000000ac0af00c>] __do_sys_sendto net/socket.c:2129 [inline]
        [<00000000ac0af00c>] __se_sys_sendto net/socket.c:2125 [inline]
        [<00000000ac0af00c>] __x64_sys_sendto+0xe1/0x1c0 net/socket.c:2125
        [<0000000066999e0e>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
        [<0000000066999e0e>] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
        [<0000000017f238c1>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

    Fixes: f214f915e7 ("tcp: enable MSG_ZEROCOPY")
    Fixes: b5947e5d1e ("udp: msg_zerocopy")
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Reviewed-by: Willem de Bruijn <willemb@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-06-26 16:58:41 +02:00
Paolo Abeni a7c60d11db skbuff: Fix a race between coalescing and releasing SKBs
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2217529
Tested: LNST, Tier1

Upstream commit:
commit 0646dc31ca886693274df5749cd0c8c1eaaeb5ca
Author: Liang Chen <liangchen.linux@gmail.com>
Date:   Thu Apr 13 17:03:53 2023 +0800

    skbuff: Fix a race between coalescing and releasing SKBs

    Commit 1effe8ca4e34 ("skbuff: fix coalescing for page_pool fragment
    recycling") allowed coalescing to proceed with non page pool page and page
    pool page when @from is cloned, i.e.

    to->pp_recycle    --> false
    from->pp_recycle  --> true
    skb_cloned(from)  --> true

    However, it actually requires skb_cloned(@from) to hold true until
    coalescing finishes in this situation. If the other cloned SKB is
    released while the merging is in process, from_shinfo->nr_frags will be
    set to 0 toward the end of the function, causing the increment of frag
    page _refcount to be unexpectedly skipped resulting in inconsistent
    reference counts. Later when SKB(@to) is released, it frees the page
    directly even though the page pool page is still in use, leading to
    use-after-free or double-free errors. So it should be prohibited.

    The double-free error message below prompted us to investigate:
    BUG: Bad page state in process swapper/1  pfn:0e0d1
    page:00000000c6548b28 refcount:-1 mapcount:0 mapping:0000000000000000
    index:0x2 pfn:0xe0d1
    flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
    raw: 000fffffc0000000 0000000000000000 ffffffff00000101 0000000000000000
    raw: 0000000000000002 0000000000000000 ffffffffffffffff 0000000000000000
    page dumped because: nonzero _refcount

    CPU: 1 PID: 0 Comm: swapper/1 Tainted: G            E      6.2.0+
    Call Trace:
     <IRQ>
    dump_stack_lvl+0x32/0x50
    bad_page+0x69/0xf0
    free_pcp_prepare+0x260/0x2f0
    free_unref_page+0x20/0x1c0
    skb_release_data+0x10b/0x1a0
    napi_consume_skb+0x56/0x150
    net_rx_action+0xf0/0x350
    ? __napi_schedule+0x79/0x90
    __do_softirq+0xc8/0x2b1
    __irq_exit_rcu+0xb9/0xf0
    common_interrupt+0x82/0xa0
    </IRQ>
    <TASK>
    asm_common_interrupt+0x22/0x40
    RIP: 0010:default_idle+0xb/0x20

    Fixes: 53e0961da1c7 ("page_pool: add frag page recycling support in page pool")
    Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20230413090353.14448-1-liangchen.linux@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-06-26 16:57:58 +02:00
Íñigo Huguet c9a53b31b9 net: extend drop reasons for multiple subsystems
Bugzilla: https://bugzilla.redhat.com/2215988

Conflicts: context conflict due to missing 78476d315e19 ("mctp: Add flow
extension to skb")

commit 071c0fc6fb919dcf29c676a842dda08a674877d7
Author: Johannes Berg <johannes.berg@intel.com>
Date:   Wed Apr 19 14:52:53 2023 +0200

    net: extend drop reasons for multiple subsystems
    
    Extend drop reasons to make them usable by subsystems
    other than core by reserving the high 16 bits for a
    new subsystem ID, of which 0 of course is used for the
    existing reasons immediately.
    
    To still be able to have string reasons, restructure
    that code a bit to make the loopup under RCU, the only
    user of this (right now) is drop_monitor.
    
    Link: https://lore.kernel.org/netdev/00659771ed54353f92027702c5bbb84702da62ce.camel@sipsolutions.net
    Signed-off-by: Johannes Berg <johannes.berg@intel.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
2023-06-20 09:18:16 +02:00
Antoine Tenart d48044618a net: add location to trace_consume_skb()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2184073
Upstream Status: linux.git

commit dd1b527831a3ed659afa01b672d8e1f7e6ca95a5
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Feb 16 15:47:18 2023 +0000

    net: add location to trace_consume_skb()

    kfree_skb() includes the location, it makes sense
    to add it to consume_skb() as well.

    After patch:

     taskd_EventMana  8602 [004]   420.406239: skb:consume_skb: skbaddr=0xffff893a4a6d0500 location=unix_stream_read_generic
             swapper     0 [011]   422.732607: skb:consume_skb: skbaddr=0xffff89597f68cee0 location=mlx4_en_free_tx_desc
          discipline  9141 [043]   423.065653: skb:consume_skb: skbaddr=0xffff893a487e9c00 location=skb_consume_udp
             swapper     0 [010]   423.073166: skb:consume_skb: skbaddr=0xffff8949ce9cdb00 location=icmpv6_rcv
             borglet  8672 [014]   425.628256: skb:consume_skb: skbaddr=0xffff8949c42e9400 location=netlink_dump
             swapper     0 [028]   426.263317: skb:consume_skb: skbaddr=0xffff893b1589dce0 location=net_rx_action
                wget 14339 [009]   426.686380: skb:consume_skb: skbaddr=0xffff893a51b552e0 location=tcp_rcv_state_process

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2023-06-06 11:23:26 +02:00
Antoine Tenart a49af01c77 net: fix call location in kfree_skb_list_reason
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2184073
Upstream Status: linux.git
Conflicts:\
- DEBUG_NET_WARN_ON_ONCE wasn't used in the removed chunk because of a
  missing dependency in c9s when that chunk was first applied, but now
  DEBUG_NET_WARN_ON_ONCE is available so we can use it instead.

commit a4650da2a2d6150a8ff1ea36fde9f6a26cf5fda3
Author: Jesper Dangaard Brouer <brouer@redhat.com>
Date:   Fri Jan 13 14:51:59 2023 +0100

    net: fix call location in kfree_skb_list_reason

    The SKB drop reason uses __builtin_return_address(0) to give the call
    "location" to trace_kfree_skb() tracepoint skb:kfree_skb.

    To keep this stable for compilers kfree_skb_reason() is annotated with
    __fix_address (noinline __noclone) as fixed in commit c205cc7534a9
    ("net: skb: prevent the split of kfree_skb_reason() by gcc").

    The function kfree_skb_list_reason() invoke kfree_skb_reason(), which
    cause the __builtin_return_address(0) "location" to report the
    unexpected address of kfree_skb_list_reason.

    Example output from 'perf script':
     kpktgend_0  1337 [000]    81.002597: skb:kfree_skb: skbaddr=0xffff888144824700 protocol=2048 location=kfree_skb_list_reason+0x1e reason: QDISC_DROP

    Patch creates an __always_inline __kfree_skb_reason() helper call that
    is called from both kfree_skb_list() and kfree_skb_list_reason().
    Suggestions for solutions that shares code better are welcome.

    As preparation for next patch move __kfree_skb() invocation out of
    this helper function.

    Reviewed-by: Saeed Mahameed <saeed@kernel.org>
    Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2023-06-02 14:52:02 +02:00
Jan Stancek 704d11b087 Merge: enable io_uring
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2375

# Merge Request Required Information

## Summary of Changes
This MR updates RHEL's io_uring implementation to match upstream v6.2 + fixes (Fixes: tags and Cc: stable commits).  The final 3 RHEL-only patches disable the feature by default, taint the kernel when it is enabled, and turn on the config option.

## Approved Development Ticket
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2170014
Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2214
Omitted-fix: b0b7a7d24b66 ("io_uring: return back links tw run optimisation")
  This is actually just	an optimization, and it	has non-trivial	conflicts
  which	would require additional backports to resolve.	Skip it.
Omitted-fix: 7d481e035633 ("io_uring/rsrc: fix DEFER_TASKRUN rsrc quiesce")
  This fix is incorrectly tagged.  The code that it applies to is not present in our tree.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>

Approved-by: John Meneghini <jmeneghi@redhat.com>
Approved-by: Ming Lei <ming.lei@redhat.com>
Approved-by: Maurizio Lombardi <mlombard@redhat.com>
Approved-by: Brian Foster <bfoster@redhat.com>
Approved-by: Guillaume Nault <gnault@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-05-17 07:47:08 +02:00
Jan Stancek fa72082f2d Merge: net: core: stable backports for 9.3 phase 1
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2408

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2188560
Depends: !2404

A bunch of fixes from upstream, affecting the core networking
implementation.

This also includes a couple of fixes for tun/tap, strictly tied to
commit "net: add sock_init_data_uid()"

Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Approved-by: Sabrina Dubroca <sdubroca@redhat.com>
Approved-by: Andrea Claudi <aclaudi@redhat.com>
Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-05-16 11:49:41 +02:00
Jan Stancek 04554d1843 Merge: bpf, xdp: update to 6.2
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2317

Rebase bpf and xdp to 6.2.

Bugzilla: https://bugzilla.redhat.com/2177177

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>

Approved-by: Artem Savkov <asavkov@redhat.com>
Approved-by: Felix Maurer <fmaurer@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-05-11 12:12:10 +02:00
Jeff Moyer 5fbf8901c6 net: shrink struct ubuf_info
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237

commit e7d2b510165fff6bedc9cca88c071ad846850c74
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Fri Sep 23 17:39:04 2022 +0100

    net: shrink struct ubuf_info
    
    We can benefit from a smaller struct ubuf_info, so leave only mandatory
    fields and let users to decide how they want to extend it. Convert
    MSG_ZEROCOPY to struct ubuf_info_msgzc and remove duplicated fields.
    This reduces the size from 48 bytes to just 16.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Acked-by: Paolo Abeni <pabeni@redhat.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-05-05 15:25:02 -04:00
Paolo Abeni 4b042c9aa8 net: fix NULL pointer in skb_segment_list
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2188560
Tested: LNST, Tier1

Upstream commit:
commit 876e8ca8366735a604bac86ff7e2732fc9d85d2d
Author: Yan Zhai <yan@cloudflare.com>
Date:   Mon Jan 30 12:51:48 2023 -0800

    net: fix NULL pointer in skb_segment_list

    Commit 3a1296a38d ("net: Support GRO/GSO fraglist chaining.")
    introduced UDP listifyed GRO. The segmentation relies on frag_list being
    untouched when passing through the network stack. This assumption can be
    broken sometimes, where frag_list itself gets pulled into linear area,
    leaving frag_list being NULL. When this happens it can trigger
    following NULL pointer dereference, and panic the kernel. Reverse the
    test condition should fix it.

    [19185.577801][    C1] BUG: kernel NULL pointer dereference, address:
    ...
    [19185.663775][    C1] RIP: 0010:skb_segment_list+0x1cc/0x390
    ...
    [19185.834644][    C1] Call Trace:
    [19185.841730][    C1]  <TASK>
    [19185.848563][    C1]  __udp_gso_segment+0x33e/0x510
    [19185.857370][    C1]  inet_gso_segment+0x15b/0x3e0
    [19185.866059][    C1]  skb_mac_gso_segment+0x97/0x110
    [19185.874939][    C1]  __skb_gso_segment+0xb2/0x160
    [19185.883646][    C1]  udp_queue_rcv_skb+0xc3/0x1d0
    [19185.892319][    C1]  udp_unicast_rcv_skb+0x75/0x90
    [19185.900979][    C1]  ip_protocol_deliver_rcu+0xd2/0x200
    [19185.910003][    C1]  ip_local_deliver_finish+0x44/0x60
    [19185.918757][    C1]  __netif_receive_skb_one_core+0x8b/0xa0
    [19185.927834][    C1]  process_backlog+0x88/0x130
    [19185.935840][    C1]  __napi_poll+0x27/0x150
    [19185.943447][    C1]  net_rx_action+0x27e/0x5f0
    [19185.951331][    C1]  ? mlx5_cq_tasklet_cb+0x70/0x160 [mlx5_core]
    [19185.960848][    C1]  __do_softirq+0xbc/0x25d
    [19185.968607][    C1]  irq_exit_rcu+0x83/0xb0
    [19185.976247][    C1]  common_interrupt+0x43/0xa0
    [19185.984235][    C1]  asm_common_interrupt+0x22/0x40
    ...
    [19186.094106][    C1]  </TASK>

    Fixes: 3a1296a38d ("net: Support GRO/GSO fraglist chaining.")
    Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
    Reviewed-by: Willem de Bruijn <willemb@google.com>
    Signed-off-by: Yan Zhai <yan@cloudflare.com>
    Acked-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/r/Y9gt5EUizK1UImEP@debian
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-05-02 19:07:41 +02:00
Xin Long c57185f9a4 tcp: fix skb_copy_ubufs() vs BIG TCP
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2185290
Tested: compile only

commit 7e692df3933628d974acb9f5b334d2b3e885e2a6
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Apr 28 04:32:31 2023 +0000

    tcp: fix skb_copy_ubufs() vs BIG TCP

    David Ahern reported crashes in skb_copy_ubufs() caused by TCP tx zerocopy
    using hugepages, and skb length bigger than ~68 KB.

    skb_copy_ubufs() assumed it could copy all payload using up to
    MAX_SKB_FRAGS order-0 pages.

    This assumption broke when BIG TCP was able to put up to 512 KB per skb.

    We did not hit this bug at Google because we use CONFIG_MAX_SKB_FRAGS=45
    and limit gso_max_size to 180000.

    A solution is to use higher order pages if needed.

    v2: add missing __GFP_COMP, or we leak memory.

    Fixes: 7c4e983c4f3c ("net: allow gso_max_size to exceed 65536")
    Reported-by: David Ahern <dsahern@kernel.org>
    Link: https://lore.kernel.org/netdev/c70000f6-baa4-4a05-46d0-4b3e0dc1ccc8@gmail.com/T/
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Xin Long <lucien.xin@gmail.com>
    Cc: Willem de Bruijn <willemb@google.com>
    Cc: Coco Li <lixiaoyan@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Xin Long <lxin@redhat.com>
2023-05-02 10:36:11 -04:00
Jeff Moyer 9f4bd88ef7 net: introduce managed frags infrastructure
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237

commit 753f1ca4e1e50248a1b760c9774d6d6b354562cc
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Tue Jul 12 21:52:31 2022 +0100

    net: introduce managed frags infrastructure
    
    Some users like io_uring can do page pinning more efficiently, so we
    want a way to delegate referencing to other subsystems. For that add
    a new flag called SKBFL_MANAGED_FRAG_REFS. When set, skb doesn't hold
    page references and upper layers are responsivle to managing page
    lifetime.
    
    It's allowed to convert skbs from managed to normal by calling
    skb_zcopy_downgrade_managed(). The function will take all needed
    page references and clear the flag. It's needed, for instance,
    to avoid mixing managed modes.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-04-29 08:07:02 -04:00
Jeff Moyer 1a5bb38f72 net: Allow custom iter handler in msghdr
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237

commit ebe73a284f4de8c5d401adeccd9b8fe3183b6e95
Author: David Ahern <dsahern@kernel.org>
Date:   Tue Jul 12 21:52:30 2022 +0100

    net: Allow custom iter handler in msghdr
    
    Add support for custom iov_iter handling to msghdr. The idea is that
    in-kernel subsystems want control over how an SG is split.
    
    Signed-off-by: David Ahern <dsahern@kernel.org>
    [pavel: move callback into msghdr]
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-04-29 08:06:02 -04:00
Jeff Moyer f82d792280 skbuff: add SKBFL_DONT_ORPHAN flag
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237

commit 2e07a521e1e424787af3bfc59615de4220856c35
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Tue Jul 12 21:52:28 2022 +0100

    skbuff: add SKBFL_DONT_ORPHAN flag
    
    We don't want to list every single ubuf_info callback in
    skb_orphan_frags(), add a flag controlling the behaviour.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-04-29 08:04:02 -04:00
Jeff Moyer d19688b83d net: avoid double accounting for pure zerocopy skbs
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237

commit 9b65b17db72313b7a4fe9bc9502928c88be57986
Author: Talal Ahmad <talalahmad@google.com>
Date:   Tue Nov 2 22:58:44 2021 -0400

    net: avoid double accounting for pure zerocopy skbs
    
    Track skbs containing only zerocopy data and avoid charging them to
    kernel memory to correctly account the memory utilization for
    msg_zerocopy. All of the data in such skbs is held in user pages which
    are already accounted to user. Before this change, they are charged
    again in kernel in __zerocopy_sg_from_iter. The charging in kernel is
    excessive because data is not being copied into skb frags. This
    excessive charging can lead to kernel going into memory pressure
    state which impacts all sockets in the system adversely. Mark pure
    zerocopy skbs with a SKBFL_PURE_ZEROCOPY flag and remove
    charge/uncharge for data in such skbs.
    
    Initially, an skb is marked pure zerocopy when it is empty and in
    zerocopy path. skb can then change from a pure zerocopy skb to mixed
    data skb (zerocopy and copy data) if it is at tail of write queue and
    there is room available in it and non-zerocopy data is being sent in
    the next sendmsg call. At this time sk_mem_charge is done for the pure
    zerocopied data and the pure zerocopy flag is unmarked. We found that
    this happens very rarely on workloads that pass MSG_ZEROCOPY.
    
    A pure zerocopy skb can later be coalesced into normal skb if they are
    next to each other in queue but this patch prevents coalescing from
    happening. This avoids complexity of charging when skb downgrades from
    pure zerocopy to mixed. This is also rare.
    
    In sk_wmem_free_skb, if it is a pure zerocopy skb, an sk_mem_uncharge
    for SKB_TRUESIZE(skb_end_offset(skb)) is done for sk_mem_charge in
    tcp_skb_entail for an skb without data.
    
    Testing with the msg_zerocopy.c benchmark between two hosts(100G nics)
    with zerocopy showed that before this patch the 'sock' variable in
    memory.stat for cgroup2 that tracks sum of sk_forward_alloc,
    sk_rmem_alloc and sk_wmem_queued is around 1822720 and with this
    change it is 0. This is due to no charge to sk_forward_alloc for
    zerocopy data and shows memory utilization for kernel is lowered.
    
    With this commit we don't see the warning we saw in previous commit
    which resulted in commit 84882cf72cd774cf16fd338bdbf00f69ac9f9194.
    
    Signed-off-by: Talal Ahmad <talalahmad@google.com>
    Acked-by: Arjun Roy <arjunroy@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Signed-off-by: Willem de Bruijn <willemb@google.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-04-29 08:03:02 -04:00
Jeff Moyer b09042f567 skbuff: don't mix ubuf_info from different sources
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237

commit 1b4b2b09d4fb451029b112f17d34792e0277aeb2
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Tue Jul 12 21:52:27 2022 +0100

    skbuff: don't mix ubuf_info from different sources
    
    We should not append MSG_ZEROCOPY requests to skbuff with non
    MSG_ZEROCOPY ubuf_info, they might be not compatible.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-04-29 08:01:02 -04:00
Jeff Moyer 3d8947f865 net: inline skb_zerocopy_iter_dgram
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237

commit 657dd5f97b2ed16cdaa6339f42f9130240af1c04
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Thu Apr 28 11:58:45 2022 +0100

    net: inline skb_zerocopy_iter_dgram
    
    skb_zerocopy_iter_dgram() is a small proxy function, inline it. For
    that, move __zerocopy_sg_from_iter into linux/skbuff.h
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-04-29 07:55:02 -04:00
Jiri Benc f9d22d44f4 net: skb: remove old comments about frag_size for build_skb()
Bugzilla: https://bugzilla.redhat.com/2177177

commit 12c1604ae1a39bef87ac099f106594b4cb433b75
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Fri Jan 6 18:29:04 2023 -0800

    net: skb: remove old comments about frag_size for build_skb()
    
    Since commit ce098da1497c ("skbuff: Introduce slab_build_skb()")
    drivers trying to build skb around slab-backed buffers should
    go via slab_build_skb() rather than passing frag_size = 0 to
    the main build_skb().
    
    Remove the copy'n'pasted comments about 0 meaning slab.
    
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2023-04-28 11:43:23 +02:00
Jiri Benc 4c362dc0ad skbuff: Introduce slab_build_skb()
Bugzilla: https://bugzilla.redhat.com/2177177

Conflicts:
- Only the networking core changes backported. Drivers will be updated
  at their own pace.
- Removed WARN_ONCE on old API usage to allow gradual change of drivers.

Omitted-fix: 8c495270845d ("bnx2x: use the right build_skb() helper")

commit ce098da1497c6dee9589fce2c61d1910f4fcf0e7
Author: Kees Cook <keescook@chromium.org>
Date:   Wed Dec 7 22:02:59 2022 -0800

    skbuff: Introduce slab_build_skb()

    syzkaller reported:

      BUG: KASAN: slab-out-of-bounds in __build_skb_around+0x235/0x340 net/core/skbuff.c:294
      Write of size 32 at addr ffff88802aa172c0 by task syz-executor413/5295

    For bpf_prog_test_run_skb(), which uses a kmalloc()ed buffer passed to
    build_skb().

    When build_skb() is passed a frag_size of 0, it means the buffer came
    from kmalloc. In these cases, ksize() is used to find its actual size,
    but since the allocation may not have been made to that size, actually
    perform the krealloc() call so that all the associated buffer size
    checking will be correctly notified (and use the "new" pointer so that
    compiler hinting works correctly). Split this logic out into a new
    interface, slab_build_skb(), but leave the original 0 checking for now
    to catch any stragglers.

    Reported-by: syzbot+fda18eaa8c12534ccb3b@syzkaller.appspotmail.com
    Link: https://groups.google.com/g/syzkaller-bugs/c/UnIKxTtU5-0/m/-wbXinkgAQAJ
    Fixes: 38931d8989b5 ("mm: Make ksize() a reporting-only function")
    Cc: Pavel Begunkov <asml.silence@gmail.com>
    Cc: pepsipu <soopthegoop@gmail.com>
    Cc: syzbot+fda18eaa8c12534ccb3b@syzkaller.appspotmail.com
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: kasan-dev <kasan-dev@googlegroups.com>
    Cc: Andrii Nakryiko <andrii@kernel.org>
    Cc: ast@kernel.org
    Cc: Daniel Borkmann <daniel@iogearbox.net>
    Cc: Hao Luo <haoluo@google.com>
    Cc: Jesper Dangaard Brouer <hawk@kernel.org>
    Cc: John Fastabend <john.fastabend@gmail.com>
    Cc: jolsa@kernel.org
    Cc: KP Singh <kpsingh@kernel.org>
    Cc: martin.lau@linux.dev
    Cc: Stanislav Fomichev <sdf@google.com>
    Cc: song@kernel.org
    Cc: Yonghong Song <yhs@fb.com>
    Signed-off-by: Kees Cook <keescook@chromium.org>
    Link: https://lore.kernel.org/r/20221208060256.give.994-kees@kernel.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2023-04-28 11:43:23 +02:00
Marc Dionne 5c216693ef rxrpc: Save last ACK's SACK table rather than marking txbufs
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2170099
JIRA: https://issues.redhat.com/browse/RHELPLAN-148774

commit d57a3a151660902091491ac2633134e1be92557f
Author: David Howells <dhowells@redhat.com>
Date:   Sat May 7 10:06:13 2022 +0100

    Improve the tracking of which packets need to be transmitted by saving the
    last ACK packet that we receive that has a populated soft-ACK table rather
    than marking packets.  Then we can step through the soft-ACK table and look
    at the packets we've transmitted beyond that to determine which packets we
    might want to retransmit.

    We also look at the highest serial number that has been acked to try and
    guess which packets we've transmitted the peer is likely to have seen.  If
    necessary, we send a ping to retrieve that number.

    One downside that might be a problem is that we can't then compare the
    previous acked/unacked state so easily in rxrpc_input_soft_acks() - which
    is a potential problem for the slow-start algorithm.

    Signed-off-by: David Howells <dhowells@redhat.com>
    cc: Marc Dionne <marc.dionne@auristor.com>
    cc: linux-afs@lists.infradead.org

Signed-off-by: Marc Dionne <mdionne@redhat.com>
2023-03-15 13:18:37 -03:00
Herton R. Krzesinski 0c207b7728 Merge: Attend warnings with gcc 11&12 when building kernel and modules
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1852

Bugzilla: https://bugzilla.redhat.com/2159468

Attend the warnings encountered when building CentOS Stream 9 kernel and module
for x86_64 and arm64 using GCC11 (cs9) and GCC12 (f37).

Some warnings end up being disabled usptream (-Wdangling-pointer,
-Warray-bounds for GCC12 and -Wdeprecated for the sign-file.c), so backport
these changes to align with this behavior.

A few configurations were introduced to deal with -Werror and specific warnings
depending on the toolchain and target architecture. This merge-request tries to
bring these relevant patches without breaking compilation (e.g, CONFIG_WERROR
is introduced, but not set).

https://bugzilla.redhat.com/2142659 was already opened separately to
attend the -Wstringop-overread in net/core/dev.c.

Signed-off-by: Eric Chanudet <echanude@redhat.com>

Approved-by: Vladis Dronov <vdronov@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>
Approved-by: Lenny Szubowicz <lszubowi@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Jonathan Toppins <jtoppins@redhat.com>
Approved-by: Jarod Wilson <jarod@redhat.com>

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2023-02-15 18:54:57 +00:00
Jiri Benc ffcea318b6 net: gso: fix panic on frag_list with mixed head alloc types
Bugzilla: https://bugzilla.redhat.com/2166641

commit 9e4b7a99a03aefd37ba7bb1f022c8efab5019165
Author: Jiri Benc <jbenc@redhat.com>
Date:   Wed Nov 2 17:53:25 2022 +0100

    net: gso: fix panic on frag_list with mixed head alloc types

    Since commit 3dcbdb134f ("net: gso: Fix skb_segment splat when
    splitting gso_size mangled skb having linear-headed frag_list"), it is
    allowed to change gso_size of a GRO packet. However, that commit assumes
    that "checking the first list_skb member suffices; i.e if either of the
    list_skb members have non head_frag head, then the first one has too".

    It turns out this assumption does not hold. We've seen BUG_ON being hit
    in skb_segment when skbs on the frag_list had differing head_frag with
    the vmxnet3 driver. This happens because __netdev_alloc_skb and
    __napi_alloc_skb can return a skb that is page backed or kmalloced
    depending on the requested size. As the result, the last small skb in
    the GRO packet can be kmalloced.

    There are three different locations where this can be fixed:

    (1) We could check head_frag in GRO and not allow GROing skbs with
        different head_frag. However, that would lead to performance
        regression on normal forward paths with unmodified gso_size, where
        !head_frag in the last packet is not a problem.

    (2) Set a flag in bpf_skb_net_grow and bpf_skb_net_shrink indicating
        that NETIF_F_SG is undesirable. That would need to eat a bit in
        sk_buff. Furthermore, that flag can be unset when all skbs on the
        frag_list are page backed. To retain good performance,
        bpf_skb_net_grow/shrink would have to walk the frag_list.

    (3) Walk the frag_list in skb_segment when determining whether
        NETIF_F_SG should be cleared. This of course slows things down.

    This patch implements (3). To limit the performance impact in
    skb_segment, the list is walked only for skbs with SKB_GSO_DODGY set
    that have gso_size changed. Normal paths thus will not hit it.

    We could check only the last skb but since we need to walk the whole
    list anyway, let's stay on the safe side.

    Fixes: 3dcbdb134f ("net: gso: Fix skb_segment splat when splitting gso_size mangled skb having linear-headed frag_list")
    Signed-off-by: Jiri Benc <jbenc@redhat.com>
    Reviewed-by: Willem de Bruijn <willemb@google.com>
    Link: https://lore.kernel.org/r/e04426a6a91baf4d1081e1b478c82b5de25fdf21.1667407944.git.jbenc@redhat.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2023-02-02 14:56:30 +01:00
Herton R. Krzesinski a63de8eac1 Merge: net: skb free reason sync part 2
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1814

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2155181

Add one extra series of skb free reasons to c9s. Those patches will be nice to
have as one is reordering the free reasons enum we backported earlier in this
cycle (adding one special reason at the start) and we'll avoid changing the
free reason enum values in between versions.

Signed-off-by: Antoine Tenart <atenart@redhat.com>

Approved-by: Hangbin Liu <haliu@redhat.com>
Approved-by: Sabrina Dubroca <sdubroca@redhat.com>

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2023-01-13 14:29:22 +00:00
Eric Chanudet fe0f041ef7 skbuff: Switch structure bounds to struct_group()
Bugzilla: https://bugzilla.redhat.com/2159468

commit 03f61041c17914355dde7261be9ccdc821ddd454
Author: Kees Cook <keescook@chromium.org>
Date:   Sat Nov 20 16:31:49 2021 -0800

    skbuff: Switch structure bounds to struct_group()

    In preparation for FORTIFY_SOURCE performing compile-time and run-time
    field bounds checking for memcpy(), memmove(), and memset(), avoid
    intentionally writing across neighboring fields.

    Replace the existing empty member position markers "headers_start" and
    "headers_end" with a struct_group(). This will allow memcpy() and sizeof()
    to more easily reason about sizes, and improve readability.

    "pahole" shows no size nor member offset changes to struct sk_buff.
    "objdump -d" shows no object code changes (outside of WARNs affected by
    source line number changes).

    Signed-off-by: Kees Cook <keescook@chromium.org>
    Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
    Reviewed-by: Jason A. Donenfeld <Jason@zx2c4.com> # drivers/net/wireguard/*
    Link: https://lore.kernel.org/lkml/20210728035006.GD35706@embeddedor
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Eric Chanudet <echanude@redhat.com>
2023-01-09 13:32:41 -05:00
Herton R. Krzesinski 19ce0cbd76 Merge: bpf, xdp: update to 5.19
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1533

bpf, xdp: update to 5.19

Bugzilla: http://bugzilla.redhat.com/2120968
Bugzilla: http://bugzilla.redhat.com/2130850
Bugzilla: http://bugzilla.redhat.com/2140077


Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>

Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>
Approved-by: Artem Savkov <asavkov@redhat.com>
Approved-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2022-12-21 20:49:27 +00:00
Antoine Tenart bb3ee6fbc5 net: dropreason: propagate drop_reason to skb_release_data()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2155181
Upstream Status: net.git

commit 511a3eda2f8d4719114ee3f2c781c37233bd171f
Author: Eric Dumazet <edumazet@google.com>
Date:   Sat Oct 29 15:45:17 2022 +0000

    net: dropreason: propagate drop_reason to skb_release_data()

    When an skb with a frag list is consumed, we currently
    pretend all skbs in the frag list were dropped.

    In order to fix this, add a @reason argument to skb_release_data()
    and skb_release_all().

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-12-21 15:06:17 +01:00
Antoine Tenart 2dc0e2d4a8 net: dropreason: add SKB_CONSUMED reason
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2155181
Upstream Status: net.git

commit 0e84afe8ebfbb9eade3f4f6de4720887bf908e26
Author: Eric Dumazet <edumazet@google.com>
Date:   Sat Oct 29 15:45:16 2022 +0000

    net: dropreason: add SKB_CONSUMED reason

    This will allow to simply use in the future:

            kfree_skb_reason(skb, reason);

    Instead of repeating sequences like:

            if (dropped)
                kfree_skb_reason(skb, reason);
            else
                consume_skb(skb);

    For instance, following patch in the series is adding
    @reason to skb_release_data() and skb_release_all(),
    so that we can propagate a meaningful @reason whenever
    consume_skb()/kfree_skb() have to take care of a potential frag_list.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-12-21 15:06:17 +01:00
Herton R. Krzesinski 09736a3a30 Merge: udp: some performance optimizations
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1541

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2133057
Tested: LNST, Tier1, tput test

This series improves UDP protocol RX tput, to keep it on equal footing with rhel-8 one.

Patches 1,3,4 are there just to reduces the conflicts, and patch 4 is a very partial
backport, to avoid pulling unrelated features.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Florian Westphal <fwestpha@redhat.com>

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2022-12-13 17:35:03 +00:00
Felix Maurer 13bc0343bd net: Change skb_ensure_writable()'s write_len param to unsigned int type
Bugzilla: https://bugzilla.redhat.com/2120968

commit 92ece28072f18f30099770c5d4b8e300ea6820fa
Author: Liu Jian <liujian56@huawei.com>
Date:   Sat Apr 16 18:58:00 2022 +0800

    net: Change skb_ensure_writable()'s write_len param to unsigned int type
    
    Both pskb_may_pull() and skb_clone_writable()'s length parameters are of
    type unsigned int already. Therefore, change this function's write_len
    param to unsigned int type.
    
    Signed-off-by: Liu Jian <liujian56@huawei.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Song Liu <songliubraving@fb.com>
    Link: https://lore.kernel.org/bpf/20220416105801.88708-3-liujian56@huawei.com

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2022-11-30 12:47:09 +02:00
Frantisek Hrbata 1269719102 Merge: BPF and XDP rebase to v5.18
Merge conflicts:
-----------------
arch/x86/net/bpf_jit_comp.c
        - bpf_arch_text_poke()
          HEAD(!1464) contains b73b002f7f ("x86/ibt,bpf: Add ENDBR instructions to prologue and trampoline")
          Resolved in favour of !1464, but keep the return statement from !1477

MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1477

Bugzilla: https://bugzilla.redhat.com/2120966

Rebase BPF and XDP to the upstream kernel version 5.18

Patch applied, then reverted:
```
544356 selftests/bpf: switch to new libbpf XDP APIs
0bfb95 selftests, bpf: Do not yet switch to new libbpf XDP APIs
```
Taken in the perf rebase:
```
23fcfc perf: use generic bpf_program__set_type() to set BPF prog type
```
Unsuported arches:
```
5c1011 libbpf: Fix riscv register names
cf0b5b libbpf: Fix accessing syscall arguments on riscv
```
Depends on changes of other subsystems:
```
7fc8c3 s390/bpf: encode register within extable entry
aebfd1 x86/ibt,ftrace: Search for __fentry__ location
589127 x86/ibt,bpf: Add ENDBR instructions to prologue and trampoline
```
Broken selftest:
```
edae34 selftests net: add UDP GRO fraglist + bpf self-tests
cf6783 selftests net: fix bpf build error
7b92aa selftests net: fix kselftest net fatal error
```
Out of scope:
```
baebdf net: dev: Makes sure netif_rx() can be invoked in any context.
5c8166 kbuild: replace $(if A,A,B) with $(or A,B)
1a97ce perf maps: Use a pointer for kmaps
967747 uaccess: remove CONFIG_SET_FS
42b01a s390: always use the packed stack layout
bf0882 flow_dissector: Add support for HSR
d09a30 s390/extable: move EX_TABLE define to asm-extable.h
3d6671 s390/extable: convert to relative table with data
4efd41 s390: raise minimum supported machine generation to z10
f65e58 flow_dissector: Add support for HSRv0
1a6d7a netdevsim: Introduce support for L3 offload xstats
9b1894 selftests: netdevsim: hw_stats_l3: Add a new test
84005b perf ftrace latency: Add -n/--use-nsec option
36c4a7 kasan, arm64: don't tag executable vmalloc allocations
8df013 docs: netdev: move the netdev-FAQ to the process pages
4d4d00 perf tools: Update copy of libbpf's hashmap.c
0df6ad perf evlist: Rename cpus to user_requested_cpus
1b8089 flow_dissector: fix false-positive __read_overflow2_field() warning
0ae065 perf build: Fix check for btf__load_from_kernel_by_id() in libbpf
8994e9 perf test bpf: Skip test if clang is not present
735346 perf build: Fix btf__load_from_kernel_by_id() feature check
f037ac s390/stack: merge empty stack frame slots
335220 docs: netdev: update maintainer-netdev.rst reference
a0b098 s390/nospec: remove unneeded header includes
34513a netdevsim: Fix hwstats debugfs file permissions
```

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>

Approved-by: John W. Linville <linville@redhat.com>
Approved-by: Wander Lairson Costa <wander@redhat.com>
Approved-by: Torez Smith <torez@redhat.com>
Approved-by: Jan Stancek <jstancek@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Felix Maurer <fmaurer@redhat.com>
Approved-by: Viktor Malik <vmalik@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-11-21 05:30:47 -05:00
Davide Caratti d011a96301 net: do not sense pfmemalloc status in skb_append_pagefrags()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2136491
Upstream Status: net.git commit 228ebc41dfab

commit 228ebc41dfab5b5d34cd76835ddb0ca8ee12f513
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Oct 27 04:03:46 2022 +0000

    net: do not sense pfmemalloc status in skb_append_pagefrags()

    skb_append_pagefrags() is used by af_unix and udp sendpage()
    implementation so far.

    In commit 326140063946 ("tcp: TX zerocopy should not sense
    pfmemalloc status") we explained why we should not sense
    pfmemalloc status for pages owned by user space.

    We should also use skb_fill_page_desc_noacc()
    in skb_append_pagefrags() to avoid following KCSAN report:

    BUG: KCSAN: data-race in lru_add_fn / skb_append_pagefrags

    write to 0xffffea00058fc1c8 of 8 bytes by task 17319 on cpu 0:
    __list_add include/linux/list.h:73 [inline]
    list_add include/linux/list.h:88 [inline]
    lruvec_add_folio include/linux/mm_inline.h:323 [inline]
    lru_add_fn+0x327/0x410 mm/swap.c:228
    folio_batch_move_lru+0x1e1/0x2a0 mm/swap.c:246
    lru_add_drain_cpu+0x73/0x250 mm/swap.c:669
    lru_add_drain+0x21/0x60 mm/swap.c:773
    free_pages_and_swap_cache+0x16/0x70 mm/swap_state.c:311
    tlb_batch_pages_flush mm/mmu_gather.c:59 [inline]
    tlb_flush_mmu_free mm/mmu_gather.c:256 [inline]
    tlb_flush_mmu+0x5b2/0x640 mm/mmu_gather.c:263
    tlb_finish_mmu+0x86/0x100 mm/mmu_gather.c:363
    exit_mmap+0x190/0x4d0 mm/mmap.c:3098
    __mmput+0x27/0x1b0 kernel/fork.c:1185
    mmput+0x3d/0x50 kernel/fork.c:1207
    copy_process+0x19fc/0x2100 kernel/fork.c:2518
    kernel_clone+0x166/0x550 kernel/fork.c:2671
    __do_sys_clone kernel/fork.c:2812 [inline]
    __se_sys_clone kernel/fork.c:2796 [inline]
    __x64_sys_clone+0xc3/0xf0 kernel/fork.c:2796
    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
    entry_SYSCALL_64_after_hwframe+0x63/0xcd

    read to 0xffffea00058fc1c8 of 8 bytes by task 17325 on cpu 1:
    page_is_pfmemalloc include/linux/mm.h:1817 [inline]
    __skb_fill_page_desc include/linux/skbuff.h:2432 [inline]
    skb_fill_page_desc include/linux/skbuff.h:2453 [inline]
    skb_append_pagefrags+0x210/0x600 net/core/skbuff.c:3974
    unix_stream_sendpage+0x45e/0x990 net/unix/af_unix.c:2338
    kernel_sendpage+0x184/0x300 net/socket.c:3561
    sock_sendpage+0x5a/0x70 net/socket.c:1054
    pipe_to_sendpage+0x128/0x160 fs/splice.c:361
    splice_from_pipe_feed fs/splice.c:415 [inline]
    __splice_from_pipe+0x222/0x4d0 fs/splice.c:559
    splice_from_pipe fs/splice.c:594 [inline]
    generic_splice_sendpage+0x89/0xc0 fs/splice.c:743
    do_splice_from fs/splice.c:764 [inline]
    direct_splice_actor+0x80/0xa0 fs/splice.c:931
    splice_direct_to_actor+0x305/0x620 fs/splice.c:886
    do_splice_direct+0xfb/0x180 fs/splice.c:974
    do_sendfile+0x3bf/0x910 fs/read_write.c:1255
    __do_sys_sendfile64 fs/read_write.c:1323 [inline]
    __se_sys_sendfile64 fs/read_write.c:1309 [inline]
    __x64_sys_sendfile64+0x10c/0x150 fs/read_write.c:1309
    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
    entry_SYSCALL_64_after_hwframe+0x63/0xcd

    value changed: 0x0000000000000000 -> 0xffffea00058fc188

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 1 PID: 17325 Comm: syz-executor.0 Not tainted 6.1.0-rc1-syzkaller-00158-g440b7895c990-dirty #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022

    Fixes: 326140063946 ("tcp: TX zerocopy should not sense pfmemalloc status")
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20221027040346.1104204-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2022-11-07 10:19:56 +01:00
Paolo Abeni 022665bacd net: skb: introduce and use a single page frag cache
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2133057
Tested: LNST, Tier1

Upstream commit:
commit dbae2b062824fc2d35ae2d5df2f500626c758e80
Author: Paolo Abeni <pabeni@redhat.com>
Date:   Wed Sep 28 10:43:09 2022 +0200

    net: skb: introduce and use a single page frag cache

    After commit 3226b158e6 ("net: avoid 32 x truesize under-estimation
    for tiny skbs") we are observing 10-20% regressions in performance
    tests with small packets. The perf trace points to high pressure on
    the slab allocator.

    This change tries to improve the allocation schema for small packets
    using an idea originally suggested by Eric: a new per CPU page frag is
    introduced and used in __napi_alloc_skb to cope with small allocation
    requests.

    To ensure that the above does not lead to excessive truesize
    underestimation, the frag size for small allocation is inflated to 1K
    and all the above is restricted to build with 4K page size.

    Note that we need to update accordingly the run-time check introduced
    with commit fd9ea57f4e95 ("net: add napi_get_frags_check() helper").

    Alex suggested a smart page refcount schema to reduce the number
    of atomic operations and deal properly with pfmemalloc pages.

    Under small packet UDP flood, I measure a 15% peak tput increases.

    Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
    Suggested-by: Alexander H Duyck <alexanderduyck@fb.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
    Link: https://lore.kernel.org/r/6b6f65957c59f86a353fc09a5127e83a32ab5999.1664350652.git.pabeni@redhat.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-27 19:12:04 +02:00
Jiri Benc cb80b39939 net: fix wrong network header length
Bugzilla: https://bugzilla.redhat.com/2120966

commit cf3ab8d4a797960b4be20565abb3bcd227b18a68
Author: Lina Wang <lina.wang@mediatek.com>
Date:   Thu May 5 13:48:49 2022 +0800

    net: fix wrong network header length

    When clatd starts with ebpf offloaing, and NETIF_F_GRO_FRAGLIST is enable,
    several skbs are gathered in skb_shinfo(skb)->frag_list. The first skb's
    ipv6 header will be changed to ipv4 after bpf_skb_proto_6_to_4,
    network_header\transport_header\mac_header have been updated as ipv4 acts,
    but other skbs in frag_list didnot update anything, just ipv6 packets.

    udp_queue_rcv_skb will call skb_segment_list to traverse other skbs in
    frag_list and make sure right udp payload is delivered to user space.
    Unfortunately, other skbs in frag_list who are still ipv6 packets are
    updated like the first skb and will have wrong transport header length.

    e.g.before bpf_skb_proto_6_to_4,the first skb and other skbs in frag_list
    has the same network_header(24)& transport_header(64), after
    bpf_skb_proto_6_to_4, ipv6 protocol has been changed to ipv4, the first
    skb's network_header is 44,transport_header is 64, other skbs in frag_list
    didnot change.After skb_segment_list, the other skbs in frag_list has
    different network_header(24) and transport_header(44), so there will be 20
    bytes different from original,that is difference between ipv6 header and
    ipv4 header. Just change transport_header to be the same with original.

    Actually, there are two solutions to fix it, one is traversing all skbs
    and changing every skb header in bpf_skb_proto_6_to_4, the other is
    modifying frag_list skb's header in skb_segment_list. Considering
    efficiency, adopt the second one--- when the first skb and other skbs in
    frag_list has different network_header length, restore them to make sure
    right udp payload is delivered to user space.

    Signed-off-by: Lina Wang <lina.wang@mediatek.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:58:10 +02:00
Jiri Benc 8958ba7e1f skbuff: clean up inconsistent indenting
Bugzilla: https://bugzilla.redhat.com/2120966

commit c645fe9bf6ae589ff9163d6c515d3517ec2e32d5
Author: Colin Ian King <colin.i.king@gmail.com>
Date:   Thu Sep 2 23:56:23 2021 +0100

    skbuff: clean up inconsistent indenting

    There is a statement that is indented one character too deeply,
    clean this up.

    Signed-off-by: Colin Ian King <colin.king@canonical.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:58:10 +02:00
Jiri Benc e17e09a099 net: Clear mono_delivery_time bit in __skb_tstamp_tx()
Bugzilla: https://bugzilla.redhat.com/2120966

commit d93376f503c7a586707925957592c0f16f4db0b1
Author: Martin KaFai Lau <kafai@fb.com>
Date:   Wed Mar 2 11:55:44 2022 -0800

    net: Clear mono_delivery_time bit in __skb_tstamp_tx()

    In __skb_tstamp_tx(), it may clone the egress skb and queues the clone to
    the sk_error_queue.  The outgoing skb may have the mono delivery_time
    while the (rcv) timestamp is expected for the clone, so the
    skb->mono_delivery_time bit needs to be cleared from the clone.

    This patch adds the skb->mono_delivery_time clearing to the existing
    __net_timestamp() and use it in __skb_tstamp_tx().
    The __net_timestamp() fast path usage in dev.c is changed to directly
    call ktime_get_real() since the mono_delivery_time bit is not set at
    that point.

    Signed-off-by: Martin KaFai Lau <kafai@fb.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:58:00 +02:00
Jiri Benc 2e725d3634 net: Add skb_clear_tstamp() to keep the mono delivery_time
Bugzilla: https://bugzilla.redhat.com/2120966

commit de799101519aad23c6096041ba2744d7b5517e6a
Author: Martin KaFai Lau <kafai@fb.com>
Date:   Wed Mar 2 11:55:31 2022 -0800

    net: Add skb_clear_tstamp() to keep the mono delivery_time

    Right now, skb->tstamp is reset to 0 whenever the skb is forwarded.

    If skb->tstamp has the mono delivery_time, clearing it can hurt
    the performance when it finally transmits out to fq@phy-dev.

    The earlier patch added a skb->mono_delivery_time bit to
    flag the skb->tstamp carrying the mono delivery_time.

    This patch adds skb_clear_tstamp() helper which keeps
    the mono delivery_time and clears everything else.

    The delivery_time clearing will be postponed until the stack knows the
    skb will be delivered locally.  It will be done in a latter patch.

    Signed-off-by: Martin KaFai Lau <kafai@fb.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:57:59 +02:00
Frantisek Hrbata fa843be1d1 Merge: net: add skb drop reasons
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1454

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161

Sync skb drop reasons with upstream to improve debuggability and visibility in
the net stack. This MR helps in understanding why a given packet is being
dropped.

One way of retrieving the skb drop reason is to hook to the kfree_skb tracepoint:

```
# perf record -e skb:kfree_skb -a sleep 10
# perf script
         swapper     0 [000] 45483.977088: skb:kfree_skb: skbaddr=0xffffa04859090f00 protocol=34525 location=0xffffffff9bc92940 reason: NOT_SPECIFIED
         swapper     0 [000] 45485.792919: skb:kfree_skb: skbaddr=0xffffa04143757900 protocol=34525 location=0xffffffff9bbd84bb reason: TCP_INVALID_SEQUENCE
```

Signed-off-by: Antoine Tenart <atenart@redhat.com>

Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-10-24 14:27:58 -04:00
Antoine Tenart 6f2e7329d3 net: skb: export skb drop reaons to user by TRACE_DEFINE_ENUM
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit 9cb252c4c1c53ae58bc565bab76e98133288f23a
Author: Menglong Dong <imagedong@tencent.com>
Date:   Mon Sep 5 11:50:15 2022 +0800

    net: skb: export skb drop reaons to user by TRACE_DEFINE_ENUM

    As Eric reported, the 'reason' field is not presented when trace the
    kfree_skb event by perf:

    $ perf record -e skb:kfree_skb -a sleep 10
    $ perf script
      ip_defrag 14605 [021]   221.614303:   skb:kfree_skb:
      skbaddr=0xffff9d2851242700 protocol=34525 location=0xffffffffa39346b1
      reason:

    The cause seems to be passing kernel address directly to TP_printk(),
    which is not right. As the enum 'skb_drop_reason' is not exported to
    user space through TRACE_DEFINE_ENUM(), perf can't get the drop reason
    string from the 'reason' field, which is a number.

    Therefore, we introduce the macro DEFINE_DROP_REASON(), which is used
    to define the trace enum by TRACE_DEFINE_ENUM(). With the help of
    DEFINE_DROP_REASON(), now we can remove the auto-generate that we
    introduced in the commit ec43908dd556
    ("net: skb: use auto-generation to convert skb drop reason to string"),
    and define the string array 'drop_reasons'.

    Hmmmm...now we come back to the situation that have to maintain drop
    reasons in both enum skb_drop_reason and DEFINE_DROP_REASON. But they
    are both in dropreason.h, which makes it easier.

    After this commit, now the format of kfree_skb is like this:

    $ cat /tracing/events/skb/kfree_skb/format
    name: kfree_skb
    ID: 1524
    format:
            field:unsigned short common_type;       offset:0;       size:2; signed:0;
            field:unsigned char common_flags;       offset:2;       size:1; signed:0;
            field:unsigned char common_preempt_count;       offset:3;       size:1; signed:0;
            field:int common_pid;   offset:4;       size:4; signed:1;

            field:void * skbaddr;   offset:8;       size:8; signed:0;
            field:void * location;  offset:16;      size:8; signed:0;
            field:unsigned short protocol;  offset:24;      size:2; signed:0;
            field:enum skb_drop_reason reason;      offset:28;      size:4; signed:0;

    print fmt: "skbaddr=%p protocol=%u location=%p reason: %s", REC->skbaddr, REC->protocol, REC->location, __print_symbolic(REC->reason, { 1, "NOT_SPECIFIED" }, { 2, "NO_SOCKET" } ......

    Fixes: ec43908dd556 ("net: skb: use auto-generation to convert skb drop reason to string")
    Link: https://lore.kernel.org/netdev/CANn89i+bx0ybvE55iMYf5GJM48WwV1HNpdm9Q6t-HaEstqpCSA@mail.gmail.com/
    Reported-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-14 17:40:26 +02:00
Antoine Tenart b45adccfcf net: skb: prevent the split of kfree_skb_reason() by gcc
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: net-next.git

commit c205cc7534a97f2d6fbd2a23a94ed7c036c6e2aa
Author: Menglong Dong <imagedong@tencent.com>
Date:   Sun Aug 21 13:18:58 2022 +0800

    net: skb: prevent the split of kfree_skb_reason() by gcc

    Sometimes, gcc will optimize the function by spliting it to two or
    more functions. In this case, kfree_skb_reason() is splited to
    kfree_skb_reason and kfree_skb_reason.part.0. However, the
    function/tracepoint trace_kfree_skb() in it needs the return address
    of kfree_skb_reason().

    This split makes the call chains becomes:
      kfree_skb_reason() -> kfree_skb_reason.part.0 -> trace_kfree_skb()

    which makes the return address that passed to trace_kfree_skb() be
    kfree_skb().

    Therefore, introduce '__fix_address', which is the combination of
    '__noclone' and 'noinline', and apply it to kfree_skb_reason() to
    prevent to from being splited or made inline.

    (Is it better to simply apply '__noclone oninline' to kfree_skb_reason?
    I'm thinking maybe other functions have the same problems)

    Meanwhile, wrap 'skb_unref()' with 'unlikely()', as the compiler thinks
    it is likely return true and splits kfree_skb_reason().

    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-14 17:40:26 +02:00
Antoine Tenart 21d9800dd4 net: skb: use auto-generation to convert skb drop reason to string
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit ec43908dd556b2292f028c6e412261689405ba6e
Author: Menglong Dong <imagedong@tencent.com>
Date:   Mon Jun 6 10:24:35 2022 +0800

    net: skb: use auto-generation to convert skb drop reason to string

    It is annoying to add new skb drop reasons to 'enum skb_drop_reason'
    and TRACE_SKB_DROP_REASON in trace/event/skb.h, and it's easy to forget
    to add the new reasons we added to TRACE_SKB_DROP_REASON.

    TRACE_SKB_DROP_REASON is used to convert drop reason of type number
    to string. For now, the string we passed to user space is exactly the
    same as the name in 'enum skb_drop_reason' with a 'SKB_DROP_REASON_'
    prefix. Therefore, we can use 'auto-generation' to generate these
    drop reasons to string at build time.

    The new source 'dropreason_str.c' will be auto generated during build
    time, which contains the string array
    'const char * const drop_reasons[]'.

    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-14 17:40:26 +02:00
Antoine Tenart f301349869 net: skb: check the boundrary of drop reason in kfree_skb_reason()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git
Conflicts:\
- Can't use DEBUG_NET_WARN_ON_ONCE as upstream commit d268c1f5cfc9
  ("net: add CONFIG_DEBUG_NET") is not in c9s yet. Resolve the conflict
  by using the define used when CONFIG_DEBUG_NET=n upstream,
  BUILD_BUG_ON_INVALID.

commit 20bbcd0a94c6686c2692e6f7081163c233d7ce40
Author: Menglong Dong <imagedong@tencent.com>
Date:   Fri May 13 11:03:37 2022 +0800

    net: skb: check the boundrary of drop reason in kfree_skb_reason()

    Sometimes, we may forget to reset skb drop reason to NOT_SPECIFIED after
    we make it the return value of the functions with return type of enum
    skb_drop_reason, such as tcp_inbound_md5_hash. Therefore, its value can
    be SKB_NOT_DROPPED_YET(0), which is invalid for kfree_skb_reason().

    So we check the range of drop reason in kfree_skb_reason() with
    DEBUG_NET_WARN_ON_ONCE().

    Reviewed-by: Jiang Biao <benbjiang@tencent.com>
    Reviewed-by: Hao Peng <flyingpeng@tencent.com>
    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-14 17:40:00 +02:00
Antoine Tenart 55115540c4 net: skb: introduce the function kfree_skb_list_reason()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit 215b0f1963d4e34fccac6992b3debe26f78a6eb8
Author: Menglong Dong <imagedong@tencent.com>
Date:   Fri Mar 4 14:00:41 2022 +0800

    net: skb: introduce the function kfree_skb_list_reason()

    To report reasons of skb drops, introduce the function
    kfree_skb_list_reason() and make kfree_skb_list() an inline call to
    it. This function will be used in the next commit in
    __dev_xmit_skb().

    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:23 +02:00
Paolo Abeni 5932d4a818 net: Fix a data-race around sysctl_tstamp_allow_data.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134161
Tested: LNST, Tier1

Upstream commit:
commit d2154b0afa73c0159b2856f875c6b4fe7cf6a95e
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Tue Aug 23 10:46:50 2022 -0700

    net: Fix a data-race around sysctl_tstamp_allow_data.

    While reading sysctl_tstamp_allow_data, it can be changed
    concurrently.  Thus, we need to add READ_ONCE() to its reader.

    Fixes: b245be1f4d ("net-timestamp: no-payload only sysctl")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-13 13:00:03 +02:00
Ivan Vecera 756015f0e8 net: gro: move skb_gro_receive into net/core/gro.c
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101789

commit e456a18a390b96f22b0de2acd4d0f49c72ed2280
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 15 09:05:53 2021 -0800

    net: gro: move skb_gro_receive into net/core/gro.c

    net/core/gro.c will contain all core gro functions,
    to shrink net/core/skbuff.c and net/core/dev.c

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-06-28 13:28:40 +02:00
Ivan Vecera 554594fd78 net: gro: move skb_gro_receive_list to udp_offload.c
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101789

commit 0b935d7f8c07bf0a192712bdbf76dbf45ef8b115
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 15 09:05:52 2021 -0800

    net: gro: move skb_gro_receive_list to udp_offload.c

    This helper is used once, no need to keep it in fat net/core/skbuff.c

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-06-28 13:28:40 +02:00
Ivan Vecera 0c79035b3b net: move gro definitions to include/net/gro.h
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101789

Conflicts:
- context conflict due to missing 92552d3abd32 ("net/mlx5e: HW_GRO cqe
  handler implementation")

commit 4721031c3559db8eae61df305f10c00099a7c1d0
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 15 09:05:51 2021 -0800

    net: move gro definitions to include/net/gro.h

    include/linux/netdevice.h became too big, move gro stuff
    into include/net/gro.h

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-06-28 13:28:40 +02:00
Patrick Talbert 8c5b3f7fd9 Merge: XDP and networking eBPF rebase to v5.15
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/674

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071618

Depends: !572

Tested: Using bpf selftests, everything passes.

This rebases XDP and networking eBPF to upstream kernel version 5.15.

Signed-off-by: Jiri Benc <jbenc@redhat.com>

Approved-by: Hangbin Liu <haliu@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Toke Høiland-Jørgensen <toke@redhat.com>
Approved-by: Íñigo Huguet <ihuguet@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-06-03 09:26:25 +02:00
Patrick Talbert 6da9f3de35 Merge: net: drop_monitor: support drop reason
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/849

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083432

After commit c504e5c2f964 ("net: skb: introduce kfree_skb_reason()"),
we have supported drop reason. So let's add this feature to drop monitor.

Signed-off-by: Hangbin Liu <haliu@redhat.com>

Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>
Approved-by: Florian Westphal <fwestpha@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-05-25 09:28:10 +02:00
Patrick Talbert f311aab772 Merge: net: backport core fixes from upstream
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/832

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081920

A bunch of fixes for net core path.

Signed-off-by: Hangbin Liu <haliu@redhat.com>

Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Jarod Wilson <jarod@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-05-18 10:58:56 +02:00
Jiri Benc 1ad710c301 skbuff: fix coalescing for page_pool fragment recycling
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071618

commit 1effe8ca4e34c34cdd9318436a4232dcb582ebf4
Author: Jean-Philippe Brucker <jean-philippe@linaro.org>
Date:   Thu Mar 31 11:24:41 2022 +0100

    skbuff: fix coalescing for page_pool fragment recycling

    Fix a use-after-free when using page_pool with page fragments. We
    encountered this problem during normal RX in the hns3 driver:

    (1) Initially we have three descriptors in the RX queue. The first one
        allocates PAGE1 through page_pool, and the other two allocate one
        half of PAGE2 each. Page references look like this:

                    RX_BD1 _______ PAGE1
                    RX_BD2 _______ PAGE2
                    RX_BD3 _________/

    (2) Handle RX on the first descriptor. Allocate SKB1, eventually added
        to the receive queue by tcp_queue_rcv().

    (3) Handle RX on the second descriptor. Allocate SKB2 and pass it to
        netif_receive_skb():

        netif_receive_skb(SKB2)
          ip_rcv(SKB2)
            SKB3 = skb_clone(SKB2)

        SKB2 and SKB3 share a reference to PAGE2 through
        skb_shinfo()->dataref. The other ref to PAGE2 is still held by
        RX_BD3:

                          SKB2 ---+- PAGE2
                          SKB3 __/   /
                    RX_BD3 _________/

     (3b) Now while handling TCP, coalesce SKB3 with SKB1:

          tcp_v4_rcv(SKB3)
            tcp_try_coalesce(to=SKB1, from=SKB3)    // succeeds
            kfree_skb_partial(SKB3)
              skb_release_data(SKB3)                // drops one dataref

                          SKB1 _____ PAGE1
                               \____
                          SKB2 _____ PAGE2
                                     /
                    RX_BD3 _________/

        In skb_try_coalesce(), __skb_frag_ref() takes a page reference to
        PAGE2, where it should instead have increased the page_pool frag
        reference, pp_frag_count. Without coalescing, when releasing both
        SKB2 and SKB3, a single reference to PAGE2 would be dropped. Now
        when releasing SKB1 and SKB2, two references to PAGE2 will be
        dropped, resulting in underflow.

     (3c) Drop SKB2:

          af_packet_rcv(SKB2)
            consume_skb(SKB2)
              skb_release_data(SKB2)                // drops second dataref
                page_pool_return_skb_page(PAGE2)    // drops one pp_frag_count

                          SKB1 _____ PAGE1
                               \____
                                     PAGE2
                                     /
                    RX_BD3 _________/

    (4) Userspace calls recvmsg()
        Copies SKB1 and releases it. Since SKB3 was coalesced with SKB1, we
        release the SKB3 page as well:

        tcp_eat_recv_skb(SKB1)
          skb_release_data(SKB1)
            page_pool_return_skb_page(PAGE1)
            page_pool_return_skb_page(PAGE2)        // drops second pp_frag_count

    (5) PAGE2 is freed, but the third RX descriptor was still using it!
        In our case this causes IOMMU faults, but it would silently corrupt
        memory if the IOMMU was disabled.

    Change the logic that checks whether pp_recycle SKBs can be coalesced.
    We still reject differing pp_recycle between 'from' and 'to' SKBs, but
    in order to avoid the situation described above, we also reject
    coalescing when both 'from' and 'to' are pp_recycled and 'from' is
    cloned.

    The new logic allows coalescing a cloned pp_recycle SKB into a page
    refcounted one, because in this case the release (4) will drop the right
    reference, the one taken by skb_try_coalesce().

    Fixes: 53e0961da1c7 ("page_pool: add frag page recycling support in page pool")
    Suggested-by: Alexander Duyck <alexanderduyck@fb.com>
    Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
    Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com>
    Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
    Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
    Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-05-12 17:29:54 +02:00
Jiri Benc 7e6f15045c net: in_irq() cleanup
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071618

commit afa79d08c6c8e1901cb1547591e3ccd3ec6965d9
Author: Changbin Du <changbin.du@intel.com>
Date:   Fri Aug 13 22:57:49 2021 +0800

    net: in_irq() cleanup

    Replace the obsolete and ambiguos macro in_irq() with new
    macro in_hardirq().

    Signed-off-by: Changbin Du <changbin.du@gmail.com>
    Link: https://lore.kernel.org/r/20210813145749.86512-1-changbin.du@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-05-12 17:29:49 +02:00
Hangbin Liu 546f7472fa net: __pskb_pull_tail() & pskb_carve_frag_list() drop_monitor friends
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083432
Upstream Status: net.git commit ef527f968ae0

commit ef527f968ae05c6717c39f49c8709a7e2c19183a
Author: Eric Dumazet <edumazet@google.com>
Date:   Sun Feb 20 07:40:52 2022 -0800

    net: __pskb_pull_tail() & pskb_carve_frag_list() drop_monitor friends

    Whenever one of these functions pull all data from an skb in a frag_list,
    use consume_skb() instead of kfree_skb() to avoid polluting drop
    monitoring.

    Fixes: 6fa01ccd88 ("skbuff: Add pskb_extract() helper function")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20220220154052.1308469-1-eric.dumazet@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Hangbin Liu <haliu@redhat.com>
2022-05-10 11:13:06 +08:00
Hangbin Liu 9de2441ab3 net: fix up skbs delta_truesize in UDP GRO frag_list
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081920
Upstream Status: net.git commit 224102de2ff1

commit 224102de2ff105a2c05695e66a08f4b5b6b2d19c
Author: lena wang <lena.wang@mediatek.com>
Date:   Tue Mar 1 19:17:09 2022 +0800

    net: fix up skbs delta_truesize in UDP GRO frag_list

    The truesize for a UDP GRO packet is added by main skb and skbs in main
    skb's frag_list:
    skb_gro_receive_list
            p->truesize += skb->truesize;

    The commit 53475c5dd8 ("net: fix use-after-free when UDP GRO with
    shared fraglist") introduced a truesize increase for frag_list skbs.
    When uncloning skb, it will call pskb_expand_head and trusesize for
    frag_list skbs may increase. This can occur when allocators uses
    __netdev_alloc_skb and not jump into __alloc_skb. This flow does not
    use ksize(len) to calculate truesize while pskb_expand_head uses.
    skb_segment_list
    err = skb_unclone(nskb, GFP_ATOMIC);
    pskb_expand_head
            if (!skb->sk || skb->destructor == sock_edemux)
                    skb->truesize += size - osize;

    If we uses increased truesize adding as delta_truesize, it will be
    larger than before and even larger than previous total truesize value
    if skbs in frag_list are abundant. The main skb truesize will become
    smaller and even a minus value or a huge value for an unsigned int
    parameter. Then the following memory check will drop this abnormal skb.

    To avoid this error we should use the original truesize to segment the
    main skb.

    Fixes: 53475c5dd8 ("net: fix use-after-free when UDP GRO with shared fraglist")
    Signed-off-by: lena wang <lena.wang@mediatek.com>
    Acked-by: Paolo Abeni <pabeni@redhat.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/1646133431-8948-1-git-send-email-lena.wang@mediatek.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Hangbin Liu <haliu@redhat.com>
2022-05-05 12:26:57 +08:00
Hangbin Liu 2af4b6bca6 net: preserve skb_end_offset() in skb_unclone_keeptruesize()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081920
Upstream Status: net.git commit 2b88cba55883

commit 2b88cba55883eaafbc9b7cbff0b2c7cdba71ed01
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Feb 21 19:21:13 2022 -0800

    net: preserve skb_end_offset() in skb_unclone_keeptruesize()

    syzbot found another way to trigger the infamous WARN_ON_ONCE(delta < len)
    in skb_try_coalesce() [1]

    I was able to root cause the issue to kfence.

    When kfence is in action, the following assertion is no longer true:

    int size = xxxx;
    void *ptr1 = kmalloc(size, gfp);
    void *ptr2 = kmalloc(size, gfp);

    if (ptr1 && ptr2)
            ASSERT(ksize(ptr1) == ksize(ptr2));

    We attempted to fix these issues in the blamed commits, but forgot
    that TCP was possibly shifting data after skb_unclone_keeptruesize()
    has been used, notably from tcp_retrans_try_collapse().

    So we not only need to keep same skb->truesize value,
    we also need to make sure TCP wont fill new tailroom
    that pskb_expand_head() was able to get from a
    addr = kmalloc(...) followed by ksize(addr)

    Split skb_unclone_keeptruesize() into two parts:

    1) Inline skb_unclone_keeptruesize() for the common case,
       when skb is not cloned.

    2) Out of line __skb_unclone_keeptruesize() for the 'slow path'.

    WARNING: CPU: 1 PID: 6490 at net/core/skbuff.c:5295 skb_try_coalesce+0x1235/0x1560 net/core/skbuff.c:5295
    Modules linked in:
    CPU: 1 PID: 6490 Comm: syz-executor161 Not tainted 5.17.0-rc4-syzkaller-00229-g4f12b742eb2b #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    RIP: 0010:skb_try_coalesce+0x1235/0x1560 net/core/skbuff.c:5295
    Code: bf 01 00 00 00 0f b7 c0 89 c6 89 44 24 20 e8 62 24 4e fa 8b 44 24 20 83 e8 01 0f 85 e5 f0 ff ff e9 87 f4 ff ff e8 cb 20 4e fa <0f> 0b e9 06 f9 ff ff e8 af b2 95 fa e9 69 f0 ff ff e8 95 b2 95 fa
    RSP: 0018:ffffc900063af268 EFLAGS: 00010293
    RAX: 0000000000000000 RBX: 00000000ffffffd5 RCX: 0000000000000000
    RDX: ffff88806fc05700 RSI: ffffffff872abd55 RDI: 0000000000000003
    RBP: ffff88806e675500 R08: 00000000ffffffd5 R09: 0000000000000000
    R10: ffffffff872ab659 R11: 0000000000000000 R12: ffff88806dd554e8
    R13: ffff88806dd9bac0 R14: ffff88806dd9a2c0 R15: 0000000000000155
    FS:  00007f18014f9700(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000020002000 CR3: 000000006be7a000 CR4: 00000000003506f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
     <TASK>
     tcp_try_coalesce net/ipv4/tcp_input.c:4651 [inline]
     tcp_try_coalesce+0x393/0x920 net/ipv4/tcp_input.c:4630
     tcp_queue_rcv+0x8a/0x6e0 net/ipv4/tcp_input.c:4914
     tcp_data_queue+0x11fd/0x4bb0 net/ipv4/tcp_input.c:5025
     tcp_rcv_established+0x81e/0x1ff0 net/ipv4/tcp_input.c:5947
     tcp_v4_do_rcv+0x65e/0x980 net/ipv4/tcp_ipv4.c:1719
     sk_backlog_rcv include/net/sock.h:1037 [inline]
     __release_sock+0x134/0x3b0 net/core/sock.c:2779
     release_sock+0x54/0x1b0 net/core/sock.c:3311
     sk_wait_data+0x177/0x450 net/core/sock.c:2821
     tcp_recvmsg_locked+0xe28/0x1fd0 net/ipv4/tcp.c:2457
     tcp_recvmsg+0x137/0x610 net/ipv4/tcp.c:2572
     inet_recvmsg+0x11b/0x5e0 net/ipv4/af_inet.c:850
     sock_recvmsg_nosec net/socket.c:948 [inline]
     sock_recvmsg net/socket.c:966 [inline]
     sock_recvmsg net/socket.c:962 [inline]
     ____sys_recvmsg+0x2c4/0x600 net/socket.c:2632
     ___sys_recvmsg+0x127/0x200 net/socket.c:2674
     __sys_recvmsg+0xe2/0x1a0 net/socket.c:2704
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x44/0xae

    Fixes: c4777efa751d ("net: add and use skb_unclone_keeptruesize() helper")
    Fixes: 097b9146c0 ("net: fix up truesize of cloned skb in skb_prepare_for_shift()")
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Marco Elver <elver@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Hangbin Liu <haliu@redhat.com>
2022-05-05 12:26:57 +08:00
Hangbin Liu 4c2b91c73f net: add skb_set_end_offset() helper
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081920
Upstream Status: net.git commit 763087dab975

commit 763087dab97547230a6807c865a6a5ae53a59247
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Feb 21 19:21:12 2022 -0800

    net: add skb_set_end_offset() helper

    We have multiple places where this helper is convenient,
    and plan using it in the following patch.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Hangbin Liu <haliu@redhat.com>
2022-05-05 12:26:57 +08:00
Hangbin Liu e333d6a1da net-timestamp: convert sk->sk_tskey to atomic_t
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081920
Upstream Status: net.git commit a1cdec57e03a

commit a1cdec57e03a1352e92fbbe7974039dda4efcec0
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Feb 17 09:05:02 2022 -0800

    net-timestamp: convert sk->sk_tskey to atomic_t

    UDP sendmsg() can be lockless, this is causing all kinds
    of data races.

    This patch converts sk->sk_tskey to remove one of these races.

    BUG: KCSAN: data-race in __ip_append_data / __ip_append_data

    read to 0xffff8881035d4b6c of 4 bytes by task 8877 on cpu 1:
     __ip_append_data+0x1c1/0x1de0 net/ipv4/ip_output.c:994
     ip_make_skb+0x13f/0x2d0 net/ipv4/ip_output.c:1636
     udp_sendmsg+0x12bd/0x14c0 net/ipv4/udp.c:1249
     inet_sendmsg+0x5f/0x80 net/ipv4/af_inet.c:819
     sock_sendmsg_nosec net/socket.c:705 [inline]
     sock_sendmsg net/socket.c:725 [inline]
     ____sys_sendmsg+0x39a/0x510 net/socket.c:2413
     ___sys_sendmsg net/socket.c:2467 [inline]
     __sys_sendmmsg+0x267/0x4c0 net/socket.c:2553
     __do_sys_sendmmsg net/socket.c:2582 [inline]
     __se_sys_sendmmsg net/socket.c:2579 [inline]
     __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x44/0xae

    write to 0xffff8881035d4b6c of 4 bytes by task 8880 on cpu 0:
     __ip_append_data+0x1d8/0x1de0 net/ipv4/ip_output.c:994
     ip_make_skb+0x13f/0x2d0 net/ipv4/ip_output.c:1636
     udp_sendmsg+0x12bd/0x14c0 net/ipv4/udp.c:1249
     inet_sendmsg+0x5f/0x80 net/ipv4/af_inet.c:819
     sock_sendmsg_nosec net/socket.c:705 [inline]
     sock_sendmsg net/socket.c:725 [inline]
     ____sys_sendmsg+0x39a/0x510 net/socket.c:2413
     ___sys_sendmsg net/socket.c:2467 [inline]
     __sys_sendmmsg+0x267/0x4c0 net/socket.c:2553
     __do_sys_sendmmsg net/socket.c:2582 [inline]
     __se_sys_sendmmsg net/socket.c:2579 [inline]
     __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x44/0xae

    value changed: 0x0000054d -> 0x0000054e

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 0 PID: 8880 Comm: syz-executor.5 Not tainted 5.17.0-rc2-syzkaller-00167-gdcb85f85fa6f-dirty #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

    Fixes: 09c2d251b7 ("net-timestamp: add key to disambiguate concurrent datagrams")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Willem de Bruijn <willemb@google.com>
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Hangbin Liu <haliu@redhat.com>
2022-05-05 12:26:57 +08:00
Hangbin Liu 24d15b7d24 net: Fix double 0x prefix print in SKB dump
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081920
Upstream Status: net.git commit 8a03ef676ade

commit 8a03ef676ade55182f9b05115763aeda6dc08159
Author: Gal Pressman <gal@nvidia.com>
Date:   Thu Dec 16 11:28:25 2021 +0200

    net: Fix double 0x prefix print in SKB dump

    When printing netdev features %pNF already takes care of the 0x prefix,
    remove the explicit one.

    Fixes: 6413139dfc ("skbuff: increase verbosity when dumping skb data")
    Signed-off-by: Gal Pressman <gal@nvidia.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Hangbin Liu <haliu@redhat.com>
2022-05-05 12:26:41 +08:00
Ivan Vecera 03eba5553a skbuff: introduce skb_pull_data
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2078759

commit 13244cccc2b61ec715f0ac583d3037497004d4a5
Author: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Date:   Wed Dec 1 10:54:52 2021 -0800

    skbuff: introduce skb_pull_data

    Like skb_pull but returns the original data pointer before pulling the
    data after performing a check against sbk->len.

    This allows to change code that does "struct foo *p = (void *)skb->data;"
    which is hard to audit and error prone, to:

            p = skb_pull_data(skb, sizeof(*p));
            if (!p)
                    return;

    Which is both safer and cleaner.

    Acked-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
    Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
    Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-04-26 09:22:23 +02:00
Herton R. Krzesinski 3e26d2a862 Merge: net: backports before kABI freeze
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/407

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2041382
Tested: ENRT
Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2028420
Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2037783

Includes patches that would break kABI without backporting the full
series they are taken from, which we will do later (post-freeze).

The following fixes were omitted as the backport of commit
f35f821935d8 ("tcp: defer skb freeing after socket lock is released")
is a partial one not introducing the issues.

Omitted-fix: ffef737fd037 ("net/tls: Fix skb memory leak when running kTLS traffic")
Omitted-fix: db094aa8140e ("net/tls: Fix another skb memory leak when running kTLS traffic")
Omitted-fix: 79074a72d335 ("net: Flush deferred skb free on socket destroy")
Omitted-fix: ebdc1a030962 ("tcp: add a missing sk_defer_free_flush() in tcp_splice_read()")

Signed-off-by: Antoine Tenart <atenart@redhat.com>

Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Sabrina Dubroca <sdubroca@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2022-02-07 15:11:27 +00:00
Antoine Tenart 73db850d41 net: use sk_is_tcp() in more places
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2041382
Upstream Status: linux.git
Tested: ENRT

commit 42f67eea3ba36cef2dce2e853de6ddcb2e89eb39
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 15 11:02:33 2021 -0800

    net: use sk_is_tcp() in more places

    Move sk_is_tcp() to include/net/sock.h and use it where we can.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-01-21 16:26:18 +01:00
Antoine Tenart 4a0269b225 net: skb: introduce kfree_skb_reason()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2041931
Upstream Status: linux.git
Tested: Instructions in bz

commit c504e5c2f9648a1e5c2be01e8c3f59d394192bd3
Author: Menglong Dong <imagedong@tencent.com>
Date:   Sun Jan 9 14:36:26 2022 +0800

    net: skb: introduce kfree_skb_reason()

    Introduce the interface kfree_skb_reason(), which is able to pass
    the reason why the skb is dropped to 'kfree_skb' tracepoint.

    Add the 'reason' field to 'trace_kfree_skb', therefor user can get
    more detail information about abnormal skb with 'drop_monitor' or
    eBPF.

    All drop reasons are defined in the enum 'skb_drop_reason', and
    they will be print as string in 'kfree_skb' tracepoint in format
    of 'reason: XXX'.

    ( Maybe the reasons should be defined in a uapi header file, so that
    user space can use them? )

    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-01-21 10:05:00 +01:00
Herton R. Krzesinski b8f20958b7 Merge: net: core stable backport for rhel 9.0
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/212

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028276
Tested: LNST, Tier1

This includes a few critical bugfixes for the core network stack.

Notably it includes 7f678def99d2 ("skb_expand_head() adjust skb->truesize incorrectly") and a whole series of pre-requisites. The bug addressed there is nasty and present even prior to skb_expand_head() introduction.

commit 719c57197010 ("net: make napi_disable() symmetric with enable") instead has been explicitly excluded, as it's not really a fix, is known to introduce problems and it's still quite new

Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Guillaume Nault <gnault@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2022-01-14 16:53:21 +00:00
Paolo Abeni cf96a90b97 net: fix GRO skb truesize update
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028927
Tested: LNST, Tier1

Upstream commit:
commit af352460b465d7a8afbeb3be07c0268d1d48a4d7
Author: Paolo Abeni <pabeni@redhat.com>
Date:   Wed Aug 4 21:07:00 2021 +0200

    net: fix GRO skb truesize update

    commit 5e10da5385d2 ("skbuff: allow 'slow_gro' for skb carring sock
    reference") introduces a serious regression at the GRO layer setting
    the wrong truesize for stolen-head skbs.

    Restore the correct truesize: SKB_DATA_ALIGN(...) instead of
    SKB_TRUESIZE(...)

    Reported-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
    Fixes: 5e10da5385d2 ("skbuff: allow 'slow_gro' for skb carring sock reference")
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Tested-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2021-12-09 18:58:42 +01:00
Paolo Abeni 2bea014388 skbuff: allow 'slow_gro' for skb carring sock reference
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028927
Tested: LNST, Tier1

Upstream commit:
commit 5e10da5385d20c4bae587bc2921e5fdd9655d5fc
Author: Paolo Abeni <pabeni@redhat.com>
Date:   Wed Jul 28 18:24:03 2021 +0200

    skbuff: allow 'slow_gro' for skb carring sock reference

    This change leverages the infrastructure introduced by the previous
    patches to allow soft devices passing to the GRO engine owned skbs
    without impacting the fast-path.

    It's up to the GRO caller ensuring the slow_gro bit validity before
    invoking the GRO engine. The new helper skb_prepare_for_gro() is
    introduced for that goal.

    On slow_gro, skbs are aggregated only with equal sk.
    Additionally, skb truesize on GRO recycle and free is correctly
    updated so that sk wmem is not changed by the GRO processing.

    rfc-> v1:
     - fixed bad truesize on dev_gro_receive NAPI_FREE
     - use the existing state bit

    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2021-12-09 18:57:52 +01:00
Paolo Abeni 9ce6ef4e71 net: optimize GRO for the common case.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028927
Tested: LNST, Tier1

Upstream commit:
commit 9efb4b5baf6ce851b247288992b0632cb4d31c17
Author: Paolo Abeni <pabeni@redhat.com>
Date:   Wed Jul 28 18:24:02 2021 +0200

    net: optimize GRO for the common case.

    After the previous patches, at GRO time, skb->slow_gro is
    usually 0, unless the packets comes from some H/W offload
    slowpath or tunnel.

    We can optimize the GRO code assuming !skb->slow_gro is likely.
    This remove multiple conditionals in the most common path, at the
    price of an additional one when we hit the above "slow-paths".

    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2021-12-09 18:57:26 +01:00
Paolo Abeni 615b5bcea7 sk_buff: track extension status in slow_gro
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028927
Tested: LNST, Tier1

Upstream commit:
commit b0999f385ac30cb17880ae1c1512491fbf0c9542
Author: Paolo Abeni <pabeni@redhat.com>
Date:   Wed Jul 28 18:24:01 2021 +0200

    sk_buff: track extension status in slow_gro

    Similar to the previous one, but tracking the
    active_extensions field status.

    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2021-12-09 18:57:10 +01:00
Paolo Abeni ca25f913f2 net: add and use skb_unclone_keeptruesize() helper
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028276
Tested: LNST, Tier1

Upstream commit:
commit c4777efa751d293e369aec464ce6875e957be255
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 1 17:45:55 2021 -0700

    net: add and use skb_unclone_keeptruesize() helper

    While commit 097b9146c0 ("net: fix up truesize of cloned
    skb in skb_prepare_for_shift()") fixed immediate issues found
    when KFENCE was enabled/tested, there are still similar issues,
    when tcp_trim_head() hits KFENCE while the master skb
    is cloned.

    This happens under heavy networking TX workloads,
    when the TX completion might be delayed after incoming ACK.

    This patch fixes the WARNING in sk_stream_kill_queues
    when sk->sk_mem_queued/sk->sk_forward_alloc are not zero.

    Fixes: d3fb45f370 ("mm, kfence: insert KFENCE hooks for SLAB")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Marco Elver <elver@google.com>
    Link: https://lore.kernel.org/r/20211102004555.1359210-1-eric.dumazet@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2021-12-09 10:44:31 +01:00
Paolo Abeni fcbf308cb4 skb_expand_head() adjust skb->truesize incorrectly
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028276
Tested: LNST, Tier1

Upstream commit:
commit 7f678def99d29c520418607509bb19c7fc96a6db
Author: Vasily Averin <vvs@virtuozzo.com>
Date:   Fri Oct 22 13:28:37 2021 +0300

    skb_expand_head() adjust skb->truesize incorrectly

    Christoph Paasch reports [1] about incorrect skb->truesize
    after skb_expand_head() call in ip6_xmit.
    This may happen because of two reasons:
    - skb_set_owner_w() for newly cloned skb is called too early,
    before pskb_expand_head() where truesize is adjusted for (!skb-sk) case.
    - pskb_expand_head() does not adjust truesize in (skb->sk) case.
    In this case sk->sk_wmem_alloc should be adjusted too.

    [1] https://lkml.org/lkml/2021/8/20/1082

    Fixes: f1260ff15a71 ("skbuff: introduce skb_expand_head()")
    Fixes: 2d85a1b31d ("ipv6: ip6_finish_output2: set sk into newly allocated nskb")
    Reported-by: Christoph Paasch <christoph.paasch@gmail.com>
    Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/644330dd-477e-0462-83bf-9f514c41edd1@virtuozzo.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2021-12-09 10:44:31 +01:00
Paolo Abeni 17a5777943 skbuff: introduce skb_expand_head()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028276
Tested: LNST, Tier1

Upstream commit:
commit f1260ff15a71b8fc122b2c9abd8a7abffb6e0168
Author: Vasily Averin <vvs@virtuozzo.com>
Date:   Mon Aug 2 11:52:15 2021 +0300

    skbuff: introduce skb_expand_head()

    Like skb_realloc_headroom(), new helper increases headroom of specified skb.
    Unlike skb_realloc_headroom(), it does not allocate a new skb if possible;
    copies skb->sk on new skb when as needed and frees original skb in case
    of failures.

    This helps to simplify ip[6]_finish_output2() and a few other similar cases.

    Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2021-12-09 10:44:30 +01:00
Pravin B Shelar a17ad09617 net: Fix zero-copy head len calculation.
In some cases skb head could be locked and entire header
data is pulled from skb. When skb_zerocopy() called in such cases,
following BUG is triggered. This patch fixes it by copying entire
skb in such cases.
This could be optimized incase this is performance bottleneck.

---8<---
kernel BUG at net/core/skbuff.c:2961!
invalid opcode: 0000 [#1] SMP PTI
CPU: 2 PID: 0 Comm: swapper/2 Tainted: G           OE     5.4.0-77-generic #86-Ubuntu
Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.13.0-1ubuntu1.1 04/01/2014
RIP: 0010:skb_zerocopy+0x37a/0x3a0
RSP: 0018:ffffbcc70013ca38 EFLAGS: 00010246
Call Trace:
 <IRQ>
 queue_userspace_packet+0x2af/0x5e0 [openvswitch]
 ovs_dp_upcall+0x3d/0x60 [openvswitch]
 ovs_dp_process_packet+0x125/0x150 [openvswitch]
 ovs_vport_receive+0x77/0xd0 [openvswitch]
 netdev_port_receive+0x87/0x130 [openvswitch]
 netdev_frame_hook+0x4b/0x60 [openvswitch]
 __netif_receive_skb_core+0x2b4/0xc90
 __netif_receive_skb_one_core+0x3f/0xa0
 __netif_receive_skb+0x18/0x60
 process_backlog+0xa9/0x160
 net_rx_action+0x142/0x390
 __do_softirq+0xe1/0x2d6
 irq_exit+0xae/0xb0
 do_IRQ+0x5a/0xf0
 common_interrupt+0xf/0xf

Code that triggered BUG:
int
skb_zerocopy(struct sk_buff *to, struct sk_buff *from, int len, int hlen)
{
        int i, j = 0;
        int plen = 0; /* length of skb->head fragment */
        int ret;
        struct page *page;
        unsigned int offset;

        BUG_ON(!from->head_frag && !hlen);

Signed-off-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-18 09:42:17 -07:00
Ilias Apalodimas 2cc3aeb5ec skbuff: Fix a potential race while recycling page_pool packets
As Alexander points out, when we are trying to recycle a cloned/expanded
SKB we might trigger a race.  The recycling code relies on the
pp_recycle bit to trigger,  which we carry over to cloned SKBs.
If that cloned SKB gets expanded or if we get references to the frags,
call skb_release_data() and overwrite skb->head, we are creating separate
instances accessing the same page frags.  Since the skb_release_data()
will first try to recycle the frags,  there's a potential race between
the original and cloned SKB, since both will have the pp_recycle bit set.

Fix this by explicitly those SKBs not recyclable.
The atomic_sub_return effectively limits us to a single release case,
and when we are calling skb_release_data we are also releasing the
option to perform the recycling, or releasing the pages from the page pool.

Fixes: 6a5bcd84e8 ("page_pool: Allow drivers to hint on SKB recycling")
Reported-by: Alexander Duyck <alexanderduyck@fb.com>
Suggested-by: Alexander Duyck <alexanderduyck@fb.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-16 11:37:00 -07:00
Paul Blakey 8550ff8d8c skbuff: Release nfct refcount on napi stolen or re-used skbs
When multiple SKBs are merged to a new skb under napi GRO,
or SKB is re-used by napi, if nfct was set for them in the
driver, it will not be released while freeing their stolen
head state or on re-use.

Release nfct on napi's stolen or re-used SKBs, and
in gro_list_prepare, check conntrack metadata diff.

Fixes: 5c6b946047 ("net/mlx5e: CT: Handle misses after executing CT action")
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-06 10:26:29 -07:00
Alexander Aring e3ae2365ef net: sock: introduce sk_error_report
This patch introduces a function wrapper to call the sk_error_report
callback. That will prepare to add additional handling whenever
sk_error_report is called, for example to trace socket errors.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-29 11:28:21 -07:00
Jakub Kicinski adc2e56ebe Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Trivial conflicts in net/can/isotp.c and
tools/testing/selftests/net/mptcp/mptcp_connect.sh

scaled_ppm_to_ppb() was moved from drivers/ptp/ptp_clock.c
to include/linux/ptp_clock_kernel.h in -next so re-apply
the fix there.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-06-18 19:47:02 -07:00
Willem de Bruijn 3bdd5ee0ec skbuff: fix incorrect msg_zerocopy copy notifications
msg_zerocopy signals if a send operation required copying with a flag
in serr->ee.ee_code.

This field can be incorrect as of the below commit, as a result of
both structs uarg and serr pointing into the same skb->cb[].

uarg->zerocopy must be read before skb->cb[] is reinitialized to hold
serr. Similar to other fields len, hi and lo, use a local variable to
temporarily hold the value.

This was not a problem before, when the value was passed as a function
argument.

Fixes: 75518851a2 ("skbuff: Push status and refcounts into sock_zerocopy_callback")
Reported-by: Talal Ahmad <talalahmad@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-10 13:39:57 -07:00
Ilias Apalodimas 6a5bcd84e8 page_pool: Allow drivers to hint on SKB recycling
Up to now several high speed NICs have custom mechanisms of recycling
the allocated memory they use for their payloads.
Our page_pool API already has recycling capabilities that are always
used when we are running in 'XDP mode'. So let's tweak the API and the
kernel network stack slightly and allow the recycling to happen even
during the standard operation.
The API doesn't take into account 'split page' policies used by those
drivers currently, but can be extended once we have users for that.

The idea is to be able to intercept the packet on skb_release_data().
If it's a buffer coming from our page_pool API recycle it back to the
pool for further usage or just release the packet entirely.

To achieve that we introduce a bit in struct sk_buff (pp_recycle:1) and
a field in struct page (page->pp) to store the page_pool pointer.
Storing the information in page->pp allows us to recycle both SKBs and
their fragments.
We could have skipped the skb bit entirely, since identical information
can bederived from struct page. However, in an effort to affect the free path
as less as possible, reading a single bit in the skb which is already
in cache, is better that trying to derive identical information for the
page stored data.

The driver or page_pool has to take care of the sync operations on it's own
during the buffer recycling since the buffer is, after opting-in to the
recycling, never unmapped.

Since the gain on the drivers depends on the architecture, we are not
enabling recycling by default if the page_pool API is used on a driver.
In order to enable recycling the driver must call skb_mark_for_recycle()
to store the information we need for recycling in page->pp and
enabling the recycling bit, or page_pool_store_mem_info() for a fragment.

Co-developed-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Co-developed-by: Matteo Croce <mcroce@microsoft.com>
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-07 14:11:47 -07:00
Matteo Croce c420c98982 skbuff: add a parameter to __skb_frag_unref
This is a prerequisite patch, the next one is enabling recycling of
skbs and fragments. Add an extra argument on __skb_frag_unref() to
handle recycling, and update the current users of the function with that.

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-07 14:11:47 -07:00
Linus Torvalds 9d31d23389 Networking changes for 5.13.
Core:
 
  - bpf:
 	- allow bpf programs calling kernel functions (initially to
 	  reuse TCP congestion control implementations)
 	- enable task local storage for tracing programs - remove the
 	  need to store per-task state in hash maps, and allow tracing
 	  programs access to task local storage previously added for
 	  BPF_LSM
 	- add bpf_for_each_map_elem() helper, allowing programs to
 	  walk all map elements in a more robust and easier to verify
 	  fashion
 	- sockmap: support UDP and cross-protocol BPF_SK_SKB_VERDICT
 	  redirection
 	- lpm: add support for batched ops in LPM trie
 	- add BTF_KIND_FLOAT support - mostly to allow use of BTF
 	  on s390 which has floats in its headers files
 	- improve BPF syscall documentation and extend the use of kdoc
 	  parsing scripts we already employ for bpf-helpers
 	- libbpf, bpftool: support static linking of BPF ELF files
 	- improve support for encapsulation of L2 packets
 
  - xdp: restructure redirect actions to avoid a runtime lookup,
 	improving performance by 4-8% in microbenchmarks
 
  - xsk: build skb by page (aka generic zerocopy xmit) - improve
 	performance of software AF_XDP path by 33% for devices
 	which don't need headers in the linear skb part (e.g. virtio)
 
  - nexthop: resilient next-hop groups - improve path stability
 	on next-hops group changes (incl. offload for mlxsw)
 
  - ipv6: segment routing: add support for IPv4 decapsulation
 
  - icmp: add support for RFC 8335 extended PROBE messages
 
  - inet: use bigger hash table for IP ID generation
 
  - tcp: deal better with delayed TX completions - make sure we don't
 	give up on fast TCP retransmissions only because driver is
 	slow in reporting that it completed transmitting the original
 
  - tcp: reorder tcp_congestion_ops for better cache locality
 
  - mptcp:
 	- add sockopt support for common TCP options
 	- add support for common TCP msg flags
 	- include multiple address ids in RM_ADDR
 	- add reset option support for resetting one subflow
 
  - udp: GRO L4 improvements - improve 'forward' / 'frag_list'
 	co-existence with UDP tunnel GRO, allowing the first to take
 	place correctly	even for encapsulated UDP traffic
 
  - micro-optimize dev_gro_receive() and flow dissection, avoid
 	retpoline overhead on VLAN and TEB GRO
 
  - use less memory for sysctls, add a new sysctl type, to allow using
 	u8 instead of "int" and "long" and shrink networking sysctls
 
  - veth: allow GRO without XDP - this allows aggregating UDP
 	packets before handing them off to routing, bridge, OvS, etc.
 
  - allow specifing ifindex when device is moved to another namespace
 
  - netfilter:
 	- nft_socket: add support for cgroupsv2
 	- nftables: add catch-all set element - special element used
 	  to define a default action in case normal lookup missed
 	- use net_generic infra in many modules to avoid allocating
 	  per-ns memory unnecessarily
 
  - xps: improve the xps handling to avoid potential out-of-bound
 	accesses and use-after-free when XPS change race with other
 	re-configuration under traffic
 
  - add a config knob to turn off per-cpu netdev refcnt to catch
 	underflows in testing
 
 Device APIs:
 
  - add WWAN subsystem to organize the WWAN interfaces better and
    hopefully start driving towards more unified and vendor-
    -independent APIs
 
  - ethtool:
 	- add interface for reading IEEE MIB stats (incl. mlx5 and
 	  bnxt support)
 	- allow network drivers to dump arbitrary SFP EEPROM data,
 	  current offset+length API was a poor fit for modern SFP
 	  which define EEPROM in terms of pages (incl. mlx5 support)
 
  - act_police, flow_offload: add support for packet-per-second
 	policing (incl. offload for nfp)
 
  - psample: add additional metadata attributes like transit delay
 	for packets sampled from switch HW (and corresponding egress
 	and policy-based sampling in the mlxsw driver)
 
  - dsa: improve support for sandwiched LAGs with bridge and DSA
 
  - netfilter:
 	- flowtable: use direct xmit in topologies with IP
 	  forwarding, bridging, vlans etc.
 	- nftables: counter hardware offload support
 
  - Bluetooth:
 	- improvements for firmware download w/ Intel devices
 	- add support for reading AOSP vendor capabilities
 	- add support for virtio transport driver
 
  - mac80211:
 	- allow concurrent monitor iface and ethernet rx decap
 	- set priority and queue mapping for injected frames
 
  - phy: add support for Clause-45 PHY Loopback
 
  - pci/iov: add sysfs MSI-X vector assignment interface
 	to distribute MSI-X resources to VFs (incl. mlx5 support)
 
 New hardware/drivers:
 
  - dsa: mv88e6xxx: add support for Marvell mv88e6393x -
 	11-port Ethernet switch with 8x 1-Gigabit Ethernet
 	and 3x 10-Gigabit interfaces.
 
  - dsa: support for legacy Broadcom tags used on BCM5325, BCM5365
 	and BCM63xx switches
 
  - Microchip KSZ8863 and KSZ8873; 3x 10/100Mbps Ethernet switches
 
  - ath11k: support for QCN9074 a 802.11ax device
 
  - Bluetooth: Broadcom BCM4330 and BMC4334
 
  - phy: Marvell 88X2222 transceiver support
 
  - mdio: add BCM6368 MDIO mux bus controller
 
  - r8152: support RTL8153 and RTL8156 (USB Ethernet) chips
 
  - mana: driver for Microsoft Azure Network Adapter (MANA)
 
  - Actions Semi Owl Ethernet MAC
 
  - can: driver for ETAS ES58X CAN/USB interfaces
 
 Pure driver changes:
 
  - add XDP support to: enetc, igc, stmmac
  - add AF_XDP support to: stmmac
 
  - virtio:
 	- page_to_skb() use build_skb when there's sufficient tailroom
 	  (21% improvement for 1000B UDP frames)
 	- support XDP even without dedicated Tx queues - share the Tx
 	  queues with the stack when necessary
 
  - mlx5:
 	- flow rules: add support for mirroring with conntrack,
 	  matching on ICMP, GTP, flex filters and more
 	- support packet sampling with flow offloads
 	- persist uplink representor netdev across eswitch mode
 	  changes
 	- allow coexistence of CQE compression and HW time-stamping
 	- add ethtool extended link error state reporting
 
  - ice, iavf: support flow filters, UDP Segmentation Offload
 
  - dpaa2-switch:
 	- move the driver out of staging
 	- add spanning tree (STP) support
 	- add rx copybreak support
 	- add tc flower hardware offload on ingress traffic
 
  - ionic:
 	- implement Rx page reuse
 	- support HW PTP time-stamping
 
  - octeon: support TC hardware offloads - flower matching on ingress
 	and egress ratelimitting.
 
  - stmmac:
 	- add RX frame steering based on VLAN priority in tc flower
 	- support frame preemption (FPE)
 	- intel: add cross time-stamping freq difference adjustment
 
  - ocelot:
 	- support forwarding of MRP frames in HW
 	- support multiple bridges
 	- support PTP Sync one-step timestamping
 
  - dsa: mv88e6xxx, dpaa2-switch: offload bridge port flags like
 	learning, flooding etc.
 
  - ipa: add IPA v4.5, v4.9 and v4.11 support (Qualcomm SDX55, SM8350,
 	SC7280 SoCs)
 
  - mt7601u: enable TDLS support
 
  - mt76:
 	- add support for 802.3 rx frames (mt7915/mt7615)
 	- mt7915 flash pre-calibration support
 	- mt7921/mt7663 runtime power management fixes
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmCKFPIACgkQMUZtbf5S
 Irtw0g/+NA8bWdHNgG4H5rya0pv2z3IieLRmSdDfKRQQXcJpklawc5MKVVaTee/Q
 5/QqgPdCsu1LAU6JXBKsKmyDDaMlQKdWuKbOqDSiAQKoMesZStTEHf9d851ZzgxA
 Cdb6O7BD3lBl/IN+oxNG+KcmD1LKquTPKGySq2mQtEdLO12ekAsranzmj4voKffd
 q9tBShpXQ7Dq77DLYfiQXVCvsizNcbbJFuxX0o9Lpb9+61ZyYAbogZSa9ypiZZwR
 I/9azRBtJg7UV1aD/cLuAfy66Qh7t63+rCxVazs5Os8jVO26P/jQdisnnOe/x+p9
 wYEmKm3GSu0V4SAPxkWW+ooKusflCeqDoMIuooKt6kbP6BRj540veGw3Ww/m5YFr
 7pLQkTSP/tSjuGQIdBE1LOP5LBO8DZeC8Kiop9V0fzAW9hFSZbEq25WW0bPj8QQO
 zA4Z7yWlslvxcfY2BdJX3wD8klaINkl/8fDWZFFsBdfFX2VeLtm7Xfduw34BJpvU
 rYT3oWr6PhtkPAKR32SUcemSfeWgIVU41eSshzRz3kez1NngBUuLlSGGSEaKbes5
 pZVt6pYFFVByyf6MTHFEoQvafZfEw04JILZpo4R5V8iTHzom0kD3Py064sBiXEw2
 B6t+OW4qgcxGblpFkK2lD4kR2s1TPUs0ckVO6sAy1x8q60KKKjY=
 =vcbA
 -----END PGP SIGNATURE-----

Merge tag 'net-next-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core:

   - bpf:
        - allow bpf programs calling kernel functions (initially to
          reuse TCP congestion control implementations)
        - enable task local storage for tracing programs - remove the
          need to store per-task state in hash maps, and allow tracing
          programs access to task local storage previously added for
          BPF_LSM
        - add bpf_for_each_map_elem() helper, allowing programs to walk
          all map elements in a more robust and easier to verify fashion
        - sockmap: support UDP and cross-protocol BPF_SK_SKB_VERDICT
          redirection
        - lpm: add support for batched ops in LPM trie
        - add BTF_KIND_FLOAT support - mostly to allow use of BTF on
          s390 which has floats in its headers files
        - improve BPF syscall documentation and extend the use of kdoc
          parsing scripts we already employ for bpf-helpers
        - libbpf, bpftool: support static linking of BPF ELF files
        - improve support for encapsulation of L2 packets

   - xdp: restructure redirect actions to avoid a runtime lookup,
     improving performance by 4-8% in microbenchmarks

   - xsk: build skb by page (aka generic zerocopy xmit) - improve
     performance of software AF_XDP path by 33% for devices which don't
     need headers in the linear skb part (e.g. virtio)

   - nexthop: resilient next-hop groups - improve path stability on
     next-hops group changes (incl. offload for mlxsw)

   - ipv6: segment routing: add support for IPv4 decapsulation

   - icmp: add support for RFC 8335 extended PROBE messages

   - inet: use bigger hash table for IP ID generation

   - tcp: deal better with delayed TX completions - make sure we don't
     give up on fast TCP retransmissions only because driver is slow in
     reporting that it completed transmitting the original

   - tcp: reorder tcp_congestion_ops for better cache locality

   - mptcp:
        - add sockopt support for common TCP options
        - add support for common TCP msg flags
        - include multiple address ids in RM_ADDR
        - add reset option support for resetting one subflow

   - udp: GRO L4 improvements - improve 'forward' / 'frag_list'
     co-existence with UDP tunnel GRO, allowing the first to take place
     correctly even for encapsulated UDP traffic

   - micro-optimize dev_gro_receive() and flow dissection, avoid
     retpoline overhead on VLAN and TEB GRO

   - use less memory for sysctls, add a new sysctl type, to allow using
     u8 instead of "int" and "long" and shrink networking sysctls

   - veth: allow GRO without XDP - this allows aggregating UDP packets
     before handing them off to routing, bridge, OvS, etc.

   - allow specifing ifindex when device is moved to another namespace

   - netfilter:
        - nft_socket: add support for cgroupsv2
        - nftables: add catch-all set element - special element used to
          define a default action in case normal lookup missed
        - use net_generic infra in many modules to avoid allocating
          per-ns memory unnecessarily

   - xps: improve the xps handling to avoid potential out-of-bound
     accesses and use-after-free when XPS change race with other
     re-configuration under traffic

   - add a config knob to turn off per-cpu netdev refcnt to catch
     underflows in testing

  Device APIs:

   - add WWAN subsystem to organize the WWAN interfaces better and
     hopefully start driving towards more unified and vendor-
     independent APIs

   - ethtool:
        - add interface for reading IEEE MIB stats (incl. mlx5 and bnxt
          support)
        - allow network drivers to dump arbitrary SFP EEPROM data,
          current offset+length API was a poor fit for modern SFP which
          define EEPROM in terms of pages (incl. mlx5 support)

   - act_police, flow_offload: add support for packet-per-second
     policing (incl. offload for nfp)

   - psample: add additional metadata attributes like transit delay for
     packets sampled from switch HW (and corresponding egress and
     policy-based sampling in the mlxsw driver)

   - dsa: improve support for sandwiched LAGs with bridge and DSA

   - netfilter:
        - flowtable: use direct xmit in topologies with IP forwarding,
          bridging, vlans etc.
        - nftables: counter hardware offload support

   - Bluetooth:
        - improvements for firmware download w/ Intel devices
        - add support for reading AOSP vendor capabilities
        - add support for virtio transport driver

   - mac80211:
        - allow concurrent monitor iface and ethernet rx decap
        - set priority and queue mapping for injected frames

   - phy: add support for Clause-45 PHY Loopback

   - pci/iov: add sysfs MSI-X vector assignment interface to distribute
     MSI-X resources to VFs (incl. mlx5 support)

  New hardware/drivers:

   - dsa: mv88e6xxx: add support for Marvell mv88e6393x - 11-port
     Ethernet switch with 8x 1-Gigabit Ethernet and 3x 10-Gigabit
     interfaces.

   - dsa: support for legacy Broadcom tags used on BCM5325, BCM5365 and
     BCM63xx switches

   - Microchip KSZ8863 and KSZ8873; 3x 10/100Mbps Ethernet switches

   - ath11k: support for QCN9074 a 802.11ax device

   - Bluetooth: Broadcom BCM4330 and BMC4334

   - phy: Marvell 88X2222 transceiver support

   - mdio: add BCM6368 MDIO mux bus controller

   - r8152: support RTL8153 and RTL8156 (USB Ethernet) chips

   - mana: driver for Microsoft Azure Network Adapter (MANA)

   - Actions Semi Owl Ethernet MAC

   - can: driver for ETAS ES58X CAN/USB interfaces

  Pure driver changes:

   - add XDP support to: enetc, igc, stmmac

   - add AF_XDP support to: stmmac

   - virtio:
        - page_to_skb() use build_skb when there's sufficient tailroom
          (21% improvement for 1000B UDP frames)
        - support XDP even without dedicated Tx queues - share the Tx
          queues with the stack when necessary

   - mlx5:
        - flow rules: add support for mirroring with conntrack, matching
          on ICMP, GTP, flex filters and more
        - support packet sampling with flow offloads
        - persist uplink representor netdev across eswitch mode changes
        - allow coexistence of CQE compression and HW time-stamping
        - add ethtool extended link error state reporting

   - ice, iavf: support flow filters, UDP Segmentation Offload

   - dpaa2-switch:
        - move the driver out of staging
        - add spanning tree (STP) support
        - add rx copybreak support
        - add tc flower hardware offload on ingress traffic

   - ionic:
        - implement Rx page reuse
        - support HW PTP time-stamping

   - octeon: support TC hardware offloads - flower matching on ingress
     and egress ratelimitting.

   - stmmac:
        - add RX frame steering based on VLAN priority in tc flower
        - support frame preemption (FPE)
        - intel: add cross time-stamping freq difference adjustment

   - ocelot:
        - support forwarding of MRP frames in HW
        - support multiple bridges
        - support PTP Sync one-step timestamping

   - dsa: mv88e6xxx, dpaa2-switch: offload bridge port flags like
     learning, flooding etc.

   - ipa: add IPA v4.5, v4.9 and v4.11 support (Qualcomm SDX55, SM8350,
     SC7280 SoCs)

   - mt7601u: enable TDLS support

   - mt76:
        - add support for 802.3 rx frames (mt7915/mt7615)
        - mt7915 flash pre-calibration support
        - mt7921/mt7663 runtime power management fixes"

* tag 'net-next-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2451 commits)
  net: selftest: fix build issue if INET is disabled
  net: netrom: nr_in: Remove redundant assignment to ns
  net: tun: Remove redundant assignment to ret
  net: phy: marvell: add downshift support for M88E1240
  net: dsa: ksz: Make reg_mib_cnt a u8 as it never exceeds 255
  net/sched: act_ct: Remove redundant ct get and check
  icmp: standardize naming of RFC 8335 PROBE constants
  bpf, selftests: Update array map tests for per-cpu batched ops
  bpf: Add batched ops support for percpu array
  bpf: Implement formatted output helpers with bstr_printf
  seq_file: Add a seq_bprintf function
  sfc: adjust efx->xdp_tx_queue_count with the real number of initialized queues
  net:nfc:digital: Fix a double free in digital_tg_recv_dep_req
  net: fix a concurrency bug in l2tp_tunnel_register()
  net/smc: Remove redundant assignment to rc
  mpls: Remove redundant assignment to err
  llc2: Remove redundant assignment to rc
  net/tls: Remove redundant initialization of record
  rds: Remove redundant assignment to nr_sig
  dt-bindings: net: mdio-gpio: add compatible for microchip,mdio-smi0
  ...
2021-04-29 11:57:23 -07:00
Ingo Molnar d0d252b8ca Linux 5.12-rc8
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmB8qHweHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGEXIIAILUbsTJsNsvZIkZ
 uQ6SY6gnsPFkRiSRjY0YsZLUnqjTuiiHeTz4gzkonddwdnAp/9g6OIHIEBaeTqBh
 sTUMU/61Fgtrt/IvkA1yJ3rlawqgwdMe2VdimB+EFhufcSKq+5vpd3MVP4IuGx4E
 J3psoTU4gVltFs5t+1QjvI3XmByN0Qm8FMRXR7iL+zov1QTmGwR3G6Rn4AymG+QT
 pdruKboyZPfsrFGSVx7wd3HpFyQcrclEX9rKmBNZqets9d9JGWnqnEN4vQKmwO86
 4MV29ucdMXH0AMB3kzGdVp0Ji2Ykt5W0K+MUWbFLtcSxnpu1OyBKGsEAMlRbD7ik
 gm0bMSw=
 =qHI0
 -----END PGP SIGNATURE-----

Merge tag 'v5.12-rc8' into sched/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-04-20 10:13:58 +02:00
Paolo Abeni 17c3df7078 skbuff: revert "skbuff: remove some unnecessary operation in skb_segment_list()"
the commit 1ddc3229ad ("skbuff: remove some unnecessary operation
in skb_segment_list()") introduces an issue very similar to the
one already fixed by commit 53475c5dd8 ("net: fix use-after-free when
UDP GRO with shared fraglist").

If the GSO skb goes though skb_clone() and pskb_expand_head() before
entering skb_segment_list(), the latter  will unshare the frag_list
skbs and will release the old list. With the reverted commit in place,
when skb_segment_list() completes, skb->next points to the just
released list, and later on the kernel will hit UaF.

Note that since commit e0e3070a9b ("udp: properly complete L4 GRO
over UDP tunnel packet") the critical scenario can be reproduced also
receiving UDP over vxlan traffic with:

NIC (NETIF_F_GRO_FRAGLIST enabled) -> vxlan -> UDP sink

Attaching a packet socket to the NIC will cause skb_clone() and the
tunnel decapsulation will call pskb_expand_head().

Fixes: 1ddc3229ad ("skbuff: remove some unnecessary operation in skb_segment_list()")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-14 13:54:08 -07:00
Cong Wang 0739cd28f2 net: Introduce skb_send_sock() for sock_map
We only have skb_send_sock_locked() which requires callers
to use lock_sock(). Introduce a variant skb_send_sock()
which locks on its own, callers do not need to lock it
any more. This will save us from adding a ->sendmsg_locked
for each protocol.

To reuse the code, pass function pointers to __skb_send_sock()
and build skb_send_sock() and skb_send_sock_locked() on top.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210331023237.41094-4-xiyou.wangcong@gmail.com
2021-04-01 10:56:13 -07:00
Yunsheng Lin 1ddc3229ad skbuff: remove some unnecessary operation in skb_segment_list()
gro list uses skb_shinfo(skb)->frag_list to link two skb together,
and NAPI_GRO_CB(p)->last->next is used when there are more skb,
see skb_gro_receive_list(). gso expects that each segmented skb is
linked together using skb->next, so only the first skb->next need
to set to skb_shinfo(skb)-> frag_list when doing gso list segment.

It is the same reason that nskb->next does not need to be set to
list_skb before goto the error handling, because nskb->next already
pointers to list_skb.

And nskb is also the last skb at the end of loop, so remove tail
variable and use nskb instead.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-10 12:45:15 -08:00
Sebastian Andrzej Siewior 183f47fcaa kcov: Remove kcov include from sched.h and move it to its users.
The recent addition of in_serving_softirq() to kconv.h results in
compile failure on PREEMPT_RT because it requires
task_struct::softirq_disable_cnt. This is not available if kconv.h is
included from sched.h.

It is not needed to include kconv.h from sched.h. All but the net/ user
already include the kconv header file.

Move the include of the kconv.h header from sched.h it its users.
Additionally include sched.h from kconv.h to ensure that everything
task_struct related is available.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Andrey Konovalov <andreyknvl@google.com>
Link: https://lkml.kernel.org/r/20210218173124.iy5iyqv3a4oia4vv@linutronix.de
2021-03-06 12:40:21 +01:00
Willem de Bruijn b228c9b058 net: expand textsearch ts_state to fit skb_seq_state
The referenced commit expands the skb_seq_state used by
skb_find_text with a 4B frag_off field, growing it to 48B.

This exceeds container ts_state->cb, causing a stack corruption:

[   73.238353] Kernel panic - not syncing: stack-protector: Kernel stack
is corrupted in: skb_find_text+0xc5/0xd0
[   73.247384] CPU: 1 PID: 376 Comm: nping Not tainted 5.11.0+ #4
[   73.252613] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.14.0-2 04/01/2014
[   73.260078] Call Trace:
[   73.264677]  dump_stack+0x57/0x6a
[   73.267866]  panic+0xf6/0x2b7
[   73.270578]  ? skb_find_text+0xc5/0xd0
[   73.273964]  __stack_chk_fail+0x10/0x10
[   73.277491]  skb_find_text+0xc5/0xd0
[   73.280727]  string_mt+0x1f/0x30
[   73.283639]  ipt_do_table+0x214/0x410

The struct is passed between skb_find_text and its callbacks
skb_prepare_seq_read, skb_seq_read and skb_abort_seq read through
the textsearch interface using TS_SKB_CB.

I assumed that this mapped to skb->cb like other .._SKB_CB wrappers.
skb->cb is 48B. But it maps to ts_state->cb, which is only 40B.

skb->cb was increased from 40B to 48B after ts_state was introduced,
in commit 3e3850e989 ("[NETFILTER]: Fix xfrm lookup in
ip_route_me_harder/ip6_route_me_harder").

Increase ts_state.cb[] to 48 to fit the struct.

Also add a BUILD_BUG_ON to avoid a repeat.

The alternative is to directly add a dependency from textsearch onto
linux/skbuff.h, but I think the intent is textsearch to have no such
dependencies on its callers.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=211911
Fixes: 97550f6fa5 ("net: compound page support in skb_seq_read")
Reported-by: Kris Karas <bugs-a17@moonlit-rail.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-01 15:25:24 -08:00
Alexander Lobakin 9243adfc31 skbuff: queue NAPI_MERGED_FREE skbs into NAPI cache instead of freeing
napi_frags_finish() and napi_skb_finish() can only be called inside
NAPI Rx context, so we can feed NAPI cache with skbuff_heads that
got NAPI_MERGED_FREE verdict instead of immediate freeing.
Replace __kfree_skb() with __kfree_skb_defer() in napi_skb_finish()
and move napi_skb_free_stolen_head() to skbuff.c, so it can drop skbs
to NAPI cache.
As many drivers call napi_alloc_skb()/napi_get_frags() on their
receive path, this becomes especially useful.

Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-13 14:32:04 -08:00
Alexander Lobakin cfb8ec6595 skbuff: allow to use NAPI cache from __napi_alloc_skb()
{,__}napi_alloc_skb() is mostly used either for optional non-linear
receive methods (usually controlled via Ethtool private flags and off
by default) and/or for Rx copybreaks.
Use __napi_build_skb() here for obtaining skbuff_heads from NAPI cache
instead of inplace allocations. This includes both kmalloc and page
frag paths.

Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-13 14:32:04 -08:00
Alexander Lobakin d13612b58e skbuff: allow to optionally use NAPI cache from __alloc_skb()
Reuse the old and forgotten SKB_ALLOC_NAPI to add an option to get
an skbuff_head from the NAPI cache instead of inplace allocation
inside __alloc_skb().
This implies that the function is called from softirq or BH-off
context, not for allocating a clone or from a distant node.

Cc: Alexander Duyck <alexander.duyck@gmail.com> # Simplified flags check
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-13 14:32:04 -08:00
Alexander Lobakin f450d539c0 skbuff: introduce {,__}napi_build_skb() which reuses NAPI cache heads
Instead of just bulk-flushing skbuff_heads queued up through
napi_consume_skb() or __kfree_skb_defer(), try to reuse them
on allocation path.
If the cache is empty on allocation, bulk-allocate the first
16 elements, which is more efficient than per-skb allocation.
If the cache is full on freeing, bulk-wipe the second half of
the cache (32 elements).
This also includes custom KASAN poisoning/unpoisoning to be
double sure there are no use-after-free cases.

To not change current behaviour, introduce a new function,
napi_build_skb(), to optionally use a new approach later
in drivers.

Note on selected bulk size, 16:
 - this equals to XDP_BULK_QUEUE_SIZE, DEV_MAP_BULK_SIZE
   and especially VETH_XDP_BATCH, which is also used to
   bulk-allocate skbuff_heads and was tested on powerful
   setups;
 - this also showed the best performance in the actual
   test series (from the array of {8, 16, 32}).

Suggested-by: Edward Cree <ecree.xilinx@gmail.com> # Divide on two halves
Suggested-by: Eric Dumazet <edumazet@google.com>   # KASAN poisoning
Cc: Dmitry Vyukov <dvyukov@google.com>             # Help with KASAN
Cc: Paolo Abeni <pabeni@redhat.com>                # Reduced batch size
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-13 14:32:04 -08:00
Alexander Lobakin 50fad4b543 skbuff: move NAPI cache declarations upper in the file
NAPI cache structures will be used for allocating skbuff_heads,
so move their declarations a bit upper.

Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-13 14:32:03 -08:00
Alexander Lobakin fec6e49b63 skbuff: remove __kfree_skb_flush()
This function isn't much needed as NAPI skb queue gets bulk-freed
anyway when there's no more room, and even may reduce the efficiency
of bulk operations.
It will be even less needed after reusing skb cache on allocation path,
so remove it and this way lighten network softirqs a bit.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-13 14:32:03 -08:00
Alexander Lobakin f9d6725bf4 skbuff: use __build_skb_around() in __alloc_skb()
Just call __build_skb_around() instead of open-coding it.

Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-13 14:32:03 -08:00
Alexander Lobakin df1ae022af skbuff: simplify __alloc_skb() a bit
Use unlikely() annotations for skbuff_head and data similarly to the
two other allocation functions and remove totally redundant goto.

Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-13 14:32:03 -08:00