Centos-kernel-stream-9

Commit Graph

Author	SHA1	Message	Date
Ivan Vecera	497f645693	net: move gso declarations and functions to their own files JIRA: https://issues.redhat.com/browse/RHEL-12679 commit d457a0e329b0bfd3a1450e0b1a18cd2b47a25a08 Author: Eric Dumazet <edumazet@google.com> Date: Thu Jun 8 19:17:37 2023 +0000 net: move gso declarations and functions to their own files Move declarations into include/net/gso.h and code into net/core/gso.c Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Stanislav Fomichev <sdf@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230608191738.3947077-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2023-10-11 13:35:27 +02:00
Ivan Vecera	b4aa21f5ad	net: introduce and use skb_frag_fill_page_desc() JIRA: https://issues.redhat.com/browse/RHEL-12625 Conflicts: * drivers/net/ethernet/freescale/enetc/enetc.c - context due to missing 8feb020f92a5 ("net: ethernet: enetc: unlock XDP_REDIRECT for XDP non-linear buffers") * drivers/net/ethernet/fungible/funeth/funeth_rx.c - removed hunk for non-existing file * drivers/net/ethernet/marvell/mvneta.c - context due to missing 76a676947b56 ("net: mvneta: update frags bit before passing the xdp buffer to eBPF layer") * drivers/net/ethernet/mellanox/mlx5/core/en_rx.c - adjusted due to missing 27602319e328 ("net/mlx5e: RX, Take shared info fragment addition into a function") commit b51f4113ebb02011f0ca86abc3134b28d2071b6a Author: Yunsheng Lin <linyunsheng@huawei.com> Date: Thu May 11 09:12:12 2023 +0800 net: introduce and use skb_frag_fill_page_desc() Most users use __skb_frag_set_page()/skb_frag_off_set()/ skb_frag_size_set() to fill the page desc for a skb frag. Introduce skb_frag_fill_page_desc() to do that. net/bpf/test_run.c does not call skb_frag_off_set() to set the offset, "copy_from_user(page_address(page), ...)" and 'shinfo' being part of the 'data' kzalloced in bpf_test_init() suggest that it is assuming offset to be initialized as zero, so call skb_frag_fill_page_desc() with offset being zero for this case. Also, skb_frag_set_page() is not used anymore, so remove it. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2023-10-11 12:38:04 +02:00
Ivan Vecera	c756370130	net: introduce skb_poison_list and use in kfree_skb_list JIRA: https://issues.redhat.com/browse/RHEL-12613 commit 9dde0cd3b10f63bc4100ebadc7e32275baabfa68 Author: Jesper Dangaard Brouer <brouer@redhat.com> Date: Fri Feb 3 13:59:29 2023 +0100 net: introduce skb_poison_list and use in kfree_skb_list First user of skb_poison_list is in kfree_skb_list_reason, to catch bugs earlier like introduced in commit eedade12f4cb ("net: kfree_skb_list use kmem_cache_free_bulk"). For completeness mentioned bug have been fixed in commit f72ff8b81ebc ("net: fix kfree_skb_list use of skb_mark_not_on_list"). In case of a bug like mentioned commit we would have seen OOPS with: general protection fault, probably for non-canonical address 0xdead000000000870 And content of one the registers e.g. R13: dead000000000800 In this case skb->len is at offset 112 bytes (0x70) why fault happens at 0x800+0x70 = 0x870 Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2023-10-11 12:16:14 +02:00
Ivan Vecera	1c444c6edb	net: fix kfree_skb_list use of skb_mark_not_on_list JIRA: https://issues.redhat.com/browse/RHEL-12613 commit f72ff8b81ebc6a0a25e41b7e6c1dc42e3aa33e7e Author: Jesper Dangaard Brouer <brouer@redhat.com> Date: Fri Jan 20 11:34:44 2023 +0100 net: fix kfree_skb_list use of skb_mark_not_on_list A bug was introduced by commit eedade12f4cb ("net: kfree_skb_list use kmem_cache_free_bulk"). It unconditionally unlinked the SKB list via invoking skb_mark_not_on_list(). In this patch we choose to remove the skb_mark_not_on_list() call as it isn't necessary. It would be possible and correct to call skb_mark_not_on_list() only when __kfree_skb_reason() returns true, meaning the SKB is ready to be free'ed, as it calls/check skb_unref(). This fix is needed as kfree_skb_list() is also invoked on skb_shared_info frag_list (skb_drop_fraglist() calling kfree_skb_list()). A frag_list can have SKBs with elevated refcnt due to cloning via skb_clone_fraglist(), which takes a reference on all SKBs in the list. This implies the invariant that all SKBs in the list must have the same refcnt, when using kfree_skb_list(). Reported-by: syzbot+c8a2e66e37eee553c4fd@syzkaller.appspotmail.com Reported-and-tested-by: syzbot+c8a2e66e37eee553c4fd@syzkaller.appspotmail.com Fixes: eedade12f4cb ("net: kfree_skb_list use kmem_cache_free_bulk") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/167421088417.1125894.9761158218878962159.stgit@firesoul Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2023-10-11 12:16:03 +02:00
Ivan Vecera	bb18a44e29	net: kfree_skb_list use kmem_cache_free_bulk JIRA: https://issues.redhat.com/browse/RHEL-12613 commit eedade12f4cb7284555c4c0314485e9575c70ab7 Author: Jesper Dangaard Brouer <brouer@redhat.com> Date: Fri Jan 13 14:52:04 2023 +0100 net: kfree_skb_list use kmem_cache_free_bulk The kfree_skb_list function walks SKB (via skb->next) and frees them individually to the SLUB/SLAB allocator (kmem_cache). It is more efficient to bulk free them via the kmem_cache_free_bulk API. This patches create a stack local array with SKBs to bulk free while walking the list. Bulk array size is limited to 16 SKBs to trade off stack usage and efficiency. The SLUB kmem_cache "skbuff_head_cache" uses objsize 256 bytes usually in an order-1 page 8192 bytes that is 32 objects per slab (can vary on archs and due to SLUB sharing). Thus, for SLUB the optimal bulk free case is 32 objects belonging to same slab, but runtime this isn't likely to occur. The expected gain from using kmem_cache bulk alloc and free API have been assessed via a microbencmark kernel module[1]. The module 'slab_bulk_test01' results at bulk 16 element: kmem-in-loop Per elem: 109 cycles(tsc) 30.532 ns (step:16) kmem-bulk Per elem: 64 cycles(tsc) 17.905 ns (step:16) More detailed description of benchmarks avail in [2]. [1] https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm [2] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/kfree_skb_list01.org V2: rename function to kfree_skb_add_bulk. Reviewed-by: Saeed Mahameed <saeed@kernel.org> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2023-10-11 12:15:45 +02:00
Ivan Vecera	858a781232	net: skb: move skb_pp_recycle() to skbuff.c JIRA: https://issues.redhat.com/browse/RHEL-12613 commit 4727bab4e9bbeafeff6acdfcb077a7a548cbde30 Author: Yunsheng Lin <linyunsheng@huawei.com> Date: Fri Oct 21 10:58:22 2022 +0800 net: skb: move skb_pp_recycle() to skbuff.c skb_pp_recycle() is only used by skb_free_head() in skbuff.c, so move it to skbuff.c. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2023-10-11 12:15:39 +02:00
Jan Stancek	8b67bdc2b8	Merge: CNB: net: extend drop reasons for multiple subsystems MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2703 Bugzilla: https://bugzilla.redhat.com/2215988 commit 071c0fc6fb919dcf29c676a842dda08a674877d7 Author: Johannes Berg <johannes.berg@intel.com> Date: Wed Apr 19 14:52:53 2023 +0200 net: extend drop reasons for multiple subsystems Extend drop reasons to make them usable by subsystems other than core by reserving the high 16 bits for a new subsystem ID, of which 0 of course is used for the existing reasons immediately. To still be able to have string reasons, restructure that code a bit to make the loopup under RCU, the only user of this (right now) is drop_monitor. Link: https://lore.kernel.org/netdev/00659771ed54353f92027702c5bbb84702da62ce.camel@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Íñigo Huguet <ihuguet@redhat.com> Approved-by: John B. Wyatt IV <jwyatt@redhat.com> Approved-by: Xin Long <lxin@redhat.com> Approved-by: Antoine Tenart <atenart@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com>	2023-08-14 14:00:39 +02:00
Paolo Abeni	b7607ad33f	net: fix skb leak in __skb_tstamp_tx() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2217529 Tested: LNST, Tier1 Upstream commit: commit 8a02fb71d7192ff1a9a47c9d937624966c6e09af Author: Pratyush Yadav <ptyadav@amazon.de> Date: Mon May 22 17:30:20 2023 +0200 net: fix skb leak in __skb_tstamp_tx() Commit 50749f2dd685 ("tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp.") added a call to skb_orphan_frags_rx() to fix leaks with zerocopy skbs. But it ended up adding a leak of its own. When skb_orphan_frags_rx() fails, the function just returns, leaking the skb it just cloned. Free it before returning. This bug was discovered and resolved using Coverity Static Analysis Security Testing (SAST) by Synopsys, Inc. Fixes: 50749f2dd685 ("tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp.") Signed-off-by: Pratyush Yadav <ptyadav@amazon.de> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20230522153020.32422-1-ptyadav@amazon.de Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-06-26 16:58:50 +02:00
Paolo Abeni	bfc3e077cb	tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp. Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2217529 Tested: LNST, Tier1 Upstream commit: commit 50749f2dd6854a41830996ad302aef2ffaf011d8 Author: Kuniyuki Iwashima <kuniyu@amazon.com> Date: Mon Apr 24 15:20:22 2023 -0700 tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp. syzkaller reported [0] memory leaks of an UDP socket and ZEROCOPY skbs. We can reproduce the problem with these sequences: sk = socket(AF_INET, SOCK_DGRAM, 0) sk.setsockopt(SOL_SOCKET, SO_TIMESTAMPING, SOF_TIMESTAMPING_TX_SOFTWARE) sk.setsockopt(SOL_SOCKET, SO_ZEROCOPY, 1) sk.sendto(b'', MSG_ZEROCOPY, ('127.0.0.1', 53)) sk.close() sendmsg() calls msg_zerocopy_alloc(), which allocates a skb, sets skb->cb->ubuf.refcnt to 1, and calls sock_hold(). Here, struct ubuf_info_msgzc indirectly holds a refcnt of the socket. When the skb is sent, __skb_tstamp_tx() clones it and puts the clone into the socket's error queue with the TX timestamp. When the original skb is received locally, skb_copy_ubufs() calls skb_unclone(), and pskb_expand_head() increments skb->cb->ubuf.refcnt. This additional count is decremented while freeing the skb, but struct ubuf_info_msgzc still has a refcnt, so __msg_zerocopy_callback() is not called. The last refcnt is not released unless we retrieve the TX timestamped skb by recvmsg(). Since we clear the error queue in inet_sock_destruct() after the socket's refcnt reaches 0, there is a circular dependency. If we close() the socket holding such skbs, we never call sock_put() and leak the count, sk, and skb. TCP has the same problem, and commit e0c8bccd40fc ("net: stream: purge sk_error_queue in sk_stream_kill_queues()") tried to fix it by calling skb_queue_purge() during close(). However, there is a small chance that skb queued in a qdisc or device could be put into the error queue after the skb_queue_purge() call. In __skb_tstamp_tx(), the cloned skb should not have a reference to the ubuf to remove the circular dependency, but skb_clone() does not call skb_copy_ubufs() for zerocopy skb. So, we need to call skb_orphan_frags_rx() for the cloned skb to call skb_copy_ubufs(). [0]: BUG: memory leak unreferenced object 0xffff88800c6d2d00 (size 1152): comm "syz-executor392", pid 264, jiffies 4294785440 (age 13.044s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 cd af e8 81 00 00 00 00 ................ 02 00 07 40 00 00 00 00 00 00 00 00 00 00 00 00 ...@............ backtrace: [<0000000055636812>] sk_prot_alloc+0x64/0x2a0 net/core/sock.c:2024 [<0000000054d77b7a>] sk_alloc+0x3b/0x800 net/core/sock.c:2083 [<0000000066f3c7e0>] inet_create net/ipv4/af_inet.c:319 [inline] [<0000000066f3c7e0>] inet_create+0x31e/0xe40 net/ipv4/af_inet.c:245 [<000000009b83af97>] __sock_create+0x2ab/0x550 net/socket.c:1515 [<00000000b9b11231>] sock_create net/socket.c:1566 [inline] [<00000000b9b11231>] __sys_socket_create net/socket.c:1603 [inline] [<00000000b9b11231>] __sys_socket_create net/socket.c:1588 [inline] [<00000000b9b11231>] __sys_socket+0x138/0x250 net/socket.c:1636 [<000000004fb45142>] __do_sys_socket net/socket.c:1649 [inline] [<000000004fb45142>] __se_sys_socket net/socket.c:1647 [inline] [<000000004fb45142>] __x64_sys_socket+0x73/0xb0 net/socket.c:1647 [<0000000066999e0e>] do_syscall_x64 arch/x86/entry/common.c:50 [inline] [<0000000066999e0e>] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80 [<0000000017f238c1>] entry_SYSCALL_64_after_hwframe+0x63/0xcd BUG: memory leak unreferenced object 0xffff888017633a00 (size 240): comm "syz-executor392", pid 264, jiffies 4294785440 (age 13.044s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 2d 6d 0c 80 88 ff ff .........-m..... backtrace: [<000000002b1c4368>] __alloc_skb+0x229/0x320 net/core/skbuff.c:497 [<00000000143579a6>] alloc_skb include/linux/skbuff.h:1265 [inline] [<00000000143579a6>] sock_omalloc+0xaa/0x190 net/core/sock.c:2596 [<00000000be626478>] msg_zerocopy_alloc net/core/skbuff.c:1294 [inline] [<00000000be626478>] msg_zerocopy_realloc+0x1ce/0x7f0 net/core/skbuff.c:1370 [<00000000cbfc9870>] __ip_append_data+0x2adf/0x3b30 net/ipv4/ip_output.c:1037 [<0000000089869146>] ip_make_skb+0x26c/0x2e0 net/ipv4/ip_output.c:1652 [<00000000098015c2>] udp_sendmsg+0x1bac/0x2390 net/ipv4/udp.c:1253 [<0000000045e0e95e>] inet_sendmsg+0x10a/0x150 net/ipv4/af_inet.c:819 [<000000008d31bfde>] sock_sendmsg_nosec net/socket.c:714 [inline] [<000000008d31bfde>] sock_sendmsg+0x141/0x190 net/socket.c:734 [<0000000021e21aa4>] __sys_sendto+0x243/0x360 net/socket.c:2117 [<00000000ac0af00c>] __do_sys_sendto net/socket.c:2129 [inline] [<00000000ac0af00c>] __se_sys_sendto net/socket.c:2125 [inline] [<00000000ac0af00c>] __x64_sys_sendto+0xe1/0x1c0 net/socket.c:2125 [<0000000066999e0e>] do_syscall_x64 arch/x86/entry/common.c:50 [inline] [<0000000066999e0e>] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80 [<0000000017f238c1>] entry_SYSCALL_64_after_hwframe+0x63/0xcd Fixes: `f214f915e7` ("tcp: enable MSG_ZEROCOPY") Fixes: `b5947e5d1e` ("udp: msg_zerocopy") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-06-26 16:58:41 +02:00
Paolo Abeni	a7c60d11db	skbuff: Fix a race between coalescing and releasing SKBs Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2217529 Tested: LNST, Tier1 Upstream commit: commit 0646dc31ca886693274df5749cd0c8c1eaaeb5ca Author: Liang Chen <liangchen.linux@gmail.com> Date: Thu Apr 13 17:03:53 2023 +0800 skbuff: Fix a race between coalescing and releasing SKBs Commit 1effe8ca4e34 ("skbuff: fix coalescing for page_pool fragment recycling") allowed coalescing to proceed with non page pool page and page pool page when @from is cloned, i.e. to->pp_recycle --> false from->pp_recycle --> true skb_cloned(from) --> true However, it actually requires skb_cloned(@from) to hold true until coalescing finishes in this situation. If the other cloned SKB is released while the merging is in process, from_shinfo->nr_frags will be set to 0 toward the end of the function, causing the increment of frag page _refcount to be unexpectedly skipped resulting in inconsistent reference counts. Later when SKB(@to) is released, it frees the page directly even though the page pool page is still in use, leading to use-after-free or double-free errors. So it should be prohibited. The double-free error message below prompted us to investigate: BUG: Bad page state in process swapper/1 pfn:0e0d1 page:00000000c6548b28 refcount:-1 mapcount:0 mapping:0000000000000000 index:0x2 pfn:0xe0d1 flags: 0xfffffc0000000(node=0\|zone=1\|lastcpupid=0x1fffff) raw: 000fffffc0000000 0000000000000000 ffffffff00000101 0000000000000000 raw: 0000000000000002 0000000000000000 ffffffffffffffff 0000000000000000 page dumped because: nonzero _refcount CPU: 1 PID: 0 Comm: swapper/1 Tainted: G E 6.2.0+ Call Trace: <IRQ> dump_stack_lvl+0x32/0x50 bad_page+0x69/0xf0 free_pcp_prepare+0x260/0x2f0 free_unref_page+0x20/0x1c0 skb_release_data+0x10b/0x1a0 napi_consume_skb+0x56/0x150 net_rx_action+0xf0/0x350 ? __napi_schedule+0x79/0x90 __do_softirq+0xc8/0x2b1 __irq_exit_rcu+0xb9/0xf0 common_interrupt+0x82/0xa0 </IRQ> <TASK> asm_common_interrupt+0x22/0x40 RIP: 0010:default_idle+0xb/0x20 Fixes: 53e0961da1c7 ("page_pool: add frag page recycling support in page pool") Signed-off-by: Liang Chen <liangchen.linux@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230413090353.14448-1-liangchen.linux@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-06-26 16:57:58 +02:00
Íñigo Huguet	c9a53b31b9	net: extend drop reasons for multiple subsystems Bugzilla: https://bugzilla.redhat.com/2215988 Conflicts: context conflict due to missing 78476d315e19 ("mctp: Add flow extension to skb") commit 071c0fc6fb919dcf29c676a842dda08a674877d7 Author: Johannes Berg <johannes.berg@intel.com> Date: Wed Apr 19 14:52:53 2023 +0200 net: extend drop reasons for multiple subsystems Extend drop reasons to make them usable by subsystems other than core by reserving the high 16 bits for a new subsystem ID, of which 0 of course is used for the existing reasons immediately. To still be able to have string reasons, restructure that code a bit to make the loopup under RCU, the only user of this (right now) is drop_monitor. Link: https://lore.kernel.org/netdev/00659771ed54353f92027702c5bbb84702da62ce.camel@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>	2023-06-20 09:18:16 +02:00
Antoine Tenart	d48044618a	net: add location to trace_consume_skb() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2184073 Upstream Status: linux.git commit dd1b527831a3ed659afa01b672d8e1f7e6ca95a5 Author: Eric Dumazet <edumazet@google.com> Date: Thu Feb 16 15:47:18 2023 +0000 net: add location to trace_consume_skb() kfree_skb() includes the location, it makes sense to add it to consume_skb() as well. After patch: taskd_EventMana 8602 [004] 420.406239: skb:consume_skb: skbaddr=0xffff893a4a6d0500 location=unix_stream_read_generic swapper 0 [011] 422.732607: skb:consume_skb: skbaddr=0xffff89597f68cee0 location=mlx4_en_free_tx_desc discipline 9141 [043] 423.065653: skb:consume_skb: skbaddr=0xffff893a487e9c00 location=skb_consume_udp swapper 0 [010] 423.073166: skb:consume_skb: skbaddr=0xffff8949ce9cdb00 location=icmpv6_rcv borglet 8672 [014] 425.628256: skb:consume_skb: skbaddr=0xffff8949c42e9400 location=netlink_dump swapper 0 [028] 426.263317: skb:consume_skb: skbaddr=0xffff893b1589dce0 location=net_rx_action wget 14339 [009] 426.686380: skb:consume_skb: skbaddr=0xffff893a51b552e0 location=tcp_rcv_state_process Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2023-06-06 11:23:26 +02:00
Antoine Tenart	a49af01c77	net: fix call location in kfree_skb_list_reason Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2184073 Upstream Status: linux.git Conflicts:\ - DEBUG_NET_WARN_ON_ONCE wasn't used in the removed chunk because of a missing dependency in c9s when that chunk was first applied, but now DEBUG_NET_WARN_ON_ONCE is available so we can use it instead. commit a4650da2a2d6150a8ff1ea36fde9f6a26cf5fda3 Author: Jesper Dangaard Brouer <brouer@redhat.com> Date: Fri Jan 13 14:51:59 2023 +0100 net: fix call location in kfree_skb_list_reason The SKB drop reason uses __builtin_return_address(0) to give the call "location" to trace_kfree_skb() tracepoint skb:kfree_skb. To keep this stable for compilers kfree_skb_reason() is annotated with __fix_address (noinline __noclone) as fixed in commit c205cc7534a9 ("net: skb: prevent the split of kfree_skb_reason() by gcc"). The function kfree_skb_list_reason() invoke kfree_skb_reason(), which cause the __builtin_return_address(0) "location" to report the unexpected address of kfree_skb_list_reason. Example output from 'perf script': kpktgend_0 1337 [000] 81.002597: skb:kfree_skb: skbaddr=0xffff888144824700 protocol=2048 location=kfree_skb_list_reason+0x1e reason: QDISC_DROP Patch creates an __always_inline __kfree_skb_reason() helper call that is called from both kfree_skb_list() and kfree_skb_list_reason(). Suggestions for solutions that shares code better are welcome. As preparation for next patch move __kfree_skb() invocation out of this helper function. Reviewed-by: Saeed Mahameed <saeed@kernel.org> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2023-06-02 14:52:02 +02:00
Jan Stancek	704d11b087	Merge: enable io_uring MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2375 # Merge Request Required Information ## Summary of Changes This MR updates RHEL's io_uring implementation to match upstream v6.2 + fixes (Fixes: tags and Cc: stable commits). The final 3 RHEL-only patches disable the feature by default, taint the kernel when it is enabled, and turn on the config option. ## Approved Development Ticket Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2170014 Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2214 Omitted-fix: b0b7a7d24b66 ("io_uring: return back links tw run optimisation") This is actually just an optimization, and it has non-trivial conflicts which would require additional backports to resolve. Skip it. Omitted-fix: 7d481e035633 ("io_uring/rsrc: fix DEFER_TASKRUN rsrc quiesce") This fix is incorrectly tagged. The code that it applies to is not present in our tree. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Approved-by: John Meneghini <jmeneghi@redhat.com> Approved-by: Ming Lei <ming.lei@redhat.com> Approved-by: Maurizio Lombardi <mlombard@redhat.com> Approved-by: Brian Foster <bfoster@redhat.com> Approved-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com>	2023-05-17 07:47:08 +02:00
Jan Stancek	fa72082f2d	Merge: net: core: stable backports for 9.3 phase 1 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2408 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2188560 Depends: !2404 A bunch of fixes from upstream, affecting the core networking implementation. This also includes a couple of fixes for tun/tap, strictly tied to commit "net: add sock_init_data_uid()" Signed-off-by: Paolo Abeni <pabeni@redhat.com> Approved-by: Sabrina Dubroca <sdubroca@redhat.com> Approved-by: Andrea Claudi <aclaudi@redhat.com> Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com>	2023-05-16 11:49:41 +02:00
Jan Stancek	04554d1843	Merge: bpf, xdp: update to 6.2 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2317 Rebase bpf and xdp to 6.2. Bugzilla: https://bugzilla.redhat.com/2177177 Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Approved-by: Artem Savkov <asavkov@redhat.com> Approved-by: Felix Maurer <fmaurer@redhat.com> Approved-by: Prarit Bhargava <prarit@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com>	2023-05-11 12:12:10 +02:00
Jeff Moyer	5fbf8901c6	net: shrink struct ubuf_info Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237 commit e7d2b510165fff6bedc9cca88c071ad846850c74 Author: Pavel Begunkov <asml.silence@gmail.com> Date: Fri Sep 23 17:39:04 2022 +0100 net: shrink struct ubuf_info We can benefit from a smaller struct ubuf_info, so leave only mandatory fields and let users to decide how they want to extend it. Convert MSG_ZEROCOPY to struct ubuf_info_msgzc and remove duplicated fields. This reduces the size from 48 bytes to just 16. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>	2023-05-05 15:25:02 -04:00
Paolo Abeni	4b042c9aa8	net: fix NULL pointer in skb_segment_list Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2188560 Tested: LNST, Tier1 Upstream commit: commit 876e8ca8366735a604bac86ff7e2732fc9d85d2d Author: Yan Zhai <yan@cloudflare.com> Date: Mon Jan 30 12:51:48 2023 -0800 net: fix NULL pointer in skb_segment_list Commit `3a1296a38d` ("net: Support GRO/GSO fraglist chaining.") introduced UDP listifyed GRO. The segmentation relies on frag_list being untouched when passing through the network stack. This assumption can be broken sometimes, where frag_list itself gets pulled into linear area, leaving frag_list being NULL. When this happens it can trigger following NULL pointer dereference, and panic the kernel. Reverse the test condition should fix it. [19185.577801][ C1] BUG: kernel NULL pointer dereference, address: ... [19185.663775][ C1] RIP: 0010:skb_segment_list+0x1cc/0x390 ... [19185.834644][ C1] Call Trace: [19185.841730][ C1] <TASK> [19185.848563][ C1] __udp_gso_segment+0x33e/0x510 [19185.857370][ C1] inet_gso_segment+0x15b/0x3e0 [19185.866059][ C1] skb_mac_gso_segment+0x97/0x110 [19185.874939][ C1] __skb_gso_segment+0xb2/0x160 [19185.883646][ C1] udp_queue_rcv_skb+0xc3/0x1d0 [19185.892319][ C1] udp_unicast_rcv_skb+0x75/0x90 [19185.900979][ C1] ip_protocol_deliver_rcu+0xd2/0x200 [19185.910003][ C1] ip_local_deliver_finish+0x44/0x60 [19185.918757][ C1] __netif_receive_skb_one_core+0x8b/0xa0 [19185.927834][ C1] process_backlog+0x88/0x130 [19185.935840][ C1] __napi_poll+0x27/0x150 [19185.943447][ C1] net_rx_action+0x27e/0x5f0 [19185.951331][ C1] ? mlx5_cq_tasklet_cb+0x70/0x160 [mlx5_core] [19185.960848][ C1] __do_softirq+0xbc/0x25d [19185.968607][ C1] irq_exit_rcu+0x83/0xb0 [19185.976247][ C1] common_interrupt+0x43/0xa0 [19185.984235][ C1] asm_common_interrupt+0x22/0x40 ... [19186.094106][ C1] </TASK> Fixes: `3a1296a38d` ("net: Support GRO/GSO fraglist chaining.") Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Yan Zhai <yan@cloudflare.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/Y9gt5EUizK1UImEP@debian Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-05-02 19:07:41 +02:00
Xin Long	c57185f9a4	tcp: fix skb_copy_ubufs() vs BIG TCP Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2185290 Tested: compile only commit 7e692df3933628d974acb9f5b334d2b3e885e2a6 Author: Eric Dumazet <edumazet@google.com> Date: Fri Apr 28 04:32:31 2023 +0000 tcp: fix skb_copy_ubufs() vs BIG TCP David Ahern reported crashes in skb_copy_ubufs() caused by TCP tx zerocopy using hugepages, and skb length bigger than ~68 KB. skb_copy_ubufs() assumed it could copy all payload using up to MAX_SKB_FRAGS order-0 pages. This assumption broke when BIG TCP was able to put up to 512 KB per skb. We did not hit this bug at Google because we use CONFIG_MAX_SKB_FRAGS=45 and limit gso_max_size to 180000. A solution is to use higher order pages if needed. v2: add missing __GFP_COMP, or we leak memory. Fixes: 7c4e983c4f3c ("net: allow gso_max_size to exceed 65536") Reported-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/netdev/c70000f6-baa4-4a05-46d0-4b3e0dc1ccc8@gmail.com/T/ Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Xin Long <lucien.xin@gmail.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Coco Li <lixiaoyan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Xin Long <lxin@redhat.com>	2023-05-02 10:36:11 -04:00
Jeff Moyer	9f4bd88ef7	net: introduce managed frags infrastructure Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237 commit 753f1ca4e1e50248a1b760c9774d6d6b354562cc Author: Pavel Begunkov <asml.silence@gmail.com> Date: Tue Jul 12 21:52:31 2022 +0100 net: introduce managed frags infrastructure Some users like io_uring can do page pinning more efficiently, so we want a way to delegate referencing to other subsystems. For that add a new flag called SKBFL_MANAGED_FRAG_REFS. When set, skb doesn't hold page references and upper layers are responsivle to managing page lifetime. It's allowed to convert skbs from managed to normal by calling skb_zcopy_downgrade_managed(). The function will take all needed page references and clear the flag. It's needed, for instance, to avoid mixing managed modes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>	2023-04-29 08:07:02 -04:00
Jeff Moyer	1a5bb38f72	net: Allow custom iter handler in msghdr Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237 commit ebe73a284f4de8c5d401adeccd9b8fe3183b6e95 Author: David Ahern <dsahern@kernel.org> Date: Tue Jul 12 21:52:30 2022 +0100 net: Allow custom iter handler in msghdr Add support for custom iov_iter handling to msghdr. The idea is that in-kernel subsystems want control over how an SG is split. Signed-off-by: David Ahern <dsahern@kernel.org> [pavel: move callback into msghdr] Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>	2023-04-29 08:06:02 -04:00
Jeff Moyer	f82d792280	skbuff: add SKBFL_DONT_ORPHAN flag Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237 commit 2e07a521e1e424787af3bfc59615de4220856c35 Author: Pavel Begunkov <asml.silence@gmail.com> Date: Tue Jul 12 21:52:28 2022 +0100 skbuff: add SKBFL_DONT_ORPHAN flag We don't want to list every single ubuf_info callback in skb_orphan_frags(), add a flag controlling the behaviour. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>	2023-04-29 08:04:02 -04:00
Jeff Moyer	d19688b83d	net: avoid double accounting for pure zerocopy skbs Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237 commit 9b65b17db72313b7a4fe9bc9502928c88be57986 Author: Talal Ahmad <talalahmad@google.com> Date: Tue Nov 2 22:58:44 2021 -0400 net: avoid double accounting for pure zerocopy skbs Track skbs containing only zerocopy data and avoid charging them to kernel memory to correctly account the memory utilization for msg_zerocopy. All of the data in such skbs is held in user pages which are already accounted to user. Before this change, they are charged again in kernel in __zerocopy_sg_from_iter. The charging in kernel is excessive because data is not being copied into skb frags. This excessive charging can lead to kernel going into memory pressure state which impacts all sockets in the system adversely. Mark pure zerocopy skbs with a SKBFL_PURE_ZEROCOPY flag and remove charge/uncharge for data in such skbs. Initially, an skb is marked pure zerocopy when it is empty and in zerocopy path. skb can then change from a pure zerocopy skb to mixed data skb (zerocopy and copy data) if it is at tail of write queue and there is room available in it and non-zerocopy data is being sent in the next sendmsg call. At this time sk_mem_charge is done for the pure zerocopied data and the pure zerocopy flag is unmarked. We found that this happens very rarely on workloads that pass MSG_ZEROCOPY. A pure zerocopy skb can later be coalesced into normal skb if they are next to each other in queue but this patch prevents coalescing from happening. This avoids complexity of charging when skb downgrades from pure zerocopy to mixed. This is also rare. In sk_wmem_free_skb, if it is a pure zerocopy skb, an sk_mem_uncharge for SKB_TRUESIZE(skb_end_offset(skb)) is done for sk_mem_charge in tcp_skb_entail for an skb without data. Testing with the msg_zerocopy.c benchmark between two hosts(100G nics) with zerocopy showed that before this patch the 'sock' variable in memory.stat for cgroup2 that tracks sum of sk_forward_alloc, sk_rmem_alloc and sk_wmem_queued is around 1822720 and with this change it is 0. This is due to no charge to sk_forward_alloc for zerocopy data and shows memory utilization for kernel is lowered. With this commit we don't see the warning we saw in previous commit which resulted in commit 84882cf72cd774cf16fd338bdbf00f69ac9f9194. Signed-off-by: Talal Ahmad <talalahmad@google.com> Acked-by: Arjun Roy <arjunroy@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>	2023-04-29 08:03:02 -04:00
Jeff Moyer	b09042f567	skbuff: don't mix ubuf_info from different sources Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237 commit 1b4b2b09d4fb451029b112f17d34792e0277aeb2 Author: Pavel Begunkov <asml.silence@gmail.com> Date: Tue Jul 12 21:52:27 2022 +0100 skbuff: don't mix ubuf_info from different sources We should not append MSG_ZEROCOPY requests to skbuff with non MSG_ZEROCOPY ubuf_info, they might be not compatible. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>	2023-04-29 08:01:02 -04:00
Jeff Moyer	3d8947f865	net: inline skb_zerocopy_iter_dgram Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237 commit 657dd5f97b2ed16cdaa6339f42f9130240af1c04 Author: Pavel Begunkov <asml.silence@gmail.com> Date: Thu Apr 28 11:58:45 2022 +0100 net: inline skb_zerocopy_iter_dgram skb_zerocopy_iter_dgram() is a small proxy function, inline it. For that, move __zerocopy_sg_from_iter into linux/skbuff.h Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>	2023-04-29 07:55:02 -04:00
Jiri Benc	f9d22d44f4	net: skb: remove old comments about frag_size for build_skb() Bugzilla: https://bugzilla.redhat.com/2177177 commit 12c1604ae1a39bef87ac099f106594b4cb433b75 Author: Jakub Kicinski <kuba@kernel.org> Date: Fri Jan 6 18:29:04 2023 -0800 net: skb: remove old comments about frag_size for build_skb() Since commit ce098da1497c ("skbuff: Introduce slab_build_skb()") drivers trying to build skb around slab-backed buffers should go via slab_build_skb() rather than passing frag_size = 0 to the main build_skb(). Remove the copy'n'pasted comments about 0 meaning slab. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Benc <jbenc@redhat.com>	2023-04-28 11:43:23 +02:00
Jiri Benc	4c362dc0ad	skbuff: Introduce slab_build_skb() Bugzilla: https://bugzilla.redhat.com/2177177 Conflicts: - Only the networking core changes backported. Drivers will be updated at their own pace. - Removed WARN_ONCE on old API usage to allow gradual change of drivers. Omitted-fix: 8c495270845d ("bnx2x: use the right build_skb() helper") commit ce098da1497c6dee9589fce2c61d1910f4fcf0e7 Author: Kees Cook <keescook@chromium.org> Date: Wed Dec 7 22:02:59 2022 -0800 skbuff: Introduce slab_build_skb() syzkaller reported: BUG: KASAN: slab-out-of-bounds in __build_skb_around+0x235/0x340 net/core/skbuff.c:294 Write of size 32 at addr ffff88802aa172c0 by task syz-executor413/5295 For bpf_prog_test_run_skb(), which uses a kmalloc()ed buffer passed to build_skb(). When build_skb() is passed a frag_size of 0, it means the buffer came from kmalloc. In these cases, ksize() is used to find its actual size, but since the allocation may not have been made to that size, actually perform the krealloc() call so that all the associated buffer size checking will be correctly notified (and use the "new" pointer so that compiler hinting works correctly). Split this logic out into a new interface, slab_build_skb(), but leave the original 0 checking for now to catch any stragglers. Reported-by: syzbot+fda18eaa8c12534ccb3b@syzkaller.appspotmail.com Link: https://groups.google.com/g/syzkaller-bugs/c/UnIKxTtU5-0/m/-wbXinkgAQAJ Fixes: 38931d8989b5 ("mm: Make ksize() a reporting-only function") Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: pepsipu <soopthegoop@gmail.com> Cc: syzbot+fda18eaa8c12534ccb3b@syzkaller.appspotmail.com Cc: Vlastimil Babka <vbabka@suse.cz> Cc: kasan-dev <kasan-dev@googlegroups.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: ast@kernel.org Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Hao Luo <haoluo@google.com> Cc: Jesper Dangaard Brouer <hawk@kernel.org> Cc: John Fastabend <john.fastabend@gmail.com> Cc: jolsa@kernel.org Cc: KP Singh <kpsingh@kernel.org> Cc: martin.lau@linux.dev Cc: Stanislav Fomichev <sdf@google.com> Cc: song@kernel.org Cc: Yonghong Song <yhs@fb.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221208060256.give.994-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jiri Benc <jbenc@redhat.com>	2023-04-28 11:43:23 +02:00
Marc Dionne	5c216693ef	rxrpc: Save last ACK's SACK table rather than marking txbufs Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2170099 JIRA: https://issues.redhat.com/browse/RHELPLAN-148774 commit d57a3a151660902091491ac2633134e1be92557f Author: David Howells <dhowells@redhat.com> Date: Sat May 7 10:06:13 2022 +0100 Improve the tracking of which packets need to be transmitted by saving the last ACK packet that we receive that has a populated soft-ACK table rather than marking packets. Then we can step through the soft-ACK table and look at the packets we've transmitted beyond that to determine which packets we might want to retransmit. We also look at the highest serial number that has been acked to try and guess which packets we've transmitted the peer is likely to have seen. If necessary, we send a ping to retrieve that number. One downside that might be a problem is that we can't then compare the previous acked/unacked state so easily in rxrpc_input_soft_acks() - which is a potential problem for the slow-start algorithm. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Signed-off-by: Marc Dionne <mdionne@redhat.com>	2023-03-15 13:18:37 -03:00
Herton R. Krzesinski	0c207b7728	Merge: Attend warnings with gcc 11&12 when building kernel and modules MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1852 Bugzilla: https://bugzilla.redhat.com/2159468 Attend the warnings encountered when building CentOS Stream 9 kernel and module for x86_64 and arm64 using GCC11 (cs9) and GCC12 (f37). Some warnings end up being disabled usptream (-Wdangling-pointer, -Warray-bounds for GCC12 and -Wdeprecated for the sign-file.c), so backport these changes to align with this behavior. A few configurations were introduced to deal with -Werror and specific warnings depending on the toolchain and target architecture. This merge-request tries to bring these relevant patches without breaking compilation (e.g, CONFIG_WERROR is introduced, but not set). https://bugzilla.redhat.com/2142659 was already opened separately to attend the -Wstringop-overread in net/core/dev.c. Signed-off-by: Eric Chanudet <echanude@redhat.com> Approved-by: Vladis Dronov <vdronov@redhat.com> Approved-by: Waiman Long <longman@redhat.com> Approved-by: Lenny Szubowicz <lszubowi@redhat.com> Approved-by: Prarit Bhargava <prarit@redhat.com> Approved-by: Jonathan Toppins <jtoppins@redhat.com> Approved-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: Herton R. Krzesinski <herton@redhat.com>	2023-02-15 18:54:57 +00:00
Jiri Benc	ffcea318b6	net: gso: fix panic on frag_list with mixed head alloc types Bugzilla: https://bugzilla.redhat.com/2166641 commit 9e4b7a99a03aefd37ba7bb1f022c8efab5019165 Author: Jiri Benc <jbenc@redhat.com> Date: Wed Nov 2 17:53:25 2022 +0100 net: gso: fix panic on frag_list with mixed head alloc types Since commit `3dcbdb134f` ("net: gso: Fix skb_segment splat when splitting gso_size mangled skb having linear-headed frag_list"), it is allowed to change gso_size of a GRO packet. However, that commit assumes that "checking the first list_skb member suffices; i.e if either of the list_skb members have non head_frag head, then the first one has too". It turns out this assumption does not hold. We've seen BUG_ON being hit in skb_segment when skbs on the frag_list had differing head_frag with the vmxnet3 driver. This happens because __netdev_alloc_skb and __napi_alloc_skb can return a skb that is page backed or kmalloced depending on the requested size. As the result, the last small skb in the GRO packet can be kmalloced. There are three different locations where this can be fixed: (1) We could check head_frag in GRO and not allow GROing skbs with different head_frag. However, that would lead to performance regression on normal forward paths with unmodified gso_size, where !head_frag in the last packet is not a problem. (2) Set a flag in bpf_skb_net_grow and bpf_skb_net_shrink indicating that NETIF_F_SG is undesirable. That would need to eat a bit in sk_buff. Furthermore, that flag can be unset when all skbs on the frag_list are page backed. To retain good performance, bpf_skb_net_grow/shrink would have to walk the frag_list. (3) Walk the frag_list in skb_segment when determining whether NETIF_F_SG should be cleared. This of course slows things down. This patch implements (3). To limit the performance impact in skb_segment, the list is walked only for skbs with SKB_GSO_DODGY set that have gso_size changed. Normal paths thus will not hit it. We could check only the last skb but since we need to walk the whole list anyway, let's stay on the safe side. Fixes: `3dcbdb134f` ("net: gso: Fix skb_segment splat when splitting gso_size mangled skb having linear-headed frag_list") Signed-off-by: Jiri Benc <jbenc@redhat.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/e04426a6a91baf4d1081e1b478c82b5de25fdf21.1667407944.git.jbenc@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jiri Benc <jbenc@redhat.com>	2023-02-02 14:56:30 +01:00
Herton R. Krzesinski	a63de8eac1	Merge: net: skb free reason sync part 2 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1814 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2155181 Add one extra series of skb free reasons to c9s. Those patches will be nice to have as one is reordering the free reasons enum we backported earlier in this cycle (adding one special reason at the start) and we'll avoid changing the free reason enum values in between versions. Signed-off-by: Antoine Tenart <atenart@redhat.com> Approved-by: Hangbin Liu <haliu@redhat.com> Approved-by: Sabrina Dubroca <sdubroca@redhat.com> Signed-off-by: Herton R. Krzesinski <herton@redhat.com>	2023-01-13 14:29:22 +00:00
Eric Chanudet	fe0f041ef7	skbuff: Switch structure bounds to struct_group() Bugzilla: https://bugzilla.redhat.com/2159468 commit 03f61041c17914355dde7261be9ccdc821ddd454 Author: Kees Cook <keescook@chromium.org> Date: Sat Nov 20 16:31:49 2021 -0800 skbuff: Switch structure bounds to struct_group() In preparation for FORTIFY_SOURCE performing compile-time and run-time field bounds checking for memcpy(), memmove(), and memset(), avoid intentionally writing across neighboring fields. Replace the existing empty member position markers "headers_start" and "headers_end" with a struct_group(). This will allow memcpy() and sizeof() to more easily reason about sizes, and improve readability. "pahole" shows no size nor member offset changes to struct sk_buff. "objdump -d" shows no object code changes (outside of WARNs affected by source line number changes). Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Jason A. Donenfeld <Jason@zx2c4.com> # drivers/net/wireguard/* Link: https://lore.kernel.org/lkml/20210728035006.GD35706@embeddedor Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Eric Chanudet <echanude@redhat.com>	2023-01-09 13:32:41 -05:00
Herton R. Krzesinski	19ce0cbd76	Merge: bpf, xdp: update to 5.19 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1533 bpf, xdp: update to 5.19 Bugzilla: http://bugzilla.redhat.com/2120968 Bugzilla: http://bugzilla.redhat.com/2130850 Bugzilla: http://bugzilla.redhat.com/2140077 Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com> Approved-by: Prarit Bhargava <prarit@redhat.com> Approved-by: Jiri Benc <jbenc@redhat.com> Approved-by: Artem Savkov <asavkov@redhat.com> Approved-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com> Signed-off-by: Herton R. Krzesinski <herton@redhat.com>	2022-12-21 20:49:27 +00:00
Antoine Tenart	bb3ee6fbc5	net: dropreason: propagate drop_reason to skb_release_data() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2155181 Upstream Status: net.git commit 511a3eda2f8d4719114ee3f2c781c37233bd171f Author: Eric Dumazet <edumazet@google.com> Date: Sat Oct 29 15:45:17 2022 +0000 net: dropreason: propagate drop_reason to skb_release_data() When an skb with a frag list is consumed, we currently pretend all skbs in the frag list were dropped. In order to fix this, add a @reason argument to skb_release_data() and skb_release_all(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2022-12-21 15:06:17 +01:00
Antoine Tenart	2dc0e2d4a8	net: dropreason: add SKB_CONSUMED reason Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2155181 Upstream Status: net.git commit 0e84afe8ebfbb9eade3f4f6de4720887bf908e26 Author: Eric Dumazet <edumazet@google.com> Date: Sat Oct 29 15:45:16 2022 +0000 net: dropreason: add SKB_CONSUMED reason This will allow to simply use in the future: kfree_skb_reason(skb, reason); Instead of repeating sequences like: if (dropped) kfree_skb_reason(skb, reason); else consume_skb(skb); For instance, following patch in the series is adding @reason to skb_release_data() and skb_release_all(), so that we can propagate a meaningful @reason whenever consume_skb()/kfree_skb() have to take care of a potential frag_list. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2022-12-21 15:06:17 +01:00
Herton R. Krzesinski	09736a3a30	Merge: udp: some performance optimizations MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1541 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2133057 Tested: LNST, Tier1, tput test This series improves UDP protocol RX tput, to keep it on equal footing with rhel-8 one. Patches 1,3,4 are there just to reduces the conflicts, and patch 4 is a very partial backport, to avoid pulling unrelated features. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Approved-by: Antoine Tenart <atenart@redhat.com> Approved-by: Florian Westphal <fwestpha@redhat.com> Signed-off-by: Herton R. Krzesinski <herton@redhat.com>	2022-12-13 17:35:03 +00:00
Felix Maurer	13bc0343bd	net: Change skb_ensure_writable()'s write_len param to unsigned int type Bugzilla: https://bugzilla.redhat.com/2120968 commit 92ece28072f18f30099770c5d4b8e300ea6820fa Author: Liu Jian <liujian56@huawei.com> Date: Sat Apr 16 18:58:00 2022 +0800 net: Change skb_ensure_writable()'s write_len param to unsigned int type Both pskb_may_pull() and skb_clone_writable()'s length parameters are of type unsigned int already. Therefore, change this function's write_len param to unsigned int type. Signed-off-by: Liu Jian <liujian56@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20220416105801.88708-3-liujian56@huawei.com Signed-off-by: Felix Maurer <fmaurer@redhat.com>	2022-11-30 12:47:09 +02:00
Frantisek Hrbata	1269719102	Merge: BPF and XDP rebase to v5.18 Merge conflicts: ----------------- arch/x86/net/bpf_jit_comp.c - bpf_arch_text_poke() HEAD(!1464) contains `b73b002f7f` ("x86/ibt,bpf: Add ENDBR instructions to prologue and trampoline") Resolved in favour of !1464, but keep the return statement from !1477 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1477 Bugzilla: https://bugzilla.redhat.com/2120966 Rebase BPF and XDP to the upstream kernel version 5.18 Patch applied, then reverted: ``` 544356 selftests/bpf: switch to new libbpf XDP APIs 0bfb95 selftests, bpf: Do not yet switch to new libbpf XDP APIs ``` Taken in the perf rebase: ``` 23fcfc perf: use generic bpf_program__set_type() to set BPF prog type ``` Unsuported arches: ``` 5c1011 libbpf: Fix riscv register names cf0b5b libbpf: Fix accessing syscall arguments on riscv ``` Depends on changes of other subsystems: ``` 7fc8c3 s390/bpf: encode register within extable entry aebfd1 x86/ibt,ftrace: Search for __fentry__ location 589127 x86/ibt,bpf: Add ENDBR instructions to prologue and trampoline ``` Broken selftest: ``` edae34 selftests net: add UDP GRO fraglist + bpf self-tests cf6783 selftests net: fix bpf build error 7b92aa selftests net: fix kselftest net fatal error ``` Out of scope: ``` baebdf net: dev: Makes sure netif_rx() can be invoked in any context. 5c8166 kbuild: replace $(if A,A,B) with $(or A,B) 1a97ce perf maps: Use a pointer for kmaps 967747 uaccess: remove CONFIG_SET_FS 42b01a s390: always use the packed stack layout bf0882 flow_dissector: Add support for HSR d09a30 s390/extable: move EX_TABLE define to asm-extable.h 3d6671 s390/extable: convert to relative table with data 4efd41 s390: raise minimum supported machine generation to z10 f65e58 flow_dissector: Add support for HSRv0 1a6d7a netdevsim: Introduce support for L3 offload xstats 9b1894 selftests: netdevsim: hw_stats_l3: Add a new test 84005b perf ftrace latency: Add -n/--use-nsec option 36c4a7 kasan, arm64: don't tag executable vmalloc allocations 8df013 docs: netdev: move the netdev-FAQ to the process pages 4d4d00 perf tools: Update copy of libbpf's hashmap.c 0df6ad perf evlist: Rename cpus to user_requested_cpus 1b8089 flow_dissector: fix false-positive __read_overflow2_field() warning 0ae065 perf build: Fix check for btf__load_from_kernel_by_id() in libbpf 8994e9 perf test bpf: Skip test if clang is not present 735346 perf build: Fix btf__load_from_kernel_by_id() feature check f037ac s390/stack: merge empty stack frame slots 335220 docs: netdev: update maintainer-netdev.rst reference a0b098 s390/nospec: remove unneeded header includes 34513a netdevsim: Fix hwstats debugfs file permissions ``` Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Approved-by: John W. Linville <linville@redhat.com> Approved-by: Wander Lairson Costa <wander@redhat.com> Approved-by: Torez Smith <torez@redhat.com> Approved-by: Jan Stancek <jstancek@redhat.com> Approved-by: Prarit Bhargava <prarit@redhat.com> Approved-by: Felix Maurer <fmaurer@redhat.com> Approved-by: Viktor Malik <vmalik@redhat.com> Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>	2022-11-21 05:30:47 -05:00
Davide Caratti	d011a96301	net: do not sense pfmemalloc status in skb_append_pagefrags() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2136491 Upstream Status: net.git commit 228ebc41dfab commit 228ebc41dfab5b5d34cd76835ddb0ca8ee12f513 Author: Eric Dumazet <edumazet@google.com> Date: Thu Oct 27 04:03:46 2022 +0000 net: do not sense pfmemalloc status in skb_append_pagefrags() skb_append_pagefrags() is used by af_unix and udp sendpage() implementation so far. In commit 326140063946 ("tcp: TX zerocopy should not sense pfmemalloc status") we explained why we should not sense pfmemalloc status for pages owned by user space. We should also use skb_fill_page_desc_noacc() in skb_append_pagefrags() to avoid following KCSAN report: BUG: KCSAN: data-race in lru_add_fn / skb_append_pagefrags write to 0xffffea00058fc1c8 of 8 bytes by task 17319 on cpu 0: __list_add include/linux/list.h:73 [inline] list_add include/linux/list.h:88 [inline] lruvec_add_folio include/linux/mm_inline.h:323 [inline] lru_add_fn+0x327/0x410 mm/swap.c:228 folio_batch_move_lru+0x1e1/0x2a0 mm/swap.c:246 lru_add_drain_cpu+0x73/0x250 mm/swap.c:669 lru_add_drain+0x21/0x60 mm/swap.c:773 free_pages_and_swap_cache+0x16/0x70 mm/swap_state.c:311 tlb_batch_pages_flush mm/mmu_gather.c:59 [inline] tlb_flush_mmu_free mm/mmu_gather.c:256 [inline] tlb_flush_mmu+0x5b2/0x640 mm/mmu_gather.c:263 tlb_finish_mmu+0x86/0x100 mm/mmu_gather.c:363 exit_mmap+0x190/0x4d0 mm/mmap.c:3098 __mmput+0x27/0x1b0 kernel/fork.c:1185 mmput+0x3d/0x50 kernel/fork.c:1207 copy_process+0x19fc/0x2100 kernel/fork.c:2518 kernel_clone+0x166/0x550 kernel/fork.c:2671 __do_sys_clone kernel/fork.c:2812 [inline] __se_sys_clone kernel/fork.c:2796 [inline] __x64_sys_clone+0xc3/0xf0 kernel/fork.c:2796 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd read to 0xffffea00058fc1c8 of 8 bytes by task 17325 on cpu 1: page_is_pfmemalloc include/linux/mm.h:1817 [inline] __skb_fill_page_desc include/linux/skbuff.h:2432 [inline] skb_fill_page_desc include/linux/skbuff.h:2453 [inline] skb_append_pagefrags+0x210/0x600 net/core/skbuff.c:3974 unix_stream_sendpage+0x45e/0x990 net/unix/af_unix.c:2338 kernel_sendpage+0x184/0x300 net/socket.c:3561 sock_sendpage+0x5a/0x70 net/socket.c:1054 pipe_to_sendpage+0x128/0x160 fs/splice.c:361 splice_from_pipe_feed fs/splice.c:415 [inline] __splice_from_pipe+0x222/0x4d0 fs/splice.c:559 splice_from_pipe fs/splice.c:594 [inline] generic_splice_sendpage+0x89/0xc0 fs/splice.c:743 do_splice_from fs/splice.c:764 [inline] direct_splice_actor+0x80/0xa0 fs/splice.c:931 splice_direct_to_actor+0x305/0x620 fs/splice.c:886 do_splice_direct+0xfb/0x180 fs/splice.c:974 do_sendfile+0x3bf/0x910 fs/read_write.c:1255 __do_sys_sendfile64 fs/read_write.c:1323 [inline] __se_sys_sendfile64 fs/read_write.c:1309 [inline] __x64_sys_sendfile64+0x10c/0x150 fs/read_write.c:1309 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd value changed: 0x0000000000000000 -> 0xffffea00058fc188 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 17325 Comm: syz-executor.0 Not tainted 6.1.0-rc1-syzkaller-00158-g440b7895c990-dirty #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022 Fixes: 326140063946 ("tcp: TX zerocopy should not sense pfmemalloc status") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20221027040346.1104204-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Davide Caratti <dcaratti@redhat.com>	2022-11-07 10:19:56 +01:00
Paolo Abeni	022665bacd	net: skb: introduce and use a single page frag cache Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2133057 Tested: LNST, Tier1 Upstream commit: commit dbae2b062824fc2d35ae2d5df2f500626c758e80 Author: Paolo Abeni <pabeni@redhat.com> Date: Wed Sep 28 10:43:09 2022 +0200 net: skb: introduce and use a single page frag cache After commit `3226b158e6` ("net: avoid 32 x truesize under-estimation for tiny skbs") we are observing 10-20% regressions in performance tests with small packets. The perf trace points to high pressure on the slab allocator. This change tries to improve the allocation schema for small packets using an idea originally suggested by Eric: a new per CPU page frag is introduced and used in __napi_alloc_skb to cope with small allocation requests. To ensure that the above does not lead to excessive truesize underestimation, the frag size for small allocation is inflated to 1K and all the above is restricted to build with 4K page size. Note that we need to update accordingly the run-time check introduced with commit fd9ea57f4e95 ("net: add napi_get_frags_check() helper"). Alex suggested a smart page refcount schema to reduce the number of atomic operations and deal properly with pfmemalloc pages. Under small packet UDP flood, I measure a 15% peak tput increases. Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Suggested-by: Alexander H Duyck <alexanderduyck@fb.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://lore.kernel.org/r/6b6f65957c59f86a353fc09a5127e83a32ab5999.1664350652.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-10-27 19:12:04 +02:00
Jiri Benc	cb80b39939	net: fix wrong network header length Bugzilla: https://bugzilla.redhat.com/2120966 commit cf3ab8d4a797960b4be20565abb3bcd227b18a68 Author: Lina Wang <lina.wang@mediatek.com> Date: Thu May 5 13:48:49 2022 +0800 net: fix wrong network header length When clatd starts with ebpf offloaing, and NETIF_F_GRO_FRAGLIST is enable, several skbs are gathered in skb_shinfo(skb)->frag_list. The first skb's ipv6 header will be changed to ipv4 after bpf_skb_proto_6_to_4, network_header\transport_header\mac_header have been updated as ipv4 acts, but other skbs in frag_list didnot update anything, just ipv6 packets. udp_queue_rcv_skb will call skb_segment_list to traverse other skbs in frag_list and make sure right udp payload is delivered to user space. Unfortunately, other skbs in frag_list who are still ipv6 packets are updated like the first skb and will have wrong transport header length. e.g.before bpf_skb_proto_6_to_4,the first skb and other skbs in frag_list has the same network_header(24)& transport_header(64), after bpf_skb_proto_6_to_4, ipv6 protocol has been changed to ipv4, the first skb's network_header is 44,transport_header is 64, other skbs in frag_list didnot change.After skb_segment_list, the other skbs in frag_list has different network_header(24) and transport_header(44), so there will be 20 bytes different from original,that is difference between ipv6 header and ipv4 header. Just change transport_header to be the same with original. Actually, there are two solutions to fix it, one is traversing all skbs and changing every skb header in bpf_skb_proto_6_to_4, the other is modifying frag_list skb's header in skb_segment_list. Considering efficiency, adopt the second one--- when the first skb and other skbs in frag_list has different network_header length, restore them to make sure right udp payload is delivered to user space. Signed-off-by: Lina Wang <lina.wang@mediatek.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Benc <jbenc@redhat.com>	2022-10-25 14:58:10 +02:00
Jiri Benc	8958ba7e1f	skbuff: clean up inconsistent indenting Bugzilla: https://bugzilla.redhat.com/2120966 commit c645fe9bf6ae589ff9163d6c515d3517ec2e32d5 Author: Colin Ian King <colin.i.king@gmail.com> Date: Thu Sep 2 23:56:23 2021 +0100 skbuff: clean up inconsistent indenting There is a statement that is indented one character too deeply, clean this up. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Benc <jbenc@redhat.com>	2022-10-25 14:58:10 +02:00
Jiri Benc	e17e09a099	net: Clear mono_delivery_time bit in __skb_tstamp_tx() Bugzilla: https://bugzilla.redhat.com/2120966 commit d93376f503c7a586707925957592c0f16f4db0b1 Author: Martin KaFai Lau <kafai@fb.com> Date: Wed Mar 2 11:55:44 2022 -0800 net: Clear mono_delivery_time bit in __skb_tstamp_tx() In __skb_tstamp_tx(), it may clone the egress skb and queues the clone to the sk_error_queue. The outgoing skb may have the mono delivery_time while the (rcv) timestamp is expected for the clone, so the skb->mono_delivery_time bit needs to be cleared from the clone. This patch adds the skb->mono_delivery_time clearing to the existing __net_timestamp() and use it in __skb_tstamp_tx(). The __net_timestamp() fast path usage in dev.c is changed to directly call ktime_get_real() since the mono_delivery_time bit is not set at that point. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Benc <jbenc@redhat.com>	2022-10-25 14:58:00 +02:00
Jiri Benc	2e725d3634	net: Add skb_clear_tstamp() to keep the mono delivery_time Bugzilla: https://bugzilla.redhat.com/2120966 commit de799101519aad23c6096041ba2744d7b5517e6a Author: Martin KaFai Lau <kafai@fb.com> Date: Wed Mar 2 11:55:31 2022 -0800 net: Add skb_clear_tstamp() to keep the mono delivery_time Right now, skb->tstamp is reset to 0 whenever the skb is forwarded. If skb->tstamp has the mono delivery_time, clearing it can hurt the performance when it finally transmits out to fq@phy-dev. The earlier patch added a skb->mono_delivery_time bit to flag the skb->tstamp carrying the mono delivery_time. This patch adds skb_clear_tstamp() helper which keeps the mono delivery_time and clears everything else. The delivery_time clearing will be postponed until the stack knows the skb will be delivered locally. It will be done in a latter patch. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Benc <jbenc@redhat.com>	2022-10-25 14:57:59 +02:00
Frantisek Hrbata	fa843be1d1	Merge: net: add skb drop reasons MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1454 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161 Sync skb drop reasons with upstream to improve debuggability and visibility in the net stack. This MR helps in understanding why a given packet is being dropped. One way of retrieving the skb drop reason is to hook to the kfree_skb tracepoint: ``` # perf record -e skb:kfree_skb -a sleep 10 # perf script swapper 0 [000] 45483.977088: skb:kfree_skb: skbaddr=0xffffa04859090f00 protocol=34525 location=0xffffffff9bc92940 reason: NOT_SPECIFIED swapper 0 [000] 45485.792919: skb:kfree_skb: skbaddr=0xffffa04143757900 protocol=34525 location=0xffffffff9bbd84bb reason: TCP_INVALID_SEQUENCE ``` Signed-off-by: Antoine Tenart <atenart@redhat.com> Approved-by: Jarod Wilson <jarod@redhat.com> Approved-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>	2022-10-24 14:27:58 -04:00
Antoine Tenart	6f2e7329d3	net: skb: export skb drop reaons to user by TRACE_DEFINE_ENUM Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161 Upstream Status: linux.git commit 9cb252c4c1c53ae58bc565bab76e98133288f23a Author: Menglong Dong <imagedong@tencent.com> Date: Mon Sep 5 11:50:15 2022 +0800 net: skb: export skb drop reaons to user by TRACE_DEFINE_ENUM As Eric reported, the 'reason' field is not presented when trace the kfree_skb event by perf: $ perf record -e skb:kfree_skb -a sleep 10 $ perf script ip_defrag 14605 [021] 221.614303: skb:kfree_skb: skbaddr=0xffff9d2851242700 protocol=34525 location=0xffffffffa39346b1 reason: The cause seems to be passing kernel address directly to TP_printk(), which is not right. As the enum 'skb_drop_reason' is not exported to user space through TRACE_DEFINE_ENUM(), perf can't get the drop reason string from the 'reason' field, which is a number. Therefore, we introduce the macro DEFINE_DROP_REASON(), which is used to define the trace enum by TRACE_DEFINE_ENUM(). With the help of DEFINE_DROP_REASON(), now we can remove the auto-generate that we introduced in the commit ec43908dd556 ("net: skb: use auto-generation to convert skb drop reason to string"), and define the string array 'drop_reasons'. Hmmmm...now we come back to the situation that have to maintain drop reasons in both enum skb_drop_reason and DEFINE_DROP_REASON. But they are both in dropreason.h, which makes it easier. After this commit, now the format of kfree_skb is like this: $ cat /tracing/events/skb/kfree_skb/format name: kfree_skb ID: 1524 format: field:unsigned short common_type; offset:0; size:2; signed:0; field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1; signed:0; field:int common_pid; offset:4; size:4; signed:1; field:void * skbaddr; offset:8; size:8; signed:0; field:void * location; offset:16; size:8; signed:0; field:unsigned short protocol; offset:24; size:2; signed:0; field:enum skb_drop_reason reason; offset:28; size:4; signed:0; print fmt: "skbaddr=%p protocol=%u location=%p reason: %s", REC->skbaddr, REC->protocol, REC->location, __print_symbolic(REC->reason, { 1, "NOT_SPECIFIED" }, { 2, "NO_SOCKET" } ...... Fixes: ec43908dd556 ("net: skb: use auto-generation to convert skb drop reason to string") Link: https://lore.kernel.org/netdev/CANn89i+bx0ybvE55iMYf5GJM48WwV1HNpdm9Q6t-HaEstqpCSA@mail.gmail.com/ Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2022-10-14 17:40:26 +02:00
Antoine Tenart	b45adccfcf	net: skb: prevent the split of kfree_skb_reason() by gcc Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161 Upstream Status: net-next.git commit c205cc7534a97f2d6fbd2a23a94ed7c036c6e2aa Author: Menglong Dong <imagedong@tencent.com> Date: Sun Aug 21 13:18:58 2022 +0800 net: skb: prevent the split of kfree_skb_reason() by gcc Sometimes, gcc will optimize the function by spliting it to two or more functions. In this case, kfree_skb_reason() is splited to kfree_skb_reason and kfree_skb_reason.part.0. However, the function/tracepoint trace_kfree_skb() in it needs the return address of kfree_skb_reason(). This split makes the call chains becomes: kfree_skb_reason() -> kfree_skb_reason.part.0 -> trace_kfree_skb() which makes the return address that passed to trace_kfree_skb() be kfree_skb(). Therefore, introduce '__fix_address', which is the combination of '__noclone' and 'noinline', and apply it to kfree_skb_reason() to prevent to from being splited or made inline. (Is it better to simply apply '__noclone oninline' to kfree_skb_reason? I'm thinking maybe other functions have the same problems) Meanwhile, wrap 'skb_unref()' with 'unlikely()', as the compiler thinks it is likely return true and splits kfree_skb_reason(). Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2022-10-14 17:40:26 +02:00
Antoine Tenart	21d9800dd4	net: skb: use auto-generation to convert skb drop reason to string Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161 Upstream Status: linux.git commit ec43908dd556b2292f028c6e412261689405ba6e Author: Menglong Dong <imagedong@tencent.com> Date: Mon Jun 6 10:24:35 2022 +0800 net: skb: use auto-generation to convert skb drop reason to string It is annoying to add new skb drop reasons to 'enum skb_drop_reason' and TRACE_SKB_DROP_REASON in trace/event/skb.h, and it's easy to forget to add the new reasons we added to TRACE_SKB_DROP_REASON. TRACE_SKB_DROP_REASON is used to convert drop reason of type number to string. For now, the string we passed to user space is exactly the same as the name in 'enum skb_drop_reason' with a 'SKB_DROP_REASON_' prefix. Therefore, we can use 'auto-generation' to generate these drop reasons to string at build time. The new source 'dropreason_str.c' will be auto generated during build time, which contains the string array 'const char * const drop_reasons[]'. Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2022-10-14 17:40:26 +02:00
Antoine Tenart	f301349869	net: skb: check the boundrary of drop reason in kfree_skb_reason() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161 Upstream Status: linux.git Conflicts:\ - Can't use DEBUG_NET_WARN_ON_ONCE as upstream commit d268c1f5cfc9 ("net: add CONFIG_DEBUG_NET") is not in c9s yet. Resolve the conflict by using the define used when CONFIG_DEBUG_NET=n upstream, BUILD_BUG_ON_INVALID. commit 20bbcd0a94c6686c2692e6f7081163c233d7ce40 Author: Menglong Dong <imagedong@tencent.com> Date: Fri May 13 11:03:37 2022 +0800 net: skb: check the boundrary of drop reason in kfree_skb_reason() Sometimes, we may forget to reset skb drop reason to NOT_SPECIFIED after we make it the return value of the functions with return type of enum skb_drop_reason, such as tcp_inbound_md5_hash. Therefore, its value can be SKB_NOT_DROPPED_YET(0), which is invalid for kfree_skb_reason(). So we check the range of drop reason in kfree_skb_reason() with DEBUG_NET_WARN_ON_ONCE(). Reviewed-by: Jiang Biao <benbjiang@tencent.com> Reviewed-by: Hao Peng <flyingpeng@tencent.com> Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2022-10-14 17:40:00 +02:00
Antoine Tenart	55115540c4	net: skb: introduce the function kfree_skb_list_reason() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161 Upstream Status: linux.git commit 215b0f1963d4e34fccac6992b3debe26f78a6eb8 Author: Menglong Dong <imagedong@tencent.com> Date: Fri Mar 4 14:00:41 2022 +0800 net: skb: introduce the function kfree_skb_list_reason() To report reasons of skb drops, introduce the function kfree_skb_list_reason() and make kfree_skb_list() an inline call to it. This function will be used in the next commit in __dev_xmit_skb(). Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2022-10-13 14:53:23 +02:00
Paolo Abeni	5932d4a818	net: Fix a data-race around sysctl_tstamp_allow_data. Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134161 Tested: LNST, Tier1 Upstream commit: commit d2154b0afa73c0159b2856f875c6b4fe7cf6a95e Author: Kuniyuki Iwashima <kuniyu@amazon.com> Date: Tue Aug 23 10:46:50 2022 -0700 net: Fix a data-race around sysctl_tstamp_allow_data. While reading sysctl_tstamp_allow_data, it can be changed concurrently. Thus, we need to add READ_ONCE() to its reader. Fixes: `b245be1f4d` ("net-timestamp: no-payload only sysctl") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-10-13 13:00:03 +02:00
Ivan Vecera	756015f0e8	net: gro: move skb_gro_receive into net/core/gro.c Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101789 commit e456a18a390b96f22b0de2acd4d0f49c72ed2280 Author: Eric Dumazet <edumazet@google.com> Date: Mon Nov 15 09:05:53 2021 -0800 net: gro: move skb_gro_receive into net/core/gro.c net/core/gro.c will contain all core gro functions, to shrink net/core/skbuff.c and net/core/dev.c Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2022-06-28 13:28:40 +02:00
Ivan Vecera	554594fd78	net: gro: move skb_gro_receive_list to udp_offload.c Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101789 commit 0b935d7f8c07bf0a192712bdbf76dbf45ef8b115 Author: Eric Dumazet <edumazet@google.com> Date: Mon Nov 15 09:05:52 2021 -0800 net: gro: move skb_gro_receive_list to udp_offload.c This helper is used once, no need to keep it in fat net/core/skbuff.c Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2022-06-28 13:28:40 +02:00
Ivan Vecera	0c79035b3b	net: move gro definitions to include/net/gro.h Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101789 Conflicts: - context conflict due to missing 92552d3abd32 ("net/mlx5e: HW_GRO cqe handler implementation") commit 4721031c3559db8eae61df305f10c00099a7c1d0 Author: Eric Dumazet <edumazet@google.com> Date: Mon Nov 15 09:05:51 2021 -0800 net: move gro definitions to include/net/gro.h include/linux/netdevice.h became too big, move gro stuff into include/net/gro.h Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2022-06-28 13:28:40 +02:00
Patrick Talbert	8c5b3f7fd9	Merge: XDP and networking eBPF rebase to v5.15 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/674 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071618 Depends: !572 Tested: Using bpf selftests, everything passes. This rebases XDP and networking eBPF to upstream kernel version 5.15. Signed-off-by: Jiri Benc <jbenc@redhat.com> Approved-by: Hangbin Liu <haliu@redhat.com> Approved-by: Rafael Aquini <aquini@redhat.com> Approved-by: Toke Høiland-Jørgensen <toke@redhat.com> Approved-by: Íñigo Huguet <ihuguet@redhat.com> Signed-off-by: Patrick Talbert <ptalbert@redhat.com>	2022-06-03 09:26:25 +02:00
Patrick Talbert	6da9f3de35	Merge: net: drop_monitor: support drop reason MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/849 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083432 After commit c504e5c2f964 ("net: skb: introduce kfree_skb_reason()"), we have supported drop reason. So let's add this feature to drop monitor. Signed-off-by: Hangbin Liu <haliu@redhat.com> Approved-by: Jarod Wilson <jarod@redhat.com> Approved-by: Antoine Tenart <atenart@redhat.com> Approved-by: Jiri Benc <jbenc@redhat.com> Approved-by: Florian Westphal <fwestpha@redhat.com> Signed-off-by: Patrick Talbert <ptalbert@redhat.com>	2022-05-25 09:28:10 +02:00
Patrick Talbert	f311aab772	Merge: net: backport core fixes from upstream MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/832 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081920 A bunch of fixes for net core path. Signed-off-by: Hangbin Liu <haliu@redhat.com> Approved-by: Antoine Tenart <atenart@redhat.com> Approved-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: Patrick Talbert <ptalbert@redhat.com>	2022-05-18 10:58:56 +02:00
Jiri Benc	1ad710c301	skbuff: fix coalescing for page_pool fragment recycling Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071618 commit 1effe8ca4e34c34cdd9318436a4232dcb582ebf4 Author: Jean-Philippe Brucker <jean-philippe@linaro.org> Date: Thu Mar 31 11:24:41 2022 +0100 skbuff: fix coalescing for page_pool fragment recycling Fix a use-after-free when using page_pool with page fragments. We encountered this problem during normal RX in the hns3 driver: (1) Initially we have three descriptors in the RX queue. The first one allocates PAGE1 through page_pool, and the other two allocate one half of PAGE2 each. Page references look like this: RX_BD1 _______ PAGE1 RX_BD2 _______ PAGE2 RX_BD3 _________/ (2) Handle RX on the first descriptor. Allocate SKB1, eventually added to the receive queue by tcp_queue_rcv(). (3) Handle RX on the second descriptor. Allocate SKB2 and pass it to netif_receive_skb(): netif_receive_skb(SKB2) ip_rcv(SKB2) SKB3 = skb_clone(SKB2) SKB2 and SKB3 share a reference to PAGE2 through skb_shinfo()->dataref. The other ref to PAGE2 is still held by RX_BD3: SKB2 ---+- PAGE2 SKB3 __/ / RX_BD3 _________/ (3b) Now while handling TCP, coalesce SKB3 with SKB1: tcp_v4_rcv(SKB3) tcp_try_coalesce(to=SKB1, from=SKB3) // succeeds kfree_skb_partial(SKB3) skb_release_data(SKB3) // drops one dataref SKB1 _____ PAGE1 \____ SKB2 _____ PAGE2 / RX_BD3 _________/ In skb_try_coalesce(), __skb_frag_ref() takes a page reference to PAGE2, where it should instead have increased the page_pool frag reference, pp_frag_count. Without coalescing, when releasing both SKB2 and SKB3, a single reference to PAGE2 would be dropped. Now when releasing SKB1 and SKB2, two references to PAGE2 will be dropped, resulting in underflow. (3c) Drop SKB2: af_packet_rcv(SKB2) consume_skb(SKB2) skb_release_data(SKB2) // drops second dataref page_pool_return_skb_page(PAGE2) // drops one pp_frag_count SKB1 _____ PAGE1 \____ PAGE2 / RX_BD3 _________/ (4) Userspace calls recvmsg() Copies SKB1 and releases it. Since SKB3 was coalesced with SKB1, we release the SKB3 page as well: tcp_eat_recv_skb(SKB1) skb_release_data(SKB1) page_pool_return_skb_page(PAGE1) page_pool_return_skb_page(PAGE2) // drops second pp_frag_count (5) PAGE2 is freed, but the third RX descriptor was still using it! In our case this causes IOMMU faults, but it would silently corrupt memory if the IOMMU was disabled. Change the logic that checks whether pp_recycle SKBs can be coalesced. We still reject differing pp_recycle between 'from' and 'to' SKBs, but in order to avoid the situation described above, we also reject coalescing when both 'from' and 'to' are pp_recycled and 'from' is cloned. The new logic allows coalescing a cloned pp_recycle SKB into a page refcounted one, because in this case the release (4) will drop the right reference, the one taken by skb_try_coalesce(). Fixes: 53e0961da1c7 ("page_pool: add frag page recycling support in page pool") Suggested-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Benc <jbenc@redhat.com>	2022-05-12 17:29:54 +02:00
Jiri Benc	7e6f15045c	net: in_irq() cleanup Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071618 commit afa79d08c6c8e1901cb1547591e3ccd3ec6965d9 Author: Changbin Du <changbin.du@intel.com> Date: Fri Aug 13 22:57:49 2021 +0800 net: in_irq() cleanup Replace the obsolete and ambiguos macro in_irq() with new macro in_hardirq(). Signed-off-by: Changbin Du <changbin.du@gmail.com> Link: https://lore.kernel.org/r/20210813145749.86512-1-changbin.du@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jiri Benc <jbenc@redhat.com>	2022-05-12 17:29:49 +02:00
Hangbin Liu	546f7472fa	net: __pskb_pull_tail() & pskb_carve_frag_list() drop_monitor friends Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083432 Upstream Status: net.git commit ef527f968ae0 commit ef527f968ae05c6717c39f49c8709a7e2c19183a Author: Eric Dumazet <edumazet@google.com> Date: Sun Feb 20 07:40:52 2022 -0800 net: __pskb_pull_tail() & pskb_carve_frag_list() drop_monitor friends Whenever one of these functions pull all data from an skb in a frag_list, use consume_skb() instead of kfree_skb() to avoid polluting drop monitoring. Fixes: `6fa01ccd88` ("skbuff: Add pskb_extract() helper function") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220220154052.1308469-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Hangbin Liu <haliu@redhat.com>	2022-05-10 11:13:06 +08:00
Hangbin Liu	9de2441ab3	net: fix up skbs delta_truesize in UDP GRO frag_list Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081920 Upstream Status: net.git commit 224102de2ff1 commit 224102de2ff105a2c05695e66a08f4b5b6b2d19c Author: lena wang <lena.wang@mediatek.com> Date: Tue Mar 1 19:17:09 2022 +0800 net: fix up skbs delta_truesize in UDP GRO frag_list The truesize for a UDP GRO packet is added by main skb and skbs in main skb's frag_list: skb_gro_receive_list p->truesize += skb->truesize; The commit `53475c5dd8` ("net: fix use-after-free when UDP GRO with shared fraglist") introduced a truesize increase for frag_list skbs. When uncloning skb, it will call pskb_expand_head and trusesize for frag_list skbs may increase. This can occur when allocators uses __netdev_alloc_skb and not jump into __alloc_skb. This flow does not use ksize(len) to calculate truesize while pskb_expand_head uses. skb_segment_list err = skb_unclone(nskb, GFP_ATOMIC); pskb_expand_head if (!skb->sk \|\| skb->destructor == sock_edemux) skb->truesize += size - osize; If we uses increased truesize adding as delta_truesize, it will be larger than before and even larger than previous total truesize value if skbs in frag_list are abundant. The main skb truesize will become smaller and even a minus value or a huge value for an unsigned int parameter. Then the following memory check will drop this abnormal skb. To avoid this error we should use the original truesize to segment the main skb. Fixes: `53475c5dd8` ("net: fix use-after-free when UDP GRO with shared fraglist") Signed-off-by: lena wang <lena.wang@mediatek.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/1646133431-8948-1-git-send-email-lena.wang@mediatek.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Hangbin Liu <haliu@redhat.com>	2022-05-05 12:26:57 +08:00
Hangbin Liu	2af4b6bca6	net: preserve skb_end_offset() in skb_unclone_keeptruesize() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081920 Upstream Status: net.git commit 2b88cba55883 commit 2b88cba55883eaafbc9b7cbff0b2c7cdba71ed01 Author: Eric Dumazet <edumazet@google.com> Date: Mon Feb 21 19:21:13 2022 -0800 net: preserve skb_end_offset() in skb_unclone_keeptruesize() syzbot found another way to trigger the infamous WARN_ON_ONCE(delta < len) in skb_try_coalesce() [1] I was able to root cause the issue to kfence. When kfence is in action, the following assertion is no longer true: int size = xxxx; void ptr1 = kmalloc(size, gfp); void ptr2 = kmalloc(size, gfp); if (ptr1 && ptr2) ASSERT(ksize(ptr1) == ksize(ptr2)); We attempted to fix these issues in the blamed commits, but forgot that TCP was possibly shifting data after skb_unclone_keeptruesize() has been used, notably from tcp_retrans_try_collapse(). So we not only need to keep same skb->truesize value, we also need to make sure TCP wont fill new tailroom that pskb_expand_head() was able to get from a addr = kmalloc(...) followed by ksize(addr) Split skb_unclone_keeptruesize() into two parts: 1) Inline skb_unclone_keeptruesize() for the common case, when skb is not cloned. 2) Out of line __skb_unclone_keeptruesize() for the 'slow path'. WARNING: CPU: 1 PID: 6490 at net/core/skbuff.c:5295 skb_try_coalesce+0x1235/0x1560 net/core/skbuff.c:5295 Modules linked in: CPU: 1 PID: 6490 Comm: syz-executor161 Not tainted 5.17.0-rc4-syzkaller-00229-g4f12b742eb2b #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:skb_try_coalesce+0x1235/0x1560 net/core/skbuff.c:5295 Code: bf 01 00 00 00 0f b7 c0 89 c6 89 44 24 20 e8 62 24 4e fa 8b 44 24 20 83 e8 01 0f 85 e5 f0 ff ff e9 87 f4 ff ff e8 cb 20 4e fa <0f> 0b e9 06 f9 ff ff e8 af b2 95 fa e9 69 f0 ff ff e8 95 b2 95 fa RSP: 0018:ffffc900063af268 EFLAGS: 00010293 RAX: 0000000000000000 RBX: 00000000ffffffd5 RCX: 0000000000000000 RDX: ffff88806fc05700 RSI: ffffffff872abd55 RDI: 0000000000000003 RBP: ffff88806e675500 R08: 00000000ffffffd5 R09: 0000000000000000 R10: ffffffff872ab659 R11: 0000000000000000 R12: ffff88806dd554e8 R13: ffff88806dd9bac0 R14: ffff88806dd9a2c0 R15: 0000000000000155 FS: 00007f18014f9700(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020002000 CR3: 000000006be7a000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> tcp_try_coalesce net/ipv4/tcp_input.c:4651 [inline] tcp_try_coalesce+0x393/0x920 net/ipv4/tcp_input.c:4630 tcp_queue_rcv+0x8a/0x6e0 net/ipv4/tcp_input.c:4914 tcp_data_queue+0x11fd/0x4bb0 net/ipv4/tcp_input.c:5025 tcp_rcv_established+0x81e/0x1ff0 net/ipv4/tcp_input.c:5947 tcp_v4_do_rcv+0x65e/0x980 net/ipv4/tcp_ipv4.c:1719 sk_backlog_rcv include/net/sock.h:1037 [inline] __release_sock+0x134/0x3b0 net/core/sock.c:2779 release_sock+0x54/0x1b0 net/core/sock.c:3311 sk_wait_data+0x177/0x450 net/core/sock.c:2821 tcp_recvmsg_locked+0xe28/0x1fd0 net/ipv4/tcp.c:2457 tcp_recvmsg+0x137/0x610 net/ipv4/tcp.c:2572 inet_recvmsg+0x11b/0x5e0 net/ipv4/af_inet.c:850 sock_recvmsg_nosec net/socket.c:948 [inline] sock_recvmsg net/socket.c:966 [inline] sock_recvmsg net/socket.c:962 [inline] ____sys_recvmsg+0x2c4/0x600 net/socket.c:2632 ___sys_recvmsg+0x127/0x200 net/socket.c:2674 __sys_recvmsg+0xe2/0x1a0 net/socket.c:2704 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae Fixes: c4777efa751d ("net: add and use skb_unclone_keeptruesize() helper") Fixes: `097b9146c0` ("net: fix up truesize of cloned skb in skb_prepare_for_shift()") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Hangbin Liu <haliu@redhat.com>	2022-05-05 12:26:57 +08:00
Hangbin Liu	4c2b91c73f	net: add skb_set_end_offset() helper Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081920 Upstream Status: net.git commit 763087dab975 commit 763087dab97547230a6807c865a6a5ae53a59247 Author: Eric Dumazet <edumazet@google.com> Date: Mon Feb 21 19:21:12 2022 -0800 net: add skb_set_end_offset() helper We have multiple places where this helper is convenient, and plan using it in the following patch. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Hangbin Liu <haliu@redhat.com>	2022-05-05 12:26:57 +08:00
Hangbin Liu	e333d6a1da	net-timestamp: convert sk->sk_tskey to atomic_t Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081920 Upstream Status: net.git commit a1cdec57e03a commit a1cdec57e03a1352e92fbbe7974039dda4efcec0 Author: Eric Dumazet <edumazet@google.com> Date: Thu Feb 17 09:05:02 2022 -0800 net-timestamp: convert sk->sk_tskey to atomic_t UDP sendmsg() can be lockless, this is causing all kinds of data races. This patch converts sk->sk_tskey to remove one of these races. BUG: KCSAN: data-race in __ip_append_data / __ip_append_data read to 0xffff8881035d4b6c of 4 bytes by task 8877 on cpu 1: __ip_append_data+0x1c1/0x1de0 net/ipv4/ip_output.c:994 ip_make_skb+0x13f/0x2d0 net/ipv4/ip_output.c:1636 udp_sendmsg+0x12bd/0x14c0 net/ipv4/udp.c:1249 inet_sendmsg+0x5f/0x80 net/ipv4/af_inet.c:819 sock_sendmsg_nosec net/socket.c:705 [inline] sock_sendmsg net/socket.c:725 [inline] ____sys_sendmsg+0x39a/0x510 net/socket.c:2413 ___sys_sendmsg net/socket.c:2467 [inline] __sys_sendmmsg+0x267/0x4c0 net/socket.c:2553 __do_sys_sendmmsg net/socket.c:2582 [inline] __se_sys_sendmmsg net/socket.c:2579 [inline] __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae write to 0xffff8881035d4b6c of 4 bytes by task 8880 on cpu 0: __ip_append_data+0x1d8/0x1de0 net/ipv4/ip_output.c:994 ip_make_skb+0x13f/0x2d0 net/ipv4/ip_output.c:1636 udp_sendmsg+0x12bd/0x14c0 net/ipv4/udp.c:1249 inet_sendmsg+0x5f/0x80 net/ipv4/af_inet.c:819 sock_sendmsg_nosec net/socket.c:705 [inline] sock_sendmsg net/socket.c:725 [inline] ____sys_sendmsg+0x39a/0x510 net/socket.c:2413 ___sys_sendmsg net/socket.c:2467 [inline] __sys_sendmmsg+0x267/0x4c0 net/socket.c:2553 __do_sys_sendmmsg net/socket.c:2582 [inline] __se_sys_sendmmsg net/socket.c:2579 [inline] __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0x0000054d -> 0x0000054e Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 8880 Comm: syz-executor.5 Not tainted 5.17.0-rc2-syzkaller-00167-gdcb85f85fa6f-dirty #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Fixes: `09c2d251b7` ("net-timestamp: add key to disambiguate concurrent datagrams") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Hangbin Liu <haliu@redhat.com>	2022-05-05 12:26:57 +08:00
Hangbin Liu	24d15b7d24	net: Fix double 0x prefix print in SKB dump Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081920 Upstream Status: net.git commit 8a03ef676ade commit 8a03ef676ade55182f9b05115763aeda6dc08159 Author: Gal Pressman <gal@nvidia.com> Date: Thu Dec 16 11:28:25 2021 +0200 net: Fix double 0x prefix print in SKB dump When printing netdev features %pNF already takes care of the 0x prefix, remove the explicit one. Fixes: `6413139dfc` ("skbuff: increase verbosity when dumping skb data") Signed-off-by: Gal Pressman <gal@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Hangbin Liu <haliu@redhat.com>	2022-05-05 12:26:41 +08:00
Ivan Vecera	03eba5553a	skbuff: introduce skb_pull_data Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2078759 commit 13244cccc2b61ec715f0ac583d3037497004d4a5 Author: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Date: Wed Dec 1 10:54:52 2021 -0800 skbuff: introduce skb_pull_data Like skb_pull but returns the original data pointer before pulling the data after performing a check against sbk->len. This allows to change code that does "struct foo p = (void )skb->data;" which is hard to audit and error prone, to: p = skb_pull_data(skb, sizeof(*p)); if (!p) return; Which is both safer and cleaner. Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2022-04-26 09:22:23 +02:00
Herton R. Krzesinski	3e26d2a862	Merge: net: backports before kABI freeze MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/407 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2041382 Tested: ENRT Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2028420 Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2037783 Includes patches that would break kABI without backporting the full series they are taken from, which we will do later (post-freeze). The following fixes were omitted as the backport of commit f35f821935d8 ("tcp: defer skb freeing after socket lock is released") is a partial one not introducing the issues. Omitted-fix: ffef737fd037 ("net/tls: Fix skb memory leak when running kTLS traffic") Omitted-fix: db094aa8140e ("net/tls: Fix another skb memory leak when running kTLS traffic") Omitted-fix: 79074a72d335 ("net: Flush deferred skb free on socket destroy") Omitted-fix: ebdc1a030962 ("tcp: add a missing sk_defer_free_flush() in tcp_splice_read()") Signed-off-by: Antoine Tenart <atenart@redhat.com> Approved-by: Jarod Wilson <jarod@redhat.com> Approved-by: Sabrina Dubroca <sdubroca@redhat.com> Approved-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: Herton R. Krzesinski <herton@redhat.com>	2022-02-07 15:11:27 +00:00
Antoine Tenart	73db850d41	net: use sk_is_tcp() in more places Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2041382 Upstream Status: linux.git Tested: ENRT commit 42f67eea3ba36cef2dce2e853de6ddcb2e89eb39 Author: Eric Dumazet <edumazet@google.com> Date: Mon Nov 15 11:02:33 2021 -0800 net: use sk_is_tcp() in more places Move sk_is_tcp() to include/net/sock.h and use it where we can. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2022-01-21 16:26:18 +01:00
Antoine Tenart	4a0269b225	net: skb: introduce kfree_skb_reason() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2041931 Upstream Status: linux.git Tested: Instructions in bz commit c504e5c2f9648a1e5c2be01e8c3f59d394192bd3 Author: Menglong Dong <imagedong@tencent.com> Date: Sun Jan 9 14:36:26 2022 +0800 net: skb: introduce kfree_skb_reason() Introduce the interface kfree_skb_reason(), which is able to pass the reason why the skb is dropped to 'kfree_skb' tracepoint. Add the 'reason' field to 'trace_kfree_skb', therefor user can get more detail information about abnormal skb with 'drop_monitor' or eBPF. All drop reasons are defined in the enum 'skb_drop_reason', and they will be print as string in 'kfree_skb' tracepoint in format of 'reason: XXX'. ( Maybe the reasons should be defined in a uapi header file, so that user space can use them? ) Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2022-01-21 10:05:00 +01:00
Herton R. Krzesinski	b8f20958b7	Merge: net: core stable backport for rhel 9.0 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/212 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028276 Tested: LNST, Tier1 This includes a few critical bugfixes for the core network stack. Notably it includes 7f678def99d2 ("skb_expand_head() adjust skb->truesize incorrectly") and a whole series of pre-requisites. The bug addressed there is nasty and present even prior to skb_expand_head() introduction. commit 719c57197010 ("net: make napi_disable() symmetric with enable") instead has been explicitly excluded, as it's not really a fix, is known to introduce problems and it's still quite new Signed-off-by: Paolo Abeni <pabeni@redhat.com> Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Approved-by: Jarod Wilson <jarod@redhat.com> Approved-by: Antoine Tenart <atenart@redhat.com> Approved-by: Guillaume Nault <gnault@redhat.com> Approved-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: Herton R. Krzesinski <herton@redhat.com>	2022-01-14 16:53:21 +00:00
Paolo Abeni	cf96a90b97	net: fix GRO skb truesize update Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028927 Tested: LNST, Tier1 Upstream commit: commit af352460b465d7a8afbeb3be07c0268d1d48a4d7 Author: Paolo Abeni <pabeni@redhat.com> Date: Wed Aug 4 21:07:00 2021 +0200 net: fix GRO skb truesize update commit 5e10da5385d2 ("skbuff: allow 'slow_gro' for skb carring sock reference") introduces a serious regression at the GRO layer setting the wrong truesize for stolen-head skbs. Restore the correct truesize: SKB_DATA_ALIGN(...) instead of SKB_TRUESIZE(...) Reported-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Fixes: 5e10da5385d2 ("skbuff: allow 'slow_gro' for skb carring sock reference") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Tested-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2021-12-09 18:58:42 +01:00
Paolo Abeni	2bea014388	skbuff: allow 'slow_gro' for skb carring sock reference Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028927 Tested: LNST, Tier1 Upstream commit: commit 5e10da5385d20c4bae587bc2921e5fdd9655d5fc Author: Paolo Abeni <pabeni@redhat.com> Date: Wed Jul 28 18:24:03 2021 +0200 skbuff: allow 'slow_gro' for skb carring sock reference This change leverages the infrastructure introduced by the previous patches to allow soft devices passing to the GRO engine owned skbs without impacting the fast-path. It's up to the GRO caller ensuring the slow_gro bit validity before invoking the GRO engine. The new helper skb_prepare_for_gro() is introduced for that goal. On slow_gro, skbs are aggregated only with equal sk. Additionally, skb truesize on GRO recycle and free is correctly updated so that sk wmem is not changed by the GRO processing. rfc-> v1: - fixed bad truesize on dev_gro_receive NAPI_FREE - use the existing state bit Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2021-12-09 18:57:52 +01:00
Paolo Abeni	9ce6ef4e71	net: optimize GRO for the common case. Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028927 Tested: LNST, Tier1 Upstream commit: commit 9efb4b5baf6ce851b247288992b0632cb4d31c17 Author: Paolo Abeni <pabeni@redhat.com> Date: Wed Jul 28 18:24:02 2021 +0200 net: optimize GRO for the common case. After the previous patches, at GRO time, skb->slow_gro is usually 0, unless the packets comes from some H/W offload slowpath or tunnel. We can optimize the GRO code assuming !skb->slow_gro is likely. This remove multiple conditionals in the most common path, at the price of an additional one when we hit the above "slow-paths". Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2021-12-09 18:57:26 +01:00
Paolo Abeni	615b5bcea7	sk_buff: track extension status in slow_gro Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028927 Tested: LNST, Tier1 Upstream commit: commit b0999f385ac30cb17880ae1c1512491fbf0c9542 Author: Paolo Abeni <pabeni@redhat.com> Date: Wed Jul 28 18:24:01 2021 +0200 sk_buff: track extension status in slow_gro Similar to the previous one, but tracking the active_extensions field status. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2021-12-09 18:57:10 +01:00
Paolo Abeni	ca25f913f2	net: add and use skb_unclone_keeptruesize() helper Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028276 Tested: LNST, Tier1 Upstream commit: commit c4777efa751d293e369aec464ce6875e957be255 Author: Eric Dumazet <edumazet@google.com> Date: Mon Nov 1 17:45:55 2021 -0700 net: add and use skb_unclone_keeptruesize() helper While commit `097b9146c0` ("net: fix up truesize of cloned skb in skb_prepare_for_shift()") fixed immediate issues found when KFENCE was enabled/tested, there are still similar issues, when tcp_trim_head() hits KFENCE while the master skb is cloned. This happens under heavy networking TX workloads, when the TX completion might be delayed after incoming ACK. This patch fixes the WARNING in sk_stream_kill_queues when sk->sk_mem_queued/sk->sk_forward_alloc are not zero. Fixes: `d3fb45f370` ("mm, kfence: insert KFENCE hooks for SLAB") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Marco Elver <elver@google.com> Link: https://lore.kernel.org/r/20211102004555.1359210-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2021-12-09 10:44:31 +01:00
Paolo Abeni	fcbf308cb4	skb_expand_head() adjust skb->truesize incorrectly Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028276 Tested: LNST, Tier1 Upstream commit: commit 7f678def99d29c520418607509bb19c7fc96a6db Author: Vasily Averin <vvs@virtuozzo.com> Date: Fri Oct 22 13:28:37 2021 +0300 skb_expand_head() adjust skb->truesize incorrectly Christoph Paasch reports [1] about incorrect skb->truesize after skb_expand_head() call in ip6_xmit. This may happen because of two reasons: - skb_set_owner_w() for newly cloned skb is called too early, before pskb_expand_head() where truesize is adjusted for (!skb-sk) case. - pskb_expand_head() does not adjust truesize in (skb->sk) case. In this case sk->sk_wmem_alloc should be adjusted too. [1] https://lkml.org/lkml/2021/8/20/1082 Fixes: f1260ff15a71 ("skbuff: introduce skb_expand_head()") Fixes: `2d85a1b31d` ("ipv6: ip6_finish_output2: set sk into newly allocated nskb") Reported-by: Christoph Paasch <christoph.paasch@gmail.com> Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/644330dd-477e-0462-83bf-9f514c41edd1@virtuozzo.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2021-12-09 10:44:31 +01:00
Paolo Abeni	17a5777943	skbuff: introduce skb_expand_head() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028276 Tested: LNST, Tier1 Upstream commit: commit f1260ff15a71b8fc122b2c9abd8a7abffb6e0168 Author: Vasily Averin <vvs@virtuozzo.com> Date: Mon Aug 2 11:52:15 2021 +0300 skbuff: introduce skb_expand_head() Like skb_realloc_headroom(), new helper increases headroom of specified skb. Unlike skb_realloc_headroom(), it does not allocate a new skb if possible; copies skb->sk on new skb when as needed and frees original skb in case of failures. This helps to simplify ip[6]_finish_output2() and a few other similar cases. Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2021-12-09 10:44:30 +01:00
Pravin B Shelar	a17ad09617	net: Fix zero-copy head len calculation. In some cases skb head could be locked and entire header data is pulled from skb. When skb_zerocopy() called in such cases, following BUG is triggered. This patch fixes it by copying entire skb in such cases. This could be optimized incase this is performance bottleneck. ---8<--- kernel BUG at net/core/skbuff.c:2961! invalid opcode: 0000 [#1] SMP PTI CPU: 2 PID: 0 Comm: swapper/2 Tainted: G OE 5.4.0-77-generic #86-Ubuntu Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.13.0-1ubuntu1.1 04/01/2014 RIP: 0010:skb_zerocopy+0x37a/0x3a0 RSP: 0018:ffffbcc70013ca38 EFLAGS: 00010246 Call Trace: <IRQ> queue_userspace_packet+0x2af/0x5e0 [openvswitch] ovs_dp_upcall+0x3d/0x60 [openvswitch] ovs_dp_process_packet+0x125/0x150 [openvswitch] ovs_vport_receive+0x77/0xd0 [openvswitch] netdev_port_receive+0x87/0x130 [openvswitch] netdev_frame_hook+0x4b/0x60 [openvswitch] __netif_receive_skb_core+0x2b4/0xc90 __netif_receive_skb_one_core+0x3f/0xa0 __netif_receive_skb+0x18/0x60 process_backlog+0xa9/0x160 net_rx_action+0x142/0x390 __do_softirq+0xe1/0x2d6 irq_exit+0xae/0xb0 do_IRQ+0x5a/0xf0 common_interrupt+0xf/0xf Code that triggered BUG: int skb_zerocopy(struct sk_buff to, struct sk_buff from, int len, int hlen) { int i, j = 0; int plen = 0; /* length of skb->head fragment / int ret; struct page page; unsigned int offset; BUG_ON(!from->head_frag && !hlen); Signed-off-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-18 09:42:17 -07:00
Ilias Apalodimas	2cc3aeb5ec	skbuff: Fix a potential race while recycling page_pool packets As Alexander points out, when we are trying to recycle a cloned/expanded SKB we might trigger a race. The recycling code relies on the pp_recycle bit to trigger, which we carry over to cloned SKBs. If that cloned SKB gets expanded or if we get references to the frags, call skb_release_data() and overwrite skb->head, we are creating separate instances accessing the same page frags. Since the skb_release_data() will first try to recycle the frags, there's a potential race between the original and cloned SKB, since both will have the pp_recycle bit set. Fix this by explicitly those SKBs not recyclable. The atomic_sub_return effectively limits us to a single release case, and when we are calling skb_release_data we are also releasing the option to perform the recycling, or releasing the pages from the page pool. Fixes: `6a5bcd84e8` ("page_pool: Allow drivers to hint on SKB recycling") Reported-by: Alexander Duyck <alexanderduyck@fb.com> Suggested-by: Alexander Duyck <alexanderduyck@fb.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-16 11:37:00 -07:00
Paul Blakey	8550ff8d8c	skbuff: Release nfct refcount on napi stolen or re-used skbs When multiple SKBs are merged to a new skb under napi GRO, or SKB is re-used by napi, if nfct was set for them in the driver, it will not be released while freeing their stolen head state or on re-use. Release nfct on napi's stolen or re-used SKBs, and in gro_list_prepare, check conntrack metadata diff. Fixes: `5c6b946047` ("net/mlx5e: CT: Handle misses after executing CT action") Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Paul Blakey <paulb@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-06 10:26:29 -07:00
Alexander Aring	e3ae2365ef	net: sock: introduce sk_error_report This patch introduces a function wrapper to call the sk_error_report callback. That will prepare to add additional handling whenever sk_error_report is called, for example to trace socket errors. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-29 11:28:21 -07:00
Jakub Kicinski	adc2e56ebe	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Trivial conflicts in net/can/isotp.c and tools/testing/selftests/net/mptcp/mptcp_connect.sh scaled_ppm_to_ppb() was moved from drivers/ptp/ptp_clock.c to include/linux/ptp_clock_kernel.h in -next so re-apply the fix there. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-06-18 19:47:02 -07:00
Willem de Bruijn	3bdd5ee0ec	skbuff: fix incorrect msg_zerocopy copy notifications msg_zerocopy signals if a send operation required copying with a flag in serr->ee.ee_code. This field can be incorrect as of the below commit, as a result of both structs uarg and serr pointing into the same skb->cb[]. uarg->zerocopy must be read before skb->cb[] is reinitialized to hold serr. Similar to other fields len, hi and lo, use a local variable to temporarily hold the value. This was not a problem before, when the value was passed as a function argument. Fixes: `75518851a2` ("skbuff: Push status and refcounts into sock_zerocopy_callback") Reported-by: Talal Ahmad <talalahmad@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-10 13:39:57 -07:00
Ilias Apalodimas	6a5bcd84e8	page_pool: Allow drivers to hint on SKB recycling Up to now several high speed NICs have custom mechanisms of recycling the allocated memory they use for their payloads. Our page_pool API already has recycling capabilities that are always used when we are running in 'XDP mode'. So let's tweak the API and the kernel network stack slightly and allow the recycling to happen even during the standard operation. The API doesn't take into account 'split page' policies used by those drivers currently, but can be extended once we have users for that. The idea is to be able to intercept the packet on skb_release_data(). If it's a buffer coming from our page_pool API recycle it back to the pool for further usage or just release the packet entirely. To achieve that we introduce a bit in struct sk_buff (pp_recycle:1) and a field in struct page (page->pp) to store the page_pool pointer. Storing the information in page->pp allows us to recycle both SKBs and their fragments. We could have skipped the skb bit entirely, since identical information can bederived from struct page. However, in an effort to affect the free path as less as possible, reading a single bit in the skb which is already in cache, is better that trying to derive identical information for the page stored data. The driver or page_pool has to take care of the sync operations on it's own during the buffer recycling since the buffer is, after opting-in to the recycling, never unmapped. Since the gain on the drivers depends on the architecture, we are not enabling recycling by default if the page_pool API is used on a driver. In order to enable recycling the driver must call skb_mark_for_recycle() to store the information we need for recycling in page->pp and enabling the recycling bit, or page_pool_store_mem_info() for a fragment. Co-developed-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Co-developed-by: Matteo Croce <mcroce@microsoft.com> Signed-off-by: Matteo Croce <mcroce@microsoft.com> Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-07 14:11:47 -07:00
Matteo Croce	c420c98982	skbuff: add a parameter to __skb_frag_unref This is a prerequisite patch, the next one is enabling recycling of skbs and fragments. Add an extra argument on __skb_frag_unref() to handle recycling, and update the current users of the function with that. Signed-off-by: Matteo Croce <mcroce@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-07 14:11:47 -07:00
Linus Torvalds	9d31d23389	Networking changes for 5.13. Core: - bpf: - allow bpf programs calling kernel functions (initially to reuse TCP congestion control implementations) - enable task local storage for tracing programs - remove the need to store per-task state in hash maps, and allow tracing programs access to task local storage previously added for BPF_LSM - add bpf_for_each_map_elem() helper, allowing programs to walk all map elements in a more robust and easier to verify fashion - sockmap: support UDP and cross-protocol BPF_SK_SKB_VERDICT redirection - lpm: add support for batched ops in LPM trie - add BTF_KIND_FLOAT support - mostly to allow use of BTF on s390 which has floats in its headers files - improve BPF syscall documentation and extend the use of kdoc parsing scripts we already employ for bpf-helpers - libbpf, bpftool: support static linking of BPF ELF files - improve support for encapsulation of L2 packets - xdp: restructure redirect actions to avoid a runtime lookup, improving performance by 4-8% in microbenchmarks - xsk: build skb by page (aka generic zerocopy xmit) - improve performance of software AF_XDP path by 33% for devices which don't need headers in the linear skb part (e.g. virtio) - nexthop: resilient next-hop groups - improve path stability on next-hops group changes (incl. offload for mlxsw) - ipv6: segment routing: add support for IPv4 decapsulation - icmp: add support for RFC 8335 extended PROBE messages - inet: use bigger hash table for IP ID generation - tcp: deal better with delayed TX completions - make sure we don't give up on fast TCP retransmissions only because driver is slow in reporting that it completed transmitting the original - tcp: reorder tcp_congestion_ops for better cache locality - mptcp: - add sockopt support for common TCP options - add support for common TCP msg flags - include multiple address ids in RM_ADDR - add reset option support for resetting one subflow - udp: GRO L4 improvements - improve 'forward' / 'frag_list' co-existence with UDP tunnel GRO, allowing the first to take place correctly even for encapsulated UDP traffic - micro-optimize dev_gro_receive() and flow dissection, avoid retpoline overhead on VLAN and TEB GRO - use less memory for sysctls, add a new sysctl type, to allow using u8 instead of "int" and "long" and shrink networking sysctls - veth: allow GRO without XDP - this allows aggregating UDP packets before handing them off to routing, bridge, OvS, etc. - allow specifing ifindex when device is moved to another namespace - netfilter: - nft_socket: add support for cgroupsv2 - nftables: add catch-all set element - special element used to define a default action in case normal lookup missed - use net_generic infra in many modules to avoid allocating per-ns memory unnecessarily - xps: improve the xps handling to avoid potential out-of-bound accesses and use-after-free when XPS change race with other re-configuration under traffic - add a config knob to turn off per-cpu netdev refcnt to catch underflows in testing Device APIs: - add WWAN subsystem to organize the WWAN interfaces better and hopefully start driving towards more unified and vendor- -independent APIs - ethtool: - add interface for reading IEEE MIB stats (incl. mlx5 and bnxt support) - allow network drivers to dump arbitrary SFP EEPROM data, current offset+length API was a poor fit for modern SFP which define EEPROM in terms of pages (incl. mlx5 support) - act_police, flow_offload: add support for packet-per-second policing (incl. offload for nfp) - psample: add additional metadata attributes like transit delay for packets sampled from switch HW (and corresponding egress and policy-based sampling in the mlxsw driver) - dsa: improve support for sandwiched LAGs with bridge and DSA - netfilter: - flowtable: use direct xmit in topologies with IP forwarding, bridging, vlans etc. - nftables: counter hardware offload support - Bluetooth: - improvements for firmware download w/ Intel devices - add support for reading AOSP vendor capabilities - add support for virtio transport driver - mac80211: - allow concurrent monitor iface and ethernet rx decap - set priority and queue mapping for injected frames - phy: add support for Clause-45 PHY Loopback - pci/iov: add sysfs MSI-X vector assignment interface to distribute MSI-X resources to VFs (incl. mlx5 support) New hardware/drivers: - dsa: mv88e6xxx: add support for Marvell mv88e6393x - 11-port Ethernet switch with 8x 1-Gigabit Ethernet and 3x 10-Gigabit interfaces. - dsa: support for legacy Broadcom tags used on BCM5325, BCM5365 and BCM63xx switches - Microchip KSZ8863 and KSZ8873; 3x 10/100Mbps Ethernet switches - ath11k: support for QCN9074 a 802.11ax device - Bluetooth: Broadcom BCM4330 and BMC4334 - phy: Marvell 88X2222 transceiver support - mdio: add BCM6368 MDIO mux bus controller - r8152: support RTL8153 and RTL8156 (USB Ethernet) chips - mana: driver for Microsoft Azure Network Adapter (MANA) - Actions Semi Owl Ethernet MAC - can: driver for ETAS ES58X CAN/USB interfaces Pure driver changes: - add XDP support to: enetc, igc, stmmac - add AF_XDP support to: stmmac - virtio: - page_to_skb() use build_skb when there's sufficient tailroom (21% improvement for 1000B UDP frames) - support XDP even without dedicated Tx queues - share the Tx queues with the stack when necessary - mlx5: - flow rules: add support for mirroring with conntrack, matching on ICMP, GTP, flex filters and more - support packet sampling with flow offloads - persist uplink representor netdev across eswitch mode changes - allow coexistence of CQE compression and HW time-stamping - add ethtool extended link error state reporting - ice, iavf: support flow filters, UDP Segmentation Offload - dpaa2-switch: - move the driver out of staging - add spanning tree (STP) support - add rx copybreak support - add tc flower hardware offload on ingress traffic - ionic: - implement Rx page reuse - support HW PTP time-stamping - octeon: support TC hardware offloads - flower matching on ingress and egress ratelimitting. - stmmac: - add RX frame steering based on VLAN priority in tc flower - support frame preemption (FPE) - intel: add cross time-stamping freq difference adjustment - ocelot: - support forwarding of MRP frames in HW - support multiple bridges - support PTP Sync one-step timestamping - dsa: mv88e6xxx, dpaa2-switch: offload bridge port flags like learning, flooding etc. - ipa: add IPA v4.5, v4.9 and v4.11 support (Qualcomm SDX55, SM8350, SC7280 SoCs) - mt7601u: enable TDLS support - mt76: - add support for 802.3 rx frames (mt7915/mt7615) - mt7915 flash pre-calibration support - mt7921/mt7663 runtime power management fixes Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmCKFPIACgkQMUZtbf5S Irtw0g/+NA8bWdHNgG4H5rya0pv2z3IieLRmSdDfKRQQXcJpklawc5MKVVaTee/Q 5/QqgPdCsu1LAU6JXBKsKmyDDaMlQKdWuKbOqDSiAQKoMesZStTEHf9d851ZzgxA Cdb6O7BD3lBl/IN+oxNG+KcmD1LKquTPKGySq2mQtEdLO12ekAsranzmj4voKffd q9tBShpXQ7Dq77DLYfiQXVCvsizNcbbJFuxX0o9Lpb9+61ZyYAbogZSa9ypiZZwR I/9azRBtJg7UV1aD/cLuAfy66Qh7t63+rCxVazs5Os8jVO26P/jQdisnnOe/x+p9 wYEmKm3GSu0V4SAPxkWW+ooKusflCeqDoMIuooKt6kbP6BRj540veGw3Ww/m5YFr 7pLQkTSP/tSjuGQIdBE1LOP5LBO8DZeC8Kiop9V0fzAW9hFSZbEq25WW0bPj8QQO zA4Z7yWlslvxcfY2BdJX3wD8klaINkl/8fDWZFFsBdfFX2VeLtm7Xfduw34BJpvU rYT3oWr6PhtkPAKR32SUcemSfeWgIVU41eSshzRz3kez1NngBUuLlSGGSEaKbes5 pZVt6pYFFVByyf6MTHFEoQvafZfEw04JILZpo4R5V8iTHzom0kD3Py064sBiXEw2 B6t+OW4qgcxGblpFkK2lD4kR2s1TPUs0ckVO6sAy1x8q60KKKjY= =vcbA -----END PGP SIGNATURE----- Merge tag 'net-next-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core: - bpf: - allow bpf programs calling kernel functions (initially to reuse TCP congestion control implementations) - enable task local storage for tracing programs - remove the need to store per-task state in hash maps, and allow tracing programs access to task local storage previously added for BPF_LSM - add bpf_for_each_map_elem() helper, allowing programs to walk all map elements in a more robust and easier to verify fashion - sockmap: support UDP and cross-protocol BPF_SK_SKB_VERDICT redirection - lpm: add support for batched ops in LPM trie - add BTF_KIND_FLOAT support - mostly to allow use of BTF on s390 which has floats in its headers files - improve BPF syscall documentation and extend the use of kdoc parsing scripts we already employ for bpf-helpers - libbpf, bpftool: support static linking of BPF ELF files - improve support for encapsulation of L2 packets - xdp: restructure redirect actions to avoid a runtime lookup, improving performance by 4-8% in microbenchmarks - xsk: build skb by page (aka generic zerocopy xmit) - improve performance of software AF_XDP path by 33% for devices which don't need headers in the linear skb part (e.g. virtio) - nexthop: resilient next-hop groups - improve path stability on next-hops group changes (incl. offload for mlxsw) - ipv6: segment routing: add support for IPv4 decapsulation - icmp: add support for RFC 8335 extended PROBE messages - inet: use bigger hash table for IP ID generation - tcp: deal better with delayed TX completions - make sure we don't give up on fast TCP retransmissions only because driver is slow in reporting that it completed transmitting the original - tcp: reorder tcp_congestion_ops for better cache locality - mptcp: - add sockopt support for common TCP options - add support for common TCP msg flags - include multiple address ids in RM_ADDR - add reset option support for resetting one subflow - udp: GRO L4 improvements - improve 'forward' / 'frag_list' co-existence with UDP tunnel GRO, allowing the first to take place correctly even for encapsulated UDP traffic - micro-optimize dev_gro_receive() and flow dissection, avoid retpoline overhead on VLAN and TEB GRO - use less memory for sysctls, add a new sysctl type, to allow using u8 instead of "int" and "long" and shrink networking sysctls - veth: allow GRO without XDP - this allows aggregating UDP packets before handing them off to routing, bridge, OvS, etc. - allow specifing ifindex when device is moved to another namespace - netfilter: - nft_socket: add support for cgroupsv2 - nftables: add catch-all set element - special element used to define a default action in case normal lookup missed - use net_generic infra in many modules to avoid allocating per-ns memory unnecessarily - xps: improve the xps handling to avoid potential out-of-bound accesses and use-after-free when XPS change race with other re-configuration under traffic - add a config knob to turn off per-cpu netdev refcnt to catch underflows in testing Device APIs: - add WWAN subsystem to organize the WWAN interfaces better and hopefully start driving towards more unified and vendor- independent APIs - ethtool: - add interface for reading IEEE MIB stats (incl. mlx5 and bnxt support) - allow network drivers to dump arbitrary SFP EEPROM data, current offset+length API was a poor fit for modern SFP which define EEPROM in terms of pages (incl. mlx5 support) - act_police, flow_offload: add support for packet-per-second policing (incl. offload for nfp) - psample: add additional metadata attributes like transit delay for packets sampled from switch HW (and corresponding egress and policy-based sampling in the mlxsw driver) - dsa: improve support for sandwiched LAGs with bridge and DSA - netfilter: - flowtable: use direct xmit in topologies with IP forwarding, bridging, vlans etc. - nftables: counter hardware offload support - Bluetooth: - improvements for firmware download w/ Intel devices - add support for reading AOSP vendor capabilities - add support for virtio transport driver - mac80211: - allow concurrent monitor iface and ethernet rx decap - set priority and queue mapping for injected frames - phy: add support for Clause-45 PHY Loopback - pci/iov: add sysfs MSI-X vector assignment interface to distribute MSI-X resources to VFs (incl. mlx5 support) New hardware/drivers: - dsa: mv88e6xxx: add support for Marvell mv88e6393x - 11-port Ethernet switch with 8x 1-Gigabit Ethernet and 3x 10-Gigabit interfaces. - dsa: support for legacy Broadcom tags used on BCM5325, BCM5365 and BCM63xx switches - Microchip KSZ8863 and KSZ8873; 3x 10/100Mbps Ethernet switches - ath11k: support for QCN9074 a 802.11ax device - Bluetooth: Broadcom BCM4330 and BMC4334 - phy: Marvell 88X2222 transceiver support - mdio: add BCM6368 MDIO mux bus controller - r8152: support RTL8153 and RTL8156 (USB Ethernet) chips - mana: driver for Microsoft Azure Network Adapter (MANA) - Actions Semi Owl Ethernet MAC - can: driver for ETAS ES58X CAN/USB interfaces Pure driver changes: - add XDP support to: enetc, igc, stmmac - add AF_XDP support to: stmmac - virtio: - page_to_skb() use build_skb when there's sufficient tailroom (21% improvement for 1000B UDP frames) - support XDP even without dedicated Tx queues - share the Tx queues with the stack when necessary - mlx5: - flow rules: add support for mirroring with conntrack, matching on ICMP, GTP, flex filters and more - support packet sampling with flow offloads - persist uplink representor netdev across eswitch mode changes - allow coexistence of CQE compression and HW time-stamping - add ethtool extended link error state reporting - ice, iavf: support flow filters, UDP Segmentation Offload - dpaa2-switch: - move the driver out of staging - add spanning tree (STP) support - add rx copybreak support - add tc flower hardware offload on ingress traffic - ionic: - implement Rx page reuse - support HW PTP time-stamping - octeon: support TC hardware offloads - flower matching on ingress and egress ratelimitting. - stmmac: - add RX frame steering based on VLAN priority in tc flower - support frame preemption (FPE) - intel: add cross time-stamping freq difference adjustment - ocelot: - support forwarding of MRP frames in HW - support multiple bridges - support PTP Sync one-step timestamping - dsa: mv88e6xxx, dpaa2-switch: offload bridge port flags like learning, flooding etc. - ipa: add IPA v4.5, v4.9 and v4.11 support (Qualcomm SDX55, SM8350, SC7280 SoCs) - mt7601u: enable TDLS support - mt76: - add support for 802.3 rx frames (mt7915/mt7615) - mt7915 flash pre-calibration support - mt7921/mt7663 runtime power management fixes" * tag 'net-next-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2451 commits) net: selftest: fix build issue if INET is disabled net: netrom: nr_in: Remove redundant assignment to ns net: tun: Remove redundant assignment to ret net: phy: marvell: add downshift support for M88E1240 net: dsa: ksz: Make reg_mib_cnt a u8 as it never exceeds 255 net/sched: act_ct: Remove redundant ct get and check icmp: standardize naming of RFC 8335 PROBE constants bpf, selftests: Update array map tests for per-cpu batched ops bpf: Add batched ops support for percpu array bpf: Implement formatted output helpers with bstr_printf seq_file: Add a seq_bprintf function sfc: adjust efx->xdp_tx_queue_count with the real number of initialized queues net:nfc:digital: Fix a double free in digital_tg_recv_dep_req net: fix a concurrency bug in l2tp_tunnel_register() net/smc: Remove redundant assignment to rc mpls: Remove redundant assignment to err llc2: Remove redundant assignment to rc net/tls: Remove redundant initialization of record rds: Remove redundant assignment to nr_sig dt-bindings: net: mdio-gpio: add compatible for microchip,mdio-smi0 ...	2021-04-29 11:57:23 -07:00
Ingo Molnar	d0d252b8ca	Linux 5.12-rc8 -----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmB8qHweHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGEXIIAILUbsTJsNsvZIkZ uQ6SY6gnsPFkRiSRjY0YsZLUnqjTuiiHeTz4gzkonddwdnAp/9g6OIHIEBaeTqBh sTUMU/61Fgtrt/IvkA1yJ3rlawqgwdMe2VdimB+EFhufcSKq+5vpd3MVP4IuGx4E J3psoTU4gVltFs5t+1QjvI3XmByN0Qm8FMRXR7iL+zov1QTmGwR3G6Rn4AymG+QT pdruKboyZPfsrFGSVx7wd3HpFyQcrclEX9rKmBNZqets9d9JGWnqnEN4vQKmwO86 4MV29ucdMXH0AMB3kzGdVp0Ji2Ykt5W0K+MUWbFLtcSxnpu1OyBKGsEAMlRbD7ik gm0bMSw= =qHI0 -----END PGP SIGNATURE----- Merge tag 'v5.12-rc8' into sched/core, to pick up fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>	2021-04-20 10:13:58 +02:00
Paolo Abeni	17c3df7078	skbuff: revert "skbuff: remove some unnecessary operation in skb_segment_list()" the commit `1ddc3229ad` ("skbuff: remove some unnecessary operation in skb_segment_list()") introduces an issue very similar to the one already fixed by commit `53475c5dd8` ("net: fix use-after-free when UDP GRO with shared fraglist"). If the GSO skb goes though skb_clone() and pskb_expand_head() before entering skb_segment_list(), the latter will unshare the frag_list skbs and will release the old list. With the reverted commit in place, when skb_segment_list() completes, skb->next points to the just released list, and later on the kernel will hit UaF. Note that since commit `e0e3070a9b` ("udp: properly complete L4 GRO over UDP tunnel packet") the critical scenario can be reproduced also receiving UDP over vxlan traffic with: NIC (NETIF_F_GRO_FRAGLIST enabled) -> vxlan -> UDP sink Attaching a packet socket to the NIC will cause skb_clone() and the tunnel decapsulation will call pskb_expand_head(). Fixes: `1ddc3229ad` ("skbuff: remove some unnecessary operation in skb_segment_list()") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-04-14 13:54:08 -07:00
Cong Wang	0739cd28f2	net: Introduce skb_send_sock() for sock_map We only have skb_send_sock_locked() which requires callers to use lock_sock(). Introduce a variant skb_send_sock() which locks on its own, callers do not need to lock it any more. This will save us from adding a ->sendmsg_locked for each protocol. To reuse the code, pass function pointers to __skb_send_sock() and build skb_send_sock() and skb_send_sock_locked() on top. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20210331023237.41094-4-xiyou.wangcong@gmail.com	2021-04-01 10:56:13 -07:00
Yunsheng Lin	1ddc3229ad	skbuff: remove some unnecessary operation in skb_segment_list() gro list uses skb_shinfo(skb)->frag_list to link two skb together, and NAPI_GRO_CB(p)->last->next is used when there are more skb, see skb_gro_receive_list(). gso expects that each segmented skb is linked together using skb->next, so only the first skb->next need to set to skb_shinfo(skb)-> frag_list when doing gso list segment. It is the same reason that nskb->next does not need to be set to list_skb before goto the error handling, because nskb->next already pointers to list_skb. And nskb is also the last skb at the end of loop, so remove tail variable and use nskb instead. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-10 12:45:15 -08:00
Sebastian Andrzej Siewior	183f47fcaa	kcov: Remove kcov include from sched.h and move it to its users. The recent addition of in_serving_softirq() to kconv.h results in compile failure on PREEMPT_RT because it requires task_struct::softirq_disable_cnt. This is not available if kconv.h is included from sched.h. It is not needed to include kconv.h from sched.h. All but the net/ user already include the kconv header file. Move the include of the kconv.h header from sched.h it its users. Additionally include sched.h from kconv.h to ensure that everything task_struct related is available. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Johannes Berg <johannes@sipsolutions.net> Acked-by: Andrey Konovalov <andreyknvl@google.com> Link: https://lkml.kernel.org/r/20210218173124.iy5iyqv3a4oia4vv@linutronix.de	2021-03-06 12:40:21 +01:00
Willem de Bruijn	b228c9b058	net: expand textsearch ts_state to fit skb_seq_state The referenced commit expands the skb_seq_state used by skb_find_text with a 4B frag_off field, growing it to 48B. This exceeds container ts_state->cb, causing a stack corruption: [ 73.238353] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: skb_find_text+0xc5/0xd0 [ 73.247384] CPU: 1 PID: 376 Comm: nping Not tainted 5.11.0+ #4 [ 73.252613] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 [ 73.260078] Call Trace: [ 73.264677] dump_stack+0x57/0x6a [ 73.267866] panic+0xf6/0x2b7 [ 73.270578] ? skb_find_text+0xc5/0xd0 [ 73.273964] __stack_chk_fail+0x10/0x10 [ 73.277491] skb_find_text+0xc5/0xd0 [ 73.280727] string_mt+0x1f/0x30 [ 73.283639] ipt_do_table+0x214/0x410 The struct is passed between skb_find_text and its callbacks skb_prepare_seq_read, skb_seq_read and skb_abort_seq read through the textsearch interface using TS_SKB_CB. I assumed that this mapped to skb->cb like other .._SKB_CB wrappers. skb->cb is 48B. But it maps to ts_state->cb, which is only 40B. skb->cb was increased from 40B to 48B after ts_state was introduced, in commit `3e3850e989` ("[NETFILTER]: Fix xfrm lookup in ip_route_me_harder/ip6_route_me_harder"). Increase ts_state.cb[] to 48 to fit the struct. Also add a BUILD_BUG_ON to avoid a repeat. The alternative is to directly add a dependency from textsearch onto linux/skbuff.h, but I think the intent is textsearch to have no such dependencies on its callers. Link: https://bugzilla.kernel.org/show_bug.cgi?id=211911 Fixes: `97550f6fa5` ("net: compound page support in skb_seq_read") Reported-by: Kris Karas <bugs-a17@moonlit-rail.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-01 15:25:24 -08:00
Alexander Lobakin	9243adfc31	skbuff: queue NAPI_MERGED_FREE skbs into NAPI cache instead of freeing napi_frags_finish() and napi_skb_finish() can only be called inside NAPI Rx context, so we can feed NAPI cache with skbuff_heads that got NAPI_MERGED_FREE verdict instead of immediate freeing. Replace __kfree_skb() with __kfree_skb_defer() in napi_skb_finish() and move napi_skb_free_stolen_head() to skbuff.c, so it can drop skbs to NAPI cache. As many drivers call napi_alloc_skb()/napi_get_frags() on their receive path, this becomes especially useful. Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-02-13 14:32:04 -08:00
Alexander Lobakin	cfb8ec6595	skbuff: allow to use NAPI cache from __napi_alloc_skb() {,__}napi_alloc_skb() is mostly used either for optional non-linear receive methods (usually controlled via Ethtool private flags and off by default) and/or for Rx copybreaks. Use __napi_build_skb() here for obtaining skbuff_heads from NAPI cache instead of inplace allocations. This includes both kmalloc and page frag paths. Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-02-13 14:32:04 -08:00
Alexander Lobakin	d13612b58e	skbuff: allow to optionally use NAPI cache from __alloc_skb() Reuse the old and forgotten SKB_ALLOC_NAPI to add an option to get an skbuff_head from the NAPI cache instead of inplace allocation inside __alloc_skb(). This implies that the function is called from softirq or BH-off context, not for allocating a clone or from a distant node. Cc: Alexander Duyck <alexander.duyck@gmail.com> # Simplified flags check Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-02-13 14:32:04 -08:00
Alexander Lobakin	f450d539c0	skbuff: introduce {,__}napi_build_skb() which reuses NAPI cache heads Instead of just bulk-flushing skbuff_heads queued up through napi_consume_skb() or __kfree_skb_defer(), try to reuse them on allocation path. If the cache is empty on allocation, bulk-allocate the first 16 elements, which is more efficient than per-skb allocation. If the cache is full on freeing, bulk-wipe the second half of the cache (32 elements). This also includes custom KASAN poisoning/unpoisoning to be double sure there are no use-after-free cases. To not change current behaviour, introduce a new function, napi_build_skb(), to optionally use a new approach later in drivers. Note on selected bulk size, 16: - this equals to XDP_BULK_QUEUE_SIZE, DEV_MAP_BULK_SIZE and especially VETH_XDP_BATCH, which is also used to bulk-allocate skbuff_heads and was tested on powerful setups; - this also showed the best performance in the actual test series (from the array of {8, 16, 32}). Suggested-by: Edward Cree <ecree.xilinx@gmail.com> # Divide on two halves Suggested-by: Eric Dumazet <edumazet@google.com> # KASAN poisoning Cc: Dmitry Vyukov <dvyukov@google.com> # Help with KASAN Cc: Paolo Abeni <pabeni@redhat.com> # Reduced batch size Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-02-13 14:32:04 -08:00
Alexander Lobakin	50fad4b543	skbuff: move NAPI cache declarations upper in the file NAPI cache structures will be used for allocating skbuff_heads, so move their declarations a bit upper. Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-02-13 14:32:03 -08:00
Alexander Lobakin	fec6e49b63	skbuff: remove __kfree_skb_flush() This function isn't much needed as NAPI skb queue gets bulk-freed anyway when there's no more room, and even may reduce the efficiency of bulk operations. It will be even less needed after reusing skb cache on allocation path, so remove it and this way lighten network softirqs a bit. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-02-13 14:32:03 -08:00
Alexander Lobakin	f9d6725bf4	skbuff: use __build_skb_around() in __alloc_skb() Just call __build_skb_around() instead of open-coding it. Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-02-13 14:32:03 -08:00
Alexander Lobakin	df1ae022af	skbuff: simplify __alloc_skb() a bit Use unlikely() annotations for skbuff_head and data similarly to the two other allocation functions and remove totally redundant goto. Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-02-13 14:32:03 -08:00

1 2 3 4 5 ...

930 Commits