Centos-kernel-stream-9

Commit Graph

Author	SHA1	Message	Date
Ivan Vecera	d5968be5cd	net: add atomic_long_t to net_device_stats fields JIRA: https://issues.redhat.com/browse/RHEL-862 commit 6c1c5097781f563b70a81683ea6fdac21637573b Author: Eric Dumazet <edumazet@google.com> Date: Tue Nov 15 08:53:55 2022 +0000 net: add atomic_long_t to net_device_stats fields Long standing KCSAN issues are caused by data-race around some dev->stats changes. Most performance critical paths already use per-cpu variables, or per-queue ones. It is reasonable (and more correct) to use atomic operations for the slow paths. This patch adds an union for each field of net_device_stats, so that we can convert paths that are not yet protected by a spinlock or a mutex. netdev_stats_to_stats64() no longer has an #if BITS_PER_LONG==64 Note that the memcpy() we were using on 64bit arches had no provision to avoid load-tearing, while atomic_long_read() is providing the needed protection at no cost. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2023-12-05 15:29:59 +01:00
Scott Weaver	f51e07d91d	Merge: CNB94: xsk: Multi-buffer support MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3310 JIRA: https://issues.redhat.com/browse/RHEL-15250 Tested: Using attached self-tests [Results in JIRA] The series adds support for multi-buffer to XSK. It is based on upstream series `3226e3139dfe ("Merge branch 'xsk-multi-buffer-support'")` and contains also commits from upstream series `34e78bab67c5 ("Merge branch 'seltests/xsk: prepare for AF_XDP multi-buffer testing'")` to make attached self-tests applicable. Commits: ``` 0c5f48599bed ("xsk: Simplify xp_aligned_validate_desc implementation") f2f167583601 ("xsk: Remove unused xsk_buff_discard") e2fa5c2068fb ("xsk: Remove unused inline function xsk_buff_discard()") 63a64a56bc3f ("xsk: prepare 'options' in xdp_desc for multi-buffer use") 81470b5c3c66 ("xsk: introduce XSK_USE_SG bind flag for xsk socket") 556444c4e683 ("xsk: prepare both copy and zero-copy modes to co-exist") faa91b839b09 ("xsk: move xdp_buff's data length check to xsk_rcv_check") 804627751b42 ("xsk: add support for AF_XDP multi-buffer on Rx path") b7f72a30e9ac ("xsk: introduce wrappers and helpers for supporting multi-buffer in Tx path") 1b725b0c8163 ("xsk: allow core/drivers to test EOP bit") cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path") 07428da9e25a ("xsk: discard zero length descriptors in Tx path") 13ce2daa259a ("xsk: add new netlink attribute dedicated for ZC max frags") 24ea50127ecf ("xsk: support mbuf on ZC RX") d5581966040f ("xsk: support ZC Tx multi-buffer in batch API") 49ca37d0d825 ("xsk: add multi-buffer documentation") 9a321fd3308e ("selftests/xsk: add xdp populate metadata test") 68e7322142f5 ("selftests: xsk: Deflakify STATS_RX_DROPPED test") 7a2050df244e ("selftests: xsk: Use correct UMEM size in testapp_invalid_desc") ccd1b2933f8c ("selftests: xsk: Add test case for packets at end of UMEM") c0801598e543 ("selftests: xsk: Add test UNALIGNED_INV_DESC_4K1_FRAME_SIZE") d2e541494935 ("selftests/xsk: do not change XDP program when not necessary") df82d2e89c41 ("selftests/xsk: generate simpler packets with variable length") feb973a9094f ("selftests/xsk: add varying payload pattern within packet") 7a8a6762822a ("selftests/xsk: dump packet at error") 69fc03d220a3 ("selftests/xsk: add packet iterator for tx to packet stream") d9f6d9709f87 ("selftests/xsk: store offset in pkt instead of addr") 041b68f688a3 ("selftests/xsx: test for huge pages only once") 86e41755b432 ("selftests/xsk: populate fill ring based on frags needed") 2f6eae0df1a8 ("selftests/xsk: generate data for multi-buffer packets") 7cd6df4f5ec2 ("selftests/xsk: adjust packet pacing for multi-buffer support") 17f1034dd76d ("selftests/xsk: transmit and receive multi-buffer packets") f540d44e05cf ("selftests/xsk: add basic multi-buffer test") 1005a226da9a ("selftests/xsk: add unaligned mode test for multi-buffer") 697604492b64 ("selftests/xsk: add invalid descriptor test for multi-buffer") f80ddbec4762 ("selftests/xsk: add metadata copy test for multi-buff") 807bf4da2049 ("selftests/xsk: add test for too many frags") 3666bccab43a ("selftests/xsk: reset NIC settings to default after running test suite") d609f3d228a8 ("xsk: add multi-buffer support for sockets sharing umem") 9d0a67b9d42c ("xsk: Fix xsk_build_skb() error: 'skb' dereferencing possible ERR_PTR()") a097627dcadd ("net: add missing net_device::xdp_zc_max_segs description") ``` Signed-off-by: Ivan Vecera <ivecera@redhat.com> Approved-by: Petr Oros <poros@redhat.com> Approved-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Scott Weaver <scweaver@redhat.com>	2023-11-29 14:08:06 -05:00
Scott Weaver	971351c941	Merge: CNB94: net: add check for current MAC address in dev_set_mac_address MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3398 JIRA: https://issues.redhat.com/browse/RHEL-16986 JIRA: https://issues.redhat.com/browse/RHEL-6368 This prevents network drivers' .ndo_set_mac_address method from being called when the MAC address is already the current one. There are drivers that more or less assume that this is how the network core already behaves. For example, iavf will send a virtchnl message to the PF requesting to add the new address and then a message to remove the old address. This logic is broken if old and new are the same address. Tested: I used the reproducer steps from RHEL-6368, with VFs on Intel E810. Signed-off-by: Michal Schmidt <mschmidt@redhat.com> Approved-by: Ivan Vecera <ivecera@redhat.com> Approved-by: Petr Oros <poros@redhat.com> Signed-off-by: Scott Weaver <scweaver@redhat.com>	2023-11-28 10:54:46 -05:00
Michal Schmidt	2c3e95de5a	net: fix net device address assign type JIRA: https://issues.redhat.com/browse/RHEL-16986 JIRA: https://issues.redhat.com/browse/RHEL-6368 commit 0ec92a8f56ff07237dbe8af7c7a72aba7f957baf Author: Piotr Gardocki <piotrx.gardocki@intel.com> Date: Wed Jun 21 15:21:06 2023 +0200 net: fix net device address assign type Commit ad72c4a06acc introduced optimization to return from function quickly if the MAC address is not changing at all. It was reported that such change causes dev->addr_assign_type to not change to NET_ADDR_SET from _PERM or _RANDOM. Restore the old behavior and skip only call to ndo_set_mac_address. Fixes: ad72c4a06acc ("net: add check for current MAC address in dev_set_mac_address") Reported-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Piotr Gardocki <piotrx.gardocki@intel.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230621132106.991342-1-piotrx.gardocki@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Michal Schmidt <mschmidt@redhat.com>	2023-11-21 23:16:06 +01:00
Michal Schmidt	37100466e2	net: add check for current MAC address in dev_set_mac_address JIRA: https://issues.redhat.com/browse/RHEL-16986 JIRA: https://issues.redhat.com/browse/RHEL-6368 commit ad72c4a06acc6762e84994ac2f722da7a07df34e Author: Piotr Gardocki <piotrx.gardocki@intel.com> Date: Wed Jun 14 16:53:00 2023 +0200 net: add check for current MAC address in dev_set_mac_address In some cases it is possible for kernel to come with request to change primary MAC address to the address that is already set on the given interface. Add proper check to return fast from the function in these cases. An example of such case is adding an interface to bonding channel in balance-alb mode: modprobe bonding mode=balance-alb miimon=100 max_bonds=1 ip link set bond0 up ifenslave bond0 <eth> Signed-off-by: Piotr Gardocki <piotrx.gardocki@intel.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Michal Schmidt <mschmidt@redhat.com>	2023-11-21 23:16:06 +01:00
Antoine Tenart	b2c4833a40	net: skbuff: update and rename __kfree_skb_defer() JIRA: https://issues.redhat.com/browse/RHEL-14554 Upstream Status: linux.git commit 8fa66e4a1bdd41d55d7842928e60a40fed65715d Author: Jakub Kicinski <kuba@kernel.org> Date: Wed Apr 19 19:00:05 2023 -0700 net: skbuff: update and rename __kfree_skb_defer() __kfree_skb_defer() uses the old naming where "defer" meant slab bulk free/alloc APIs. In the meantime we also made __kfree_skb_defer() feed the per-NAPI skb cache, which implies bulk APIs. So take away the 'defer' and add 'napi'. While at it add a drop reason. This only matters on the tx_action path, if the skb has a frag_list. But getting rid of a SKB_DROP_REASON_NOT_SPECIFIED seems like a net benefit so why not. Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/r/20230420020005.815854-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2023-11-10 17:40:29 +01:00
Scott Weaver	6cf5659031	Merge: CNB94: page_pool: allow caching from safely localized NAPI MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3196 JIRA: https://issues.redhat.com/browse/RHEL-12613 Tested: Using LNST net-driver test-suite on i40e, bnxt_en, ice and mlx5_core [http://dashboard.lnst.anl.lab.eng.bos.redhat.com/pipeline/3644] Commits: ``` 4727bab4e9bb ("net: skb: move skb_pp_recycle() to skbuff.c") eedade12f4cb ("net: kfree_skb_list use kmem_cache_free_bulk") f72ff8b81ebc ("net: fix kfree_skb_list use of skb_mark_not_on_list") 9dde0cd3b10f ("net: introduce skb_poison_list and use in kfree_skb_list") b07a2d97ba5e ("net: skb: plumb napi state thru skb freeing paths") 8c48eea3adf3 ("page_pool: allow caching from safely localized NAPI") dd64b232deb8 ("page_pool: unlink from napi during destroy") ``` Signed-off-by: Ivan Vecera <ivecera@redhat.com> Approved-by: Antoine Tenart <atenart@redhat.com> Approved-by: Michal Schmidt <mschmidt@redhat.com> Signed-off-by: Scott Weaver <scweaver@redhat.com>	2023-11-09 07:22:35 -05:00
Ivan Vecera	96ba8afe11	xsk: add new netlink attribute dedicated for ZC max frags JIRA: https://issues.redhat.com/browse/RHEL-15250 commit 13ce2daa259a3bfbc9a5aeeee8b9a87058703731 Author: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Date: Wed Jul 19 15:24:07 2023 +0200 xsk: add new netlink attribute dedicated for ZC max frags Introduce new netlink attribute NETDEV_A_DEV_XDP_ZC_MAX_SEGS that will carry maximum fragments that underlying ZC driver is able to handle on TX side. It is going to be included in netlink response only when driver supports ZC. Any value higher than 1 implies multi-buffer ZC support on underlying device. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20230719132421.584801-11-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2023-11-01 14:56:57 +01:00
Ivan Vecera	d80ce17d20	page_pool: allow caching from safely localized NAPI JIRA: https://issues.redhat.com/browse/RHEL-12613 Conflicts: - simple context conflict in net/core/dev.c due to absence of commit 8b43fd3d1d7d8 ("net: optimize ____napi_schedule() to avoid extra NET_RX_SOFTIRQ") that is out of scope of this series commit 8c48eea3adf3119e0a3fc57bd31f6966f26ee784 Author: Jakub Kicinski <kuba@kernel.org> Date: Wed Apr 12 21:26:04 2023 -0700 page_pool: allow caching from safely localized NAPI Recent patches to mlx5 mentioned a regression when moving from driver local page pool to only using the generic page pool code. Page pool has two recycling paths (1) direct one, which runs in safe NAPI context (basically consumer context, so producing can be lockless); and (2) via a ptr_ring, which takes a spin lock because the freeing can happen from any CPU; producer and consumer may run concurrently. Since the page pool code was added, Eric introduced a revised version of deferred skb freeing. TCP skbs are now usually returned to the CPU which allocated them, and freed in softirq context. This places the freeing (producing of pages back to the pool) enticingly close to the allocation (consumer). If we can prove that we're freeing in the same softirq context in which the consumer NAPI will run - lockless use of the cache is perfectly fine, no need for the lock. Let drivers link the page pool to a NAPI instance. If the NAPI instance is scheduled on the same CPU on which we're freeing - place the pages in the direct cache. With that and patched bnxt (XDP enabled to engage the page pool, sigh, bnxt really needs page pool work :() I see a 2.6% perf boost with a TCP stream test (app on a different physical core than softirq). The CPU use of relevant functions decreases as expected: page_pool_refill_alloc_cache 1.17% -> 0% _raw_spin_lock 2.41% -> 0.98% Only consider lockless path to be safe when NAPI is scheduled - in practice this should cover majority if not all of steady state workloads. It's usually the NAPI kicking in that causes the skb flush. The main case we'll miss out on is when application runs on the same CPU as NAPI. In that case we don't use the deferred skb free path. Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Tested-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2023-10-31 15:09:26 +01:00
Scott Weaver	d05495aca0	Merge: CNB94: tc: update tc subsystem to the upstream v6.5 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3067 JIRA: https://issues.redhat.com/browse/RHEL-1773 Depends: https://issues.redhat.com/browse/RHEL-860 Depends: https://issues.redhat.com/browse/RHEL-3646 Update TC (net/sched) to the upstream v6.5 Omitted-fix: cad7526f33ce ("net: dsa: ocelot: unlock on error in vsc9959_qos_port_tas_set()") Not needed, DSA as well as ocelot driver is not enabled/supported in RHEL Commits: ``` 1b808993e194 ("flow_dissector: fix false-positive __read_overflow2_field() warning") f743f16c548b ("treewide: use get_random_{u8,u16}() when possible, part 2") 7e3cf0843fe5 ("treewide: use get_random_{u8,u16}() when possible, part 1") 8032bf1233a7 ("treewide: use get_random_u32_below() instead of deprecated function") 62423bd2d2e2 ("net: sched: remove qdisc_watchdog->last_expires") c66b2111c9c9 ("selftests: tc-testing: add tests for action binding") f5fca219ad45 ("net: do not use skb_mac_header() in qdisc_pkt_len_init()") e495a9673caf ("sch_cake: do not use skb_mac_header() in cake_overhead()") b3be94885af4 ("net/sched: remove two skb_mac_header() uses") fcb3a4653bc5 ("net/sched: act_api: use the correct TCA_ACT attributes in dump") 4170f0ef582c ("fix typos in net/sched/) 8b0f256530d9 ("net/sched: sch_mqprio: use netlink payload helpers") 3dd0c16ec93e ("net/sched: mqprio: simplify handling of nlattr portion of TCA_OPTIONS") 57f21bf85400 ("net/sched: mqprio: add extack to mqprio_parse_nlattr()") ab277d2084ba ("net/sched: mqprio: add an extack message to mqprio_parse_opt()") c54876cd5961 ("net/sched: pass netlink extack to mqprio and taprio offload") f62af20bed2d ("net/sched: mqprio: allow per-TC user input of FP adminStatus") a721c3e54b80 ("net/sched: taprio: allow per-TC user input of FP adminStatus") 8c966a10eb84 ("flow_dissector: Address kdoc warnings") 54e906f1639e ("selftests: forwarding: sch_tbf_*: Add a pre-run hook") 2f0f9465ad9f ("net: sched: Print msecs when transmit queue time out") 5036034572b7 ("net/sched: act_pedit: use NLA_POLICY for parsing 'ex' keys") 0c83c5210e18 ("net/sched: act_pedit: use extack in 'ex' parsing errors") e1201bc781c2 ("net/sched: act_pedit: check static offsets a priori") 577140180ba2 ("net/sched: act_pedit: remove extra check for key type") e3c9673e2f6e ("net/sched: act_pedit: rate limit datapath messages") 807cfded92b0 ("net/sched: sch_htb: use extack on errors messages") c69a9b023f65 ("net/sched: sch_qfq: use extack on errors messages") 25369891fcef ("net/sched: sch_qfq: refactor parsing of netlink parameters") 7eb060a51a3b ("selftests: tc-testing: add more tests for sch_qfq") 1b483d9f5805 ("net/sched: act_pedit: free pedit keys on bail from offset check") 526f28bd0fbd ("net/sched: act_mirred: Add carrier check") 12e7789ad5b4 ("sch_htb: Allow HTB priority parameter in offload mode") c7cfbd115001 ("net/sched: sch_ingress: Only create under TC_H_INGRESS") 5eeebfe6c493 ("net/sched: sch_clsact: Only create under TC_H_CLSACT") f85fa45d4a94 ("net/sched: Reserve TC_H_INGRESS (TC_H_CLSACT) for ingress (clsact) Qdiscs") 9de95df5d15b ("net/sched: Prohibit regrafting ingress or clsact Qdiscs") 7b4858df3bf7 ("skbuff: bridge: Add layer 2 miss indication") d5ccfd90df7f ("flow_dissector: Dissect layer 2 miss from tc skb extension") 1a432018c0cd ("net/sched: flower: Allow matching on layer 2 miss") f4356947f029 ("flow_offload: Reject matching on layer 2 miss") 8c33266ae26a ("selftests: forwarding: Add layer 2 miss test cases") dced11ef84fb ("net/sched: taprio: don't overwrite "sch" variable in taprio_dump_class_stats()") 2d800bc500fb ("net/sched: taprio: replace tc_taprio_qopt_offload :: enable with a "cmd" enum") 6c1adb650c8d ("net/sched: taprio: add netlink reporting for offload statistics counters") a395b8d1c7c3 ("selftests/tc-testing: replace mq with invalid parent ID") 8cde87b007da ("net: sched: wrap tc_skip_wrapper with CONFIG_RETPOLINE") cd2b8113c2e8 ("net/sched: fq_pie: ensure reasonable TCA_FQ_PIE_QUANTUM values") d636fc5dd692 ("net: sched: add rcu annotations around qdisc->qdisc_sleeping") 886bc7d6ed33 ("net: sched: move rtm_tca_policy declaration to include file") 682881ee45c8 ("net: sched: act_police: fix sparse errors in tcf_police_dump()") 6c02568fd1ae ("net/sched: act_pedit: Parse L3 Header for L4 offset") 26e35370b976 ("net/sched: act_pedit: Use kmemdup() to replace kmalloc + memcpy") 2b84960fc5dd ("net/sched: taprio: report class offload stats per TXQ, not per TC") d7ad70b5ef5a ("net: flow_dissector: add support for cfm packets") 7cfffd5fed3e ("net: flower: add support for matching cfm fields") 1668a55a73f5 ("selftests: net: add tc flower cfm test") c29e012eae29 ("selftests: forwarding: Fix layer 2 miss test syntax") aef6e908b542 ("selftests/tc-testing: Fix Error: Specified qdisc kind is unknown.") b849c566ee9c ("selftests/tc-testing: Fix Error: failed to find target LOG") b39d8c41c7a8 ("selftests/tc-testing: Fix SFB db test") 11b8b2e70a9b ("selftests/tc-testing: Remove configs that no longer exist") 41f2c7c342d3 ("net/sched: act_ct: Fix promotion of offloaded unreplied tuple") 2d5f6a8d7aef ("net/sched: Refactor qdisc_graft() for ingress and clsact Qdiscs") 84ad0af0bccd ("net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting") e16ad981e2a1 ("net: sched: Remove unused qdisc_l2t()") ca4fa8743537 ("selftests: tc-testing: add one test for flushing explicitly created chain") b4ee93380b3c ("net/sched: act_ipt: add sanity checks on table name and hook locations") b2dc32dcba08 ("net/sched: act_ipt: add sanity checks on skb before calling target") 93d75d475c5d ("net/sched: act_ipt: zero skb->cb before calling target") 30c45b5361d3 ("net/sched: act_pedit: Add size check for TCA_PEDIT_PARMS_EX") 989b52cdc849 ("net: sched: Replace strlcpy with strscpy") d3f87278bcb8 ("net/sched: flower: Ensure both minimum and maximum ports are specified") 150e33e62c1f ("net/sched: make psched_mtu() RTNL-less safe") 158810b261d0 ("net/sched: sch_qfq: reintroduce lmax bound check for MTU") c5a06fdc618d ("selftests: tc-testing: add tests for qfq mtu sanity check") 3e337087c3b5 ("net/sched: sch_qfq: account for stab overhead in qfq_enqueue") 137f6219da59 ("selftests: tc-testing: add test for qfq with stab overhead") d1cca974548d ("pie: fix kernel-doc notation warning") b3d0e0489430 ("net: sched: cls_matchall: Undo tcf_bind_filter in case of failure after mall_set_parms") 9cb36faedeaf ("net: sched: cls_u32: Undo tcf_bind_filter if u32_replace_hw_knode") e8d3d78c19be ("net: sched: cls_u32: Undo refcount decrement in case update failed") 26a22194927e ("net: sched: cls_bpf: Undo tcf_bind_filter in case of an error") ac177a330077 ("net: sched: cls_flower: Undo tcf_bind_filter in case of an error") fda05798c22a ("selftests: tc: set timeout to 15 minutes") 719b4774a8cb ("selftests: tc: add 'ct' action kconfig dep") 031c99e71fed ("selftests: tc: add ConnTrack procfs kconfig") 4914109a8e1e ("netfilter: allow exp not to be removed in nf_ct_find_expectation") 76622ced50a1 ("net: sched: set IPS_CONFIRMED in tmpl status only when commit is set in act_ct") 8c8b73320805 ("openvswitch: set IPS_CONFIRMED in tmpl status only when commit is set in conntrack") 9fe63d5f1da9 ("sch_htb: Allow HTB quantum parameter in offload mode") 6c58c8816abb ("net/sched: mqprio: Add length check for TCA_MQPRIO_{MAX/MIN}_RATE64") 4d50e50045aa ("net: flower: fix stack-out-of-bounds in fl_set_key_cfm()") e68409db9953 ("net: sched: cls_u32: Fix match key mis-addressing") e739718444f7 ("net/sched: taprio: Limit TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME to INT_MAX.") 21a72166abb9 ("selftests: forwarding: tc_flower_l2_miss: Fix failing test with old libnet") ``` Signed-off-by: Ivan Vecera <ivecera@redhat.com> Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com> Approved-by: Michal Schmidt <mschmidt@redhat.com> Approved-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Scott Weaver <scweaver@redhat.com>	2023-10-24 13:29:05 -04:00
Scott Weaver	03206d751a	Merge: CNB94: net: move gso declarations and functions to their own files MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3198 JIRA: https://issues.redhat.com/browse/RHEL-12679 Tested: Just built... no functional change Commits: ``` d457a0e329b0 ("net: move gso declarations and functions to their own files") ``` Signed-off-by: Ivan Vecera <ivecera@redhat.com> Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com> Approved-by: Sabrina Dubroca <sdubroca@redhat.com> Signed-off-by: Scott Weaver <scweaver@redhat.com>	2023-10-19 10:36:22 -04:00
Scott Weaver	ec70982f69	Merge: ice: Enable DPLL support MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2961 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2232515 This feature request is for add and enable DPLL subsystem and DPLL support in ice driver Signed-off-by: Petr Oros <poros@redhat.com> Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Approved-by: Michal Schmidt <mschmidt@redhat.com> Signed-off-by: Scott Weaver <scweaver@redhat.com>	2023-10-19 10:36:20 -04:00
Ivan Vecera	92e020fb45	net: sched: add rcu annotations around qdisc->qdisc_sleeping JIRA: https://issues.redhat.com/browse/RHEL-1773 Conflicts: - resolved conflict in net/sched/sch_taprio.c the same way like in 449f6bc17a51 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net") commit d636fc5dd692c8f4e00ae6e0359c0eceeb5d9bdb Author: Eric Dumazet <edumazet@google.com> Date: Tue Jun 6 11:19:29 2023 +0000 net: sched: add rcu annotations around qdisc->qdisc_sleeping syzbot reported a race around qdisc->qdisc_sleeping [1] It is time we add proper annotations to reads and writes to/from qdisc->qdisc_sleeping. [1] BUG: KCSAN: data-race in dev_graft_qdisc / qdisc_lookup_rcu read to 0xffff8881286fc618 of 8 bytes by task 6928 on cpu 1: qdisc_lookup_rcu+0x192/0x2c0 net/sched/sch_api.c:331 __tcf_qdisc_find+0x74/0x3c0 net/sched/cls_api.c:1174 tc_get_tfilter+0x18f/0x990 net/sched/cls_api.c:2547 rtnetlink_rcv_msg+0x7af/0x8c0 net/core/rtnetlink.c:6386 netlink_rcv_skb+0x126/0x220 net/netlink/af_netlink.c:2546 rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:6413 netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline] netlink_unicast+0x56f/0x640 net/netlink/af_netlink.c:1365 netlink_sendmsg+0x665/0x770 net/netlink/af_netlink.c:1913 sock_sendmsg_nosec net/socket.c:724 [inline] sock_sendmsg net/socket.c:747 [inline] ____sys_sendmsg+0x375/0x4c0 net/socket.c:2503 ___sys_sendmsg net/socket.c:2557 [inline] __sys_sendmsg+0x1e3/0x270 net/socket.c:2586 __do_sys_sendmsg net/socket.c:2595 [inline] __se_sys_sendmsg net/socket.c:2593 [inline] __x64_sys_sendmsg+0x46/0x50 net/socket.c:2593 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd write to 0xffff8881286fc618 of 8 bytes by task 6912 on cpu 0: dev_graft_qdisc+0x4f/0x80 net/sched/sch_generic.c:1115 qdisc_graft+0x7d0/0xb60 net/sched/sch_api.c:1103 tc_modify_qdisc+0x712/0xf10 net/sched/sch_api.c:1693 rtnetlink_rcv_msg+0x807/0x8c0 net/core/rtnetlink.c:6395 netlink_rcv_skb+0x126/0x220 net/netlink/af_netlink.c:2546 rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:6413 netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline] netlink_unicast+0x56f/0x640 net/netlink/af_netlink.c:1365 netlink_sendmsg+0x665/0x770 net/netlink/af_netlink.c:1913 sock_sendmsg_nosec net/socket.c:724 [inline] sock_sendmsg net/socket.c:747 [inline] ____sys_sendmsg+0x375/0x4c0 net/socket.c:2503 ___sys_sendmsg net/socket.c:2557 [inline] __sys_sendmsg+0x1e3/0x270 net/socket.c:2586 __do_sys_sendmsg net/socket.c:2595 [inline] __se_sys_sendmsg net/socket.c:2593 [inline] __x64_sys_sendmsg+0x46/0x50 net/socket.c:2593 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 6912 Comm: syz-executor.5 Not tainted 6.4.0-rc3-syzkaller-00190-g0d85b27b0cc6 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/16/2023 Fixes: `3a7d0d07a3` ("net: sched: extend Qdisc with rcu") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Vlad Buslov <vladbu@nvidia.com> Acked-by: Jamal Hadi Salim<jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2023-10-13 09:03:10 +02:00
Ivan Vecera	f43c4f5429	net: do not use skb_mac_header() in qdisc_pkt_len_init() JIRA: https://issues.redhat.com/browse/RHEL-1773 commit f5fca219ad4548bc45f0221f9857ad22cb8136a1 Author: Eric Dumazet <edumazet@google.com> Date: Tue Mar 21 16:45:17 2023 +0000 net: do not use skb_mac_header() in qdisc_pkt_len_init() We want to remove our use of skb_mac_header() in tx paths, eg remove skb_reset_mac_header() from __dev_queue_xmit(). Idea is that ndo_start_xmit() can get the mac header simply looking at skb->data. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2023-10-13 09:03:06 +02:00
Ivan Vecera	497f645693	net: move gso declarations and functions to their own files JIRA: https://issues.redhat.com/browse/RHEL-12679 commit d457a0e329b0bfd3a1450e0b1a18cd2b47a25a08 Author: Eric Dumazet <edumazet@google.com> Date: Thu Jun 8 19:17:37 2023 +0000 net: move gso declarations and functions to their own files Move declarations into include/net/gso.h and code into net/core/gso.c Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Stanislav Fomichev <sdf@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230608191738.3947077-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2023-10-11 13:35:27 +02:00
Petr Oros	104234d3d2	netdev: expose DPLL pin handle for netdevice Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2232515 Upstream commit(s): commit 5f18426928800c59fb0f9bc8fb0c182bb6f5ee24 Author: Jiri Pirko <jiri@nvidia.com> Date: Wed Sep 13 21:49:39 2023 +0100 netdev: expose DPLL pin handle for netdevice In case netdevice represents a SyncE port, the user needs to understand the connection between netdevice and associated DPLL pin. There might me multiple netdevices pointing to the same pin, in case of VF/SF implementation. Add a IFLA Netlink attribute to nest the DPLL pin handle, similar to how it is implemented for devlink port. Add a struct dpll_pin pointer to netdev and protect access to it by RTNL. Expose netdev_dpll_pin_set() and netdev_dpll_pin_clear() helpers to the drivers so they can set/clear the DPLL pin relationship to netdev. Note that during the lifetime of struct dpll_pin the pin handle does not change. Therefore it is save to access it lockless. It is drivers responsibility to call netdev_dpll_pin_clear() before dpll_pin_put(). Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Petr Oros <poros@redhat.com>	2023-09-18 15:13:24 +02:00
Ivan Vecera	3cc9e8b28b	random32: use real rng for non-deterministic randomness JIRA: https://issues.redhat.com/browse/RHEL-3646 commit d4150779e60fb6c49be25572596b2cdfc5d46a09 Author: Jason A. Donenfeld <Jason@zx2c4.com> Date: Wed May 11 16:11:29 2022 +0200 random32: use real rng for non-deterministic randomness random32.c has two random number generators in it: one that is meant to be used deterministically, with some predefined seed, and one that does the same exact thing as random.c, except does it poorly. The first one has some use cases. The second one no longer does and can be replaced with calls to random.c's proper random number generator. The relatively recent siphash-based bad random32.c code was added in response to concerns that the prior random32.c was too deterministic. Out of fears that random.c was (at the time) too slow, this code was anonymously contributed. Then out of that emerged a kind of shadow entropy gathering system, with its own tentacles throughout various net code, added willy nilly. Stop👏making👏bespoke👏random👏number👏generators👏. Fortunately, recent advances in random.c mean that we can stop playing with this sketchiness, and just use get_random_u32(), which is now fast enough. In micro benchmarks using RDPMC, I'm seeing the same median cycle count between the two functions, with the mean being _slightly_ higher due to batches refilling (which we can optimize further need be). However, when doing real benchmarks of the net functions that actually use these random numbers, the mean cycles actually decreased slightly (with the median still staying the same), likely because the additional prandom code means icache misses and complexity, whereas random.c is generally already being used by something else nearby. The biggest benefit of this is that there are many users of prandom who probably should be using cryptographically secure random numbers. This makes all of those accidental cases become secure by just flipping a switch. Later on, we can do a tree-wide cleanup to remove the static inline wrapper functions that this commit adds. There are also some low-ish hanging fruits for making this even faster in the future: a get_random_u16() function for use in the networking stack will give a 2x performance boost there, using SIMD for ChaCha20 will let us compute 4 or 8 or 16 blocks of output in parallel, instead of just one, giving us large buffers for cheap, and introducing a get_random_*_bh() function that assumes irqs are already disabled will shave off a few cycles for ordinary calls. These are things we can chip away at down the road. Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2023-09-13 18:39:29 +02:00
Jan Stancek	645597c064	Merge: net: core: stable backport form upstream for 9.3 phase 2 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2731 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2217529 Tested: LNST, Tier1 A bunch of fixes for relevant issues. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Approved-by: Florian Westphal <fwestpha@redhat.com> Approved-by: Hangbin Liu <haliu@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com>	2023-07-07 07:38:20 +02:00
Jan Stancek	e341c7e709	Merge: bpf, xdp: update to 6.3 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2583 Rebase bpf and xdp to 6.3. Bugzilla: https://bugzilla.redhat.com/2178930 Signed-off-by: Viktor Malik <vmalik@redhat.com> Approved-by: Rafael Aquini <aquini@redhat.com> Approved-by: Artem Savkov <asavkov@redhat.com> Approved-by: Jason Wang <jasowang@redhat.com> Approved-by: Jiri Benc <jbenc@redhat.com> Approved-by: Jan Stancek <jstancek@redhat.com> Approved-by: Baoquan He <5820488-baoquan_he@users.noreply.gitlab.com> Signed-off-by: Jan Stancek <jstancek@redhat.com>	2023-06-28 07:52:45 +02:00
Paolo Abeni	e4256bf256	net: add vlan_get_protocol_and_depth() helper Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2217529 Tested: LNST, Tier1 Upstream commit: commit 4063384ef762cc5946fc7a3f89879e76c6ec51e2 Author: Eric Dumazet <edumazet@google.com> Date: Tue May 9 13:18:57 2023 +0000 net: add vlan_get_protocol_and_depth() helper Before blamed commit, pskb_may_pull() was used instead of skb_header_pointer() in __vlan_get_protocol() and friends. Few callers depended on skb->head being populated with MAC header, syzbot caught one of them (skb_mac_gso_segment()) Add vlan_get_protocol_and_depth() to make the intent clearer and use it where sensible. This is a more generic fix than commit e9d3f80935b6 ("net/af_packet: make sure to pull mac header") which was dealing with a similar issue. kernel BUG at include/linux/skbuff.h:2655 ! invalid opcode: 0000 [#1] SMP KASAN CPU: 0 PID: 1441 Comm: syz-executor199 Not tainted 6.1.24-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/14/2023 RIP: 0010:__skb_pull include/linux/skbuff.h:2655 [inline] RIP: 0010:skb_mac_gso_segment+0x68f/0x6a0 net/core/gro.c:136 Code: fd 48 8b 5c 24 10 44 89 6b 70 48 c7 c7 c0 ae 0d 86 44 89 e6 e8 a1 91 d0 00 48 c7 c7 00 af 0d 86 48 89 de 31 d2 e8 d1 4a e9 ff <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 RSP: 0018:ffffc90001bd7520 EFLAGS: 00010286 RAX: ffffffff8469736a RBX: ffff88810f31dac0 RCX: ffff888115a18b00 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffffc90001bd75e8 R08: ffffffff84697183 R09: fffff5200037adf9 R10: 0000000000000000 R11: dffffc0000000001 R12: 0000000000000012 R13: 000000000000fee5 R14: 0000000000005865 R15: 000000000000fed7 FS: 000055555633f300(0000) GS:ffff8881f6a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020000000 CR3: 0000000116fea000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> [<ffffffff847018dd>] __skb_gso_segment+0x32d/0x4c0 net/core/dev.c:3419 [<ffffffff8470398a>] skb_gso_segment include/linux/netdevice.h:4819 [inline] [<ffffffff8470398a>] validate_xmit_skb+0x3aa/0xee0 net/core/dev.c:3725 [<ffffffff84707042>] __dev_queue_xmit+0x1332/0x3300 net/core/dev.c:4313 [<ffffffff851a9ec7>] dev_queue_xmit+0x17/0x20 include/linux/netdevice.h:3029 [<ffffffff851b4a82>] packet_snd net/packet/af_packet.c:3111 [inline] [<ffffffff851b4a82>] packet_sendmsg+0x49d2/0x6470 net/packet/af_packet.c:3142 [<ffffffff84669a12>] sock_sendmsg_nosec net/socket.c:716 [inline] [<ffffffff84669a12>] sock_sendmsg net/socket.c:736 [inline] [<ffffffff84669a12>] __sys_sendto+0x472/0x5f0 net/socket.c:2139 [<ffffffff84669c75>] __do_sys_sendto net/socket.c:2151 [inline] [<ffffffff84669c75>] __se_sys_sendto net/socket.c:2147 [inline] [<ffffffff84669c75>] __x64_sys_sendto+0xe5/0x100 net/socket.c:2147 [<ffffffff8551d40f>] do_syscall_x64 arch/x86/entry/common.c:50 [inline] [<ffffffff8551d40f>] do_syscall_64+0x2f/0x50 arch/x86/entry/common.c:80 [<ffffffff85600087>] entry_SYSCALL_64_after_hwframe+0x63/0xcd Fixes: `469aceddfa` ("vlan: consolidate VLAN parsing code and limit max parsing depth") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Toke Høiland-Jørgensen <toke@redhat.com> Cc: Willem de Bruijn <willemb@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-06-26 16:58:59 +02:00
Jan Stancek	9d37206873	Merge: net: sync skb free reasons MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2627 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2184073 Did not included commit 071c0fc6fb91 ("net: extend drop reasons for multiple subsystems") as it would be appropriate to backport it in its own MR, would have not user for now, and it's not clear to me how trace_kfree_skb deals with non-core free reasons once applied. Signed-off-by: Antoine Tenart <atenart@redhat.com> Approved-by: Íñigo Huguet <ihuguet@redhat.com> Approved-by: Xin Long <lxin@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com>	2023-06-14 13:27:31 +02:00
Felix Maurer	b576afd91a	netdev-genl: create a simple family for netdev stuff Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2178930 Conflicts: - include/linux/netdevice.h: Context difference in includes due to missing 406f42fa0d3c ("net-next: When a bond have a massive amount of VLANs with IPv6 addresses, performance of changing link state, attaching a VRF, changing an IPv6 address, etc. go down dramtically.") - net/core/Makefile: Context difference due to missing 2c193f2cb110 ("net: kunit: add a test for dev_addr_lists") commit d3d854fd6a1d97157f790604e07f6386e8df8fe4 Author: Jakub Kicinski <kuba@kernel.org> Date: Wed Feb 1 11:24:17 2023 +0100 netdev-genl: create a simple family for netdev stuff Add a Netlink spec-compatible family for netdevs. This is a very simple implementation without much thought going into it. It allows us to reap all the benefits of Netlink specs, one can use the generic client to issue the commands: $ ./cli.py --spec netdev.yaml --dump dev_get [{'ifindex': 1, 'xdp-features': set()}, {'ifindex': 2, 'xdp-features': {'basic', 'ndo-xmit', 'redirect'}}, {'ifindex': 3, 'xdp-features': {'rx-sg'}}] the generic python library does not have flags-by-name support, yet, but we also don't have to carry strings in the messages, as user space can get the names from the spec. Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Co-developed-by: Marek Majtyka <alardam@gmail.com> Signed-off-by: Marek Majtyka <alardam@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/327ad9c9868becbe1e601b580c962549c8cd81f2.1675245258.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Felix Maurer <fmaurer@redhat.com>	2023-06-13 22:45:50 +02:00
Felix Maurer	e630642b6b	bpf: Introduce device-bound XDP programs Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2178930 commit 2b3486bc2d237ec345b3942b7be5deabf8c8fed1 Author: Stanislav Fomichev <sdf@google.com> Date: Thu Jan 19 14:15:24 2023 -0800 bpf: Introduce device-bound XDP programs New flag BPF_F_XDP_DEV_BOUND_ONLY plus all the infra to have a way to associate a netdev with a BPF program at load time. netdevsim checks are dropped in favor of generic check in dev_xdp_attach. Cc: John Fastabend <john.fastabend@gmail.com> Cc: David Ahern <dsahern@gmail.com> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Willem de Bruijn <willemb@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Anatoly Burakov <anatoly.burakov@intel.com> Cc: Alexander Lobakin <alexandr.lobakin@intel.com> Cc: Magnus Karlsson <magnus.karlsson@gmail.com> Cc: Maryam Tahhan <mtahhan@redhat.com> Cc: xdp-hints@xdp-project.net Cc: netdev@vger.kernel.org Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230119221536.3349901-6-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Felix Maurer <fmaurer@redhat.com>	2023-06-13 22:45:13 +02:00
Felix Maurer	c0febc32b2	bpf: Rename bpf_{prog,map}_is_dev_bound to is_offloaded Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2178930 commit 9d03ebc71a027ca495c60f6e94d3cda81921791f Author: Stanislav Fomichev <sdf@google.com> Date: Thu Jan 19 14:15:21 2023 -0800 bpf: Rename bpf_{prog,map}_is_dev_bound to is_offloaded BPF offloading infra will be reused to implement bound-but-not-offloaded bpf programs. Rename existing helpers for clarity. No functional changes. Cc: John Fastabend <john.fastabend@gmail.com> Cc: David Ahern <dsahern@gmail.com> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Willem de Bruijn <willemb@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Anatoly Burakov <anatoly.burakov@intel.com> Cc: Alexander Lobakin <alexandr.lobakin@intel.com> Cc: Magnus Karlsson <magnus.karlsson@gmail.com> Cc: Maryam Tahhan <mtahhan@redhat.com> Cc: xdp-hints@xdp-project.net Cc: netdev@vger.kernel.org Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230119221536.3349901-3-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Felix Maurer <fmaurer@redhat.com>	2023-06-13 22:45:12 +02:00
Ivan Vecera	1cb324e3cc	net: Remove the obsolte u64_stats_fetch__irq() users (net). Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2193170 Conflicts: net/netfilter/ipvs/ip_vs_ctl.c - the change was already applied by RHEL commit `914c1e31d9` ("ipvs: use u64_stats_t for the per-cpu counters") * net/core/devlink.c - hunk was applied in different file (net/devlink/leftover.c) commit d120d1a63b2c484d6175873d8ee736a633f74b70 Author: Thomas Gleixner <tglx@linutronix.de> Date: Wed Oct 26 15:22:15 2022 +0200 net: Remove the obsolte u64_stats_fetch_*_irq() users (net). Now that the 32bit UP oddity is gone and 32bit uses always a sequence count, there is no need for the fetch_irq() variants anymore. Convert to the regular interface. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2023-06-08 13:38:11 +02:00
Ivan Vecera	41bf85273b	net: adopt u64_stats_t in struct pcpu_sw_netstats Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2193170 commit 9962acefbcb92736c268aafe5f52200948f60f3e Author: Eric Dumazet <edumazet@google.com> Date: Wed Jun 8 08:46:37 2022 -0700 net: adopt u64_stats_t in struct pcpu_sw_netstats As explained in commit `316580b69d` ("u64_stats: provide u64_stats_t type") we should use u64_stats_t and related accessors to avoid load/store tearing. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2023-06-08 13:37:00 +02:00
Antoine Tenart	f2ed106175	net: remove enum skb_free_reason Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2184073 Upstream Status: linux.git commit 40bbae583ec38ea31e728bf42a4ea72bded22ab6 Author: Eric Dumazet <edumazet@google.com> Date: Mon Mar 6 20:43:13 2023 +0000 net: remove enum skb_free_reason enum skb_drop_reason is more generic, we can adopt it instead. Provide dev_kfree_skb_irq_reason() and dev_kfree_skb_any_reason(). This means drivers can use more precise drop reasons if they want to. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com> Link: https://lore.kernel.org/r/20230306204313.10492-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2023-06-06 11:23:26 +02:00
Antoine Tenart	d48044618a	net: add location to trace_consume_skb() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2184073 Upstream Status: linux.git commit dd1b527831a3ed659afa01b672d8e1f7e6ca95a5 Author: Eric Dumazet <edumazet@google.com> Date: Thu Feb 16 15:47:18 2023 +0000 net: add location to trace_consume_skb() kfree_skb() includes the location, it makes sense to add it to consume_skb() as well. After patch: taskd_EventMana 8602 [004] 420.406239: skb:consume_skb: skbaddr=0xffff893a4a6d0500 location=unix_stream_read_generic swapper 0 [011] 422.732607: skb:consume_skb: skbaddr=0xffff89597f68cee0 location=mlx4_en_free_tx_desc discipline 9141 [043] 423.065653: skb:consume_skb: skbaddr=0xffff893a487e9c00 location=skb_consume_udp swapper 0 [010] 423.073166: skb:consume_skb: skbaddr=0xffff8949ce9cdb00 location=icmpv6_rcv borglet 8672 [014] 425.628256: skb:consume_skb: skbaddr=0xffff8949c42e9400 location=netlink_dump swapper 0 [028] 426.263317: skb:consume_skb: skbaddr=0xffff893b1589dce0 location=net_rx_action wget 14339 [009] 426.686380: skb:consume_skb: skbaddr=0xffff893a51b552e0 location=tcp_rcv_state_process Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2023-06-06 11:23:26 +02:00
Jan Stancek	6318ae37c7	Merge: ovs: stable backports for 9.3 phase 1 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2438 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2190207 Signed-off-by: Antoine Tenart <atenart@redhat.com> Approved-by: Andrea Claudi <aclaudi@redhat.com> Approved-by: Davide Caratti <dcaratti@redhat.com> Approved-by: Eelco Chaudron <echaudro@redhat.com> Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com>	2023-06-01 07:25:53 +02:00
Jan Stancek	91e631150d	Merge: Bonding: rebase to linux v6.3 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2419 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2189406 Depends: !2418 Signed-off-by: Hangbin Liu <haliu@redhat.com> Approved-by: Andrea Claudi <aclaudi@redhat.com> Approved-by: Xin Long <lxin@redhat.com> Approved-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com>	2023-05-17 07:47:13 +02:00
Jan Stancek	704d11b087	Merge: enable io_uring MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2375 # Merge Request Required Information ## Summary of Changes This MR updates RHEL's io_uring implementation to match upstream v6.2 + fixes (Fixes: tags and Cc: stable commits). The final 3 RHEL-only patches disable the feature by default, taint the kernel when it is enabled, and turn on the config option. ## Approved Development Ticket Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2170014 Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2214 Omitted-fix: b0b7a7d24b66 ("io_uring: return back links tw run optimisation") This is actually just an optimization, and it has non-trivial conflicts which would require additional backports to resolve. Skip it. Omitted-fix: 7d481e035633 ("io_uring/rsrc: fix DEFER_TASKRUN rsrc quiesce") This fix is incorrectly tagged. The code that it applies to is not present in our tree. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Approved-by: John Meneghini <jmeneghi@redhat.com> Approved-by: Ming Lei <ming.lei@redhat.com> Approved-by: Maurizio Lombardi <mlombard@redhat.com> Approved-by: Brian Foster <bfoster@redhat.com> Approved-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com>	2023-05-17 07:47:08 +02:00
Jeff Moyer	2595bc4d80	net: fix kdoc on __dev_queue_xmit() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237 commit be76955dea93fe7ee9e0a6f961a7185290a2417f Author: Jakub Kicinski <kuba@kernel.org> Date: Mon May 9 10:04:12 2022 -0700 net: fix kdoc on __dev_queue_xmit() Commit c526fd8f9f4f21 ("net: inline dev_queue_xmit()") exported __dev_queue_xmit(), now it's being rendered in html docs, triggering: Documentation/networking/kapi:92: net/core/dev.c:4101: WARNING: Missing matching underline for section title overline. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Link: https://lore.kernel.org/linux-next/20220503073420.6d3f135d@canb.auug.org.au/ Fixes: c526fd8f9f4f21 ("net: inline dev_queue_xmit()") Link: https://lore.kernel.org/r/20220509170412.1069190-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>	2023-05-05 15:23:02 -04:00
Paolo Abeni	d0ff450947	net: fix __dev_kfree_skb_any() vs drop monitor Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2188560 Tested: LNST, Tier1 Upstream commit: commit ac3ad19584b26fae9ac86e4faebe790becc74491 Author: Eric Dumazet <edumazet@google.com> Date: Thu Feb 23 08:38:45 2023 +0000 net: fix __dev_kfree_skb_any() vs drop monitor dev_kfree_skb() is aliased to consume_skb(). When a driver is dropping a packet by calling dev_kfree_skb_any() we should propagate the drop reason instead of pretending the packet was consumed. Note: Now we have enum skb_drop_reason we could remove enum skb_free_reason (for linux-6.4) v2: added an unlikely(), suggested by Yunsheng Lin. Fixes: `e6247027e5` ("net: introduce dev_consume_skb_any()") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-05-02 19:07:41 +02:00
Xin Long	2db946b2f7	net: add gso_ipv4_max_size and gro_ipv4_max_size per device Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2185290 Tested: compile only Conflicts: - move netif_set_gro_max_size() from include/linux/netdevice.h to net/core/dev.h, then make the change, as commit 744d49daf8bd was backported earlier than eac1b93c14d6. netif_set_gro_max_size() was missed the oppotunity to be moved to net/core/dev.h. - different context in net/core/dev.h, rps_cpumask_housekeeping() is added due to 370ca718fd5e already in RHEL-9. commit 9eefedd58ae1daece2ba907849a44db2941fb4b0 Author: Xin Long <lucien.xin@gmail.com> Date: Sat Jan 28 10:58:38 2023 -0500 net: add gso_ipv4_max_size and gro_ipv4_max_size per device This patch introduces gso_ipv4_max_size and gro_ipv4_max_size per device and adds netlink attributes for them, so that IPV4 BIG TCP can be guarded by a separate tunable in the next patch. To not break the old application using "gso/gro_max_size" for IPv4 GSO packets, this patch updates "gso/gro_ipv4_max_size" in netif_set_gso/gro_max_size() if the new size isn't greater than GSO_LEGACY_MAX_SIZE, so that nothing will change even if userspace doesn't realize the new netlink attributes. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Xin Long <lxin@redhat.com>	2023-05-02 10:36:11 -04:00
Jeff Moyer	82f65d6ce4	net: inline dev_queue_xmit() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237 commit c526fd8f9f4f21cb83c0b1c9a1ee9c0ac9be9e2e Author: Pavel Begunkov <asml.silence@gmail.com> Date: Thu Apr 28 11:58:46 2022 +0100 net: inline dev_queue_xmit() Inline dev_queue_xmit() and dev_queue_xmit_accel(), they both are small proxy functions doing nothing but redirecting the control flow to __dev_queue_xmit(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>	2023-04-29 07:56:02 -04:00
Antoine Tenart	af98894a33	net: openvswitch: fix race on port output Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2190207 Upstream Status: linux.git commit 066b86787fa3d97b7aefb5ac0a99a22dad2d15f8 Author: Felix Huettner <felix.huettner@mail.schwarz> Date: Wed Apr 5 07:53:41 2023 +0000 net: openvswitch: fix race on port output assume the following setup on a single machine: 1. An openvswitch instance with one bridge and default flows 2. two network namespaces "server" and "client" 3. two ovs interfaces "server" and "client" on the bridge 4. for each ovs interface a veth pair with a matching name and 32 rx and tx queues 5. move the ends of the veth pairs to the respective network namespaces 6. assign ip addresses to each of the veth ends in the namespaces (needs to be the same subnet) 7. start some http server on the server network namespace 8. test if a client in the client namespace can reach the http server when following the actions below the host has a chance of getting a cpu stuck in a infinite loop: 1. send a large amount of parallel requests to the http server (around 3000 curls should work) 2. in parallel delete the network namespace (do not delete interfaces or stop the server, just kill the namespace) there is a low chance that this will cause the below kernel cpu stuck message. If this does not happen just retry. Below there is also the output of bpftrace for the functions mentioned in the output. The series of events happening here is: 1. the network namespace is deleted calling `unregister_netdevice_many_notify` somewhere in the process 2. this sets first `NETREG_UNREGISTERING` on both ends of the veth and then runs `synchronize_net` 3. it then calls `call_netdevice_notifiers` with `NETDEV_UNREGISTER` 4. this is then handled by `dp_device_event` which calls `ovs_netdev_detach_dev` (if a vport is found, which is the case for the veth interface attached to ovs) 5. this removes the rx_handlers of the device but does not prevent packages to be sent to the device 6. `dp_device_event` then queues the vport deletion to work in background as a ovs_lock is needed that we do not hold in the unregistration path 7. `unregister_netdevice_many_notify` continues to call `netdev_unregister_kobject` which sets `real_num_tx_queues` to 0 8. port deletion continues (but details are not relevant for this issue) 9. at some future point the background task deletes the vport If after 7. but before 9. a packet is send to the ovs vport (which is not deleted at this point in time) which forwards it to the `dev_queue_xmit` flow even though the device is unregistering. In `skb_tx_hash` (which is called in the `dev_queue_xmit`) path there is a while loop (if the packet has a rx_queue recorded) that is infinite if `dev->real_num_tx_queues` is zero. To prevent this from happening we update `do_output` to handle devices without carrier the same as if the device is not found (which would be the code path after 9. is done). Additionally we now produce a warning in `skb_tx_hash` if we will hit the infinite loop. bpftrace (first word is function name): __dev_queue_xmit server: real_num_tx_queues: 1, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 1 netdev_core_pick_tx server: addr: 0xffff9f0a46d4a000 real_num_tx_queues: 1, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 1 dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 2, reg_state: 1 synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024 synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024 synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024 synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024 dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 6, reg_state: 2 ovs_netdev_detach_dev server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, reg_state: 2 netdev_rx_handler_unregister server: real_num_tx_queues: 1, cpu: 9, pid: 21024, tid: 21024, reg_state: 2 synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024 netdev_rx_handler_unregister ret server: real_num_tx_queues: 1, cpu: 9, pid: 21024, tid: 21024, reg_state: 2 dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 27, reg_state: 2 dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 22, reg_state: 2 dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 18, reg_state: 2 netdev_unregister_kobject: real_num_tx_queues: 1, cpu: 9, pid: 21024, tid: 21024 synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024 ovs_vport_send server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 2 __dev_queue_xmit server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 2 netdev_core_pick_tx server: addr: 0xffff9f0a46d4a000 real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 2 broken device server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024 ovs_dp_detach_port server: real_num_tx_queues: 0 cpu 9, pid: 9124, tid: 9124, reg_state: 2 synchronize_rcu_expedited: cpu 9, pid: 33604, tid: 33604 stuck message: watchdog: BUG: soft lockup - CPU#5 stuck for 26s! [curl:1929279] Modules linked in: veth pktgen bridge stp llc ip_set_hash_net nft_counter xt_set nft_compat nf_tables ip_set_hash_ip ip_set nfnetlink_cttimeout nfnetlink openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 tls binfmt_misc nls_iso8859_1 input_leds joydev serio_raw dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua sch_fq_codel drm efi_pstore virtio_rng ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel virtio_net ahci net_failover crypto_simd cryptd psmouse libahci virtio_blk failover CPU: 5 PID: 1929279 Comm: curl Not tainted 5.15.0-67-generic #74-Ubuntu Hardware name: OpenStack Foundation OpenStack Nova, BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:netdev_pick_tx+0xf1/0x320 Code: 00 00 8d 48 ff 0f b7 c1 66 39 ca 0f 86 e9 01 00 00 45 0f b7 ff 41 39 c7 0f 87 5b 01 00 00 44 29 f8 41 39 c7 0f 87 4f 01 00 00 <eb> f2 0f 1f 44 00 00 49 8b 94 24 28 04 00 00 48 85 d2 0f 84 53 01 RSP: 0018:ffffb78b40298820 EFLAGS: 00000246 RAX: 0000000000000000 RBX: ffff9c8773adc2e0 RCX: 000000000000083f RDX: 0000000000000000 RSI: ffff9c8773adc2e0 RDI: ffff9c870a25e000 RBP: ffffb78b40298858 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff9c870a25e000 R13: ffff9c870a25e000 R14: ffff9c87fe043480 R15: 0000000000000000 FS: 00007f7b80008f00(0000) GS:ffff9c8e5f740000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f7b80f6a0b0 CR3: 0000000329d66000 CR4: 0000000000350ee0 Call Trace: <IRQ> netdev_core_pick_tx+0xa4/0xb0 __dev_queue_xmit+0xf8/0x510 ? __bpf_prog_exit+0x1e/0x30 dev_queue_xmit+0x10/0x20 ovs_vport_send+0xad/0x170 [openvswitch] do_output+0x59/0x180 [openvswitch] do_execute_actions+0xa80/0xaa0 [openvswitch] ? kfree+0x1/0x250 ? kfree+0x1/0x250 ? kprobe_perf_func+0x4f/0x2b0 ? flow_lookup.constprop.0+0x5c/0x110 [openvswitch] ovs_execute_actions+0x4c/0x120 [openvswitch] ovs_dp_process_packet+0xa1/0x200 [openvswitch] ? ovs_ct_update_key.isra.0+0xa8/0x120 [openvswitch] ? ovs_ct_fill_key+0x1d/0x30 [openvswitch] ? ovs_flow_key_extract+0x2db/0x350 [openvswitch] ovs_vport_receive+0x77/0xd0 [openvswitch] ? __htab_map_lookup_elem+0x4e/0x60 ? bpf_prog_680e8aff8547aec1_kfree+0x3b/0x714 ? trace_call_bpf+0xc8/0x150 ? kfree+0x1/0x250 ? kfree+0x1/0x250 ? kprobe_perf_func+0x4f/0x2b0 ? kprobe_perf_func+0x4f/0x2b0 ? __mod_memcg_lruvec_state+0x63/0xe0 netdev_port_receive+0xc4/0x180 [openvswitch] ? netdev_port_receive+0x180/0x180 [openvswitch] netdev_frame_hook+0x1f/0x40 [openvswitch] __netif_receive_skb_core.constprop.0+0x23d/0xf00 __netif_receive_skb_one_core+0x3f/0xa0 __netif_receive_skb+0x15/0x60 process_backlog+0x9e/0x170 __napi_poll+0x33/0x180 net_rx_action+0x126/0x280 ? ttwu_do_activate+0x72/0xf0 __do_softirq+0xd9/0x2e7 ? rcu_report_exp_cpu_mult+0x1b0/0x1b0 do_softirq+0x7d/0xb0 </IRQ> <TASK> __local_bh_enable_ip+0x54/0x60 ip_finish_output2+0x191/0x460 __ip_finish_output+0xb7/0x180 ip_finish_output+0x2e/0xc0 ip_output+0x78/0x100 ? __ip_finish_output+0x180/0x180 ip_local_out+0x5e/0x70 __ip_queue_xmit+0x184/0x440 ? tcp_syn_options+0x1f9/0x300 ip_queue_xmit+0x15/0x20 __tcp_transmit_skb+0x910/0x9c0 ? __mod_memcg_state+0x44/0xa0 tcp_connect+0x437/0x4e0 ? ktime_get_with_offset+0x60/0xf0 tcp_v4_connect+0x436/0x530 __inet_stream_connect+0xd4/0x3a0 ? kprobe_perf_func+0x4f/0x2b0 ? aa_sk_perm+0x43/0x1c0 inet_stream_connect+0x3b/0x60 __sys_connect_file+0x63/0x70 __sys_connect+0xa6/0xd0 ? setfl+0x108/0x170 ? do_fcntl+0xe8/0x5a0 __x64_sys_connect+0x18/0x20 do_syscall_64+0x5c/0xc0 ? __x64_sys_fcntl+0xa9/0xd0 ? exit_to_user_mode_prepare+0x37/0xb0 ? syscall_exit_to_user_mode+0x27/0x50 ? do_syscall_64+0x69/0xc0 ? __sys_setsockopt+0xea/0x1e0 ? exit_to_user_mode_prepare+0x37/0xb0 ? syscall_exit_to_user_mode+0x27/0x50 ? __x64_sys_setsockopt+0x1f/0x30 ? do_syscall_64+0x69/0xc0 ? irqentry_exit+0x1d/0x30 ? exc_page_fault+0x89/0x170 entry_SYSCALL_64_after_hwframe+0x61/0xcb RIP: 0033:0x7f7b8101c6a7 Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 18 89 54 24 0c 48 89 34 24 89 RSP: 002b:00007ffffd6b2198 EFLAGS: 00000246 ORIG_RAX: 000000000000002a RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7b8101c6a7 RDX: 0000000000000010 RSI: 00007ffffd6b2360 RDI: 0000000000000005 RBP: 0000561f1370d560 R08: 00002795ad21d1ac R09: 0030312e302e302e R10: 00007ffffd73f080 R11: 0000000000000246 R12: 0000561f1370c410 R13: 0000000000000000 R14: 0000000000000005 R15: 0000000000000000 </TASK> Fixes: `7f8a436eaa` ("openvswitch: Add conntrack action") Co-developed-by: Luca Czesla <luca.czesla@mail.schwarz> Signed-off-by: Luca Czesla <luca.czesla@mail.schwarz> Signed-off-by: Felix Huettner <felix.huettner@mail.schwarz> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/ZC0pBXBAgh7c76CA@kernel-bug-kernel-bug Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2023-04-27 16:30:10 +02:00
Jan Stancek	8e94775eed	Merge: CNB: rebase/update devlink for RHEL 9.3 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2191 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2172273 Tested: selftests, basic devlink features on ice and mlx5 Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2175249 Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2175250 Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2176150 Update devlink up to v6.3. Signed-off-by: Petr Oros <poros@redhat.com> Approved-by: Ivan Vecera <ivecera@redhat.com> Approved-by: Íñigo Huguet <ihuguet@redhat.com> Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com> Approved-by: Herbert Xu <zxu@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com>	2023-04-27 07:47:22 +02:00
Hangbin Liu	a149ec5e7d	net/core: Allow live renaming when an interface is up Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2189406 Upstream Status: net.git commit bd039b5ea2a9 commit bd039b5ea2a91ea707ee8539df26456bd5be80af Author: Andy Ren <andy.ren@getcruise.com> Date: Mon Nov 7 09:42:42 2022 -0800 net/core: Allow live renaming when an interface is up Allow a network interface to be renamed when the interface is up. As described in the netconsole documentation [1], when netconsole is used as a built-in, it will bring up the specified interface as soon as possible. As a result, user space will not be able to rename the interface since the kernel disallows renaming of interfaces that are administratively up unless the 'IFF_LIVE_RENAME_OK' private flag was set by the kernel. The original solution [2] to this problem was to add a new parameter to the netconsole configuration parameters that allows renaming of the interface used by netconsole while it is administratively up. However, during the discussion that followed, it became apparent that we have no reason to keep the current restriction and instead we should allow user space to rename interfaces regardless of their administrative state: 1. The restriction was put in place over 20 years ago when renaming was only possible via IOCTL and before rtnetlink started notifying user space about such changes like it does today. 2. The 'IFF_LIVE_RENAME_OK' flag was added over 3 years ago in version 5.2 and no regressions were reported. 3. In-kernel listeners to 'NETDEV_CHANGENAME' do not seem to care about the administrative state of interface. Therefore, allow user space to rename running interfaces by removing the restriction and the associated 'IFF_LIVE_RENAME_OK' flag. Help in possible triage by emitting a message to the kernel log that an interface was renamed while UP. [1] https://www.kernel.org/doc/Documentation/networking/netconsole.rst [2] https://lore.kernel.org/netdev/20221102002420.2613004-1-andy.ren@getcruise.com/ Signed-off-by: Andy Ren <andy.ren@getcruise.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Hangbin Liu <haliu@redhat.com>	2023-04-25 15:26:55 +08:00
Petr Oros	59e7861deb	devlink: Fix netdev notifier chain corruption Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2172273 Conflicts: - adjusted upstream merge conflict which was resolved in 675f176b4dcc2b ("Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/net") Upstream commit(s): commit b20b8aec6ffc07bb547966b356780cd344f20f5b Author: Ido Schimmel <idosch@nvidia.com> Date: Wed Feb 15 09:31:39 2023 +0200 devlink: Fix netdev notifier chain corruption Cited commit changed devlink to register its netdev notifier block on the global netdev notifier chain instead of on the per network namespace one. However, when changing the network namespace of the devlink instance, devlink still tries to unregister its notifier block from the chain of the old namespace and register it on the chain of the new namespace. This results in corruption of the notifier chains, as the same notifier block is registered on two different chains: The global one and the per network namespace one. In turn, this causes other problems such as the inability to dismantle namespaces due to netdev reference count issues. Fix by preventing devlink from moving its notifier block between namespaces. Reproducer: # echo "10 1" > /sys/bus/netdevsim/new_device # ip netns add test123 # devlink dev reload netdevsim/netdevsim10 netns test123 # ip netns del test123 [ 71.935619] unregister_netdevice: waiting for lo to become free. Usage count = 2 [ 71.938348] leaked reference. Fixes: 565b4824c39f ("devlink: change port event netdev notifier from per-net to global") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230215073139.1360108-1-idosch@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Petr Oros <poros@redhat.com>	2023-04-04 11:12:28 +02:00
Petr Oros	8df3e0fd3b	net: introduce a helper to move notifier block to different namespace Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2172273 Upstream commit(s): commit 3e52fba03a20234abc65a656cef063a1045d9723 Author: Jiri Pirko <jiri@nvidia.com> Date: Tue Nov 8 14:22:06 2022 +0100 net: introduce a helper to move notifier block to different namespace Currently, net_dev() netdev notifier variant follows the netdev with per-net notifier from namespace to namespace. This is implemented by move_netdevice_notifiers_dev_net() helper. For devlink it is needed to re-register per-net notifier during devlink reload. Introduce a new helper called move_netdevice_notifier_net() and share the unregister/register code with existing move_netdevice_notifiers_dev_net() helper. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Petr Oros <poros@redhat.com>	2023-04-03 14:05:59 +02:00
Petr Oros	afc2a59634	net: devlink: track netdev with devlink_port assigned Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2172273 Upstream commit(s): commit 02a68a47eadedf95748facfca6ced31fb0181d52 Author: Jiri Pirko <jiri@nvidia.com> Date: Wed Nov 2 17:02:03 2022 +0100 net: devlink: track netdev with devlink_port assigned Currently, ethernet drivers are using devlink_port_type_eth_set() and devlink_port_type_clear() to set devlink port type and link to related netdev. Instead of calling them directly, let the driver use SET_NETDEV_DEVLINK_PORT macro to assign devlink_port pointer and let devlink to track it. Note the devlink port pointer is static during the time netdevice is registered. In devlink code, use per-namespace netdev notifier to track the netdevices with devlink_port assigned and change the internal devlink_port type and related type pointer accordingly. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Petr Oros <poros@redhat.com>	2023-04-03 10:57:13 +02:00
Íñigo Huguet	3a91b473a8	net: rename reference+tracking helpers Bugzilla: https://bugzilla.redhat.com/2175258 Conflicts: - Removed chunks of unsupported protocol AX.25 - Renamed the funtions also in ipvlan. Commit 40b9d1ab63f5 ("ipvlan: hold lower dev to avoid possible use-after-free") was backported out of order so it had to use the old functions names. commit d62607c3fe45911b2331fac073355a8c914bbde2 Author: Jakub Kicinski <kuba@kernel.org> Date: Tue Jun 7 21:39:55 2022 -0700 net: rename reference+tracking helpers Netdev reference helpers have a dev_ prefix for historic reasons. Renaming the old helpers would be too much churn but we can rename the tracking ones which are relatively recent and should be the default for new code. Rename: dev_hold_track() -> netdev_hold() dev_put_track() -> netdev_put() dev_replace_track() -> netdev_ref_replace() Link: https://lore.kernel.org/r/20220608043955.919359-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>	2023-03-23 16:19:21 +01:00
Xin Long	3a75ec1506	net: avoid quadratic behavior in netdev_wait_allrefs_any() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2180612 Tested: compile only Conflicts: - context difference due to cc26c2661fef already in RHEL-9. commit 86213f80da1b1d007721cc22e04b5f5d0da33127 Author: Eric Dumazet <edumazet@google.com> Date: Thu Feb 17 22:54:30 2022 -0800 net: avoid quadratic behavior in netdev_wait_allrefs_any() If the list of devices has N elements, netdev_wait_allrefs_any() is called N times, and linkwatch_forget_dev() is called N*(N-1)/2 times. Fix this by calling linkwatch_forget_dev() only once per device. Fixes: faab39f63c1f ("net: allow out-of-order netdev unregistration") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220218065430.2613262-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Xin Long <lxin@redhat.com>	2023-03-21 17:39:40 -04:00
Xin Long	b1a4490d48	net: allow out-of-order netdev unregistration Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2180612 Tested: compile only Conflicts: - context difference due to 05e49cfc89e4 already in RHEL-9. commit faab39f63c1fc4bcdf135690f03bd596b578c67e Author: Jakub Kicinski <kuba@kernel.org> Date: Tue Feb 15 14:53:10 2022 -0800 net: allow out-of-order netdev unregistration Sprinkle for each loops to allow netdevices to be unregistered out of order, as their refs are released. This prevents problems caused by dependencies between netdevs which want to release references in their ->priv_destructor. See commit d6ff94afd90b ("vlan: move dev_put into vlan_dev_uninit") for example. Eric has removed the only known ordering requirement in commit c002496babfd ("Merge branch 'ipv6-loopback'") so let's try this and see if anything explodes... Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Link: https://lore.kernel.org/r/20220215225310.3679266-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Xin Long <lxin@redhat.com>	2023-03-21 17:39:26 -04:00
Xin Long	bfdcece7f8	net: transition netdev reg state earlier in run_todo Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2180612 Tested: compile only Conflicts: - context difference due to cc26c2661fef already in RHEL-9. commit ae68db14b6164ce46beffaf35eb7c9bb2f92fee3 Author: Jakub Kicinski <kuba@kernel.org> Date: Tue Feb 15 14:53:09 2022 -0800 net: transition netdev reg state earlier in run_todo In prep for unregistering netdevs out of order move the netdev state validation and change outside of the loop. While at it modernize this code and use WARN() instead of pr_err() + dump_stack(). Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Link: https://lore.kernel.org/r/20220215225310.3679266-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Xin Long <lxin@redhat.com>	2023-03-21 17:38:42 -04:00
Herton R. Krzesinski	05d2a7216e	Merge: CNB: net: add netdev_sw_irq_coalesce_default_on() MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1970 Bugzilla: https://bugzilla.redhat.com/2161921 commit d93607082e982223cf92750f2d9039ff365b9d24 Author: Heiner Kallweit <hkallweit1@gmail.com> Date: Wed Nov 30 23:28:26 2022 +0100 net: add netdev_sw_irq_coalesce_default_on() Add a helper for drivers wanting to set SW IRQ coalescing by default. The related sysfs attributes can be used to override the default values. Follow Jakub's suggestion and put this functionality into net core so that drivers wanting to use software interrupt coalescing per default don't have to open-code it. Note that this function needs to be called before the netdevice is registered. Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Dan Campbell <dacampbe@redhat.com> Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com> Approved-by: Ivan Vecera <ivecera@redhat.com> Approved-by: Michal Schmidt <mschmidt@redhat.com> Approved-by: Petr Oros <poros@redhat.com> Approved-by: Andrea Claudi <aclaudi@redhat.com> Signed-off-by: Herton R. Krzesinski <herton@redhat.com>	2023-02-08 01:41:42 +00:00
Dan Campbell	bee4544aab	net: add netdev_sw_irq_coalesce_default_on() Bugzilla: https://bugzilla.redhat.com/2161921 commit d93607082e982223cf92750f2d9039ff365b9d24 Author: Heiner Kallweit <hkallweit1@gmail.com> Date: Wed Nov 30 23:28:26 2022 +0100 net: add netdev_sw_irq_coalesce_default_on() Add a helper for drivers wanting to set SW IRQ coalescing by default. The related sysfs attributes can be used to override the default values. Follow Jakub's suggestion and put this functionality into net core so that drivers wanting to use software interrupt coalescing per default don't have to open-code it. Note that this function needs to be called before the netdevice is registered. Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Dan Campbell <dacampbe@redhat.com>	2023-01-27 12:28:55 -06:00
Paolo Abeni	af86e36c42	net: Fix return value of qdisc ingress handling on success Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2162711 Tested: vs bz reproducer Upstream commit: commit 672e97ef689a38cb20c2cc6a1814298fea34461e Author: Paul Blakey <paulb@nvidia.com> Date: Tue Oct 18 10:34:38 2022 +0300 net: Fix return value of qdisc ingress handling on success Currently qdisc ingress handling (sch_handle_ingress()) doesn't set a return value and it is left to the old return value of the caller (__netif_receive_skb_core()) which is RX drop, so if the packet is consumed, caller will stop and return this value as if the packet was dropped. This causes a problem in the kernel tcp stack when having a egress tc rule forwarding to a ingress tc rule. The tcp stack sending packets on the device having the egress rule will see the packets as not successfully transmitted (although they actually were), will not advance it's internal state of sent data, and packets returning on such tcp stream will be dropped by the tcp stack with reason ack-of-unsent-data. See reproduction in [0] below. Fix that by setting the return value to RX success if the packet was handled successfully. [0] Reproduction steps: $ ip link add veth1 type veth peer name peer1 $ ip link add veth2 type veth peer name peer2 $ ifconfig peer1 5.5.5.6/24 up $ ip netns add ns0 $ ip link set dev peer2 netns ns0 $ ip netns exec ns0 ifconfig peer2 5.5.5.5/24 up $ ifconfig veth2 0 up $ ifconfig veth1 0 up #ingress forwarding veth1 <-> veth2 $ tc qdisc add dev veth2 ingress $ tc qdisc add dev veth1 ingress $ tc filter add dev veth2 ingress prio 1 proto all flower \ action mirred egress redirect dev veth1 $ tc filter add dev veth1 ingress prio 1 proto all flower \ action mirred egress redirect dev veth2 #steal packet from peer1 egress to veth2 ingress, bypassing the veth pipe $ tc qdisc add dev peer1 clsact $ tc filter add dev peer1 egress prio 20 proto ip flower \ action mirred ingress redirect dev veth1 #run iperf and see connection not running $ iperf3 -s& $ ip netns exec ns0 iperf3 -c 5.5.5.6 -i 1 #delete egress rule, and run again, now should work $ tc filter del dev peer1 egress $ ip netns exec ns0 iperf3 -c 5.5.5.6 -i 1 Fixes: `f697c3e8b3` ("[NET]: Avoid unnecessary cloning for ingress filtering") Signed-off-by: Paul Blakey <paulb@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-01-20 16:33:01 +01:00
Herton R. Krzesinski	19ce0cbd76	Merge: bpf, xdp: update to 5.19 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1533 bpf, xdp: update to 5.19 Bugzilla: http://bugzilla.redhat.com/2120968 Bugzilla: http://bugzilla.redhat.com/2130850 Bugzilla: http://bugzilla.redhat.com/2140077 Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com> Approved-by: Prarit Bhargava <prarit@redhat.com> Approved-by: Jiri Benc <jbenc@redhat.com> Approved-by: Artem Savkov <asavkov@redhat.com> Approved-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com> Signed-off-by: Herton R. Krzesinski <herton@redhat.com>	2022-12-21 20:49:27 +00:00
Herton R. Krzesinski	09736a3a30	Merge: udp: some performance optimizations MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1541 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2133057 Tested: LNST, Tier1, tput test This series improves UDP protocol RX tput, to keep it on equal footing with rhel-8 one. Patches 1,3,4 are there just to reduces the conflicts, and patch 4 is a very partial backport, to avoid pulling unrelated features. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Approved-by: Antoine Tenart <atenart@redhat.com> Approved-by: Florian Westphal <fwestpha@redhat.com> Signed-off-by: Herton R. Krzesinski <herton@redhat.com>	2022-12-13 17:35:03 +00:00
Felix Maurer	1e3ab14088	xdp: Fix spurious packet loss in generic XDP TX path Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2120968 commit 1fd6e5675336daf4747940b4285e84b0c114ae32 Author: Johan Almbladh <johan.almbladh@anyfinetworks.com> Date: Tue Jul 5 10:23:45 2022 +0200 xdp: Fix spurious packet loss in generic XDP TX path The byte queue limits (BQL) mechanism is intended to move queuing from the driver to the network stack in order to reduce latency caused by excessive queuing in hardware. However, when transmitting or redirecting a packet using generic XDP, the qdisc layer is bypassed and there are no additional queues. Since netif_xmit_stopped() also takes BQL limits into account, but without having any alternative queuing, packets are silently dropped. This patch modifies the drop condition to only consider cases when the driver itself cannot accept any more packets. This is analogous to the condition in __dev_direct_xmit(). Dropped packets are also counted on the device. Bypassing the qdisc layer in the generic XDP TX path means that XDP packets are able to starve other packets going through a qdisc, and DDOS attacks will be more effective. In-driver-XDP use dedicated TX queues, so they do not have this starvation issue. Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220705082345.2494312-1-johan.almbladh@anyfinetworks.com Signed-off-by: Felix Maurer <fmaurer@redhat.com>	2022-11-30 12:47:11 +02:00
Felix Maurer	b06bbd83be	net: Use this_cpu_inc() to increment net->core_stats Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2130850 commit 6510ea973d8d9d4a0cb2fb557b36bd1ab3eb49f6 Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Date: Mon Apr 25 18:39:46 2022 +0200 net: Use this_cpu_inc() to increment net->core_stats The macro dev_core_stats_##FIELD##_inc() disables preemption and invokes netdev_core_stats_alloc() to return a per-CPU pointer. netdev_core_stats_alloc() will allocate memory on its first invocation which breaks on PREEMPT_RT because it requires non-atomic context for memory allocation. This can be avoided by enabling preemption in netdev_core_stats_alloc() assuming the caller always disables preemption. It might be better to replace local_inc() with this_cpu_inc() now that dev_core_stats_##FIELD##_inc() gained a preempt-disable section and does not rely on already disabled preemption. This results in less instructions on x86-64: local_inc: \| incl %gs:__preempt_count(%rip) # __preempt_count \| movq 488(%rdi), %rax # _1->core_stats, _22 \| testq %rax, %rax # _22 \| je .L585 #, \| add %gs:this_cpu_off(%rip), %rax # this_cpu_off, tcp_ptr__ \| .L586: \| testq %rax, %rax # _27 \| je .L587 #, \| incq (%rax) # _6->a.counter \| .L587: \| decl %gs:__preempt_count(%rip) # __preempt_count this_cpu_inc(), this patch: \| movq 488(%rdi), %rax # _1->core_stats, _5 \| testq %rax, %rax # _5 \| je .L591 #, \| .L585: \| incq %gs:(%rax) # _18->rx_dropped Use unsigned long as type for the counter. Use this_cpu_inc() to increment the counter. Use a plain read of the counter. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/YmbO0pxgtKpCw4SY@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Felix Maurer <fmaurer@redhat.com>	2022-11-30 12:47:10 +02:00
Felix Maurer	a320271336	net: add per-cpu storage and net->core_stats Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2130850 Conflicts: - drivers/net/vxlan.c: file is not moved to drivers/net/vxlan/vxlan_core.c due to missing 6765393614ea8 ("vxlan: move to its own directory"); context difference due to missing 4095e0e1328a3 ("drivers: vxlan: vnifilter: per vni stats") - net/core/dev.c: code difference in __netif_receive_skb_core due to already applied 9f8ed577c2881 ("net: skb: rename SKB_DROP_REASON_PTYPE_ABSENT"). Result is like upstream now. - net/core/gro_cells.c: context difference due to already applied 5dcd08cd1991 ("net: Fix data-races around netdev_max_backlog.") commit 625788b5844511cf4c30cffa7fa0bc3a69cebc82 Author: Eric Dumazet <edumazet@google.com> Date: Thu Mar 10 21:14:20 2022 -0800 net: add per-cpu storage and net->core_stats Before adding yet another possibly contended atomic_long_t, it is time to add per-cpu storage for existing ones: dev->tx_dropped, dev->rx_dropped, and dev->rx_nohandler Because many devices do not have to increment such counters, allocate the per-cpu storage on demand, so that dev_get_stats() does not have to spend considerable time folding zero counters. Note that some drivers have abused these counters which were supposed to be only used by core networking stack. v4: should use per_cpu_ptr() in dev_get_stats() (Jakub) v3: added a READ_ONCE() in netdev_core_stats_alloc() (Paolo) v2: add a missing include (reported by kernel test robot <lkp@intel.com>) Change in netdev_core_stats_alloc() (Jakub) Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: jeffreyji <jeffreyji@google.com> Reviewed-by: Brian Vazquez <brianvv@google.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20220311051420.2608812-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Felix Maurer <fmaurer@redhat.com>	2022-11-30 12:47:10 +02:00
Frantisek Hrbata	a03fbb1743	Merge: CNB: Update TC subsystem to upstream v6.0 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1567 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139170 Tested: Using self-tests, results present in the BZ Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2133511 Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2128185 Commits: ``` b20dc3c68458 ("gtp: Allow to create GTP device without FDs") 9af41cc33471 ("gtp: Implement GTP echo response") d33bd757d362 ("gtp: Implement GTP echo request") e3acda7ade0a ("net/sched: Allow flower to match on GTP options") 81dd9849fa49 ("gtp: Add support for checking GTP device type") 02f393381d14 ("gtp: Fix inconsistent indenting") 4c096ea2d67c ("net/sched: matchall: Take verbose flag into account when logging error messages") 11c95317bc1a ("net/sched: flower: Take verbose flag into account when logging error messages") c2ccf84ecb71 ("net/sched: act_api: Add extack to offload_act_setup() callback") 69642c2ab2f5 ("net/sched: act_gact: Add extack messages for offload failure") 4dcaa50d0292 ("net/sched: act_mirred: Add extack message for offload failure") bca3821d19d9 ("net/sched: act_mpls: Add extack messages for offload failure") bf3b99e4f9ce ("net/sched: act_pedit: Add extack message for offload failure") b50e462bc22d ("net/sched: act_police: Add extack messages for offload failure") a9c64939b669 ("net/sched: act_skbedit: Add extack messages for offload failure") ee367d44b936 ("net/sched: act_tunnel_key: Add extack message for offload failure") f8fab3169464 ("net/sched: act_vlan: Add extack message for offload failure") c440615ffbcb ("net/sched: cls_api: Add extack message for unsupported action offload") 0cba5c34b8f4 ("net/sched: matchall: Avoid overwriting error messages") fd23e0e250c6 ("net/sched: flower: Avoid overwriting error messages") c9a40d1c87e9 ("net_sched: make qdisc_reset() smaller") 7463acfbe52a ("netfilter: Rename ingress hook include file") 17d20784223d ("netfilter: Generalize ingress hook include file") 42df6e1d221d ("netfilter: Introduce egress hook") 2f1e85b1aee4 ("net: sched: use queue_mapping to pick tx queue") 38a6f0865796 ("net: sched: support hash selecting tx queue") 285ba06b0edb ("net/sched: flower: Helper function for vlan ethtype checks") 6ee59e554d33 ("net/sched: flower: Reduce identation after is_key_vlan refactoring") b40003128226 ("net/sched: flower: Add number of vlan tags filter") 99fdb22bc5e9 ("net/sched: flower: Consider the number of tags for vlan filters") b57c7e8b76c6 ("selftests: forwarding: tc_actions: allow mirred egress test to run on non-offloaded h2") 70f87de9fa0d ("net_sched: em_meta: add READ_ONCE() in var_sk_bound_if()") a2b1a5d40bd1 ("net/sched: sch_netem: Fix arithmetic in netem_dump() for 32-bit platforms") 1da9e27415bf ("tc-testing: gitignore, delete plugins directory") 6deb209dc6b0 ("net: Print hashed skb addresses for all net and qdisc events") 76b39b94382f ("net/sched: act_api: Notify user space if any actions were flushed before error") 88153e29c1e0 ("selftests: tc-testing: Add testcases to test new flush behaviour") 837ced3a1a5d ("time64.h: consolidate uses of PSEC_PER_NSEC") d7be266adbfd ("net: sched: provide shim definitions for taprio_offload_{get,free}") fc54d9065f90 ("net/sched: act_ct: set 'net' pointer when creating new nf_flow_table") b038177636f8 ("netfilter: nf_flow_table: count pending offload workqueue tasks") b06ada6df9cf ("netfilter: flowtable: fix incorrect Kconfig dependencies") 83d85bb06915 ("net: extract port range fields from fl_flow_key") bc5c8260f411 ("net/sched: remove return value of unregister_tcf_proto_ops") 88b3822cdf2f ("net/sched: sch_cbq: Delete unused delay_timer") ca0cab119288 ("net/sched: remove qdisc_root_lock() helper") c0f47c2822aa ("net/sched: cls_api: Fix flow action initialization") 5008750eff5d ("net/sched: flower: Add PPPoE filter") a482d47d33ac ("net/sched: sch_cbq: change the type of cbq_set_lss to void") 06799a9085e1 ("net: bonding: replace dev_trans_start() with the jiffies of the last ARP/NS") 4873a1b2024d ("net/sched: remove hacks added to dev_trans_start() for bonding to work") 9ad36309e271 ("net_sched: cls_route: remove from list when handle is 0") 02799571714d ("net_sched: cls_route: disallow handle of 0") b05972f01e7d ("net: sched: tbf: don't call qdisc_put() while holding tree lock") f612466ebecb ("net/sched: fix netdevice reference leaks in attach_default_qdiscs()") 9efd23297cca ("sch_sfb: Don't assume the skb is still around after enqueueing to child") 2f09707d0c97 ("sch_sfb: Also store skb len before calling child enqueue") db46e3a88a09 ("net/sched: taprio: avoid disabling offload when it was never enabled") 1461d212ab27 ("net/sched: taprio: make qdisc_leaf() see the per-netdev-queue pfifo child qdiscs") c2e1cfefcac3 ("net: sched: fix possible refcount leak in tc_new_tfilter()") 6e23ec0ba92d ("net: sched: act_ct: fix possible refcount leak in tcf_ct_init()") ffdd33dd9c12 ("netfilter: core: Fix clang warnings about unused static inlines") 6316136ec6e3 ("netfilter: egress: avoid a lockdep splat") d645552e9bd9 ("netfilter: egress: Report interface as outgoing") af7b29b1deaa ("Revert "net/sched: taprio: make qdisc_leaf() see the per-netdev-queue pfifo child qdiscs"") 8bdc2acd420c ("net: sched: Fix use after free in red_enqueue()") ``` Signed-off-by: Ivan Vecera <ivecera@redhat.com> Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Approved-by: Petr Oros <poros@redhat.com> Approved-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>	2022-11-23 02:46:05 -05:00
Frantisek Hrbata	1269719102	Merge: BPF and XDP rebase to v5.18 Merge conflicts: ----------------- arch/x86/net/bpf_jit_comp.c - bpf_arch_text_poke() HEAD(!1464) contains `b73b002f7f` ("x86/ibt,bpf: Add ENDBR instructions to prologue and trampoline") Resolved in favour of !1464, but keep the return statement from !1477 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1477 Bugzilla: https://bugzilla.redhat.com/2120966 Rebase BPF and XDP to the upstream kernel version 5.18 Patch applied, then reverted: ``` 544356 selftests/bpf: switch to new libbpf XDP APIs 0bfb95 selftests, bpf: Do not yet switch to new libbpf XDP APIs ``` Taken in the perf rebase: ``` 23fcfc perf: use generic bpf_program__set_type() to set BPF prog type ``` Unsuported arches: ``` 5c1011 libbpf: Fix riscv register names cf0b5b libbpf: Fix accessing syscall arguments on riscv ``` Depends on changes of other subsystems: ``` 7fc8c3 s390/bpf: encode register within extable entry aebfd1 x86/ibt,ftrace: Search for __fentry__ location 589127 x86/ibt,bpf: Add ENDBR instructions to prologue and trampoline ``` Broken selftest: ``` edae34 selftests net: add UDP GRO fraglist + bpf self-tests cf6783 selftests net: fix bpf build error 7b92aa selftests net: fix kselftest net fatal error ``` Out of scope: ``` baebdf net: dev: Makes sure netif_rx() can be invoked in any context. 5c8166 kbuild: replace $(if A,A,B) with $(or A,B) 1a97ce perf maps: Use a pointer for kmaps 967747 uaccess: remove CONFIG_SET_FS 42b01a s390: always use the packed stack layout bf0882 flow_dissector: Add support for HSR d09a30 s390/extable: move EX_TABLE define to asm-extable.h 3d6671 s390/extable: convert to relative table with data 4efd41 s390: raise minimum supported machine generation to z10 f65e58 flow_dissector: Add support for HSRv0 1a6d7a netdevsim: Introduce support for L3 offload xstats 9b1894 selftests: netdevsim: hw_stats_l3: Add a new test 84005b perf ftrace latency: Add -n/--use-nsec option 36c4a7 kasan, arm64: don't tag executable vmalloc allocations 8df013 docs: netdev: move the netdev-FAQ to the process pages 4d4d00 perf tools: Update copy of libbpf's hashmap.c 0df6ad perf evlist: Rename cpus to user_requested_cpus 1b8089 flow_dissector: fix false-positive __read_overflow2_field() warning 0ae065 perf build: Fix check for btf__load_from_kernel_by_id() in libbpf 8994e9 perf test bpf: Skip test if clang is not present 735346 perf build: Fix btf__load_from_kernel_by_id() feature check f037ac s390/stack: merge empty stack frame slots 335220 docs: netdev: update maintainer-netdev.rst reference a0b098 s390/nospec: remove unneeded header includes 34513a netdevsim: Fix hwstats debugfs file permissions ``` Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Approved-by: John W. Linville <linville@redhat.com> Approved-by: Wander Lairson Costa <wander@redhat.com> Approved-by: Torez Smith <torez@redhat.com> Approved-by: Jan Stancek <jstancek@redhat.com> Approved-by: Prarit Bhargava <prarit@redhat.com> Approved-by: Felix Maurer <fmaurer@redhat.com> Approved-by: Viktor Malik <vmalik@redhat.com> Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>	2022-11-21 05:30:47 -05:00
Frantisek Hrbata	27a89b8946	Merge: tcp: BIG TCP implementation MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1560 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139501 Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2128180 Tested: Using netperf and veth driver. Results meet the assumptions. See https://bugzilla.redhat.com/show_bug.cgi?id=2139501#c1 The series introduces support for BIG TCP. - Patch 1-2: Preliminary dependencies - Patch 3-14: Commits from upstream series 7fa2e481ff2f ("Merge branch 'big-tcp'", 2022-05-16) - Patch 15-19: Follow-ups Signed-off-by: Ivan Vecera <ivecera@redhat.com> Approved-by: Antoine Tenart <atenart@redhat.com> Approved-by: Florian Westphal <fwestpha@redhat.com> Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>	2022-11-15 07:30:55 -05:00
Frantisek Hrbata	6fd36e2149	Merge: CNB: net: drop the weight argument from netif_napi_add MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1577 Bugzilla: https://bugzilla.redhat.com/2139498 Tested: build, boot Change netif_napi_add family function's API so `netif_napi_add` and `netif_napi_add_tx` uses by default weight = NAPI_POLL_WEIGHT (as most of drivers were already doing in some or another way), and add `netif_napi_add_weight` and `netif_napi_add_tx_weight` for drivers that want to specify a custom NAPI weight. Signed-off-by: Íñigo Huguet <ihuguet@redhat.com> Approved-by: John W. Linville <linville@redhat.com> Approved-by: Jarod Wilson <jarod@redhat.com> Approved-by: Torez Smith <torez@redhat.com> Approved-by: Ivan Vecera <ivecera@redhat.com> Approved-by: Tony Camuso <tcamuso@redhat.com> Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>	2022-11-14 10:28:04 -05:00
Ivan Vecera	f31181025a	net: sched: use queue_mapping to pick tx queue Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139170 commit 2f1e85b1aee459b7d0fd981839042c6a38ffaf0c Author: Tonghao Zhang <xiangxia.m.yue@gmail.com> Date: Sat Apr 16 00:40:45 2022 +0800 net: sched: use queue_mapping to pick tx queue This patch fixes issue: * If we install tc filters with act_skbedit in clsact hook. It doesn't work, because netdev_core_pick_tx() overwrites queue_mapping. $ tc filter ... action skbedit queue_mapping 1 And this patch is useful: * We can use FQ + EDT to implement efficient policies. Tx queues are picked by xps, ndo_select_queue of netdev driver, or skb hash in netdev_core_pick_tx(). In fact, the netdev driver, and skb hash are _not_ under control. xps uses the CPUs map to select Tx queues, but we can't figure out which task_struct of pod/containter running on this cpu in most case. We can use clsact filters to classify one pod/container traffic to one Tx queue. Why ? In containter networking environment, there are two kinds of pod/ containter/net-namespace. One kind (e.g. P1, P2), the high throughput is key in these applications. But avoid running out of network resource, the outbound traffic of these pods is limited, using or sharing one dedicated Tx queues assigned HTB/TBF/FQ Qdisc. Other kind of pods (e.g. Pn), the low latency of data access is key. And the traffic is not limited. Pods use or share other dedicated Tx queues assigned FIFO Qdisc. This choice provides two benefits. First, contention on the HTB/FQ Qdisc lock is significantly reduced since fewer CPUs contend for the same queue. More importantly, Qdisc contention can be eliminated completely if each CPU has its own FIFO Qdisc for the second kind of pods. There must be a mechanism in place to support classifying traffic based on pods/container to different Tx queues. Note that clsact is outside of Qdisc while Qdisc can run a classifier to select a sub-queue under the lock. In general recording the decision in the skb seems a little heavy handed. This patch introduces a per-CPU variable, suggested by Eric. The xmit.skip_txqueue flag is firstly cleared in __dev_queue_xmit(). - Tx Qdisc may install that skbedit actions, then xmit.skip_txqueue flag is set in qdisc->enqueue() though tx queue has been selected in netdev_tx_queue_mapping() or netdev_core_pick_tx(). That flag is cleared firstly in __dev_queue_xmit(), is useful: - Avoid picking Tx queue with netdev_tx_queue_mapping() in next netdev in such case: eth0 macvlan - eth0.3 vlan - eth0 ixgbe-phy: For example, eth0, macvlan in pod, which root Qdisc install skbedit queue_mapping, send packets to eth0.3, vlan in host. In __dev_queue_xmit() of eth0.3, clear the flag, does not select tx queue according to skb->queue_mapping because there is no filters in clsact or tx Qdisc of this netdev. Same action taked in eth0, ixgbe in Host. - Avoid picking Tx queue for next packet. If we set xmit.skip_txqueue in tx Qdisc (qdisc->enqueue()), the proper way to clear it is clearing it in __dev_queue_xmit when processing next packets. For performance reasons, use the static key. If user does not config the NET_EGRESS, the patch will not be compiled. +----+ +----+ +----+ \| P1 \| \| P2 \| \| Pn \| +----+ +----+ +----+ \| \| \| +-----------+-----------+ \| \| clsact/skbedit \| MQ v +-----------+-----------+ \| q0 \| q1 \| qn v v v HTB/FQ HTB/FQ ... FIFO Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Jonathan Lemon <jonathan.lemon@gmail.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Alexander Lobakin <alobakin@pm.me> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Talal Ahmad <talalahmad@google.com> Cc: Kevin Hao <haokexin@gmail.com> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Cc: Kees Cook <keescook@chromium.org> Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com> Cc: Antoine Tenart <atenart@kernel.org> Cc: Wei Wang <weiwan@google.com> Cc: Arnd Bergmann <arnd@arndb.de> Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2022-11-13 16:59:02 +01:00
Ivan Vecera	d545c120ec	netfilter: Introduce egress hook Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139170 commit 42df6e1d221dddc0f2acf2be37e68d553ad65f96 Author: Lukas Wunner <lukas@wunner.de> Date: Fri Oct 8 22:06:03 2021 +0200 netfilter: Introduce egress hook Support classifying packets with netfilter on egress to satisfy user requirements such as: * outbound security policies for containers (Laura) * filtering and mangling intra-node Direct Server Return (DSR) traffic on a load balancer (Laura) * filtering locally generated traffic coming in through AF_PACKET, such as local ARP traffic generated for clustering purposes or DHCP (Laura; the AF_PACKET plumbing is contained in a follow-up commit) * L2 filtering from ingress and egress for AVB (Audio Video Bridging) and gPTP with nftables (Pablo) * in the future: in-kernel NAT64/NAT46 (Pablo) The egress hook introduced herein complements the ingress hook added by commit `e687ad60af` ("netfilter: add netfilter ingress hook after handle_ing() under unique static key"). A patch for nftables to hook up egress rules from user space has been submitted separately, so users may immediately take advantage of the feature. Alternatively or in addition to netfilter, packets can be classified with traffic control (tc). On ingress, packets are classified first by tc, then by netfilter. On egress, the order is reversed for symmetry. Conceptually, tc and netfilter can be thought of as layers, with netfilter layered above tc. Traffic control is capable of redirecting packets to another interface (man 8 tc-mirred). E.g., an ingress packet may be redirected from the host namespace to a container via a veth connection: tc ingress (host) -> tc egress (veth host) -> tc ingress (veth container) In this case, netfilter egress classifying is not performed when leaving the host namespace! That's because the packet is still on the tc layer. If tc redirects the packet to a physical interface in the host namespace such that it leaves the system, the packet is never subjected to netfilter egress classifying. That is only logical since it hasn't passed through netfilter ingress classifying either. Packets can alternatively be redirected at the netfilter layer using nft fwd. Such a packet is subjected to netfilter egress classifying since it has reached the netfilter layer. Internally, the skb->nf_skip_egress flag controls whether netfilter is invoked on egress by __dev_queue_xmit(). Because __dev_queue_xmit() may be called recursively by tunnel drivers such as vxlan, the flag is reverted to false after sch_handle_egress(). This ensures that netfilter is applied both on the overlay and underlying network. Interaction between tc and netfilter is possible by setting and querying skb->mark. If netfilter egress classifying is not enabled on any interface, it is patched out of the data path by way of a static_key and doesn't make a performance difference that is discernible from noise: Before: 1537 1538 1538 1537 1538 1537 Mb/sec After: 1536 1534 1539 1539 1539 1540 Mb/sec Before + tc accept: 1418 1418 1418 1419 1419 1418 Mb/sec After + tc accept: 1419 1424 1418 1419 1422 1420 Mb/sec Before + tc drop: 1620 1619 1619 1619 1620 1620 Mb/sec After + tc drop: 1616 1624 1625 1624 1622 1619 Mb/sec When netfilter egress classifying is enabled on at least one interface, a minimal performance penalty is incurred for every egress packet, even if the interface it's transmitted over doesn't have any netfilter egress rules configured. That is caused by checking dev->nf_hooks_egress against NULL. Measurements were performed on a Core i7-3615QM. Commands to reproduce: ip link add dev foo type dummy ip link set dev foo up modprobe pktgen echo "add_device foo" > /proc/net/pktgen/kpktgend_3 samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i foo -n 400000000 -m "11:11:11:11:11:11" -d 1.1.1.1 Accept all traffic with tc: tc qdisc add dev foo clsact tc filter add dev foo egress bpf da bytecode '1,6 0 0 0,' Drop all traffic with tc: tc qdisc add dev foo clsact tc filter add dev foo egress bpf da bytecode '1,6 0 0 2,' Apply this patch when measuring packet drops to avoid errors in dmesg: https://lore.kernel.org/netdev/a73dda33-57f4-95d8-ea51-ed483abd6a7a@iogearbox.net/ Signed-off-by: Lukas Wunner <lukas@wunner.de> Cc: Laura García Liébana <nevola@gmail.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Thomas Graf <tgraf@suug.ch> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2022-11-13 16:59:01 +01:00
Ivan Vecera	866706749c	netfilter: Generalize ingress hook include file Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139170 commit 17d20784223d52bf1671f984c9e8d5d9b8ea171b Author: Lukas Wunner <lukas@wunner.de> Date: Fri Oct 8 22:06:02 2021 +0200 netfilter: Generalize ingress hook include file Prepare for addition of a netfilter egress hook by generalizing the ingress hook include file. No functional change intended. Signed-off-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2022-11-13 16:59:01 +01:00
Ivan Vecera	3ccbb377fc	netfilter: Rename ingress hook include file Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139170 commit 7463acfbe52ae8b7e0ea6890c1886b3f8ba8bddd Author: Lukas Wunner <lukas@wunner.de> Date: Fri Oct 8 22:06:01 2021 +0200 netfilter: Rename ingress hook include file Prepare for addition of a netfilter egress hook by renaming <linux/netfilter_ingress.h> to <linux/netfilter_netdev.h>. The egress hook also necessitates a refactoring of the include file, but that is done in a separate commit to ease reviewing. No functional change intended. Signed-off-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2022-11-13 16:59:01 +01:00
Frantisek Hrbata	0fe0e3e4d8	Merge: CNB: net: HW counters for soft devices MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1580 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2140149 Tested: Using netdevsim hw_stats_l3.sh self-test Commits: ``` 22b67d17194f ("net: rtnetlink: rtnl_stats_get(): Emit an extack for unset filter_mask") 6b524a1d012b ("net: rtnetlink: Namespace functions related to IFLA_OFFLOAD_XSTATS_") f6e0fb812988 ("net: rtnetlink: Stop assuming that IFLA_OFFLOAD_XSTATS_ are dev-backed") 46efc97b7306 ("net: rtnetlink: RTM_GETSTATS: Allow filtering inside nests") 05415bccbb09 ("net: rtnetlink: Propagate extack to rtnl_offload_xstats_fill()") 216e690631f5 ("net: rtnetlink: rtnl_fill_statsinfo(): Permit non-EMSGSIZE error returns") 9309f97aef6d ("net: dev: Add hardware stats support") 0e7788fd7622 ("net: rtnetlink: Add UAPI for obtaining L3 offload xstats") 03ba35667091 ("net: rtnetlink: Add RTM_SETSTATS") 5fd0b838efac ("net: rtnetlink: Add UAPI toggle for IFLA_OFFLOAD_XSTATS_L3_STATS") ba95e7930957 ("selftests: forwarding: hw_stats_l3: Add a new test") 57d29a2935c9 ("net: rtnetlink: fix error handling in rtnl_fill_statsinfo()") 23cfe941b52e ("rtnetlink: Fix handling of disabled L3 stats in RTM_GETSTATS replies") ``` Signed-off-by: Ivan Vecera <ivecera@redhat.com> Approved-by: Petr Oros <poros@redhat.com> Approved-by: Jiri Benc <jbenc@redhat.com> Approved-by: Wander Lairson Costa <wander@redhat.com> Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>	2022-11-08 09:08:22 -05:00
Frantisek Hrbata	5ac5a1dfd0	Merge: CNB: net: disambiguate the TSO and GSO limits MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1419 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2128180 Tested: Using iperf3 and toggling gso/tso offloading knobs Commits: ``` 2106efda785b ("net: remove .ndo_change_proto_down") 2cc6cdd44a16 ("net: unexport a handful of dev_* functions") 6264f58ca0e5 ("net: extract a few internals from netdevice.h") 6df6398f7c8b ("net: add netif_inherit_tso_max()") 14d7b8122fd5 ("net: don't allow user space to lift the device limits") ee8b7a1156f3 ("net: make drivers set the TSO limit not the GSO limit") 744d49daf8bd ("net: move netif_set_gso_max helpers") ``` Signed-off-by: Ivan Vecera <ivecera@redhat.com> Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com> Approved-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>	2022-11-05 02:54:07 -04:00
Ivan Vecera	a5a7be252a	net: dev: Add hardware stats support Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2140149 commit 9309f97aef6d8250bb484dabeac925c3a7c57716 Author: Petr Machata <petrm@nvidia.com> Date: Wed Mar 2 18:31:20 2022 +0200 net: dev: Add hardware stats support Offloading switch device drivers may be able to collect statistics of the traffic taking place in the HW datapath that pertains to a certain soft netdevice, such as VLAN. Add the necessary infrastructure to allow exposing these statistics to the offloaded netdevice in question. The API was shaped by the following considerations: - Collection of HW statistics is not free: there may be a finite number of counters, and the act of counting may have a performance impact. It is therefore necessary to allow toggling whether HW counting should be done for any particular SW netdevice. - As the drivers are loaded and removed, a particular device may get offloaded and unoffloaded again. At the same time, the statistics values need to stay monotonic (modulo the eventual 64-bit wraparound), increasing only to reflect traffic measured in the device. To that end, the netdevice keeps around a lazily-allocated copy of struct rtnl_link_stats64. Device drivers then contribute to the values kept therein at various points. Even as the driver goes away, the struct stays around to maintain the statistics values. - Different HW devices may be able to count different things. The motivation behind this patch in particular is exposure of HW counters on Nvidia Spectrum switches, where the only practical approach to counting traffic on offloaded soft netdevices currently is to use router interface counters, and count L3 traffic. Correspondingly that is the statistics suite added in this patch. Other devices may be able to measure different kinds of traffic, and for that reason, the APIs are built to allow uniform access to different statistics suites. - Because soft netdevices and offloading drivers are only loosely bound, a netdevice uses a notifier chain to communicate with the drivers. Several new notifiers, NETDEV_OFFLOAD_XSTATS_*, have been added to carry messages to the offloading drivers. - Devices can have various conditions for when a particular counter is available. As the device is configured and reconfigured, the device offload may become or cease being suitable for counter binding. A netdevice can use a notifier type NETDEV_OFFLOAD_XSTATS_REPORT_USED to ping offloading drivers and determine whether anyone currently implements a given statistics suite. This information can then be propagated to user space. When the driver decides to unoffload a netdevice, it can use a newly-added function, netdev_offload_xstats_report_delta(), to record outstanding collected statistics, before destroying the HW counter. This patch adds a helper, call_netdevice_notifiers_info_robust(), for dispatching a notifier with the possibility of unwind when one of the consumers bails. Given the wish to eventually get rid of the global notifier block altogether, this helper only invokes the per-netns notifier block. Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2022-11-04 17:15:40 +01:00
Íñigo Huguet	4ed32c17b9	netdev: reshuffle netif_napi_add() APIs to allow dropping weight Bugzilla: https://bugzilla.redhat.com/2139498 commit 58caed3dacb4354a25a1aa8d2febc3e9648ba1f4 Author: Jakub Kicinski <kuba@kernel.org> Date: Mon May 2 16:27:03 2022 -0700 netdev: reshuffle netif_napi_add() APIs to allow dropping weight Most drivers should not have to worry about selecting the right weight for their NAPI instances and pass NAPI_POLL_WEIGHT. It'd be best if we didn't require the argument at all and selected the default internally. This change prepares the ground for such reshuffling, allowing for a smooth transition. The following API should remain after the next release cycle: netif_napi_add() netif_napi_add_weight() netif_napi_add_tx() netif_napi_add_tx_weight() Where the _weight() variants take an explicit weight argument. I opted for a _weight() suffix rather than a __ prefix, because we use __ in places to mean that caller needs to also issue a synchronize_net() call. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20220502232703.396351-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>	2022-11-04 16:46:33 +01:00
Ivan Vecera	fccce056fa	net: allow gro_max_size to exceed 65536 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139501 commit 0fe79f28bfaf73b66b7b1562d2468f94aa03bd12 Author: Alexander Duyck <alexanderduyck@fb.com> Date: Fri May 13 11:34:03 2022 -0700 net: allow gro_max_size to exceed 65536 Allow the gro_max_size to exceed a value larger than 65536. There weren't really any external limitations that prevented this other than the fact that IPv4 only supports a 16 bit length field. Since we have the option of adding a hop-by-hop header for IPv6 we can allow IPv6 to exceed this value and for IPv4 and non-TCP flows we can cap things at 65536 via a constant rather than relying on gro_max_size. [edumazet] limit GRO_MAX_SIZE to (8 * 65535) to avoid overflows. Signed-off-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2022-11-02 18:56:09 +01:00
Ivan Vecera	d513603ec1	net: allow gso_max_size to exceed 65536 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139501 commit 7c4e983c4f3cf94fcd879730c6caa877e0768a4d Author: Alexander Duyck <alexanderduyck@fb.com> Date: Fri May 13 11:33:57 2022 -0700 net: allow gso_max_size to exceed 65536 The code for gso_max_size was added originally to allow for debugging and workaround of buggy devices that couldn't support TSO with blocks 64K in size. The original reason for limiting it to 64K was because that was the existing limits of IPv4 and non-jumbogram IPv6 length fields. With the addition of Big TCP we can remove this limit and allow the value to potentially go up to UINT_MAX and instead be limited by the tso_max_size value. So in order to support this we need to go through and clean up the remaining users of the gso_max_size value so that the values will cap at 64K for non-TCPv6 flows. In addition we can clean up the GSO_MAX_SIZE value so that 64K becomes GSO_LEGACY_MAX_SIZE and UINT_MAX will now be the upper limit for GSO_MAX_SIZE. v6: (edumazet) fixed a compile error if CONFIG_IPV6=n, in a new sk_trim_gso_size() helper. netif_set_tso_max_size() caps the requested TSO size with GSO_MAX_SIZE. Signed-off-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2022-11-02 18:55:52 +01:00
Ivan Vecera	017d0aca36	gro: add ability to control gro max packet size Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139501 Conflicts: - context due to existing backport of 14d7b8122fd5 ("net: don't allow user space to lift the device limits") commit eac1b93c14d645ef147b049ace0d5230df755548 Author: Coco Li <lixiaoyan@google.com> Date: Wed Jan 5 02:48:38 2022 -0800 gro: add ability to control gro max packet size Eric Dumazet suggested to allow users to modify max GRO packet size. We have seen GRO being disabled by users of appliances (such as wifi access points) because of claimed bufferbloat issues, or some work arounds in sch_cake, to split GRO/GSO packets. Instead of disabling GRO completely, one can chose to limit the maximum packet size of GRO packets, depending on their latency constraints. This patch adds a per device gro_max_size attribute that can be changed with ip link command. ip link set dev eth0 gro_max_size 16000 Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Coco Li <lixiaoyan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2022-11-02 18:55:37 +01:00
Paolo Abeni	022665bacd	net: skb: introduce and use a single page frag cache Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2133057 Tested: LNST, Tier1 Upstream commit: commit dbae2b062824fc2d35ae2d5df2f500626c758e80 Author: Paolo Abeni <pabeni@redhat.com> Date: Wed Sep 28 10:43:09 2022 +0200 net: skb: introduce and use a single page frag cache After commit `3226b158e6` ("net: avoid 32 x truesize under-estimation for tiny skbs") we are observing 10-20% regressions in performance tests with small packets. The perf trace points to high pressure on the slab allocator. This change tries to improve the allocation schema for small packets using an idea originally suggested by Eric: a new per CPU page frag is introduced and used in __napi_alloc_skb to cope with small allocation requests. To ensure that the above does not lead to excessive truesize underestimation, the frag size for small allocation is inflated to 1K and all the above is restricted to build with 4K page size. Note that we need to update accordingly the run-time check introduced with commit fd9ea57f4e95 ("net: add napi_get_frags_check() helper"). Alex suggested a smart page refcount schema to reduce the number of atomic operations and deal properly with pfmemalloc pages. Under small packet UDP flood, I measure a 15% peak tput increases. Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Suggested-by: Alexander H Duyck <alexanderduyck@fb.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://lore.kernel.org/r/6b6f65957c59f86a353fc09a5127e83a32ab5999.1664350652.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-10-27 19:12:04 +02:00
Paolo Abeni	7822d83322	net: add napi_get_frags_check() helper Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2133057 Tested: LNST, Tier1 Upstream commit: commit fd9ea57f4e9514f9d0f0dec505eefd99a8faa148 Author: Eric Dumazet <edumazet@google.com> Date: Wed Jun 8 09:04:38 2022 -0700 net: add napi_get_frags_check() helper This is a follow up of commit `3226b158e6` ("net: avoid 32 x truesize under-estimation for tiny skbs") When/if we increase MAX_SKB_FRAGS, we better make sure the old bug will not come back. Adding a check in napi_get_frags() would be costly, even if using DEBUG_NET_WARN_ON_ONCE(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-10-27 19:10:48 +02:00
Jiri Benc	2da69cb317	net: Postpone skb_clear_delivery_time() until knowing the skb is delivered locally Bugzilla: https://bugzilla.redhat.com/2120966 Conflicts: - [minor] context difference in __netif_receive_skb_core due to missing 42df6e1d221d ("netfilter: Introduce egress hook") commit cd14e9b7b8d312dfbf75ce1f78552902e51b9045 Author: Martin KaFai Lau <kafai@fb.com> Date: Wed Mar 2 11:56:22 2022 -0800 net: Postpone skb_clear_delivery_time() until knowing the skb is delivered locally The previous patches handled the delivery_time in the ingress path before the routing decision is made. This patch can postpone clearing delivery_time in a skb until knowing it is delivered locally and also set the (rcv) timestamp if needed. This patch moves the skb_clear_delivery_time() from dev.c to ip_local_deliver_finish() and ip6_input_finish(). Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Benc <jbenc@redhat.com>	2022-10-25 14:58:00 +02:00
Jiri Benc	e0f797236e	net: Set skb->mono_delivery_time and clear it after sch_handle_ingress() Bugzilla: https://bugzilla.redhat.com/2120966 Conflicts: - [minor] context difference in __netif_receive_skb_core due to missing 42df6e1d221d ("netfilter: Introduce egress hook") commit d98d58a002619b5c165f1eedcd731e2fe2c19088 Author: Martin KaFai Lau <kafai@fb.com> Date: Wed Mar 2 11:55:50 2022 -0800 net: Set skb->mono_delivery_time and clear it after sch_handle_ingress() The previous patches handled the delivery_time before sch_handle_ingress(). This patch can now set the skb->mono_delivery_time to flag the skb->tstamp is used as the mono delivery_time (EDT) instead of the (rcv) timestamp and also clear it with skb_clear_delivery_time() after sch_handle_ingress(). This will make the bpf_redirect_*() to keep the mono delivery_time and used by a qdisc (fq) of the egress-ing interface. A latter patch will postpone the skb_clear_delivery_time() until the stack learns that the skb is being delivered locally and that will make other kernel forwarding paths (ip[6]_forward) able to keep the delivery_time also. Thus, like the previous patches on using the skb->mono_delivery_time bit, calling skb_clear_delivery_time() is not limited within the CONFIG_NET_INGRESS to avoid too many code churns among this set. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Benc <jbenc@redhat.com>	2022-10-25 14:58:00 +02:00
Jiri Benc	e17e09a099	net: Clear mono_delivery_time bit in __skb_tstamp_tx() Bugzilla: https://bugzilla.redhat.com/2120966 commit d93376f503c7a586707925957592c0f16f4db0b1 Author: Martin KaFai Lau <kafai@fb.com> Date: Wed Mar 2 11:55:44 2022 -0800 net: Clear mono_delivery_time bit in __skb_tstamp_tx() In __skb_tstamp_tx(), it may clone the egress skb and queues the clone to the sk_error_queue. The outgoing skb may have the mono delivery_time while the (rcv) timestamp is expected for the clone, so the skb->mono_delivery_time bit needs to be cleared from the clone. This patch adds the skb->mono_delivery_time clearing to the existing __net_timestamp() and use it in __skb_tstamp_tx(). The __net_timestamp() fast path usage in dev.c is changed to directly call ktime_get_real() since the mono_delivery_time bit is not set at that point. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Benc <jbenc@redhat.com>	2022-10-25 14:58:00 +02:00
Jiri Benc	c387356f8d	net: Handle delivery_time in skb->tstamp during network tapping with af_packet Bugzilla: https://bugzilla.redhat.com/2120966 commit 27942a15209f564ed8ee2a9e126cb7b105181355 Author: Martin KaFai Lau <kafai@fb.com> Date: Wed Mar 2 11:55:38 2022 -0800 net: Handle delivery_time in skb->tstamp during network tapping with af_packet A latter patch will set the skb->mono_delivery_time to flag the skb->tstamp is used as the mono delivery_time (EDT) instead of the (rcv) timestamp. skb_clear_tstamp() will then keep this delivery_time during forwarding. This patch is to make the network tapping (with af_packet) to handle the delivery_time stored in skb->tstamp. Regardless of tapping at the ingress or egress, the tapped skb is received by the af_packet socket, so it is ingress to the af_packet socket and it expects the (rcv) timestamp. When tapping at egress, dev_queue_xmit_nit() is used. It has already expected skb->tstamp may have delivery_time, so it does skb_clone()+net_timestamp_set() to ensure the cloned skb has the (rcv) timestamp before passing to the af_packet sk. This patch only adds to clear the skb->mono_delivery_time bit in net_timestamp_set(). When tapping at ingress, it currently expects the skb->tstamp is either 0 or the (rcv) timestamp. Meaning, the tapping at ingress path has already expected the skb->tstamp could be 0 and it will get the (rcv) timestamp by ktime_get_real() when needed. There are two cases for tapping at ingress: One case is af_packet queues the skb to its sk_receive_queue. The skb is either not shared or new clone created. The newly added skb_clear_delivery_time() is called to clear the delivery_time (if any) and set the (rcv) timestamp if needed before the skb is queued to the sk_receive_queue. Another case, the ingress skb is directly copied to the rx_ring and tpacket_get_timestamp() is used to get the (rcv) timestamp. The newly added skb_tstamp() is used in tpacket_get_timestamp() to check the skb->mono_delivery_time bit before returning skb->tstamp. As mentioned earlier, the tapping@ingress has already expected the skb may not have the (rcv) timestamp (because no sk has asked for it) and has handled this case by directly calling ktime_get_real(). Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Benc <jbenc@redhat.com>	2022-10-25 14:58:00 +02:00
Frantisek Hrbata	fa843be1d1	Merge: net: add skb drop reasons MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1454 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161 Sync skb drop reasons with upstream to improve debuggability and visibility in the net stack. This MR helps in understanding why a given packet is being dropped. One way of retrieving the skb drop reason is to hook to the kfree_skb tracepoint: ``` # perf record -e skb:kfree_skb -a sleep 10 # perf script swapper 0 [000] 45483.977088: skb:kfree_skb: skbaddr=0xffffa04859090f00 protocol=34525 location=0xffffffff9bc92940 reason: NOT_SPECIFIED swapper 0 [000] 45485.792919: skb:kfree_skb: skbaddr=0xffffa04143757900 protocol=34525 location=0xffffffff9bbd84bb reason: TCP_INVALID_SEQUENCE ``` Signed-off-by: Antoine Tenart <atenart@redhat.com> Approved-by: Jarod Wilson <jarod@redhat.com> Approved-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>	2022-10-24 14:27:58 -04:00
Ivan Vecera	4ba4dadfe4	net: make drivers set the TSO limit not the GSO limit Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2128180 Conflicts: * drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c * drivers/net/ethernet/marvell/octeontx2/nic/otx2_vf.c - small context conflicts * drivers/net/usb/ax88179_178a.c - hunk removed, the driver does not call netif_set_gso_max_size() * drivers/net/usb/lan78xx.c - modified due to absence of commits d383216a7efe ("lan78xx: Introduce Tx URB processing improvements") and 0dd87266c133 ("lan78xx: Remove hardware-specific header update") commit ee8b7a1156f357613646d6c69d07ac5a087a1071 Author: Jakub Kicinski <kuba@kernel.org> Date: Thu May 5 19:51:33 2022 -0700 net: make drivers set the TSO limit not the GSO limit Drivers should call the TSO setting helper, GSO is controllable by user space. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2022-10-18 10:27:21 +02:00
Ivan Vecera	8f95afcecf	net: don't allow user space to lift the device limits Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2128180 Conflicts: - small context conflict due to missing eac1b93c14d6 ("gro: add ability to control gro max packet size") commit 14d7b8122fd591693a2388b98563707ba72c6780 Author: Jakub Kicinski <kuba@kernel.org> Date: Thu May 5 19:51:32 2022 -0700 net: don't allow user space to lift the device limits Up until commit `46e6b992c2` ("rtnetlink: allow GSO maximums to be set on device creation") the gso_max_segs and gso_max_size of a device were not controlled from user space. The quoted commit added the ability to control them because of the following setup: netns A \| netns B veth<->veth eth0 If eth0 has TSO limitations and user wants to efficiently forward traffic between eth0 and the veths they should copy the TSO limitations of eth0 onto the veths. This would happen automatically for macvlans or ipvlan but veth users are not so lucky (given the loose coupling). Unfortunately the commit in question allowed users to also override the limits on real HW devices. It may be useful to control the max GSO size and someone may be using that ability (not that I know of any user), so create a separate set of knobs to reliably record the TSO limitations. Validate the user requests. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2022-10-18 10:27:21 +02:00
Ivan Vecera	f9b471a989	net: add netif_inherit_tso_max() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2128180 Conflicts: - small context conflict due to missing eac1b93c14d6 ("gro: add ability to control gro max packet size") commit 6df6398f7c8b481ce83f28143bc08a5231616deb Author: Jakub Kicinski <kuba@kernel.org> Date: Thu May 5 19:51:31 2022 -0700 net: add netif_inherit_tso_max() To make later patches smaller create a helper for inheriting the TSO limitations of a lower device. The TSO in the name is not an accident, subsequent patches will replace GSO with TSO in more names. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2022-10-18 10:27:21 +02:00
Ivan Vecera	5a0eef8003	net: extract a few internals from netdevice.h Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2128180 Conflicts: - slightly modified due to missing 0b5c21bbc01e ("net: ensure net_todo_list is processed quickly") and d07b26f5bbea ("dev_addr: add a modification check") commit 6264f58ca0e54e41d63c2d00334a48bac28fbf30 Author: Jakub Kicinski <kuba@kernel.org> Date: Wed Apr 6 14:37:54 2022 -0700 net: extract a few internals from netdevice.h There's a number of functions and static variables used under net/core/ but not from the outside. We currently dump most of them into netdevice.h. That bad for many reasons: - netdevice.h is very cluttered, hard to figure out what the APIs are; - netdevice.h is very long; - we have to touch netdevice.h more which causes expensive incremental builds. Create a header under net/core/ and move some declarations. The new header is also a bit of a catch-all but that's fine, if we create more specific headers people will likely over-think where their declaration fit best. And end up putting them in netdevice.h, again. More work should be done on splitting netdevice.h into more targeted headers, but that'd be more time consuming so small steps. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2022-10-18 10:27:16 +02:00
Antoine Tenart	d3b8b917fb	net: skb: rename SKB_DROP_REASON_PTYPE_ABSENT Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161 Upstream Status: linux.git Conflict:\ - In __netif_receive_skb_core due to missing upstream commit 625788b58445 ("net: add per-cpu storage and net->core_stats") in c9s. commit 9f8ed577c28813410614b418bad42285840c1a00 Author: Menglong Dong <imagedong@tencent.com> Date: Thu Apr 7 14:20:50 2022 +0800 net: skb: rename SKB_DROP_REASON_PTYPE_ABSENT As David Ahern suggested, the reasons for skb drops should be more general and not be code based. Therefore, rename SKB_DROP_REASON_PTYPE_ABSENT to SKB_DROP_REASON_UNHANDLED_PROTO, which is used for the cases of no L3 protocol handler, no L4 protocol handler, version extensions, etc. From previous discussion, now we have the aim to make these reasons more abstract and users based, avoiding code based. Signed-off-by: Menglong Dong <imagedong@tencent.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2022-10-13 14:53:24 +02:00
Antoine Tenart	3f421c9474	net: dev: use kfree_skb_reason() for __netif_receive_skb_core() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161 Upstream Status: linux.git commit 6c2728b7c14164928cb7cb9c847dead101b2d503 Author: Menglong Dong <imagedong@tencent.com> Date: Fri Mar 4 14:00:46 2022 +0800 net: dev: use kfree_skb_reason() for __netif_receive_skb_core() Add reason for skb drops to __netif_receive_skb_core() when packet_type not found to handle the skb. For this purpose, the drop reason SKB_DROP_REASON_PTYPE_ABSENT is introduced. Take ether packets for example, this case mainly happens when L3 protocol is not supported. Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2022-10-13 14:53:23 +02:00
Antoine Tenart	4fa8044e89	net: dev: use kfree_skb_reason() for sch_handle_ingress() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161 Upstream Status: linux.git commit a568aff26ac03ee9eb1482683514914a5ec3b4c3 Author: Menglong Dong <imagedong@tencent.com> Date: Fri Mar 4 14:00:45 2022 +0800 net: dev: use kfree_skb_reason() for sch_handle_ingress() Replace kfree_skb() used in sch_handle_ingress() with kfree_skb_reason(). Following drop reasons are introduced: SKB_DROP_REASON_TC_INGRESS Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2022-10-13 14:53:23 +02:00
Antoine Tenart	9c9aa3ee0a	net: dev: use kfree_skb_reason() for do_xdp_generic() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161 Upstream Status: linux.git commit 7e726ed81e1ddd5fdc431e02b94fcfe2a9876d42 Author: Menglong Dong <imagedong@tencent.com> Date: Fri Mar 4 14:00:44 2022 +0800 net: dev: use kfree_skb_reason() for do_xdp_generic() Replace kfree_skb() used in do_xdp_generic() with kfree_skb_reason(). The drop reason SKB_DROP_REASON_XDP is introduced for this case. Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2022-10-13 14:53:23 +02:00
Antoine Tenart	db388f3375	net: dev: use kfree_skb_reason() for enqueue_to_backlog() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161 Upstream Status: linux.git commit 44f0bd40803c0e04f1c8cd59df3c7acce783ae9c Author: Menglong Dong <imagedong@tencent.com> Date: Fri Mar 4 14:00:43 2022 +0800 net: dev: use kfree_skb_reason() for enqueue_to_backlog() Replace kfree_skb() used in enqueue_to_backlog() with kfree_skb_reason(). The skb rop reason SKB_DROP_REASON_CPU_BACKLOG is introduced for the case of failing to enqueue the skb to the per CPU backlog queue. The further reason can be backlog queue full or RPS flow limition, and I think we needn't to make further distinctions. Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2022-10-13 14:53:23 +02:00
Antoine Tenart	b63c068d65	net: dev: add skb drop reasons to __dev_xmit_skb() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161 Upstream Status: linux.git commit 7faef0547f4c29031a68d058918b031a8e520d49 Author: Menglong Dong <imagedong@tencent.com> Date: Fri Mar 4 14:00:42 2022 +0800 net: dev: add skb drop reasons to __dev_xmit_skb() Add reasons for skb drops to __dev_xmit_skb() by replacing kfree_skb_list() with kfree_skb_list_reason(). The drop reason of SKB_DROP_REASON_QDISC_DROP is introduced for qdisc enqueue fails. Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2022-10-13 14:53:23 +02:00
Antoine Tenart	694219a303	net: dev: use kfree_skb_reason() for sch_handle_egress() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161 Upstream Status: linux.git commit 98b4d7a4e7374a44c4afd9f08330e72f6ad0d644 Author: Menglong Dong <imagedong@tencent.com> Date: Fri Mar 4 14:00:40 2022 +0800 net: dev: use kfree_skb_reason() for sch_handle_egress() Replace kfree_skb() used in sch_handle_egress() with kfree_skb_reason(). The drop reason SKB_DROP_REASON_TC_EGRESS is introduced. Considering the code path of tc egerss, we make it distinct with the drop reason of SKB_DROP_REASON_QDISC_DROP in the next commit. Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2022-10-13 14:53:23 +02:00
Paolo Abeni	7403d40195	net: Fix a data-race around netdev_unregister_timeout_secs. Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134161 Tested: LNST, Tier1 Conflicts: chunk applied into netdev_wait_allrefs() instead of \ netdev_wait_allrefs_any() and with different context as rhel-9 \ lacks the upstream commit faab39f63c1fc ("net: allow out-of-order \ netdev unregistration") Upstream commit: commit 05e49cfc89e4f325eebbc62d24dd122e55f94c23 Author: Kuniyuki Iwashima <kuniyu@amazon.com> Date: Tue Aug 23 10:46:59 2022 -0700 net: Fix a data-race around netdev_unregister_timeout_secs. While reading netdev_unregister_timeout_secs, it can be changed concurrently. Thus, we need to add READ_ONCE() to its reader. Fixes: `5aa3afe107` ("net: make unregister netdev warning timeout configurable") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-10-13 13:00:04 +02:00
Paolo Abeni	48e48d197a	net: Fix a data-race around netdev_budget_usecs. Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134161 Tested: LNST, Tier1 Upstream commit: commit fa45d484c52c73f79db2c23b0cdfc6c6455093ad Author: Kuniyuki Iwashima <kuniyu@amazon.com> Date: Tue Aug 23 10:46:55 2022 -0700 net: Fix a data-race around netdev_budget_usecs. While reading netdev_budget_usecs, it can be changed concurrently. Thus, we need to add READ_ONCE() to its reader. Fixes: `7acf8a1e8a` ("Replace 2 jiffies with sysctl netdev_budget_usecs to enable softirq tuning") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-10-13 13:00:04 +02:00
Paolo Abeni	3d0c78c5c1	net: Fix a data-race around netdev_budget. Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134161 Tested: LNST, Tier1 Upstream commit: commit 2e0c42374ee32e72948559d2ae2f7ba3dc6b977c Author: Kuniyuki Iwashima <kuniyu@amazon.com> Date: Tue Aug 23 10:46:53 2022 -0700 net: Fix a data-race around netdev_budget. While reading netdev_budget, it can be changed concurrently. Thus, we need to add READ_ONCE() to its reader. Fixes: `51b0bdedb8` ("[NET]: Separate two usages of netdev_max_backlog.") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-10-13 13:00:04 +02:00
Paolo Abeni	08060d0717	net: Fix data-races around netdev_tstamp_prequeue. Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134161 Tested: LNST, Tier1 Upstream commit: commit 61adf447e38664447526698872e21c04623afb8e Author: Kuniyuki Iwashima <kuniyu@amazon.com> Date: Tue Aug 23 10:46:47 2022 -0700 net: Fix data-races around netdev_tstamp_prequeue. While reading netdev_tstamp_prequeue, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers. Fixes: `3b098e2d7c` ("net: Consistent skb timestamping") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-10-13 13:00:03 +02:00
Paolo Abeni	13d50816f6	net: Fix data-races around netdev_max_backlog. Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134161 Tested: LNST, Tier1 Upstream commit: commit 5dcd08cd19912892586c6082d56718333e2d19db Author: Kuniyuki Iwashima <kuniyu@amazon.com> Date: Tue Aug 23 10:46:46 2022 -0700 net: Fix data-races around netdev_max_backlog. While reading netdev_max_backlog, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers. While at it, we remove the unnecessary spaces in the doc. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-10-13 13:00:03 +02:00
Paolo Abeni	05d6206bdc	net: Fix data-races around weight_p and dev_weight_[rt]x_bias. Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134161 Tested: LNST, Tier1 Upstream commit: commit bf955b5ab8f6f7b0632cdef8e36b14e4f6e77829 Author: Kuniyuki Iwashima <kuniyu@amazon.com> Date: Tue Aug 23 10:46:45 2022 -0700 net: Fix data-races around weight_p and dev_weight_[rt]x_bias. While reading weight_p, it can be changed concurrently. Thus, we need to add READ_ONCE() to its reader. Also, dev_[rt]x_weight can be read/written at the same time. So, we need to use READ_ONCE() and WRITE_ONCE() for its access. Moreover, to use the same weight_p while changing dev_[rt]x_weight, we add a mutex in proc_do_dev_weight(). Fixes: `3d48b53fb2` ("net: dev_weight: TX/RX orthogonality") Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-10-13 13:00:03 +02:00
Ivan Vecera	7ca7843425	net: unexport a handful of dev_* functions Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2128180 commit 2cc6cdd44a1655ac5a9863529a2fd6dbed2d092c Author: Jakub Kicinski <kuba@kernel.org> Date: Wed Apr 6 14:37:53 2022 -0700 net: unexport a handful of dev_* functions We have a bunch of functions which are only used under net/core/ yet they get exported. Remove the exports. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2022-10-03 17:03:08 +02:00
Ivan Vecera	616826f600	net: remove .ndo_change_proto_down Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2128180 Conflicts: - small context conflict due to existing backport of 3b89b511ea0c ("net: fix IFF_TX_SKB_NO_LINEAR definition") commit 2106efda785b55a8957efed9a52dfa28ee0d7280 Author: Jakub Kicinski <kuba@kernel.org> Date: Mon Nov 22 17:24:47 2021 -0800 net: remove .ndo_change_proto_down .ndo_change_proto_down was added seemingly to enable out-of-tree implementations. Over 2.5yrs later we still have no real users upstream. Hardwire the generic implementation for now, we can revert once real users materialize. (rocker is a test vehicle, not a user.) We need to drop the optimization on the sysfs side, because unlike ndos priv_flags will be changed at runtime, so we'd need READ_ONCE/WRITE_ONCE everywhere.. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2022-10-03 17:02:55 +02:00
Felix Maurer	8611666ff2	xdp: check prog type before updating BPF link Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071620 commit 382778edc8262b7535f00523e9eb22edba1b9816 Author: Toke Høiland-Jørgensen <toke@redhat.com> Date: Fri Jan 7 23:11:13 2022 +0100 xdp: check prog type before updating BPF link The bpf_xdp_link_update() function didn't check the program type before updating the program, which made it possible to install any program type as an XDP program, which is obviously not good. Syzbot managed to trigger this by swapping in an LWT program on the XDP hook which would crash in a helper call. Fix this by adding a check and bailing out if the types don't match. Fixes: `026a4c28e1` ("bpf, xdp: Implement LINK_UPDATE for BPF XDP link") Reported-by: syzbot+983941aa85af6ded1fd9@syzkaller.appspotmail.com Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/r/20220107221115.326171-1-toke@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Felix Maurer <fmaurer@redhat.com>	2022-08-24 16:56:03 +02:00
Patrick Talbert	95ad1a9fa6	Merge: CNB: bpf: Let bpf_warn_invalid_xdp_action() report more info MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1070 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2073454 Tested: Build, boot. The commit let bpf_warn_invalid_xdp_action() report more info Signed-off-by: Ivan Vecera <ivecera@redhat.com> Approved-by: Corinna Vinschen <vinschen@redhat.com> Approved-by: Kamal Heib <kheib@redhat.com> Approved-by: Jarod Wilson <jarod@redhat.com> Approved-by: Hangbin Liu <haliu@redhat.com> Approved-by: Artem Savkov <asavkov@redhat.com> Approved-by: Petr Oros <poros@redhat.com> Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com> Approved-by: Mohamed Gamal Morsy <mgamal@redhat.com> Approved-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: Patrick Talbert <ptalbert@redhat.com>	2022-07-15 09:40:47 +02:00
Patrick Talbert	5f85d33e47	Merge: net/core: backport fixes from upstream for 9.1 P2 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1057 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101278 The latest path depends on the second latest patch. Signed-off-by: Hangbin Liu <haliu@redhat.com> Approved-by: Jarod Wilson <jarod@redhat.com> Approved-by: Florian Westphal <fwestpha@redhat.com> Approved-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Patrick Talbert <ptalbert@redhat.com>	2022-07-14 12:07:49 +02:00
Patrick Talbert	c2f72a65cf	Merge: CNB: gro: get out of core files MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1066 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101789 Tested: Just built - there is no functional change The series moves GRO related definitions, declarations and code from core files into net/core/gro.h and include/net/gro.h and reduces too big files include/linux/netdevice.h andnet/core/dev.c. Backport of this series provides <net/gro.h> for NIC drivers and avoids conflicts in future GRO related backports and fixes. Signed-off-by: Ivan Vecera <ivecera@redhat.com> Approved-by: Kamal Heib <kheib@redhat.com> Approved-by: Jarod Wilson <jarod@redhat.com> Approved-by: Hangbin Liu <haliu@redhat.com> Approved-by: Íñigo Huguet <ihuguet@redhat.com> Conflicts: - include/linux/netdevice.h: fuzz. Signed-off-by: Patrick Talbert <ptalbert@redhat.com>	2022-07-12 10:36:03 +02:00
Patrick Talbert	f063b56239	Merge: net: backport netdevice and netns refcount tracking and enable them for debug kernels MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1003 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2096377 Tested: Basic networking tasks using namespaces, vlans, veths, macvlans etc. with kernel-debug flavor Upstream kernel recently introduces refcount tracking infrastructure for network devices and namespaces to help to avoid resource leaks and use-after-free issues. This infrastructure should be helpful for our support teams to debug customers' issues. The series backports the following commits and enables both trackers for kernel debug flavors: ``` 95d1d2490c27 ("netdevice: move xdp_rxq within netdev_rx_queue") 2a12ae5d433d ("net: inline sock_prot_inuse_add()") d477eb900484 ("net: make sock_inuse_add() available") 4199bae10c49 ("net: merge net->core.prot_inuse and net->core.sock_inuse") b3cb764aa1d7 ("net: drop nopreempt requirement on sock_prot_inuse_add()") 4e66934eaadc ("lib: add reference counting tracking infrastructure") 914a7b5000d0 ("lib: add tests for reference tracker") 4d92b95ff2f9 ("net: add net device refcount tracker infrastructure") 80e8921b2b72 ("net: add net device refcount tracker to struct netdev_rx_queue") 0b688f24b7d6 ("net: add net device refcount tracker to struct netdev_queue") 5ae2195088d0 ("net: add net device refcount tracker to ethtool_phys_id()") 14ed029b5eb5 ("net: add net device refcount tracker to dev_ifsioc()") 4dbd24f65c60 ("drop_monitor: add net device refcount tracker") 9038c320001d ("net: dst: add net device refcount tracking to dst_entry") fb67510ba9bd ("ipv6: add net device refcount tracker to rt6_probe_deferred()") c0fd407a0666 ("sit: add net device refcount tracking to ip_tunnel") 56c1c77948ba ("ipv6: add net device refcount tracker to struct ip6_tnl") 85662c9f8cbd ("net: add net device refcount tracker to struct neighbour") 77a23b1f9543 ("net: add net device refcount tracker to struct pneigh_entry") 08d622568e5a ("net: add net device refcount tracker to struct neigh_parms") f77159a348f2 ("net: add net device refcount tracker to struct netdev_adjacent") 8c727003c4d0 ("ipv6: add net device refcount tracker to struct inet6_dev") c04438f58d14 ("ipv4: add net device refcount tracker to struct in_device") 606509f27f67 ("net/sched: add net device refcount tracker to struct Qdisc") 63f13937cbe9 ("net: linkwatch: add net device refcount tracker") 095e200f175f ("net: failover: add net device refcount tracker") 42120a864383 ("ipmr, ip6mr: add net device refcount tracker to struct vif_device") 5fa5ae605821 ("netpoll: add net device refcount tracker to struct netpoll") c0e5e11af12b ("vrf: use dev_replace_track() for better tracking") 08f0b22d731f ("net: eql: add net device refcount tracker") 19c9ebf6ed70 ("vlan: add net device refcount tracker") b2dcdc7f731d ("net: bridge: add net device refcount tracker") f12bf6f3f942 ("net: watchdog: add net device refcount tracker") 4fc003fe0313 ("net: switchdev: add net device refcount tracker") e44b14ebae10 ("inet: add net device refcount tracker to struct fib_nh_common") 66ce07f7802b ("ax25: add net device refcount tracker") 615d069dcf12 ("llc: add net device refcount tracker") 035f1f2b96ae ("pktgen add net device refcount tracker") b60645248af3 ("net/smc: add net device tracker to struct smc_pnetentry") e4b8954074f6 ("netlink: add net device refcount tracker to struct ethnl_req_info") e7c8ab8419d7 ("openvswitch: add net device refcount tracker to struct vport") ada066b2e02c ("net: sched: act_mirred: add net device refcount tracker") 4177e4960594 ("xfrm: use net device refcount tracker helpers") 9ba74e6c9e9d ("net: add networking namespace refcount tracker") ffa84b5ffb37 ("net: add netns refcount tracker to struct sock") 04a931e58d19 ("net: add netns refcount tracker to struct seq_net_private") dbdcda634ce3 ("net: sched: add netns refcount tracker to struct tcf_exts") 285ec2fef4b8 ("l2tp: add netns refcount tracker to l2tp_dfs_seq_data") 11b311a867b6 ("ppp: add netns refcount tracker") 0976b888a150 ("ethtool: fix null-ptr-deref on ref tracker") e1b539bd73a7 ("xfrm: add net device refcount tracker to struct xfrm_state_offload") 8b40a9d53d4f ("ipv6: use GFP_ATOMIC in rt6_probe()") 1d2f3d3c6268 ("mptcp: adjust to use netns refcount tracker") 123e495ecc25 ("net: linkwatch: be more careful about dev->linkwatch_dev_tracker") 9280ac2e6f19 ("net: dev_replace_track() cleanup") 34ac17ecbf57 ("ethtool: use ethnl_parse_header_dev_put()") f1d9268e0618 ("net: add net device refcount tracker to struct packet_type") 3bc14ea0d12a ("ethtool: always write dev in ethnl_parse_header_dev_get") a9382d9389a0 ("netfilter: nfnetlink: add netns refcount tracker to struct nfulnl_instance") 30db406923b9 ("netfilter: nf_nat_masquerade: make async masq_inet6_event handling generic") 7970a19b7104 ("netfilter: nf_nat_masquerade: defer conntrack walk to work queue") fc0d026a2fad ("netfilter: nf_nat_masquerade: add netns refcount tracker to masq_dev_work") 88248c357c2a ("net/sched: add missing tracker information in qdisc_create()") 2d6ec25539b0 ("netlink: do not allocate a device refcount tracker in ethnl_default_notify()") bf44077c1b3a ("af_packet: fix tracking issues in packet_do_bind()") cb963a19d99f ("net: sched: do not allocate a tracker in tcf_exts_init()") c12837d1bb31 ("ref_tracker: use __GFP_NOFAIL more carefully") fcfb894d5952 ("net: bridge: fix net device refcount tracking issue in error path") 7b9b1d449a7c ("net/smc: fix possible NULL deref in smc_pnet_add_eth()") 6cdef8a6ee74 ("SUNRPC: add netns refcount tracker to struct svc_xprt") 9b1831e56c7f ("SUNRPC: add netns refcount tracker to struct gss_auth") b9a0d6d143ec ("SUNRPC: add netns refcount tracker to struct rpc_xprt") e3ececfe668f ("ref_tracker: implement use-after-free detection") 8fd5522f44dc ("ref_tracker: add a count of untracked references") 4c6c11ea0f7b ("net: refine dev_put()/dev_hold() debugging") 28f922213886 ("net/smc: fix ref_tracker issue in smc_pnet_add()") 94fdd7c02a56 ("net/smc: use GFP_ATOMIC allocation in smc_pnet_add_eth()") b2309a71c1f2 ("net: add dev->dev_registered_tracker") 3db09e762dc7 ("net/sched: cls_u32: fix netns refcount changes in u32_change()") ec5b0f605b10 ("net/sched: cls_u32: fix possible leak in u32_init_knode()") ``` Signed-off-by: Ivan Vecera <ivecera@redhat.com> Approved-by: Jarod Wilson <jarod@redhat.com> Approved-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Patrick Talbert <ptalbert@redhat.com>	2022-07-01 09:17:32 +02:00
Ivan Vecera	ca7c7d9c0c	bpf: Let bpf_warn_invalid_xdp_action() report more info Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2073454 Conflicts: - N/A hunk for unsupported octeontx2 driver omitted commit c8064e5b4adac5e1255cf4f3b374e75b5376e7ca Author: Paolo Abeni <pabeni@redhat.com> Date: Tue Nov 30 11:08:07 2021 +0100 bpf: Let bpf_warn_invalid_xdp_action() report more info In non trivial scenarios, the action id alone is not sufficient to identify the program causing the warning. Before the previous patch, the generated stack-trace pointed out at least the involved device driver. Let's additionally include the program name and id, and the relevant device name. If the user needs additional infos, he can fetch them via a kernel probe, leveraging the arguments added here. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/ddb96bb975cbfddb1546cf5da60e77d5100b533c.1638189075.git.pabeni@redhat.com Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2022-06-28 16:13:14 +02:00
Ivan Vecera	7ba9ae4395	net: gro: populate net/core/gro.c Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101789 Conflicts: - adjusted due to existing backport of 7881453e4adf ("net: gro: avoid re-computing truesize twice on recycle") commit 587652bbdd06ab38a4c1b85e40f933d2cf4a1147 Author: Eric Dumazet <edumazet@google.com> Date: Mon Nov 15 09:05:54 2021 -0800 net: gro: populate net/core/gro.c Move gro code and data from net/core/dev.c to net/core/gro.c to ease maintenance. gro_normal_list() and gro_normal_one() are inlined because they are called from both files. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2022-06-28 13:28:41 +02:00
Ivan Vecera	e9721641ed	net:dev: Change napi_gro_complete return type to void Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101789 commit 1643771eeb2db9b487cbbde12e2a3f6ed0171490 Author: Gyumin Hwang <hkm73560@gmail.com> Date: Sat Oct 2 08:11:36 2021 +0000 net:dev: Change napi_gro_complete return type to void napi_gro_complete always returned the same value, NET_RX_SUCCESS And the value was not used anywhere Signed-off-by: Gyumin Hwang <hkm73560@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2022-06-28 13:28:40 +02:00
Ivan Vecera	2119ff5330	move netdev_boot_setup into Space.c Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101789 commit 5ea2f5ffde39251115ef9a566262fb9e52b91cb7 Author: Arnd Bergmann <arnd@arndb.de> Date: Tue Aug 3 13:40:46 2021 +0200 move netdev_boot_setup into Space.c This is now only used by a handful of old ISA drivers, and can be moved into the file they already all depend on. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2022-06-28 13:28:39 +02:00
Hangbin Liu	e4c3a2b313	net: fix data-race in dev_isalive() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101278 Upstream Status: net.git commit cc26c2661fef Conflicts: context conflicts due to missing ae68db14b616 ("net: transition netdev reg state earlier in run_todo") and 86213f80da1b ("net: avoid quadratic behavior in netdev_wait_allrefs_any()") commit cc26c2661fefea215f41edb665193324a5f99021 Author: Eric Dumazet <edumazet@google.com> Date: Thu Jun 16 00:34:34 2022 -0700 net: fix data-race in dev_isalive() dev_isalive() is called under RTNL or dev_base_lock protection. This means that changes to dev->reg_state should be done with both locks held. syzbot reported: BUG: KCSAN: data-race in register_netdevice / type_show write to 0xffff888144ecf518 of 1 bytes by task 20886 on cpu 0: register_netdevice+0xb9f/0xdf0 net/core/dev.c:10050 lapbeth_new_device drivers/net/wan/lapbether.c:414 [inline] lapbeth_device_event+0x4a0/0x6c0 drivers/net/wan/lapbether.c:456 notifier_call_chain kernel/notifier.c:87 [inline] raw_notifier_call_chain+0x53/0xb0 kernel/notifier.c:455 __dev_notify_flags+0x1d6/0x3a0 dev_change_flags+0xa2/0xc0 net/core/dev.c:8607 do_setlink+0x778/0x2230 net/core/rtnetlink.c:2780 __rtnl_newlink net/core/rtnetlink.c:3546 [inline] rtnl_newlink+0x114c/0x16a0 net/core/rtnetlink.c:3593 rtnetlink_rcv_msg+0x811/0x8c0 net/core/rtnetlink.c:6089 netlink_rcv_skb+0x13e/0x240 net/netlink/af_netlink.c:2501 rtnetlink_rcv+0x18/0x20 net/core/rtnetlink.c:6107 netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline] netlink_unicast+0x58a/0x660 net/netlink/af_netlink.c:1345 netlink_sendmsg+0x661/0x750 net/netlink/af_netlink.c:1921 sock_sendmsg_nosec net/socket.c:714 [inline] sock_sendmsg net/socket.c:734 [inline] __sys_sendto+0x21e/0x2c0 net/socket.c:2119 __do_sys_sendto net/socket.c:2131 [inline] __se_sys_sendto net/socket.c:2127 [inline] __x64_sys_sendto+0x74/0x90 net/socket.c:2127 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 read to 0xffff888144ecf518 of 1 bytes by task 20423 on cpu 1: dev_isalive net/core/net-sysfs.c:38 [inline] netdev_show net/core/net-sysfs.c:50 [inline] type_show+0x24/0x90 net/core/net-sysfs.c:112 dev_attr_show+0x35/0x90 drivers/base/core.c:2095 sysfs_kf_seq_show+0x175/0x240 fs/sysfs/file.c:59 kernfs_seq_show+0x75/0x80 fs/kernfs/file.c:162 seq_read_iter+0x2c3/0x8e0 fs/seq_file.c:230 kernfs_fop_read_iter+0xd1/0x2f0 fs/kernfs/file.c:235 call_read_iter include/linux/fs.h:2052 [inline] new_sync_read fs/read_write.c:401 [inline] vfs_read+0x5a5/0x6a0 fs/read_write.c:482 ksys_read+0xe8/0x1a0 fs/read_write.c:620 __do_sys_read fs/read_write.c:630 [inline] __se_sys_read fs/read_write.c:628 [inline] __x64_sys_read+0x3e/0x50 fs/read_write.c:628 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 value changed: 0x00 -> 0x01 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 20423 Comm: udevd Tainted: G W 5.19.0-rc2-syzkaller-dirty #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Hangbin Liu <haliu@redhat.com>	2022-06-27 16:39:41 +08:00
Hangbin Liu	ca3a0598a6	net: Write lock dev_base_lock without disabling bottom halves. Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101278 Upstream Status: net.git commit fd888e85fe6b commit fd888e85fe6b661e78044dddfec0be5271afa626 Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Date: Fri Nov 26 17:15:29 2021 +0100 net: Write lock dev_base_lock without disabling bottom halves. The writer acquires dev_base_lock with disabled bottom halves. The reader can acquire dev_base_lock without disabling bottom halves because there is no writer in softirq context. On PREEMPT_RT the softirqs are preemptible and local_bh_disable() acts as a lock to ensure that resources, that are protected by disabling bottom halves, remain protected. This leads to a circular locking dependency if the lock acquired with disabled bottom halves (as in write_lock_bh()) and somewhere else with enabled bottom halves (as by read_lock() in netstat_show()) followed by disabling bottom halves (cxgb_get_stats() -> t4_wr_mbox_meat_timeout() -> spin_lock_bh()). This is the reverse locking order. All read_lock() invocation are from sysfs callback which are not invoked from softirq context. Therefore there is no need to disable bottom halves while acquiring a write lock. Acquire the write lock of dev_base_lock without disabling bottom halves. Reported-by: Pei Zhang <pezhang@redhat.com> Reported-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Hangbin Liu <haliu@redhat.com>	2022-06-27 16:37:14 +08:00
Hangbin Liu	7b9f2507ce	net: fix dev_fill_forward_path with pppoe + bridge Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101278 Upstream Status: net.git commit cf2df74e202d commit cf2df74e202d81b09f09d84c2d8903e0e87e9274 Author: Felix Fietkau <nbd@nbd.name> Date: Mon May 9 14:26:15 2022 +0200 net: fix dev_fill_forward_path with pppoe + bridge When calling dev_fill_forward_path on a pppoe device, the provided destination address is invalid. In order for the bridge fdb lookup to succeed, the pppoe code needs to update ctx->daddr to the correct value. Fix this by storing the address inside struct net_device_path_ctx Fixes: `f6efc675c9` ("net: ppp: resolve forwarding path for bridge pppoe devices") Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Hangbin Liu <haliu@redhat.com>	2022-06-27 12:26:09 +08:00
Patrick Talbert	164ce13234	Merge: CNB: Update TC subsystem to upstream v5.18 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/971 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2090410 Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2094002 Tested: Using TC related kernel self-tests The series rebases TC subsystem to upstream v5.18 Commits: ``` f79a3bcb1a50 ("net/sched: Remove unnecessary if statement") 409f386b8e5d ("qdisc: add new field for qdisc_enqueue tracepoint") 56af5e749f20 ("net/sched: act_skbmod: Add SKBMOD_F_ECN option support") 68f9884837c6 ("tc-testing: Add control-plane selftest for skbmod SKBMOD_F_ECN option") 695176bfe5de ("net_sched: refactor TC action init API") 625af9f0298b ("tc-testing: Add control-plane selftests for sch_mq") a5397d68b2db ("net/sched: cls_api, reset flags on replay") efe487fce306 ("fix array-index-out-of-bounds in taprio_change") 1e080f17750d ("net: sched: update default qdisc visibility after Tx queue cnt changes") 2e367522ce6b ("netdevsim: add ability to change channel count") 2d6a58996ee2 ("selftests: net: test ethtool -L vs mq") f7116fb46085 ("net: sched: move and reuse mq_change_real_num_tx()") b193e15ac69d ("net: prevent user from passing illegal stab size") 69508d43334e ("net_sched: Use struct_size() and flex_array_size() helpers") 129291980f49 ("net: sched: Use struct_size() helper in kvmalloc()") fbf307c89eb0 ("gen_stats: Add instead Set the value in __gnet_stats_copy_basic().") 448e163f8b9b ("gen_stats: Add gnet_stats_add_queue().") 7361df4606ba ("mq, mqprio: Use gnet_stats_add_queue().") 10940eb746d4 ("gen_stats: Move remaining users to gnet_stats_add_queue().") f2efdb179289 ("u64_stats: Introduce u64_stats_set()") 67c9e6270f30 ("net: sched: Protect Qdisc::bstats with u64_stats") f56940daa5a7 ("net: sched: Use _bstats_update/set() instead of raw writes") 50dc9a8572aa ("net: sched: Merge Qdisc::bstats and Qdisc::cpu_bstats data types") 29cbcd858283 ("net: sched: Remove Qdisc::running sequence counter") 4c57e2fac41c ("net: sched: fix logic error in qdisc_run_begin()") 97604c65bcda ("net: sched: remove one pair of atomic operations") 6b3efbfa4e68 ("net: sch_tbf: Add a graft command") e22db7bd552f ("net: sched: Allow statistics reads from softirq.") c5c6e589a8c8 ("net: stats: Read the statistics in ___gnet_stats_copy_basic() instead of adding.") f25c0515c521 ("net: sched: gred: dynamically allocate tc_gred_qopt_offload") 267463823adb ("net: sch: eliminate unnecessary RCU waits in mini_qdisc_pair_swap()") 85c0c3eb9a66 ("net: sch: simplify condtion for selecting mini_Qdisc_pair buffer") 648a991cf316 ("sch_htb: Add extack messages for EOPNOTSUPP errors") 6de6e46d27ef ("cls_flower: Fix inability to match GRE/IPIP packets") af0a51113cb7 ("selftests: forwarding: Fix packet matching in mirroring selftests") cb3ef7b00042 ("net: sched: sch_netem: Refactor code in 4-state loss generator") bdf1565fe03d ("selftests/tc-testing: match any qdisc type") b43c2793f5e9 ("netfilter: nfnetlink_queue: silence bogus compiler warning") 43332cf97425 ("net/sched: act_ct: Offload only ASSURED connections") 40bd094d65fc ("flow_offload: fill flags to action structure") 144d4c9e800d ("flow_offload: reject to offload tc actions in offload drivers") 5a9959008fb6 ("flow_offload: add index to flow_action_entry structure") 9c1c0e124ca2 ("flow_offload: rename offload functions with offload instead of flow") c54e1d920f04 ("flow_offload: add ops to tc_action_ops for flow action setup") 8cbfe939abe9 ("flow_offload: allow user to offload tc action to net device") 7adc57651211 ("flow_offload: add skip_hw and skip_sw to control if offload the action") bcd64368584b ("flow_offload: rename exts stats update functions with hw") c7a66f8d8a94 ("flow_offload: add process to update action stats from hardware") e8cb5bcf6ed6 ("net: sched: save full flags for tc action") 13926d19a11e ("flow_offload: add reoffload process to update hw_count") c86e0209dc77 ("flow_offload: validate flags of filter and actions") eb473bac4a4b ("selftests: tc-testing: add action offload selftest for action and filter") c48c94b0ab75 ("net/sched: use min() macro instead of doing it manually") 963178a06352 ("flow_offload: fix suspicious RCU usage when offloading tc action") 9795ded7f924 ("net/sched: act_ct: Fill offloading tuple iifidx") b702436a51df ("net: openvswitch: Fill act ct extension") 7d18a07897d0 ("sch_qfq: prevent shift-out-of-bounds in qfq_init_qdisc") c25af830ab26 ("sch_cake: revise Diffserv docs") 719774377622 ("netfilter: conntrack: convert to refcount_t api") 3fce16493dc1 ("netfilter: core: move ip_ct_attach indirection to struct nf_ct_hook") 285c8a7a5815 ("netfilter: make function op structures const") 6ae7989c9af0 ("netfilter: conntrack: avoid useless indirection during conntrack destruction") 408bdcfce8df ("net: prefer nf_ct_put instead of nf_conntrack_put") fb80445c438c ("net_sched: restore "mpu xxx" handling") 973bf8fdd12f ("net: sched: Clarify error message when qdisc kind is unknown") bb62a765b1b5 ("netfilter: conntrack: make all extensions 8-byte alignned") 5f31edc0676b ("netfilter: conntrack: move extension sizes into core") 1bc91a5ddf3e ("netfilter: conntrack: handle ->destroy hook via nat_ops instead") 1015c3de23ee ("netfilter: conntrack: remove extension register api") 34243b9ec856 ("netfilter: nft_ct: fix use after free when attaching zone template") 429c3be8a5e2 ("sch_htb: Fail on unsupported parameters when offload is requested") 98b608629746 ("net: sched: remove psched_tdiff_bounded()") a459bc9a3a68 ("net: sched: remove qdisc_qlen_cpu()") 04c2a47ffb13 ("net: sched: fix use-after-free in tc_new_tfilter()") 35d39fecbc24 ("net/sched: Enable tc skb ext allocation on chain miss only when needed") 4ddc844eb81d ("net/sched: act_police: more accurate MTU policing") 5891cd5ec46c ("net_sched: add __rcu annotation to netdev->qdisc") 5740d0689096 ("net: sched: limit TC_ACT_REPEAT loops") 2f131de361f6 ("net/sched: act_ct: Fix flow table lookup after ct clear or switching zones") ecf4a24cf978 ("net: sched: avoid newline at end of message in NL_SET_ERR_MSG_MOD") b8cd5831c61c ("net: flow_offload: add tc police action parameters") d97b4b105ce7 ("flow_offload: reject offload for all drivers with invalid police parameters") fcb6aa86532c ("act_ct: Support GRE offload") db6140e5e35a ("net/sched: act_ct: Fix flow table lookup failure with no originating ifindex") d922a99b96d0 ("flow_offload: improve extack msg for user when adding invalid filter") ab95465cde23 ("net/sched: add vlan push_eth and pop_eth action to the hardware IR") 054d5575cd6e ("net/sched: fix incorrect vlan_push_eth dest field") bcb74e132a76 ("net/sched: act_ct: fix ref leak when switching zones") 2105f700b53c ("net/sched: flower: fix parsing of ethertype following VLAN header") e65812fd22eb ("net/sched: fix initialization order when updating chain 0 head") e8a64bbaaad1 ("net/sched: taprio: Check if socket flags are valid") 3db09e762dc7 ("net/sched: cls_u32: fix netns refcount changes in u32_change()") ec5b0f605b10 ("net/sched: cls_u32: fix possible leak in u32_init_knode()") 8b796475fd78 ("net/sched: act_pedit: really ensure the skb is writable") 4d42d54a7d6a ("net/sched: act_pedit: sanitize shift argument before usage") 86360030cc51 ("net/sched: act_api: fix error code in tcf_ct_flow_table_fill_tuple_ipv6()") ``` Signed-off-by: Ivan Vecera <ivecera@redhat.com> Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Approved-by: Corinna Vinschen <vinschen@redhat.com> Approved-by: Jarod Wilson <jarod@redhat.com> Approved-by: Davide Caratti <dcaratti@redhat.com> Approved-by: Petr Oros <poros@redhat.com> Signed-off-by: Patrick Talbert <ptalbert@redhat.com>	2022-06-21 10:07:08 +02:00
Ivan Vecera	056507f0cb	net: add dev->dev_registered_tracker Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2096377 commit b2309a71c1f2fc841feb184195b2e46b2e139bf4 Author: Eric Dumazet <edumazet@google.com> Date: Mon Feb 7 10:41:07 2022 -0800 net: add dev->dev_registered_tracker Convert one dev_hold()/dev_put() pair in register_netdevice() and unregister_netdevice_many() to dev_hold_track() and dev_put_track(). This would allow to detect a rogue dev_put() a bit earlier. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220207184107.1401096-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2022-06-13 18:39:34 +02:00
Ivan Vecera	859ed7a9a3	net: refine dev_put()/dev_hold() debugging Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2096377 commit 4c6c11ea0f7b00a1894803efe980dfaf3b074886 Author: Eric Dumazet <edumazet@google.com> Date: Fri Feb 4 14:42:37 2022 -0800 net: refine dev_put()/dev_hold() debugging We are still chasing some syzbot reports where we think a rogue dev_put() is called with no corresponding prior dev_hold(). Unfortunately it eats a reference on dev->dev_refcnt taken by innocent dev_hold_track(), meaning that the refcount saturation splat comes too late to be useful. Make sure that 'not tracked' dev_put() and dev_hold() better use CONFIG_NET_DEV_REFCNT_TRACKER=y debug infrastructure: Prior patch in the series allowed ref_tracker_alloc() and ref_tracker_free() to be called with a NULL @trackerp parameter, and to use a separate refcount only to detect too many put() even in the following case: dev_hold_track(dev, tracker_1, GFP_ATOMIC); dev_hold(dev); dev_put(dev); dev_put(dev); // Should complain loudly here. dev_put_track(dev, tracker_1); // instead of here Add clarification about netdev_tracker_alloc() role. v2: I replaced the dev_put() in linkwatch_do_dev() with __dev_put() because callers called netdev_tracker_free(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2022-06-13 18:39:33 +02:00
Ivan Vecera	6ce56701da	net: add net device refcount tracker to struct netdev_adjacent Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2096377 commit f77159a348f2d6078af7fe4933a60229d7c7aae2 Author: Eric Dumazet <edumazet@google.com> Date: Sat Dec 4 20:22:10 2021 -0800 net: add net device refcount tracker to struct netdev_adjacent Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2022-06-13 18:38:19 +02:00
Ivan Vecera	f516b70a26	net: add net device refcount tracker infrastructure Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2096377 Conflicts: - context conflict due to missing commit 5ea2f5ffde392 ("move netdev_boot_setup into Space.c") commit 4d92b95ff2f95f13df9bad0b5a25a9f60e72758d Author: Eric Dumazet <edumazet@google.com> Date: Sat Dec 4 20:21:57 2021 -0800 net: add net device refcount tracker infrastructure net device are refcounted. Over the years we had numerous bugs caused by imbalanced dev_hold() and dev_put() calls. The general idea is to be able to precisely pair each decrement with a corresponding prior increment. Both share a cookie, basically a pointer to private data storing stack traces. This patch adds dev_hold_track() and dev_put_track(). To use these helpers, each data structure owning a refcount should also use a "netdevice_tracker" to pair the hold and put. netdevice_tracker dev_tracker; ... dev_hold_track(dev, &dev_tracker, GFP_ATOMIC); ... dev_put_track(dev, &dev_tracker); Whenever a leak happens, we will get precise stack traces of the point dev_hold_track() happened, at device dismantle phase. We will also get a stack trace if too many dev_put_track() for the same netdevice_tracker are attempted. This is guarded by CONFIG_NET_DEV_REFCNT_TRACKER option. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2022-06-13 18:36:42 +02:00
Patrick Talbert	0b353d8be8	Merge: CNB: net: consolidate neif_rx() and make it callable from any context MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/968 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089703 Tested: basic network tests on lo, tun, veth Series consolidate neif_rx() and make it callable from any context. It is backport for these upstream series: da54d75bebf4d8 ("Merge branch 'netdev-RT'") 9f9919f73c94ae ("Merge branch 'netif_rx'") 83b7b77af37a89 ("Merge branch 'netif_rx-conversions-part2'") e21af12622c0fb ("Merge branch 'netif_rx-part3'") Omitted-fix: b903117b48681e12fae38e09c874f38c45186dc6 Omitted-fix: e1f9e434617fb28097223d9484de66218bc0b52d Signed-off-by: Petr Oros <poros@redhat.com> Approved-by: Jiri Benc <jbenc@redhat.com> Approved-by: Hangbin Liu <haliu@redhat.com> Approved-by: John W. Linville <linville@redhat.com> Signed-off-by: Patrick Talbert <ptalbert@redhat.com>	2022-06-10 09:44:49 +02:00
Ivan Vecera	0cdfbe9c70	net: sched: update default qdisc visibility after Tx queue cnt changes Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2090410 commit 1e080f17750d1083e8a32f7b350584ae1cd7ff20 Author: Jakub Kicinski <kuba@kernel.org> Date: Mon Sep 13 15:53:30 2021 -0700 net: sched: update default qdisc visibility after Tx queue cnt changes mq / mqprio make the default child qdiscs visible. They only do so for the qdiscs which are within real_num_tx_queues when the device is registered. Depending on order of calls in the driver, or if user space changes config via ethtool -L the number of qdiscs visible under tc qdisc show will differ from the number of queues. This is confusing to users and potentially to system configuration scripts which try to make sure qdiscs have the right parameters. Add a new Qdisc_ops callback and make relevant qdiscs TTRT. Note that this uncovers the "shortcut" created by commit `1f27cde313` ("net: sched: use pfifo_fast for non real queues") The default child qdiscs beyond initial real_num_tx are always pfifo_fast, no matter what the sysfs setting is. Fixing this gets a little tricky because we'd need to keep a reference on whatever the default qdisc was at the time of creation. In practice this is likely an non-issue the qdiscs likely have to be configured to non-default settings, so whatever user space is doing such configuration can replace the pfifos... now that it will see them. Reported-by: Matthew Massey <matthewmassey@fb.com> Reviewed-by: Dave Taht <dave.taht@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2022-06-06 16:29:55 +02:00
Ivan Vecera	bfa8b4c7ce	net: add netif_set_real_num_queues() for device reconfig Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2094002 commit 271e5b7d00aeff7c61fb6c5415d14dbedb783b68 Author: Jakub Kicinski <kuba@kernel.org> Date: Tue Aug 3 06:05:26 2021 -0700 net: add netif_set_real_num_queues() for device reconfig netif_set_real_num_rx_queues() and netif_set_real_num_tx_queues() can fail which breaks drivers trying to implement reconfiguration in a way that can't leave the device half-broken. In other words those functions are incompatible with prepare/commit approach. Luckily setting real number of queues can fail only if the number is increased, meaning that if we order operations correctly we can guarantee ending up with either new config (success), or the old one (on error). Provide a helper implementing such logic so that drivers don't have to duplicate it. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2022-06-06 16:24:58 +02:00
Petr Oros	6f8d815bcf	net: dev: Use netif_rx(). Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089703 Upstream commit(s): commit ad0a043fc26c17522ede3cc986d559f05ece20f4 Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Date: Thu Mar 3 18:15:05 2022 +0100 net: dev: Use netif_rx(). Since commit baebdf48c3600 ("net: dev: Makes sure netif_rx() can be invoked in any context.") the function netif_rx() can be used in preemptible/thread context as well as in interrupt context. Use netif_rx(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Petr Oros <poros@redhat.com>	2022-06-06 11:54:24 +02:00
Petr Oros	ee3d25c7a3	net: Correct wrong BH disable in hard-interrupt. Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089703 Upstream commit(s): commit 167053f8dd0ed60287858448696b4784d7e1d899 Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Date: Wed Feb 16 18:50:46 2022 +0100 net: Correct wrong BH disable in hard-interrupt. I missed the obvious case where netif_ix() is invoked from hard-IRQ context. Disabling bottom halves is only needed in process context. This ensures that the code remains on the current CPU and that the soft-interrupts are processed at local_bh_enable() time. In hard- and soft-interrupt context this is already the case and the soft-interrupts will be processed once the context is left (at irq-exit time). Disable bottom halves if neither hard-interrupts nor soft-interrupts are disabled. Update the kernel-doc, mention that interrupts must be enabled if invoked from process context. Fixes: baebdf48c3600 ("net: dev: Makes sure netif_rx() can be invoked in any context.") Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Link: https://lore.kernel.org/r/Yg05duINKBqvnxUc@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Petr Oros <poros@redhat.com>	2022-06-06 11:54:20 +02:00
Petr Oros	32c9187bad	net: dev: Make rps_lock() disable interrupts. Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089703 Upstream commit(s): commit e722db8de6e6932267457ace2657a19015f3db4a Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Date: Sat Feb 12 00:38:39 2022 +0100 net: dev: Make rps_lock() disable interrupts. Disabling interrupts and in the RPS case locking input_pkt_queue is split into local_irq_disable() and optional spin_lock(). This breaks on PREEMPT_RT because the spinlock_t typed lock can not be acquired with disabled interrupts. The sections in which the lock is acquired is usually short in a sense that it is not causing long und unbounded latiencies. One exception is the skb_flow_limit() invocation which may invoke a BPF program (and may require sleeping locks). By moving local_irq_disable() + spin_lock() into rps_lock(), we can keep interrupts disabled on !PREEMPT_RT and enabled on PREEMPT_RT kernels. Without RPS on a PREEMPT_RT kernel, the needed synchronisation happens as part of local_bh_disable() on the local CPU. ____napi_schedule() is only invoked if sd is from the local CPU. Replace it with __napi_schedule_irqoff() which already disables interrupts on PREEMPT_RT as needed. Move this call to rps_ipi_queued() and rename the function to napi_schedule_rps as suggested by Jakub. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Petr Oros <poros@redhat.com>	2022-06-06 11:25:38 +02:00
Petr Oros	56766d1469	net: dev: Makes sure netif_rx() can be invoked in any context. Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089703 Conflicts: - drivers/net/amt.c Unmerged because file missing in rhel Upstream commit(s): commit baebdf48c360080710f80699eea3affbb13d6c65 Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Date: Sat Feb 12 00:38:38 2022 +0100 net: dev: Makes sure netif_rx() can be invoked in any context. Dave suggested a while ago (eleven years by now) "Let's make netif_rx() work in all contexts and get rid of netif_rx_ni()". Eric agreed and pointed out that modern devices should use netif_receive_skb() to avoid the overhead. In the meantime someone added another variant, netif_rx_any_context(), which behaves as suggested. netif_rx() must be invoked with disabled bottom halves to ensure that pending softirqs, which were raised within the function, are handled. netif_rx_ni() can be invoked only from process context (bottom halves must be enabled) because the function handles pending softirqs without checking if bottom halves were disabled or not. netif_rx_any_context() invokes on the former functions by checking in_interrupts(). netif_rx() could be taught to handle both cases (disabled and enabled bottom halves) by simply disabling bottom halves while invoking netif_rx_internal(). The local_bh_enable() invocation will then invoke pending softirqs only if the BH-disable counter drops to zero. Eric is concerned about the overhead of BH-disable+enable especially in regard to the loopback driver. As critical as this driver is, it will receive a shortcut to avoid the additional overhead which is not needed. Add a local_bh_disable() section in netif_rx() to ensure softirqs are handled if needed. Provide __netif_rx() which does not disable BH and has a lockdep assert to ensure that interrupts are disabled. Use this shortcut in the loopback driver and in drivers/net/*.c. Make netif_rx_ni() and netif_rx_any_context() invoke netif_rx() so they can be removed once they are no more users left. Link: https://lkml.kernel.org/r/20100415.020246.218622820.davem@davemloft.net Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Petr Oros <poros@redhat.com>	2022-06-06 11:25:37 +02:00
Petr Oros	c15df5c592	net: dev: Remove preempt_disable() and get_cpu() in netif_rx_internal(). Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089703 Upstream commit(s): commit f234ae2947612825686b25cae3e9579188a6ba95 Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Date: Sat Feb 12 00:38:37 2022 +0100 net: dev: Remove preempt_disable() and get_cpu() in netif_rx_internal(). The preempt_disable() () section was introduced in commit `cece1945bf` ("net: disable preemption before call smp_processor_id()") and adds it in case this function is invoked from preemtible context and because get_cpu() later on as been added. The get_cpu() usage was added in commit `b0e28f1eff` ("net: netif_rx() must disable preemption") because ip_dev_loopback_xmit() invoked netif_rx() with enabled preemption causing a warning in smp_processor_id(). The function netif_rx() should only be invoked from an interrupt context which implies disabled preemption. The commit `e30b38c298` ("ip: Fix ip_dev_loopback_xmit()") was addressing this and replaced netif_rx() with in netif_rx_ni() in ip_dev_loopback_xmit(). Based on the discussion on the list, the former patch (`b0e28f1eff`) should not have been applied only the latter (`e30b38c298`). Remove get_cpu() and preempt_disable() since the function is supposed to be invoked from context with stable per-CPU pointers. Bottom halves have to be disabled at this point because the function may raise softirqs which need to be processed. Link: https://lkml.kernel.org/r/20100415.013347.98375530.davem@davemloft.net Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Petr Oros <poros@redhat.com>	2022-06-06 11:25:37 +02:00
Patrick Talbert	8c5b3f7fd9	Merge: XDP and networking eBPF rebase to v5.15 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/674 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071618 Depends: !572 Tested: Using bpf selftests, everything passes. This rebases XDP and networking eBPF to upstream kernel version 5.15. Signed-off-by: Jiri Benc <jbenc@redhat.com> Approved-by: Hangbin Liu <haliu@redhat.com> Approved-by: Rafael Aquini <aquini@redhat.com> Approved-by: Toke Høiland-Jørgensen <toke@redhat.com> Approved-by: Íñigo Huguet <ihuguet@redhat.com> Signed-off-by: Patrick Talbert <ptalbert@redhat.com>	2022-06-03 09:26:25 +02:00
Patrick Talbert	092af648a0	Merge: bpf: update to v5.15 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/572 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2041365 Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Approved-by: Rado Vrbovsky <rvrbovsk@redhat.com> Approved-by: Prarit Bhargava <prarit@redhat.com> Approved-by: Yauheni Kaliuta <ykaliuta@redhat.com> Approved-by: Artem Savkov <asavkov@redhat.com> Approved-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: Patrick Talbert <ptalbert@redhat.com>	2022-05-26 09:27:25 +02:00
Jiri Benc	7e6f15045c	net: in_irq() cleanup Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071618 commit afa79d08c6c8e1901cb1547591e3ccd3ec6965d9 Author: Changbin Du <changbin.du@intel.com> Date: Fri Aug 13 22:57:49 2021 +0800 net: in_irq() cleanup Replace the obsolete and ambiguos macro in_irq() with new macro in_hardirq(). Signed-off-by: Changbin Du <changbin.du@gmail.com> Link: https://lore.kernel.org/r/20210813145749.86512-1-changbin.du@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jiri Benc <jbenc@redhat.com>	2022-05-12 17:29:49 +02:00
Jiri Benc	c773bf00b4	net, core: Allow netdev_lower_get_next_private_rcu in bh context Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071618 commit 689186699931313c7a42462602bd5c03eef77f9f Author: Jussi Maki <joamaki@gmail.com> Date: Sat Jul 31 05:57:36 2021 +0000 net, core: Allow netdev_lower_get_next_private_rcu in bh context For the XDP bonding slave lookup to work in the NAPI poll context in which the redudant rcu_read_lock() has been removed we have to follow the same approach as in `694cea395f` ("bpf: Allow RCU-protected lookups to happen from bh context") and modify the WARN_ON to also check rcu_read_lock_bh_held(). Signed-off-by: Jussi Maki <joamaki@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20210731055738.16820-6-joamaki@gmail.com Signed-off-by: Jiri Benc <jbenc@redhat.com>	2022-05-12 17:29:48 +02:00
Jiri Benc	88b4e5f8ea	net, core: Add support for XDP redirection to slave device Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071618 Conflicts: - Using lower case __bpf_prog_run in bpf_prog_run_xdp due to out of order backport of fb7dd8bca013 ("bpf: Refactor BPF_PROG_RUN into a function") commit 879af96ffd72706c6e3278ea6b45b0b0e37ec5d7 Author: Jussi Maki <joamaki@gmail.com> Date: Sat Jul 31 05:57:33 2021 +0000 net, core: Add support for XDP redirection to slave device This adds the ndo_xdp_get_xmit_slave hook for transforming XDP_TX into XDP_REDIRECT after BPF program run when the ingress device is a bond slave. The dev_xdp_prog_count is exposed so that slave devices can be checked for loaded XDP programs in order to avoid the situation where both bond master and slave have programs loaded according to xdp_state. Signed-off-by: Jussi Maki <joamaki@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Jay Vosburgh <j.vosburgh@gmail.com> Cc: Veaceslav Falico <vfalico@gmail.com> Cc: Andy Gospodarek <andy@greyhouse.net> Link: https://lore.kernel.org/bpf/20210731055738.16820-3-joamaki@gmail.com Signed-off-by: Jiri Benc <jbenc@redhat.com>	2022-05-12 17:29:47 +02:00
Hangbin Liu	b2ce8f1b0b	net: initialize init_net earlier Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081920 Upstream Status: net.git commit 9c1be1935fb6 Conflicts: context conflicts due to missing commit 41467d2ff4df ("net: net_namespace: Optimize the code") commit 9c1be1935fb68b2413796cdc03d019b8cf35ab51 Author: Eric Dumazet <edumazet@google.com> Date: Sat Feb 5 09:01:25 2022 -0800 net: initialize init_net earlier While testing a patch that will follow later ("net: add netns refcount tracker to struct nsproxy") I found that devtmpfs_init() was called before init_net was initialized. This is a bug, because devtmpfs_setup() calls ksys_unshare(CLONE_NEWNS); This has the effect of increasing init_net refcount, which will be later overwritten to 1, as part of setup_net(&init_net) We had too many prior patches [1] trying to work around the root cause. Really, make sure init_net is in BSS section, and that net_ns_init() is called earlier at boot time. Note that another patch ("vfs: add netns refcount tracker to struct fs_context") also will need net_ns_init() being called before vfs_caches_init() As a bonus, this patch saves around 4KB in .data section. [1] `f8c46cb390` ("netns: do not call pernet ops for not yet set up init_net namespace") `b5082df801` ("net: Initialise init_net.count to 1") `734b65417b` ("net: Statically initialize init_net.dev_base_head") v2: fixed a build error reported by kernel build bots (CONFIG_NET=n) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Hangbin Liu <haliu@redhat.com>	2022-05-05 12:26:57 +08:00
Hangbin Liu	970a02e10a	net: gro: avoid re-computing truesize twice on recycle Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081920 Upstream Status: net.git commit 7881453e4adf Conflicts: there is no net/core/gro.c due to missing commit 587652bbdd06 ("net: gro: populate net/core/gro.c") commit 7881453e4adf497cf9109c84fa21eedda9ac6164 Author: Paolo Abeni <pabeni@redhat.com> Date: Fri Feb 4 12:28:36 2022 +0100 net: gro: avoid re-computing truesize twice on recycle After commit 5e10da5385d2 ("skbuff: allow 'slow_gro' for skb carring sock reference") and commit af352460b465 ("net: fix GRO skb truesize update") the truesize of the skb with stolen head is properly updated by the GRO engine, we don't need anymore resetting it at recycle time. v1 -> v2: - clarify the commit message (Alexander) Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Hangbin Liu <haliu@redhat.com>	2022-05-05 12:26:41 +08:00
Hangbin Liu	9ef759e929	net: annotate data-races on txq->xmit_lock_owner Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081920 Upstream Status: net.git commit 7a10d8c810cf commit 7a10d8c810cfad3e79372d7d1c77899d86cd6662 Author: Eric Dumazet <edumazet@google.com> Date: Tue Nov 30 09:01:55 2021 -0800 net: annotate data-races on txq->xmit_lock_owner syzbot found that __dev_queue_xmit() is reading txq->xmit_lock_owner without annotations. No serious issue there, let's document what is happening there. BUG: KCSAN: data-race in __dev_queue_xmit / __dev_queue_xmit write to 0xffff888139d09484 of 4 bytes by interrupt on cpu 0: __netif_tx_unlock include/linux/netdevice.h:4437 [inline] __dev_queue_xmit+0x948/0xf70 net/core/dev.c:4229 dev_queue_xmit_accel+0x19/0x20 net/core/dev.c:4265 macvlan_queue_xmit drivers/net/macvlan.c:543 [inline] macvlan_start_xmit+0x2b3/0x3d0 drivers/net/macvlan.c:567 __netdev_start_xmit include/linux/netdevice.h:4987 [inline] netdev_start_xmit include/linux/netdevice.h:5001 [inline] xmit_one+0x105/0x2f0 net/core/dev.c:3590 dev_hard_start_xmit+0x72/0x120 net/core/dev.c:3606 sch_direct_xmit+0x1b2/0x7c0 net/sched/sch_generic.c:342 __dev_xmit_skb+0x83d/0x1370 net/core/dev.c:3817 __dev_queue_xmit+0x590/0xf70 net/core/dev.c:4194 dev_queue_xmit+0x13/0x20 net/core/dev.c:4259 neigh_hh_output include/net/neighbour.h:511 [inline] neigh_output include/net/neighbour.h:525 [inline] ip6_finish_output2+0x995/0xbb0 net/ipv6/ip6_output.c:126 __ip6_finish_output net/ipv6/ip6_output.c:191 [inline] ip6_finish_output+0x444/0x4c0 net/ipv6/ip6_output.c:201 NF_HOOK_COND include/linux/netfilter.h:296 [inline] ip6_output+0x10e/0x210 net/ipv6/ip6_output.c:224 dst_output include/net/dst.h:450 [inline] NF_HOOK include/linux/netfilter.h:307 [inline] ndisc_send_skb+0x486/0x610 net/ipv6/ndisc.c:508 ndisc_send_rs+0x3b0/0x3e0 net/ipv6/ndisc.c:702 addrconf_rs_timer+0x370/0x540 net/ipv6/addrconf.c:3898 call_timer_fn+0x2e/0x240 kernel/time/timer.c:1421 expire_timers+0x116/0x240 kernel/time/timer.c:1466 __run_timers+0x368/0x410 kernel/time/timer.c:1734 run_timer_softirq+0x2e/0x60 kernel/time/timer.c:1747 __do_softirq+0x158/0x2de kernel/softirq.c:558 __irq_exit_rcu kernel/softirq.c:636 [inline] irq_exit_rcu+0x37/0x70 kernel/softirq.c:648 sysvec_apic_timer_interrupt+0x3e/0xb0 arch/x86/kernel/apic/apic.c:1097 asm_sysvec_apic_timer_interrupt+0x12/0x20 read to 0xffff888139d09484 of 4 bytes by interrupt on cpu 1: __dev_queue_xmit+0x5e3/0xf70 net/core/dev.c:4213 dev_queue_xmit_accel+0x19/0x20 net/core/dev.c:4265 macvlan_queue_xmit drivers/net/macvlan.c:543 [inline] macvlan_start_xmit+0x2b3/0x3d0 drivers/net/macvlan.c:567 __netdev_start_xmit include/linux/netdevice.h:4987 [inline] netdev_start_xmit include/linux/netdevice.h:5001 [inline] xmit_one+0x105/0x2f0 net/core/dev.c:3590 dev_hard_start_xmit+0x72/0x120 net/core/dev.c:3606 sch_direct_xmit+0x1b2/0x7c0 net/sched/sch_generic.c:342 __dev_xmit_skb+0x83d/0x1370 net/core/dev.c:3817 __dev_queue_xmit+0x590/0xf70 net/core/dev.c:4194 dev_queue_xmit+0x13/0x20 net/core/dev.c:4259 neigh_resolve_output+0x3db/0x410 net/core/neighbour.c:1523 neigh_output include/net/neighbour.h:527 [inline] ip6_finish_output2+0x9be/0xbb0 net/ipv6/ip6_output.c:126 __ip6_finish_output net/ipv6/ip6_output.c:191 [inline] ip6_finish_output+0x444/0x4c0 net/ipv6/ip6_output.c:201 NF_HOOK_COND include/linux/netfilter.h:296 [inline] ip6_output+0x10e/0x210 net/ipv6/ip6_output.c:224 dst_output include/net/dst.h:450 [inline] NF_HOOK include/linux/netfilter.h:307 [inline] ndisc_send_skb+0x486/0x610 net/ipv6/ndisc.c:508 ndisc_send_rs+0x3b0/0x3e0 net/ipv6/ndisc.c:702 addrconf_rs_timer+0x370/0x540 net/ipv6/addrconf.c:3898 call_timer_fn+0x2e/0x240 kernel/time/timer.c:1421 expire_timers+0x116/0x240 kernel/time/timer.c:1466 __run_timers+0x368/0x410 kernel/time/timer.c:1734 run_timer_softirq+0x2e/0x60 kernel/time/timer.c:1747 __do_softirq+0x158/0x2de kernel/softirq.c:558 __irq_exit_rcu kernel/softirq.c:636 [inline] irq_exit_rcu+0x37/0x70 kernel/softirq.c:648 sysvec_apic_timer_interrupt+0x8d/0xb0 arch/x86/kernel/apic/apic.c:1097 asm_sysvec_apic_timer_interrupt+0x12/0x20 kcsan_setup_watchpoint+0x94/0x420 kernel/kcsan/core.c:443 folio_test_anon include/linux/page-flags.h:581 [inline] PageAnon include/linux/page-flags.h:586 [inline] zap_pte_range+0x5ac/0x10e0 mm/memory.c:1347 zap_pmd_range mm/memory.c:1467 [inline] zap_pud_range mm/memory.c:1496 [inline] zap_p4d_range mm/memory.c:1517 [inline] unmap_page_range+0x2dc/0x3d0 mm/memory.c:1538 unmap_single_vma+0x157/0x210 mm/memory.c:1583 unmap_vmas+0xd0/0x180 mm/memory.c:1615 exit_mmap+0x23d/0x470 mm/mmap.c:3170 __mmput+0x27/0x1b0 kernel/fork.c:1113 mmput+0x3d/0x50 kernel/fork.c:1134 exit_mm+0xdb/0x170 kernel/exit.c:507 do_exit+0x608/0x17a0 kernel/exit.c:819 do_group_exit+0xce/0x180 kernel/exit.c:929 get_signal+0xfc3/0x1550 kernel/signal.c:2852 arch_do_signal_or_restart+0x8c/0x2e0 arch/x86/kernel/signal.c:868 handle_signal_work kernel/entry/common.c:148 [inline] exit_to_user_mode_loop kernel/entry/common.c:172 [inline] exit_to_user_mode_prepare+0x113/0x190 kernel/entry/common.c:207 __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline] syscall_exit_to_user_mode+0x20/0x40 kernel/entry/common.c:300 do_syscall_64+0x50/0xd0 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0x00000000 -> 0xffffffff Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 28712 Comm: syz-executor.0 Tainted: G W 5.16.0-rc1-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Link: https://lore.kernel.org/r/20211130170155.2331929-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Hangbin Liu <haliu@redhat.com>	2022-05-05 12:26:41 +08:00
Hangbin Liu	1928aa8364	net: multicast: calculate csum of looped-back and forwarded packets Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081920 Upstream Status: net.git commit 9122a70a6333 commit 9122a70a6333705c0c35614ddc51c274ed1d3637 Author: Cyril Strejc <cyril.strejc@skoda.cz> Date: Sun Oct 24 22:14:25 2021 +0200 net: multicast: calculate csum of looped-back and forwarded packets During a testing of an user-space application which transmits UDP multicast datagrams and utilizes multicast routing to send the UDP datagrams out of defined network interfaces, I've found a multicast router does not fill-in UDP checksum into locally produced, looped-back and forwarded UDP datagrams, if an original output NIC the datagrams are sent to has UDP TX checksum offload enabled. The datagrams are sent malformed out of the NIC the datagrams have been forwarded to. It is because: 1. If TX checksum offload is enabled on the output NIC, UDP checksum is not calculated by kernel and is not filled into skb data. 2. dev_loopback_xmit(), which is called solely by ip_mc_finish_output(), sets skb->ip_summed = CHECKSUM_UNNECESSARY unconditionally. 3. Since `35fc92a9` ("[NET]: Allow forwarding of ip_summed except CHECKSUM_COMPLETE"), the ip_summed value is preserved during forwarding. 4. If ip_summed != CHECKSUM_PARTIAL, checksum is not calculated during a packet egress. The minimum fix in dev_loopback_xmit(): 1. Preserves skb->ip_summed CHECKSUM_PARTIAL. This is the case when the original output NIC has TX checksum offload enabled. The effects are: a) If the forwarding destination interface supports TX checksum offloading, the NIC driver is responsible to fill-in the checksum. b) If the forwarding destination interface does NOT support TX checksum offloading, checksums are filled-in by kernel before skb is submitted to the NIC driver. c) For local delivery, checksum validation is skipped as in the case of CHECKSUM_UNNECESSARY, thanks to skb_csum_unnecessary(). 2. Translates ip_summed CHECKSUM_NONE to CHECKSUM_UNNECESSARY. It means, for CHECKSUM_NONE, the behavior is unmodified and is there to skip a looped-back packet local delivery checksum validation. Signed-off-by: Cyril Strejc <cyril.strejc@skoda.cz> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Hangbin Liu <haliu@redhat.com>	2022-05-05 12:26:41 +08:00
Jerome Marchand	850123ac6a	bpf: devmap: Implement devmap prog execution for generic XDP Bugzilla: http://bugzilla.redhat.com/2041365 commit 2ea5eabaf04a1829383aefe98ac38a2e5ae2d698 Author: Kumar Kartikeya Dwivedi <memxor@gmail.com> Date: Fri Jul 2 16:48:24 2021 +0530 bpf: devmap: Implement devmap prog execution for generic XDP This lifts the restriction on running devmap BPF progs in generic redirect mode. To match native XDP behavior, it is invoked right before generic_xdp_tx is called, and only supports XDP_PASS/XDP_ABORTED/ XDP_DROP actions. We also return 0 even if devmap program drops the packet, as semantically redirect has already succeeded and the devmap prog is the last point before TX of the packet to device where it can deliver a verdict on the packet. This also means it must take care of freeing the skb, as xdp_do_generic_redirect callers only do that in case an error is returned. Since devmap entry prog is supported, remove the check in generic_xdp_install entirely. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20210702111825.491065-5-memxor@gmail.com Signed-off-by: Jerome Marchand <jmarchan@redhat.com>	2022-04-29 18:14:30 +02:00
Jerome Marchand	01fb58edc6	bpf: cpumap: Implement generic cpumap Bugzilla: http://bugzilla.redhat.com/2041365 commit 11941f8a85362f612df61f4aaab0e41b64d2111d Author: Kumar Kartikeya Dwivedi <memxor@gmail.com> Date: Fri Jul 2 16:48:23 2021 +0530 bpf: cpumap: Implement generic cpumap This change implements CPUMAP redirect support for generic XDP programs. The idea is to reuse the cpu map entry's queue that is used to push native xdp frames for redirecting skb to a different CPU. This will match native XDP behavior (in that RPS is invoked again for packet reinjected into networking stack). To be able to determine whether the incoming skb is from the driver or cpumap, we reuse skb->redirected bit that skips generic XDP processing when it is set. To always make use of this, CONFIG_NET_REDIRECT guard on it has been lifted and it is always available. >From the redirect side, we add the skb to ptr_ring with its lowest bit set to 1. This should be safe as skb is not 1-byte aligned. This allows kthread to discern between xdp_frames and sk_buff. On consumption of the ptr_ring item, the lowest bit is unset. In the end, the skb is simply added to the list that kthread is anyway going to maintain for xdp_frames converted to skb, and then received again by using netif_receive_skb_list. Bulking optimization for generic cpumap is left as an exercise for a future patch for now. Since cpumap entry progs are now supported, also remove check in generic_xdp_install for the cpumap. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/bpf/20210702111825.491065-4-memxor@gmail.com Signed-off-by: Jerome Marchand <jmarchan@redhat.com>	2022-04-29 18:14:30 +02:00
Jerome Marchand	9d29c832f5	net: core: Split out code to run generic XDP prog Bugzilla: http://bugzilla.redhat.com/2041365 commit fe21cb91ae7bca1ae7805454be80b6d03bec85f7 Author: Kumar Kartikeya Dwivedi <memxor@gmail.com> Date: Fri Jul 2 16:48:21 2021 +0530 net: core: Split out code to run generic XDP prog This helper can later be utilized in code that runs cpumap and devmap programs in generic redirect mode and adjust skb based on changes made to xdp_buff. When returning XDP_REDIRECT/XDP_TX, it invokes __skb_push, so whenever a generic redirect path invokes devmap/cpumap prog if set, it must __skb_pull again as we expect mac header to be pulled. It also drops the skb_reset_mac_len call after do_xdp_generic, as the mac_header and network_header are advanced by the same offset, so the difference (mac_len) remains constant. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20210702111825.491065-2-memxor@gmail.com Signed-off-by: Jerome Marchand <jmarchan@redhat.com>	2022-04-29 18:14:30 +02:00
Ivan Vecera	85520fc44a	net: annotate accesses to dev->gso_max_segs Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2073465 Conflicts: - small context conflicts in octeontx2 driver commit 6d872df3e3b91532b142de9044e5b4984017a55f Author: Eric Dumazet <edumazet@google.com> Date: Fri Nov 19 07:43:32 2021 -0800 net: annotate accesses to dev->gso_max_segs dev->gso_max_segs is written under RTNL protection, or when the device is not yet visible, but is read locklessly. Add netif_set_gso_max_segs() helper. Add the READ_ONCE()/WRITE_ONCE() pairs, and use netif_set_gso_max_segs() where we can to better document what is going on. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ivan Vecera <ivecera@redhat.com>	2022-04-08 16:46:08 +02:00
Herton R. Krzesinski	90182f8b73	Merge: ovs: backports P2 for 9.0 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/431 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2045048 Tested: Sanity only A bit large for a P2 backport; but those patches are needed and were requested by members of the OVS team. Signed-off-by: Antoine Tenart <atenart@redhat.com> Approved-by: Jarod Wilson <jarod@redhat.com> Approved-by: Ivan Vecera <ivecera@redhat.com> Approved-by: Davide Caratti <dcaratti@redhat.com> Approved-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Herton R. Krzesinski <herton@redhat.com>	2022-02-15 22:45:52 +00:00
Herton R. Krzesinski	4f893751ba	Merge: net: introduce kfree_skb_reason MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/405 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2041931 Tested: Instructions in bz Signed-off-by: Antoine Tenart <atenart@redhat.com> Approved-by: Florian Westphal <fwestpha@redhat.com> Approved-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: Herton R. Krzesinski <herton@redhat.com>	2022-01-26 22:28:46 +00:00
Herton R. Krzesinski	adc4082e23	Merge: CNB: net: Remove redundant if statements MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/328 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2037315 Series moving dev NULL check into dev_put()/dev_hold() Signed-off-by: Petr Oros <poros@redhat.com> Approved-by: John W. Linville <linville@redhat.com> Approved-by: Andrea Claudi <aclaudi@redhat.com> Approved-by: Corinna Vinschen <vinschen@redhat.com> Approved-by: Jarod Wilson <jarod@redhat.com> Approved-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: Herton R. Krzesinski <herton@redhat.com>	2022-01-26 22:11:25 +00:00
Antoine Tenart	b5e24650b7	net/sched: Extend qdisc control block with tc control block Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2045048 Upstream Status: linux.git Tested: Sanity only commit ec624fe740b416fb68d536b37fb8eef46f90b5c2 Author: Paul Blakey <paulb@nvidia.com> Date: Tue Dec 14 19:24:33 2021 +0200 net/sched: Extend qdisc control block with tc control block BPF layer extends the qdisc control block via struct bpf_skb_data_end and because of that there is no more room to add variables to the qdisc layer control block without going over the skb->cb size. Extend the qdisc control block with a tc control block, and move all tc related variables to there as a pre-step for extending the tc control block with additional members. Signed-off-by: Paul Blakey <paulb@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2022-01-26 16:54:01 +01:00
Antoine Tenart	4a0269b225	net: skb: introduce kfree_skb_reason() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2041931 Upstream Status: linux.git Tested: Instructions in bz commit c504e5c2f9648a1e5c2be01e8c3f59d394192bd3 Author: Menglong Dong <imagedong@tencent.com> Date: Sun Jan 9 14:36:26 2022 +0800 net: skb: introduce kfree_skb_reason() Introduce the interface kfree_skb_reason(), which is able to pass the reason why the skb is dropped to 'kfree_skb' tracepoint. Add the 'reason' field to 'trace_kfree_skb', therefor user can get more detail information about abnormal skb with 'drop_monitor' or eBPF. All drop reasons are defined in the enum 'skb_drop_reason', and they will be print as string in 'kfree_skb' tracepoint in format of 'reason: XXX'. ( Maybe the reasons should be defined in a uapi header file, so that user space can use them? ) Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Antoine Tenart <atenart@redhat.com>	2022-01-21 10:05:00 +01:00
Herton R. Krzesinski	b8f20958b7	Merge: net: core stable backport for rhel 9.0 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/212 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028276 Tested: LNST, Tier1 This includes a few critical bugfixes for the core network stack. Notably it includes 7f678def99d2 ("skb_expand_head() adjust skb->truesize incorrectly") and a whole series of pre-requisites. The bug addressed there is nasty and present even prior to skb_expand_head() introduction. commit 719c57197010 ("net: make napi_disable() symmetric with enable") instead has been explicitly excluded, as it's not really a fix, is known to introduce problems and it's still quite new Signed-off-by: Paolo Abeni <pabeni@redhat.com> Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Approved-by: Jarod Wilson <jarod@redhat.com> Approved-by: Antoine Tenart <atenart@redhat.com> Approved-by: Guillaume Nault <gnault@redhat.com> Approved-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: Herton R. Krzesinski <herton@redhat.com>	2022-01-14 16:53:21 +00:00
Herton R. Krzesinski	911d813798	Merge: net/sched: 9.0 P1 backports from upstream MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/197 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2025552 Upstream Status: all mainline in net.git Conflicts: None Tested: boot-tested only Signed-off-by: Davide Caratti <dcaratti@redhat.com> Approved-by: Kamal Heib <kheib@redhat.com> Approved-by: Ivan Vecera <ivecera@redhat.com> Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Approved-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Herton R. Krzesinski <herton@redhat.com>	2022-01-12 15:43:04 +00:00
Petr Oros	ea6b084bc4	net: Remove redundant if statements Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2037315 Upstream commit(s): commit 1160dfa178eb848327e9dec39960a735f4dc1685 Author: Yajun Deng <yajun.deng@linux.dev> Date: Thu Aug 5 19:55:27 2021 +0800 net: Remove redundant if statements The 'if (dev)' statement already move into dev_{put , hold}, so remove redundant if statements. Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Petr Oros <poros@redhat.com>	2022-01-10 16:20:08 +01:00
Herton R. Krzesinski	adc818bf26	Merge: Replace deprecated CPU-hotplug functions for kernel-rt MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/134 Bugzilla: http://bugzilla.redhat.com/2023079 Depends: https://gitlab.com/redhat/rhel/src/kernel/rhel-8/-/merge_requests/99 The kernel-rt variant requires these changes in order to make future changes to the RHEL9 kernel. These changes were found by code inspection and affect not only kernel-rt but the regular kernel variants as well. Signed-off-by: Prarit Bhargava <prarit@redhat.com> RH-Acked-by: Rafael Aquini <aquini@redhat.com> RH-Acked-by: John W. Linville <linville@redhat.com> RH-Acked-by: David Arcari <darcari@redhat.com> RH-Acked-by: Vladis Dronov <vdronov@redhat.com> RH-Acked-by: Jiri Benc <jbenc@redhat.com> RH-Acked-by: Jarod Wilson <jarod@redhat.com> RH-Acked-by: Waiman Long <longman@redhat.com> RH-Acked-by: Phil Auld <pauld@redhat.com> RH-Acked-by: Wander Lairson Costa <wander@redhat.com> Signed-off-by: Herton R. Krzesinski <herton@redhat.com>	2022-01-10 11:46:27 -03:00
Paolo Abeni	d27bdebcab	sk_buff: avoid potentially clearing 'slow_gro' field Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028927 Tested: LNST, Tier1 Upstream commit: commit a432934a30679c0e3c47b87f13e4901bc1a3fc03 Author: Paolo Abeni <pabeni@redhat.com> Date: Fri Jul 30 18:30:53 2021 +0200 sk_buff: avoid potentially clearing 'slow_gro' field If skb_dst_set_noref() is invoked with a NULL dst, the 'slow_gro' field is cleared, too. That could lead to wrong behavior if the skb later enters the GRO stage. Fix the potential issue replacing preserving a non-zero value of the 'slow_gro' field. Additionally, fix a comment typo. Reported-by: Sabrina Dubroca <sd@queasysnail.net> Reported-by: Jakub Kicinski <kuba@kernel.org> Fixes: 8a886b142bd0 ("sk_buff: track dst status in slow_gro") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/aa42529252dc8bb02bd42e8629427040d1058537.1627662501.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2021-12-09 18:58:21 +01:00
Paolo Abeni	2bea014388	skbuff: allow 'slow_gro' for skb carring sock reference Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028927 Tested: LNST, Tier1 Upstream commit: commit 5e10da5385d20c4bae587bc2921e5fdd9655d5fc Author: Paolo Abeni <pabeni@redhat.com> Date: Wed Jul 28 18:24:03 2021 +0200 skbuff: allow 'slow_gro' for skb carring sock reference This change leverages the infrastructure introduced by the previous patches to allow soft devices passing to the GRO engine owned skbs without impacting the fast-path. It's up to the GRO caller ensuring the slow_gro bit validity before invoking the GRO engine. The new helper skb_prepare_for_gro() is introduced for that goal. On slow_gro, skbs are aggregated only with equal sk. Additionally, skb truesize on GRO recycle and free is correctly updated so that sk wmem is not changed by the GRO processing. rfc-> v1: - fixed bad truesize on dev_gro_receive NAPI_FREE - use the existing state bit Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2021-12-09 18:57:52 +01:00
Paolo Abeni	9ce6ef4e71	net: optimize GRO for the common case. Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028927 Tested: LNST, Tier1 Upstream commit: commit 9efb4b5baf6ce851b247288992b0632cb4d31c17 Author: Paolo Abeni <pabeni@redhat.com> Date: Wed Jul 28 18:24:02 2021 +0200 net: optimize GRO for the common case. After the previous patches, at GRO time, skb->slow_gro is usually 0, unless the packets comes from some H/W offload slowpath or tunnel. We can optimize the GRO code assuming !skb->slow_gro is likely. This remove multiple conditionals in the most common path, at the price of an additional one when we hit the above "slow-paths". Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2021-12-09 18:57:26 +01:00
Prarit Bhargava	286c7df21b	net: Replace deprecated CPU-hotplug functions. Bugzilla: http://bugzilla.redhat.com/2023079 commit 372bbdd5bb3fc454d9c280dc0914486a3c7419d5 Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Date: Tue Aug 3 16:16:06 2021 +0200 net: Replace deprecated CPU-hotplug functions. The functions get_online_cpus() and put_online_cpus() have been deprecated during the CPU hotplug rework. They map directly to cpus_read_lock() and cpus_read_unlock(). Replace deprecated CPU-hotplug functions with the official version. The behavior remains unchanged. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Prarit Bhargava <prarit@redhat.com>	2021-12-09 09:04:08 -05:00
Davide Caratti	bee2c235ef	net/sched: store the last executed chain also for clsact egress Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2025552 Upstream Status: net-next.git commit 3aa260559455 commit 3aa2605594556c676fb88744bd9845acae60683d Author: Davide Caratti <dcaratti@redhat.com> Date: Wed Jul 28 20:08:00 2021 +0200 net/sched: store the last executed chain also for clsact egress currently, only 'ingress' and 'clsact ingress' qdiscs store the tc 'chain id' in the skb extension. However, userspace programs (like ovs) are able to setup egress rules, and datapath gets confused in case it doesn't find the 'chain id' for a packet that's "recirculated" by tc. Change tcf_classify() to have the same semantic as tcf_classify_ingress() so that a single function can be called in ingress / egress, using the tc ingress / egress block respectively. Suggested-by: Alaa Hleilel <alaa@nvidia.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Davide Caratti <dcaratti@redhat.com>	2021-12-09 12:01:45 +01:00
Paolo Abeni	96d14cbcf2	net: Prevent infinite while loop in skb_tx_hash() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028276 Tested: LNST, Tier1 Upstream commit: commit 0c57eeecc559ca6bc18b8c4e2808bc78dbe769b0 Author: Michael Chan <michael.chan@broadcom.com> Date: Mon Oct 25 05:05:28 2021 -0400 net: Prevent infinite while loop in skb_tx_hash() Drivers call netdev_set_num_tc() and then netdev_set_tc_queue() to set the queue count and offset for each TC. So the queue count and offset for the TCs may be zero for a short period after dev->num_tc has been set. If a TX packet is being transmitted at this time in the code path netdev_pick_tx() -> skb_tx_hash(), skb_tx_hash() may see nonzero dev->num_tc but zero qcount for the TC. The while loop that keeps looping while hash >= qcount will not end. Fix it by checking the TC's qcount to be nonzero before using it. Fixes: `eadec877ce` ("net: Add support for subordinate traffic classes to netdev_pick_tx") Reviewed-by: Andy Gospodarek <gospo@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2021-12-09 10:44:31 +01:00
Paolo Abeni	a1950c1dcf	napi: fix race inside napi_enable Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028276 Tested: LNST, Tier1 Upstream commit: commit 3765996e4f0b8a755cab215a08df744490c76052 Author: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Date: Sat Sep 18 16:52:32 2021 +0800 napi: fix race inside napi_enable The process will cause napi.state to contain NAPI_STATE_SCHED and not in the poll_list, which will cause napi_disable() to get stuck. The prefix "NAPI_STATE_" is removed in the figure below, and NAPI_STATE_HASHED is ignored in napi.state. CPU0 \| CPU1 \| napi.state =============================================================================== napi_disable() \| \| SCHED \| NPSVC napi_enable() \| \| { \| \| smp_mb__before_atomic(); \| \| clear_bit(SCHED, &n->state); \| \| NPSVC \| napi_schedule_prep() \| SCHED \| NPSVC \| napi_poll() \| \| napi_complete_done() \| \| { \| \| if (n->state & (NPSVC \| \| (1) \| _BUSY_POLL))) \| \| return false; \| \| ................ \| \| } \| SCHED \| NPSVC \| \| clear_bit(NPSVC, &n->state); \| \| SCHED } \| \| \| \| napi_schedule_prep() \| \| SCHED \| MISSED (2) (1) Here return direct. Because of NAPI_STATE_NPSVC exists. (2) NAPI_STATE_SCHED exists. So not add napi.poll_list to sd->poll_list Since NAPI_STATE_SCHED already exists and napi is not in the sd->poll_list queue, NAPI_STATE_SCHED cannot be cleared and will always exist. 1. This will cause this queue to no longer receive packets. 2. If you encounter napi_disable under the protection of rtnl_lock, it will cause the entire rtnl_lock to be locked, affecting the overall system. This patch uses cmpxchg to implement napi_enable(), which ensures that there will be no race due to the separation of clear two bits. Fixes: `2d8bff1269` ("netpoll: Close race condition between poll_one_napi and napi_disable") Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2021-12-09 10:44:31 +01:00
David S. Miller	20192d9c9f	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Andrii Nakryiko says: ==================== pull-request: bpf 2021-07-15 The following pull-request contains BPF updates for your net tree. We've added 9 non-merge commits during the last 5 day(s) which contain a total of 9 files changed, 37 insertions(+), 15 deletions(-). The main changes are: 1) Fix NULL pointer dereference in BPF_TEST_RUN for BPF_XDP_DEVMAP and BPF_XDP_CPUMAP programs, from Xuan Zhuo. 2) Fix use-after-free of net_device in XDP bpf_link, from Xuan Zhuo. 3) Follow-up fix to subprog poke descriptor use-after-free problem, from Daniel Borkmann and John Fastabend. 4) Fix out-of-range array access in s390 BPF JIT backend, from Colin Ian King. 5) Fix memory leak in BPF sockmap, from John Fastabend. 6) Fix for sockmap to prevent proc stats reporting bug, from John Fastabend and Jakub Sitnicki. 7) Fix NULL pointer dereference in bpftool, from Tobias Klauser. 8) AF_XDP documentation fixes, from Baruch Siach. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-15 14:39:45 -07:00
Qitao Xu	70713dddf3	net_sched: introduce tracepoint trace_qdisc_enqueue() Tracepoint trace_qdisc_enqueue() is introduced to trace skb at the entrance of TC layer on TX side. This is similar to trace_qdisc_dequeue(): 1. For both we only trace successful cases. The failure cases can be traced via trace_kfree_skb(). 2. They are called at entrance or exit of TC layer, not for each ->enqueue() or ->dequeue(). This is intentional, because we want to make trace_qdisc_enqueue() symmetric to trace_qdisc_dequeue(), which is easier to use. The return value of qdisc_enqueue() is not interesting here, we have Qdisc's drop packets in ->dequeue(), it is impossible to trace them even if we have the return value, the only way to trace them is tracing kfree_skb(). We only add information we need to trace ring buffer. If any other information is needed, it is easy to extend it without breaking ABI, see commit `3dd344ea84` ("net: tracepoint: exposing sk_family in all tcp:tracepoints"). Reviewed-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Qitao Xu <qitao.xu@bytedance.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-15 10:32:38 -07:00
Xuan Zhuo	5acc7d3e8d	xdp, net: Fix use-after-free in bpf_xdp_link_release The problem occurs between dev_get_by_index() and dev_xdp_attach_link(). At this point, dev_xdp_uninstall() is called. Then xdp link will not be detached automatically when dev is released. But link->dev already points to dev, when xdp link is released, dev will still be accessed, but dev has been released. dev_get_by_index() \| link->dev = dev \| \| rtnl_lock() \| unregister_netdevice_many() \| dev_xdp_uninstall() \| rtnl_unlock() rtnl_lock(); \| dev_xdp_attach_link() \| rtnl_unlock(); \| \| netdev_run_todo() // dev released bpf_xdp_link_release() \| /* access dev. \| use-after-free */ \| [ 45.966867] BUG: KASAN: use-after-free in bpf_xdp_link_release+0x3b8/0x3d0 [ 45.967619] Read of size 8 at addr ffff00000f9980c8 by task a.out/732 [ 45.968297] [ 45.968502] CPU: 1 PID: 732 Comm: a.out Not tainted 5.13.0+ #22 [ 45.969222] Hardware name: linux,dummy-virt (DT) [ 45.969795] Call trace: [ 45.970106] dump_backtrace+0x0/0x4c8 [ 45.970564] show_stack+0x30/0x40 [ 45.970981] dump_stack_lvl+0x120/0x18c [ 45.971470] print_address_description.constprop.0+0x74/0x30c [ 45.972182] kasan_report+0x1e8/0x200 [ 45.972659] __asan_report_load8_noabort+0x2c/0x50 [ 45.973273] bpf_xdp_link_release+0x3b8/0x3d0 [ 45.973834] bpf_link_free+0xd0/0x188 [ 45.974315] bpf_link_put+0x1d0/0x218 [ 45.974790] bpf_link_release+0x3c/0x58 [ 45.975291] __fput+0x20c/0x7e8 [ 45.975706] ____fput+0x24/0x30 [ 45.976117] task_work_run+0x104/0x258 [ 45.976609] do_notify_resume+0x894/0xaf8 [ 45.977121] work_pending+0xc/0x328 [ 45.977575] [ 45.977775] The buggy address belongs to the page: [ 45.978369] page:fffffc00003e6600 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x4f998 [ 45.979522] flags: 0x7fffe0000000000(node=0\|zone=0\|lastcpupid=0x3ffff) [ 45.980349] raw: 07fffe0000000000 fffffc00003e6708 ffff0000dac3c010 0000000000000000 [ 45.981309] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 [ 45.982259] page dumped because: kasan: bad access detected [ 45.982948] [ 45.983153] Memory state around the buggy address: [ 45.983753] ffff00000f997f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 45.984645] ffff00000f998000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [ 45.985533] >ffff00000f998080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [ 45.986419] ^ [ 45.987112] ffff00000f998100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [ 45.988006] ffff00000f998180: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [ 45.988895] ================================================================== [ 45.989773] Disabling lock debugging due to kernel taint [ 45.990552] Kernel panic - not syncing: panic_on_warn set ... [ 45.991166] CPU: 1 PID: 732 Comm: a.out Tainted: G B 5.13.0+ #22 [ 45.991929] Hardware name: linux,dummy-virt (DT) [ 45.992448] Call trace: [ 45.992753] dump_backtrace+0x0/0x4c8 [ 45.993208] show_stack+0x30/0x40 [ 45.993627] dump_stack_lvl+0x120/0x18c [ 45.994113] dump_stack+0x1c/0x34 [ 45.994530] panic+0x3a4/0x7d8 [ 45.994930] end_report+0x194/0x198 [ 45.995380] kasan_report+0x134/0x200 [ 45.995850] __asan_report_load8_noabort+0x2c/0x50 [ 45.996453] bpf_xdp_link_release+0x3b8/0x3d0 [ 45.997007] bpf_link_free+0xd0/0x188 [ 45.997474] bpf_link_put+0x1d0/0x218 [ 45.997942] bpf_link_release+0x3c/0x58 [ 45.998429] __fput+0x20c/0x7e8 [ 45.998833] ____fput+0x24/0x30 [ 45.999247] task_work_run+0x104/0x258 [ 45.999731] do_notify_resume+0x894/0xaf8 [ 46.000236] work_pending+0xc/0x328 [ 46.000697] SMP: stopping secondary CPUs [ 46.001226] Dumping ftrace buffer: [ 46.001663] (ftrace buffer empty) [ 46.002110] Kernel Offset: disabled [ 46.002545] CPU features: 0x00000001,23202c00 [ 46.003080] Memory Limit: none Fixes: `aa8d3a716b` ("bpf, xdp: Add bpf_link-based XDP attachment API") Reported-by: Abaci <abaci@linux.alibaba.com> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210710031635.41649-1-xuanzhuo@linux.alibaba.com	2021-07-13 08:22:31 -07:00
Antoine Tenart	28b34f01a7	net: do not reuse skbuff allocated from skbuff_fclone_cache in the skb cache Some socket buffers allocated in the fclone cache (in __alloc_skb) can end-up in the following path[1]: napi_skb_finish __kfree_skb_defer napi_skb_cache_put The issue is napi_skb_cache_put is not fclone friendly and will put those skbuff in the skb cache to be reused later, although this cache only expects skbuff allocated from skbuff_head_cache. When this happens the skbuff is eventually freed using the wrong origin cache, and we can see traces similar to: [ 1223.947534] cache_from_obj: Wrong slab cache. skbuff_head_cache but object is from skbuff_fclone_cache [ 1223.948895] WARNING: CPU: 3 PID: 0 at mm/slab.h:442 kmem_cache_free+0x251/0x3e0 [ 1223.950211] Modules linked in: [ 1223.950680] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.13.0+ #474 [ 1223.951587] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-3.fc34 04/01/2014 [ 1223.953060] RIP: 0010:kmem_cache_free+0x251/0x3e0 Leading sometimes to other memory related issues. Fix this by using __kfree_skb for fclone skbuff, similar to what is done the other place __kfree_skb_defer is called. [1] At least in setups using veth pairs and tunnels. Building a kernel with KASAN we can for example see packets allocated in sk_stream_alloc_skb hit the above path and later the issue arises when the skbuff is reused. Fixes: `9243adfc31` ("skbuff: queue NAPI_MERGED_FREE skbs into NAPI cache instead of freeing") Cc: Alexander Lobakin <alobakin@pm.me> Signed-off-by: Antoine Tenart <atenart@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-09 11:26:27 -07:00
Florian Fainelli	9615fe36b3	skbuff: Fix build with SKB extensions disabled We will fail to build with CONFIG_SKB_EXTENSIONS disabled after `8550ff8d8c` ("skbuff: Release nfct refcount on napi stolen or re-used skbs") since there is an unconditionally use of skb_ext_find() without an appropriate stub. Simply build the code conditionally and properly guard against both COFNIG_SKB_EXTENSIONS as well as CONFIG_NET_TC_SKB_EXT being disabled. Fixes: Fixes: `8550ff8d8c` ("skbuff: Release nfct refcount on napi stolen or re-used skbs") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-08 00:07:14 -07:00
Paul Blakey	8550ff8d8c	skbuff: Release nfct refcount on napi stolen or re-used skbs When multiple SKBs are merged to a new skb under napi GRO, or SKB is re-used by napi, if nfct was set for them in the driver, it will not be released while freeing their stolen head state or on re-use. Release nfct on napi's stolen or re-used SKBs, and in gro_list_prepare, check conntrack metadata diff. Fixes: `5c6b946047` ("net/mlx5e: CT: Handle misses after executing CT action") Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Paul Blakey <paulb@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-07-06 10:26:29 -07:00
Linus Torvalds	dbe69e4337	Networking changes for 5.14. Core: - BPF: - add syscall program type and libbpf support for generating instructions and bindings for in-kernel BPF loaders (BPF loaders for BPF), this is a stepping stone for signed BPF programs - infrastructure to migrate TCP child sockets from one listener to another in the same reuseport group/map to improve flexibility of service hand-off/restart - add broadcast support to XDP redirect - allow bypass of the lockless qdisc to improving performance (for pktgen: +23% with one thread, +44% with 2 threads) - add a simpler version of "DO_ONCE()" which does not require jump labels, intended for slow-path usage - virtio/vsock: introduce SOCK_SEQPACKET support - add getsocketopt to retrieve netns cookie - ip: treat lowest address of a IPv4 subnet as ordinary unicast address allowing reclaiming of precious IPv4 addresses - ipv6: use prandom_u32() for ID generation - ip: add support for more flexible field selection for hashing across multi-path routes (w/ offload to mlxsw) - icmp: add support for extended RFC 8335 PROBE (ping) - seg6: add support for SRv6 End.DT46 behavior - mptcp: - DSS checksum support (RFC 8684) to detect middlebox meddling - support Connection-time 'C' flag - time stamping support - sctp: packetization Layer Path MTU Discovery (RFC 8899) - xfrm: speed up state addition with seq set - WiFi: - hidden AP discovery on 6 GHz and other HE 6 GHz improvements - aggregation handling improvements for some drivers - minstrel improvements for no-ack frames - deferred rate control for TXQs to improve reaction times - switch from round robin to virtual time-based airtime scheduler - add trace points: - tcp checksum errors - openvswitch - action execution, upcalls - socket errors via sk_error_report Device APIs: - devlink: add rate API for hierarchical control of max egress rate of virtual devices (VFs, SFs etc.) - don't require RCU read lock to be held around BPF hooks in NAPI context - page_pool: generic buffer recycling New hardware/drivers: - mobile: - iosm: PCIe Driver for Intel M.2 Modem - support for Qualcomm MSM8998 (ipa) - WiFi: Qualcomm QCN9074 and WCN6855 PCI devices - sparx5: Microchip SparX-5 family of Enterprise Ethernet switches - Mellanox BlueField Gigabit Ethernet (control NIC of the DPU) - NXP SJA1110 Automotive Ethernet 10-port switch - Qualcomm QCA8327 switch support (qca8k) - Mikrotik 10/25G NIC (atl1c) Driver changes: - ACPI support for some MDIO, MAC and PHY devices from Marvell and NXP (our first foray into MAC/PHY description via ACPI) - HW timestamping (PTP) support: bnxt_en, ice, sja1105, hns3, tja11xx - Mellanox/Nvidia NIC (mlx5) - NIC VF offload of L2 bridging - support IRQ distribution to Sub-functions - Marvell (prestera): - add flower and match all - devlink trap - link aggregation - Netronome (nfp): connection tracking offload - Intel 1GE (igc): add AF_XDP support - Marvell DPU (octeontx2): ingress ratelimit offload - Google vNIC (gve): new ring/descriptor format support - Qualcomm mobile (rmnet & ipa): inline checksum offload support - MediaTek WiFi (mt76) - mt7915 MSI support - mt7915 Tx status reporting - mt7915 thermal sensors support - mt7921 decapsulation offload - mt7921 enable runtime pm and deep sleep - Realtek WiFi (rtw88) - beacon filter support - Tx antenna path diversity support - firmware crash information via devcoredump - Qualcomm 60GHz WiFi (wcn36xx) - Wake-on-WLAN support with magic packets and GTK rekeying - Micrel PHY (ksz886x/ksz8081): add cable test support Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmDb+fUACgkQMUZtbf5S Irs2Jg//aqN0Q8CgIvYCVhPxQw1tY7pTAbgyqgBZ01vwjyvtIOgJiWzSfFEU84mX M8fcpFX5eTKrOyJ9S6UFfQ/JG114n3hjAxFFT4Hxk2gC1Tg0vHuFQTDHcUl28bUE mTm61e1YpdorILnv2k5JVQ/wu0vs5QKDrjcYcrcPnh+j93wvnPOgAfDBV95nZzjS OTt4q2fR8GzLcSYWWsclMbDNkzyTG50RW/0Yd6aGjr5QGvXfrMeXfUJNz533PMf/ w5lNyjRKv+x9mdTZJzU0+msNUrZgUdRz7W8Ey8lD3hJZRE+D6/uU7FtsE8Mi3+uc HWxeZUyzA3YF1MfVl/eesbxyPT7S/OkLzk4O5B35FbqP0YltaP+bOjq1/nM3ce1/ io9Dx9pIl/2JANUgRCAtLi8Z2dkvRoqTaBxZ/nPudCCljFwDwl6joTMJ7Ow22i5Y 5aIkcXFmZq4LbJDiHvbTlqT7yiuaEvu2UK/23bSIg/K3nF4eAmkY9Y1EgiMf60OF 78Ttw0wk2tUegwaS5MZnCniKBKDyl9gM2F6rbZ/IxQRR2LTXFc1B6gC+ynUxgXfh Ub8O++6qGYGYZ0XvQH4pzco79p3qQWBTK5beIp2eu6BOAjBVIXq4AibUfoQLACsu hX7jMPYd0kc3WFgUnKgQP8EnjFSwbf4XiaE7fIXvWBY8hzCw2h4= =LvtX -----END PGP SIGNATURE----- Merge tag 'net-next-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core: - BPF: - add syscall program type and libbpf support for generating instructions and bindings for in-kernel BPF loaders (BPF loaders for BPF), this is a stepping stone for signed BPF programs - infrastructure to migrate TCP child sockets from one listener to another in the same reuseport group/map to improve flexibility of service hand-off/restart - add broadcast support to XDP redirect - allow bypass of the lockless qdisc to improving performance (for pktgen: +23% with one thread, +44% with 2 threads) - add a simpler version of "DO_ONCE()" which does not require jump labels, intended for slow-path usage - virtio/vsock: introduce SOCK_SEQPACKET support - add getsocketopt to retrieve netns cookie - ip: treat lowest address of a IPv4 subnet as ordinary unicast address allowing reclaiming of precious IPv4 addresses - ipv6: use prandom_u32() for ID generation - ip: add support for more flexible field selection for hashing across multi-path routes (w/ offload to mlxsw) - icmp: add support for extended RFC 8335 PROBE (ping) - seg6: add support for SRv6 End.DT46 behavior - mptcp: - DSS checksum support (RFC 8684) to detect middlebox meddling - support Connection-time 'C' flag - time stamping support - sctp: packetization Layer Path MTU Discovery (RFC 8899) - xfrm: speed up state addition with seq set - WiFi: - hidden AP discovery on 6 GHz and other HE 6 GHz improvements - aggregation handling improvements for some drivers - minstrel improvements for no-ack frames - deferred rate control for TXQs to improve reaction times - switch from round robin to virtual time-based airtime scheduler - add trace points: - tcp checksum errors - openvswitch - action execution, upcalls - socket errors via sk_error_report Device APIs: - devlink: add rate API for hierarchical control of max egress rate of virtual devices (VFs, SFs etc.) - don't require RCU read lock to be held around BPF hooks in NAPI context - page_pool: generic buffer recycling New hardware/drivers: - mobile: - iosm: PCIe Driver for Intel M.2 Modem - support for Qualcomm MSM8998 (ipa) - WiFi: Qualcomm QCN9074 and WCN6855 PCI devices - sparx5: Microchip SparX-5 family of Enterprise Ethernet switches - Mellanox BlueField Gigabit Ethernet (control NIC of the DPU) - NXP SJA1110 Automotive Ethernet 10-port switch - Qualcomm QCA8327 switch support (qca8k) - Mikrotik 10/25G NIC (atl1c) Driver changes: - ACPI support for some MDIO, MAC and PHY devices from Marvell and NXP (our first foray into MAC/PHY description via ACPI) - HW timestamping (PTP) support: bnxt_en, ice, sja1105, hns3, tja11xx - Mellanox/Nvidia NIC (mlx5) - NIC VF offload of L2 bridging - support IRQ distribution to Sub-functions - Marvell (prestera): - add flower and match all - devlink trap - link aggregation - Netronome (nfp): connection tracking offload - Intel 1GE (igc): add AF_XDP support - Marvell DPU (octeontx2): ingress ratelimit offload - Google vNIC (gve): new ring/descriptor format support - Qualcomm mobile (rmnet & ipa): inline checksum offload support - MediaTek WiFi (mt76) - mt7915 MSI support - mt7915 Tx status reporting - mt7915 thermal sensors support - mt7921 decapsulation offload - mt7921 enable runtime pm and deep sleep - Realtek WiFi (rtw88) - beacon filter support - Tx antenna path diversity support - firmware crash information via devcoredump - Qualcomm WiFi (wcn36xx) - Wake-on-WLAN support with magic packets and GTK rekeying - Micrel PHY (ksz886x/ksz8081): add cable test support" * tag 'net-next-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2168 commits) tcp: change ICSK_CA_PRIV_SIZE definition tcp_yeah: check struct yeah size at compile time gve: DQO: Fix off by one in gve_rx_dqo() stmmac: intel: set PCI_D3hot in suspend stmmac: intel: Enable PHY WOL option in EHL net: stmmac: option to enable PHY WOL with PMT enabled net: say "local" instead of "static" addresses in ndo_dflt_fdb_{add,del} net: use netdev_info in ndo_dflt_fdb_{add,del} ptp: Set lookup cookie when creating a PTP PPS source. net: sock: add trace for socket errors net: sock: introduce sk_error_report net: dsa: replay the local bridge FDB entries pointing to the bridge dev too net: dsa: ensure during dsa_fdb_offload_notify that dev_hold and dev_put are on the same dev net: dsa: include fdb entries pointing to bridge in the host fdb list net: dsa: include bridge addresses which are local in the host fdb list net: dsa: sync static FDB entries on foreign interfaces to hardware net: dsa: install the host MDB and FDB entries in the master's RX filter net: dsa: reference count the FDB addresses at the cross-chip notifier level net: dsa: introduce a separate cross-chip notifier type for host FDBs net: dsa: reference count the MDB entries at the cross-chip notifier level ...	2021-06-30 15:51:09 -07:00
Jakub Kicinski	b6df00789e	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Trivial conflict in net/netfilter/nf_tables_api.c. Duplicate fix in tools/testing/selftests/net/devlink_port_split.py - take the net-next version. skmsg, and L4 bpf - keep the bpf code but remove the flags and err params. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-06-29 15:45:27 -07:00
Tanner Love	127d7355ab	net: update netdev_rx_csum_fault() print dump only once Printing this stack dump multiple times does not provide additional useful information, and consumes time in the data path. Printing once is sufficient. Changes v2: Format indentation properly Signed-off-by: Tanner Love <tannerlove@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-28 15:54:57 -07:00
Yunsheng Lin	c4fef01ba4	net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc Currently pfifo_fast has both TCQ_F_CAN_BYPASS and TCQ_F_NOLOCK flag set, but queue discipline by-pass does not work for lockless qdisc because skb is always enqueued to qdisc even when the qdisc is empty, see __dev_xmit_skb(). This patch calls sch_direct_xmit() to transmit the skb directly to the driver for empty lockless qdisc, which aviod enqueuing and dequeuing operation. As qdisc->empty is not reliable to indicate a empty qdisc because there is a time window between enqueuing and setting qdisc->empty. So we use the MISSED state added in commit `a90c57f2ce` ("net: sched: fix packet stuck problem for lockless qdisc"), which indicate there is lock contention, suggesting that it is better not to do the qdisc bypass in order to avoid packet out of order problem. In order to make MISSED state reliable to indicate a empty qdisc, we need to ensure that testing and clearing of MISSED state is within the protection of qdisc->seqlock, only setting MISSED state can be done without the protection of qdisc->seqlock. A MISSED state testing is added without the protection of qdisc->seqlock to aviod doing unnecessary spin_trylock() for contention case. As the enqueuing is not within the protection of qdisc->seqlock, there is still a potential data race as mentioned by Jakub [1]: thread1 thread2 thread3 qdisc_run_begin() # true qdisc_run_begin(q) set(MISSED) pfifo_fast_dequeue clear(MISSED) # recheck the queue qdisc_run_end() enqueue skb1 qdisc empty # true qdisc_run_begin() # true sch_direct_xmit() # skb2 qdisc_run_begin() set(MISSED) When above happens, skb1 enqueued by thread2 is transmited after skb2 is transmited by thread3 because MISSED state setting and enqueuing is not under the qdisc->seqlock. If qdisc bypass is disabled, skb1 has better chance to be transmited quicker than skb2. This patch does not take care of the above data race, because we view this as similar as below: Even at the same time CPU1 and CPU2 write the skb to two socket which both heading to the same qdisc, there is no guarantee that which skb will hit the qdisc first, because there is a lot of factor like interrupt/softirq/cache miss/scheduling afffecting that. There are below cases that need special handling: 1. When MISSED state is cleared before another round of dequeuing in pfifo_fast_dequeue(), and __qdisc_run() might not be able to dequeue all skb in one round and call __netif_schedule(), which might result in a non-empty qdisc without MISSED set. In order to avoid this, the MISSED state is set for lockless qdisc and __netif_schedule() will be called at the end of qdisc_run_end. 2. The MISSED state also need to be set for lockless qdisc instead of calling __netif_schedule() directly when requeuing a skb for a similar reason. 3. For netdev queue stopped case, the MISSED case need clearing while the netdev queue is stopped, otherwise there may be unnecessary __netif_schedule() calling. So a new DRAINING state is added to indicate this case, which also indicate a non-empty qdisc. 4. As there is already netif_xmit_frozen_or_stopped() checking in dequeue_skb() and sch_direct_xmit(), which are both within the protection of qdisc->seqlock, but the same checking in __dev_xmit_skb() is without the protection, which might cause empty indication of a lockless qdisc to be not reliable. So remove the checking in __dev_xmit_skb(), and the checking in the protection of qdisc->seqlock seems enough to avoid the cpu consumption problem for netdev queue stopped case. 1. https://lkml.org/lkml/2021/5/29/215 Acked-by: Jakub Kicinski <kuba@kernel.org> Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # flexcan Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-23 12:17:35 -07:00
Sebastian Andrzej Siewior	2b4cd14fd9	net/netif_receive_skb_core: Use migrate_disable() The preempt disable around do_xdp_generic() has been introduced in commit `bbbe211c29` ("net: rcu lock and preempt disable missing around generic xdp") For BPF it is enough to use migrate_disable() and the code was updated as it can be seen in commit `3c58482a38` ("bpf: Provide bpf_prog_run_pin_on_cpu() helper") This is a leftover which was not converted. Use migrate_disable() before invoking do_xdp_generic(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-06-21 12:08:02 -07:00
Peter Zijlstra	2f064a59a1	sched: Change task_struct::state Change the type and name of task_struct::state. Drop the volatile and shrink it to an 'unsigned int'. Rename it in order to find all uses such that we can use READ_ONCE/WRITE_ONCE as appropriate. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Acked-by: Will Deacon <will@kernel.org> Acked-by: Daniel Thompson <daniel.thompson@linaro.org> Link: https://lore.kernel.org/r/20210611082838.550736351@infradead.org	2021-06-18 11:43:09 +02:00
Jakub Kicinski	5ada57a9a6	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net cdc-wdm: s/kill_urbs/poison_urbs/ to fix build Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-05-27 09:55:10 -07:00
Yunsheng Lin	dcad9ee9e0	net: sched: fix tx action reschedule issue with stopped queue The netdev qeueue might be stopped when byte queue limit has reached or tx hw ring is full, net_tx_action() may still be rescheduled if STATE_MISSED is set, which consumes unnecessary cpu without dequeuing and transmiting any skb because the netdev queue is stopped, see qdisc_run_end(). This patch fixes it by checking the netdev queue state before calling qdisc_run() and clearing STATE_MISSED if netdev queue is stopped during qdisc_run(), the net_tx_action() is rescheduled again when netdev qeueue is restarted, see netif_tx_wake_queue(). As there is time window between netif_xmit_frozen_or_stopped() checking and STATE_MISSED clearing, between which STATE_MISSED may set by net_tx_action() scheduled by netif_tx_wake_queue(), so set the STATE_MISSED again if netdev queue is restarted. Fixes: `6b3ba9146f` ("net: sched: allow qdiscs to handle locking") Reported-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-05-14 15:05:46 -07:00
Yunsheng Lin	102b55ee92	net: sched: fix tx action rescheduling issue during deactivation Currently qdisc_run() checks the STATE_DEACTIVATED of lockless qdisc before calling __qdisc_run(), which ultimately clear the STATE_MISSED when all the skb is dequeued. If STATE_DEACTIVATED is set before clearing STATE_MISSED, there may be rescheduling of net_tx_action() at the end of qdisc_run_end(), see below: CPU0(net_tx_atcion) CPU1(__dev_xmit_skb) CPU2(dev_deactivate) . . . . set STATE_MISSED . . __netif_schedule() . . . set STATE_DEACTIVATED . . qdisc_reset() . . . .<--------------- . synchronize_net() clear __QDISC_STATE_SCHED \| . . . \| . . . \| . some_qdisc_is_busy() . \| . return false . \| . . test STATE_DEACTIVATED \| . . __qdisc_run() not called \| . . . \| . . test STATE_MISS \| . . __netif_schedule()--------\| . . . . . . . . __qdisc_run() is not called by net_tx_atcion() in CPU0 because CPU2 has set STATE_DEACTIVATED flag during dev_deactivate(), and STATE_MISSED is only cleared in __qdisc_run(), __netif_schedule is called at the end of qdisc_run_end(), causing tx action rescheduling problem. qdisc_run() called by net_tx_action() runs in the softirq context, which should has the same semantic as the qdisc_run() called by __dev_xmit_skb() protected by rcu_read_lock_bh(). And there is a synchronize_net() between STATE_DEACTIVATED flag being set and qdisc_reset()/some_qdisc_is_busy in dev_deactivate(), we can safely bail out for the deactived lockless qdisc in net_tx_action(), and qdisc_reset() will reset all skb not dequeued yet. So add the rcu_read_lock() explicitly to protect the qdisc_run() and do the STATE_DEACTIVATED checking in net_tx_action() before calling qdisc_run_begin(). Another option is to do the checking in the qdisc_run_end(), but it will add unnecessary overhead for non-tx_action case, because __dev_queue_xmit() will not see qdisc with STATE_DEACTIVATED after synchronize_net(), the qdisc with STATE_DEACTIVATED can only be seen by net_tx_action() because of __netif_schedule(). The STATE_DEACTIVATED checking in qdisc_run() is to avoid race between net_tx_action() and qdisc_reset(), see: commit `d518d2ed86` ("net/sched: fix race between deactivation and dequeue for NOLOCK qdisc"). As the bailout added above for deactived lockless qdisc in net_tx_action() provides better protection for the race without calling qdisc_run() at all, so remove the STATE_DEACTIVATED checking in qdisc_run(). After qdisc_reset(), there is no skb in qdisc to be dequeued, so clear the STATE_MISSED in dev_reset_queue() too. Fixes: `6b3ba9146f` ("net: sched: allow qdiscs to handle locking") Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> V8: Clearing STATE_MISSED before calling __netif_schedule() has avoid the endless rescheduling problem, but there may still be a unnecessary rescheduling, so adjust the commit log. Signed-off-by: David S. Miller <davem@davemloft.net>	2021-05-14 15:05:46 -07:00
Sebastian Andrzej Siewior	8380c81d5c	net: Treat __napi_schedule_irqoff() as __napi_schedule() on PREEMPT_RT __napi_schedule_irqoff() is an optimized version of __napi_schedule() which can be used where it is known that interrupts are disabled, e.g. in interrupt-handlers, spin_lock_irq() sections or hrtimer callbacks. On PREEMPT_RT enabled kernels this assumptions is not true. Force- threaded interrupt handlers and spinlocks are not disabling interrupts and the NAPI hrtimer callback is forced into softirq context which runs with interrupts enabled as well. Chasing all usage sites of __napi_schedule_irqoff() is a whack-a-mole game so make __napi_schedule_irqoff() invoke __napi_schedule() for PREEMPT_RT kernels. The callers of ____napi_schedule() in the networking core have been audited and are correct on PREEMPT_RT kernels as well. Reported-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-05-13 13:11:19 -07:00
David S. Miller	6876a18d33	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net	2021-04-26 12:00:00 -07:00
David S. Miller	5f6c2f536d	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Alexei Starovoitov says: ==================== pull-request: bpf-next 2021-04-23 The following pull-request contains BPF updates for your net-next tree. We've added 69 non-merge commits during the last 22 day(s) which contain a total of 69 files changed, 3141 insertions(+), 866 deletions(-). The main changes are: 1) Add BPF static linker support for extern resolution of global, from Andrii. 2) Refine retval for bpf_get_task_stack helper, from Dave. 3) Add a bpf_snprintf helper, from Florent. 4) A bunch of miscellaneous improvements from many developers. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2021-04-25 18:02:32 -07:00
Martin Willi	22b6034323	net, xdp: Update pkt_type if generic XDP changes unicast MAC If a generic XDP program changes the destination MAC address from/to multicast/broadcast, the skb->pkt_type is updated to properly handle the packet when passed up the stack. When changing the MAC from/to the NICs MAC, PACKET_HOST/OTHERHOST is not updated, though, making the behavior different from that of native XDP. Remember the PACKET_HOST/OTHERHOST state before calling the program in generic XDP, and update pkt_type accordingly if the destination MAC address has changed. As eth_type_trans() assumes a default pkt_type of PACKET_HOST, restore that before calling it. The use case for this is when a XDP program wants to push received packets up the stack by rewriting the MAC to the NICs MAC, for example by cluster nodes sharing MAC addresses. Fixes: `2972495699` ("net: fix generic XDP to handle if eth header was mangled") Signed-off-by: Martin Willi <martin@strongswan.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20210419141559.8611-1-martin@strongswan.org	2021-04-22 23:18:02 +02:00
Alexander Lobakin	7ad18ff644	gro: fix napi_gro_frags() Fast GRO breakage due to IP alignment check Commit `38ec4944b5` ("gro: ensure frag0 meets IP header alignment") did the right thing, but missed the fact that napi_gro_frags() logics calls for skb_gro_reset_offset() before pulling Ethernet header to the skb linear space. That said, the introduced check for frag0 address being aligned to 4 always fails for it as Ethernet header is obviously 14 bytes long, and in case with NET_IP_ALIGN its start is not aligned to 4. Fix this by adding @nhoff argument to skb_gro_reset_offset() which tells if an IP header is placed right at the start of frag0 or not. This restores Fast GRO for napi_gro_frags() that became very slow after the mentioned commit, and preserves the introduced check to avoid silent unaligned accesses. From v1 [0]: - inline tiny skb_gro_reset_offset() to let the code be optimized more efficively (esp. for the !NET_IP_ALIGN case) (Eric); - pull in Reviewed-by from Eric. [0] https://lore.kernel.org/netdev/20210418114200.5839-1-alobakin@pm.me Fixes: `38ec4944b5` ("gro: ensure frag0 meets IP header alignment") Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-04-19 16:03:32 -07:00
Jakub Kicinski	8203c7ce4e	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net drivers/net/ethernet/stmicro/stmmac/stmmac_main.c - keep the ZC code, drop the code related to reinit net/bridge/netfilter/ebtables.c - fix build after move to net_generic Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-04-17 11:08:07 -07:00
Eric Dumazet	38ec4944b5	gro: ensure frag0 meets IP header alignment After commit `0f6925b3e8` ("virtio_net: Do not pull payload in skb->head") Guenter Roeck reported one failure in his tests using sh architecture. After much debugging, we have been able to spot silent unaligned accesses in inet_gro_receive() The issue at hand is that upper networking stacks assume their header is word-aligned. Low level drivers are supposed to reserve NET_IP_ALIGN bytes before the Ethernet header to make that happen. This patch hardens skb_gro_reset_offset() to not allow frag0 fast-path if the fragment is not properly aligned. Some arches like x86, arm64 and powerpc do not care and define NET_IP_ALIGN as 0, this extra check will be a NOP for them. Note that if frag0 is not used, GRO will call pskb_may_pull() as many times as needed to pull network and transport headers. Fixes: `0f6925b3e8` ("virtio_net: Do not pull payload in skb->head") Fixes: `78a478d0ef` ("gro: Inline skb_gro_header and cache frag0 virtual address") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Guenter Roeck <linux@roeck-us.net> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-04-13 15:09:31 -07:00
Jakub Kicinski	8859a44ea0	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Conflicts: MAINTAINERS - keep Chandrasekar drivers/net/ethernet/mellanox/mlx5/core/en_main.c - simple fix + trust the code re-added to param.c in -next is fine include/linux/bpf.h - trivial include/linux/ethtool.h - trivial, fix kdoc while at it include/linux/skmsg.h - move to relevant place in tcp.c, comment re-wrapped net/core/skmsg.c - add the sk = sk // sk = NULL around calls net/tipc/crypto.c - trivial Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-04-09 20:48:35 -07:00
Paolo Abeni	27f0ad7169	net: fix hangup on napi_disable for threaded napi napi_disable() is subject to an hangup, when the threaded mode is enabled and the napi is under heavy traffic. If the relevant napi has been scheduled and the napi_disable() kicks in before the next napi_threaded_wait() completes - so that the latter quits due to the napi_disable_pending() condition, the existing code leaves the NAPI_STATE_SCHED bit set and the napi_disable() loop waiting for such bit will hang. This patch addresses the issue by dropping the NAPI_STATE_DISABLE bit test in napi_thread_wait(). The later napi_threaded_poll() iteration will take care of clearing the NAPI_STATE_SCHED. This also addresses a related problem reported by Jakub: before this patch a napi_disable()/napi_enable() pair killed the napi thread, effectively disabling the threaded mode. On the patched kernel napi_disable() simply stops scheduling the relevant thread. v1 -> v2: - let the main napi_thread_poll() loop clear the SCHED bit Reported-by: Jakub Kicinski <kuba@kernel.org> Fixes: `29863d41bb` ("net: implement threaded-able napi poll loop support") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/883923fa22745a9589e8610962b7dc59df09fb1f.1617981844.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-04-09 12:50:31 -07:00
Andrei Vagin	0854fa82c9	net: remove the new_ifindex argument from dev_change_net_namespace Here is only one place where we want to specify new_ifindex. In all other cases, callers pass 0 as new_ifindex. It looks reasonable to add a low-level function with new_ifindex and to convert dev_change_net_namespace to a static inline wrapper. Fixes: `eeb85a14ee` ("net: Allow to specify ifindex when device is moved to another namespace") Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Andrei Vagin <avagin@gmail.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-04-07 14:43:28 -07:00
Andrei Vagin	eeb85a14ee	net: Allow to specify ifindex when device is moved to another namespace Currently, we can specify ifindex on link creation. This change allows to specify ifindex when a device is moved to another network namespace. Even now, a device ifindex can be changed if there is another device with the same ifindex in the target namespace. So this change doesn't introduce completely new behavior, it adds more control to the process. CRIU users want to restore containers with pre-created network devices. A user will provide network devices and instructions where they have to be restored, then CRIU will restore network namespaces and move devices into them. The problem is that devices have to be restored with the same indexes that they have before C/R. Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com> Suggested-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Andrei Vagin <avagin@gmail.com> Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-04-05 14:49:40 -07:00
Dmitry Vyukov	6c996e1994	net: change netdev_unregister_timeout_secs min value to 1 netdev_unregister_timeout_secs=0 can lead to printing the "waiting for dev to become free" message every jiffy. This is too frequent and unnecessary. Set the min value to 1 second. Also fix the merge issue introduced by "net: make unregister netdev warning timeout configurable": it changed "refcnt != 1" to "refcnt". Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Suggested-by: Eric Dumazet <edumazet@google.com> Fixes: `5aa3afe107` ("net: make unregister netdev warning timeout configurable") Cc: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-25 17:24:06 -07:00
David S. Miller	efd13b71a3	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-25 15:31:22 -07:00
Pablo Neira Ayuso	ddb94eafab	net: resolve forwarding path from virtual netdevice and HW destination address This patch adds dev_fill_forward_path() which resolves the path to reach the real netdevice from the IP forwarding side. This function takes as input the netdevice and the destination hardware address and it walks down the devices calling .ndo_fill_forward_path() for each device until the real device is found. For instance, assuming the following topology: IP forwarding / \ br0 eth0 / \ eth1 eth2 . . . ethX ab💿ef🆎cd:ef where eth1 and eth2 are bridge ports and eth0 provides WAN connectivity. ethX is the interface in another box which is connected to the eth1 bridge port. For packets going through IP forwarding to br0 whose destination MAC address is ab💿ef🆎cd:ef, dev_fill_forward_path() provides the following path: br0 -> eth1 .ndo_fill_forward_path for br0 looks up at the FDB for the bridge port from the destination MAC address to get the bridge port eth1. This information allows to create a fast path that bypasses the classic bridge and IP forwarding paths, so packets go directly from the bridge port eth1 to eth0 (wan interface) and vice versa. fast path .------------------------. / \ \| IP forwarding \| \| / \ \/ \| br0 eth0 . / \ -> eth1 eth2 . . . ethX ab💿ef🆎cd:ef Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-24 12:48:38 -07:00
Dmitry Vyukov	5aa3afe107	net: make unregister netdev warning timeout configurable netdev_wait_allrefs() issues a warning if refcount does not drop to 0 after 10 seconds. While 10 second wait generally should not happen under normal workload in normal environment, it seems to fire falsely very often during fuzzing and/or in qemu emulation (~10x slower). At least it's not possible to understand if it's really a false positive or not. Automated testing generally bumps all timeouts to very high values to avoid flake failures. Add net.core.netdev_unregister_timeout_secs sysctl to make the timeout configurable for automated testing systems. Lowering the timeout may also be useful for e.g. manual bisection. The default value matches the current behavior. Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=211877 Cc: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-23 17:22:50 -07:00
Eric Dumazet	add2d73631	net: set initial device refcount to 1 When adding CONFIG_PCPU_DEV_REFCNT, I forgot that the initial net device refcount was 0. When CONFIG_PCPU_DEV_REFCNT is not set, this means the first dev_hold() triggers an illegal refcount operation (addition on 0) refcount_t: addition on 0; use-after-free. WARNING: CPU: 0 PID: 1 at lib/refcount.c:25 refcount_warn_saturate+0x128/0x1a4 Fix is to change initial (and final) refcount to be 1. Also add a missing kerneldoc piece, as reported by Stephen Rothwell. Fixes: `919067cc84` ("net: add CONFIG_PCPU_DEV_REFCNT") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Guenter Roeck <groeck@google.com> Tested-by: Guenter Roeck <groeck@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-22 16:57:36 -07:00
Vladimir Oltean	5da9ace340	net: make xps_needed and xps_rxqs_needed static Since their introduction in commit `04157469b7` ("net: Use static_key for XPS maps"), xps_needed and xps_rxqs_needed were never used outside net/core/dev.c, so I don't really understand why they were exported as symbols in the first place. This is needed in order to silence a "make W=1" warning about these static keys not being declared as static variables, but not having a previous declaration in a header file nonetheless. Cc: Amritha Nambiar <amritha.nambiar@intel.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-22 13:13:55 -07:00
Eric Dumazet	919067cc84	net: add CONFIG_PCPU_DEV_REFCNT I was working on a syzbot issue, claiming one device could not be dismantled because its refcount was -1 unregister_netdevice: waiting for sit0 to become free. Usage count = -1 It would be nice if syzbot could trigger a warning at the time this reference count became negative. This patch adds CONFIG_PCPU_DEV_REFCNT options which defaults to per cpu variables (as before this patch) on SMP builds. v2: free_dev label in alloc_netdev_mqs() is moved to avoid a compiler warning (-Wunused-label), as reported by kernel test robot <lkp@intel.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-19 13:38:46 -07:00
Antoine Tenart	75b2758abc	net: NULL the old xps map entries when freeing them In __netif_set_xps_queue, old map entries from the old dev_maps are freed but their corresponding entry in the old dev_maps aren't NULLed. Fix this. Signed-off-by: Antoine Tenart <atenart@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-18 14:56:22 -07:00
Antoine Tenart	2d05bf0153	net: fix use after free in xps When setting up an new dev_maps in __netif_set_xps_queue, we remove and free maps from unused CPUs/rx-queues near the end of the function; by calling remove_xps_queue. However it's possible those maps are also part of the old not-freed-yet dev_maps, which might be used concurrently. When that happens, a map can be freed while its corresponding entry in the old dev_maps table isn't NULLed, leading to: "BUG: KASAN: use-after-free" in different places. This fixes the map freeing logic for unused CPUs/rx-queues, to also NULL the map entries from the old dev_maps table. Signed-off-by: Antoine Tenart <atenart@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-18 14:56:22 -07:00
Antoine Tenart	132f743b01	net: improve queue removal readability in __netif_set_xps_queue Improve the readability of the loop removing tx-queue from unused CPUs/rx-queues in __netif_set_xps_queue. The change should only be cosmetic. Signed-off-by: Antoine Tenart <atenart@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-18 14:56:22 -07:00
Antoine Tenart	402fbb992e	net: add an helper to copy xps maps to the new dev_maps This patch adds an helper, xps_copy_dev_maps, to copy maps from dev_maps to new_dev_maps at a given index. The logic should be the same, with an improved code readability and maintenance. Signed-off-by: Antoine Tenart <atenart@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-18 14:56:22 -07:00
Antoine Tenart	044ab86d43	net: move the xps maps to an array Move the xps maps (xps_cpus_map and xps_rxqs_map) to an array in net_device. That will simplify a lot the code removing the need for lots of if/else conditionals as the correct map will be available using its offset in the array. This should not modify the xps maps behaviour in any way. Suggested-by: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: Antoine Tenart <atenart@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-18 14:56:22 -07:00
Antoine Tenart	6f36158e05	net: remove the xps possible_mask Remove the xps possible_mask. It was an optimization but we can just loop from 0 to nr_ids now that it is embedded in the xps dev_maps. That simplifies the code a bit. Suggested-by: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: Antoine Tenart <atenart@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-18 14:56:22 -07:00
Antoine Tenart	5478fcd0f4	net: embed nr_ids in the xps maps Embed nr_ids (the number of cpu for the xps cpus map, and the number of rxqs for the xps cpus map) in dev_maps. That will help not accessing out of bound memory if those values change after dev_maps was allocated. Suggested-by: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: Antoine Tenart <atenart@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-18 14:56:22 -07:00
Antoine Tenart	255c04a87f	net: embed num_tc in the xps maps The xps cpus/rxqs map is accessed using dev->num_tc, which is used when allocating the map. But later updates of dev->num_tc can lead to having a mismatch between the maps and how they're accessed. In such cases the map values do not make any sense and out of bound accesses can occur (that can be easily seen using KASAN). This patch aims at fixing this by embedding num_tc into the maps, using the value at the time the map is created. This brings two improvements: - The maps can be accessed using the embedded num_tc, so we know for sure we won't have out of bound accesses. - Checks can be made before accessing the maps so we know the values retrieved will make sense. We also update __netif_set_xps_queue to conditionally copy old maps from dev_maps in the new one only if the number of traffic classes from both maps match. Signed-off-by: Antoine Tenart <atenart@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-18 14:56:22 -07:00
Jiri Bohac	6c015a2256	net: check all name nodes in __dev_alloc_name __dev_alloc_name(), when supplied with a name containing '%d', will search for the first available device number to generate a unique device name. Since commit `ff92741270` ("net: introduce name_node struct to be used in hashlist") network devices may have alternate names. __dev_alloc_name() does take these alternate names into account, possibly generating a name that is already taken and failing with -ENFILE as a result. This demonstrates the bug: # rmmod dummy 2>/dev/null # ip link property add dev lo altname dummy0 # modprobe dummy numdummies=1 modprobe: ERROR: could not insert 'dummy': Too many open files in system Instead of creating a device named dummy1, modprobe fails. Fix this by checking all the names in the d->name_node list, not just d->name. Signed-off-by: Jiri Bohac <jbohac@suse.cz> Fixes: `ff92741270` ("net: introduce name_node struct to be used in hashlist") Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-18 14:40:53 -07:00
Wei Wang	cb03835793	net: fix race between napi kthread mode and busy poll Currently, napi_thread_wait() checks for NAPI_STATE_SCHED bit to determine if the kthread owns this napi and could call napi->poll() on it. However, if socket busy poll is enabled, it is possible that the busy poll thread grabs this SCHED bit (after the previous napi->poll() invokes napi_complete_done() and clears SCHED bit) and tries to poll on the same napi. napi_disable() could grab the SCHED bit as well. This patch tries to fix this race by adding a new bit NAPI_STATE_SCHED_THREADED in napi->state. This bit gets set in ____napi_schedule() if the threaded mode is enabled, and gets cleared in napi_complete_done(), and we only poll the napi in kthread if this bit is set. This helps distinguish the ownership of the napi between kthread and other scenarios and fixes the race issue. Fixes: `29863d41bb` ("net: implement threaded-able napi poll loop support") Reported-by: Martin Zaharinov <micron10@gmail.com> Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Wei Wang <weiwan@google.com> Cc: Alexander Duyck <alexanderduyck@fb.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-17 14:31:17 -07:00
Martin Willi	3a5ca85707	can: dev: Move device back to init netns on owning netns delete When a non-initial netns is destroyed, the usual policy is to delete all virtual network interfaces contained, but move physical interfaces back to the initial netns. This keeps the physical interface visible on the system. CAN devices are somewhat special, as they define rtnl_link_ops even if they are physical devices. If a CAN interface is moved into a non-initial netns, destroying that netns lets the interface vanish instead of moving it back to the initial netns. default_device_exit() skips CAN interfaces due to having rtnl_link_ops set. Reproducer: ip netns add foo ip link set can0 netns foo ip netns delete foo WARNING: CPU: 1 PID: 84 at net/core/dev.c:11030 ops_exit_list+0x38/0x60 CPU: 1 PID: 84 Comm: kworker/u4:2 Not tainted 5.10.19 #1 Workqueue: netns cleanup_net [<c010e700>] (unwind_backtrace) from [<c010a1d8>] (show_stack+0x10/0x14) [<c010a1d8>] (show_stack) from [<c086dc10>] (dump_stack+0x94/0xa8) [<c086dc10>] (dump_stack) from [<c086b938>] (__warn+0xb8/0x114) [<c086b938>] (__warn) from [<c086ba10>] (warn_slowpath_fmt+0x7c/0xac) [<c086ba10>] (warn_slowpath_fmt) from [<c0629f20>] (ops_exit_list+0x38/0x60) [<c0629f20>] (ops_exit_list) from [<c062a5c4>] (cleanup_net+0x230/0x380) [<c062a5c4>] (cleanup_net) from [<c0142c20>] (process_one_work+0x1d8/0x438) [<c0142c20>] (process_one_work) from [<c0142ee4>] (worker_thread+0x64/0x5a8) [<c0142ee4>] (worker_thread) from [<c0148a98>] (kthread+0x148/0x14c) [<c0148a98>] (kthread) from [<c0100148>] (ret_from_fork+0x14/0x2c) To properly restore physical CAN devices to the initial netns on owning netns exit, introduce a flag on rtnl_link_ops that can be set by drivers. For CAN devices setting this flag, default_device_exit() considers them non-virtual, applying the usual namespace move. The issue was introduced in the commit mentioned below, as at that time CAN devices did not have a dellink() operation. Fixes: `e008b5fc8d` ("net: Simplfy default_device_exit and improve batching.") Link: https://lore.kernel.org/r/20210302122423.872326-1-martin@strongswan.org Signed-off-by: Martin Willi <martin@strongswan.org> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2021-03-16 08:40:04 +01:00
Lorenzo Bianconi	8f64860f8b	net: export dev_set_threaded symbol For wireless devices (e.g. mt76 driver) multiple net_devices belongs to the same wireless phy and the napi object is registered in a dummy netdevice related to the wireless phy. Export dev_set_threaded in order to be reused in device drivers enabling threaded NAPI. Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-15 12:35:23 -07:00
Alexander Lobakin	d0eed5c325	gro: give 'hash' variable in dev_gro_receive() a less confusing name 'hash' stores not the flow hash, but the index of the GRO bucket corresponding to it. Change its name to 'bucket' to avoid confusion while reading lines like '__set_bit(hash, &napi->gro_bitmask)'. Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-14 14:41:09 -07:00
Alexander Lobakin	9dc2c31337	gro: consistentify napi->gro_hash[x] access in dev_gro_receive() GRO bucket index doesn't change through the entire function. Store a pointer to the corresponding bucket instead of its member and use it consistently through the function. It is performance-safe since &gro_list->list == gro_list. Misc: remove superfluous braces around single-line branches. Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-14 14:41:08 -07:00
Alexander Lobakin	0ccf4d50d1	gro: simplify gro_list_prepare() gro_list_prepare() always returns &napi->gro_hash[bucket].list, without any variations. Moreover, it uses 'napi' argument only to have access to this list, and calculates the bucket index for the second time (firstly it happens at the beginning of dev_gro_receive()) to do that. Given that dev_gro_receive() already has an index to the needed list, just pass it as the first argument to eliminate redundant calculations, and make gro_list_prepare() return void. Also, both arguments of gro_list_prepare() can be constified since this function can only modify the skbs from the bucket list. Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-14 14:41:08 -07:00
Gustavo A. R. Silva	b1866bfff9	net: core: Fix fall-through warnings for Clang In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning by explicitly adding a break statement instead of letting the code fall through to the next case. Link: https://github.com/KSPP/linux/issues/115 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-03-10 12:45:15 -08:00
David S. Miller	b8af417e4d	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2021-02-16 The following pull-request contains BPF updates for your net-next tree. There's a small merge conflict between `7eeba1706e` ("tcp: Add receive timestamp support for receive zerocopy.") from net-next tree and `9cacf81f81` ("bpf: Remove extra lock_sock for TCP_ZEROCOPY_RECEIVE") from bpf-next tree. Resolve as follows: [...] lock_sock(sk); err = tcp_zerocopy_receive(sk, &zc, &tss); err = BPF_CGROUP_RUN_PROG_GETSOCKOPT_KERN(sk, level, optname, &zc, &len, err); release_sock(sk); [...] We've added 116 non-merge commits during the last 27 day(s) which contain a total of 156 files changed, 5662 insertions(+), 1489 deletions(-). The main changes are: 1) Adds support of pointers to types with known size among global function args to overcome the limit on max # of allowed args, from Dmitrii Banshchikov. 2) Add bpf_iter for task_vma which can be used to generate information similar to /proc/pid/maps, from Song Liu. 3) Enable bpf_{g,s}etsockopt() from all sock_addr related program hooks. Allow rewriting bind user ports from BPF side below the ip_unprivileged_port_start range, both from Stanislav Fomichev. 4) Prevent recursion on fentry/fexit & sleepable programs and allow map-in-map as well as per-cpu maps for the latter, from Alexei Starovoitov. 5) Add selftest script to run BPF CI locally. Also enable BPF ringbuffer for sleepable programs, both from KP Singh. 6) Extend verifier to enable variable offset read/write access to the BPF program stack, from Andrei Matei. 7) Improve tc & XDP MTU handling and add a new bpf_check_mtu() helper to query device MTU from programs, from Jesper Dangaard Brouer. 8) Allow bpf_get_socket_cookie() helper also be called from [sleepable] BPF tracing programs, from Florent Revest. 9) Extend x86 JIT to pad JMPs with NOPs for helping image to converge when otherwise too many passes are required, from Gary Lin. 10) Verifier fixes on atomics with BPF_FETCH as well as function-by-function verification both related to zero-extension handling, from Ilya Leoshkevich. 11) Better kernel build integration of resolve_btfids tool, from Jiri Olsa. 12) Batch of AF_XDP selftest cleanups and small performance improvement for libbpf's xsk map redirect for newer kernels, from Björn Töpel. 13) Follow-up BPF doc and verifier improvements around atomics with BPF_FETCH, from Brendan Jackman. 14) Permit zero-sized data sections e.g. if ELF .rodata section contains read-only data from local variables, from Yonghong Song. 15) veth driver skb bulk-allocation for ndo_xdp_xmit, from Lorenzo Bianconi. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2021-02-16 13:14:06 -08:00
Alexander Lobakin	9243adfc31	skbuff: queue NAPI_MERGED_FREE skbs into NAPI cache instead of freeing napi_frags_finish() and napi_skb_finish() can only be called inside NAPI Rx context, so we can feed NAPI cache with skbuff_heads that got NAPI_MERGED_FREE verdict instead of immediate freeing. Replace __kfree_skb() with __kfree_skb_defer() in napi_skb_finish() and move napi_skb_free_stolen_head() to skbuff.c, so it can drop skbs to NAPI cache. As many drivers call napi_alloc_skb()/napi_get_frags() on their receive path, this becomes especially useful. Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-02-13 14:32:04 -08:00
Alexander Lobakin	fec6e49b63	skbuff: remove __kfree_skb_flush() This function isn't much needed as NAPI skb queue gets bulk-freed anyway when there's no more room, and even may reduce the efficiency of bulk operations. It will be even less needed after reusing skb cache on allocation path, so remove it and this way lighten network softirqs a bit. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-02-13 14:32:03 -08:00

... 2 3 4 5 6 ...

2210 Commits