JIRA: https://issues.redhat.com/browse/RHEL-30902
commit f6e0a4984c2e7244689ea87b62b433bed9d07e94
Author: Eric Dumazet <edumazet@google.com>
Date: Thu Mar 14 20:08:45 2024 +0000
net: move dev->state into net_device_read_txrx group
dev->state can be read in rx and tx fast paths.
netif_running() which needs dev->state is called from
- enqueue_to_backlog() [RX path]
- __dev_direct_xmit() [TX path]
Fixes: 43a71cd66b9c ("net-device: reorganize net_device fast path variables")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Coco Li <lixiaoyan@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240314200845.3050179-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30902
Conflicts:
- include/linux/netdevice.h: Context difference due to missing 34d21de99cea
("net: Move {l,t,d}stats allocation to core and convert veth & vrf");
this doesn't affect that the stats pointer union itself is read in the rx
and tx fast paths.
commit c353c7b7ffb7ae6ed8f3339906fe33c8be6cf344
Author: Eric Dumazet <edumazet@google.com>
Date: Thu Feb 8 14:43:23 2024 +0000
net-device: move lstats in net_device_read_txrx
dev->lstats is notably used from loopback ndo_start_xmit()
and other virtual drivers.
Per cpu stats updates are dirtying per-cpu data,
but the pointer itself is read-only.
Fixes: 43a71cd66b9c ("net-device: reorganize net_device fast path variables")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Coco Li <lixiaoyan@google.com>
Cc: Simon Horman <horms@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30902
Conflicts:
- include/linux/netdevice.h: code differece because we are maintaining kABI
exclusions.
commit d3d344a1ca69d8fb2413e29e6400f3ad58a05c06
Author: Eric Dumazet <edumazet@google.com>
Date: Tue Jan 2 16:22:20 2024 +0000
net-device: move xdp_prog to net_device_read_rx
xdp_prog is used in receive path, both from XDP enabled drivers
and from netif_elide_gro().
This patch also removes two 4-bytes holes.
Fixes: 43a71cd66b9c ("net-device: reorganize net_device fast path variables")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Coco Li <lixiaoyan@google.com>
Cc: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240102162220.750823-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30902
commit 993498e537af9260e697219ce41b41b22b6199cc
Author: Eric Dumazet <edumazet@google.com>
Date: Thu Dec 21 14:07:47 2023 +0000
net-device: move gso_partial_features to net_device_read_tx
dev->gso_partial_features is read from tx fast path for GSO packets.
Move it to appropriate section to avoid a cache line miss.
Fixes: 43a71cd66b9c ("net-device: reorganize net_device fast path variables")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Coco Li <lixiaoyan@google.com>
Cc: David Ahern <dsahern@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30902
Conflicts:
- include/linux/netdevice.h: Conflicts due to kABI exclusions in the
struct. Reordering kABI excluded fields maintains the kABI exclusion.
- include/linux/netdevice.h: Context differences due to missing patches
from upstream.
commit 43a71cd66b9c0a4af3d15d8644359fde35bdbed0
Author: Coco Li <lixiaoyan@google.com>
Date: Mon Dec 4 20:12:30 2023 +0000
net-device: reorganize net_device fast path variables
Reorganize fast path variables on tx-txrx-rx order
Fastpath variables end after npinfo.
Below data generated with pahole on x86 architecture.
Fast path variables span cache lines before change: 12
Fast path variables span cache lines after change: 4
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Coco Li <lixiaoyan@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20231204201232.520025-2-lixiaoyan@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4236
JIRA: https://issues.redhat.com/browse/RHEL-36217
Commits:
```
b534dc46c8ae ("net_tstamp: add SOF_TIMESTAMPING_OPT_ID_TCP")
70f7457ad6d6 ("net: create device lookup API with reference tracking")
3515440df461 ("ipv6: also use netdev_hold() in ip6_route_check_nh()")
108a36d07c01 ("ethtool: Fix mod state of verbose no_mask bitset")
524515020f25 ("Revert "ethtool: Fix mod state of verbose no_mask bitset"")
f55d8e60f109 ("net: ethtool: Fix documentation of ethtool_sprintf()")
65c9fde15a65 ("net: vlan: convert to ndo_hwtstamp_get() / ndo_hwtstamp_set()")
0bca3f7f9acd ("net: macvlan: convert to ndo_hwtstamp_get() / ndo_hwtstamp_set()")
c0dabeb4c666 ("net: bonding: convert to ndo_hwtstamp_get() / ndo_hwtstamp_set()")
ef5eb9c5ce45 ("net: fec: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set()")
547b006d1922 ("net: fec: delete fec_ptp_disable_hwts()")
fd770e856e22 ("net: remove phy_has_hwtstamp() -> phy_mii_ioctl() decision from converted drivers")
c35e927cbe09 ("net: omit ndo_hwtstamp_get() call when possible in dev_set_hwtstamp_phylib()")
446e2305827b ("net: Convert PHYs hwtstamp callback to use kernel_hwtstamp_config")
430dc3256d57 ("net: phy: Remove the call to phy_mii_ioctl in phy_hwstamp_get/set")
b8768dc40777 ("net: ethtool: Refactor identical get_ts_info implementations.")
202cb220026e ("net: macb: Convert to ndo_hwtstamp_get() and ndo_hwtstamp_set()")
011dd3b3f83f ("net: Make dev_set_hwtstamp_phylib accessible")
915d25a9d69b ("net: phy: micrel: fix ts_info value in case of no phc")
acec05fb78ab ("net_tstamp: Add TIMESTAMPING SOFTWARE and HARDWARE mask")
11d55be06df0 ("net: ethtool: Add a command to expose current time stamping layer")
d905f9c75329 ("net: ethtool: Add a command to list available time stamping layers")
51bdf3165f01 ("net: Replace hwtstamp_source by timestamping layer")
0f7f463d4821 ("net: Change the API of PHY default timestamp to MAC")
091fab122869 ("net: ethtool: ts: Update GET_TS to reply the current selected timestamp")
152c75e1d002 ("net: ethtool: ts: Let the active time stamping layer be selectable")
289354f21b2c ("net: partial revert of the "Make timestamping selectable: series")
cc124ad39288 ("Documentation: networking: add missing PLCA messages from the message list")
d0c3891db2d2 ("ethtool: reformat kerneldoc for struct ethtool_link_settings")
1271ca00aa7f ("ethtool: reformat kerneldoc for struct ethtool_fec_stats")
f1172f3ee3a9 ("ethtool: netlink: Add missing ethnl_ops_begin/complete")
```
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: Davide Caratti <dcaratti@redhat.com>
Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Michal Schmidt <mschmidt@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>
Merged-by: Lucas Zampieri <lzampier@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30139
Conflicts:
- context conflict due to RH KABI reservations for z-stream
commit 26793bfb5d6072326d1465343e7cbf6156abca4f
Author: Amritha Nambiar <amritha.nambiar@intel.com>
Date: Fri Dec 1 15:29:07 2023 -0800
net: Add NAPI IRQ support
Add support to associate the interrupt vector number for a
NAPI instance.
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Link: https://lore.kernel.org/r/170147334728.5260.13221803396905901904.stgit@anambiarhost.jf.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30139
Conflicts:
- context conflict due to missing 9a675ba55a96 ("net, bpf: Add
a warning if NAPI cb missed xdp_do_flush().")
commit 27f91aaf49b3a50e5a02ad5fa27b7c453d029a72
Author: Amritha Nambiar <amritha.nambiar@intel.com>
Date: Fri Dec 1 15:28:56 2023 -0800
netdev-genl: Add netlink framework functions for napi
Implement the netdev netlink framework functions for
napi support. The netdev structure tracks all the napi
instances and napi fields. The napi instances and associated
parameters can be retrieved this way.
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Link: https://lore.kernel.org/r/170147333637.5260.14807433239805550815.stgit@anambiarhost.jf.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30139
Conflicts:
- context conflict due to RH KABI reservations for z-stream
commit 2a502ff0c4e42a739b5aa550c901bf3852795532
Author: Amritha Nambiar <amritha.nambiar@intel.com>
Date: Fri Dec 1 15:28:34 2023 -0800
net: Add queue and napi association
Add the napi pointer in netdev queue for tracking the napi
instance for each queue. This achieves the queue<->napi mapping.
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Link: https://lore.kernel.org/r/170147331483.5260.15723438819994285695.stgit@anambiarhost.jf.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4235
JIRA: https://issues.redhat.com/browse/RHEL-36218
Note that patch 2 is needed for patch 3 to avoid compiler warnings and patch 1 is a dependency for patch 2.
Commits:
```
4eb6bd55cfb2 ("compiler.h: drop fallback overflow checkers")
d219d2a9a92e ("overflow: Allow mixed type arguments")
8798481b667f ("net/sched: wrap open coded Qdics class filter counter")
daf8d9181b9b ("net/sched: sch_drr: warn about class in use while deleting")
e20e75017c5a ("net/sched: sch_qfq: warn about class in use while deleting")
a57c34a80cbe ("net: flow_dissector: Add IPSEC dissector")
4c13eda757e3 ("tc: flower: support for SPI")
c8915d7329d6 ("tc: flower: Enable offload support IPSEC SPI field.")
992b47851be9 ("net: pkt_cls: Remove unused inline helpers")
09e0c3bbde90 ("net/sched: taprio: don't access q->qdiscs[] in unoffloaded mode during attach()")
25b0d4e4e41f ("net/sched: taprio: keep child Qdisc refcount elevated at 2 in offload mode")
98766add2d55 ("net/sched: taprio: try again to report q->qdiscs[] to qdisc_leaf()")
6e0ec800c174 ("net/sched: taprio: delete misleading comment about preallocating child qdiscs")
665338b2a7a0 ("net/sched: taprio: dump class stats for the actual q->qdiscs[]")
40b0425f8ba1 ("net: ptp: create a mock-up PTP Hardware Clock driver")
b63e78fca889 ("net: netdevsim: use mock PHC driver")
35da47fe1c47 ("net: netdevsim: mimic tc-taprio offload")
355adce3010b ("selftests/tc-testing: add ptp_mock Kconfig dependency")
1890cf08bd99 ("selftests/tc-testing: test that taprio can only be attached as root")
29c298d2bc82 ("selftests/tc-testing: verify that a qdisc can be grafted onto a taprio class")
4072d97ddc44 ("netem: add prng attribute to netem_sched_data")
9c87b2aeccf1 ("netem: use a seeded PRNG for generating random losses")
3cad70bc74ef ("netem: use seeded PRNG for correlated loss events")
8c21ab1bae94 ("net/sched: fq_pie: avoid stalls in fq_pie_timer()")
8fc134fee27f ("net: sched: sch_qfq: Fix UAF in qfq_dequeue()")
a5e2151ff9d5 ("net/ipv6: SKB symmetric hash should incorporate transport ports")
70ad43333cbe ("selftests/tc-testing: cls_fw: add tests for classid")
7c339083616c ("selftests/tc-testing: cls_route: add tests for classid")
e2f2fb3c352d ("selftests/tc-testing: cls_u32: add tests for classid")
ef765c258759 ("net/sched: cls_route: make netlink errors meaningful")
98cfbe4234a4 ("selftests/tc-testing: localize test resources")
d227cc0b1ee1 ("selftests/tc-testing: update test definitions for local resources")
ac9b82930964 ("selftests/tc-testing: implement tdc parallel test run")
d3fc4eea9742 ("selftests/tc-testing: update tdc documentation")
1add90738cf5 ("net_sched: constify qdisc_priv()")
54ff8ad69c6e ("net_sched: sch_fq: struct sched_data reorg")
ee9af4e14d16 ("net_sched: sch_fq: change how @inactive is tracked")
076433bd78d7 ("net_sched: sch_fq: add fast path for mostly idle qdisc")
8f6c4ff9e052 ("net_sched: sch_fq: always garbage collect")
2ae45136a938 ("net_sched: sch_fq: remove q->ktime_cache")
5579ee462dfe ("net_sched: export pfifo_fast prio2band[]")
29f834aa326e ("net_sched: sch_fq: add 3 bands and WRR scheduling")
49e7265fd098 ("net_sched: sch_fq: add TCA_FQ_WEIGHTS attribute")
0fef0907d6fa ("netem: Annotate struct disttable with __counted_by")
c4d49196ceec ("net: sched: cls_u32: Fix allocation size in u32_init()")
54a59aed395c ("net, sched: Make tc-related drop reason more flexible")
39d08b91646d ("net, sched: Add tcf_set_drop_reason for {__,}tcf_classify")
f157b73d5114 ("selftests: tc-testing: add missing Kconfig options to 'config'")
35027c790970 ("selftests: tc-testing: move auxiliary scripts to a dedicated folder")
ee3d12285471 ("selftests: tc-testing: add test for 'rt' upgrade on hfsc")
06e4dd18f868 ("net_sched: sch_fq: fix off-by-one error in fq_dequeue()")
81a416985698 ("net_sched: sch_fq: fastpath needs to take care of sk->sk_pacing_status")
6d25d1dc76bf ("net: sched: sch_qfq: Use non-work-conserving warning handler")
70f06c115bcc ("sched: act_ct: switch to per-action label counting")
49b02a19c23a ("net: sched: Fill in MODULE_DESCRIPTION for act_gate")
a9c92771fa23 ("net: sched: Fill in missing MODULE_DESCRIPTION for classifiers")
f96118c5d86f ("net: sched: Fill in missing MODULE_DESCRIPTION for qdiscs")
40cb2fdfed34 ("net, sched: Fix SKB_NOT_DROPPED_YET splat under debug config")
f1a3b283f852 ("net_sched: sch_fq: better validate TCA_FQ_WEIGHTS and TCA_FQ_PRIOMAP")
e316dd1cf135 ("net: don't dump stack on queue timeout")
9ffa01cab069 ("selftests: tc-testing: drop '-N' argument from nsPlugin")
fa63d353ddfb ("selftests: tc-testing: rework namespaces and devices setup")
bb9623c337f5 ("selftests: tc-testing: preload all modules in kselftests")
04fd47bf70f9 ("selftests: tc-testing: use parallel tdc in kselftests")
6b78debe1c07 ("net/sched: cls_u32: replace int refcounts with proper refcounts")
54293e4d6a62 ("selftests/tc-testing: add hashtable tests for u32")
025de7b6a6dd ("selftests: tc-testing: cap parallel tdc to 4 cores")
50a5988a7a54 ("selftests: tc-testing: move back to per test ns setup")
3d5026fc5adb ("selftests: tc-testing: use netns delete from pyroute2")
3f2d94a4ff48 ("selftests: tc-testing: leverage -all in suite ns teardown")
4b480cfb1066 ("selftests: tc-testing: timeout on unbounded loops")
4968afa0143d ("selftests: tc-testing: report number of workers in use")
a79d8ba734bd ("selftests: tc-testing: remove buildebpf plugin")
8059e68b9928 ("selftests: tc-testing: remove unnecessary time.sleep")
56e16bc69bb7 ("selftests: tc-testing: prefix iproute2 functions with "ipr2"")
501679f5d4a4 ("selftests: tc-testing: cleanup on Ctrl-C")
ed346fccfc40 ("selftests: tc-testing: remove unused import")
000db9e9ad42 ("net/sched: cbs: Use units.h instead of the copy of a definition")
f7580f00cc6e ("selftests: tc-testing: remove spurious nsPlugin usage")
74f7e7eeb1d2 ("selftests: tc-testing: remove spurious './' from Makefile")
7de8b2efafeb ("selftests: tc-testing: rename concurrency.json to flower.json")
0fbb5a54f941 ("selftests: tc-testing: remove filters/tests.json")
3872347e0a16 ("net/sched: act_api: use tcf_act_for_each_action")
a0e947c9ccff ("net/sched: act_api: avoid non-contiguous action array")
e09ac779f736 ("net/sched: act_api: stop loop over ops array on NULL in tcf_action_init")
f9bfc8eb1342 ("net/sched: act_api: use tcf_act_for_each_action in tcf_idr_insert_many")
c5e2a973448d ("rtnl: add helper to check if rtnl group has listeners")
8439109b76a3 ("rtnl: add helper to check if a notification is needed")
ddb6b284bdc3 ("rtnl: add helper to send if skb is not null")
c73724bfde09 ("net/sched: act_api: don't open code max()")
8d4390f51920 ("net/sched: act_api: conditional notification of events")
e522755520ef ("net/sched: cls_api: remove 'unicast' argument from delete notification")
93775590b1ee ("net/sched: cls_api: conditional notification of events")
4b55e86736d5 ("net/sched: act_api: rely on rcu in tcf_idr_check_alloc")
1dd7f18fc0ed ("net/sched: act_api: skip idr replace on bound actions")
fb2780721ca5 ("net: sched: Move drop_reason to struct tc_skb_cb")
b6a3c6066afc ("net: sched: Make tc-related drop reason more flexible for remaining qdiscs")
2f57dd94bdef ("packet: add a generic drop reason for receive")
4cf24dc89340 ("net: sched: Add initial TC error skb drop reasons")
913b47d3424e ("net/sched: Introduce tc block netdev tracking infra")
a7042cf8f231 ("net/sched: cls_api: Expose tc block to the datapath")
415e38bf1d8d ("net/sched: act_mirred: Add helper function tcf_mirred_replace_dev")
42f39036cda8 ("net/sched: act_mirred: Allow mirred to block")
8fcb0382af6f ("net: sched: em_text: fix possible memory leak in em_text_destroy()")
ba24ea129126 ("net/sched: Retire ipt action")
6d6d80e4f6bc ("net/sched: Remove CONFIG_NET_ACT_IPT from default configs")
41bc3e8fc1f7 ("net/sched: Remove uapi support for rsvp classifier")
82b2545ed9a4 ("net/sched: Remove uapi support for tcindex classifier")
fe3b739a5472 ("net/sched: Remove uapi support for dsmark qdisc")
26cc8714fc7f ("net/sched: Remove uapi support for ATM qdisc")
33241dca4862 ("net/sched: Remove uapi support for CBQ qdisc")
2ab1efad60ad ("net/sched: cls_api: complement tcf_tfilter_dump_policy")
c2a67de9bb54 ("net/sched: introduce ACT_P_BOUND return code")
530496985cea ("net/sched: sch_api: conditional netlink notifications")
94e2557d086a ("net: sched: move block device tracking into tcf_block_get/put_ext()")
405cd9fc6f44 ("net/sched: simplify tc_action_load_ops parameters")
2ffca83aa39c ("net/sched: Remove ipt action tests")
e18405d0be80 ("net: sched: track device in tcf_block_get/put_ext() only for clsact binder types")
ea937f772083 ("net: netdevsim: don't try to destroy PHC on VFs")
93590849a05e ("selftests: forwarding: Fix layer 2 miss test flakiness")
aae09a6c7783 ("net/sched: act_mirred: Don't zero blockid when net device is being deleted")
a46c31bf2744 ("net: fill in MODULE_DESCRIPTION()s for net/sched")
86fe596b588f ("net: sched: Remove NET_ACT_IPT from Kconfig")
eb2c11b27c58 ("net: bql: fix building with BQL disabled")
51270d573a8d ("tracing/net_sched: Fix tracepoints that save qdisc_dev() as a string")
```
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: Florian Westphal <fwestpha@redhat.com>
Approved-by: Davide Caratti <dcaratti@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>
Merged-by: Scott Weaver <scweaver@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-29681
Upstream Status: linux.git
commit ee403248fa6db5ca23031fc51b06284d6855cd02
Author: Eric Dumazet <edumazet@google.com>
Date: Mon Feb 7 20:50:38 2022 -0800
net: remove default_device_exit()
For some reason default_device_ops kept two exit method:
1) default_device_exit() is called for each netns being dismantled in
a cleanup_net() round. This acquires rtnl for each invocation.
2) default_device_exit_batch() is called once with the list of all netns
int the batch, allowing for a single rtnl invocation.
Get rid of the .exit() method to handle the logic from
default_device_exit_batch(), to decrease the number of rtnl acquisition
to one.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Antoine Tenart <atenart@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-35058
CVE: CVE-2024-27010
Upstream Status: net.git commit 0f022d32c3eca477fbf79a205243a6123ed0fe11
commit 0f022d32c3eca477fbf79a205243a6123ed0fe11
Author: Eric Dumazet <edumazet@google.com>
Date: Mon Apr 15 18:07:28 2024 -0300
net/sched: Fix mirred deadlock on device recursion
When the mirred action is used on a classful egress qdisc and a packet is
mirrored or redirected to self we hit a qdisc lock deadlock.
See trace below.
[..... other info removed for brevity....]
[ 82.890906]
[ 82.890906] ============================================
[ 82.890906] WARNING: possible recursive locking detected
[ 82.890906] 6.8.0-05205-g77fadd89fe2d-dirty #213 Tainted: G W
[ 82.890906] --------------------------------------------
[ 82.890906] ping/418 is trying to acquire lock:
[ 82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at:
__dev_queue_xmit+0x1778/0x3550
[ 82.890906]
[ 82.890906] but task is already holding lock:
[ 82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at:
__dev_queue_xmit+0x1778/0x3550
[ 82.890906]
[ 82.890906] other info that might help us debug this:
[ 82.890906] Possible unsafe locking scenario:
[ 82.890906]
[ 82.890906] CPU0
[ 82.890906] ----
[ 82.890906] lock(&sch->q.lock);
[ 82.890906] lock(&sch->q.lock);
[ 82.890906]
[ 82.890906] *** DEADLOCK ***
[ 82.890906]
[..... other info removed for brevity....]
Example setup (eth0->eth0) to recreate
tc qdisc add dev eth0 root handle 1: htb default 30
tc filter add dev eth0 handle 1: protocol ip prio 2 matchall \
action mirred egress redirect dev eth0
Another example(eth0->eth1->eth0) to recreate
tc qdisc add dev eth0 root handle 1: htb default 30
tc filter add dev eth0 handle 1: protocol ip prio 2 matchall \
action mirred egress redirect dev eth1
tc qdisc add dev eth1 root handle 1: htb default 30
tc filter add dev eth1 handle 1: protocol ip prio 2 matchall \
action mirred egress redirect dev eth0
We fix this by adding an owner field (CPU id) to struct Qdisc set after
root qdisc is entered. When the softirq enters it a second time, if the
qdisc owner is the same CPU, the packet is dropped to break the loop.
Reported-by: Mingshuai Ren <renmingshuai@huawei.com>
Closes: https://lore.kernel.org/netdev/20240314111713.5979-1-renmingshuai@huawei.com/
Fixes: 3bcb846ca4 ("net: get rid of spin_trylock() in net_tx_action()")
Fixes: e578d9c025 ("net: sched: use counter to break reclassify loops")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20240415210728.36949-1-victor@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-32098
Conflicts:
- drivers/net/ethernet/mellanox/mlx5/core/dpll.c: chunk omitted due
to missing 496fd0a26bbf73 ("mlx5: Implement SyncE support using DPLL
infrastructure")
Upstream commit(s):
commit 289e922582af5b4721ba02e86bde4d9ba918158a
Author: Jakub Kicinski <kuba@kernel.org>
Date: Mon Mar 4 17:35:32 2024 -0800
dpll: move all dpll<>netdev helpers to dpll code
Older versions of GCC really want to know the full definition
of the type involved in rcu_assign_pointer().
struct dpll_pin is defined in a local header, net/core can't
reach it. Move all the netdev <> dpll code into dpll, where
the type is known. Otherwise we'd need multiple function calls
to jump between the compilation units.
This is the same problem the commit under fixes was trying to address,
but with rcu_assign_pointer() not rcu_dereference().
Some of the exports are not needed, networking core can't
be a module, we only need exports for the helpers used by
drivers.
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Link: https://lore.kernel.org/all/35a869c8-52e8-177-1d4d-e57578b99b6@linux-m68k.org/
Fixes: 640f41ed33b5 ("dpll: fix build failure due to rcu_dereference_check() on unknown type")
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240305013532.694866-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Petr Oros <poros@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-32098
Upstream commit(s):
commit 0d60d8df6f493bb46bf5db40d39dd60a1bafdd4e
Author: Eric Dumazet <edumazet@google.com>
Date: Fri Feb 23 12:32:08 2024 +0000
dpll: rely on rcu for netdev_dpll_pin()
This fixes a possible UAF in if_nlmsg_size(),
which can run without RTNL.
Add rcu protection to "struct dpll_pin"
Move netdev_dpll_pin() from netdevice.h to dpll.h to
decrease name pollution.
Note: This looks possible to no longer acquire RTNL in
netdev_dpll_pin_assign() later in net-next.
v2: do not force rcu_read_lock() in rtnl_dpll_pin_size() (Jiri Pirko)
Fixes: 5f1842692880 ("netdev: expose DPLL pin handle for netdevice")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Cc: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240223123208.3543319-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Petr Oros <poros@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-36217
Conflicts:
- hunk for lan966x removed as it does not exist in RHEL
- context conflict caused by presence of RH_KABI macros
commit 289354f21b2c3fac93e956efd45f256a88a4d997
Author: Jakub Kicinski <kuba@kernel.org>
Date: Sat Nov 18 18:38:05 2023 -0800
net: partial revert of the "Make timestamping selectable: series
Revert following commits:
commit acec05fb78ab ("net_tstamp: Add TIMESTAMPING SOFTWARE and HARDWARE mask")
commit 11d55be06df0 ("net: ethtool: Add a command to expose current time stamping layer")
commit bb8645b00ced ("netlink: specs: Introduce new netlink command to get current timestamp")
commit d905f9c75329 ("net: ethtool: Add a command to list available time stamping layers")
commit aed5004ee7a0 ("netlink: specs: Introduce new netlink command to list available time stamping layers")
commit 51bdf3165f01 ("net: Replace hwtstamp_source by timestamping layer")
commit 0f7f463d4821 ("net: Change the API of PHY default timestamp to MAC")
commit 091fab122869 ("net: ethtool: ts: Update GET_TS to reply the current selected timestamp")
commit 152c75e1d002 ("net: ethtool: ts: Let the active time stamping layer be selectable")
commit ee60ea6be0d3 ("netlink: specs: Introduce time stamping set command")
They need more time for reviews.
Link: https://lore.kernel.org/all/20231118183529.6e67100c@kernel.org/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-36217
Conflicts:
- context conflict caused by presence of RH_KABI macros
commit 0f7f463d4821a4f52fa5c0a961389e651d50c384
Author: Kory Maincent <kory.maincent@bootlin.com>
Date: Tue Nov 14 12:28:41 2023 +0100
net: Change the API of PHY default timestamp to MAC
Change the API to select MAC default time stamping instead of the PHY.
Indeed the PHY is closer to the wire therefore theoretically it has less
delay than the MAC timestamping but the reality is different. Due to lower
time stamping clock frequency, latency in the MDIO bus and no PHC hardware
synchronization between different PHY, the PHY PTP is often less precise
than the MAC. The exception is for PHY designed specially for PTP case but
these devices are not very widespread. For not breaking the compatibility I
introduce a default_timestamp flag in phy_device that is set by the phy
driver to know we are using the old API behavior.
The phy_set_timestamp function is called at each call of phy_attach_direct.
In case of MAC driver using phylink this function is called when the
interface is turned up. Then if the interface goes down and up again the
last choice of timestamp will be overwritten by the default choice.
A solution could be to cache the timestamp status but it can bring other
issues. In case of SFP, if we change the module, it doesn't make sense to
blindly re-set the timestamp back to PHY, if the new module has a PHY with
mediocre timestamping capabilities.
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-36217
commit 70f7457ad6d655e65f1b93cbba2a519e4b11c946
Author: Jakub Kicinski <kuba@kernel.org>
Date: Mon Jun 12 14:49:43 2023 -0700
net: create device lookup API with reference tracking
New users of dev_get_by_index() and dev_get_by_name() keep
getting added and it would be nice to steer them towards
the APIs with reference tracking.
Add variants of those calls which allocate the reference
tracker and use them in a couple of places.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4000
JIRA: https://issues.redhat.com/browse/RHEL-30145
Depends: !3939
The series updates netlink and devlink core to upstream version v6.8.
Both have to be updated at once due to circular dependencies.
Signed-off-by: Petr Oros <poros@redhat.com>
Approved-by: Jan Stancek <jstancek@redhat.com>
Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com>
Approved-by: Davide Caratti <dcaratti@redhat.com>
Approved-by: Ivan Vecera <ivecera@redhat.com>
Merged-by: Lucas Zampieri <lzampier@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-36218
commit b6a3c6066afc2cb7b92f45c67ab0b12ded81cb11
Author: Victor Nogueira <victor@mojatatu.com>
Date: Sat Dec 16 17:44:35 2023 -0300
net: sched: Make tc-related drop reason more flexible for remaining qdiscs
Incrementing on Daniel's patch[1], make tc-related drop reason more
flexible for remaining qdiscs - that is, all qdiscs aside from clsact.
In essence, the drop reason will be set by cls_api and act_api in case
any error occurred in the data path. With that, we can give the user more
detailed information so that they can distinguish between a policy drop
or an error drop.
[1] https://lore.kernel.org/all/20231009092655.22025-1-daniel@iogearbox.net
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-36218
commit fb2780721ca5e9f78bbe4544b819b929a982df9c
Author: Victor Nogueira <victor@mojatatu.com>
Date: Sat Dec 16 17:44:34 2023 -0300
net: sched: Move drop_reason to struct tc_skb_cb
Move drop_reason from struct tcf_result to skb cb - more specifically to
struct tc_skb_cb. With that, we'll be able to also set the drop reason for
the remaining qdiscs (aside from clsact) that do not have access to
tcf_result when time comes to set the skb drop reason.
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-36218
commit 54a59aed395ce0f4177b5212e5746a6462de3ad9
Author: Daniel Borkmann <daniel@iogearbox.net>
Date: Mon Oct 9 11:26:54 2023 +0200
net, sched: Make tc-related drop reason more flexible
Currently, the kfree_skb_reason() in sch_handle_{ingress,egress}() can only
express a basic SKB_DROP_REASON_TC_INGRESS or SKB_DROP_REASON_TC_EGRESS reason.
Victor kicked-off an initial proposal to make this more flexible by disambiguating
verdict from return code by moving the verdict into struct tcf_result and
letting tcf_classify() return a negative error. If hit, then two new drop
reasons were added in the proposal, that is SKB_DROP_REASON_TC_INGRESS_ERROR
as well as SKB_DROP_REASON_TC_EGRESS_ERROR. Further analysis of the actual
error codes would have required to attach to tcf_classify via kprobe/kretprobe
to more deeply debug skb and the returned error.
In order to make the kfree_skb_reason() in sch_handle_{ingress,egress}() more
extensible, it can be addressed in a more straight forward way, that is: Instead
of placing the verdict into struct tcf_result, we can just put the drop reason
in there, which does not require changes throughout various classful schedulers
given the existing verdict logic can stay as is.
Then, SKB_DROP_REASON_TC_ERROR{,_*} can be added to the enum skb_drop_reason
to disambiguate between an error or an intentional drop. New drop reason error
codes can be added successively to the tc code base.
For internal error locations which have not yet been annotated with a
SKB_DROP_REASON_TC_ERROR{,_*}, the fallback is SKB_DROP_REASON_TC_INGRESS and
SKB_DROP_REASON_TC_EGRESS, respectively. Generic errors could be marked with a
SKB_DROP_REASON_TC_ERROR code until they are converted to more specific ones
if it is found that they would be useful for troubleshooting.
While drop reasons have infrastructure for subsystem specific error codes which
are currently used by mac80211 and ovs, Jakub mentioned that it is preferred
for tc to use the enum skb_drop_reason core codes given it is a better fit and
currently the tooling support is better, too.
With regards to the latter:
[...] I think Alastair (bpftrace) is working on auto-prettifying enums when
bpftrace outputs maps. So we can do something like:
$ bpftrace -e 'tracepoint:skb:kfree_skb { @[args->reason] = count(); }'
Attaching 1 probe...
^C
@[SKB_DROP_REASON_TC_INGRESS]: 2
@[SKB_CONSUMED]: 34
^^^^^^^^^^^^ names!!
Auto-magically. [...]
Add a small helper tcf_set_drop_reason() which can be used to set the drop reason
into the tcf_result.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Victor Nogueira <victor@mojatatu.com>
Link: https://lore.kernel.org/netdev/20231006063233.74345d36@kernel.org
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20231009092655.22025-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3966
# Merge Request Required Information
This is the first pass at drm dependencies for backporting 6.8 or 6.9 into RHEL 9.5
Marked as draft as I think there will be a few more patches needed, and maybe some other teams might be in the same area (e.g. kunit).
JIRA: https://issues.redhat.com/browse/RHEL-24101
Signed-off-by: Dave Airlie <airlied@redhat.com>
## Summary of Changes
## Approved Development Ticket
All submissions to CentOS Stream must reference an approved ticket in [Red Hat Jira](https://issues.redhat.com/). Please follow the CentOS Stream [contribution documentation](https://docs.centos.org/en-US/stream-contrib/quickstart/) for how to file this ticket and have it approved.
Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Approved-by: Mika Penttilä <mpenttil@redhat.com>
Approved-by: Aristeu Rozanski <arozansk@redhat.com>
Approved-by: David Arcari <darcari@redhat.com>
Approved-by: Steve Best <sbest@redhat.com>
Merged-by: Patrick Talbert <ptalbert@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30145
Upstream commit(s):
commit 9f30831390ede02d9fcd54fd9ea5a585ab649f4a
Author: Eric Dumazet <edumazet@google.com>
Date: Fri Feb 9 18:12:48 2024 +0000
net: add rcu safety to rtnl_prop_list_size()
rtnl_prop_list_size() can be called while alternative names
are added or removed concurrently.
if_nlmsg_size() / rtnl_calcit() can indeed be called
without RTNL held.
Use explicit RCU protection to avoid UAF.
Fixes: 88f4fb0c74 ("net: rtnetlink: put alternative names to getlink message")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240209181248.96637-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Petr Oros <poros@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3939
JIRA: https://issues.redhat.com/browse/RHEL-30656
Tested: LNST
Depends: !3918
The series updates netlink and devlink core to upstream version v6.6. Both have to be updated at once due to circular dependencies.
Omitted-fix: 83f2df9d66bc
The fix needs an additional devlink dependencies and it will be applied in next rebase series covered by RHEL-30145
Commits:
```
6978052448f9 ("netlink: remove unused 'compare' function")
74bf6477c18b ("netlink-specs: add partial specification for devlink")
82b3297009b6 ("netlink: specs: allow uapi-header in genetlink")
56c874f7dbca ("tools: ynl: skip the explicit op array size when not needed")
8da3a5598f75 ("ynl: allow to encode u8 attr")
bc77f7318da8 ("tools: ynl: add the Python requirements.txt file")
dd3a7d58dcc2 ("tools: ynl: Add missing types to encode/decode")
4c6170d1ae2c ("tools: ynl: default to treating enums as flags for mask generation")
bec0b7a2db35 ("tools: ynl: Add struct parsing to nlspec")
b423c3c86325 ("tools: ynl: Add C array attribute decoding to ynl")
2607191395bd ("tools: ynl: Add struct attr decoding to ynl")
f036d936ca57 ("tools: ynl: Add fixed-header support to ynl")
643ef4a676e3 ("netlink: specs: add partial specification for openvswitch")
88e288968412 ("docs: netlink: document struct support for genetlink-legacy")
04eac39361d3 ("docs: netlink: document the sub-type attribute property")
9f7cc57fe550 ("tools: ynl: support byte-order in cli")
a353318ebf24 ("tools: ynl: populate most of the ethtool spec")
48993e22d23a ("tools: ynl: replace print with NlError")
f3d07b02b2b8 ("tools: ynl: ethtool testing tool")
ebe3bdc4359e ("tools: ynl: throw a more meaningful exception if family not supported")
3ea31e66644b ("tools: ynl: Remove absolute paths to yaml files from ethtool testing tool")
85a4abed1554 ("tools: ynl: Rename ethtool to ethtool.py")
d913d32cc270 ("netlink: Use copy_to_user() for optval in netlink_getsockopt().")
a939d14919b7 ("netlink: annotate accesses to nlk->cb_running")
7c2435ef76e5 ("tools: ynl: Use dict of predefined Structs to decode scalar types")
bddd2e561b0a ("tools: ynl: Handle byte-order in struct members")
081e8df68199 ("tools: ynl: avoid dict errors on older Python versions")
9b66ee06e5ca ("net: ynl: prefix uAPI header include with uapi/")
0684f29a89e5 ("netlink: specs: correct types of legacy arrays")
6d6bae63053d ("doc: ynl: Add doc attr to struct members in genetlink-legacy spec")
5ac18889bde0 ("tools: ynl: Initialise fixed headers to 0 in genetlink-legacy")
313a7a808ca8 ("tools: ynl: Support enums in struct members in genetlink-legacy")
93b230b549bc ("netlink: specs: add ynl spec for ovs_flow")
f4e4534850a9 ("net/netlink: fix NETLINK_LIST_MEMBERSHIPS length report")
91dfaef243cd ("tools: ynl-gen: add extra headers for user space")
6ad49839ba9b ("tools: ynl-gen: fix unused / pad attribute handling")
67c65ce762ad ("tools: ynl-gen: don't override pure nested struct")
5605f102378f ("tools: ynl-gen: loosen type consistency check for events")
eef9b794eac8 ("tools: ynl-gen: add error checking for nested structs")
21b6e302789c ("tools: ynl-gen: generate enum-to-string helpers")
dc0956c98f11 ("tools: ynl-gen: move the response reading logic into YNL")
5d58f911c755 ("tools: ynl-gen: generate alloc and free helpers for req")
8cb6afb33541 ("tools: ynl-gen: switch to family struct")
59d814f0f285 ("tools: ynl-gen: generate static descriptions of notifications")
a99bfdf64795 ("tools: ynl-gen: clean up stray new lines at the end of reply-less requests")
86878f14d71a ("tools: ynl: user space helpers")
d75fdfbc6f26 ("tools: ynl: support fou and netdev in C")
ee0202e2e731 ("tools: ynl: add sample for netdev")
f6ca5baf2a86 ("netlink: specs: ethtool: fix random typos")
2cc9671a82e3 ("tools: ynl-gen: fill in support for MultiAttr scalars")
58da455b31ba ("tools: ynl-gen: improve unwind on parsing errors")
7a11f70ce882 ("tools: ynl: generate code for the handshake family")
8947e5037371 ("netlink: specs: devlink: fill in some details important for C")
9858bfc271de ("tools: ynl-gen: use enum names in op strmap more carefully")
6f115d4575ab ("tools: ynl-gen: refactor strmap helper generation")
ff6db4b58c93 ("tools: ynl-gen: enable code gen for directional specs")
6afaa0ef9b0e ("tools: ynl-gen: try to sort the types more intelligently")
37487f93b125 ("tools: ynl-gen: inherit struct use info")
eae7af21bdb9 ("tools: ynl-gen: walk nested types in depth")
168dea20ecef ("tools: ynl-gen: don't generate forward declarations for policies")
0a9471219672 ("tools: ynl-gen: don't generate forward declarations for policies - regen")
5d1a30eb989a ("tools: ynl: generate code for the devlink family")
fff8660b5425 ("tools: ynl: add sample for devlink")
30b5c720e1a9 ("tools: ynl-gen: cleanup user space header includes")
9b52fd4b6305 ("tools: ynl: regen: cleanup user space header includes")
820343ccbb2e ("tools: ynl-gen: complete the C keyword list")
2c0f1466867c ("tools: ynl-gen: combine else with closing bracket")
e4ea3cc68472 ("tools: ynl-gen: get attr type outside of if()")
7234415b8f86 ("tools: ynl: regen: regenerate the if ladders")
f2ba1e5e2208 ("tools: ynl-gen: stop generating common notification handlers")
d0915d64c3a6 ("tools: ynl: regen: stop generating common notification handlers")
ced1568862bd ("tools: ynl-gen: sanitize notification tracking")
6da3424fd629 ("tools: ynl-gen: support code gen for events")
6f96ec73cb5a ("tools: ynl-gen: don't pass op_name to RenderInfo")
76abff37f0d7 ("tools: ynl-gen: support / skip pads on the way to kernel")
008bcd6835a2 ("tools: ynl-gen: support excluding tricky ops")
33eedb0071c8 ("tools: ynl-gen: record extra args for regen")
ed2042cc77f1 ("netlink: specs: support setting prefix-name per attribute")
d4813b11d679 ("netlink: specs: ethtool: add C render hints")
dddc9f53da3e ("tools: ynl-gen: don't generate enum types if unnamed")
2c9d47a095f7 ("tools: ynl-gen: resolve enum vs struct name conflicts")
180ad455273a ("netlink: specs: ethtool: add empty enum stringset")
37c852222712 ("netlink: specs: ethtool: untangle UDP tunnels and cable test a bit")
709d0c3b3d4c ("netlink: specs: ethtool: untangle stats-get")
68335713d2ea ("netlink: specs: ethtool: mark pads as pads")
2d7be507d65e ("tools: ynl: generate code for the ethtool family")
f561ff232a6b ("tools: ynl: add sample for ethtool")
10c4d2a7b88d ("tools: ynl-gen: correct enum policies")
be093a80dff0 ("tools: ynl-gen: inherit policy in multi-attr")
fa0e21fa4443 ("rtnetlink: extend RTEXT_FILTER_SKIP_STATS to IFLA_VF_INFO")
89da780aa4c7 ("rtnetlink: move validate_linkmsg out of do_setlink")
f0ec58d557d6 ("tools: ynl: work around stale system headers")
6907217a8054 ("netlink: specs: fixup openvswitch specs for code generation")
8d61f926d420 ("netlink: fix potential deadlock in netlink_set_err()")
0c3d6fd4b89c ("tools: ynl: improve the direct-include header guard logic")
737eab775d36 ("netlink: specs: add display-hint to schema definitions")
d8eea68d913c ("tools: ynl: add display-hint support to ynl")
334f39ce17ef ("netlink: specs: add display hints to ovs_flow")
25a9c8a4431c ("netlink: Add __sock_i_ino() for __netlink_diag_dump().")
b8e39b38487e ("netlink: Make use of __assign_bit() API")
633d76ad01ad ("devlink: remove reload failed checks in params get/set callbacks")
4a59cdfd6699 ("rtnetlink: Move nesting cancellation rollback to proper function")
5766946ea511 ("genetlink: add explicit ordering break check for split ops")
a3377386b564 ("netlink: Reverse the patch which removed filtering")
a4c9a56e6a2c ("netlink: Add new netlink_release function")
d7ddf5f4269f ("tools: ynl-gen: fix enum index in _decode_enum(..)")
df15c15e6c98 ("tools: ynl-gen: fix parse multi-attr enum attribute")
5fac9b7c16c5 ("netlink: allow be16 and be32 types in all uint policy checks")
e5c157f081ab ("ynl: expose xdp-zc-max-segs")
37844828d290 ("ynl: mark max/mask as private for kdoc")
25b5a2a1905f ("ynl: regenerate all headers")
26fdb67e8b4a ("ynl: print xdp-zc-max-segs in the sample")
759ab1edb56c ("net: store netdevs in an xarray")
84e00d9bd4e4 ("net: convert some netlink netdev iterators to depend on the xarray")
2628d40899d1 ("devlink: Remove unused extern declaration devlink_port_region_destroy()")
78c96d7b7c9a ("netlink: specs: add dump-strict flag for dont-validate property")
dc7b81a828db ("ynl-gen-c.py: filter rendering of validate field values for split ops")
eab7be688b44 ("ynl-gen-c.py: allow directional model for kernel mode")
fa8ba3502ade ("ynl-gen-c.py: render netlink policies static for split ops")
ba0f66c95fa6 ("devlink: rename devlink_nl_ops to devlink_nl_small_ops")
d61aedcf628e ("devlink: rename couple of doit netlink callbacks to match generated names")
491a24872a64 ("devlink: introduce couple of dumpit callbacks for split ops")
8300dce542e4 ("devlink: un-static devlink_nl_pre/post_doit()")
759f661012d1 ("netlink: specs: devlink: add info-get dump op")
6b7c486cae81 ("devlink: add split ops generated according to spec")
b2551b1517d8 ("devlink: include the generated netlink header")
6e067d0cab68 ("devlink: use generated split ops and remove duplicated commands from small ops")
b876b71a6ac2 ("devlink: Remove unused devlink_dpipe_table_resource_set() declaration")
2c0e9f3806c4 ("tools: ynl-gen: avoid rendering empty validate field")
832140804e3b ("devlink: clear flag on port register error path")
cd3112ebbaf4 ("tools: ynl-gen: add missing empty line between policies")
8fe08d70a2b6 ("netlink: convert nlk->flags to atomic flags")
63618463cb94 ("devlink: parse linecard attr in doit() callbacks")
41a1d4d1399a ("devlink: parse rate attrs in doit() callbacks")
ee6d78ac28c7 ("devlink: introduce devlink_nl_pre_doit_port*() helper functions")
8fa995ad1f7f ("devlink: rename doit callbacks for per-instance dump commands")
24c8e56d4f98 ("devlink: introduce dumpit callbacks for split ops")
7d3c6fec6135 ("devlink: pass flags as an arg of dump_one() callback")
7199c86247e9 ("netlink: specs: devlink: add commands that do per-instance dump")
ddff283280ba ("devlink: remove duplicate temporary netlink callback prototypes")
833e479d330c ("devlink: remove converted commands from small ops")
4a1b5aa8b5c7 ("devlink: allow user to narrow per-instance dumps by passing handle attrs")
34493336e7d3 ("netlink: specs: devlink: extend per-instance dump commands to accept instance attributes")
b03f13cb67a5 ("devlink: extend health reporter dump selector by port index")
0149bca17262 ("netlink: specs: devlink: extend health reporter dump attributes by port index")
84817d8c6042 ("genetlink: push conditional locking into dumpit/done")
fde9bd4a4d41 ("genetlink: make genl_info->nlhdr const")
bffcc6882a1b ("genetlink: remove userhdr from struct genl_info")
9272af109fe6 ("genetlink: add struct genl_info to struct genl_dumpit_info")
7288dd2fd488 ("genetlink: use attrs from struct genl_info")
5c670a010de4 ("genetlink: add a family pointer to struct genl_info")
5aa51d9f889c ("genetlink: add genlmsg_iput() API")
0e19d3108aea ("netdev-genl: use struct genl_info for reply construction")
ec0e5b09b834 ("ethtool: netlink: simplify arguments to ethnl_default_parse()")
f946270d05c2 ("ethtool: netlink: always pass genl_info to .prepare_data")
956db0a13b47 ("net: warn about attempts to register negative ifindex")
ded67d90815a ("netlink: specs: add ovs_vport new command")
7582113c6917 ("tools: ynl: add more info to KeyErrors on missing attrs")
d56b699d76d1 ("Documentation: Fix typos")
f65f305ae008 ("tools: ynl-gen: use temporary file for rendering")
f534f6581ec0 ("net: validate veth and vxcan peer ifindexes")
649bde9004ac ("tools: ynl: allow passing binary data")
a149a3a13bbc ("tools: ynl-gen: set length of binary fields")
dc2ef94d8926 ("tools: ynl-gen: fix collecting global policy attrs")
4c8c24e801e6 ("tools: ynl-gen: support empty attribute lists")
e83d4e9b2d0f ("netlink: specs: fix indent in fou")
a02430c06f56 ("tools: ynl-gen: fix uAPI generation after tempfile changes")
52d08fda3516 ("doc/netlink: Add delete operation to ovs_vport spec")
ed68c58c0eb4 ("doc/netlink: Add a schema for netlink-raw families")
294f37fc8772 ("doc/netlink: Update genetlink-legacy documentation")
2db8abf0b455 ("doc/netlink: Document the netlink-raw schema extensions")
88901b967958 ("tools/ynl: Add mcast-group schema parsing to ynl")
fb0a06d455d6 ("tools/net/ynl: Fix extack parsing with fixed header genlmsg")
e46dd903efe3 ("tools/net/ynl: Add support for netlink-raw families")
0493e56d021d ("tools/net/ynl: Implement nlattr array-nest decoding in ynl")
1768d8a767f8 ("tools/net/ynl: Add support for create flags")
dfb0f7d9d979 ("doc/netlink: Add spec for rt addr messages")
b2f63d904e72 ("doc/netlink: Add spec for rt link messages")
023289b4f582 ("doc/netlink: Add spec for rt route messages")
56e65312830e ("devlink: push object register/unregister notifications into separate helpers")
eec1e5ea1d71 ("devlink: push port related code into separate file")
2b4d8bb08889 ("devlink: push shared buffer related code into separate file")
2475ed158c47 ("devlink: move and rename devlink_dpipe_send_and_alloc_skb() helper")
a9fd44b15fc5 ("devlink: push dpipe related code into separate file")
a9f960074ecd ("devlink: push resource related code into separate file")
830c41e1e987 ("devlink: push param related code into separate file")
1aa47ca1f52e ("devlink: push region related code into separate file")
85facf94fd80 ("devlink: use tracepoint_enabled() helper")
4bbdec80ff27 ("devlink: push trap related code into separate file")
7cc7194e85ca ("devlink: push rate related code into separate file")
9edbe6f36c5f ("devlink: push linecard related code into separate file")
890c55667437 ("devlink: move tracepoint definitions into core.c")
29a390d17748 ("devlink: move small_ops definition into netlink.c")
71179ac5c211 ("devlink: move devlink_notify_register/unregister() to dev.c")
ee940b57a929 ("doc/netlink: Fix missing classic_netlink doc reference")
d0f95894fda7 ("netlink: annotate data-races around sk->sk_err")
0f4d44f6ee04 ("netlink: specs: devlink: fix reply command values")
69844e335d8c ("selftests/bpf: Fix sockopt_sk selftest")
e4fe082c38cd ("tools: ynl: make sure we always pass yarg to mnl_cb_run")
5d78b73e8514 ("tools: ynl: don't leak mcast_groups on init error")
b6c65eb20ffa ("tools: ynl: fix handling of multiple mcast groups")
ceaac91dcd06 ("net: make sure we never create ifindex = 0")
0e0939c0adf9 ("net-procfs: use xarray iterator to implement /proc/net/dev")
```
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com>
Approved-by: Sabrina Dubroca <sdubroca@redhat.com>
Merged-by: Lucas Zampieri <lzampier@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3918
JIRA: https://issues.redhat.com/browse/RHEL-30344
Tested: LNST
Commits:
```
cfdf0d9ae75b ("rtnetlink: use nlmsg_notify() in rtnetlink_send()")
fef773fc8110 ("netlink: Deal with ESRCH error in nlmsg_notify()")
f9b282b36dfa ("net: netlink: add the case when nlh is NULL")
bc830525615d ("net: netlink: Remove unused function")
d3432bf10f17 ("net: Support filtering interfaces on no master")
4fc29989835a ("net: rtnetlink: convert rcu_assign_pointer to RCU_INIT_POINTER")
7707a4d01a64 ("netlink: annotate data races around nlk->bound")
549017aa1bb7 ("netlink: remove netlink_broadcast_filtered")
50af5969bb22 ("net/core: Remove unused assignment operations and variable")
efd38f75bb04 ("net: rtnetlink: use __dev_addr_set()")
f123cffdd8fe ("net: netlink: af_netlink: Prevent empty skb by adding a check on len.")
d59a67f2f3f3 ("netlink: remove nl_set_extack_cookie_u32()")
ede6c39c4f90 ("net: make net->dev_unreg_count atomic")
7b8135f4df98 ("rtnetlink: add new rtm tunnel api for tunnel id filtering")
5d26cff5bdbe ("net: account alternate interface name memory")
155fb43b70b5 ("net: limit altnames to 64k total")
0caf6d992219 ("af_netlink: Fix shift out of bounds in group mask calculation")
0b5c21bbc01e ("net: ensure net_todo_list is processed quickly")
ef2a7c9065ce ("rtnetlink: return ENODEV when ifname does not exist and group is given")
5ea08b5286f6 ("rtnetlink: enable alt_ifname for setlink/newlink")
dee04163e9f2 ("rtnetlink: return ENODEV when IFLA_ALT_IFNAME is used in dellink")
b6177d3240a4 ("rtnetlink: return EINVAL when request cannot succeed")
99c07327ae11 ("netlink: reset network and mac headers in netlink_dump()")
6f37c9f9dfbf ("Revert "rtnetlink: return EINVAL when request cannot succeed"")
c92bf26ccebc ("rtnl: allocate more attr tables on the heap")
63105e83987a ("rtnl: split __rtnl_newlink() into two functions")
02839cc8d72b ("rtnl: move rtnl_newlink_create()")
d5076fe4049c ("netlink: do not reset transport header in netlink_recvmsg()")
f329a0ebeaba ("genetlink: correct uAPI defines")
5c221f0af68c ("net: add missing kdoc for struct genl_multicast_group::flags")
30b6055428a9 ("net: improve and fix netlink kdoc")
0bf73255d3a3 ("netlink: fix some kernel-doc comments")
8f1948bdcf2f ("genetlink: hold read cb_lock during iteration of genl_fam_idr in genl_bind()")
abbc79280abc ("net: rtnetlink: use netif_oper_up instead of open code")
710d21fdff9a ("netlink: Bounds-check struct nlmsgerr creation")
08724ef69907 ("netlink: introduce NLA_POLICY_MAX_BE")
e7af210e6dd0 ("netfilter: nft_payload: reject out-of-range attributes via policy")
a4abfa627c38 ("net: rtnetlink: Enslave device before bringing it up")
5493a2ad0d20 ("docs: netlink: clarify the historical baggage of Netlink flags")
7354c9024f28 ("netlink: hide validation union fields from kdoc")
738136a0e375 ("netlink: split up copies in the ack construction")
1d997f101307 ("rtnetlink: pass netlink message header and portid to rtnl_configure_link()")
77f4aa9a2a17 ("net: add new helper unregister_netdevice_many_notify")
d88e136cab37 ("rtnetlink: Honour NLM_F_ECHO flag in rtnl_newlink_create")
f3a63cce1b4f ("rtnetlink: Honour NLM_F_ECHO flag in rtnl_delete_link")
ecaf75ffd5f5 ("netlink: introduce bigendian integer types")
e69761483361 ("netlink: Fix potential skb memleak in netlink_ack")
8e18be7610ae ("lib: Fix some kernel-doc comments")
8032bf1233a7 ("treewide: use get_random_u32_below() instead of deprecated function")
c73a72f4cbb4 ("netlink: remove the flex array from struct nlmsghdr")
f0950402e8c7 ("netlink: prevent potential spectre v1 gadgets")
c1bb9484e3b0 ("netlink: annotate data races around nlk->portid")
004db64d185a ("netlink: annotate data races around dst_portid and dst_group")
9b663b5cbb15 ("netlink: annotate data races around sk_state")
9d6a65079c98 ("docs: add more netlink docs (incl. spec docs)")
e616c07ca518 ("netlink: add schemas for YAML specs")
be5bea1cc0bf ("net: add basic C code generators for Netlink")
4eb77b4ecd3c ("netlink: add a proto specification for FOU")
3a330496baa8 ("net: fou: regenerate the uAPI from the spec")
08d323234d10 ("net: fou: rename the source for linking")
1d562c32e439 ("net: fou: use policy and operation tables generated from the spec")
e4b48ed460d3 ("tools: ynl: add a completely generic client")
66fa34b9c2a5 ("tools: ynl: support kdocs for flags in code generation")
b49c34e217c6 ("tools: ynl: rename ops_list -> msg_list")
3a43ded081f8 ("tools: ynl: store ops in ordered dict to avoid random ordering")
70eb3911d80f ("net: netlink: recommend policy range validation")
eaf317e7d2bb ("tools: ynl-gen: prevent do / dump reordering")
4e4480e89c47 ("tools: ynl: move the cli and netlink code around")
3aacf8281336 ("tools: ynl: add an object hierarchy to represent parsed spec")
30a5c6c8104f ("tools: ynl: use the common YAML loading and validation code")
19b64b48a33e ("tools: ynl: add support for types needed by ethtool")
fd0616d34274 ("tools: ynl: support directional enum-model in CLI")
90256f3f8093 ("tools: ynl: support multi-attr")
4cd2796f3f8d ("tools: ynl: support pretty printing bad attribute names")
8dfec0a88868 ("tools: ynl: use operation names from spec on the CLI")
5c6674f6eb52 ("tools: ynl: load jsonschema on demand")
8403bf044530 ("netlink: specs: finish up operation enum-models")
01e47a372268 ("docs: netlink: add a starting guide for working with specs")
981cbcb030d9 ("tools: net: use python3 explicitly")
f1db99c07b4f ("string_helpers: Move string_is_valid() to the header")
d4545bf9c33b ("genetlink: Use string_is_terminated() helper")
f7cf644796fc ("tools: ynl-gen: fix single attribute structs with attr 0 only")
b9d3a3e4ae0c ("tools: ynl-gen: re-raise the exception instead of printing")
d77e7eceeac9 ("tools: net: add __pycache__ to gitignore")
7cf93538e087 ("tools: ynl: fully inherit attrs in subsets")
ad4fafcde5bc ("tools: ynl: use 1 as the default for first entry in attrs/ops")
bcec7171eba9 ("netlink: specs: update for codegen enumerating from 1")
37d9df224d1e ("ynl: re-license uniformly under GPL-2.0 OR BSD-3-Clause")
6517a60b0307 ("tools: ynl: move the enum classes to shared code")
c311aaa74ca1 ("tools: ynl: fix enum-as-flags in the generic CLI")
8f76a4f80fba ("tools: ynl: fix render-max for flags definition")
bf51d27704c9 ("tools: ynl: fix get_mask utility routine")
054abb515f34 ("tools: ynl: make definitions optional again")
4e16b6a748df ("ynl: broaden the license even more")
cfab77c0b545 ("ynl: make the tooling check the license")
758d29fb3a8b ("tools: ynl: Fix genlmsg header encoding formats")
a1865f2e7d10 ("netlink: annotate lockless accesses to nlk->max_recvmsg_len")
59d3efd27c11 ("rtnetlink: Restore RTM_NEW/DELLINK notification behavior")
```
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: Sabrina Dubroca <sdubroca@redhat.com>
Approved-by: Petr Oros <poros@redhat.com>
Merged-by: Lucas Zampieri <lzampier@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3968
JIRA: https://issues.redhat.com/browse/RHEL-28590
Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3841
Tested: bpf tc selftests pass, manual tests that the tcx hooks work as
expected.
Add the new tcx hook for bpf. It attaches at a similar place as the tc
hook but has several advantages: it is based on the new multi prog
infrastructure in the kernel to allow adding multiple bpf programs at
the same hook; it follows the link semantics most other bpf hooks use
which gives applications better control over the lifecycle of the bpf
program; and tcx does not require a qdisc making the setup simpler.
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
Approved-by: Toke Høiland-Jørgensen <toke@redhat.com>
Approved-by: Artem Savkov <asavkov@redhat.com>
Merged-by: Lucas Zampieri <lzampier@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-24101
Upstream Status: v6.5-rc1
This doesn't backport the namespace chunk that isn't
in RHEL yet.
Conflicts:
net/core/net_namespace.c
commit b6d7c0eb2dcbd238fa233a3a1737654e380e784a
Author: Andrzej Hajda <andrzej.hajda@intel.com>
AuthorDate: Fri Jun 2 12:21:34 2023 +0200
Commit: Jakub Kicinski <kuba@kernel.org>
CommitDate: Mon Jun 5 15:28:42 2023 -0700
In case the library is tracking busy subsystem, simply
printing stack for every active reference will spam log
with long, hard to read, redundant stack traces. To improve
readabilty following changes have been made:
- reports are printed per stack_handle - log is more compact,
- added display name for ref_tracker_dir - it will differentiate
multiple subsystems,
- stack trace is printed indented, in the same printk call,
- info about dropped references is printed as well.
Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com>
Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Dave Airlie <airlied@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30656
commit ceaac91dcd065db781d1ed5dfaef0686b8ec44dc
Author: Jakub Kicinski <kuba@kernel.org>
Date: Mon Jul 31 10:11:58 2023 -0700
net: make sure we never create ifindex = 0
Instead of allocating from 1 use proper xa_init flag,
to protect ourselves from IDs wrapping back to 0.
Fixes: 759ab1edb56c ("net: store netdevs in an xarray")
Reported-by: Stephen Hemminger <stephen@networkplumber.org>
Link: https://lore.kernel.org/all/20230728162350.2a6d4979@hermes.local/
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20230731171159.988962-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30656
commit 956db0a13b47df7f3d6d624394e602e8bf9b057e
Author: Jakub Kicinski <kuba@kernel.org>
Date: Mon Aug 14 13:56:25 2023 -0700
net: warn about attempts to register negative ifindex
Since the xarray changes we mix returning valid ifindex and negative
errno in a single int returned from dev_index_reserve(). This depends
on the fact that ifindexes can't be negative. Otherwise we may insert
into the xarray and return a very large negative value. This in turn
may break ERR_PTR().
OvS is susceptible to this problem and lacking validation (fix posted
separately for net).
Reject negative ifindex explicitly. Add a warning because the input
validation is better handled by the caller.
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20230814205627.2914583-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30656
commit 759ab1edb56c88906830fd6b2e7b12514dd32758
Author: Jakub Kicinski <kuba@kernel.org>
Date: Wed Jul 26 11:55:29 2023 -0700
net: store netdevs in an xarray
Iterating over the netdev hash table for netlink dumps is hard.
Dumps are done in "chunks" so we need to save the position
after each chunk, so we know where to restart from. Because
netdevs are stored in a hash table we remember which bucket
we were in and how many devices we dumped.
Since we don't hold any locks across the "chunks" - devices may
come and go while we're dumping. If that happens we may miss
a device (if device is deleted from the bucket we were in).
We indicate to user space that this may have happened by setting
NLM_F_DUMP_INTR. User space is supposed to dump again (I think)
if it sees that. Somehow I doubt most user space gets this right..
To illustrate let's look at an example:
System state:
start: # [A, B, C]
del: B # [A, C]
with the hash table we may dump [A, B], missing C completely even
tho it existed both before and after the "del B".
Add an xarray and use it to allocate ifindexes. This way we
can iterate ifindexes in order, without the worry that we'll
skip one. We may still generate a dump of a state which "never
existed", for example for a set of values and sequence of ops:
System state:
start: # [A, B]
add: C # [A, C, B]
del: B # [A, C]
we may generate a dump of [A], if C got an index between A and B.
System has never been in such state. But I'm 90% sure that's perfectly
fine, important part is that we can't _miss_ devices which exist before
and after. User space which wants to mirror kernel's state subscribes
to notifications and does periodic dumps so it will know that C exists
from the notification about its creation or from the next dump
(next dump is _guaranteed_ to include C, if it doesn't get removed).
To avoid any perf regressions keep the hash table for now. Most
net namespaces have very few devices and microbenchmarking 1M lookups
on Skylake I get the following results (not counting loopback
to number of devs):
#devs | hash | xa | delta
2 | 18.3 | 20.1 | + 9.8%
16 | 18.3 | 20.1 | + 9.5%
64 | 18.3 | 26.3 | +43.8%
128 | 20.4 | 26.3 | +28.6%
256 | 20.0 | 26.4 | +32.1%
1024 | 26.6 | 26.7 | + 0.2%
8192 |541.3 | 33.5 | -93.8%
No surprises since the hash table has 256 entries.
The microbenchmark scans indexes in order, if the pattern is more
random xa starts to win at 512 devices already. But that's a lot
of devices, in practice.
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20230726185530.2247698-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-31916
Conflicts:
* net/core/dev.c
context conflict due to missing commit 2b0cfa6e49566 ("net: add
generic percpu page_pool allocator")
* net/core/sysctl_net_core.c
context conflict due to missing commit 2658b5a8a4eee ("net: introduce
struct net_hotdata")
commit 490a79faf95e705ba0ffd9ebf04a624b379e53c9
Author: Eric Dumazet <edumazet@google.com>
Date: Wed Mar 6 16:00:30 2024 +0000
net: introduce include/net/rps.h
Move RPS related structures and helpers from include/linux/netdevice.h
and include/net/sock.h to a new include file.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306160031.874438-18-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-31916
Conflicts:
* include/linux/netdevice.h
Adjusted due to KABI reservations made by RHEL
commit 3b3a52715a ("net: exclude BPF/XDP from kABI")
commit 49e47a5b6145d86c30022fe0e949bbb24bae28ba
Author: Jakub Kicinski <kuba@kernel.org>
Date: Wed Aug 2 18:02:29 2023 -0700
net: move struct netdev_rx_queue out of netdevice.h
struct netdev_rx_queue is touched in only a few places
and having it defined in netdevice.h brings in the dependency
on xdp.h, because struct xdp_rxq_info gets embedded in
struct netdev_rx_queue.
In prep for removal of xdp.h from netdevice.h move all
the netdev_rx_queue stuff to a new header.
We could technically break the new header up to avoid
the sysfs.h include but it's so rarely included it
doesn't seem to be worth it at this point.
Reviewed-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/20230803010230.1755386-3-kuba@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-31916
commit 5c3b74a92aa285a3df722bf6329ba7ccf70346d6
Author: Eric Dumazet <edumazet@google.com>
Date: Tue Jun 6 07:41:15 2023 +0000
rfs: annotate lockless accesses to RFS sock flow table
Add READ_ONCE()/WRITE_ONCE() on accesses to the sock flow table.
This also prevents a (smart ?) compiler to remove the condition in:
if (table->ents[index] != newval)
table->ents[index] = newval;
We need the condition to avoid dirtying a shared cache line.
Fixes: fec5e652e5 ("rfs: Receive Flow Steering")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-28590
commit 28d18b673ffa2d13112ddb6e4c32c60d9b0cda50
Author: Daniel Borkmann <daniel@iogearbox.net>
Date: Fri Aug 25 15:49:45 2023 +0200
net: Fix skb consume leak in sch_handle_egress
Fix a memory leak for the tc egress path with TC_ACT_{STOLEN,QUEUED,TRAP}:
[...]
unreferenced object 0xffff88818bcb4f00 (size 232):
comm "softirq", pid 0, jiffies 4299085078 (age 134.028s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 80 70 61 81 88 ff ff 00 41 31 14 81 88 ff ff ..pa.....A1.....
backtrace:
[<ffffffff9991b938>] kmem_cache_alloc_node+0x268/0x400
[<ffffffff9b3d9231>] __alloc_skb+0x211/0x2c0
[<ffffffff9b3f0c7e>] alloc_skb_with_frags+0xbe/0x6b0
[<ffffffff9b3bf9a9>] sock_alloc_send_pskb+0x6a9/0x870
[<ffffffff9b6b3f00>] __ip_append_data+0x14d0/0x3bf0
[<ffffffff9b6ba24e>] ip_append_data+0xee/0x190
[<ffffffff9b7e1496>] icmp_push_reply+0xa6/0x470
[<ffffffff9b7e4030>] icmp_reply+0x900/0xa00
[<ffffffff9b7e42e3>] icmp_echo.part.0+0x1a3/0x230
[<ffffffff9b7e444d>] icmp_echo+0xcd/0x190
[<ffffffff9b7e9566>] icmp_rcv+0x806/0xe10
[<ffffffff9b699bd1>] ip_protocol_deliver_rcu+0x351/0x3d0
[<ffffffff9b699f14>] ip_local_deliver_finish+0x2b4/0x450
[<ffffffff9b69a234>] ip_local_deliver+0x174/0x1f0
[<ffffffff9b69a4b2>] ip_sublist_rcv_finish+0x1f2/0x420
[<ffffffff9b69ab56>] ip_sublist_rcv+0x466/0x920
[...]
I was able to reproduce this via:
ip link add dev dummy0 type dummy
ip link set dev dummy0 up
tc qdisc add dev eth0 clsact
tc filter add dev eth0 egress protocol ip prio 1 u32 match ip protocol 1 0xff action mirred egress redirect dev dummy0
ping 1.1.1.1
<stolen>
After the fix, there are no kmemleak reports with the reproducer. This is
in line with what is also done on the ingress side, and from debugging the
skb_unref(skb) on dummy xmit and sch_handle_egress() side, it is visible
that these are two different skbs with both skb_unref(skb) as true. The two
seen skbs are due to mirred doing a skb_clone() internally as use_reinsert
is false in tcf_mirred_act() for egress. This was initially reported by Gal.
Fixes: e420bed02507 ("bpf: Add fd-based tcx multi-prog infra with link support")
Reported-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/bdfc2640-8f65-5b56-4472-db8e2b161aab@nvidia.com
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-28590
Conflicts:
- MAINTAINERS: The file has been restructured upstream, but this is not
relevant for us. All paths are already covered.
- include/linux/netdevice.h: We have excluded TC from kABI with
845ad79d11 ("net: exclude TC from kABI"). Keep this exclusion.
- include/linux/skbuff.h: The order of the fields has been changed upstream
in c0ba861117c3 ("net: skbuff: move the fields BPF cares about directly
next to the offset marker"). The actual change is just changing config
options. Do this instead of picking the field reordering to make
backporting easier.
- include/uapi/linux/bpf.h and tools/include/uapi/linux/bpf.h: The changes
to these files were already backported through 1d5bff6a09 ("bpf: Add
fd-based tcx multi-prog infra with link support") to keep UAPI close to
upstream.
- kernel/bpf/syscall.c: Already backported 58ff9f1ec9 ("bpf: Add
attach_type checks under bpf_prog_attach_check_attach_type") moves one
switch block around. The case BPF_PROG_TYPE_SCHED_CLS was added during
that backport, therefore this hunk is missing now. This also causes
context differences.
- kernel/bpf/syscall.c: Already backported 81b5cf0a11 ("bpf: Fix
BPF_PROG_QUERY last field check") fixed the QUERY_LAST_FIELD.
commit e420bed025071a623d2720a92bc2245c84757ecb
Author: Daniel Borkmann <daniel@iogearbox.net>
Date: Wed Jul 19 16:08:52 2023 +0200
bpf: Add fd-based tcx multi-prog infra with link support
This work refactors and adds a lightweight extension ("tcx") to the tc BPF
ingress and egress data path side for allowing BPF program management based
on fds via bpf() syscall through the newly added generic multi-prog API.
The main goal behind this work which we also presented at LPC [0] last year
and a recent update at LSF/MM/BPF this year [3] is to support long-awaited
BPF link functionality for tc BPF programs, which allows for a model of safe
ownership and program detachment.
Given the rise in tc BPF users in cloud native environments, this becomes
necessary to avoid hard to debug incidents either through stale leftover
programs or 3rd party applications accidentally stepping on each others toes.
As a recap, a BPF link represents the attachment of a BPF program to a BPF
hook point. The BPF link holds a single reference to keep BPF program alive.
Moreover, hook points do not reference a BPF link, only the application's
fd or pinning does. A BPF link holds meta-data specific to attachment and
implements operations for link creation, (atomic) BPF program update,
detachment and introspection. The motivation for BPF links for tc BPF programs
is multi-fold, for example:
- From Meta: "It's especially important for applications that are deployed
fleet-wide and that don't "control" hosts they are deployed to. If such
application crashes and no one notices and does anything about that, BPF
program will keep running draining resources or even just, say, dropping
packets. We at FB had outages due to such permanent BPF attachment
semantics. With fd-based BPF link we are getting a framework, which allows
safe, auto-detachable behavior by default, unless application explicitly
opts in by pinning the BPF link." [1]
- From Cilium-side the tc BPF programs we attach to host-facing veth devices
and phys devices build the core datapath for Kubernetes Pods, and they
implement forwarding, load-balancing, policy, EDT-management, etc, within
BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently
experienced hard-to-debug issues in a user's staging environment where
another Kubernetes application using tc BPF attached to the same prio/handle
of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath
it. The goal is to establish a clear/safe ownership model via links which
cannot accidentally be overridden. [0,2]
BPF links for tc can co-exist with non-link attachments, and the semantics are
in line also with XDP links: BPF links cannot replace other BPF links, BPF
links cannot replace non-BPF links, non-BPF links cannot replace BPF links and
lastly only non-BPF links can replace non-BPF links. In case of Cilium, this
would solve mentioned issue of safe ownership model as 3rd party applications
would not be able to accidentally wipe Cilium programs, even if they are not
BPF link aware.
Earlier attempts [4] have tried to integrate BPF links into core tc machinery
to solve cls_bpf, which has been intrusive to the generic tc kernel API with
extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could
be wiped from the qdisc also. Locking a tc BPF program in place this way, is
getting into layering hacks given the two object models are vastly different.
We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF
attach API, so that the BPF link implementation blends in naturally similar to
other link types which are fd-based and without the need for changing core tc
internal APIs. BPF programs for tc can then be successively migrated from classic
cls_bpf to the new tc BPF link without needing to change the program's source
code, just the BPF loader mechanics for attaching is sufficient.
For the current tc framework, there is no change in behavior with this change
and neither does this change touch on tc core kernel APIs. The gist of this
patch is that the ingress and egress hook have a lightweight, qdisc-less
extension for BPF to attach its tc BPF programs, in other words, a minimal
entry point for tc BPF. The name tcx has been suggested from discussion of
earlier revisions of this work as a good fit, and to more easily differ between
the classic cls_bpf attachment and the fd-based one.
For the ingress and egress tcx points, the device holds a cache-friendly array
with program pointers which is separated from control plane (slow-path) data.
Earlier versions of this work used priority to determine ordering and expression
of dependencies similar as with classic tc, but it was challenged that for
something more future-proof a better user experience is required. Hence this
resulted in the design and development of the generic attach/detach/query API
for multi-progs. See prior patch with its discussion on the API design. tcx is
the first user and later we plan to integrate also others, for example, one
candidate is multi-prog support for XDP which would benefit and have the same
'look and feel' from API perspective.
The goal with tcx is to have maximum compatibility to existing tc BPF programs,
so they don't need to be rewritten specifically. Compatibility to call into
classic tcf_classify() is also provided in order to allow successive migration
or both to cleanly co-exist where needed given its all one logical tc layer and
the tcx plus classic tc cls/act build one logical overall processing pipeline.
tcx supports the simplified return codes TCX_NEXT which is non-terminating (go
to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT.
The fd-based API is behind a static key, so that when unused the code is also
not entered. The struct tcx_entry's program array is currently static, but
could be made dynamic if necessary at a point in future. The a/b pair swap
design has been chosen so that for detachment there are no allocations which
otherwise could fail.
The work has been tested with tc-testing selftest suite which all passes, as
well as the tc BPF tests from the BPF CI, and also with Cilium's L4LB.
Thanks also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews
of this work.
[0] https://lpc.events/event/16/contributions/1353/
[1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com
[2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog
[3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
[4] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230719140858.13224-3-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30344
commit 59d3efd27c11c59b32291e5ebc307bed2edb65ee
Author: Martin Willi <martin@strongswan.org>
Date: Tue Apr 11 09:43:19 2023 +0200
rtnetlink: Restore RTM_NEW/DELLINK notification behavior
The commits referenced below allows userspace to use the NLM_F_ECHO flag
for RTM_NEW/DELLINK operations to receive unicast notifications for the
affected link. Prior to these changes, applications may have relied on
multicast notifications to learn the same information without specifying
the NLM_F_ECHO flag.
For such applications, the mentioned commits changed the behavior for
requests not using NLM_F_ECHO. Multicast notifications are still received,
but now use the portid of the requester and the sequence number of the
request instead of zero values used previously. For the application, this
message may be unexpected and likely handled as a response to the
NLM_F_ACKed request, especially if it uses the same socket to handle
requests and notifications.
To fix existing applications relying on the old notification behavior,
set the portid and sequence number in the notification only if the
request included the NLM_F_ECHO flag. This restores the old behavior
for applications not using it, but allows unicasted notifications for
others.
Fixes: f3a63cce1b4f ("rtnetlink: Honour NLM_F_ECHO flag in rtnl_delete_link")
Fixes: d88e136cab37 ("rtnetlink: Honour NLM_F_ECHO flag in rtnl_newlink_create")
Signed-off-by: Martin Willi <martin@strongswan.org>
Acked-by: Guillaume Nault <gnault@redhat.com>
Acked-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://lore.kernel.org/r/20230411074319.24133-1-martin@strongswan.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30344
commit 77f4aa9a2a1766a0b9343fd812b71f18d05178da
Author: Hangbin Liu <liuhangbin@gmail.com>
Date: Fri Oct 28 04:42:22 2022 -0400
net: add new helper unregister_netdevice_many_notify
Add new helper unregister_netdevice_many_notify(), pass netlink message
header and portid, which could be used to notify userspace when flag
NLM_F_ECHO is set.
Make the unregister_netdevice_many() as a wrapper of new function
unregister_netdevice_many_notify().
Suggested-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30344
commit 1d997f1013079c05b642c739901e3584a3ae558d
Author: Hangbin Liu <liuhangbin@gmail.com>
Date: Fri Oct 28 04:42:21 2022 -0400
rtnetlink: pass netlink message header and portid to rtnl_configure_link()
This patch pass netlink message header and portid to rtnl_configure_link()
All the functions in this call chain need to add the parameters so we can
use them in the last call rtnl_notify(), and notify the userspace about
the new link info if NLM_F_ECHO flag is set.
- rtnl_configure_link()
- __dev_notify_flags()
- rtmsg_ifinfo()
- rtmsg_ifinfo_event()
- rtmsg_ifinfo_build_skb()
- rtmsg_ifinfo_send()
- rtnl_notify()
Also move __dev_notify_flags() declaration to net/core/dev.h, as Jakub
suggested.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30344
Conflicts:
- we have already backported 6264f58ca0e54 ("net: extract a few
internals from netdevice.h") so the net_todo_list has to be placed in
net/core/dev.h instead of include/linux/netdevice.h
commit 0b5c21bbc01e92745ca1ca4f6fd87d878fa3ea5e
Author: Johannes Berg <johannes.berg@intel.com>
Date: Mon Apr 4 11:38:47 2022 +0200
net: ensure net_todo_list is processed quickly
In [1], Will raised a potential issue that the cfg80211 code,
which does (from a locking perspective)
rtnl_lock()
wiphy_lock()
rtnl_unlock()
might be suspectible to ABBA deadlocks, because rtnl_unlock()
calls netdev_run_todo(), which might end up calling rtnl_lock()
again, which could then deadlock (see the comment in the code
added here for the scenario).
Some back and forth and thinking ensued, but clearly this can't
happen if the net_todo_list is empty at the rtnl_unlock() here.
Clearly, the code here cannot actually put an entry on it, and
all other users of rtnl_unlock() will empty it since that will
always go through netdev_run_todo(), emptying the list.
So the only other way to get there would be to add to the list
and then unlock the RTNL without going through rtnl_unlock(),
which is only possible through __rtnl_unlock(). However, this
isn't exported and not used in many places, and none of them
seem to be able to unregister before using it.
Therefore, add a WARN_ON() in the code to ensure this invariant
won't be broken, so that the cfg80211 (or any similar) code
stays safe.
[1] https://lore.kernel.org/r/Yjzpo3TfZxtKPMAG@google.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20220404113847.0ee02e4a70da.Ic73d206e217db20fd22dcec14fe5442ca732804b@changeid
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30344
commit ede6c39c4f9068cbeb4036448c45fff5393e0432
Author: Eric Dumazet <edumazet@google.com>
Date: Wed Feb 9 18:59:32 2022 -0800
net: make net->dev_unreg_count atomic
Having to acquire rtnl from netdev_run_todo() for every dismantled
device is not desirable when/if rtnl is under stress.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3584
JIRA: https://issues.redhat.com/browse/RHEL-21356
Upstream Status: mostly RHEL-only patches
This series adds reserved fields to networking structs, and excludes
some areas of networking from the kABI guarantee. These reserved
fields are only needed during backports to z-stream.
Signed-off-by: Sabrina Dubroca <sdubroca@redhat.com>
Approved-by: Jan Stancek <jstancek@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Paolo Abeni <pabeni@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Scott Weaver <scweaver@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-21447
Tested: LNST, Tier1
Upstream commit:
commit 24ab059d2ebd62fdccc43794796f6ffbabe49ebc
Author: Eric Dumazet <edumazet@google.com>
Date: Tue Dec 19 12:53:31 2023 +0000
net: check dev->gso_max_size in gso_features_check()
Some drivers might misbehave if TSO packets get too big.
GVE for instance uses a 16bit field in its TX descriptor,
and will do bad things if a packet is bigger than 2^16 bytes.
Linux TCP stack honors dev->gso_max_size, but there are
other ways for too big packets to reach an ndo_start_xmit()
handler : virtio_net, af_packet, GRO...
Add a generic check in gso_features_check() and fallback
to GSO when needed.
gso_max_size was added in the blamed commit.
Fixes: 82cc1a7a56 ("[NET]: Add per-connection option to set max TSO frame size")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20231219125331.4127498-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-21356
Upstream Status: RHEL-only
rtnl_link_stats and rtnl_link_stats64 are protected by kABI, add 4
reserved fields. We need to use a custom mechanism here, because those
structures are part of uapi.
Signed-off-by: Sabrina Dubroca <sdubroca@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3460
JIRA: https://issues.redhat.com/browse/RHEL-18147
Tested: Just built... No way to test the new interface as no driver was converted yet.
Commits:
```
00d521b39307 ("net: don't abuse "default" case for unknown ioctl in dev_ifsioc()")
1193db2a55b6 ("net: simplify handling of dsa_ndo_eth_ioctl() return code")
4ee58e1e5680 ("net: promote SIOCSHWTSTAMP and SIOCGHWTSTAMP ioctls to dedicated handlers")
d5d5fd8f2552 ("net: move copy_from_user() out of net_hwtstamp_validate()")
c4bffeaa8d50 ("net: add struct kernel_hwtstamp_config and make net_hwtstamp_validate() use it")
88c0a6b503b7 ("net: create a netdev notifier for DSA to reject PTP on DSA master")
5a17818682cf ("net: dsa: replace NETDEV_PRE_CHANGE_HWTSTAMP notifier with a stub")
66f7223039c0 ("net: add NDOs for configuring hardware timestamping")
e47d01fea663 ("net: add hwtstamping helpers for stackable net devices")
```
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com>
Approved-by: Antoine Tenart <atenart@redhat.com>
Signed-off-by: Scott Weaver <scweaver@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-18147
Conflicts:
- DSA stuff removed except dsa_stubs.h that provides inline function
dsa_master_hwtstamp_validate()
commit 5a17818682cf43ad0fdd6035945f3b7a8c9dc5e9
Author: Vladimir Oltean <vladimir.oltean@nxp.com>
Date: Thu Apr 6 14:42:46 2023 +0300
net: dsa: replace NETDEV_PRE_CHANGE_HWTSTAMP notifier with a stub
There was a sort of rush surrounding commit 88c0a6b503b7 ("net: create a
netdev notifier for DSA to reject PTP on DSA master"), due to a desire
to convert DSA's attempt to deny TX timestamping on a DSA master to
something that doesn't block the kernel-wide API conversion from
ndo_eth_ioctl() to ndo_hwtstamp_set().
What was required was a mechanism that did not depend on ndo_eth_ioctl(),
and what was provided was a mechanism that did not depend on
ndo_eth_ioctl(), while at the same time introducing something that
wasn't absolutely necessary - a new netdev notifier.
There have been objections from Jakub Kicinski that using notifiers in
general when they are not absolutely necessary creates complications to
the control flow and difficulties to maintainers who look at the code.
So there is a desire to not use notifiers.
In addition to that, the notifier chain gets called even if there is no
DSA in the system and no one is interested in applying any restriction.
Take the model of udp_tunnel_nic_ops and introduce a stub mechanism,
through which net/core/dev_ioctl.c can call into DSA even when
CONFIG_NET_DSA=m.
Compared to the code that existed prior to the notifier conversion, aka
what was added in commits:
- 4cfab35667 ("net: dsa: Add wrappers for overloaded ndo_ops")
- 3369afba1e ("net: Call into DSA netdevice_ops wrappers")
this is different because we are not overloading any struct
net_device_ops of the DSA master anymore, but rather, we are exposing a
rather specific functionality which is orthogonal to which API is used
to enable it - ndo_eth_ioctl() or ndo_hwtstamp_set().
Also, what is similar is that both approaches use function pointers to
get from built-in code to DSA.
There is no point in replicating the function pointers towards
__dsa_master_hwtstamp_validate() once for every CPU port (dev->dsa_ptr).
Instead, it is sufficient to introduce a singleton struct dsa_stubs,
built into the kernel, which contains a single function pointer to
__dsa_master_hwtstamp_validate().
I find this approach preferable to what we had originally, because
dev->dsa_ptr->netdev_ops->ndo_do_ioctl() used to require going through
struct dsa_port (dev->dsa_ptr), and so, this was incompatible with any
attempts to add any data encapsulation and hide DSA data structures from
the outside world.
Link: https://lore.kernel.org/netdev/20230403083019.120b72fd@kernel.org/
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-18147
Conflicts:
- Omitted DSA changes as they are not applicable. Note tha DSA is disabled
in RHEL.
commit 88c0a6b503b7f4fffb68a8d49c3987870c5b1d6b
Author: Vladimir Oltean <vladimir.oltean@nxp.com>
Date: Sun Apr 2 15:37:55 2023 +0300
net: create a netdev notifier for DSA to reject PTP on DSA master
The fact that PTP 2-step TX timestamping is broken on DSA switches if
the master also timestamps the same packets is documented by commit
f685e609a3 ("net: dsa: Deny PTP on master if switch supports it").
We attempt to help the users avoid shooting themselves in the foot by
making DSA reject the timestamping ioctls on an interface that is a DSA
master, and the switch tree beneath it contains switches which are aware
of PTP.
The only problem is that there isn't an established way of intercepting
ndo_eth_ioctl calls, so DSA creates avoidable burden upon the network
stack by creating a struct dsa_netdevice_ops with overlaid function
pointers that are manually checked from the relevant call sites. There
used to be 2 such dsa_netdevice_ops, but now, ndo_eth_ioctl is the only
one left.
There is an ongoing effort to migrate driver-visible hardware timestamping
control from the ndo_eth_ioctl() based API to a new ndo_hwtstamp_set()
model, but DSA actively prevents that migration, since dsa_master_ioctl()
is currently coded to manually call the master's legacy ndo_eth_ioctl(),
and so, whenever a network device driver would be converted to the new
API, DSA's restrictions would be circumvented, because any device could
be used as a DSA master.
The established way for unrelated modules to react on a net device event
is via netdevice notifiers. So we create a new notifier which gets
called whenever there is an attempt to change hardware timestamping
settings on a device.
Finally, there is another reason why a netdev notifier will be a good
idea, besides strictly DSA, and this has to do with PHY timestamping.
With ndo_eth_ioctl(), all MAC drivers must manually call
phy_has_hwtstamp() before deciding whether to act upon SIOCSHWTSTAMP,
otherwise they must pass this ioctl to the PHY driver via
phy_mii_ioctl().
With the new ndo_hwtstamp_set() API, it will be desirable to simply not
make any calls into the MAC device driver when timestamping should be
performed at the PHY level.
But there exist drivers, such as the lan966x switch, which need to
install packet traps for PTP regardless of whether they are the layer
that provides the hardware timestamps, or the PHY is. That would be
impossible to support with the new API.
The proposal there, too, is to introduce a netdev notifier which acts as
a better cue for switching drivers to add or remove PTP packet traps,
than ndo_hwtstamp_set(). The one introduced here "almost" works there as
well, except for the fact that packet traps should only be installed if
the PHY driver succeeded to enable hardware timestamping, whereas here,
we need to deny hardware timestamping on the DSA master before it
actually gets enabled. This is why this notifier is called "PRE_", and
the notifier that would get used for PHY timestamping and packet traps
would be called NETDEV_CHANGE_HWTSTAMP. This isn't a new concept, for
example NETDEV_CHANGEUPPER and NETDEV_PRECHANGEUPPER do the same thing.
In expectation of future netlink UAPI, we also pass a non-NULL extack
pointer to the netdev notifier, and we make DSA populate it with an
informative reason for the rejection. To avoid making it go to waste, we
make the ioctl-based dev_set_hwtstamp() create a fake extack and print
the message to the kernel log.
Link: https://lore.kernel.org/netdev/20230401191215.tvveoi3lkawgg6g4@skbuf/
Link: https://lore.kernel.org/netdev/20230310164451.ls7bbs6pdzs4m6pw@skbuf/
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-17413
Upstream Status: linux.git
commit bf4ea1d0b2cb2251f9e5619c81daa98591087c33
Author: Leon Hwang <hffilwlqm@gmail.com>
Date: Tue Aug 1 22:26:20 2023 +0800
bpf, xdp: Add tracepoint to xdp attaching failure
When error happens in dev_xdp_attach(), it should have a way to tell
users the error message like the netlink approach.
To avoid breaking uapi, adding a tracepoint in bpf_xdp_link_attach() is
an appropriate way to notify users the error message.
Hence, bpf libraries are able to retrieve the error message by this
tracepoint, and then report the error message to users.
Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
Link: https://lore.kernel.org/r/20230801142621.7925-2-hffilwlqm@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Antoine Tenart <atenart@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-862
commit 9b55d3f0a69af649c62cbc2633e6d695bb3cc583
Author: Felix Riemann <felix.riemann@sma.de>
Date: Fri Feb 10 13:36:44 2023 +0100
net: Fix unwanted sign extension in netdev_stats_to_stats64()
When converting net_device_stats to rtnl_link_stats64 sign extension
is triggered on ILP32 machines as 6c1c509778 changed the previous
"ulong -> u64" conversion to "long -> u64" by accessing the
net_device_stats fields through a (signed) atomic_long_t.
This causes for example the received bytes counter to jump to 16EiB after
having received 2^31 bytes. Casting the atomic value to "unsigned long"
beforehand converting it into u64 avoids this.
Fixes: 6c1c5097781f ("net: add atomic_long_t to net_device_stats fields")
Signed-off-by: Felix Riemann <felix.riemann@sma.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-862
commit 6c1c5097781f563b70a81683ea6fdac21637573b
Author: Eric Dumazet <edumazet@google.com>
Date: Tue Nov 15 08:53:55 2022 +0000
net: add atomic_long_t to net_device_stats fields
Long standing KCSAN issues are caused by data-race around
some dev->stats changes.
Most performance critical paths already use per-cpu
variables, or per-queue ones.
It is reasonable (and more correct) to use atomic operations
for the slow paths.
This patch adds an union for each field of net_device_stats,
so that we can convert paths that are not yet protected
by a spinlock or a mutex.
netdev_stats_to_stats64() no longer has an #if BITS_PER_LONG==64
Note that the memcpy() we were using on 64bit arches
had no provision to avoid load-tearing,
while atomic_long_read() is providing the needed protection
at no cost.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3310
JIRA: https://issues.redhat.com/browse/RHEL-15250
Tested: Using attached self-tests [Results in JIRA]
The series adds support for multi-buffer to XSK. It is based on upstream series `3226e3139dfe ("Merge branch 'xsk-multi-buffer-support'")` and contains also commits from upstream series `34e78bab67c5 ("Merge branch 'seltests/xsk: prepare for AF_XDP multi-buffer testing'")` to make attached self-tests applicable.
Commits:
```
0c5f48599bed ("xsk: Simplify xp_aligned_validate_desc implementation")
f2f167583601 ("xsk: Remove unused xsk_buff_discard")
e2fa5c2068fb ("xsk: Remove unused inline function xsk_buff_discard()")
63a64a56bc3f ("xsk: prepare 'options' in xdp_desc for multi-buffer use")
81470b5c3c66 ("xsk: introduce XSK_USE_SG bind flag for xsk socket")
556444c4e683 ("xsk: prepare both copy and zero-copy modes to co-exist")
faa91b839b09 ("xsk: move xdp_buff's data length check to xsk_rcv_check")
804627751b42 ("xsk: add support for AF_XDP multi-buffer on Rx path")
b7f72a30e9ac ("xsk: introduce wrappers and helpers for supporting multi-buffer in Tx path")
1b725b0c8163 ("xsk: allow core/drivers to test EOP bit")
cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
07428da9e25a ("xsk: discard zero length descriptors in Tx path")
13ce2daa259a ("xsk: add new netlink attribute dedicated for ZC max frags")
24ea50127ecf ("xsk: support mbuf on ZC RX")
d5581966040f ("xsk: support ZC Tx multi-buffer in batch API")
49ca37d0d825 ("xsk: add multi-buffer documentation")
9a321fd3308e ("selftests/xsk: add xdp populate metadata test")
68e7322142f5 ("selftests: xsk: Deflakify STATS_RX_DROPPED test")
7a2050df244e ("selftests: xsk: Use correct UMEM size in testapp_invalid_desc")
ccd1b2933f8c ("selftests: xsk: Add test case for packets at end of UMEM")
c0801598e543 ("selftests: xsk: Add test UNALIGNED_INV_DESC_4K1_FRAME_SIZE")
d2e541494935 ("selftests/xsk: do not change XDP program when not necessary")
df82d2e89c41 ("selftests/xsk: generate simpler packets with variable length")
feb973a9094f ("selftests/xsk: add varying payload pattern within packet")
7a8a6762822a ("selftests/xsk: dump packet at error")
69fc03d220a3 ("selftests/xsk: add packet iterator for tx to packet stream")
d9f6d9709f87 ("selftests/xsk: store offset in pkt instead of addr")
041b68f688a3 ("selftests/xsx: test for huge pages only once")
86e41755b432 ("selftests/xsk: populate fill ring based on frags needed")
2f6eae0df1a8 ("selftests/xsk: generate data for multi-buffer packets")
7cd6df4f5ec2 ("selftests/xsk: adjust packet pacing for multi-buffer support")
17f1034dd76d ("selftests/xsk: transmit and receive multi-buffer packets")
f540d44e05cf ("selftests/xsk: add basic multi-buffer test")
1005a226da9a ("selftests/xsk: add unaligned mode test for multi-buffer")
697604492b64 ("selftests/xsk: add invalid descriptor test for multi-buffer")
f80ddbec4762 ("selftests/xsk: add metadata copy test for multi-buff")
807bf4da2049 ("selftests/xsk: add test for too many frags")
3666bccab43a ("selftests/xsk: reset NIC settings to default after running test suite")
d609f3d228a8 ("xsk: add multi-buffer support for sockets sharing umem")
9d0a67b9d42c ("xsk: Fix xsk_build_skb() error: 'skb' dereferencing possible ERR_PTR()")
a097627dcadd ("net: add missing net_device::xdp_zc_max_segs description")
```
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: Petr Oros <poros@redhat.com>
Approved-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Scott Weaver <scweaver@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3398
JIRA: https://issues.redhat.com/browse/RHEL-16986
JIRA: https://issues.redhat.com/browse/RHEL-6368
This prevents network drivers' .ndo_set_mac_address method from being called when the MAC address is already the current one. There are drivers that more or less assume that this is how the network core already behaves. For example, iavf will send a virtchnl message to the PF requesting to add the new address and then a message to remove the old address. This logic is broken if old and new are the same address.
Tested: I used the reproducer steps from RHEL-6368, with VFs on Intel E810.
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Approved-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: Petr Oros <poros@redhat.com>
Signed-off-by: Scott Weaver <scweaver@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-16986
JIRA: https://issues.redhat.com/browse/RHEL-6368
commit 0ec92a8f56ff07237dbe8af7c7a72aba7f957baf
Author: Piotr Gardocki <piotrx.gardocki@intel.com>
Date: Wed Jun 21 15:21:06 2023 +0200
net: fix net device address assign type
Commit ad72c4a06acc introduced optimization to return from function
quickly if the MAC address is not changing at all. It was reported
that such change causes dev->addr_assign_type to not change
to NET_ADDR_SET from _PERM or _RANDOM.
Restore the old behavior and skip only call to ndo_set_mac_address.
Fixes: ad72c4a06acc ("net: add check for current MAC address in dev_set_mac_address")
Reported-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Piotr Gardocki <piotrx.gardocki@intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230621132106.991342-1-piotrx.gardocki@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-16986
JIRA: https://issues.redhat.com/browse/RHEL-6368
commit ad72c4a06acc6762e84994ac2f722da7a07df34e
Author: Piotr Gardocki <piotrx.gardocki@intel.com>
Date: Wed Jun 14 16:53:00 2023 +0200
net: add check for current MAC address in dev_set_mac_address
In some cases it is possible for kernel to come with request
to change primary MAC address to the address that is already
set on the given interface.
Add proper check to return fast from the function in these cases.
An example of such case is adding an interface to bonding
channel in balance-alb mode:
modprobe bonding mode=balance-alb miimon=100 max_bonds=1
ip link set bond0 up
ifenslave bond0 <eth>
Signed-off-by: Piotr Gardocki <piotrx.gardocki@intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-14554
Upstream Status: linux.git
commit 8fa66e4a1bdd41d55d7842928e60a40fed65715d
Author: Jakub Kicinski <kuba@kernel.org>
Date: Wed Apr 19 19:00:05 2023 -0700
net: skbuff: update and rename __kfree_skb_defer()
__kfree_skb_defer() uses the old naming where "defer" meant
slab bulk free/alloc APIs. In the meantime we also made
__kfree_skb_defer() feed the per-NAPI skb cache, which
implies bulk APIs. So take away the 'defer' and add 'napi'.
While at it add a drop reason. This only matters on the
tx_action path, if the skb has a frag_list. But getting
rid of a SKB_DROP_REASON_NOT_SPECIFIED seems like a net
benefit so why not.
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/r/20230420020005.815854-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Antoine Tenart <atenart@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3196
JIRA: https://issues.redhat.com/browse/RHEL-12613
Tested: Using LNST net-driver test-suite on i40e, bnxt_en, ice and mlx5_core [http://dashboard.lnst.anl.lab.eng.bos.redhat.com/pipeline/3644]
Commits:
```
4727bab4e9bb ("net: skb: move skb_pp_recycle() to skbuff.c")
eedade12f4cb ("net: kfree_skb_list use kmem_cache_free_bulk")
f72ff8b81ebc ("net: fix kfree_skb_list use of skb_mark_not_on_list")
9dde0cd3b10f ("net: introduce skb_poison_list and use in kfree_skb_list")
b07a2d97ba5e ("net: skb: plumb napi state thru skb freeing paths")
8c48eea3adf3 ("page_pool: allow caching from safely localized NAPI")
dd64b232deb8 ("page_pool: unlink from napi during destroy")
```
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Michal Schmidt <mschmidt@redhat.com>
Signed-off-by: Scott Weaver <scweaver@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-15250
commit 13ce2daa259a3bfbc9a5aeeee8b9a87058703731
Author: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Date: Wed Jul 19 15:24:07 2023 +0200
xsk: add new netlink attribute dedicated for ZC max frags
Introduce new netlink attribute NETDEV_A_DEV_XDP_ZC_MAX_SEGS that will
carry maximum fragments that underlying ZC driver is able to handle on
TX side. It is going to be included in netlink response only when driver
supports ZC. Any value higher than 1 implies multi-buffer ZC support on
underlying device.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/r/20230719132421.584801-11-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-12613
Conflicts:
- simple context conflict in net/core/dev.c due to absence of commit
8b43fd3d1d7d8 ("net: optimize ____napi_schedule() to avoid extra
NET_RX_SOFTIRQ") that is out of scope of this series
commit 8c48eea3adf3119e0a3fc57bd31f6966f26ee784
Author: Jakub Kicinski <kuba@kernel.org>
Date: Wed Apr 12 21:26:04 2023 -0700
page_pool: allow caching from safely localized NAPI
Recent patches to mlx5 mentioned a regression when moving from
driver local page pool to only using the generic page pool code.
Page pool has two recycling paths (1) direct one, which runs in
safe NAPI context (basically consumer context, so producing
can be lockless); and (2) via a ptr_ring, which takes a spin
lock because the freeing can happen from any CPU; producer
and consumer may run concurrently.
Since the page pool code was added, Eric introduced a revised version
of deferred skb freeing. TCP skbs are now usually returned to the CPU
which allocated them, and freed in softirq context. This places the
freeing (producing of pages back to the pool) enticingly close to
the allocation (consumer).
If we can prove that we're freeing in the same softirq context in which
the consumer NAPI will run - lockless use of the cache is perfectly fine,
no need for the lock.
Let drivers link the page pool to a NAPI instance. If the NAPI instance
is scheduled on the same CPU on which we're freeing - place the pages
in the direct cache.
With that and patched bnxt (XDP enabled to engage the page pool, sigh,
bnxt really needs page pool work :() I see a 2.6% perf boost with
a TCP stream test (app on a different physical core than softirq).
The CPU use of relevant functions decreases as expected:
page_pool_refill_alloc_cache 1.17% -> 0%
_raw_spin_lock 2.41% -> 0.98%
Only consider lockless path to be safe when NAPI is scheduled
- in practice this should cover majority if not all of steady state
workloads. It's usually the NAPI kicking in that causes the skb flush.
The main case we'll miss out on is when application runs on the same
CPU as NAPI. In that case we don't use the deferred skb free path.
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3067
JIRA: https://issues.redhat.com/browse/RHEL-1773
Depends: https://issues.redhat.com/browse/RHEL-860
Depends: https://issues.redhat.com/browse/RHEL-3646
Update TC (net/sched) to the upstream v6.5
Omitted-fix: cad7526f33ce ("net: dsa: ocelot: unlock on error in vsc9959_qos_port_tas_set()")
Not needed, DSA as well as ocelot driver is not enabled/supported in RHEL
Commits:
```
1b808993e194 ("flow_dissector: fix false-positive __read_overflow2_field() warning")
f743f16c548b ("treewide: use get_random_{u8,u16}() when possible, part 2")
7e3cf0843fe5 ("treewide: use get_random_{u8,u16}() when possible, part 1")
8032bf1233a7 ("treewide: use get_random_u32_below() instead of deprecated function")
62423bd2d2e2 ("net: sched: remove qdisc_watchdog->last_expires")
c66b2111c9c9 ("selftests: tc-testing: add tests for action binding")
f5fca219ad45 ("net: do not use skb_mac_header() in qdisc_pkt_len_init()")
e495a9673caf ("sch_cake: do not use skb_mac_header() in cake_overhead()")
b3be94885af4 ("net/sched: remove two skb_mac_header() uses")
fcb3a4653bc5 ("net/sched: act_api: use the correct TCA_ACT attributes in dump")
4170f0ef582c ("fix typos in net/sched/)
8b0f256530d9 ("net/sched: sch_mqprio: use netlink payload helpers")
3dd0c16ec93e ("net/sched: mqprio: simplify handling of nlattr portion of TCA_OPTIONS")
57f21bf85400 ("net/sched: mqprio: add extack to mqprio_parse_nlattr()")
ab277d2084ba ("net/sched: mqprio: add an extack message to mqprio_parse_opt()")
c54876cd5961 ("net/sched: pass netlink extack to mqprio and taprio offload")
f62af20bed2d ("net/sched: mqprio: allow per-TC user input of FP adminStatus")
a721c3e54b80 ("net/sched: taprio: allow per-TC user input of FP adminStatus")
8c966a10eb84 ("flow_dissector: Address kdoc warnings")
54e906f1639e ("selftests: forwarding: sch_tbf_*: Add a pre-run hook")
2f0f9465ad9f ("net: sched: Print msecs when transmit queue time out")
5036034572b7 ("net/sched: act_pedit: use NLA_POLICY for parsing 'ex' keys")
0c83c5210e18 ("net/sched: act_pedit: use extack in 'ex' parsing errors")
e1201bc781c2 ("net/sched: act_pedit: check static offsets a priori")
577140180ba2 ("net/sched: act_pedit: remove extra check for key type")
e3c9673e2f6e ("net/sched: act_pedit: rate limit datapath messages")
807cfded92b0 ("net/sched: sch_htb: use extack on errors messages")
c69a9b023f65 ("net/sched: sch_qfq: use extack on errors messages")
25369891fcef ("net/sched: sch_qfq: refactor parsing of netlink parameters")
7eb060a51a3b ("selftests: tc-testing: add more tests for sch_qfq")
1b483d9f5805 ("net/sched: act_pedit: free pedit keys on bail from offset check")
526f28bd0fbd ("net/sched: act_mirred: Add carrier check")
12e7789ad5b4 ("sch_htb: Allow HTB priority parameter in offload mode")
c7cfbd115001 ("net/sched: sch_ingress: Only create under TC_H_INGRESS")
5eeebfe6c493 ("net/sched: sch_clsact: Only create under TC_H_CLSACT")
f85fa45d4a94 ("net/sched: Reserve TC_H_INGRESS (TC_H_CLSACT) for ingress (clsact) Qdiscs")
9de95df5d15b ("net/sched: Prohibit regrafting ingress or clsact Qdiscs")
7b4858df3bf7 ("skbuff: bridge: Add layer 2 miss indication")
d5ccfd90df7f ("flow_dissector: Dissect layer 2 miss from tc skb extension")
1a432018c0cd ("net/sched: flower: Allow matching on layer 2 miss")
f4356947f029 ("flow_offload: Reject matching on layer 2 miss")
8c33266ae26a ("selftests: forwarding: Add layer 2 miss test cases")
dced11ef84fb ("net/sched: taprio: don't overwrite "sch" variable in taprio_dump_class_stats()")
2d800bc500fb ("net/sched: taprio: replace tc_taprio_qopt_offload :: enable with a "cmd" enum")
6c1adb650c8d ("net/sched: taprio: add netlink reporting for offload statistics counters")
a395b8d1c7c3 ("selftests/tc-testing: replace mq with invalid parent ID")
8cde87b007da ("net: sched: wrap tc_skip_wrapper with CONFIG_RETPOLINE")
cd2b8113c2e8 ("net/sched: fq_pie: ensure reasonable TCA_FQ_PIE_QUANTUM values")
d636fc5dd692 ("net: sched: add rcu annotations around qdisc->qdisc_sleeping")
886bc7d6ed33 ("net: sched: move rtm_tca_policy declaration to include file")
682881ee45c8 ("net: sched: act_police: fix sparse errors in tcf_police_dump()")
6c02568fd1ae ("net/sched: act_pedit: Parse L3 Header for L4 offset")
26e35370b976 ("net/sched: act_pedit: Use kmemdup() to replace kmalloc + memcpy")
2b84960fc5dd ("net/sched: taprio: report class offload stats per TXQ, not per TC")
d7ad70b5ef5a ("net: flow_dissector: add support for cfm packets")
7cfffd5fed3e ("net: flower: add support for matching cfm fields")
1668a55a73f5 ("selftests: net: add tc flower cfm test")
c29e012eae29 ("selftests: forwarding: Fix layer 2 miss test syntax")
aef6e908b542 ("selftests/tc-testing: Fix Error: Specified qdisc kind is unknown.")
b849c566ee9c ("selftests/tc-testing: Fix Error: failed to find target LOG")
b39d8c41c7a8 ("selftests/tc-testing: Fix SFB db test")
11b8b2e70a9b ("selftests/tc-testing: Remove configs that no longer exist")
41f2c7c342d3 ("net/sched: act_ct: Fix promotion of offloaded unreplied tuple")
2d5f6a8d7aef ("net/sched: Refactor qdisc_graft() for ingress and clsact Qdiscs")
84ad0af0bccd ("net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting")
e16ad981e2a1 ("net: sched: Remove unused qdisc_l2t()")
ca4fa8743537 ("selftests: tc-testing: add one test for flushing explicitly created chain")
b4ee93380b3c ("net/sched: act_ipt: add sanity checks on table name and hook locations")
b2dc32dcba08 ("net/sched: act_ipt: add sanity checks on skb before calling target")
93d75d475c5d ("net/sched: act_ipt: zero skb->cb before calling target")
30c45b5361d3 ("net/sched: act_pedit: Add size check for TCA_PEDIT_PARMS_EX")
989b52cdc849 ("net: sched: Replace strlcpy with strscpy")
d3f87278bcb8 ("net/sched: flower: Ensure both minimum and maximum ports are specified")
150e33e62c1f ("net/sched: make psched_mtu() RTNL-less safe")
158810b261d0 ("net/sched: sch_qfq: reintroduce lmax bound check for MTU")
c5a06fdc618d ("selftests: tc-testing: add tests for qfq mtu sanity check")
3e337087c3b5 ("net/sched: sch_qfq: account for stab overhead in qfq_enqueue")
137f6219da59 ("selftests: tc-testing: add test for qfq with stab overhead")
d1cca974548d ("pie: fix kernel-doc notation warning")
b3d0e0489430 ("net: sched: cls_matchall: Undo tcf_bind_filter in case of failure after mall_set_parms")
9cb36faedeaf ("net: sched: cls_u32: Undo tcf_bind_filter if u32_replace_hw_knode")
e8d3d78c19be ("net: sched: cls_u32: Undo refcount decrement in case update failed")
26a22194927e ("net: sched: cls_bpf: Undo tcf_bind_filter in case of an error")
ac177a330077 ("net: sched: cls_flower: Undo tcf_bind_filter in case of an error")
fda05798c22a ("selftests: tc: set timeout to 15 minutes")
719b4774a8cb ("selftests: tc: add 'ct' action kconfig dep")
031c99e71fed ("selftests: tc: add ConnTrack procfs kconfig")
4914109a8e1e ("netfilter: allow exp not to be removed in nf_ct_find_expectation")
76622ced50a1 ("net: sched: set IPS_CONFIRMED in tmpl status only when commit is set in act_ct")
8c8b73320805 ("openvswitch: set IPS_CONFIRMED in tmpl status only when commit is set in conntrack")
9fe63d5f1da9 ("sch_htb: Allow HTB quantum parameter in offload mode")
6c58c8816abb ("net/sched: mqprio: Add length check for TCA_MQPRIO_{MAX/MIN}_RATE64")
4d50e50045aa ("net: flower: fix stack-out-of-bounds in fl_set_key_cfm()")
e68409db9953 ("net: sched: cls_u32: Fix match key mis-addressing")
e739718444f7 ("net/sched: taprio: Limit TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME to INT_MAX.")
21a72166abb9 ("selftests: forwarding: tc_flower_l2_miss: Fix failing test with old libnet")
```
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com>
Approved-by: Michal Schmidt <mschmidt@redhat.com>
Approved-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Scott Weaver <scweaver@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-1773
commit f5fca219ad4548bc45f0221f9857ad22cb8136a1
Author: Eric Dumazet <edumazet@google.com>
Date: Tue Mar 21 16:45:17 2023 +0000
net: do not use skb_mac_header() in qdisc_pkt_len_init()
We want to remove our use of skb_mac_header() in tx paths,
eg remove skb_reset_mac_header() from __dev_queue_xmit().
Idea is that ndo_start_xmit() can get the mac header
simply looking at skb->data.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-12679
commit d457a0e329b0bfd3a1450e0b1a18cd2b47a25a08
Author: Eric Dumazet <edumazet@google.com>
Date: Thu Jun 8 19:17:37 2023 +0000
net: move gso declarations and functions to their own files
Move declarations into include/net/gso.h and code into net/core/gso.c
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Stanislav Fomichev <sdf@google.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230608191738.3947077-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2232515
Upstream commit(s):
commit 5f18426928800c59fb0f9bc8fb0c182bb6f5ee24
Author: Jiri Pirko <jiri@nvidia.com>
Date: Wed Sep 13 21:49:39 2023 +0100
netdev: expose DPLL pin handle for netdevice
In case netdevice represents a SyncE port, the user needs to understand
the connection between netdevice and associated DPLL pin. There might me
multiple netdevices pointing to the same pin, in case of VF/SF
implementation.
Add a IFLA Netlink attribute to nest the DPLL pin handle, similar to
how it is implemented for devlink port. Add a struct dpll_pin pointer
to netdev and protect access to it by RTNL. Expose netdev_dpll_pin_set()
and netdev_dpll_pin_clear() helpers to the drivers so they can set/clear
the DPLL pin relationship to netdev.
Note that during the lifetime of struct dpll_pin the pin handle does not
change. Therefore it is save to access it lockless. It is drivers
responsibility to call netdev_dpll_pin_clear() before dpll_pin_put().
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Petr Oros <poros@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-3646
commit d4150779e60fb6c49be25572596b2cdfc5d46a09
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date: Wed May 11 16:11:29 2022 +0200
random32: use real rng for non-deterministic randomness
random32.c has two random number generators in it: one that is meant to
be used deterministically, with some predefined seed, and one that does
the same exact thing as random.c, except does it poorly. The first one
has some use cases. The second one no longer does and can be replaced
with calls to random.c's proper random number generator.
The relatively recent siphash-based bad random32.c code was added in
response to concerns that the prior random32.c was too deterministic.
Out of fears that random.c was (at the time) too slow, this code was
anonymously contributed. Then out of that emerged a kind of shadow
entropy gathering system, with its own tentacles throughout various net
code, added willy nilly.
Stop👏making👏bespoke👏random👏number👏generators👏.
Fortunately, recent advances in random.c mean that we can stop playing
with this sketchiness, and just use get_random_u32(), which is now fast
enough. In micro benchmarks using RDPMC, I'm seeing the same median
cycle count between the two functions, with the mean being _slightly_
higher due to batches refilling (which we can optimize further need be).
However, when doing *real* benchmarks of the net functions that actually
use these random numbers, the mean cycles actually *decreased* slightly
(with the median still staying the same), likely because the additional
prandom code means icache misses and complexity, whereas random.c is
generally already being used by something else nearby.
The biggest benefit of this is that there are many users of prandom who
probably should be using cryptographically secure random numbers. This
makes all of those accidental cases become secure by just flipping a
switch. Later on, we can do a tree-wide cleanup to remove the static
inline wrapper functions that this commit adds.
There are also some low-ish hanging fruits for making this even faster
in the future: a get_random_u16() function for use in the networking
stack will give a 2x performance boost there, using SIMD for ChaCha20
will let us compute 4 or 8 or 16 blocks of output in parallel, instead
of just one, giving us large buffers for cheap, and introducing a
get_random_*_bh() function that assumes irqs are already disabled will
shave off a few cycles for ordinary calls. These are things we can chip
away at down the road.
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2583
Rebase bpf and xdp to 6.3.
Bugzilla: https://bugzilla.redhat.com/2178930
Signed-off-by: Viktor Malik <vmalik@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Artem Savkov <asavkov@redhat.com>
Approved-by: Jason Wang <jasowang@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>
Approved-by: Jan Stancek <jstancek@redhat.com>
Approved-by: Baoquan He <5820488-baoquan_he@users.noreply.gitlab.com>
Signed-off-by: Jan Stancek <jstancek@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2627
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2184073
Did not included commit 071c0fc6fb91 ("net: extend drop reasons for multiple subsystems")
as it would be appropriate to backport it in its own MR, would have not user for now,
and it's not clear to me how trace_kfree_skb deals with non-core free reasons once applied.
Signed-off-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Íñigo Huguet <ihuguet@redhat.com>
Approved-by: Xin Long <lxin@redhat.com>
Signed-off-by: Jan Stancek <jstancek@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2178930
Conflicts:
- include/linux/netdevice.h: Context difference in includes due to missing
406f42fa0d3c ("net-next: When a bond have a massive amount of VLANs with
IPv6 addresses, performance of changing link state, attaching a VRF,
changing an IPv6 address, etc. go down dramtically.")
- net/core/Makefile: Context difference due to missing 2c193f2cb110 ("net:
kunit: add a test for dev_addr_lists")
commit d3d854fd6a1d97157f790604e07f6386e8df8fe4
Author: Jakub Kicinski <kuba@kernel.org>
Date: Wed Feb 1 11:24:17 2023 +0100
netdev-genl: create a simple family for netdev stuff
Add a Netlink spec-compatible family for netdevs.
This is a very simple implementation without much
thought going into it.
It allows us to reap all the benefits of Netlink specs,
one can use the generic client to issue the commands:
$ ./cli.py --spec netdev.yaml --dump dev_get
[{'ifindex': 1, 'xdp-features': set()},
{'ifindex': 2, 'xdp-features': {'basic', 'ndo-xmit', 'redirect'}},
{'ifindex': 3, 'xdp-features': {'rx-sg'}}]
the generic python library does not have flags-by-name
support, yet, but we also don't have to carry strings
in the messages, as user space can get the names from
the spec.
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Co-developed-by: Marek Majtyka <alardam@gmail.com>
Signed-off-by: Marek Majtyka <alardam@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/327ad9c9868becbe1e601b580c962549c8cd81f2.1675245258.git.lorenzo@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2178930
commit 2b3486bc2d237ec345b3942b7be5deabf8c8fed1
Author: Stanislav Fomichev <sdf@google.com>
Date: Thu Jan 19 14:15:24 2023 -0800
bpf: Introduce device-bound XDP programs
New flag BPF_F_XDP_DEV_BOUND_ONLY plus all the infra to have a way
to associate a netdev with a BPF program at load time.
netdevsim checks are dropped in favor of generic check in dev_xdp_attach.
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230119221536.3349901-6-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2178930
commit 9d03ebc71a027ca495c60f6e94d3cda81921791f
Author: Stanislav Fomichev <sdf@google.com>
Date: Thu Jan 19 14:15:21 2023 -0800
bpf: Rename bpf_{prog,map}_is_dev_bound to is_offloaded
BPF offloading infra will be reused to implement
bound-but-not-offloaded bpf programs. Rename existing
helpers for clarity. No functional changes.
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230119221536.3349901-3-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2193170
Conflicts:
* net/netfilter/ipvs/ip_vs_ctl.c
- the change was already applied by RHEL commit 914c1e31d9 ("ipvs:
use u64_stats_t for the per-cpu counters")
* net/core/devlink.c
- hunk was applied in different file (net/devlink/leftover.c)
commit d120d1a63b2c484d6175873d8ee736a633f74b70
Author: Thomas Gleixner <tglx@linutronix.de>
Date: Wed Oct 26 15:22:15 2022 +0200
net: Remove the obsolte u64_stats_fetch_*_irq() users (net).
Now that the 32bit UP oddity is gone and 32bit uses always a sequence
count, there is no need for the fetch_irq() variants anymore.
Convert to the regular interface.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2193170
commit 9962acefbcb92736c268aafe5f52200948f60f3e
Author: Eric Dumazet <edumazet@google.com>
Date: Wed Jun 8 08:46:37 2022 -0700
net: adopt u64_stats_t in struct pcpu_sw_netstats
As explained in commit 316580b69d ("u64_stats: provide u64_stats_t type")
we should use u64_stats_t and related accessors to avoid load/store tearing.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2184073
Upstream Status: linux.git
commit 40bbae583ec38ea31e728bf42a4ea72bded22ab6
Author: Eric Dumazet <edumazet@google.com>
Date: Mon Mar 6 20:43:13 2023 +0000
net: remove enum skb_free_reason
enum skb_drop_reason is more generic, we can adopt it instead.
Provide dev_kfree_skb_irq_reason() and dev_kfree_skb_any_reason().
This means drivers can use more precise drop reasons if they want to.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com>
Link: https://lore.kernel.org/r/20230306204313.10492-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Antoine Tenart <atenart@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2375
# Merge Request Required Information
## Summary of Changes
This MR updates RHEL's io_uring implementation to match upstream v6.2 + fixes (Fixes: tags and Cc: stable commits). The final 3 RHEL-only patches disable the feature by default, taint the kernel when it is enabled, and turn on the config option.
## Approved Development Ticket
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2170014
Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2214
Omitted-fix: b0b7a7d24b66 ("io_uring: return back links tw run optimisation")
This is actually just an optimization, and it has non-trivial conflicts
which would require additional backports to resolve. Skip it.
Omitted-fix: 7d481e035633 ("io_uring/rsrc: fix DEFER_TASKRUN rsrc quiesce")
This fix is incorrectly tagged. The code that it applies to is not present in our tree.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Approved-by: John Meneghini <jmeneghi@redhat.com>
Approved-by: Ming Lei <ming.lei@redhat.com>
Approved-by: Maurizio Lombardi <mlombard@redhat.com>
Approved-by: Brian Foster <bfoster@redhat.com>
Approved-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Jan Stancek <jstancek@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2188560
Tested: LNST, Tier1
Upstream commit:
commit ac3ad19584b26fae9ac86e4faebe790becc74491
Author: Eric Dumazet <edumazet@google.com>
Date: Thu Feb 23 08:38:45 2023 +0000
net: fix __dev_kfree_skb_any() vs drop monitor
dev_kfree_skb() is aliased to consume_skb().
When a driver is dropping a packet by calling dev_kfree_skb_any()
we should propagate the drop reason instead of pretending
the packet was consumed.
Note: Now we have enum skb_drop_reason we could remove
enum skb_free_reason (for linux-6.4)
v2: added an unlikely(), suggested by Yunsheng Lin.
Fixes: e6247027e5 ("net: introduce dev_consume_skb_any()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yunsheng Lin <linyunsheng@huawei.com>
Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2185290
Tested: compile only
Conflicts:
- move netif_set_gro_max_size() from include/linux/netdevice.h to
net/core/dev.h, then make the change, as commit 744d49daf8bd was
backported earlier than eac1b93c14d6. netif_set_gro_max_size()
was missed the oppotunity to be moved to net/core/dev.h.
- different context in net/core/dev.h, rps_cpumask_housekeeping()
is added due to 370ca718fd5e already in RHEL-9.
commit 9eefedd58ae1daece2ba907849a44db2941fb4b0
Author: Xin Long <lucien.xin@gmail.com>
Date: Sat Jan 28 10:58:38 2023 -0500
net: add gso_ipv4_max_size and gro_ipv4_max_size per device
This patch introduces gso_ipv4_max_size and gro_ipv4_max_size
per device and adds netlink attributes for them, so that IPV4
BIG TCP can be guarded by a separate tunable in the next patch.
To not break the old application using "gso/gro_max_size" for
IPv4 GSO packets, this patch updates "gso/gro_ipv4_max_size"
in netif_set_gso/gro_max_size() if the new size isn't greater
than GSO_LEGACY_MAX_SIZE, so that nothing will change even if
userspace doesn't realize the new netlink attributes.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Xin Long <lxin@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237
commit c526fd8f9f4f21cb83c0b1c9a1ee9c0ac9be9e2e
Author: Pavel Begunkov <asml.silence@gmail.com>
Date: Thu Apr 28 11:58:46 2022 +0100
net: inline dev_queue_xmit()
Inline dev_queue_xmit() and dev_queue_xmit_accel(), they both are small
proxy functions doing nothing but redirecting the control flow to
__dev_queue_xmit().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2190207
Upstream Status: linux.git
commit 066b86787fa3d97b7aefb5ac0a99a22dad2d15f8
Author: Felix Huettner <felix.huettner@mail.schwarz>
Date: Wed Apr 5 07:53:41 2023 +0000
net: openvswitch: fix race on port output
assume the following setup on a single machine:
1. An openvswitch instance with one bridge and default flows
2. two network namespaces "server" and "client"
3. two ovs interfaces "server" and "client" on the bridge
4. for each ovs interface a veth pair with a matching name and 32 rx and
tx queues
5. move the ends of the veth pairs to the respective network namespaces
6. assign ip addresses to each of the veth ends in the namespaces (needs
to be the same subnet)
7. start some http server on the server network namespace
8. test if a client in the client namespace can reach the http server
when following the actions below the host has a chance of getting a cpu
stuck in a infinite loop:
1. send a large amount of parallel requests to the http server (around
3000 curls should work)
2. in parallel delete the network namespace (do not delete interfaces or
stop the server, just kill the namespace)
there is a low chance that this will cause the below kernel cpu stuck
message. If this does not happen just retry.
Below there is also the output of bpftrace for the functions mentioned
in the output.
The series of events happening here is:
1. the network namespace is deleted calling
`unregister_netdevice_many_notify` somewhere in the process
2. this sets first `NETREG_UNREGISTERING` on both ends of the veth and
then runs `synchronize_net`
3. it then calls `call_netdevice_notifiers` with `NETDEV_UNREGISTER`
4. this is then handled by `dp_device_event` which calls
`ovs_netdev_detach_dev` (if a vport is found, which is the case for
the veth interface attached to ovs)
5. this removes the rx_handlers of the device but does not prevent
packages to be sent to the device
6. `dp_device_event` then queues the vport deletion to work in
background as a ovs_lock is needed that we do not hold in the
unregistration path
7. `unregister_netdevice_many_notify` continues to call
`netdev_unregister_kobject` which sets `real_num_tx_queues` to 0
8. port deletion continues (but details are not relevant for this issue)
9. at some future point the background task deletes the vport
If after 7. but before 9. a packet is send to the ovs vport (which is
not deleted at this point in time) which forwards it to the
`dev_queue_xmit` flow even though the device is unregistering.
In `skb_tx_hash` (which is called in the `dev_queue_xmit`) path there is
a while loop (if the packet has a rx_queue recorded) that is infinite if
`dev->real_num_tx_queues` is zero.
To prevent this from happening we update `do_output` to handle devices
without carrier the same as if the device is not found (which would
be the code path after 9. is done).
Additionally we now produce a warning in `skb_tx_hash` if we will hit
the infinite loop.
bpftrace (first word is function name):
__dev_queue_xmit server: real_num_tx_queues: 1, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 1
netdev_core_pick_tx server: addr: 0xffff9f0a46d4a000 real_num_tx_queues: 1, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 1
dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 2, reg_state: 1
synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 6, reg_state: 2
ovs_netdev_detach_dev server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, reg_state: 2
netdev_rx_handler_unregister server: real_num_tx_queues: 1, cpu: 9, pid: 21024, tid: 21024, reg_state: 2
synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
netdev_rx_handler_unregister ret server: real_num_tx_queues: 1, cpu: 9, pid: 21024, tid: 21024, reg_state: 2
dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 27, reg_state: 2
dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 22, reg_state: 2
dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 18, reg_state: 2
netdev_unregister_kobject: real_num_tx_queues: 1, cpu: 9, pid: 21024, tid: 21024
synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
ovs_vport_send server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 2
__dev_queue_xmit server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 2
netdev_core_pick_tx server: addr: 0xffff9f0a46d4a000 real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 2
broken device server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024
ovs_dp_detach_port server: real_num_tx_queues: 0 cpu 9, pid: 9124, tid: 9124, reg_state: 2
synchronize_rcu_expedited: cpu 9, pid: 33604, tid: 33604
stuck message:
watchdog: BUG: soft lockup - CPU#5 stuck for 26s! [curl:1929279]
Modules linked in: veth pktgen bridge stp llc ip_set_hash_net nft_counter xt_set nft_compat nf_tables ip_set_hash_ip ip_set nfnetlink_cttimeout nfnetlink openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 tls binfmt_misc nls_iso8859_1 input_leds joydev serio_raw dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua sch_fq_codel drm efi_pstore virtio_rng ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel virtio_net ahci net_failover crypto_simd cryptd psmouse libahci virtio_blk failover
CPU: 5 PID: 1929279 Comm: curl Not tainted 5.15.0-67-generic #74-Ubuntu
Hardware name: OpenStack Foundation OpenStack Nova, BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:netdev_pick_tx+0xf1/0x320
Code: 00 00 8d 48 ff 0f b7 c1 66 39 ca 0f 86 e9 01 00 00 45 0f b7 ff 41 39 c7 0f 87 5b 01 00 00 44 29 f8 41 39 c7 0f 87 4f 01 00 00 <eb> f2 0f 1f 44 00 00 49 8b 94 24 28 04 00 00 48 85 d2 0f 84 53 01
RSP: 0018:ffffb78b40298820 EFLAGS: 00000246
RAX: 0000000000000000 RBX: ffff9c8773adc2e0 RCX: 000000000000083f
RDX: 0000000000000000 RSI: ffff9c8773adc2e0 RDI: ffff9c870a25e000
RBP: ffffb78b40298858 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff9c870a25e000
R13: ffff9c870a25e000 R14: ffff9c87fe043480 R15: 0000000000000000
FS: 00007f7b80008f00(0000) GS:ffff9c8e5f740000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f7b80f6a0b0 CR3: 0000000329d66000 CR4: 0000000000350ee0
Call Trace:
<IRQ>
netdev_core_pick_tx+0xa4/0xb0
__dev_queue_xmit+0xf8/0x510
? __bpf_prog_exit+0x1e/0x30
dev_queue_xmit+0x10/0x20
ovs_vport_send+0xad/0x170 [openvswitch]
do_output+0x59/0x180 [openvswitch]
do_execute_actions+0xa80/0xaa0 [openvswitch]
? kfree+0x1/0x250
? kfree+0x1/0x250
? kprobe_perf_func+0x4f/0x2b0
? flow_lookup.constprop.0+0x5c/0x110 [openvswitch]
ovs_execute_actions+0x4c/0x120 [openvswitch]
ovs_dp_process_packet+0xa1/0x200 [openvswitch]
? ovs_ct_update_key.isra.0+0xa8/0x120 [openvswitch]
? ovs_ct_fill_key+0x1d/0x30 [openvswitch]
? ovs_flow_key_extract+0x2db/0x350 [openvswitch]
ovs_vport_receive+0x77/0xd0 [openvswitch]
? __htab_map_lookup_elem+0x4e/0x60
? bpf_prog_680e8aff8547aec1_kfree+0x3b/0x714
? trace_call_bpf+0xc8/0x150
? kfree+0x1/0x250
? kfree+0x1/0x250
? kprobe_perf_func+0x4f/0x2b0
? kprobe_perf_func+0x4f/0x2b0
? __mod_memcg_lruvec_state+0x63/0xe0
netdev_port_receive+0xc4/0x180 [openvswitch]
? netdev_port_receive+0x180/0x180 [openvswitch]
netdev_frame_hook+0x1f/0x40 [openvswitch]
__netif_receive_skb_core.constprop.0+0x23d/0xf00
__netif_receive_skb_one_core+0x3f/0xa0
__netif_receive_skb+0x15/0x60
process_backlog+0x9e/0x170
__napi_poll+0x33/0x180
net_rx_action+0x126/0x280
? ttwu_do_activate+0x72/0xf0
__do_softirq+0xd9/0x2e7
? rcu_report_exp_cpu_mult+0x1b0/0x1b0
do_softirq+0x7d/0xb0
</IRQ>
<TASK>
__local_bh_enable_ip+0x54/0x60
ip_finish_output2+0x191/0x460
__ip_finish_output+0xb7/0x180
ip_finish_output+0x2e/0xc0
ip_output+0x78/0x100
? __ip_finish_output+0x180/0x180
ip_local_out+0x5e/0x70
__ip_queue_xmit+0x184/0x440
? tcp_syn_options+0x1f9/0x300
ip_queue_xmit+0x15/0x20
__tcp_transmit_skb+0x910/0x9c0
? __mod_memcg_state+0x44/0xa0
tcp_connect+0x437/0x4e0
? ktime_get_with_offset+0x60/0xf0
tcp_v4_connect+0x436/0x530
__inet_stream_connect+0xd4/0x3a0
? kprobe_perf_func+0x4f/0x2b0
? aa_sk_perm+0x43/0x1c0
inet_stream_connect+0x3b/0x60
__sys_connect_file+0x63/0x70
__sys_connect+0xa6/0xd0
? setfl+0x108/0x170
? do_fcntl+0xe8/0x5a0
__x64_sys_connect+0x18/0x20
do_syscall_64+0x5c/0xc0
? __x64_sys_fcntl+0xa9/0xd0
? exit_to_user_mode_prepare+0x37/0xb0
? syscall_exit_to_user_mode+0x27/0x50
? do_syscall_64+0x69/0xc0
? __sys_setsockopt+0xea/0x1e0
? exit_to_user_mode_prepare+0x37/0xb0
? syscall_exit_to_user_mode+0x27/0x50
? __x64_sys_setsockopt+0x1f/0x30
? do_syscall_64+0x69/0xc0
? irqentry_exit+0x1d/0x30
? exc_page_fault+0x89/0x170
entry_SYSCALL_64_after_hwframe+0x61/0xcb
RIP: 0033:0x7f7b8101c6a7
Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 18 89 54 24 0c 48 89 34 24 89
RSP: 002b:00007ffffd6b2198 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7b8101c6a7
RDX: 0000000000000010 RSI: 00007ffffd6b2360 RDI: 0000000000000005
RBP: 0000561f1370d560 R08: 00002795ad21d1ac R09: 0030312e302e302e
R10: 00007ffffd73f080 R11: 0000000000000246 R12: 0000561f1370c410
R13: 0000000000000000 R14: 0000000000000005 R15: 0000000000000000
</TASK>
Fixes: 7f8a436eaa ("openvswitch: Add conntrack action")
Co-developed-by: Luca Czesla <luca.czesla@mail.schwarz>
Signed-off-by: Luca Czesla <luca.czesla@mail.schwarz>
Signed-off-by: Felix Huettner <felix.huettner@mail.schwarz>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/ZC0pBXBAgh7c76CA@kernel-bug-kernel-bug
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Antoine Tenart <atenart@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2189406
Upstream Status: net.git commit bd039b5ea2a9
commit bd039b5ea2a91ea707ee8539df26456bd5be80af
Author: Andy Ren <andy.ren@getcruise.com>
Date: Mon Nov 7 09:42:42 2022 -0800
net/core: Allow live renaming when an interface is up
Allow a network interface to be renamed when the interface
is up.
As described in the netconsole documentation [1], when netconsole is
used as a built-in, it will bring up the specified interface as soon as
possible. As a result, user space will not be able to rename the
interface since the kernel disallows renaming of interfaces that are
administratively up unless the 'IFF_LIVE_RENAME_OK' private flag was set
by the kernel.
The original solution [2] to this problem was to add a new parameter to
the netconsole configuration parameters that allows renaming of
the interface used by netconsole while it is administratively up.
However, during the discussion that followed, it became apparent that we
have no reason to keep the current restriction and instead we should
allow user space to rename interfaces regardless of their administrative
state:
1. The restriction was put in place over 20 years ago when renaming was
only possible via IOCTL and before rtnetlink started notifying user
space about such changes like it does today.
2. The 'IFF_LIVE_RENAME_OK' flag was added over 3 years ago in version
5.2 and no regressions were reported.
3. In-kernel listeners to 'NETDEV_CHANGENAME' do not seem to care about
the administrative state of interface.
Therefore, allow user space to rename running interfaces by removing the
restriction and the associated 'IFF_LIVE_RENAME_OK' flag. Help in
possible triage by emitting a message to the kernel log that an
interface was renamed while UP.
[1] https://www.kernel.org/doc/Documentation/networking/netconsole.rst
[2] https://lore.kernel.org/netdev/20221102002420.2613004-1-andy.ren@getcruise.com/
Signed-off-by: Andy Ren <andy.ren@getcruise.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Hangbin Liu <haliu@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2172273
Conflicts:
- adjusted upstream merge conflict which was resolved in 675f176b4dcc2b
("Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/net")
Upstream commit(s):
commit b20b8aec6ffc07bb547966b356780cd344f20f5b
Author: Ido Schimmel <idosch@nvidia.com>
Date: Wed Feb 15 09:31:39 2023 +0200
devlink: Fix netdev notifier chain corruption
Cited commit changed devlink to register its netdev notifier block on
the global netdev notifier chain instead of on the per network namespace
one.
However, when changing the network namespace of the devlink instance,
devlink still tries to unregister its notifier block from the chain of
the old namespace and register it on the chain of the new namespace.
This results in corruption of the notifier chains, as the same notifier
block is registered on two different chains: The global one and the per
network namespace one. In turn, this causes other problems such as the
inability to dismantle namespaces due to netdev reference count issues.
Fix by preventing devlink from moving its notifier block between
namespaces.
Reproducer:
# echo "10 1" > /sys/bus/netdevsim/new_device
# ip netns add test123
# devlink dev reload netdevsim/netdevsim10 netns test123
# ip netns del test123
[ 71.935619] unregister_netdevice: waiting for lo to become free. Usage count = 2
[ 71.938348] leaked reference.
Fixes: 565b4824c39f ("devlink: change port event netdev notifier from per-net to global")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230215073139.1360108-1-idosch@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Petr Oros <poros@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2172273
Upstream commit(s):
commit 3e52fba03a20234abc65a656cef063a1045d9723
Author: Jiri Pirko <jiri@nvidia.com>
Date: Tue Nov 8 14:22:06 2022 +0100
net: introduce a helper to move notifier block to different namespace
Currently, net_dev() netdev notifier variant follows the netdev with
per-net notifier from namespace to namespace. This is implemented
by move_netdevice_notifiers_dev_net() helper.
For devlink it is needed to re-register per-net notifier during
devlink reload. Introduce a new helper called
move_netdevice_notifier_net() and share the unregister/register code
with existing move_netdevice_notifiers_dev_net() helper.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Petr Oros <poros@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2172273
Upstream commit(s):
commit 02a68a47eadedf95748facfca6ced31fb0181d52
Author: Jiri Pirko <jiri@nvidia.com>
Date: Wed Nov 2 17:02:03 2022 +0100
net: devlink: track netdev with devlink_port assigned
Currently, ethernet drivers are using devlink_port_type_eth_set() and
devlink_port_type_clear() to set devlink port type and link to related
netdev.
Instead of calling them directly, let the driver use
SET_NETDEV_DEVLINK_PORT macro to assign devlink_port pointer and let
devlink to track it. Note the devlink port pointer is static during
the time netdevice is registered.
In devlink code, use per-namespace netdev notifier to track
the netdevices with devlink_port assigned and change the internal
devlink_port type and related type pointer accordingly.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Petr Oros <poros@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2175258
Conflicts:
- Removed chunks of unsupported protocol AX.25
- Renamed the funtions also in ipvlan. Commit 40b9d1ab63f5 ("ipvlan: hold lower
dev to avoid possible use-after-free") was backported out of order so it had
to use the old functions names.
commit d62607c3fe45911b2331fac073355a8c914bbde2
Author: Jakub Kicinski <kuba@kernel.org>
Date: Tue Jun 7 21:39:55 2022 -0700
net: rename reference+tracking helpers
Netdev reference helpers have a dev_ prefix for historic
reasons. Renaming the old helpers would be too much churn
but we can rename the tracking ones which are relatively
recent and should be the default for new code.
Rename:
dev_hold_track() -> netdev_hold()
dev_put_track() -> netdev_put()
dev_replace_track() -> netdev_ref_replace()
Link: https://lore.kernel.org/r/20220608043955.919359-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2180612
Tested: compile only
Conflicts:
- context difference due to cc26c2661fef already in RHEL-9.
commit 86213f80da1b1d007721cc22e04b5f5d0da33127
Author: Eric Dumazet <edumazet@google.com>
Date: Thu Feb 17 22:54:30 2022 -0800
net: avoid quadratic behavior in netdev_wait_allrefs_any()
If the list of devices has N elements, netdev_wait_allrefs_any()
is called N times, and linkwatch_forget_dev() is called N*(N-1)/2 times.
Fix this by calling linkwatch_forget_dev() only once per device.
Fixes: faab39f63c1f ("net: allow out-of-order netdev unregistration")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220218065430.2613262-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Xin Long <lxin@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2180612
Tested: compile only
Conflicts:
- context difference due to 05e49cfc89e4 already in RHEL-9.
commit faab39f63c1fc4bcdf135690f03bd596b578c67e
Author: Jakub Kicinski <kuba@kernel.org>
Date: Tue Feb 15 14:53:10 2022 -0800
net: allow out-of-order netdev unregistration
Sprinkle for each loops to allow netdevices to be unregistered
out of order, as their refs are released.
This prevents problems caused by dependencies between netdevs
which want to release references in their ->priv_destructor.
See commit d6ff94afd90b ("vlan: move dev_put into vlan_dev_uninit")
for example.
Eric has removed the only known ordering requirement in
commit c002496babfd ("Merge branch 'ipv6-loopback'")
so let's try this and see if anything explodes...
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Link: https://lore.kernel.org/r/20220215225310.3679266-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Xin Long <lxin@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2180612
Tested: compile only
Conflicts:
- context difference due to cc26c2661fef already in RHEL-9.
commit ae68db14b6164ce46beffaf35eb7c9bb2f92fee3
Author: Jakub Kicinski <kuba@kernel.org>
Date: Tue Feb 15 14:53:09 2022 -0800
net: transition netdev reg state earlier in run_todo
In prep for unregistering netdevs out of order move the netdev
state validation and change outside of the loop.
While at it modernize this code and use WARN() instead of
pr_err() + dump_stack().
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Link: https://lore.kernel.org/r/20220215225310.3679266-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Xin Long <lxin@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1970
Bugzilla: https://bugzilla.redhat.com/2161921
commit d93607082e982223cf92750f2d9039ff365b9d24
Author: Heiner Kallweit <hkallweit1@gmail.com>
Date: Wed Nov 30 23:28:26 2022 +0100
net: add netdev_sw_irq_coalesce_default_on()
Add a helper for drivers wanting to set SW IRQ coalescing
by default. The related sysfs attributes can be used to
override the default values.
Follow Jakub's suggestion and put this functionality into
net core so that drivers wanting to use software interrupt
coalescing per default don't have to open-code it.
Note that this function needs to be called before the
netdevice is registered.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Dan Campbell <dacampbe@redhat.com>
Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com>
Approved-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: Michal Schmidt <mschmidt@redhat.com>
Approved-by: Petr Oros <poros@redhat.com>
Approved-by: Andrea Claudi <aclaudi@redhat.com>
Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2161921
commit d93607082e982223cf92750f2d9039ff365b9d24
Author: Heiner Kallweit <hkallweit1@gmail.com>
Date: Wed Nov 30 23:28:26 2022 +0100
net: add netdev_sw_irq_coalesce_default_on()
Add a helper for drivers wanting to set SW IRQ coalescing
by default. The related sysfs attributes can be used to
override the default values.
Follow Jakub's suggestion and put this functionality into
net core so that drivers wanting to use software interrupt
coalescing per default don't have to open-code it.
Note that this function needs to be called before the
netdevice is registered.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Dan Campbell <dacampbe@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2162711
Tested: vs bz reproducer
Upstream commit:
commit 672e97ef689a38cb20c2cc6a1814298fea34461e
Author: Paul Blakey <paulb@nvidia.com>
Date: Tue Oct 18 10:34:38 2022 +0300
net: Fix return value of qdisc ingress handling on success
Currently qdisc ingress handling (sch_handle_ingress()) doesn't
set a return value and it is left to the old return value of
the caller (__netif_receive_skb_core()) which is RX drop, so if
the packet is consumed, caller will stop and return this value
as if the packet was dropped.
This causes a problem in the kernel tcp stack when having a
egress tc rule forwarding to a ingress tc rule.
The tcp stack sending packets on the device having the egress rule
will see the packets as not successfully transmitted (although they
actually were), will not advance it's internal state of sent data,
and packets returning on such tcp stream will be dropped by the tcp
stack with reason ack-of-unsent-data. See reproduction in [0] below.
Fix that by setting the return value to RX success if
the packet was handled successfully.
[0] Reproduction steps:
$ ip link add veth1 type veth peer name peer1
$ ip link add veth2 type veth peer name peer2
$ ifconfig peer1 5.5.5.6/24 up
$ ip netns add ns0
$ ip link set dev peer2 netns ns0
$ ip netns exec ns0 ifconfig peer2 5.5.5.5/24 up
$ ifconfig veth2 0 up
$ ifconfig veth1 0 up
#ingress forwarding veth1 <-> veth2
$ tc qdisc add dev veth2 ingress
$ tc qdisc add dev veth1 ingress
$ tc filter add dev veth2 ingress prio 1 proto all flower \
action mirred egress redirect dev veth1
$ tc filter add dev veth1 ingress prio 1 proto all flower \
action mirred egress redirect dev veth2
#steal packet from peer1 egress to veth2 ingress, bypassing the veth pipe
$ tc qdisc add dev peer1 clsact
$ tc filter add dev peer1 egress prio 20 proto ip flower \
action mirred ingress redirect dev veth1
#run iperf and see connection not running
$ iperf3 -s&
$ ip netns exec ns0 iperf3 -c 5.5.5.6 -i 1
#delete egress rule, and run again, now should work
$ tc filter del dev peer1 egress
$ ip netns exec ns0 iperf3 -c 5.5.5.6 -i 1
Fixes: f697c3e8b3 ("[NET]: Avoid unnecessary cloning for ingress filtering")
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1541
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2133057
Tested: LNST, Tier1, tput test
This series improves UDP protocol RX tput, to keep it on equal footing with rhel-8 one.
Patches 1,3,4 are there just to reduces the conflicts, and patch 4 is a very partial
backport, to avoid pulling unrelated features.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Florian Westphal <fwestpha@redhat.com>
Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2120968
commit 1fd6e5675336daf4747940b4285e84b0c114ae32
Author: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Date: Tue Jul 5 10:23:45 2022 +0200
xdp: Fix spurious packet loss in generic XDP TX path
The byte queue limits (BQL) mechanism is intended to move queuing from
the driver to the network stack in order to reduce latency caused by
excessive queuing in hardware. However, when transmitting or redirecting
a packet using generic XDP, the qdisc layer is bypassed and there are no
additional queues. Since netif_xmit_stopped() also takes BQL limits into
account, but without having any alternative queuing, packets are
silently dropped.
This patch modifies the drop condition to only consider cases when the
driver itself cannot accept any more packets. This is analogous to the
condition in __dev_direct_xmit(). Dropped packets are also counted on
the device.
Bypassing the qdisc layer in the generic XDP TX path means that XDP
packets are able to starve other packets going through a qdisc, and
DDOS attacks will be more effective. In-driver-XDP use dedicated TX
queues, so they do not have this starvation issue.
Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220705082345.2494312-1-johan.almbladh@anyfinetworks.com
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2130850
commit 6510ea973d8d9d4a0cb2fb557b36bd1ab3eb49f6
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date: Mon Apr 25 18:39:46 2022 +0200
net: Use this_cpu_inc() to increment net->core_stats
The macro dev_core_stats_##FIELD##_inc() disables preemption and invokes
netdev_core_stats_alloc() to return a per-CPU pointer.
netdev_core_stats_alloc() will allocate memory on its first invocation
which breaks on PREEMPT_RT because it requires non-atomic context for
memory allocation.
This can be avoided by enabling preemption in netdev_core_stats_alloc()
assuming the caller always disables preemption.
It might be better to replace local_inc() with this_cpu_inc() now that
dev_core_stats_##FIELD##_inc() gained a preempt-disable section and does
not rely on already disabled preemption. This results in less
instructions on x86-64:
local_inc:
| incl %gs:__preempt_count(%rip) # __preempt_count
| movq 488(%rdi), %rax # _1->core_stats, _22
| testq %rax, %rax # _22
| je .L585 #,
| add %gs:this_cpu_off(%rip), %rax # this_cpu_off, tcp_ptr__
| .L586:
| testq %rax, %rax # _27
| je .L587 #,
| incq (%rax) # _6->a.counter
| .L587:
| decl %gs:__preempt_count(%rip) # __preempt_count
this_cpu_inc(), this patch:
| movq 488(%rdi), %rax # _1->core_stats, _5
| testq %rax, %rax # _5
| je .L591 #,
| .L585:
| incq %gs:(%rax) # _18->rx_dropped
Use unsigned long as type for the counter. Use this_cpu_inc() to
increment the counter. Use a plain read of the counter.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/YmbO0pxgtKpCw4SY@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2130850
Conflicts:
- drivers/net/vxlan.c: file is not moved to drivers/net/vxlan/vxlan_core.c
due to missing 6765393614ea8 ("vxlan: move to its own directory");
context difference due to missing 4095e0e1328a3 ("drivers: vxlan:
vnifilter: per vni stats")
- net/core/dev.c: code difference in __netif_receive_skb_core due to
already applied 9f8ed577c2881 ("net: skb: rename
SKB_DROP_REASON_PTYPE_ABSENT"). Result is like upstream now.
- net/core/gro_cells.c: context difference due to already applied
5dcd08cd1991 ("net: Fix data-races around netdev_max_backlog.")
commit 625788b5844511cf4c30cffa7fa0bc3a69cebc82
Author: Eric Dumazet <edumazet@google.com>
Date: Thu Mar 10 21:14:20 2022 -0800
net: add per-cpu storage and net->core_stats
Before adding yet another possibly contended atomic_long_t,
it is time to add per-cpu storage for existing ones:
dev->tx_dropped, dev->rx_dropped, and dev->rx_nohandler
Because many devices do not have to increment such counters,
allocate the per-cpu storage on demand, so that dev_get_stats()
does not have to spend considerable time folding zero counters.
Note that some drivers have abused these counters which
were supposed to be only used by core networking stack.
v4: should use per_cpu_ptr() in dev_get_stats() (Jakub)
v3: added a READ_ONCE() in netdev_core_stats_alloc() (Paolo)
v2: add a missing include (reported by kernel test robot <lkp@intel.com>)
Change in netdev_core_stats_alloc() (Jakub)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: jeffreyji <jeffreyji@google.com>
Reviewed-by: Brian Vazquez <brianvv@google.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20220311051420.2608812-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1567
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139170
Tested: Using self-tests, results present in the BZ
Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2133511
Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2128185
Commits:
```
b20dc3c68458 ("gtp: Allow to create GTP device without FDs")
9af41cc33471 ("gtp: Implement GTP echo response")
d33bd757d362 ("gtp: Implement GTP echo request")
e3acda7ade0a ("net/sched: Allow flower to match on GTP options")
81dd9849fa49 ("gtp: Add support for checking GTP device type")
02f393381d14 ("gtp: Fix inconsistent indenting")
4c096ea2d67c ("net/sched: matchall: Take verbose flag into account when logging error messages")
11c95317bc1a ("net/sched: flower: Take verbose flag into account when logging error messages")
c2ccf84ecb71 ("net/sched: act_api: Add extack to offload_act_setup() callback")
69642c2ab2f5 ("net/sched: act_gact: Add extack messages for offload failure")
4dcaa50d0292 ("net/sched: act_mirred: Add extack message for offload failure")
bca3821d19d9 ("net/sched: act_mpls: Add extack messages for offload failure")
bf3b99e4f9ce ("net/sched: act_pedit: Add extack message for offload failure")
b50e462bc22d ("net/sched: act_police: Add extack messages for offload failure")
a9c64939b669 ("net/sched: act_skbedit: Add extack messages for offload failure")
ee367d44b936 ("net/sched: act_tunnel_key: Add extack message for offload failure")
f8fab3169464 ("net/sched: act_vlan: Add extack message for offload failure")
c440615ffbcb ("net/sched: cls_api: Add extack message for unsupported action offload")
0cba5c34b8f4 ("net/sched: matchall: Avoid overwriting error messages")
fd23e0e250c6 ("net/sched: flower: Avoid overwriting error messages")
c9a40d1c87e9 ("net_sched: make qdisc_reset() smaller")
7463acfbe52a ("netfilter: Rename ingress hook include file")
17d20784223d ("netfilter: Generalize ingress hook include file")
42df6e1d221d ("netfilter: Introduce egress hook")
2f1e85b1aee4 ("net: sched: use queue_mapping to pick tx queue")
38a6f0865796 ("net: sched: support hash selecting tx queue")
285ba06b0edb ("net/sched: flower: Helper function for vlan ethtype checks")
6ee59e554d33 ("net/sched: flower: Reduce identation after is_key_vlan refactoring")
b40003128226 ("net/sched: flower: Add number of vlan tags filter")
99fdb22bc5e9 ("net/sched: flower: Consider the number of tags for vlan filters")
b57c7e8b76c6 ("selftests: forwarding: tc_actions: allow mirred egress test to run on non-offloaded h2")
70f87de9fa0d ("net_sched: em_meta: add READ_ONCE() in var_sk_bound_if()")
a2b1a5d40bd1 ("net/sched: sch_netem: Fix arithmetic in netem_dump() for 32-bit platforms")
1da9e27415bf ("tc-testing: gitignore, delete plugins directory")
6deb209dc6b0 ("net: Print hashed skb addresses for all net and qdisc events")
76b39b94382f ("net/sched: act_api: Notify user space if any actions were flushed before error")
88153e29c1e0 ("selftests: tc-testing: Add testcases to test new flush behaviour")
837ced3a1a5d ("time64.h: consolidate uses of PSEC_PER_NSEC")
d7be266adbfd ("net: sched: provide shim definitions for taprio_offload_{get,free}")
fc54d9065f90 ("net/sched: act_ct: set 'net' pointer when creating new nf_flow_table")
b038177636f8 ("netfilter: nf_flow_table: count pending offload workqueue tasks")
b06ada6df9cf ("netfilter: flowtable: fix incorrect Kconfig dependencies")
83d85bb06915 ("net: extract port range fields from fl_flow_key")
bc5c8260f411 ("net/sched: remove return value of unregister_tcf_proto_ops")
88b3822cdf2f ("net/sched: sch_cbq: Delete unused delay_timer")
ca0cab119288 ("net/sched: remove qdisc_root_lock() helper")
c0f47c2822aa ("net/sched: cls_api: Fix flow action initialization")
5008750eff5d ("net/sched: flower: Add PPPoE filter")
a482d47d33ac ("net/sched: sch_cbq: change the type of cbq_set_lss to void")
06799a9085e1 ("net: bonding: replace dev_trans_start() with the jiffies of the last ARP/NS")
4873a1b2024d ("net/sched: remove hacks added to dev_trans_start() for bonding to work")
9ad36309e271 ("net_sched: cls_route: remove from list when handle is 0")
02799571714d ("net_sched: cls_route: disallow handle of 0")
b05972f01e7d ("net: sched: tbf: don't call qdisc_put() while holding tree lock")
f612466ebecb ("net/sched: fix netdevice reference leaks in attach_default_qdiscs()")
9efd23297cca ("sch_sfb: Don't assume the skb is still around after enqueueing to child")
2f09707d0c97 ("sch_sfb: Also store skb len before calling child enqueue")
db46e3a88a09 ("net/sched: taprio: avoid disabling offload when it was never enabled")
1461d212ab27 ("net/sched: taprio: make qdisc_leaf() see the per-netdev-queue pfifo child qdiscs")
c2e1cfefcac3 ("net: sched: fix possible refcount leak in tc_new_tfilter()")
6e23ec0ba92d ("net: sched: act_ct: fix possible refcount leak in tcf_ct_init()")
ffdd33dd9c12 ("netfilter: core: Fix clang warnings about unused static inlines")
6316136ec6e3 ("netfilter: egress: avoid a lockdep splat")
d645552e9bd9 ("netfilter: egress: Report interface as outgoing")
af7b29b1deaa ("Revert "net/sched: taprio: make qdisc_leaf() see the per-netdev-queue pfifo child qdiscs"")
8bdc2acd420c ("net: sched: Fix use after free in red_enqueue()")
```
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Approved-by: Petr Oros <poros@redhat.com>
Approved-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
Merge conflicts:
-----------------
arch/x86/net/bpf_jit_comp.c
- bpf_arch_text_poke()
HEAD(!1464) contains b73b002f7f ("x86/ibt,bpf: Add ENDBR instructions to prologue and trampoline")
Resolved in favour of !1464, but keep the return statement from !1477
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1477
Bugzilla: https://bugzilla.redhat.com/2120966
Rebase BPF and XDP to the upstream kernel version 5.18
Patch applied, then reverted:
```
544356 selftests/bpf: switch to new libbpf XDP APIs
0bfb95 selftests, bpf: Do not yet switch to new libbpf XDP APIs
```
Taken in the perf rebase:
```
23fcfc perf: use generic bpf_program__set_type() to set BPF prog type
```
Unsuported arches:
```
5c1011 libbpf: Fix riscv register names
cf0b5b libbpf: Fix accessing syscall arguments on riscv
```
Depends on changes of other subsystems:
```
7fc8c3 s390/bpf: encode register within extable entry
aebfd1 x86/ibt,ftrace: Search for __fentry__ location
589127 x86/ibt,bpf: Add ENDBR instructions to prologue and trampoline
```
Broken selftest:
```
edae34 selftests net: add UDP GRO fraglist + bpf self-tests
cf6783 selftests net: fix bpf build error
7b92aa selftests net: fix kselftest net fatal error
```
Out of scope:
```
baebdf net: dev: Makes sure netif_rx() can be invoked in any context.
5c8166 kbuild: replace $(if A,A,B) with $(or A,B)
1a97ce perf maps: Use a pointer for kmaps
967747 uaccess: remove CONFIG_SET_FS
42b01a s390: always use the packed stack layout
bf0882 flow_dissector: Add support for HSR
d09a30 s390/extable: move EX_TABLE define to asm-extable.h
3d6671 s390/extable: convert to relative table with data
4efd41 s390: raise minimum supported machine generation to z10
f65e58 flow_dissector: Add support for HSRv0
1a6d7a netdevsim: Introduce support for L3 offload xstats
9b1894 selftests: netdevsim: hw_stats_l3: Add a new test
84005b perf ftrace latency: Add -n/--use-nsec option
36c4a7 kasan, arm64: don't tag executable vmalloc allocations
8df013 docs: netdev: move the netdev-FAQ to the process pages
4d4d00 perf tools: Update copy of libbpf's hashmap.c
0df6ad perf evlist: Rename cpus to user_requested_cpus
1b8089 flow_dissector: fix false-positive __read_overflow2_field() warning
0ae065 perf build: Fix check for btf__load_from_kernel_by_id() in libbpf
8994e9 perf test bpf: Skip test if clang is not present
735346 perf build: Fix btf__load_from_kernel_by_id() feature check
f037ac s390/stack: merge empty stack frame slots
335220 docs: netdev: update maintainer-netdev.rst reference
a0b098 s390/nospec: remove unneeded header includes
34513a netdevsim: Fix hwstats debugfs file permissions
```
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Approved-by: John W. Linville <linville@redhat.com>
Approved-by: Wander Lairson Costa <wander@redhat.com>
Approved-by: Torez Smith <torez@redhat.com>
Approved-by: Jan Stancek <jstancek@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Felix Maurer <fmaurer@redhat.com>
Approved-by: Viktor Malik <vmalik@redhat.com>
Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1577
Bugzilla: https://bugzilla.redhat.com/2139498
Tested: build, boot
Change netif_napi_add family function's API so `netif_napi_add` and `netif_napi_add_tx` uses by default weight = NAPI_POLL_WEIGHT (as most of drivers were already doing in some or another way), and add `netif_napi_add_weight` and `netif_napi_add_tx_weight` for drivers that want to specify a custom NAPI weight.
Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
Approved-by: John W. Linville <linville@redhat.com>
Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Torez Smith <torez@redhat.com>
Approved-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: Tony Camuso <tcamuso@redhat.com>
Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139170
commit 2f1e85b1aee459b7d0fd981839042c6a38ffaf0c
Author: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Date: Sat Apr 16 00:40:45 2022 +0800
net: sched: use queue_mapping to pick tx queue
This patch fixes issue:
* If we install tc filters with act_skbedit in clsact hook.
It doesn't work, because netdev_core_pick_tx() overwrites
queue_mapping.
$ tc filter ... action skbedit queue_mapping 1
And this patch is useful:
* We can use FQ + EDT to implement efficient policies. Tx queues
are picked by xps, ndo_select_queue of netdev driver, or skb hash
in netdev_core_pick_tx(). In fact, the netdev driver, and skb
hash are _not_ under control. xps uses the CPUs map to select Tx
queues, but we can't figure out which task_struct of pod/containter
running on this cpu in most case. We can use clsact filters to classify
one pod/container traffic to one Tx queue. Why ?
In containter networking environment, there are two kinds of pod/
containter/net-namespace. One kind (e.g. P1, P2), the high throughput
is key in these applications. But avoid running out of network resource,
the outbound traffic of these pods is limited, using or sharing one
dedicated Tx queues assigned HTB/TBF/FQ Qdisc. Other kind of pods
(e.g. Pn), the low latency of data access is key. And the traffic is not
limited. Pods use or share other dedicated Tx queues assigned FIFO Qdisc.
This choice provides two benefits. First, contention on the HTB/FQ Qdisc
lock is significantly reduced since fewer CPUs contend for the same queue.
More importantly, Qdisc contention can be eliminated completely if each
CPU has its own FIFO Qdisc for the second kind of pods.
There must be a mechanism in place to support classifying traffic based on
pods/container to different Tx queues. Note that clsact is outside of Qdisc
while Qdisc can run a classifier to select a sub-queue under the lock.
In general recording the decision in the skb seems a little heavy handed.
This patch introduces a per-CPU variable, suggested by Eric.
The xmit.skip_txqueue flag is firstly cleared in __dev_queue_xmit().
- Tx Qdisc may install that skbedit actions, then xmit.skip_txqueue flag
is set in qdisc->enqueue() though tx queue has been selected in
netdev_tx_queue_mapping() or netdev_core_pick_tx(). That flag is cleared
firstly in __dev_queue_xmit(), is useful:
- Avoid picking Tx queue with netdev_tx_queue_mapping() in next netdev
in such case: eth0 macvlan - eth0.3 vlan - eth0 ixgbe-phy:
For example, eth0, macvlan in pod, which root Qdisc install skbedit
queue_mapping, send packets to eth0.3, vlan in host. In __dev_queue_xmit() of
eth0.3, clear the flag, does not select tx queue according to skb->queue_mapping
because there is no filters in clsact or tx Qdisc of this netdev.
Same action taked in eth0, ixgbe in Host.
- Avoid picking Tx queue for next packet. If we set xmit.skip_txqueue
in tx Qdisc (qdisc->enqueue()), the proper way to clear it is clearing it
in __dev_queue_xmit when processing next packets.
For performance reasons, use the static key. If user does not config the NET_EGRESS,
the patch will not be compiled.
+----+ +----+ +----+
| P1 | | P2 | | Pn |
+----+ +----+ +----+
| | |
+-----------+-----------+
|
| clsact/skbedit
| MQ
v
+-----------+-----------+
| q0 | q1 | qn
v v v
HTB/FQ HTB/FQ ... FIFO
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexander Lobakin <alobakin@pm.me>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Talal Ahmad <talalahmad@google.com>
Cc: Kevin Hao <haokexin@gmail.com>
Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Antoine Tenart <atenart@kernel.org>
Cc: Wei Wang <weiwan@google.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139170
commit 42df6e1d221dddc0f2acf2be37e68d553ad65f96
Author: Lukas Wunner <lukas@wunner.de>
Date: Fri Oct 8 22:06:03 2021 +0200
netfilter: Introduce egress hook
Support classifying packets with netfilter on egress to satisfy user
requirements such as:
* outbound security policies for containers (Laura)
* filtering and mangling intra-node Direct Server Return (DSR) traffic
on a load balancer (Laura)
* filtering locally generated traffic coming in through AF_PACKET,
such as local ARP traffic generated for clustering purposes or DHCP
(Laura; the AF_PACKET plumbing is contained in a follow-up commit)
* L2 filtering from ingress and egress for AVB (Audio Video Bridging)
and gPTP with nftables (Pablo)
* in the future: in-kernel NAT64/NAT46 (Pablo)
The egress hook introduced herein complements the ingress hook added by
commit e687ad60af ("netfilter: add netfilter ingress hook after
handle_ing() under unique static key"). A patch for nftables to hook up
egress rules from user space has been submitted separately, so users may
immediately take advantage of the feature.
Alternatively or in addition to netfilter, packets can be classified
with traffic control (tc). On ingress, packets are classified first by
tc, then by netfilter. On egress, the order is reversed for symmetry.
Conceptually, tc and netfilter can be thought of as layers, with
netfilter layered above tc.
Traffic control is capable of redirecting packets to another interface
(man 8 tc-mirred). E.g., an ingress packet may be redirected from the
host namespace to a container via a veth connection:
tc ingress (host) -> tc egress (veth host) -> tc ingress (veth container)
In this case, netfilter egress classifying is not performed when leaving
the host namespace! That's because the packet is still on the tc layer.
If tc redirects the packet to a physical interface in the host namespace
such that it leaves the system, the packet is never subjected to
netfilter egress classifying. That is only logical since it hasn't
passed through netfilter ingress classifying either.
Packets can alternatively be redirected at the netfilter layer using
nft fwd. Such a packet *is* subjected to netfilter egress classifying
since it has reached the netfilter layer.
Internally, the skb->nf_skip_egress flag controls whether netfilter is
invoked on egress by __dev_queue_xmit(). Because __dev_queue_xmit() may
be called recursively by tunnel drivers such as vxlan, the flag is
reverted to false after sch_handle_egress(). This ensures that
netfilter is applied both on the overlay and underlying network.
Interaction between tc and netfilter is possible by setting and querying
skb->mark.
If netfilter egress classifying is not enabled on any interface, it is
patched out of the data path by way of a static_key and doesn't make a
performance difference that is discernible from noise:
Before: 1537 1538 1538 1537 1538 1537 Mb/sec
After: 1536 1534 1539 1539 1539 1540 Mb/sec
Before + tc accept: 1418 1418 1418 1419 1419 1418 Mb/sec
After + tc accept: 1419 1424 1418 1419 1422 1420 Mb/sec
Before + tc drop: 1620 1619 1619 1619 1620 1620 Mb/sec
After + tc drop: 1616 1624 1625 1624 1622 1619 Mb/sec
When netfilter egress classifying is enabled on at least one interface,
a minimal performance penalty is incurred for every egress packet, even
if the interface it's transmitted over doesn't have any netfilter egress
rules configured. That is caused by checking dev->nf_hooks_egress
against NULL.
Measurements were performed on a Core i7-3615QM. Commands to reproduce:
ip link add dev foo type dummy
ip link set dev foo up
modprobe pktgen
echo "add_device foo" > /proc/net/pktgen/kpktgend_3
samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i foo -n 400000000 -m "11:11:11:11:11:11" -d 1.1.1.1
Accept all traffic with tc:
tc qdisc add dev foo clsact
tc filter add dev foo egress bpf da bytecode '1,6 0 0 0,'
Drop all traffic with tc:
tc qdisc add dev foo clsact
tc filter add dev foo egress bpf da bytecode '1,6 0 0 2,'
Apply this patch when measuring packet drops to avoid errors in dmesg:
https://lore.kernel.org/netdev/a73dda33-57f4-95d8-ea51-ed483abd6a7a@iogearbox.net/
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: Laura García Liébana <nevola@gmail.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139170
commit 17d20784223d52bf1671f984c9e8d5d9b8ea171b
Author: Lukas Wunner <lukas@wunner.de>
Date: Fri Oct 8 22:06:02 2021 +0200
netfilter: Generalize ingress hook include file
Prepare for addition of a netfilter egress hook by generalizing the
ingress hook include file.
No functional change intended.
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139170
commit 7463acfbe52ae8b7e0ea6890c1886b3f8ba8bddd
Author: Lukas Wunner <lukas@wunner.de>
Date: Fri Oct 8 22:06:01 2021 +0200
netfilter: Rename ingress hook include file
Prepare for addition of a netfilter egress hook by renaming
<linux/netfilter_ingress.h> to <linux/netfilter_netdev.h>.
The egress hook also necessitates a refactoring of the include file,
but that is done in a separate commit to ease reviewing.
No functional change intended.
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1419
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2128180
Tested: Using iperf3 and toggling gso/tso offloading knobs
Commits:
```
2106efda785b ("net: remove .ndo_change_proto_down")
2cc6cdd44a16 ("net: unexport a handful of dev_* functions")
6264f58ca0e5 ("net: extract a few internals from netdevice.h")
6df6398f7c8b ("net: add netif_inherit_tso_max()")
14d7b8122fd5 ("net: don't allow user space to lift the device limits")
ee8b7a1156f3 ("net: make drivers set the TSO limit not the GSO limit")
744d49daf8bd ("net: move netif_set_gso_max helpers")
```
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2140149
commit 9309f97aef6d8250bb484dabeac925c3a7c57716
Author: Petr Machata <petrm@nvidia.com>
Date: Wed Mar 2 18:31:20 2022 +0200
net: dev: Add hardware stats support
Offloading switch device drivers may be able to collect statistics of the
traffic taking place in the HW datapath that pertains to a certain soft
netdevice, such as VLAN. Add the necessary infrastructure to allow exposing
these statistics to the offloaded netdevice in question. The API was shaped
by the following considerations:
- Collection of HW statistics is not free: there may be a finite number of
counters, and the act of counting may have a performance impact. It is
therefore necessary to allow toggling whether HW counting should be done
for any particular SW netdevice.
- As the drivers are loaded and removed, a particular device may get
offloaded and unoffloaded again. At the same time, the statistics values
need to stay monotonic (modulo the eventual 64-bit wraparound),
increasing only to reflect traffic measured in the device.
To that end, the netdevice keeps around a lazily-allocated copy of struct
rtnl_link_stats64. Device drivers then contribute to the values kept
therein at various points. Even as the driver goes away, the struct stays
around to maintain the statistics values.
- Different HW devices may be able to count different things. The
motivation behind this patch in particular is exposure of HW counters on
Nvidia Spectrum switches, where the only practical approach to counting
traffic on offloaded soft netdevices currently is to use router interface
counters, and count L3 traffic. Correspondingly that is the statistics
suite added in this patch.
Other devices may be able to measure different kinds of traffic, and for
that reason, the APIs are built to allow uniform access to different
statistics suites.
- Because soft netdevices and offloading drivers are only loosely bound, a
netdevice uses a notifier chain to communicate with the drivers. Several
new notifiers, NETDEV_OFFLOAD_XSTATS_*, have been added to carry messages
to the offloading drivers.
- Devices can have various conditions for when a particular counter is
available. As the device is configured and reconfigured, the device
offload may become or cease being suitable for counter binding. A
netdevice can use a notifier type NETDEV_OFFLOAD_XSTATS_REPORT_USED to
ping offloading drivers and determine whether anyone currently implements
a given statistics suite. This information can then be propagated to user
space.
When the driver decides to unoffload a netdevice, it can use a
newly-added function, netdev_offload_xstats_report_delta(), to record
outstanding collected statistics, before destroying the HW counter.
This patch adds a helper, call_netdevice_notifiers_info_robust(), for
dispatching a notifier with the possibility of unwind when one of the
consumers bails. Given the wish to eventually get rid of the global
notifier block altogether, this helper only invokes the per-netns notifier
block.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2139498
commit 58caed3dacb4354a25a1aa8d2febc3e9648ba1f4
Author: Jakub Kicinski <kuba@kernel.org>
Date: Mon May 2 16:27:03 2022 -0700
netdev: reshuffle netif_napi_add() APIs to allow dropping weight
Most drivers should not have to worry about selecting the right
weight for their NAPI instances and pass NAPI_POLL_WEIGHT.
It'd be best if we didn't require the argument at all and selected
the default internally.
This change prepares the ground for such reshuffling, allowing
for a smooth transition. The following API should remain after
the next release cycle:
netif_napi_add()
netif_napi_add_weight()
netif_napi_add_tx()
netif_napi_add_tx_weight()
Where the _weight() variants take an explicit weight argument.
I opted for a _weight() suffix rather than a __ prefix, because
we use __ in places to mean that caller needs to also issue a
synchronize_net() call.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20220502232703.396351-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139501
commit 0fe79f28bfaf73b66b7b1562d2468f94aa03bd12
Author: Alexander Duyck <alexanderduyck@fb.com>
Date: Fri May 13 11:34:03 2022 -0700
net: allow gro_max_size to exceed 65536
Allow the gro_max_size to exceed a value larger than 65536.
There weren't really any external limitations that prevented this other
than the fact that IPv4 only supports a 16 bit length field. Since we have
the option of adding a hop-by-hop header for IPv6 we can allow IPv6 to
exceed this value and for IPv4 and non-TCP flows we can cap things at 65536
via a constant rather than relying on gro_max_size.
[edumazet] limit GRO_MAX_SIZE to (8 * 65535) to avoid overflows.
Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139501
commit 7c4e983c4f3cf94fcd879730c6caa877e0768a4d
Author: Alexander Duyck <alexanderduyck@fb.com>
Date: Fri May 13 11:33:57 2022 -0700
net: allow gso_max_size to exceed 65536
The code for gso_max_size was added originally to allow for debugging and
workaround of buggy devices that couldn't support TSO with blocks 64K in
size. The original reason for limiting it to 64K was because that was the
existing limits of IPv4 and non-jumbogram IPv6 length fields.
With the addition of Big TCP we can remove this limit and allow the value
to potentially go up to UINT_MAX and instead be limited by the tso_max_size
value.
So in order to support this we need to go through and clean up the
remaining users of the gso_max_size value so that the values will cap at
64K for non-TCPv6 flows. In addition we can clean up the GSO_MAX_SIZE value
so that 64K becomes GSO_LEGACY_MAX_SIZE and UINT_MAX will now be the upper
limit for GSO_MAX_SIZE.
v6: (edumazet) fixed a compile error if CONFIG_IPV6=n,
in a new sk_trim_gso_size() helper.
netif_set_tso_max_size() caps the requested TSO size
with GSO_MAX_SIZE.
Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139501
Conflicts:
- context due to existing backport of 14d7b8122fd5 ("net: don't allow
user space to lift the device limits")
commit eac1b93c14d645ef147b049ace0d5230df755548
Author: Coco Li <lixiaoyan@google.com>
Date: Wed Jan 5 02:48:38 2022 -0800
gro: add ability to control gro max packet size
Eric Dumazet suggested to allow users to modify max GRO packet size.
We have seen GRO being disabled by users of appliances (such as
wifi access points) because of claimed bufferbloat issues,
or some work arounds in sch_cake, to split GRO/GSO packets.
Instead of disabling GRO completely, one can chose to limit
the maximum packet size of GRO packets, depending on their
latency constraints.
This patch adds a per device gro_max_size attribute
that can be changed with ip link command.
ip link set dev eth0 gro_max_size 16000
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Coco Li <lixiaoyan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2133057
Tested: LNST, Tier1
Upstream commit:
commit dbae2b062824fc2d35ae2d5df2f500626c758e80
Author: Paolo Abeni <pabeni@redhat.com>
Date: Wed Sep 28 10:43:09 2022 +0200
net: skb: introduce and use a single page frag cache
After commit 3226b158e6 ("net: avoid 32 x truesize under-estimation
for tiny skbs") we are observing 10-20% regressions in performance
tests with small packets. The perf trace points to high pressure on
the slab allocator.
This change tries to improve the allocation schema for small packets
using an idea originally suggested by Eric: a new per CPU page frag is
introduced and used in __napi_alloc_skb to cope with small allocation
requests.
To ensure that the above does not lead to excessive truesize
underestimation, the frag size for small allocation is inflated to 1K
and all the above is restricted to build with 4K page size.
Note that we need to update accordingly the run-time check introduced
with commit fd9ea57f4e95 ("net: add napi_get_frags_check() helper").
Alex suggested a smart page refcount schema to reduce the number
of atomic operations and deal properly with pfmemalloc pages.
Under small packet UDP flood, I measure a 15% peak tput increases.
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Suggested-by: Alexander H Duyck <alexanderduyck@fb.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://lore.kernel.org/r/6b6f65957c59f86a353fc09a5127e83a32ab5999.1664350652.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2133057
Tested: LNST, Tier1
Upstream commit:
commit fd9ea57f4e9514f9d0f0dec505eefd99a8faa148
Author: Eric Dumazet <edumazet@google.com>
Date: Wed Jun 8 09:04:38 2022 -0700
net: add napi_get_frags_check() helper
This is a follow up of commit 3226b158e6
("net: avoid 32 x truesize under-estimation for tiny skbs")
When/if we increase MAX_SKB_FRAGS, we better make sure
the old bug will not come back.
Adding a check in napi_get_frags() would be costly,
even if using DEBUG_NET_WARN_ON_ONCE().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2120966
Conflicts:
- [minor] context difference in __netif_receive_skb_core due to missing
42df6e1d221d ("netfilter: Introduce egress hook")
commit cd14e9b7b8d312dfbf75ce1f78552902e51b9045
Author: Martin KaFai Lau <kafai@fb.com>
Date: Wed Mar 2 11:56:22 2022 -0800
net: Postpone skb_clear_delivery_time() until knowing the skb is delivered locally
The previous patches handled the delivery_time in the ingress path
before the routing decision is made. This patch can postpone clearing
delivery_time in a skb until knowing it is delivered locally and also
set the (rcv) timestamp if needed. This patch moves the
skb_clear_delivery_time() from dev.c to ip_local_deliver_finish()
and ip6_input_finish().
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2120966
Conflicts:
- [minor] context difference in __netif_receive_skb_core due to missing
42df6e1d221d ("netfilter: Introduce egress hook")
commit d98d58a002619b5c165f1eedcd731e2fe2c19088
Author: Martin KaFai Lau <kafai@fb.com>
Date: Wed Mar 2 11:55:50 2022 -0800
net: Set skb->mono_delivery_time and clear it after sch_handle_ingress()
The previous patches handled the delivery_time before sch_handle_ingress().
This patch can now set the skb->mono_delivery_time to flag the skb->tstamp
is used as the mono delivery_time (EDT) instead of the (rcv) timestamp
and also clear it with skb_clear_delivery_time() after
sch_handle_ingress(). This will make the bpf_redirect_*()
to keep the mono delivery_time and used by a qdisc (fq) of
the egress-ing interface.
A latter patch will postpone the skb_clear_delivery_time() until the
stack learns that the skb is being delivered locally and that will
make other kernel forwarding paths (ip[6]_forward) able to keep
the delivery_time also. Thus, like the previous patches on using
the skb->mono_delivery_time bit, calling skb_clear_delivery_time()
is not limited within the CONFIG_NET_INGRESS to avoid too many code
churns among this set.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2120966
commit d93376f503c7a586707925957592c0f16f4db0b1
Author: Martin KaFai Lau <kafai@fb.com>
Date: Wed Mar 2 11:55:44 2022 -0800
net: Clear mono_delivery_time bit in __skb_tstamp_tx()
In __skb_tstamp_tx(), it may clone the egress skb and queues the clone to
the sk_error_queue. The outgoing skb may have the mono delivery_time
while the (rcv) timestamp is expected for the clone, so the
skb->mono_delivery_time bit needs to be cleared from the clone.
This patch adds the skb->mono_delivery_time clearing to the existing
__net_timestamp() and use it in __skb_tstamp_tx().
The __net_timestamp() fast path usage in dev.c is changed to directly
call ktime_get_real() since the mono_delivery_time bit is not set at
that point.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2120966
commit 27942a15209f564ed8ee2a9e126cb7b105181355
Author: Martin KaFai Lau <kafai@fb.com>
Date: Wed Mar 2 11:55:38 2022 -0800
net: Handle delivery_time in skb->tstamp during network tapping with af_packet
A latter patch will set the skb->mono_delivery_time to flag the skb->tstamp
is used as the mono delivery_time (EDT) instead of the (rcv) timestamp.
skb_clear_tstamp() will then keep this delivery_time during forwarding.
This patch is to make the network tapping (with af_packet) to handle
the delivery_time stored in skb->tstamp.
Regardless of tapping at the ingress or egress, the tapped skb is
received by the af_packet socket, so it is ingress to the af_packet
socket and it expects the (rcv) timestamp.
When tapping at egress, dev_queue_xmit_nit() is used. It has already
expected skb->tstamp may have delivery_time, so it does
skb_clone()+net_timestamp_set() to ensure the cloned skb has
the (rcv) timestamp before passing to the af_packet sk.
This patch only adds to clear the skb->mono_delivery_time
bit in net_timestamp_set().
When tapping at ingress, it currently expects the skb->tstamp is either 0
or the (rcv) timestamp. Meaning, the tapping at ingress path
has already expected the skb->tstamp could be 0 and it will get
the (rcv) timestamp by ktime_get_real() when needed.
There are two cases for tapping at ingress:
One case is af_packet queues the skb to its sk_receive_queue.
The skb is either not shared or new clone created. The newly
added skb_clear_delivery_time() is called to clear the
delivery_time (if any) and set the (rcv) timestamp if
needed before the skb is queued to the sk_receive_queue.
Another case, the ingress skb is directly copied to the rx_ring
and tpacket_get_timestamp() is used to get the (rcv) timestamp.
The newly added skb_tstamp() is used in tpacket_get_timestamp()
to check the skb->mono_delivery_time bit before returning skb->tstamp.
As mentioned earlier, the tapping@ingress has already expected
the skb may not have the (rcv) timestamp (because no sk has asked
for it) and has handled this case by directly calling ktime_get_real().
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1454
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Sync skb drop reasons with upstream to improve debuggability and visibility in
the net stack. This MR helps in understanding why a given packet is being
dropped.
One way of retrieving the skb drop reason is to hook to the kfree_skb tracepoint:
```
# perf record -e skb:kfree_skb -a sleep 10
# perf script
swapper 0 [000] 45483.977088: skb:kfree_skb: skbaddr=0xffffa04859090f00 protocol=34525 location=0xffffffff9bc92940 reason: NOT_SPECIFIED
swapper 0 [000] 45485.792919: skb:kfree_skb: skbaddr=0xffffa04143757900 protocol=34525 location=0xffffffff9bbd84bb reason: TCP_INVALID_SEQUENCE
```
Signed-off-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2128180
Conflicts:
* drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c
* drivers/net/ethernet/marvell/octeontx2/nic/otx2_vf.c
- small context conflicts
* drivers/net/usb/ax88179_178a.c
- hunk removed, the driver does not call netif_set_gso_max_size()
* drivers/net/usb/lan78xx.c
- modified due to absence of commits d383216a7efe ("lan78xx: Introduce
Tx URB processing improvements") and 0dd87266c133 ("lan78xx: Remove
hardware-specific header update")
commit ee8b7a1156f357613646d6c69d07ac5a087a1071
Author: Jakub Kicinski <kuba@kernel.org>
Date: Thu May 5 19:51:33 2022 -0700
net: make drivers set the TSO limit not the GSO limit
Drivers should call the TSO setting helper, GSO is controllable
by user space.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2128180
Conflicts:
- small context conflict due to missing eac1b93c14d6 ("gro: add ability
to control gro max packet size")
commit 14d7b8122fd591693a2388b98563707ba72c6780
Author: Jakub Kicinski <kuba@kernel.org>
Date: Thu May 5 19:51:32 2022 -0700
net: don't allow user space to lift the device limits
Up until commit 46e6b992c2 ("rtnetlink: allow GSO maximums to
be set on device creation") the gso_max_segs and gso_max_size
of a device were not controlled from user space.
The quoted commit added the ability to control them because of
the following setup:
netns A | netns B
veth<->veth eth0
If eth0 has TSO limitations and user wants to efficiently forward
traffic between eth0 and the veths they should copy the TSO
limitations of eth0 onto the veths. This would happen automatically
for macvlans or ipvlan but veth users are not so lucky (given the
loose coupling).
Unfortunately the commit in question allowed users to also override
the limits on real HW devices.
It may be useful to control the max GSO size and someone may be using
that ability (not that I know of any user), so create a separate set
of knobs to reliably record the TSO limitations. Validate the user
requests.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2128180
Conflicts:
- small context conflict due to missing eac1b93c14d6 ("gro: add ability
to control gro max packet size")
commit 6df6398f7c8b481ce83f28143bc08a5231616deb
Author: Jakub Kicinski <kuba@kernel.org>
Date: Thu May 5 19:51:31 2022 -0700
net: add netif_inherit_tso_max()
To make later patches smaller create a helper for inheriting
the TSO limitations of a lower device. The TSO in the name
is not an accident, subsequent patches will replace GSO
with TSO in more names.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2128180
Conflicts:
- slightly modified due to missing 0b5c21bbc01e ("net: ensure
net_todo_list is processed quickly") and d07b26f5bbea ("dev_addr:
add a modification check")
commit 6264f58ca0e54e41d63c2d00334a48bac28fbf30
Author: Jakub Kicinski <kuba@kernel.org>
Date: Wed Apr 6 14:37:54 2022 -0700
net: extract a few internals from netdevice.h
There's a number of functions and static variables used
under net/core/ but not from the outside. We currently
dump most of them into netdevice.h. That bad for many
reasons:
- netdevice.h is very cluttered, hard to figure out
what the APIs are;
- netdevice.h is very long;
- we have to touch netdevice.h more which causes expensive
incremental builds.
Create a header under net/core/ and move some declarations.
The new header is also a bit of a catch-all but that's
fine, if we create more specific headers people will
likely over-think where their declaration fit best.
And end up putting them in netdevice.h, again.
More work should be done on splitting netdevice.h into more
targeted headers, but that'd be more time consuming so small
steps.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git
Conflict:\
- In __netif_receive_skb_core due to missing upstream commit
625788b58445 ("net: add per-cpu storage and net->core_stats") in c9s.
commit 9f8ed577c28813410614b418bad42285840c1a00
Author: Menglong Dong <imagedong@tencent.com>
Date: Thu Apr 7 14:20:50 2022 +0800
net: skb: rename SKB_DROP_REASON_PTYPE_ABSENT
As David Ahern suggested, the reasons for skb drops should be more
general and not be code based.
Therefore, rename SKB_DROP_REASON_PTYPE_ABSENT to
SKB_DROP_REASON_UNHANDLED_PROTO, which is used for the cases of no
L3 protocol handler, no L4 protocol handler, version extensions, etc.
From previous discussion, now we have the aim to make these reasons
more abstract and users based, avoiding code based.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Antoine Tenart <atenart@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git
commit 6c2728b7c14164928cb7cb9c847dead101b2d503
Author: Menglong Dong <imagedong@tencent.com>
Date: Fri Mar 4 14:00:46 2022 +0800
net: dev: use kfree_skb_reason() for __netif_receive_skb_core()
Add reason for skb drops to __netif_receive_skb_core() when packet_type
not found to handle the skb. For this purpose, the drop reason
SKB_DROP_REASON_PTYPE_ABSENT is introduced. Take ether packets for
example, this case mainly happens when L3 protocol is not supported.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Antoine Tenart <atenart@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git
commit a568aff26ac03ee9eb1482683514914a5ec3b4c3
Author: Menglong Dong <imagedong@tencent.com>
Date: Fri Mar 4 14:00:45 2022 +0800
net: dev: use kfree_skb_reason() for sch_handle_ingress()
Replace kfree_skb() used in sch_handle_ingress() with
kfree_skb_reason(). Following drop reasons are introduced:
SKB_DROP_REASON_TC_INGRESS
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Antoine Tenart <atenart@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git
commit 7e726ed81e1ddd5fdc431e02b94fcfe2a9876d42
Author: Menglong Dong <imagedong@tencent.com>
Date: Fri Mar 4 14:00:44 2022 +0800
net: dev: use kfree_skb_reason() for do_xdp_generic()
Replace kfree_skb() used in do_xdp_generic() with kfree_skb_reason().
The drop reason SKB_DROP_REASON_XDP is introduced for this case.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Antoine Tenart <atenart@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git
commit 44f0bd40803c0e04f1c8cd59df3c7acce783ae9c
Author: Menglong Dong <imagedong@tencent.com>
Date: Fri Mar 4 14:00:43 2022 +0800
net: dev: use kfree_skb_reason() for enqueue_to_backlog()
Replace kfree_skb() used in enqueue_to_backlog() with
kfree_skb_reason(). The skb rop reason SKB_DROP_REASON_CPU_BACKLOG is
introduced for the case of failing to enqueue the skb to the per CPU
backlog queue. The further reason can be backlog queue full or RPS
flow limition, and I think we needn't to make further distinctions.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Antoine Tenart <atenart@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git
commit 7faef0547f4c29031a68d058918b031a8e520d49
Author: Menglong Dong <imagedong@tencent.com>
Date: Fri Mar 4 14:00:42 2022 +0800
net: dev: add skb drop reasons to __dev_xmit_skb()
Add reasons for skb drops to __dev_xmit_skb() by replacing
kfree_skb_list() with kfree_skb_list_reason(). The drop reason of
SKB_DROP_REASON_QDISC_DROP is introduced for qdisc enqueue fails.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Antoine Tenart <atenart@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git
commit 98b4d7a4e7374a44c4afd9f08330e72f6ad0d644
Author: Menglong Dong <imagedong@tencent.com>
Date: Fri Mar 4 14:00:40 2022 +0800
net: dev: use kfree_skb_reason() for sch_handle_egress()
Replace kfree_skb() used in sch_handle_egress() with kfree_skb_reason().
The drop reason SKB_DROP_REASON_TC_EGRESS is introduced. Considering
the code path of tc egerss, we make it distinct with the drop reason
of SKB_DROP_REASON_QDISC_DROP in the next commit.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Antoine Tenart <atenart@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134161
Tested: LNST, Tier1
Conflicts: chunk applied into netdev_wait_allrefs() instead of \
netdev_wait_allrefs_any() and with different context as rhel-9 \
lacks the upstream commit faab39f63c1fc ("net: allow out-of-order \
netdev unregistration")
Upstream commit:
commit 05e49cfc89e4f325eebbc62d24dd122e55f94c23
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date: Tue Aug 23 10:46:59 2022 -0700
net: Fix a data-race around netdev_unregister_timeout_secs.
While reading netdev_unregister_timeout_secs, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its reader.
Fixes: 5aa3afe107 ("net: make unregister netdev warning timeout configurable")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134161
Tested: LNST, Tier1
Upstream commit:
commit fa45d484c52c73f79db2c23b0cdfc6c6455093ad
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date: Tue Aug 23 10:46:55 2022 -0700
net: Fix a data-race around netdev_budget_usecs.
While reading netdev_budget_usecs, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.
Fixes: 7acf8a1e8a ("Replace 2 jiffies with sysctl netdev_budget_usecs to enable softirq tuning")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134161
Tested: LNST, Tier1
Upstream commit:
commit 2e0c42374ee32e72948559d2ae2f7ba3dc6b977c
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date: Tue Aug 23 10:46:53 2022 -0700
net: Fix a data-race around netdev_budget.
While reading netdev_budget, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.
Fixes: 51b0bdedb8 ("[NET]: Separate two usages of netdev_max_backlog.")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134161
Tested: LNST, Tier1
Upstream commit:
commit 61adf447e38664447526698872e21c04623afb8e
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date: Tue Aug 23 10:46:47 2022 -0700
net: Fix data-races around netdev_tstamp_prequeue.
While reading netdev_tstamp_prequeue, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.
Fixes: 3b098e2d7c ("net: Consistent skb timestamping")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134161
Tested: LNST, Tier1
Upstream commit:
commit 5dcd08cd19912892586c6082d56718333e2d19db
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date: Tue Aug 23 10:46:46 2022 -0700
net: Fix data-races around netdev_max_backlog.
While reading netdev_max_backlog, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.
While at it, we remove the unnecessary spaces in the doc.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134161
Tested: LNST, Tier1
Upstream commit:
commit bf955b5ab8f6f7b0632cdef8e36b14e4f6e77829
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date: Tue Aug 23 10:46:45 2022 -0700
net: Fix data-races around weight_p and dev_weight_[rt]x_bias.
While reading weight_p, it can be changed concurrently. Thus, we need
to add READ_ONCE() to its reader.
Also, dev_[rt]x_weight can be read/written at the same time. So, we
need to use READ_ONCE() and WRITE_ONCE() for its access. Moreover, to
use the same weight_p while changing dev_[rt]x_weight, we add a mutex
in proc_do_dev_weight().
Fixes: 3d48b53fb2 ("net: dev_weight: TX/RX orthogonality")
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2128180
commit 2cc6cdd44a1655ac5a9863529a2fd6dbed2d092c
Author: Jakub Kicinski <kuba@kernel.org>
Date: Wed Apr 6 14:37:53 2022 -0700
net: unexport a handful of dev_* functions
We have a bunch of functions which are only used under
net/core/ yet they get exported. Remove the exports.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2128180
Conflicts:
- small context conflict due to existing backport of 3b89b511ea0c ("net:
fix IFF_TX_SKB_NO_LINEAR definition")
commit 2106efda785b55a8957efed9a52dfa28ee0d7280
Author: Jakub Kicinski <kuba@kernel.org>
Date: Mon Nov 22 17:24:47 2021 -0800
net: remove .ndo_change_proto_down
.ndo_change_proto_down was added seemingly to enable out-of-tree
implementations. Over 2.5yrs later we still have no real users
upstream. Hardwire the generic implementation for now, we can
revert once real users materialize. (rocker is a test vehicle,
not a user.)
We need to drop the optimization on the sysfs side, because
unlike ndos priv_flags will be changed at runtime, so we'd
need READ_ONCE/WRITE_ONCE everywhere..
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071620
commit 382778edc8262b7535f00523e9eb22edba1b9816
Author: Toke Høiland-Jørgensen <toke@redhat.com>
Date: Fri Jan 7 23:11:13 2022 +0100
xdp: check prog type before updating BPF link
The bpf_xdp_link_update() function didn't check the program type before
updating the program, which made it possible to install any program type as
an XDP program, which is obviously not good. Syzbot managed to trigger this
by swapping in an LWT program on the XDP hook which would crash in a helper
call.
Fix this by adding a check and bailing out if the types don't match.
Fixes: 026a4c28e1 ("bpf, xdp: Implement LINK_UPDATE for BPF XDP link")
Reported-by: syzbot+983941aa85af6ded1fd9@syzkaller.appspotmail.com
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20220107221115.326171-1-toke@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1066
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101789
Tested: Just built - there is no functional change
The series moves GRO related definitions, declarations and code from core files into net/core/gro.h and include/net/gro.h and reduces too big files include/linux/netdevice.h andnet/core/dev.c. Backport of this series provides <net/gro.h> for NIC drivers and avoids conflicts in future GRO related backports and fixes.
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: Kamal Heib <kheib@redhat.com>
Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Hangbin Liu <haliu@redhat.com>
Approved-by: Íñigo Huguet <ihuguet@redhat.com>
Conflicts:
- include/linux/netdevice.h: fuzz.
Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2073454
Conflicts:
- N/A hunk for unsupported octeontx2 driver omitted
commit c8064e5b4adac5e1255cf4f3b374e75b5376e7ca
Author: Paolo Abeni <pabeni@redhat.com>
Date: Tue Nov 30 11:08:07 2021 +0100
bpf: Let bpf_warn_invalid_xdp_action() report more info
In non trivial scenarios, the action id alone is not sufficient to
identify the program causing the warning. Before the previous patch,
the generated stack-trace pointed out at least the involved device
driver.
Let's additionally include the program name and id, and the relevant
device name.
If the user needs additional infos, he can fetch them via a kernel
probe, leveraging the arguments added here.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/ddb96bb975cbfddb1546cf5da60e77d5100b533c.1638189075.git.pabeni@redhat.com
Signed-off-by: Ivan Vecera <ivecera@redhat.com>