Commit Graph

2210 Commits

Author SHA1 Message Date
Felix Maurer 471c2b7e9a net: move dev->state into net_device_read_txrx group
JIRA: https://issues.redhat.com/browse/RHEL-30902

commit f6e0a4984c2e7244689ea87b62b433bed9d07e94
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Mar 14 20:08:45 2024 +0000

    net: move dev->state into net_device_read_txrx group
    
    dev->state can be read in rx and tx fast paths.
    
    netif_running() which needs dev->state is called from
    - enqueue_to_backlog() [RX path]
    - __dev_direct_xmit()  [TX path]
    
    Fixes: 43a71cd66b9c ("net-device: reorganize net_device fast path variables")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Coco Li <lixiaoyan@google.com>
    Reviewed-by: Jiri Pirko <jiri@nvidia.com>
    Link: https://lore.kernel.org/r/20240314200845.3050179-1-edumazet@google.com
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2024-06-26 17:17:20 +02:00
Felix Maurer 74175b4d44 net-device: move lstats in net_device_read_txrx
JIRA: https://issues.redhat.com/browse/RHEL-30902
Conflicts:
- include/linux/netdevice.h: Context difference due to missing 34d21de99cea
  ("net: Move {l,t,d}stats allocation to core and convert veth & vrf");
  this doesn't affect that the stats pointer union itself is read in the rx
  and tx fast paths.

commit c353c7b7ffb7ae6ed8f3339906fe33c8be6cf344
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Feb 8 14:43:23 2024 +0000

    net-device: move lstats in net_device_read_txrx

    dev->lstats is notably used from loopback ndo_start_xmit()
    and other virtual drivers.

    Per cpu stats updates are dirtying per-cpu data,
    but the pointer itself is read-only.

    Fixes: 43a71cd66b9c ("net-device: reorganize net_device fast path variables")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Coco Li <lixiaoyan@google.com>
    Cc: Simon Horman <horms@kernel.org>
    Reviewed-by: Simon Horman <horms@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2024-06-26 17:17:19 +02:00
Felix Maurer 9d446c94ae net-device: move xdp_prog to net_device_read_rx
JIRA: https://issues.redhat.com/browse/RHEL-30902
Conflicts:
- include/linux/netdevice.h: code differece because we are maintaining kABI
  exclusions.

commit d3d344a1ca69d8fb2413e29e6400f3ad58a05c06
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Jan 2 16:22:20 2024 +0000

    net-device: move xdp_prog to net_device_read_rx

    xdp_prog is used in receive path, both from XDP enabled drivers
    and from netif_elide_gro().

    This patch also removes two 4-bytes holes.

    Fixes: 43a71cd66b9c ("net-device: reorganize net_device fast path variables")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Coco Li <lixiaoyan@google.com>
    Cc: Simon Horman <horms@kernel.org>
    Link: https://lore.kernel.org/r/20240102162220.750823-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2024-06-26 17:17:19 +02:00
Felix Maurer 00b76bfb2e net-device: move gso_partial_features to net_device_read_tx
JIRA: https://issues.redhat.com/browse/RHEL-30902

commit 993498e537af9260e697219ce41b41b22b6199cc
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Dec 21 14:07:47 2023 +0000

    net-device: move gso_partial_features to net_device_read_tx
    
    dev->gso_partial_features is read from tx fast path for GSO packets.
    
    Move it to appropriate section to avoid a cache line miss.
    
    Fixes: 43a71cd66b9c ("net-device: reorganize net_device fast path variables")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Coco Li <lixiaoyan@google.com>
    Cc: David Ahern <dsahern@kernel.org>
    Reviewed-by: Simon Horman <horms@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2024-06-26 17:17:18 +02:00
Felix Maurer dcdc5896bb net-device: reorganize net_device fast path variables
JIRA: https://issues.redhat.com/browse/RHEL-30902
Conflicts:
- include/linux/netdevice.h: Conflicts due to kABI exclusions in the
  struct. Reordering kABI excluded fields maintains the kABI exclusion.
- include/linux/netdevice.h: Context differences due to missing patches
  from upstream.

commit 43a71cd66b9c0a4af3d15d8644359fde35bdbed0
Author: Coco Li <lixiaoyan@google.com>
Date:   Mon Dec 4 20:12:30 2023 +0000

    net-device: reorganize net_device fast path variables

    Reorganize fast path variables on tx-txrx-rx order
    Fastpath variables end after npinfo.

    Below data generated with pahole on x86 architecture.

    Fast path variables span cache lines before change: 12
    Fast path variables span cache lines after change: 4

    Suggested-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Coco Li <lixiaoyan@google.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Link: https://lore.kernel.org/r/20231204201232.520025-2-lixiaoyan@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2024-06-26 17:17:17 +02:00
cki-backport-bot 2e00a59dfd packet: annotate data-races around ignore_outgoing
JIRA: https://issues.redhat.com/browse/RHEL-33238
CVE: CVE-2024-26862

commit 6ebfad33161afacb3e1e59ed1c2feefef70f9f97
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Mar 14 14:18:16 2024 +0000

    packet: annotate data-races around ignore_outgoing

    ignore_outgoing is read locklessly from dev_queue_xmit_nit()
    and packet_getsockopt()

    Add appropriate READ_ONCE()/WRITE_ONCE() annotations.

    syzbot reported:

    BUG: KCSAN: data-race in dev_queue_xmit_nit / packet_setsockopt

    write to 0xffff888107804542 of 1 bytes by task 22618 on cpu 0:
     packet_setsockopt+0xd83/0xfd0 net/packet/af_packet.c:4003
     do_sock_setsockopt net/socket.c:2311 [inline]
     __sys_setsockopt+0x1d8/0x250 net/socket.c:2334
     __do_sys_setsockopt net/socket.c:2343 [inline]
     __se_sys_setsockopt net/socket.c:2340 [inline]
     __x64_sys_setsockopt+0x66/0x80 net/socket.c:2340
     do_syscall_64+0xd3/0x1d0
     entry_SYSCALL_64_after_hwframe+0x6d/0x75

    read to 0xffff888107804542 of 1 bytes by task 27 on cpu 1:
     dev_queue_xmit_nit+0x82/0x620 net/core/dev.c:2248
     xmit_one net/core/dev.c:3527 [inline]
     dev_hard_start_xmit+0xcc/0x3f0 net/core/dev.c:3547
     __dev_queue_xmit+0xf24/0x1dd0 net/core/dev.c:4335
     dev_queue_xmit include/linux/netdevice.h:3091 [inline]
     batadv_send_skb_packet+0x264/0x300 net/batman-adv/send.c:108
     batadv_send_broadcast_skb+0x24/0x30 net/batman-adv/send.c:127
     batadv_iv_ogm_send_to_if net/batman-adv/bat_iv_ogm.c:392 [inline]
     batadv_iv_ogm_emit net/batman-adv/bat_iv_ogm.c:420 [inline]
     batadv_iv_send_outstanding_bat_ogm_packet+0x3f0/0x4b0 net/batman-adv/bat_iv_ogm.c:1700
     process_one_work kernel/workqueue.c:3254 [inline]
     process_scheduled_works+0x465/0x990 kernel/workqueue.c:3335
     worker_thread+0x526/0x730 kernel/workqueue.c:3416
     kthread+0x1d1/0x210 kernel/kthread.c:388
     ret_from_fork+0x4b/0x60 arch/x86/kernel/process.c:147
     ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:243

    value changed: 0x00 -> 0x01

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 1 PID: 27 Comm: kworker/u8:1 Tainted: G        W          6.8.0-syzkaller-08073-g480e035fc4c7 #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/29/2024
    Workqueue: bat_events batadv_iv_send_outstanding_bat_ogm_packet

    Fixes: fa788d986a ("packet: add sockopt to ignore outgoing packets")
    Reported-by: syzbot+c669c1136495a2e7c31f@syzkaller.appspotmail.com
    Closes: https://lore.kernel.org/netdev/CANn89i+Z7MfbkBLOv=p7KZ7=K1rKHO4P1OL5LYDCtBiyqsa9oQ@mail.gmail.com/T/#t
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
    Reviewed-by: Willem de Bruijn <willemb@google.com>
    Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: cki-backport-bot <cki-ci-bot+cki-gitlab-backport-bot@redhat.com>
2024-06-25 17:47:49 +00:00
Lucas Zampieri 4aba3f45f9 Merge: CNB95: ethtool: update ethtool core to upstream v6.8
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4236

JIRA: https://issues.redhat.com/browse/RHEL-36217  

Commits:
```
b534dc46c8ae ("net_tstamp: add SOF_TIMESTAMPING_OPT_ID_TCP")
70f7457ad6d6 ("net: create device lookup API with reference tracking")
3515440df461 ("ipv6: also use netdev_hold() in ip6_route_check_nh()")
108a36d07c01 ("ethtool: Fix mod state of verbose no_mask bitset")
524515020f25 ("Revert "ethtool: Fix mod state of verbose no_mask bitset"")
f55d8e60f109 ("net: ethtool: Fix documentation of ethtool_sprintf()")
65c9fde15a65 ("net: vlan: convert to ndo_hwtstamp_get() / ndo_hwtstamp_set()")
0bca3f7f9acd ("net: macvlan: convert to ndo_hwtstamp_get() / ndo_hwtstamp_set()")
c0dabeb4c666 ("net: bonding: convert to ndo_hwtstamp_get() / ndo_hwtstamp_set()")
ef5eb9c5ce45 ("net: fec: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set()")
547b006d1922 ("net: fec: delete fec_ptp_disable_hwts()")
fd770e856e22 ("net: remove phy_has_hwtstamp() -> phy_mii_ioctl() decision from converted drivers")
c35e927cbe09 ("net: omit ndo_hwtstamp_get() call when possible in dev_set_hwtstamp_phylib()")
446e2305827b ("net: Convert PHYs hwtstamp callback to use kernel_hwtstamp_config")
430dc3256d57 ("net: phy: Remove the call to phy_mii_ioctl in phy_hwstamp_get/set")
b8768dc40777 ("net: ethtool: Refactor identical get_ts_info implementations.")
202cb220026e ("net: macb: Convert to ndo_hwtstamp_get() and ndo_hwtstamp_set()")
011dd3b3f83f ("net: Make dev_set_hwtstamp_phylib accessible")
915d25a9d69b ("net: phy: micrel: fix ts_info value in case of no phc")
acec05fb78ab ("net_tstamp: Add TIMESTAMPING SOFTWARE and HARDWARE mask")
11d55be06df0 ("net: ethtool: Add a command to expose current time stamping layer")
d905f9c75329 ("net: ethtool: Add a command to list available time stamping layers")
51bdf3165f01 ("net: Replace hwtstamp_source by timestamping layer")
0f7f463d4821 ("net: Change the API of PHY default timestamp to MAC")
091fab122869 ("net: ethtool: ts: Update GET_TS to reply the current selected timestamp")
152c75e1d002 ("net: ethtool: ts: Let the active time stamping layer be selectable")
289354f21b2c ("net: partial revert of the "Make timestamping selectable: series")
cc124ad39288 ("Documentation: networking: add missing PLCA messages from the message list")
d0c3891db2d2 ("ethtool: reformat kerneldoc for struct ethtool_link_settings")
1271ca00aa7f ("ethtool: reformat kerneldoc for struct ethtool_fec_stats")
f1172f3ee3a9 ("ethtool: netlink: Add missing ethnl_ops_begin/complete")
```

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: Davide Caratti <dcaratti@redhat.com>
Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Michal Schmidt <mschmidt@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-06-06 19:18:41 +00:00
Ivan Vecera c7df211c47 net: Add NAPI IRQ support
JIRA: https://issues.redhat.com/browse/RHEL-30139

Conflicts:
- context conflict due to RH KABI reservations for z-stream

commit 26793bfb5d6072326d1465343e7cbf6156abca4f
Author: Amritha Nambiar <amritha.nambiar@intel.com>
Date:   Fri Dec 1 15:29:07 2023 -0800

    net: Add NAPI IRQ support

    Add support to associate the interrupt vector number for a
    NAPI instance.

    Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
    Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
    Link: https://lore.kernel.org/r/170147334728.5260.13221803396905901904.stgit@anambiarhost.jf.intel.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-06-05 17:57:52 +02:00
Ivan Vecera daa10dac98 netdev-genl: Add netlink framework functions for napi
JIRA: https://issues.redhat.com/browse/RHEL-30139

Conflicts:
- context conflict due to missing 9a675ba55a96 ("net, bpf: Add
  a warning if NAPI cb missed xdp_do_flush().")

commit 27f91aaf49b3a50e5a02ad5fa27b7c453d029a72
Author: Amritha Nambiar <amritha.nambiar@intel.com>
Date:   Fri Dec 1 15:28:56 2023 -0800

    netdev-genl: Add netlink framework functions for napi

    Implement the netdev netlink framework functions for
    napi support. The netdev structure tracks all the napi
    instances and napi fields. The napi instances and associated
    parameters can be retrieved this way.

    Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
    Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
    Link: https://lore.kernel.org/r/170147333637.5260.14807433239805550815.stgit@anambiarhost.jf.intel.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-06-05 17:57:52 +02:00
Ivan Vecera eb61cb17fb net: Add queue and napi association
JIRA: https://issues.redhat.com/browse/RHEL-30139

Conflicts:
- context conflict due to RH KABI reservations for z-stream

commit 2a502ff0c4e42a739b5aa550c901bf3852795532
Author: Amritha Nambiar <amritha.nambiar@intel.com>
Date:   Fri Dec 1 15:28:34 2023 -0800

    net: Add queue and napi association

    Add the napi pointer in netdev queue for tracking the napi
    instance for each queue. This achieves the queue<->napi mapping.

    Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
    Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
    Link: https://lore.kernel.org/r/170147331483.5260.15723438819994285695.stgit@anambiarhost.jf.intel.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-06-05 17:57:52 +02:00
Scott Weaver ef1d77c61f Merge: CNB95: net/sched: update TC core to upstream v6.8
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4235

JIRA: https://issues.redhat.com/browse/RHEL-36218

Note that patch 2 is needed for patch 3 to avoid compiler warnings and patch 1 is a dependency for patch 2.

Commits:
```
4eb6bd55cfb2 ("compiler.h: drop fallback overflow checkers")
d219d2a9a92e ("overflow: Allow mixed type arguments")
8798481b667f ("net/sched: wrap open coded Qdics class filter counter")
daf8d9181b9b ("net/sched: sch_drr: warn about class in use while deleting")
e20e75017c5a ("net/sched: sch_qfq: warn about class in use while deleting")
a57c34a80cbe ("net: flow_dissector: Add IPSEC dissector")
4c13eda757e3 ("tc: flower: support for SPI")
c8915d7329d6 ("tc: flower: Enable offload support IPSEC SPI field.")
992b47851be9 ("net: pkt_cls: Remove unused inline helpers")
09e0c3bbde90 ("net/sched: taprio: don't access q->qdiscs[] in unoffloaded mode during attach()")
25b0d4e4e41f ("net/sched: taprio: keep child Qdisc refcount elevated at 2 in offload mode")
98766add2d55 ("net/sched: taprio: try again to report q->qdiscs[] to qdisc_leaf()")
6e0ec800c174 ("net/sched: taprio: delete misleading comment about preallocating child qdiscs")
665338b2a7a0 ("net/sched: taprio: dump class stats for the actual q->qdiscs[]")
40b0425f8ba1 ("net: ptp: create a mock-up PTP Hardware Clock driver")
b63e78fca889 ("net: netdevsim: use mock PHC driver")
35da47fe1c47 ("net: netdevsim: mimic tc-taprio offload")
355adce3010b ("selftests/tc-testing: add ptp_mock Kconfig dependency")
1890cf08bd99 ("selftests/tc-testing: test that taprio can only be attached as root")
29c298d2bc82 ("selftests/tc-testing: verify that a qdisc can be grafted onto a taprio class")
4072d97ddc44 ("netem: add prng attribute to netem_sched_data")
9c87b2aeccf1 ("netem: use a seeded PRNG for generating random losses")
3cad70bc74ef ("netem: use seeded PRNG for correlated loss events")
8c21ab1bae94 ("net/sched: fq_pie: avoid stalls in fq_pie_timer()")
8fc134fee27f ("net: sched: sch_qfq: Fix UAF in qfq_dequeue()")
a5e2151ff9d5 ("net/ipv6: SKB symmetric hash should incorporate transport ports")
70ad43333cbe ("selftests/tc-testing: cls_fw: add tests for classid")
7c339083616c ("selftests/tc-testing: cls_route: add tests for classid")
e2f2fb3c352d ("selftests/tc-testing: cls_u32: add tests for classid")
ef765c258759 ("net/sched: cls_route: make netlink errors meaningful")
98cfbe4234a4 ("selftests/tc-testing: localize test resources")
d227cc0b1ee1 ("selftests/tc-testing: update test definitions for local resources")
ac9b82930964 ("selftests/tc-testing: implement tdc parallel test run")
d3fc4eea9742 ("selftests/tc-testing: update tdc documentation")
1add90738cf5 ("net_sched: constify qdisc_priv()")
54ff8ad69c6e ("net_sched: sch_fq: struct sched_data reorg")
ee9af4e14d16 ("net_sched: sch_fq: change how @inactive is tracked")
076433bd78d7 ("net_sched: sch_fq: add fast path for mostly idle qdisc")
8f6c4ff9e052 ("net_sched: sch_fq: always garbage collect")
2ae45136a938 ("net_sched: sch_fq: remove q->ktime_cache")
5579ee462dfe ("net_sched: export pfifo_fast prio2band[]")
29f834aa326e ("net_sched: sch_fq: add 3 bands and WRR scheduling")
49e7265fd098 ("net_sched: sch_fq: add TCA_FQ_WEIGHTS attribute")
0fef0907d6fa ("netem: Annotate struct disttable with __counted_by")
c4d49196ceec ("net: sched: cls_u32: Fix allocation size in u32_init()")
54a59aed395c ("net, sched: Make tc-related drop reason more flexible")
39d08b91646d ("net, sched: Add tcf_set_drop_reason for {__,}tcf_classify")
f157b73d5114 ("selftests: tc-testing: add missing Kconfig options to 'config'")
35027c790970 ("selftests: tc-testing: move auxiliary scripts to a dedicated folder")
ee3d12285471 ("selftests: tc-testing: add test for 'rt' upgrade on hfsc")
06e4dd18f868 ("net_sched: sch_fq: fix off-by-one error in fq_dequeue()")
81a416985698 ("net_sched: sch_fq: fastpath needs to take care of sk->sk_pacing_status")
6d25d1dc76bf ("net: sched: sch_qfq: Use non-work-conserving warning handler")
70f06c115bcc ("sched: act_ct: switch to per-action label counting")
49b02a19c23a ("net: sched: Fill in MODULE_DESCRIPTION for act_gate")
a9c92771fa23 ("net: sched: Fill in missing MODULE_DESCRIPTION for classifiers")
f96118c5d86f ("net: sched: Fill in missing MODULE_DESCRIPTION for qdiscs")
40cb2fdfed34 ("net, sched: Fix SKB_NOT_DROPPED_YET splat under debug config")
f1a3b283f852 ("net_sched: sch_fq: better validate TCA_FQ_WEIGHTS and TCA_FQ_PRIOMAP")
e316dd1cf135 ("net: don't dump stack on queue timeout")
9ffa01cab069 ("selftests: tc-testing: drop '-N' argument from nsPlugin")
fa63d353ddfb ("selftests: tc-testing: rework namespaces and devices setup")
bb9623c337f5 ("selftests: tc-testing: preload all modules in kselftests")
04fd47bf70f9 ("selftests: tc-testing: use parallel tdc in kselftests")
6b78debe1c07 ("net/sched: cls_u32: replace int refcounts with proper refcounts")
54293e4d6a62 ("selftests/tc-testing: add hashtable tests for u32")
025de7b6a6dd ("selftests: tc-testing: cap parallel tdc to 4 cores")
50a5988a7a54 ("selftests: tc-testing: move back to per test ns setup")
3d5026fc5adb ("selftests: tc-testing: use netns delete from pyroute2")
3f2d94a4ff48 ("selftests: tc-testing: leverage -all in suite ns teardown")
4b480cfb1066 ("selftests: tc-testing: timeout on unbounded loops")
4968afa0143d ("selftests: tc-testing: report number of workers in use")
a79d8ba734bd ("selftests: tc-testing: remove buildebpf plugin")
8059e68b9928 ("selftests: tc-testing: remove unnecessary time.sleep")
56e16bc69bb7 ("selftests: tc-testing: prefix iproute2 functions with "ipr2"")
501679f5d4a4 ("selftests: tc-testing: cleanup on Ctrl-C")
ed346fccfc40 ("selftests: tc-testing: remove unused import")
000db9e9ad42 ("net/sched: cbs: Use units.h instead of the copy of a definition")
f7580f00cc6e ("selftests: tc-testing: remove spurious nsPlugin usage")
74f7e7eeb1d2 ("selftests: tc-testing: remove spurious './' from Makefile")
7de8b2efafeb ("selftests: tc-testing: rename concurrency.json to flower.json")
0fbb5a54f941 ("selftests: tc-testing: remove filters/tests.json")
3872347e0a16 ("net/sched: act_api: use tcf_act_for_each_action")
a0e947c9ccff ("net/sched: act_api: avoid non-contiguous action array")
e09ac779f736 ("net/sched: act_api: stop loop over ops array on NULL in tcf_action_init")
f9bfc8eb1342 ("net/sched: act_api: use tcf_act_for_each_action in tcf_idr_insert_many")
c5e2a973448d ("rtnl: add helper to check if rtnl group has listeners")
8439109b76a3 ("rtnl: add helper to check if a notification is needed")
ddb6b284bdc3 ("rtnl: add helper to send if skb is not null")
c73724bfde09 ("net/sched: act_api: don't open code max()")
8d4390f51920 ("net/sched: act_api: conditional notification of events")
e522755520ef ("net/sched: cls_api: remove 'unicast' argument from delete notification")
93775590b1ee ("net/sched: cls_api: conditional notification of events")
4b55e86736d5 ("net/sched: act_api: rely on rcu in tcf_idr_check_alloc")
1dd7f18fc0ed ("net/sched: act_api: skip idr replace on bound actions")
fb2780721ca5 ("net: sched: Move drop_reason to struct tc_skb_cb")
b6a3c6066afc ("net: sched: Make tc-related drop reason more flexible for remaining qdiscs")
2f57dd94bdef ("packet: add a generic drop reason for receive")
4cf24dc89340 ("net: sched: Add initial TC error skb drop reasons")
913b47d3424e ("net/sched: Introduce tc block netdev tracking infra")
a7042cf8f231 ("net/sched: cls_api: Expose tc block to the datapath")
415e38bf1d8d ("net/sched: act_mirred: Add helper function tcf_mirred_replace_dev")
42f39036cda8 ("net/sched: act_mirred: Allow mirred to block")
8fcb0382af6f ("net: sched: em_text: fix possible memory leak in em_text_destroy()")
ba24ea129126 ("net/sched: Retire ipt action")
6d6d80e4f6bc ("net/sched: Remove CONFIG_NET_ACT_IPT from default configs")
41bc3e8fc1f7 ("net/sched: Remove uapi support for rsvp classifier")
82b2545ed9a4 ("net/sched: Remove uapi support for tcindex classifier")
fe3b739a5472 ("net/sched: Remove uapi support for dsmark qdisc")
26cc8714fc7f ("net/sched: Remove uapi support for ATM qdisc")
33241dca4862 ("net/sched: Remove uapi support for CBQ qdisc")
2ab1efad60ad ("net/sched: cls_api: complement tcf_tfilter_dump_policy")
c2a67de9bb54 ("net/sched: introduce ACT_P_BOUND return code")
530496985cea ("net/sched: sch_api: conditional netlink notifications")
94e2557d086a ("net: sched: move block device tracking into tcf_block_get/put_ext()")
405cd9fc6f44 ("net/sched: simplify tc_action_load_ops parameters")
2ffca83aa39c ("net/sched: Remove ipt action tests")
e18405d0be80 ("net: sched: track device in tcf_block_get/put_ext() only for clsact binder types")
ea937f772083 ("net: netdevsim: don't try to destroy PHC on VFs")
93590849a05e ("selftests: forwarding: Fix layer 2 miss test flakiness")
aae09a6c7783 ("net/sched: act_mirred: Don't zero blockid when net device is being deleted")
a46c31bf2744 ("net: fill in MODULE_DESCRIPTION()s for net/sched")
86fe596b588f ("net: sched: Remove NET_ACT_IPT from Kconfig")
eb2c11b27c58 ("net: bql: fix building with BQL disabled")
51270d573a8d ("tracing/net_sched: Fix tracepoints that save qdisc_dev() as a string")
```

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: Florian Westphal <fwestpha@redhat.com>
Approved-by: Davide Caratti <dcaratti@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Scott Weaver <scweaver@redhat.com>
2024-05-30 09:32:38 -04:00
Antoine Tenart b0488605b0 net: remove default_device_exit()
JIRA: https://issues.redhat.com/browse/RHEL-29681
Upstream Status: linux.git

commit ee403248fa6db5ca23031fc51b06284d6855cd02
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Feb 7 20:50:38 2022 -0800

    net: remove default_device_exit()

    For some reason default_device_ops kept two exit method:

    1) default_device_exit() is called for each netns being dismantled in
    a cleanup_net() round. This acquires rtnl for each invocation.

    2) default_device_exit_batch() is called once with the list of all netns
    int the batch, allowing for a single rtnl invocation.

    Get rid of the .exit() method to handle the logic from
    default_device_exit_batch(), to decrease the number of rtnl acquisition
    to one.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-05-28 15:24:05 +02:00
Davide Caratti c3c09e38bc net/sched: Fix mirred deadlock on device recursion
JIRA: https://issues.redhat.com/browse/RHEL-35058
CVE: CVE-2024-27010
Upstream Status: net.git commit 0f022d32c3eca477fbf79a205243a6123ed0fe11

commit 0f022d32c3eca477fbf79a205243a6123ed0fe11
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Apr 15 18:07:28 2024 -0300

    net/sched: Fix mirred deadlock on device recursion

    When the mirred action is used on a classful egress qdisc and a packet is
    mirrored or redirected to self we hit a qdisc lock deadlock.
    See trace below.

    [..... other info removed for brevity....]
    [   82.890906]
    [   82.890906] ============================================
    [   82.890906] WARNING: possible recursive locking detected
    [   82.890906] 6.8.0-05205-g77fadd89fe2d-dirty #213 Tainted: G        W
    [   82.890906] --------------------------------------------
    [   82.890906] ping/418 is trying to acquire lock:
    [   82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at:
    __dev_queue_xmit+0x1778/0x3550
    [   82.890906]
    [   82.890906] but task is already holding lock:
    [   82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at:
    __dev_queue_xmit+0x1778/0x3550
    [   82.890906]
    [   82.890906] other info that might help us debug this:
    [   82.890906]  Possible unsafe locking scenario:
    [   82.890906]
    [   82.890906]        CPU0
    [   82.890906]        ----
    [   82.890906]   lock(&sch->q.lock);
    [   82.890906]   lock(&sch->q.lock);
    [   82.890906]
    [   82.890906]  *** DEADLOCK ***
    [   82.890906]
    [..... other info removed for brevity....]

    Example setup (eth0->eth0) to recreate
    tc qdisc add dev eth0 root handle 1: htb default 30
    tc filter add dev eth0 handle 1: protocol ip prio 2 matchall \
         action mirred egress redirect dev eth0

    Another example(eth0->eth1->eth0) to recreate
    tc qdisc add dev eth0 root handle 1: htb default 30
    tc filter add dev eth0 handle 1: protocol ip prio 2 matchall \
         action mirred egress redirect dev eth1

    tc qdisc add dev eth1 root handle 1: htb default 30
    tc filter add dev eth1 handle 1: protocol ip prio 2 matchall \
         action mirred egress redirect dev eth0

    We fix this by adding an owner field (CPU id) to struct Qdisc set after
    root qdisc is entered. When the softirq enters it a second time, if the
    qdisc owner is the same CPU, the packet is dropped to break the loop.

    Reported-by: Mingshuai Ren <renmingshuai@huawei.com>
    Closes: https://lore.kernel.org/netdev/20240314111713.5979-1-renmingshuai@huawei.com/
    Fixes: 3bcb846ca4 ("net: get rid of spin_trylock() in net_tx_action()")
    Fixes: e578d9c025 ("net: sched: use counter to break reclassify loops")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Victor Nogueira <victor@mojatatu.com>
    Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
    Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
    Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
    Link: https://lore.kernel.org/r/20240415210728.36949-1-victor@mojatatu.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2024-05-21 11:32:04 +02:00
Petr Oros 149ecfe407 dpll: move all dpll<>netdev helpers to dpll code
JIRA: https://issues.redhat.com/browse/RHEL-32098

Conflicts:
- drivers/net/ethernet/mellanox/mlx5/core/dpll.c: chunk omitted due
  to missing 496fd0a26bbf73 ("mlx5: Implement SyncE support using DPLL
  infrastructure")

Upstream commit(s):
commit 289e922582af5b4721ba02e86bde4d9ba918158a
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Mon Mar 4 17:35:32 2024 -0800

    dpll: move all dpll<>netdev helpers to dpll code

    Older versions of GCC really want to know the full definition
    of the type involved in rcu_assign_pointer().

    struct dpll_pin is defined in a local header, net/core can't
    reach it. Move all the netdev <> dpll code into dpll, where
    the type is known. Otherwise we'd need multiple function calls
    to jump between the compilation units.

    This is the same problem the commit under fixes was trying to address,
    but with rcu_assign_pointer() not rcu_dereference().

    Some of the exports are not needed, networking core can't
    be a module, we only need exports for the helpers used by
    drivers.

    Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
    Link: https://lore.kernel.org/all/35a869c8-52e8-177-1d4d-e57578b99b6@linux-m68k.org/
    Fixes: 640f41ed33b5 ("dpll: fix build failure due to rcu_dereference_check() on unknown type")
    Reviewed-by: Jiri Pirko <jiri@nvidia.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20240305013532.694866-1-kuba@kernel.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Petr Oros <poros@redhat.com>
2024-05-16 20:47:06 +02:00
Petr Oros 23888a39e4 dpll: rely on rcu for netdev_dpll_pin()
JIRA: https://issues.redhat.com/browse/RHEL-32098

Upstream commit(s):
commit 0d60d8df6f493bb46bf5db40d39dd60a1bafdd4e
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Feb 23 12:32:08 2024 +0000

    dpll: rely on rcu for netdev_dpll_pin()

    This fixes a possible UAF in if_nlmsg_size(),
    which can run without RTNL.

    Add rcu protection to "struct dpll_pin"

    Move netdev_dpll_pin() from netdevice.h to dpll.h to
    decrease name pollution.

    Note: This looks possible to no longer acquire RTNL in
    netdev_dpll_pin_assign() later in net-next.

    v2: do not force rcu_read_lock() in rtnl_dpll_pin_size() (Jiri Pirko)

    Fixes: 5f1842692880 ("netdev: expose DPLL pin handle for netdevice")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
    Cc: Vadim Fedorenko <vadim.fedorenko@linux.dev>
    Reviewed-by: Jiri Pirko <jiri@nvidia.com>
    Link: https://lore.kernel.org/r/20240223123208.3543319-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Petr Oros <poros@redhat.com>
2024-05-16 20:47:06 +02:00
Ivan Vecera b128d4b635 net: partial revert of the "Make timestamping selectable: series
JIRA: https://issues.redhat.com/browse/RHEL-36217

Conflicts:
- hunk for lan966x removed as it does not exist in RHEL
- context conflict caused by presence of RH_KABI macros

commit 289354f21b2c3fac93e956efd45f256a88a4d997
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Sat Nov 18 18:38:05 2023 -0800

    net: partial revert of the "Make timestamping selectable: series

    Revert following commits:

    commit acec05fb78ab ("net_tstamp: Add TIMESTAMPING SOFTWARE and HARDWARE mask")
    commit 11d55be06df0 ("net: ethtool: Add a command to expose current time stamping layer")
    commit bb8645b00ced ("netlink: specs: Introduce new netlink command to get current timestamp")
    commit d905f9c75329 ("net: ethtool: Add a command to list available time stamping layers")
    commit aed5004ee7a0 ("netlink: specs: Introduce new netlink command to list available time stamping layers")
    commit 51bdf3165f01 ("net: Replace hwtstamp_source by timestamping layer")
    commit 0f7f463d4821 ("net: Change the API of PHY default timestamp to MAC")
    commit 091fab122869 ("net: ethtool: ts: Update GET_TS to reply the current selected timestamp")
    commit 152c75e1d002 ("net: ethtool: ts: Let the active time stamping layer be selectable")
    commit ee60ea6be0d3 ("netlink: specs: Introduce time stamping set command")

    They need more time for reviews.

    Link: https://lore.kernel.org/all/20231118183529.6e67100c@kernel.org/
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-05-16 18:34:23 +02:00
Ivan Vecera 5a1bdf9297 net: Change the API of PHY default timestamp to MAC
JIRA: https://issues.redhat.com/browse/RHEL-36217

Conflicts:
- context conflict caused by presence of RH_KABI macros

commit 0f7f463d4821a4f52fa5c0a961389e651d50c384
Author: Kory Maincent <kory.maincent@bootlin.com>
Date:   Tue Nov 14 12:28:41 2023 +0100

    net: Change the API of PHY default timestamp to MAC

    Change the API to select MAC default time stamping instead of the PHY.
    Indeed the PHY is closer to the wire therefore theoretically it has less
    delay than the MAC timestamping but the reality is different. Due to lower
    time stamping clock frequency, latency in the MDIO bus and no PHC hardware
    synchronization between different PHY, the PHY PTP is often less precise
    than the MAC. The exception is for PHY designed specially for PTP case but
    these devices are not very widespread. For not breaking the compatibility I
    introduce a default_timestamp flag in phy_device that is set by the phy
    driver to know we are using the old API behavior.

    The phy_set_timestamp function is called at each call of phy_attach_direct.
    In case of MAC driver using phylink this function is called when the
    interface is turned up. Then if the interface goes down and up again the
    last choice of timestamp will be overwritten by the default choice.
    A solution could be to cache the timestamp status but it can bring other
    issues. In case of SFP, if we change the module, it doesn't make sense to
    blindly re-set the timestamp back to PHY, if the new module has a PHY with
    mediocre timestamping capabilities.

    Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-05-16 18:34:23 +02:00
Ivan Vecera b12b3268cd net: create device lookup API with reference tracking
JIRA: https://issues.redhat.com/browse/RHEL-36217

commit 70f7457ad6d655e65f1b93cbba2a519e4b11c946
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Mon Jun 12 14:49:43 2023 -0700

    net: create device lookup API with reference tracking

    New users of dev_get_by_index() and dev_get_by_name() keep
    getting added and it would be nice to steer them towards
    the APIs with reference tracking.

    Add variants of those calls which allocate the reference
    tracker and use them in a couple of places.

    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-05-16 18:34:21 +02:00
Lucas Zampieri 30b4d81286 Merge: CNB95: netlink/devlink: update devlink & netlink to the v6.8
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4000

JIRA: https://issues.redhat.com/browse/RHEL-30145  
Depends: !3939  

The series updates netlink and devlink core to upstream version v6.8.  
Both have to be updated at once due to circular dependencies.


Signed-off-by: Petr Oros <poros@redhat.com>

Approved-by: Jan Stancek <jstancek@redhat.com>
Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com>
Approved-by: Davide Caratti <dcaratti@redhat.com>
Approved-by: Ivan Vecera <ivecera@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-05-16 13:34:03 +00:00
Ivan Vecera 1602ceb6b7 net: sched: Make tc-related drop reason more flexible for remaining qdiscs
JIRA: https://issues.redhat.com/browse/RHEL-36218

commit b6a3c6066afc2cb7b92f45c67ab0b12ded81cb11
Author: Victor Nogueira <victor@mojatatu.com>
Date:   Sat Dec 16 17:44:35 2023 -0300

    net: sched: Make tc-related drop reason more flexible for remaining qdiscs

    Incrementing on Daniel's patch[1], make tc-related drop reason more
    flexible for remaining qdiscs - that is, all qdiscs aside from clsact.
    In essence, the drop reason will be set by cls_api and act_api in case
    any error occurred in the data path. With that, we can give the user more
    detailed information so that they can distinguish between a policy drop
    or an error drop.

    [1] https://lore.kernel.org/all/20231009092655.22025-1-daniel@iogearbox.net

    Signed-off-by: Victor Nogueira <victor@mojatatu.com>
    Acked-by: Daniel Borkmann <daniel@iogearbox.net>
    Reviewed-by: Simon Horman <horms@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-05-14 13:13:23 +02:00
Ivan Vecera 0c3f908699 net: sched: Move drop_reason to struct tc_skb_cb
JIRA: https://issues.redhat.com/browse/RHEL-36218

commit fb2780721ca5e9f78bbe4544b819b929a982df9c
Author: Victor Nogueira <victor@mojatatu.com>
Date:   Sat Dec 16 17:44:34 2023 -0300

    net: sched: Move drop_reason to struct tc_skb_cb

    Move drop_reason from struct tcf_result to skb cb - more specifically to
    struct tc_skb_cb. With that, we'll be able to also set the drop reason for
    the remaining qdiscs (aside from clsact) that do not have access to
    tcf_result when time comes to set the skb drop reason.

    Signed-off-by: Victor Nogueira <victor@mojatatu.com>
    Acked-by: Daniel Borkmann <daniel@iogearbox.net>
    Reviewed-by: Simon Horman <horms@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-05-14 13:13:23 +02:00
Ivan Vecera 779204ab78 net, sched: Make tc-related drop reason more flexible
JIRA: https://issues.redhat.com/browse/RHEL-36218

commit 54a59aed395ce0f4177b5212e5746a6462de3ad9
Author: Daniel Borkmann <daniel@iogearbox.net>
Date:   Mon Oct 9 11:26:54 2023 +0200

    net, sched: Make tc-related drop reason more flexible

    Currently, the kfree_skb_reason() in sch_handle_{ingress,egress}() can only
    express a basic SKB_DROP_REASON_TC_INGRESS or SKB_DROP_REASON_TC_EGRESS reason.

    Victor kicked-off an initial proposal to make this more flexible by disambiguating
    verdict from return code by moving the verdict into struct tcf_result and
    letting tcf_classify() return a negative error. If hit, then two new drop
    reasons were added in the proposal, that is SKB_DROP_REASON_TC_INGRESS_ERROR
    as well as SKB_DROP_REASON_TC_EGRESS_ERROR. Further analysis of the actual
    error codes would have required to attach to tcf_classify via kprobe/kretprobe
    to more deeply debug skb and the returned error.

    In order to make the kfree_skb_reason() in sch_handle_{ingress,egress}() more
    extensible, it can be addressed in a more straight forward way, that is: Instead
    of placing the verdict into struct tcf_result, we can just put the drop reason
    in there, which does not require changes throughout various classful schedulers
    given the existing verdict logic can stay as is.

    Then, SKB_DROP_REASON_TC_ERROR{,_*} can be added to the enum skb_drop_reason
    to disambiguate between an error or an intentional drop. New drop reason error
    codes can be added successively to the tc code base.

    For internal error locations which have not yet been annotated with a
    SKB_DROP_REASON_TC_ERROR{,_*}, the fallback is SKB_DROP_REASON_TC_INGRESS and
    SKB_DROP_REASON_TC_EGRESS, respectively. Generic errors could be marked with a
    SKB_DROP_REASON_TC_ERROR code until they are converted to more specific ones
    if it is found that they would be useful for troubleshooting.

    While drop reasons have infrastructure for subsystem specific error codes which
    are currently used by mac80211 and ovs, Jakub mentioned that it is preferred
    for tc to use the enum skb_drop_reason core codes given it is a better fit and
    currently the tooling support is better, too.

    With regards to the latter:

      [...] I think Alastair (bpftrace) is working on auto-prettifying enums when
      bpftrace outputs maps. So we can do something like:

      $ bpftrace -e 'tracepoint:skb:kfree_skb { @[args->reason] = count(); }'
      Attaching 1 probe...
      ^C

      @[SKB_DROP_REASON_TC_INGRESS]: 2
      @[SKB_CONSUMED]: 34

      ^^^^^^^^^^^^ names!!

      Auto-magically. [...]

    Add a small helper tcf_set_drop_reason() which can be used to set the drop reason
    into the tcf_result.

    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Cc: Jamal Hadi Salim <jhs@mojatatu.com>
    Cc: Victor Nogueira <victor@mojatatu.com>
    Link: https://lore.kernel.org/netdev/20231006063233.74345d36@kernel.org
    Reviewed-by: Jakub Kicinski <kuba@kernel.org>
    Link: https://lore.kernel.org/r/20231009092655.22025-1-daniel@iogearbox.net
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-05-14 13:13:19 +02:00
Patrick Talbert e6aa44fcbe Merge: rhel 9.5 drm dependencies
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3966

# Merge Request Required Information

This is the first pass at drm dependencies for backporting 6.8 or 6.9 into RHEL 9.5

Marked as draft as I think there will be a few more patches needed, and maybe some other teams might be in the same area (e.g. kunit).

JIRA: https://issues.redhat.com/browse/RHEL-24101

Signed-off-by: Dave Airlie <airlied@redhat.com>
## Summary of Changes

## Approved Development Ticket
All submissions to CentOS Stream must reference an approved ticket in [Red Hat Jira](https://issues.redhat.com/). Please follow the CentOS Stream [contribution documentation](https://docs.centos.org/en-US/stream-contrib/quickstart/) for how to file this ticket and have it approved.

Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Approved-by: Mika Penttilä <mpenttil@redhat.com>
Approved-by: Aristeu Rozanski <arozansk@redhat.com>
Approved-by: David Arcari <darcari@redhat.com>
Approved-by: Steve Best <sbest@redhat.com>

Merged-by: Patrick Talbert <ptalbert@redhat.com>
2024-05-03 12:43:29 +02:00
Petr Oros 866233764b net: add rcu safety to rtnl_prop_list_size()
JIRA: https://issues.redhat.com/browse/RHEL-30145

Upstream commit(s):
commit 9f30831390ede02d9fcd54fd9ea5a585ab649f4a
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Feb 9 18:12:48 2024 +0000

    net: add rcu safety to rtnl_prop_list_size()

    rtnl_prop_list_size() can be called while alternative names
    are added or removed concurrently.

    if_nlmsg_size() / rtnl_calcit() can indeed be called
    without RTNL held.

    Use explicit RCU protection to avoid UAF.

    Fixes: 88f4fb0c74 ("net: rtnetlink: put alternative names to getlink message")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Jiri Pirko <jiri@nvidia.com>
    Link: https://lore.kernel.org/r/20240209181248.96637-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Petr Oros <poros@redhat.com>
2024-04-26 17:16:11 +02:00
Lucas Zampieri f112e4de2c Merge: CNB95: netlink/devlink: update devlink & netlink to the v6.6
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3939

JIRA: https://issues.redhat.com/browse/RHEL-30656  
Tested: LNST  
Depends: !3918 

The series updates netlink and devlink core to upstream version v6.6. Both have to be updated at once due to circular dependencies.

Omitted-fix: 83f2df9d66bc
            The fix needs an additional devlink dependencies and it will be applied in next rebase series covered by RHEL-30145

Commits:
```
6978052448f9 ("netlink: remove unused 'compare' function")
74bf6477c18b ("netlink-specs: add partial specification for devlink")
82b3297009b6 ("netlink: specs: allow uapi-header in genetlink")
56c874f7dbca ("tools: ynl: skip the explicit op array size when not needed")
8da3a5598f75 ("ynl: allow to encode u8 attr")
bc77f7318da8 ("tools: ynl: add the Python requirements.txt file")
dd3a7d58dcc2 ("tools: ynl: Add missing types to encode/decode")
4c6170d1ae2c ("tools: ynl: default to treating enums as flags for mask generation")
bec0b7a2db35 ("tools: ynl: Add struct parsing to nlspec")
b423c3c86325 ("tools: ynl: Add C array attribute decoding to ynl")
2607191395bd ("tools: ynl: Add struct attr decoding to ynl")
f036d936ca57 ("tools: ynl: Add fixed-header support to ynl")
643ef4a676e3 ("netlink: specs: add partial specification for openvswitch")
88e288968412 ("docs: netlink: document struct support for genetlink-legacy")
04eac39361d3 ("docs: netlink: document the sub-type attribute property")
9f7cc57fe550 ("tools: ynl: support byte-order in cli")
a353318ebf24 ("tools: ynl: populate most of the ethtool spec")
48993e22d23a ("tools: ynl: replace print with NlError")
f3d07b02b2b8 ("tools: ynl: ethtool testing tool")
ebe3bdc4359e ("tools: ynl: throw a more meaningful exception if family not supported")
3ea31e66644b ("tools: ynl: Remove absolute paths to yaml files from ethtool testing tool")
85a4abed1554 ("tools: ynl: Rename ethtool to ethtool.py")
d913d32cc270 ("netlink: Use copy_to_user() for optval in netlink_getsockopt().")
a939d14919b7 ("netlink: annotate accesses to nlk->cb_running")
7c2435ef76e5 ("tools: ynl: Use dict of predefined Structs to decode scalar types")
bddd2e561b0a ("tools: ynl: Handle byte-order in struct members")
081e8df68199 ("tools: ynl: avoid dict errors on older Python versions")
9b66ee06e5ca ("net: ynl: prefix uAPI header include with uapi/")
0684f29a89e5 ("netlink: specs: correct types of legacy arrays")
6d6bae63053d ("doc: ynl: Add doc attr to struct members in genetlink-legacy spec")
5ac18889bde0 ("tools: ynl: Initialise fixed headers to 0 in genetlink-legacy")
313a7a808ca8 ("tools: ynl: Support enums in struct members in genetlink-legacy")
93b230b549bc ("netlink: specs: add ynl spec for ovs_flow")
f4e4534850a9 ("net/netlink: fix NETLINK_LIST_MEMBERSHIPS length report")
91dfaef243cd ("tools: ynl-gen: add extra headers for user space")
6ad49839ba9b ("tools: ynl-gen: fix unused / pad attribute handling")
67c65ce762ad ("tools: ynl-gen: don't override pure nested struct")
5605f102378f ("tools: ynl-gen: loosen type consistency check for events")
eef9b794eac8 ("tools: ynl-gen: add error checking for nested structs")
21b6e302789c ("tools: ynl-gen: generate enum-to-string helpers")
dc0956c98f11 ("tools: ynl-gen: move the response reading logic into YNL")
5d58f911c755 ("tools: ynl-gen: generate alloc and free helpers for req")
8cb6afb33541 ("tools: ynl-gen: switch to family struct")
59d814f0f285 ("tools: ynl-gen: generate static descriptions of notifications")
a99bfdf64795 ("tools: ynl-gen: clean up stray new lines at the end of reply-less requests")
86878f14d71a ("tools: ynl: user space helpers")
d75fdfbc6f26 ("tools: ynl: support fou and netdev in C")
ee0202e2e731 ("tools: ynl: add sample for netdev")
f6ca5baf2a86 ("netlink: specs: ethtool: fix random typos")
2cc9671a82e3 ("tools: ynl-gen: fill in support for MultiAttr scalars")
58da455b31ba ("tools: ynl-gen: improve unwind on parsing errors")
7a11f70ce882 ("tools: ynl: generate code for the handshake family")
8947e5037371 ("netlink: specs: devlink: fill in some details important for C")
9858bfc271de ("tools: ynl-gen: use enum names in op strmap more carefully")
6f115d4575ab ("tools: ynl-gen: refactor strmap helper generation")
ff6db4b58c93 ("tools: ynl-gen: enable code gen for directional specs")
6afaa0ef9b0e ("tools: ynl-gen: try to sort the types more intelligently")
37487f93b125 ("tools: ynl-gen: inherit struct use info")
eae7af21bdb9 ("tools: ynl-gen: walk nested types in depth")
168dea20ecef ("tools: ynl-gen: don't generate forward declarations for policies")
0a9471219672 ("tools: ynl-gen: don't generate forward declarations for policies - regen")
5d1a30eb989a ("tools: ynl: generate code for the devlink family")
fff8660b5425 ("tools: ynl: add sample for devlink")
30b5c720e1a9 ("tools: ynl-gen: cleanup user space header includes")
9b52fd4b6305 ("tools: ynl: regen: cleanup user space header includes")
820343ccbb2e ("tools: ynl-gen: complete the C keyword list")
2c0f1466867c ("tools: ynl-gen: combine else with closing bracket")
e4ea3cc68472 ("tools: ynl-gen: get attr type outside of if()")
7234415b8f86 ("tools: ynl: regen: regenerate the if ladders")
f2ba1e5e2208 ("tools: ynl-gen: stop generating common notification handlers")
d0915d64c3a6 ("tools: ynl: regen: stop generating common notification handlers")
ced1568862bd ("tools: ynl-gen: sanitize notification tracking")
6da3424fd629 ("tools: ynl-gen: support code gen for events")
6f96ec73cb5a ("tools: ynl-gen: don't pass op_name to RenderInfo")
76abff37f0d7 ("tools: ynl-gen: support / skip pads on the way to kernel")
008bcd6835a2 ("tools: ynl-gen: support excluding tricky ops")
33eedb0071c8 ("tools: ynl-gen: record extra args for regen")
ed2042cc77f1 ("netlink: specs: support setting prefix-name per attribute")
d4813b11d679 ("netlink: specs: ethtool: add C render hints")
dddc9f53da3e ("tools: ynl-gen: don't generate enum types if unnamed")
2c9d47a095f7 ("tools: ynl-gen: resolve enum vs struct name conflicts")
180ad455273a ("netlink: specs: ethtool: add empty enum stringset")
37c852222712 ("netlink: specs: ethtool: untangle UDP tunnels and cable test a bit")
709d0c3b3d4c ("netlink: specs: ethtool: untangle stats-get")
68335713d2ea ("netlink: specs: ethtool: mark pads as pads")
2d7be507d65e ("tools: ynl: generate code for the ethtool family")
f561ff232a6b ("tools: ynl: add sample for ethtool")
10c4d2a7b88d ("tools: ynl-gen: correct enum policies")
be093a80dff0 ("tools: ynl-gen: inherit policy in multi-attr")
fa0e21fa4443 ("rtnetlink: extend RTEXT_FILTER_SKIP_STATS to IFLA_VF_INFO")
89da780aa4c7 ("rtnetlink: move validate_linkmsg out of do_setlink")
f0ec58d557d6 ("tools: ynl: work around stale system headers")
6907217a8054 ("netlink: specs: fixup openvswitch specs for code generation")
8d61f926d420 ("netlink: fix potential deadlock in netlink_set_err()")
0c3d6fd4b89c ("tools: ynl: improve the direct-include header guard logic")
737eab775d36 ("netlink: specs: add display-hint to schema definitions")
d8eea68d913c ("tools: ynl: add display-hint support to ynl")
334f39ce17ef ("netlink: specs: add display hints to ovs_flow")
25a9c8a4431c ("netlink: Add __sock_i_ino() for __netlink_diag_dump().")
b8e39b38487e ("netlink: Make use of __assign_bit() API")
633d76ad01ad ("devlink: remove reload failed checks in params get/set callbacks")
4a59cdfd6699 ("rtnetlink: Move nesting cancellation rollback to proper function")
5766946ea511 ("genetlink: add explicit ordering break check for split ops")
a3377386b564 ("netlink: Reverse the patch which removed filtering")
a4c9a56e6a2c ("netlink: Add new netlink_release function")
d7ddf5f4269f ("tools: ynl-gen: fix enum index in _decode_enum(..)")
df15c15e6c98 ("tools: ynl-gen: fix parse multi-attr enum attribute")
5fac9b7c16c5 ("netlink: allow be16 and be32 types in all uint policy checks")
e5c157f081ab ("ynl: expose xdp-zc-max-segs")
37844828d290 ("ynl: mark max/mask as private for kdoc")
25b5a2a1905f ("ynl: regenerate all headers")
26fdb67e8b4a ("ynl: print xdp-zc-max-segs in the sample")
759ab1edb56c ("net: store netdevs in an xarray")
84e00d9bd4e4 ("net: convert some netlink netdev iterators to depend on the xarray")
2628d40899d1 ("devlink: Remove unused extern declaration devlink_port_region_destroy()")
78c96d7b7c9a ("netlink: specs: add dump-strict flag for dont-validate property")
dc7b81a828db ("ynl-gen-c.py: filter rendering of validate field values for split ops")
eab7be688b44 ("ynl-gen-c.py: allow directional model for kernel mode")
fa8ba3502ade ("ynl-gen-c.py: render netlink policies static for split ops")
ba0f66c95fa6 ("devlink: rename devlink_nl_ops to devlink_nl_small_ops")
d61aedcf628e ("devlink: rename couple of doit netlink callbacks to match generated names")
491a24872a64 ("devlink: introduce couple of dumpit callbacks for split ops")
8300dce542e4 ("devlink: un-static devlink_nl_pre/post_doit()")
759f661012d1 ("netlink: specs: devlink: add info-get dump op")
6b7c486cae81 ("devlink: add split ops generated according to spec")
b2551b1517d8 ("devlink: include the generated netlink header")
6e067d0cab68 ("devlink: use generated split ops and remove duplicated commands from small ops")
b876b71a6ac2 ("devlink: Remove unused devlink_dpipe_table_resource_set() declaration")
2c0e9f3806c4 ("tools: ynl-gen: avoid rendering empty validate field")
832140804e3b ("devlink: clear flag on port register error path")
cd3112ebbaf4 ("tools: ynl-gen: add missing empty line between policies")
8fe08d70a2b6 ("netlink: convert nlk->flags to atomic flags")
63618463cb94 ("devlink: parse linecard attr in doit() callbacks")
41a1d4d1399a ("devlink: parse rate attrs in doit() callbacks")
ee6d78ac28c7 ("devlink: introduce devlink_nl_pre_doit_port*() helper functions")
8fa995ad1f7f ("devlink: rename doit callbacks for per-instance dump commands")
24c8e56d4f98 ("devlink: introduce dumpit callbacks for split ops")
7d3c6fec6135 ("devlink: pass flags as an arg of dump_one() callback")
7199c86247e9 ("netlink: specs: devlink: add commands that do per-instance dump")
ddff283280ba ("devlink: remove duplicate temporary netlink callback prototypes")
833e479d330c ("devlink: remove converted commands from small ops")
4a1b5aa8b5c7 ("devlink: allow user to narrow per-instance dumps by passing handle attrs")
34493336e7d3 ("netlink: specs: devlink: extend per-instance dump commands to accept instance attributes")
b03f13cb67a5 ("devlink: extend health reporter dump selector by port index")
0149bca17262 ("netlink: specs: devlink: extend health reporter dump attributes by port index")
84817d8c6042 ("genetlink: push conditional locking into dumpit/done")
fde9bd4a4d41 ("genetlink: make genl_info->nlhdr const")
bffcc6882a1b ("genetlink: remove userhdr from struct genl_info")
9272af109fe6 ("genetlink: add struct genl_info to struct genl_dumpit_info")
7288dd2fd488 ("genetlink: use attrs from struct genl_info")
5c670a010de4 ("genetlink: add a family pointer to struct genl_info")
5aa51d9f889c ("genetlink: add genlmsg_iput() API")
0e19d3108aea ("netdev-genl: use struct genl_info for reply construction")
ec0e5b09b834 ("ethtool: netlink: simplify arguments to ethnl_default_parse()")
f946270d05c2 ("ethtool: netlink: always pass genl_info to .prepare_data")
956db0a13b47 ("net: warn about attempts to register negative ifindex")
ded67d90815a ("netlink: specs: add ovs_vport new command")
7582113c6917 ("tools: ynl: add more info to KeyErrors on missing attrs")
d56b699d76d1 ("Documentation: Fix typos")
f65f305ae008 ("tools: ynl-gen: use temporary file for rendering")
f534f6581ec0 ("net: validate veth and vxcan peer ifindexes")
649bde9004ac ("tools: ynl: allow passing binary data")
a149a3a13bbc ("tools: ynl-gen: set length of binary fields")
dc2ef94d8926 ("tools: ynl-gen: fix collecting global policy attrs")
4c8c24e801e6 ("tools: ynl-gen: support empty attribute lists")
e83d4e9b2d0f ("netlink: specs: fix indent in fou")
a02430c06f56 ("tools: ynl-gen: fix uAPI generation after tempfile changes")
52d08fda3516 ("doc/netlink: Add delete operation to ovs_vport spec")
ed68c58c0eb4 ("doc/netlink: Add a schema for netlink-raw families")
294f37fc8772 ("doc/netlink: Update genetlink-legacy documentation")
2db8abf0b455 ("doc/netlink: Document the netlink-raw schema extensions")
88901b967958 ("tools/ynl: Add mcast-group schema parsing to ynl")
fb0a06d455d6 ("tools/net/ynl: Fix extack parsing with fixed header genlmsg")
e46dd903efe3 ("tools/net/ynl: Add support for netlink-raw families")
0493e56d021d ("tools/net/ynl: Implement nlattr array-nest decoding in ynl")
1768d8a767f8 ("tools/net/ynl: Add support for create flags")
dfb0f7d9d979 ("doc/netlink: Add spec for rt addr messages")
b2f63d904e72 ("doc/netlink: Add spec for rt link messages")
023289b4f582 ("doc/netlink: Add spec for rt route messages")
56e65312830e ("devlink: push object register/unregister notifications into separate helpers")
eec1e5ea1d71 ("devlink: push port related code into separate file")
2b4d8bb08889 ("devlink: push shared buffer related code into separate file")
2475ed158c47 ("devlink: move and rename devlink_dpipe_send_and_alloc_skb() helper")
a9fd44b15fc5 ("devlink: push dpipe related code into separate file")
a9f960074ecd ("devlink: push resource related code into separate file")
830c41e1e987 ("devlink: push param related code into separate file")
1aa47ca1f52e ("devlink: push region related code into separate file")
85facf94fd80 ("devlink: use tracepoint_enabled() helper")
4bbdec80ff27 ("devlink: push trap related code into separate file")
7cc7194e85ca ("devlink: push rate related code into separate file")
9edbe6f36c5f ("devlink: push linecard related code into separate file")
890c55667437 ("devlink: move tracepoint definitions into core.c")
29a390d17748 ("devlink: move small_ops definition into netlink.c")
71179ac5c211 ("devlink: move devlink_notify_register/unregister() to dev.c")
ee940b57a929 ("doc/netlink: Fix missing classic_netlink doc reference")
d0f95894fda7 ("netlink: annotate data-races around sk->sk_err")
0f4d44f6ee04 ("netlink: specs: devlink: fix reply command values")
69844e335d8c ("selftests/bpf: Fix sockopt_sk selftest")
e4fe082c38cd ("tools: ynl: make sure we always pass yarg to mnl_cb_run")
5d78b73e8514 ("tools: ynl: don't leak mcast_groups on init error")
b6c65eb20ffa ("tools: ynl: fix handling of multiple mcast groups")
ceaac91dcd06 ("net: make sure we never create ifindex = 0")
0e0939c0adf9 ("net-procfs: use xarray iterator to implement /proc/net/dev")
```

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com>
Approved-by: Sabrina Dubroca <sdubroca@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-04-26 12:33:53 +00:00
Lucas Zampieri 15f4a7a740 Merge: CNB95: netlink: update netlink core to the upstream v6.3
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3918

JIRA: https://issues.redhat.com/browse/RHEL-30344  
Tested: LNST  

Commits:
```
cfdf0d9ae75b ("rtnetlink: use nlmsg_notify() in rtnetlink_send()")
fef773fc8110 ("netlink: Deal with ESRCH error in nlmsg_notify()")
f9b282b36dfa ("net: netlink: add the case when nlh is NULL")
bc830525615d ("net: netlink: Remove unused function")
d3432bf10f17 ("net: Support filtering interfaces on no master")
4fc29989835a ("net: rtnetlink: convert rcu_assign_pointer to RCU_INIT_POINTER")
7707a4d01a64 ("netlink: annotate data races around nlk->bound")
549017aa1bb7 ("netlink: remove netlink_broadcast_filtered")
50af5969bb22 ("net/core: Remove unused assignment operations and variable")
efd38f75bb04 ("net: rtnetlink: use __dev_addr_set()")
f123cffdd8fe ("net: netlink: af_netlink: Prevent empty skb by adding a check on len.")
d59a67f2f3f3 ("netlink: remove nl_set_extack_cookie_u32()")
ede6c39c4f90 ("net: make net->dev_unreg_count atomic")
7b8135f4df98 ("rtnetlink: add new rtm tunnel api for tunnel id filtering")
5d26cff5bdbe ("net: account alternate interface name memory")
155fb43b70b5 ("net: limit altnames to 64k total")
0caf6d992219 ("af_netlink: Fix shift out of bounds in group mask calculation")
0b5c21bbc01e ("net: ensure net_todo_list is processed quickly")
ef2a7c9065ce ("rtnetlink: return ENODEV when ifname does not exist and group is given")
5ea08b5286f6 ("rtnetlink: enable alt_ifname for setlink/newlink")
dee04163e9f2 ("rtnetlink: return ENODEV when IFLA_ALT_IFNAME is used in dellink")
b6177d3240a4 ("rtnetlink: return EINVAL when request cannot succeed")
99c07327ae11 ("netlink: reset network and mac headers in netlink_dump()")
6f37c9f9dfbf ("Revert "rtnetlink: return EINVAL when request cannot succeed"")
c92bf26ccebc ("rtnl: allocate more attr tables on the heap")
63105e83987a ("rtnl: split __rtnl_newlink() into two functions")
02839cc8d72b ("rtnl: move rtnl_newlink_create()")
d5076fe4049c ("netlink: do not reset transport header in netlink_recvmsg()")
f329a0ebeaba ("genetlink: correct uAPI defines")
5c221f0af68c ("net: add missing kdoc for struct genl_multicast_group::flags")
30b6055428a9 ("net: improve and fix netlink kdoc")
0bf73255d3a3 ("netlink: fix some kernel-doc comments")
8f1948bdcf2f ("genetlink: hold read cb_lock during iteration of genl_fam_idr in genl_bind()")
abbc79280abc ("net: rtnetlink: use netif_oper_up instead of open code")
710d21fdff9a ("netlink: Bounds-check struct nlmsgerr creation")
08724ef69907 ("netlink: introduce NLA_POLICY_MAX_BE")
e7af210e6dd0 ("netfilter: nft_payload: reject out-of-range attributes via policy")
a4abfa627c38 ("net: rtnetlink: Enslave device before bringing it up")
5493a2ad0d20 ("docs: netlink: clarify the historical baggage of Netlink flags")
7354c9024f28 ("netlink: hide validation union fields from kdoc")
738136a0e375 ("netlink: split up copies in the ack construction")
1d997f101307 ("rtnetlink: pass netlink message header and portid to rtnl_configure_link()")
77f4aa9a2a17 ("net: add new helper unregister_netdevice_many_notify")
d88e136cab37 ("rtnetlink: Honour NLM_F_ECHO flag in rtnl_newlink_create")
f3a63cce1b4f ("rtnetlink: Honour NLM_F_ECHO flag in rtnl_delete_link")
ecaf75ffd5f5 ("netlink: introduce bigendian integer types")
e69761483361 ("netlink: Fix potential skb memleak in netlink_ack")
8e18be7610ae ("lib: Fix some kernel-doc comments")
8032bf1233a7 ("treewide: use get_random_u32_below() instead of deprecated function")
c73a72f4cbb4 ("netlink: remove the flex array from struct nlmsghdr")
f0950402e8c7 ("netlink: prevent potential spectre v1 gadgets")
c1bb9484e3b0 ("netlink: annotate data races around nlk->portid")
004db64d185a ("netlink: annotate data races around dst_portid and dst_group")
9b663b5cbb15 ("netlink: annotate data races around sk_state")
9d6a65079c98 ("docs: add more netlink docs (incl. spec docs)")
e616c07ca518 ("netlink: add schemas for YAML specs")
be5bea1cc0bf ("net: add basic C code generators for Netlink")
4eb77b4ecd3c ("netlink: add a proto specification for FOU")
3a330496baa8 ("net: fou: regenerate the uAPI from the spec")
08d323234d10 ("net: fou: rename the source for linking")
1d562c32e439 ("net: fou: use policy and operation tables generated from the spec")
e4b48ed460d3 ("tools: ynl: add a completely generic client")
66fa34b9c2a5 ("tools: ynl: support kdocs for flags in code generation")
b49c34e217c6 ("tools: ynl: rename ops_list -> msg_list")
3a43ded081f8 ("tools: ynl: store ops in ordered dict to avoid random ordering")
70eb3911d80f ("net: netlink: recommend policy range validation")
eaf317e7d2bb ("tools: ynl-gen: prevent do / dump reordering")
4e4480e89c47 ("tools: ynl: move the cli and netlink code around")
3aacf8281336 ("tools: ynl: add an object hierarchy to represent parsed spec")
30a5c6c8104f ("tools: ynl: use the common YAML loading and validation code")
19b64b48a33e ("tools: ynl: add support for types needed by ethtool")
fd0616d34274 ("tools: ynl: support directional enum-model in CLI")
90256f3f8093 ("tools: ynl: support multi-attr")
4cd2796f3f8d ("tools: ynl: support pretty printing bad attribute names")
8dfec0a88868 ("tools: ynl: use operation names from spec on the CLI")
5c6674f6eb52 ("tools: ynl: load jsonschema on demand")
8403bf044530 ("netlink: specs: finish up operation enum-models")
01e47a372268 ("docs: netlink: add a starting guide for working with specs")
981cbcb030d9 ("tools: net: use python3 explicitly")
f1db99c07b4f ("string_helpers: Move string_is_valid() to the header")
d4545bf9c33b ("genetlink: Use string_is_terminated() helper")
f7cf644796fc ("tools: ynl-gen: fix single attribute structs with attr 0 only")
b9d3a3e4ae0c ("tools: ynl-gen: re-raise the exception instead of printing")
d77e7eceeac9 ("tools: net: add __pycache__ to gitignore")
7cf93538e087 ("tools: ynl: fully inherit attrs in subsets")
ad4fafcde5bc ("tools: ynl: use 1 as the default for first entry in attrs/ops")
bcec7171eba9 ("netlink: specs: update for codegen enumerating from 1")
37d9df224d1e ("ynl: re-license uniformly under GPL-2.0 OR BSD-3-Clause")
6517a60b0307 ("tools: ynl: move the enum classes to shared code")
c311aaa74ca1 ("tools: ynl: fix enum-as-flags in the generic CLI")
8f76a4f80fba ("tools: ynl: fix render-max for flags definition")
bf51d27704c9 ("tools: ynl: fix get_mask utility routine")
054abb515f34 ("tools: ynl: make definitions optional again")
4e16b6a748df ("ynl: broaden the license even more")
cfab77c0b545 ("ynl: make the tooling check the license")
758d29fb3a8b ("tools: ynl: Fix genlmsg header encoding formats")
a1865f2e7d10 ("netlink: annotate lockless accesses to nlk->max_recvmsg_len")
59d3efd27c11 ("rtnetlink: Restore RTM_NEW/DELLINK notification behavior")
```

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: Sabrina Dubroca <sdubroca@redhat.com>
Approved-by: Petr Oros <poros@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-04-24 19:47:18 +00:00
Lucas Zampieri 386e4aff26 Merge: bpf: add tcx support
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3968

JIRA: https://issues.redhat.com/browse/RHEL-28590  
Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3841  
Tested: bpf tc selftests pass, manual tests that the tcx hooks work as  
expected.  
  
Add the new tcx hook for bpf. It attaches at a similar place as the tc  
hook but has several advantages: it is based on the new multi prog  
infrastructure in the kernel to allow adding multiple bpf programs at  
the same hook; it follows the link semantics most other bpf hooks use  
which gives applications better control over the lifecycle of the bpf  
program; and tcx does not require a qdisc making the setup simpler.  
  
Signed-off-by: Felix Maurer <fmaurer@redhat.com>

Approved-by: Toke Høiland-Jørgensen <toke@redhat.com>
Approved-by: Artem Savkov <asavkov@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-04-22 12:42:51 +00:00
Dave Airlie 3b0a87ad0e lib/ref_tracker: improve printing stats
JIRA: https://issues.redhat.com/browse/RHEL-24101
Upstream Status: v6.5-rc1

This doesn't backport the namespace chunk that isn't
in RHEL yet.

Conflicts:
        net/core/net_namespace.c

commit b6d7c0eb2dcbd238fa233a3a1737654e380e784a
Author:     Andrzej Hajda <andrzej.hajda@intel.com>
AuthorDate: Fri Jun  2 12:21:34 2023 +0200
Commit:     Jakub Kicinski <kuba@kernel.org>
CommitDate: Mon Jun  5 15:28:42 2023 -0700

    In case the library is tracking busy subsystem, simply
    printing stack for every active reference will spam log
    with long, hard to read, redundant stack traces. To improve
    readabilty following changes have been made:
    - reports are printed per stack_handle - log is more compact,
    - added display name for ref_tracker_dir - it will differentiate
      multiple subsystems,
    - stack trace is printed indented, in the same printk call,
    - info about dropped references is printed as well.

    Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com>
    Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Dave Airlie <airlied@redhat.com>
2024-04-17 10:46:57 +10:00
Ivan Vecera ea0f31b62f net: make sure we never create ifindex = 0
JIRA: https://issues.redhat.com/browse/RHEL-30656

commit ceaac91dcd065db781d1ed5dfaef0686b8ec44dc
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Mon Jul 31 10:11:58 2023 -0700

    net: make sure we never create ifindex = 0

    Instead of allocating from 1 use proper xa_init flag,
    to protect ourselves from IDs wrapping back to 0.

    Fixes: 759ab1edb56c ("net: store netdevs in an xarray")
    Reported-by: Stephen Hemminger <stephen@networkplumber.org>
    Link: https://lore.kernel.org/all/20230728162350.2a6d4979@hermes.local/
    Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
    Link: https://lore.kernel.org/r/20230731171159.988962-1-kuba@kernel.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-04-10 09:19:34 +02:00
Ivan Vecera 7f08107cfd net: warn about attempts to register negative ifindex
JIRA: https://issues.redhat.com/browse/RHEL-30656

commit 956db0a13b47df7f3d6d624394e602e8bf9b057e
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Mon Aug 14 13:56:25 2023 -0700

    net: warn about attempts to register negative ifindex

    Since the xarray changes we mix returning valid ifindex and negative
    errno in a single int returned from dev_index_reserve(). This depends
    on the fact that ifindexes can't be negative. Otherwise we may insert
    into the xarray and return a very large negative value. This in turn
    may break ERR_PTR().

    OvS is susceptible to this problem and lacking validation (fix posted
    separately for net).

    Reject negative ifindex explicitly. Add a warning because the input
    validation is better handled by the caller.

    Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
    Link: https://lore.kernel.org/r/20230814205627.2914583-2-kuba@kernel.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-04-10 09:19:31 +02:00
Ivan Vecera ab71363b2b net: store netdevs in an xarray
JIRA: https://issues.redhat.com/browse/RHEL-30656

commit 759ab1edb56c88906830fd6b2e7b12514dd32758
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Wed Jul 26 11:55:29 2023 -0700

    net: store netdevs in an xarray

    Iterating over the netdev hash table for netlink dumps is hard.
    Dumps are done in "chunks" so we need to save the position
    after each chunk, so we know where to restart from. Because
    netdevs are stored in a hash table we remember which bucket
    we were in and how many devices we dumped.

    Since we don't hold any locks across the "chunks" - devices may
    come and go while we're dumping. If that happens we may miss
    a device (if device is deleted from the bucket we were in).
    We indicate to user space that this may have happened by setting
    NLM_F_DUMP_INTR. User space is supposed to dump again (I think)
    if it sees that. Somehow I doubt most user space gets this right..

    To illustrate let's look at an example:

                   System state:
      start:       # [A, B, C]
      del:  B      # [A, C]

    with the hash table we may dump [A, B], missing C completely even
    tho it existed both before and after the "del B".

    Add an xarray and use it to allocate ifindexes. This way we
    can iterate ifindexes in order, without the worry that we'll
    skip one. We may still generate a dump of a state which "never
    existed", for example for a set of values and sequence of ops:

                   System state:
      start:       # [A, B]
      add:  C      # [A, C, B]
      del:  B      # [A, C]

    we may generate a dump of [A], if C got an index between A and B.
    System has never been in such state. But I'm 90% sure that's perfectly
    fine, important part is that we can't _miss_ devices which exist before
    and after. User space which wants to mirror kernel's state subscribes
    to notifications and does periodic dumps so it will know that C exists
    from the notification about its creation or from the next dump
    (next dump is _guaranteed_ to include C, if it doesn't get removed).

    To avoid any perf regressions keep the hash table for now. Most
    net namespaces have very few devices and microbenchmarking 1M lookups
    on Skylake I get the following results (not counting loopback
    to number of devs):

     #devs | hash |  xa  | delta
        2  | 18.3 | 20.1 | + 9.8%
       16  | 18.3 | 20.1 | + 9.5%
       64  | 18.3 | 26.3 | +43.8%
      128  | 20.4 | 26.3 | +28.6%
      256  | 20.0 | 26.4 | +32.1%
     1024  | 26.6 | 26.7 | + 0.2%
     8192  |541.3 | 33.5 | -93.8%

    No surprises since the hash table has 256 entries.
    The microbenchmark scans indexes in order, if the pattern is more
    random xa starts to win at 512 devices already. But that's a lot
    of devices, in practice.

    Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
    Link: https://lore.kernel.org/r/20230726185530.2247698-2-kuba@kernel.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-04-10 09:19:28 +02:00
Ivan Vecera 4ee448db07 net: introduce include/net/rps.h
JIRA: https://issues.redhat.com/browse/RHEL-31916

Conflicts:
* net/core/dev.c
  context conflict due to missing commit 2b0cfa6e49566 ("net: add
  generic percpu page_pool allocator")
* net/core/sysctl_net_core.c
  context conflict due to missing commit 2658b5a8a4eee ("net: introduce
  struct net_hotdata")

commit 490a79faf95e705ba0ffd9ebf04a624b379e53c9
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Mar 6 16:00:30 2024 +0000

    net: introduce include/net/rps.h

    Move RPS related structures and helpers from include/linux/netdevice.h
    and include/net/sock.h to a new include file.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Link: https://lore.kernel.org/r/20240306160031.874438-18-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-04-05 16:03:32 +02:00
Ivan Vecera 139012e61c net: move struct netdev_rx_queue out of netdevice.h
JIRA: https://issues.redhat.com/browse/RHEL-31916

Conflicts:
* include/linux/netdevice.h
  Adjusted due to KABI reservations made by RHEL
  commit 3b3a52715a ("net: exclude BPF/XDP from kABI")

commit 49e47a5b6145d86c30022fe0e949bbb24bae28ba
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Wed Aug 2 18:02:29 2023 -0700

    net: move struct netdev_rx_queue out of netdevice.h

    struct netdev_rx_queue is touched in only a few places
    and having it defined in netdevice.h brings in the dependency
    on xdp.h, because struct xdp_rxq_info gets embedded in
    struct netdev_rx_queue.

    In prep for removal of xdp.h from netdevice.h move all
    the netdev_rx_queue stuff to a new header.

    We could technically break the new header up to avoid
    the sysfs.h include but it's so rarely included it
    doesn't seem to be worth it at this point.

    Reviewed-by: Amritha Nambiar <amritha.nambiar@intel.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
    Link: https://lore.kernel.org/r/20230803010230.1755386-3-kuba@kernel.org
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-04-05 16:03:26 +02:00
Ivan Vecera ff64b212d1 rfs: annotate lockless accesses to RFS sock flow table
JIRA: https://issues.redhat.com/browse/RHEL-31916

commit 5c3b74a92aa285a3df722bf6329ba7ccf70346d6
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Jun 6 07:41:15 2023 +0000

    rfs: annotate lockless accesses to RFS sock flow table

    Add READ_ONCE()/WRITE_ONCE() on accesses to the sock flow table.

    This also prevents a (smart ?) compiler to remove the condition in:

    if (table->ents[index] != newval)
            table->ents[index] = newval;

    We need the condition to avoid dirtying a shared cache line.

    Fixes: fec5e652e5 ("rfs: Receive Flow Steering")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Simon Horman <simon.horman@corigine.com>
    Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-04-05 15:05:45 +02:00
Felix Maurer 09a80de356 net: Fix skb consume leak in sch_handle_egress
JIRA: https://issues.redhat.com/browse/RHEL-28590

commit 28d18b673ffa2d13112ddb6e4c32c60d9b0cda50
Author: Daniel Borkmann <daniel@iogearbox.net>
Date:   Fri Aug 25 15:49:45 2023 +0200

    net: Fix skb consume leak in sch_handle_egress
    
    Fix a memory leak for the tc egress path with TC_ACT_{STOLEN,QUEUED,TRAP}:
    
      [...]
      unreferenced object 0xffff88818bcb4f00 (size 232):
      comm "softirq", pid 0, jiffies 4299085078 (age 134.028s)
      hex dump (first 32 bytes):
        00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        00 80 70 61 81 88 ff ff 00 41 31 14 81 88 ff ff  ..pa.....A1.....
      backtrace:
        [<ffffffff9991b938>] kmem_cache_alloc_node+0x268/0x400
        [<ffffffff9b3d9231>] __alloc_skb+0x211/0x2c0
        [<ffffffff9b3f0c7e>] alloc_skb_with_frags+0xbe/0x6b0
        [<ffffffff9b3bf9a9>] sock_alloc_send_pskb+0x6a9/0x870
        [<ffffffff9b6b3f00>] __ip_append_data+0x14d0/0x3bf0
        [<ffffffff9b6ba24e>] ip_append_data+0xee/0x190
        [<ffffffff9b7e1496>] icmp_push_reply+0xa6/0x470
        [<ffffffff9b7e4030>] icmp_reply+0x900/0xa00
        [<ffffffff9b7e42e3>] icmp_echo.part.0+0x1a3/0x230
        [<ffffffff9b7e444d>] icmp_echo+0xcd/0x190
        [<ffffffff9b7e9566>] icmp_rcv+0x806/0xe10
        [<ffffffff9b699bd1>] ip_protocol_deliver_rcu+0x351/0x3d0
        [<ffffffff9b699f14>] ip_local_deliver_finish+0x2b4/0x450
        [<ffffffff9b69a234>] ip_local_deliver+0x174/0x1f0
        [<ffffffff9b69a4b2>] ip_sublist_rcv_finish+0x1f2/0x420
        [<ffffffff9b69ab56>] ip_sublist_rcv+0x466/0x920
      [...]
    
    I was able to reproduce this via:
    
      ip link add dev dummy0 type dummy
      ip link set dev dummy0 up
      tc qdisc add dev eth0 clsact
      tc filter add dev eth0 egress protocol ip prio 1 u32 match ip protocol 1 0xff action mirred egress redirect dev dummy0
      ping 1.1.1.1
      <stolen>
    
    After the fix, there are no kmemleak reports with the reproducer. This is
    in line with what is also done on the ingress side, and from debugging the
    skb_unref(skb) on dummy xmit and sch_handle_egress() side, it is visible
    that these are two different skbs with both skb_unref(skb) as true. The two
    seen skbs are due to mirred doing a skb_clone() internally as use_reinsert
    is false in tcf_mirred_act() for egress. This was initially reported by Gal.
    
    Fixes: e420bed02507 ("bpf: Add fd-based tcx multi-prog infra with link support")
    Reported-by: Gal Pressman <gal@nvidia.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/bpf/bdfc2640-8f65-5b56-4472-db8e2b161aab@nvidia.com
    Reviewed-by: Simon Horman <horms@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2024-04-04 16:30:18 +02:00
Felix Maurer cbdf6d7deb bpf: Add fd-based tcx multi-prog infra with link support
JIRA: https://issues.redhat.com/browse/RHEL-28590
Conflicts:
- MAINTAINERS: The file has been restructured upstream, but this is not
  relevant for us. All paths are already covered.
- include/linux/netdevice.h: We have excluded TC from kABI with
  845ad79d11 ("net: exclude TC from kABI"). Keep this exclusion.
- include/linux/skbuff.h: The order of the fields has been changed upstream
  in c0ba861117c3 ("net: skbuff: move the fields BPF cares about directly
  next to the offset marker"). The actual change is just changing config
  options. Do this instead of picking the field reordering to make
  backporting easier.
- include/uapi/linux/bpf.h and tools/include/uapi/linux/bpf.h: The changes
  to these files were already backported through 1d5bff6a09 ("bpf: Add
  fd-based tcx multi-prog infra with link support") to keep UAPI close to
  upstream.
- kernel/bpf/syscall.c: Already backported 58ff9f1ec9 ("bpf: Add
  attach_type checks under bpf_prog_attach_check_attach_type") moves one
  switch block around. The case BPF_PROG_TYPE_SCHED_CLS was added during
  that backport, therefore this hunk is missing now. This also causes
  context differences.
- kernel/bpf/syscall.c: Already backported 81b5cf0a11 ("bpf: Fix
  BPF_PROG_QUERY last field check") fixed the QUERY_LAST_FIELD.

commit e420bed025071a623d2720a92bc2245c84757ecb
Author: Daniel Borkmann <daniel@iogearbox.net>
Date:   Wed Jul 19 16:08:52 2023 +0200

    bpf: Add fd-based tcx multi-prog infra with link support

    This work refactors and adds a lightweight extension ("tcx") to the tc BPF
    ingress and egress data path side for allowing BPF program management based
    on fds via bpf() syscall through the newly added generic multi-prog API.
    The main goal behind this work which we also presented at LPC [0] last year
    and a recent update at LSF/MM/BPF this year [3] is to support long-awaited
    BPF link functionality for tc BPF programs, which allows for a model of safe
    ownership and program detachment.

    Given the rise in tc BPF users in cloud native environments, this becomes
    necessary to avoid hard to debug incidents either through stale leftover
    programs or 3rd party applications accidentally stepping on each others toes.
    As a recap, a BPF link represents the attachment of a BPF program to a BPF
    hook point. The BPF link holds a single reference to keep BPF program alive.
    Moreover, hook points do not reference a BPF link, only the application's
    fd or pinning does. A BPF link holds meta-data specific to attachment and
    implements operations for link creation, (atomic) BPF program update,
    detachment and introspection. The motivation for BPF links for tc BPF programs
    is multi-fold, for example:

      - From Meta: "It's especially important for applications that are deployed
        fleet-wide and that don't "control" hosts they are deployed to. If such
        application crashes and no one notices and does anything about that, BPF
        program will keep running draining resources or even just, say, dropping
        packets. We at FB had outages due to such permanent BPF attachment
        semantics. With fd-based BPF link we are getting a framework, which allows
        safe, auto-detachable behavior by default, unless application explicitly
        opts in by pinning the BPF link." [1]

      - From Cilium-side the tc BPF programs we attach to host-facing veth devices
        and phys devices build the core datapath for Kubernetes Pods, and they
        implement forwarding, load-balancing, policy, EDT-management, etc, within
        BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently
        experienced hard-to-debug issues in a user's staging environment where
        another Kubernetes application using tc BPF attached to the same prio/handle
        of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath
        it. The goal is to establish a clear/safe ownership model via links which
        cannot accidentally be overridden. [0,2]

    BPF links for tc can co-exist with non-link attachments, and the semantics are
    in line also with XDP links: BPF links cannot replace other BPF links, BPF
    links cannot replace non-BPF links, non-BPF links cannot replace BPF links and
    lastly only non-BPF links can replace non-BPF links. In case of Cilium, this
    would solve mentioned issue of safe ownership model as 3rd party applications
    would not be able to accidentally wipe Cilium programs, even if they are not
    BPF link aware.

    Earlier attempts [4] have tried to integrate BPF links into core tc machinery
    to solve cls_bpf, which has been intrusive to the generic tc kernel API with
    extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could
    be wiped from the qdisc also. Locking a tc BPF program in place this way, is
    getting into layering hacks given the two object models are vastly different.

    We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF
    attach API, so that the BPF link implementation blends in naturally similar to
    other link types which are fd-based and without the need for changing core tc
    internal APIs. BPF programs for tc can then be successively migrated from classic
    cls_bpf to the new tc BPF link without needing to change the program's source
    code, just the BPF loader mechanics for attaching is sufficient.

    For the current tc framework, there is no change in behavior with this change
    and neither does this change touch on tc core kernel APIs. The gist of this
    patch is that the ingress and egress hook have a lightweight, qdisc-less
    extension for BPF to attach its tc BPF programs, in other words, a minimal
    entry point for tc BPF. The name tcx has been suggested from discussion of
    earlier revisions of this work as a good fit, and to more easily differ between
    the classic cls_bpf attachment and the fd-based one.

    For the ingress and egress tcx points, the device holds a cache-friendly array
    with program pointers which is separated from control plane (slow-path) data.
    Earlier versions of this work used priority to determine ordering and expression
    of dependencies similar as with classic tc, but it was challenged that for
    something more future-proof a better user experience is required. Hence this
    resulted in the design and development of the generic attach/detach/query API
    for multi-progs. See prior patch with its discussion on the API design. tcx is
    the first user and later we plan to integrate also others, for example, one
    candidate is multi-prog support for XDP which would benefit and have the same
    'look and feel' from API perspective.

    The goal with tcx is to have maximum compatibility to existing tc BPF programs,
    so they don't need to be rewritten specifically. Compatibility to call into
    classic tcf_classify() is also provided in order to allow successive migration
    or both to cleanly co-exist where needed given its all one logical tc layer and
    the tcx plus classic tc cls/act build one logical overall processing pipeline.

    tcx supports the simplified return codes TCX_NEXT which is non-terminating (go
    to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT.
    The fd-based API is behind a static key, so that when unused the code is also
    not entered. The struct tcx_entry's program array is currently static, but
    could be made dynamic if necessary at a point in future. The a/b pair swap
    design has been chosen so that for detachment there are no allocations which
    otherwise could fail.

    The work has been tested with tc-testing selftest suite which all passes, as
    well as the tc BPF tests from the BPF CI, and also with Cilium's L4LB.

    Thanks also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews
    of this work.

      [0] https://lpc.events/event/16/contributions/1353/
      [1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com
      [2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog
      [3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
      [4] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.com

    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Jakub Kicinski <kuba@kernel.org>
    Link: https://lore.kernel.org/r/20230719140858.13224-3-daniel@iogearbox.net
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2024-04-04 16:21:22 +02:00
Ivan Vecera 6acc282759 rtnetlink: Restore RTM_NEW/DELLINK notification behavior
JIRA: https://issues.redhat.com/browse/RHEL-30344

commit 59d3efd27c11c59b32291e5ebc307bed2edb65ee
Author: Martin Willi <martin@strongswan.org>
Date:   Tue Apr 11 09:43:19 2023 +0200

    rtnetlink: Restore RTM_NEW/DELLINK notification behavior

    The commits referenced below allows userspace to use the NLM_F_ECHO flag
    for RTM_NEW/DELLINK operations to receive unicast notifications for the
    affected link. Prior to these changes, applications may have relied on
    multicast notifications to learn the same information without specifying
    the NLM_F_ECHO flag.

    For such applications, the mentioned commits changed the behavior for
    requests not using NLM_F_ECHO. Multicast notifications are still received,
    but now use the portid of the requester and the sequence number of the
    request instead of zero values used previously. For the application, this
    message may be unexpected and likely handled as a response to the
    NLM_F_ACKed request, especially if it uses the same socket to handle
    requests and notifications.

    To fix existing applications relying on the old notification behavior,
    set the portid and sequence number in the notification only if the
    request included the NLM_F_ECHO flag. This restores the old behavior
    for applications not using it, but allows unicasted notifications for
    others.

    Fixes: f3a63cce1b4f ("rtnetlink: Honour NLM_F_ECHO flag in rtnl_delete_link")
    Fixes: d88e136cab37 ("rtnetlink: Honour NLM_F_ECHO flag in rtnl_newlink_create")
    Signed-off-by: Martin Willi <martin@strongswan.org>
    Acked-by: Guillaume Nault <gnault@redhat.com>
    Acked-by: Hangbin Liu <liuhangbin@gmail.com>
    Link: https://lore.kernel.org/r/20230411074319.24133-1-martin@strongswan.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-04-02 11:15:41 +02:00
Ivan Vecera e2cfbf12c4 net: add new helper unregister_netdevice_many_notify
JIRA: https://issues.redhat.com/browse/RHEL-30344

commit 77f4aa9a2a1766a0b9343fd812b71f18d05178da
Author: Hangbin Liu <liuhangbin@gmail.com>
Date:   Fri Oct 28 04:42:22 2022 -0400

    net: add new helper unregister_netdevice_many_notify

    Add new helper unregister_netdevice_many_notify(), pass netlink message
    header and portid, which could be used to notify userspace when flag
    NLM_F_ECHO is set.

    Make the unregister_netdevice_many() as a wrapper of new function
    unregister_netdevice_many_notify().

    Suggested-by: Guillaume Nault <gnault@redhat.com>
    Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
    Reviewed-by: Guillaume Nault <gnault@redhat.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-04-02 11:15:37 +02:00
Ivan Vecera e3ad0f633f rtnetlink: pass netlink message header and portid to rtnl_configure_link()
JIRA: https://issues.redhat.com/browse/RHEL-30344

commit 1d997f1013079c05b642c739901e3584a3ae558d
Author: Hangbin Liu <liuhangbin@gmail.com>
Date:   Fri Oct 28 04:42:21 2022 -0400

    rtnetlink: pass netlink message header and portid to rtnl_configure_link()

    This patch pass netlink message header and portid to rtnl_configure_link()
    All the functions in this call chain need to add the parameters so we can
    use them in the last call rtnl_notify(), and notify the userspace about
    the new link info if NLM_F_ECHO flag is set.

    - rtnl_configure_link()
      - __dev_notify_flags()
        - rtmsg_ifinfo()
          - rtmsg_ifinfo_event()
            - rtmsg_ifinfo_build_skb()
            - rtmsg_ifinfo_send()
              - rtnl_notify()

    Also move __dev_notify_flags() declaration to net/core/dev.h, as Jakub
    suggested.

    Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
    Reviewed-by: Guillaume Nault <gnault@redhat.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-04-02 11:15:37 +02:00
Ivan Vecera 8a9913f68e net: ensure net_todo_list is processed quickly
JIRA: https://issues.redhat.com/browse/RHEL-30344

Conflicts:
- we have already backported 6264f58ca0e54 ("net: extract a few
  internals from netdevice.h") so the net_todo_list has to be placed in
  net/core/dev.h instead of include/linux/netdevice.h

commit 0b5c21bbc01e92745ca1ca4f6fd87d878fa3ea5e
Author: Johannes Berg <johannes.berg@intel.com>
Date:   Mon Apr 4 11:38:47 2022 +0200

    net: ensure net_todo_list is processed quickly

    In [1], Will raised a potential issue that the cfg80211 code,
    which does (from a locking perspective)

      rtnl_lock()
      wiphy_lock()
      rtnl_unlock()

    might be suspectible to ABBA deadlocks, because rtnl_unlock()
    calls netdev_run_todo(), which might end up calling rtnl_lock()
    again, which could then deadlock (see the comment in the code
    added here for the scenario).

    Some back and forth and thinking ensued, but clearly this can't
    happen if the net_todo_list is empty at the rtnl_unlock() here.
    Clearly, the code here cannot actually put an entry on it, and
    all other users of rtnl_unlock() will empty it since that will
    always go through netdev_run_todo(), emptying the list.

    So the only other way to get there would be to add to the list
    and then unlock the RTNL without going through rtnl_unlock(),
    which is only possible through __rtnl_unlock(). However, this
    isn't exported and not used in many places, and none of them
    seem to be able to unregister before using it.

    Therefore, add a WARN_ON() in the code to ensure this invariant
    won't be broken, so that the cfg80211 (or any similar) code
    stays safe.

    [1] https://lore.kernel.org/r/Yjzpo3TfZxtKPMAG@google.com

    Signed-off-by: Johannes Berg <johannes.berg@intel.com>
    Link: https://lore.kernel.org/r/20220404113847.0ee02e4a70da.Ic73d206e217db20fd22dcec14fe5442ca732804b@changeid
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-03-26 15:41:49 +01:00
Ivan Vecera 3961299c81 net: make net->dev_unreg_count atomic
JIRA: https://issues.redhat.com/browse/RHEL-30344

commit ede6c39c4f9068cbeb4036448c45fff5393e0432
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Feb 9 18:59:32 2022 -0800

    net: make net->dev_unreg_count atomic

    Having to acquire rtnl from netdev_run_todo() for every dismantled
    device is not desirable when/if rtnl is under stress.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-03-26 14:45:13 +01:00
Scott Weaver 68fc749fdd Merge: add kabi reserved fields and excludes to networking
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3584

JIRA: https://issues.redhat.com/browse/RHEL-21356
Upstream Status: mostly RHEL-only patches

This series adds reserved fields to networking structs, and excludes
some areas of networking from the kABI guarantee. These reserved
fields are only needed during backports to z-stream.

Signed-off-by: Sabrina Dubroca <sdubroca@redhat.com>

Approved-by: Jan Stancek <jstancek@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Paolo Abeni <pabeni@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>

Signed-off-by: Scott Weaver <scweaver@redhat.com>
2024-01-22 12:01:19 -05:00
Paolo Abeni f31d9dce62 net: check dev->gso_max_size in gso_features_check()
JIRA: https://issues.redhat.com/browse/RHEL-21447
Tested: LNST, Tier1

Upstream commit:
commit 24ab059d2ebd62fdccc43794796f6ffbabe49ebc
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Dec 19 12:53:31 2023 +0000

    net: check dev->gso_max_size in gso_features_check()

    Some drivers might misbehave if TSO packets get too big.

    GVE for instance uses a 16bit field in its TX descriptor,
    and will do bad things if a packet is bigger than 2^16 bytes.

    Linux TCP stack honors dev->gso_max_size, but there are
    other ways for too big packets to reach an ndo_start_xmit()
    handler : virtio_net, af_packet, GRO...

    Add a generic check in gso_features_check() and fallback
    to GSO when needed.

    gso_max_size was added in the blamed commit.

    Fixes: 82cc1a7a56 ("[NET]: Add per-connection option to set max TSO frame size")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20231219125331.4127498-1-edumazet@google.com
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-01-12 17:13:23 +01:00
Sabrina Dubroca 1a10ee13b5 net: add reserved fields to rtnl_link_stats*
JIRA: https://issues.redhat.com/browse/RHEL-21356
Upstream Status: RHEL-only

rtnl_link_stats and rtnl_link_stats64 are protected by kABI, add 4
reserved fields. We need to use a custom mechanism here, because those
structures are part of uapi.

Signed-off-by: Sabrina Dubroca <sdubroca@redhat.com>
2024-01-12 14:27:38 +01:00
Scott Weaver 2f27d6aa76 Merge: CNB94: bridge: update bridge and switchdev to the latest upstream
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3455

JIRA: https://issues.redhat.com/browse/RHEL-862
Tested: Using self-tests infrastructure

Commits:
```
6fb4825e492b ("docs: net: add an explanation of VF (and other) Representors")
8c44fa12c8fa ("net: Add MDB net device operations")
c009de1061b5 ("bridge: mcast: Implement MDB net device operations")
cc7f5022f810 ("rtnetlink: bridge: mcast: Move MDB handlers out of bridge driver")
da654c80a0eb ("rtnetlink: bridge: mcast: Relax group address validation in common code")
013a7ce81dd8 ("bridge: Reorder neighbor suppression check when flooding")
e408336a693e ("bridge: Pass VLAN ID to br_flood()")
a714e3ec2308 ("bridge: Add internal flags for per-{Port, VLAN} neighbor suppression")
6be42ed0a5f4 ("bridge: Take per-{Port, VLAN} neighbor suppression into account")
3aca683e0654 ("bridge: Encapsulate data path neighbor suppression logic")
412614b1457a ("bridge: Add per-{Port, VLAN} neighbor suppression data path support")
83f6d600796c ("bridge: vlan: Allow setting VLAN neighbor suppression state")
160656d7201d ("bridge: Allow setting per-{Port, VLAN} neighbor suppression state")
7648ac72dcd7 ("selftests: net: Add bridge neighbor suppression test")
89dcd87ce534 ("bridge: always declare tunnel functions")
812de4dfab98 ("selftests: router_bridge_vlan: Add a diagram")
f5136877f421 ("selftests: router_bridge_vlan: Set vlan_default_pvid 0 on the bridge")
8c3736ce595b ("selftests: forwarding: q_in_vni: Disable IPv6 autogen on bridges")
c801533304ca ("selftests: forwarding: dual_vxlan_bridge: Disable IPv6 autogen on bridges")
d7442b7d288e ("selftests: forwarding: skbedit_priority: Disable IPv6 autogen on a bridge")
f61018dc3e21 ("selftests: forwarding: pedit_dsfield: Disable IPv6 autogen on a bridge")
92c3bb5393db ("selftests: forwarding: mirror_gre_*: Disable IPv6 autogen on bridges")
8fd32576e650 ("selftests: forwarding: mirror_gre_*: Use port MAC for bridge address")
5e71bf50c2e2 ("selftests: forwarding: router_bridge: Use port MAC for bridge address")
6ca3c005d060 ("net: bridge: keep ports without IFF_UNICAST_FLT in BR_PROMISC mode")
5f44a7144cc5 ("selftests: forwarding: lib: Add ping6_, ping_test_fails()")
c7203a2981dc ("selftests: router_bridge: Add tests to remove and add PVID")
d4172a93b279 ("selftests: router_bridge_vlan: Add PVID change test")
b0307b77265b ("selftests: router_bridge_vlan_upper_pvid: Add a new selftest")
9cbb3da4f4f7 ("selftests: router_bridge_pvid_vlan_upper: Add a new selftest")
989280d6ea70 ("net: bridge: br_switchdev: Tolerate -EOPNOTSUPP when replaying MDB")
f2e2857b3522 ("net: switchdev: Add a helper to replay objects on a bridge port")
4d66f235c790 ("bridge: Remove unused declaration br_multicast_set_hash_max()")
eb1388553ef4 ("selftests: router_bridge: Add remastering tests")
0a06e0c1af97 ("selftests: router_bridge_1d: Add a new selftest")
49e15dec8b90 ("selftests: router_bridge_vlan_upper: Add a new selftest")
3f0c4e70a9ef ("selftests: router_bridge_lag: Add a new selftest")
24e84656e432 ("selftests: router_bridge_1d_lag: Add a new selftest")
f85b1c7da776 ("net: switchdev: Remove unused typedef switchdev_obj_dump_cb_t()")
a76728719c85 ("net: switchdev: Remove unused declaration switchdev_port_fwd_mark_set()")
38c43a1ce758 ("selftests: forwarding: Add test case for traffic redirection from a locked port")
6c1c5097781f ("net: add atomic_long_t to net_device_stats fields")
44bdb313da57 ("net: bridge: use DEV_STATS_INC()")
```

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: Hangbin Liu <haliu@redhat.com>
Approved-by: Michal Schmidt <mschmidt@redhat.com>

Signed-off-by: Scott Weaver <scweaver@redhat.com>
2024-01-10 12:31:27 -05:00
Scott Weaver 96360329fb Merge: CNB94: Introduce ndo_hwtstamp_get() and ndo_hwtstamp_set()
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3460

JIRA: https://issues.redhat.com/browse/RHEL-18147
Tested: Just built... No way to test the new interface as no driver was converted yet.

Commits:
```
00d521b39307 ("net: don't abuse "default" case for unknown ioctl in dev_ifsioc()")
1193db2a55b6 ("net: simplify handling of dsa_ndo_eth_ioctl() return code")
4ee58e1e5680 ("net: promote SIOCSHWTSTAMP and SIOCGHWTSTAMP ioctls to dedicated handlers")
d5d5fd8f2552 ("net: move copy_from_user() out of net_hwtstamp_validate()")
c4bffeaa8d50 ("net: add struct kernel_hwtstamp_config and make net_hwtstamp_validate() use it")
88c0a6b503b7 ("net: create a netdev notifier for DSA to reject PTP on DSA master")
5a17818682cf ("net: dsa: replace NETDEV_PRE_CHANGE_HWTSTAMP notifier with a stub")
66f7223039c0 ("net: add NDOs for configuring hardware timestamping")
e47d01fea663 ("net: add hwtstamping helpers for stackable net devices")
```

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com>
Approved-by: Antoine Tenart <atenart@redhat.com>

Signed-off-by: Scott Weaver <scweaver@redhat.com>
2024-01-08 12:05:43 -05:00
Ivan Vecera 8051b046de net: dsa: replace NETDEV_PRE_CHANGE_HWTSTAMP notifier with a stub
JIRA: https://issues.redhat.com/browse/RHEL-18147

Conflicts:
- DSA stuff removed except dsa_stubs.h that provides inline function
  dsa_master_hwtstamp_validate()

commit 5a17818682cf43ad0fdd6035945f3b7a8c9dc5e9
Author: Vladimir Oltean <vladimir.oltean@nxp.com>
Date:   Thu Apr 6 14:42:46 2023 +0300

    net: dsa: replace NETDEV_PRE_CHANGE_HWTSTAMP notifier with a stub

    There was a sort of rush surrounding commit 88c0a6b503b7 ("net: create a
    netdev notifier for DSA to reject PTP on DSA master"), due to a desire
    to convert DSA's attempt to deny TX timestamping on a DSA master to
    something that doesn't block the kernel-wide API conversion from
    ndo_eth_ioctl() to ndo_hwtstamp_set().

    What was required was a mechanism that did not depend on ndo_eth_ioctl(),
    and what was provided was a mechanism that did not depend on
    ndo_eth_ioctl(), while at the same time introducing something that
    wasn't absolutely necessary - a new netdev notifier.

    There have been objections from Jakub Kicinski that using notifiers in
    general when they are not absolutely necessary creates complications to
    the control flow and difficulties to maintainers who look at the code.
    So there is a desire to not use notifiers.

    In addition to that, the notifier chain gets called even if there is no
    DSA in the system and no one is interested in applying any restriction.

    Take the model of udp_tunnel_nic_ops and introduce a stub mechanism,
    through which net/core/dev_ioctl.c can call into DSA even when
    CONFIG_NET_DSA=m.

    Compared to the code that existed prior to the notifier conversion, aka
    what was added in commits:
    - 4cfab35667 ("net: dsa: Add wrappers for overloaded ndo_ops")
    - 3369afba1e ("net: Call into DSA netdevice_ops wrappers")

    this is different because we are not overloading any struct
    net_device_ops of the DSA master anymore, but rather, we are exposing a
    rather specific functionality which is orthogonal to which API is used
    to enable it - ndo_eth_ioctl() or ndo_hwtstamp_set().

    Also, what is similar is that both approaches use function pointers to
    get from built-in code to DSA.

    There is no point in replicating the function pointers towards
    __dsa_master_hwtstamp_validate() once for every CPU port (dev->dsa_ptr).
    Instead, it is sufficient to introduce a singleton struct dsa_stubs,
    built into the kernel, which contains a single function pointer to
    __dsa_master_hwtstamp_validate().

    I find this approach preferable to what we had originally, because
    dev->dsa_ptr->netdev_ops->ndo_do_ioctl() used to require going through
    struct dsa_port (dev->dsa_ptr), and so, this was incompatible with any
    attempts to add any data encapsulation and hide DSA data structures from
    the outside world.

    Link: https://lore.kernel.org/netdev/20230403083019.120b72fd@kernel.org/
    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-12-12 15:38:59 +01:00
Ivan Vecera cbac5c0d4e net: create a netdev notifier for DSA to reject PTP on DSA master
JIRA: https://issues.redhat.com/browse/RHEL-18147

Conflicts:
- Omitted DSA changes as they are not applicable. Note tha DSA is disabled
  in RHEL.

commit 88c0a6b503b7f4fffb68a8d49c3987870c5b1d6b
Author: Vladimir Oltean <vladimir.oltean@nxp.com>
Date:   Sun Apr 2 15:37:55 2023 +0300

    net: create a netdev notifier for DSA to reject PTP on DSA master

    The fact that PTP 2-step TX timestamping is broken on DSA switches if
    the master also timestamps the same packets is documented by commit
    f685e609a3 ("net: dsa: Deny PTP on master if switch supports it").
    We attempt to help the users avoid shooting themselves in the foot by
    making DSA reject the timestamping ioctls on an interface that is a DSA
    master, and the switch tree beneath it contains switches which are aware
    of PTP.

    The only problem is that there isn't an established way of intercepting
    ndo_eth_ioctl calls, so DSA creates avoidable burden upon the network
    stack by creating a struct dsa_netdevice_ops with overlaid function
    pointers that are manually checked from the relevant call sites. There
    used to be 2 such dsa_netdevice_ops, but now, ndo_eth_ioctl is the only
    one left.

    There is an ongoing effort to migrate driver-visible hardware timestamping
    control from the ndo_eth_ioctl() based API to a new ndo_hwtstamp_set()
    model, but DSA actively prevents that migration, since dsa_master_ioctl()
    is currently coded to manually call the master's legacy ndo_eth_ioctl(),
    and so, whenever a network device driver would be converted to the new
    API, DSA's restrictions would be circumvented, because any device could
    be used as a DSA master.

    The established way for unrelated modules to react on a net device event
    is via netdevice notifiers. So we create a new notifier which gets
    called whenever there is an attempt to change hardware timestamping
    settings on a device.

    Finally, there is another reason why a netdev notifier will be a good
    idea, besides strictly DSA, and this has to do with PHY timestamping.

    With ndo_eth_ioctl(), all MAC drivers must manually call
    phy_has_hwtstamp() before deciding whether to act upon SIOCSHWTSTAMP,
    otherwise they must pass this ioctl to the PHY driver via
    phy_mii_ioctl().

    With the new ndo_hwtstamp_set() API, it will be desirable to simply not
    make any calls into the MAC device driver when timestamping should be
    performed at the PHY level.

    But there exist drivers, such as the lan966x switch, which need to
    install packet traps for PTP regardless of whether they are the layer
    that provides the hardware timestamps, or the PHY is. That would be
    impossible to support with the new API.

    The proposal there, too, is to introduce a netdev notifier which acts as
    a better cue for switching drivers to add or remove PTP packet traps,
    than ndo_hwtstamp_set(). The one introduced here "almost" works there as
    well, except for the fact that packet traps should only be installed if
    the PHY driver succeeded to enable hardware timestamping, whereas here,
    we need to deny hardware timestamping on the DSA master before it
    actually gets enabled. This is why this notifier is called "PRE_", and
    the notifier that would get used for PHY timestamping and packet traps
    would be called NETDEV_CHANGE_HWTSTAMP. This isn't a new concept, for
    example NETDEV_CHANGEUPPER and NETDEV_PRECHANGEUPPER do the same thing.

    In expectation of future netlink UAPI, we also pass a non-NULL extack
    pointer to the netdev notifier, and we make DSA populate it with an
    informative reason for the rejection. To avoid making it go to waste, we
    make the ioctl-based dev_set_hwtstamp() create a fake extack and print
    the message to the kernel log.

    Link: https://lore.kernel.org/netdev/20230401191215.tvveoi3lkawgg6g4@skbuf/
    Link: https://lore.kernel.org/netdev/20230310164451.ls7bbs6pdzs4m6pw@skbuf/
    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-12-12 15:38:55 +01:00
Antoine Tenart bacd38afc2 bpf, xdp: Add tracepoint to xdp attaching failure
JIRA: https://issues.redhat.com/browse/RHEL-17413
Upstream Status: linux.git

commit bf4ea1d0b2cb2251f9e5619c81daa98591087c33
Author: Leon Hwang <hffilwlqm@gmail.com>
Date:   Tue Aug 1 22:26:20 2023 +0800

    bpf, xdp: Add tracepoint to xdp attaching failure

    When error happens in dev_xdp_attach(), it should have a way to tell
    users the error message like the netlink approach.

    To avoid breaking uapi, adding a tracepoint in bpf_xdp_link_attach() is
    an appropriate way to notify users the error message.

    Hence, bpf libraries are able to retrieve the error message by this
    tracepoint, and then report the error message to users.

    Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
    Link: https://lore.kernel.org/r/20230801142621.7925-2-hffilwlqm@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2023-12-11 11:15:48 +01:00
Ivan Vecera 93fed84669 net: Fix unwanted sign extension in netdev_stats_to_stats64()
JIRA: https://issues.redhat.com/browse/RHEL-862

commit 9b55d3f0a69af649c62cbc2633e6d695bb3cc583
Author: Felix Riemann <felix.riemann@sma.de>
Date:   Fri Feb 10 13:36:44 2023 +0100

    net: Fix unwanted sign extension in netdev_stats_to_stats64()

    When converting net_device_stats to rtnl_link_stats64 sign extension
    is triggered on ILP32 machines as 6c1c509778 changed the previous
    "ulong -> u64" conversion to "long -> u64" by accessing the
    net_device_stats fields through a (signed) atomic_long_t.

    This causes for example the received bytes counter to jump to 16EiB after
    having received 2^31 bytes. Casting the atomic value to "unsigned long"
    beforehand converting it into u64 avoids this.

    Fixes: 6c1c5097781f ("net: add atomic_long_t to net_device_stats fields")
    Signed-off-by: Felix Riemann <felix.riemann@sma.de>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-12-05 15:35:32 +01:00
Ivan Vecera d5968be5cd net: add atomic_long_t to net_device_stats fields
JIRA: https://issues.redhat.com/browse/RHEL-862

commit 6c1c5097781f563b70a81683ea6fdac21637573b
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Nov 15 08:53:55 2022 +0000

    net: add atomic_long_t to net_device_stats fields

    Long standing KCSAN issues are caused by data-race around
    some dev->stats changes.

    Most performance critical paths already use per-cpu
    variables, or per-queue ones.

    It is reasonable (and more correct) to use atomic operations
    for the slow paths.

    This patch adds an union for each field of net_device_stats,
    so that we can convert paths that are not yet protected
    by a spinlock or a mutex.

    netdev_stats_to_stats64() no longer has an #if BITS_PER_LONG==64

    Note that the memcpy() we were using on 64bit arches
    had no provision to avoid load-tearing,
    while atomic_long_read() is providing the needed protection
    at no cost.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-12-05 15:29:59 +01:00
Scott Weaver f51e07d91d Merge: CNB94: xsk: Multi-buffer support
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3310

JIRA: https://issues.redhat.com/browse/RHEL-15250
Tested: Using attached self-tests [Results in JIRA]

The series adds support for multi-buffer to XSK. It is based on upstream series `3226e3139dfe ("Merge branch 'xsk-multi-buffer-support'")` and contains also commits from upstream series `34e78bab67c5 ("Merge branch 'seltests/xsk: prepare for AF_XDP multi-buffer testing'")` to make attached self-tests applicable.

Commits:
```
0c5f48599bed ("xsk: Simplify xp_aligned_validate_desc implementation")
f2f167583601 ("xsk: Remove unused xsk_buff_discard")
e2fa5c2068fb ("xsk: Remove unused inline function xsk_buff_discard()")
63a64a56bc3f ("xsk: prepare 'options' in xdp_desc for multi-buffer use")
81470b5c3c66 ("xsk: introduce XSK_USE_SG bind flag for xsk socket")
556444c4e683 ("xsk: prepare both copy and zero-copy modes to co-exist")
faa91b839b09 ("xsk: move xdp_buff's data length check to xsk_rcv_check")
804627751b42 ("xsk: add support for AF_XDP multi-buffer on Rx path")
b7f72a30e9ac ("xsk: introduce wrappers and helpers for supporting multi-buffer in Tx path")
1b725b0c8163 ("xsk: allow core/drivers to test EOP bit")
cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
07428da9e25a ("xsk: discard zero length descriptors in Tx path")
13ce2daa259a ("xsk: add new netlink attribute dedicated for ZC max frags")
24ea50127ecf ("xsk: support mbuf on ZC RX")
d5581966040f ("xsk: support ZC Tx multi-buffer in batch API")
49ca37d0d825 ("xsk: add multi-buffer documentation")
9a321fd3308e ("selftests/xsk: add xdp populate metadata test")
68e7322142f5 ("selftests: xsk: Deflakify STATS_RX_DROPPED test")
7a2050df244e ("selftests: xsk: Use correct UMEM size in testapp_invalid_desc")
ccd1b2933f8c ("selftests: xsk: Add test case for packets at end of UMEM")
c0801598e543 ("selftests: xsk: Add test UNALIGNED_INV_DESC_4K1_FRAME_SIZE")
d2e541494935 ("selftests/xsk: do not change XDP program when not necessary")
df82d2e89c41 ("selftests/xsk: generate simpler packets with variable length")
feb973a9094f ("selftests/xsk: add varying payload pattern within packet")
7a8a6762822a ("selftests/xsk: dump packet at error")
69fc03d220a3 ("selftests/xsk: add packet iterator for tx to packet stream")
d9f6d9709f87 ("selftests/xsk: store offset in pkt instead of addr")
041b68f688a3 ("selftests/xsx: test for huge pages only once")
86e41755b432 ("selftests/xsk: populate fill ring based on frags needed")
2f6eae0df1a8 ("selftests/xsk: generate data for multi-buffer packets")
7cd6df4f5ec2 ("selftests/xsk: adjust packet pacing for multi-buffer support")
17f1034dd76d ("selftests/xsk: transmit and receive multi-buffer packets")
f540d44e05cf ("selftests/xsk: add basic multi-buffer test")
1005a226da9a ("selftests/xsk: add unaligned mode test for multi-buffer")
697604492b64 ("selftests/xsk: add invalid descriptor test for multi-buffer")
f80ddbec4762 ("selftests/xsk: add metadata copy test for multi-buff")
807bf4da2049 ("selftests/xsk: add test for too many frags")
3666bccab43a ("selftests/xsk: reset NIC settings to default after running test suite")
d609f3d228a8 ("xsk: add multi-buffer support for sockets sharing umem")
9d0a67b9d42c ("xsk: Fix xsk_build_skb() error: 'skb' dereferencing possible ERR_PTR()")
a097627dcadd ("net: add missing net_device::xdp_zc_max_segs description")
```

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: Petr Oros <poros@redhat.com>
Approved-by: Toke Høiland-Jørgensen <toke@redhat.com>

Signed-off-by: Scott Weaver <scweaver@redhat.com>
2023-11-29 14:08:06 -05:00
Scott Weaver 971351c941 Merge: CNB94: net: add check for current MAC address in dev_set_mac_address
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3398

JIRA: https://issues.redhat.com/browse/RHEL-16986
JIRA: https://issues.redhat.com/browse/RHEL-6368

This prevents network drivers' .ndo_set_mac_address method from being called when the MAC address is already the current one. There are drivers that more or less assume that this is how the network core already behaves. For example, iavf will send a virtchnl message to the PF requesting to add the new address and then a message to remove the old address. This logic is broken if old and new are the same address.

Tested: I used the reproducer steps from RHEL-6368, with VFs on Intel E810.

Signed-off-by: Michal Schmidt <mschmidt@redhat.com>

Approved-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: Petr Oros <poros@redhat.com>

Signed-off-by: Scott Weaver <scweaver@redhat.com>
2023-11-28 10:54:46 -05:00
Michal Schmidt 2c3e95de5a net: fix net device address assign type
JIRA: https://issues.redhat.com/browse/RHEL-16986
JIRA: https://issues.redhat.com/browse/RHEL-6368

commit 0ec92a8f56ff07237dbe8af7c7a72aba7f957baf
Author: Piotr Gardocki <piotrx.gardocki@intel.com>
Date:   Wed Jun 21 15:21:06 2023 +0200

    net: fix net device address assign type

    Commit ad72c4a06acc introduced optimization to return from function
    quickly if the MAC address is not changing at all. It was reported
    that such change causes dev->addr_assign_type to not change
    to NET_ADDR_SET from _PERM or _RANDOM.
    Restore the old behavior and skip only call to ndo_set_mac_address.

    Fixes: ad72c4a06acc ("net: add check for current MAC address in dev_set_mac_address")
    Reported-by: Gal Pressman <gal@nvidia.com>
    Signed-off-by: Piotr Gardocki <piotrx.gardocki@intel.com>
    Reviewed-by: Simon Horman <simon.horman@corigine.com>
    Reviewed-by: Jiri Pirko <jiri@nvidia.com>
    Link: https://lore.kernel.org/r/20230621132106.991342-1-piotrx.gardocki@intel.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
2023-11-21 23:16:06 +01:00
Michal Schmidt 37100466e2 net: add check for current MAC address in dev_set_mac_address
JIRA: https://issues.redhat.com/browse/RHEL-16986
JIRA: https://issues.redhat.com/browse/RHEL-6368

commit ad72c4a06acc6762e84994ac2f722da7a07df34e
Author: Piotr Gardocki <piotrx.gardocki@intel.com>
Date:   Wed Jun 14 16:53:00 2023 +0200

    net: add check for current MAC address in dev_set_mac_address

    In some cases it is possible for kernel to come with request
    to change primary MAC address to the address that is already
    set on the given interface.

    Add proper check to return fast from the function in these cases.

    An example of such case is adding an interface to bonding
    channel in balance-alb mode:
    modprobe bonding mode=balance-alb miimon=100 max_bonds=1
    ip link set bond0 up
    ifenslave bond0 <eth>

    Signed-off-by: Piotr Gardocki <piotrx.gardocki@intel.com>
    Reviewed-by: Simon Horman <simon.horman@corigine.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
2023-11-21 23:16:06 +01:00
Antoine Tenart b2c4833a40 net: skbuff: update and rename __kfree_skb_defer()
JIRA: https://issues.redhat.com/browse/RHEL-14554
Upstream Status: linux.git

commit 8fa66e4a1bdd41d55d7842928e60a40fed65715d
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Wed Apr 19 19:00:05 2023 -0700

    net: skbuff: update and rename __kfree_skb_defer()

    __kfree_skb_defer() uses the old naming where "defer" meant
    slab bulk free/alloc APIs. In the meantime we also made
    __kfree_skb_defer() feed the per-NAPI skb cache, which
    implies bulk APIs. So take away the 'defer' and add 'napi'.

    While at it add a drop reason. This only matters on the
    tx_action path, if the skb has a frag_list. But getting
    rid of a SKB_DROP_REASON_NOT_SPECIFIED seems like a net
    benefit so why not.

    Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
    Link: https://lore.kernel.org/r/20230420020005.815854-1-kuba@kernel.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2023-11-10 17:40:29 +01:00
Scott Weaver 6cf5659031 Merge: CNB94: page_pool: allow caching from safely localized NAPI
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3196

JIRA: https://issues.redhat.com/browse/RHEL-12613
Tested: Using LNST net-driver test-suite on i40e, bnxt_en, ice and mlx5_core [http://dashboard.lnst.anl.lab.eng.bos.redhat.com/pipeline/3644]

Commits:
```
4727bab4e9bb ("net: skb: move skb_pp_recycle() to skbuff.c")
eedade12f4cb ("net: kfree_skb_list use kmem_cache_free_bulk")
f72ff8b81ebc ("net: fix kfree_skb_list use of skb_mark_not_on_list")
9dde0cd3b10f ("net: introduce skb_poison_list and use in kfree_skb_list")
b07a2d97ba5e ("net: skb: plumb napi state thru skb freeing paths")
8c48eea3adf3 ("page_pool: allow caching from safely localized NAPI")
dd64b232deb8 ("page_pool: unlink from napi during destroy")
```

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Michal Schmidt <mschmidt@redhat.com>

Signed-off-by: Scott Weaver <scweaver@redhat.com>
2023-11-09 07:22:35 -05:00
Ivan Vecera 96ba8afe11 xsk: add new netlink attribute dedicated for ZC max frags
JIRA: https://issues.redhat.com/browse/RHEL-15250

commit 13ce2daa259a3bfbc9a5aeeee8b9a87058703731
Author: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Date:   Wed Jul 19 15:24:07 2023 +0200

    xsk: add new netlink attribute dedicated for ZC max frags

    Introduce new netlink attribute NETDEV_A_DEV_XDP_ZC_MAX_SEGS that will
    carry maximum fragments that underlying ZC driver is able to handle on
    TX side. It is going to be included in netlink response only when driver
    supports ZC. Any value higher than 1 implies multi-buffer ZC support on
    underlying device.

    Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
    Link: https://lore.kernel.org/r/20230719132421.584801-11-maciej.fijalkowski@intel.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-11-01 14:56:57 +01:00
Ivan Vecera d80ce17d20 page_pool: allow caching from safely localized NAPI
JIRA: https://issues.redhat.com/browse/RHEL-12613

Conflicts:
- simple context conflict in net/core/dev.c due to absence of commit
  8b43fd3d1d7d8 ("net: optimize ____napi_schedule() to avoid extra
  NET_RX_SOFTIRQ") that is out of scope of this series

commit 8c48eea3adf3119e0a3fc57bd31f6966f26ee784
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Wed Apr 12 21:26:04 2023 -0700

    page_pool: allow caching from safely localized NAPI

    Recent patches to mlx5 mentioned a regression when moving from
    driver local page pool to only using the generic page pool code.
    Page pool has two recycling paths (1) direct one, which runs in
    safe NAPI context (basically consumer context, so producing
    can be lockless); and (2) via a ptr_ring, which takes a spin
    lock because the freeing can happen from any CPU; producer
    and consumer may run concurrently.

    Since the page pool code was added, Eric introduced a revised version
    of deferred skb freeing. TCP skbs are now usually returned to the CPU
    which allocated them, and freed in softirq context. This places the
    freeing (producing of pages back to the pool) enticingly close to
    the allocation (consumer).

    If we can prove that we're freeing in the same softirq context in which
    the consumer NAPI will run - lockless use of the cache is perfectly fine,
    no need for the lock.

    Let drivers link the page pool to a NAPI instance. If the NAPI instance
    is scheduled on the same CPU on which we're freeing - place the pages
    in the direct cache.

    With that and patched bnxt (XDP enabled to engage the page pool, sigh,
    bnxt really needs page pool work :() I see a 2.6% perf boost with
    a TCP stream test (app on a different physical core than softirq).

    The CPU use of relevant functions decreases as expected:

      page_pool_refill_alloc_cache   1.17% -> 0%
      _raw_spin_lock                 2.41% -> 0.98%

    Only consider lockless path to be safe when NAPI is scheduled
    - in practice this should cover majority if not all of steady state
    workloads. It's usually the NAPI kicking in that causes the skb flush.

    The main case we'll miss out on is when application runs on the same
    CPU as NAPI. In that case we don't use the deferred skb free path.

    Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
    Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
    Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-10-31 15:09:26 +01:00
Scott Weaver d05495aca0 Merge: CNB94: tc: update tc subsystem to the upstream v6.5
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3067

JIRA: https://issues.redhat.com/browse/RHEL-1773
Depends: https://issues.redhat.com/browse/RHEL-860
Depends: https://issues.redhat.com/browse/RHEL-3646

Update TC (net/sched) to the upstream v6.5

Omitted-fix: cad7526f33ce ("net: dsa: ocelot: unlock on error in vsc9959_qos_port_tas_set()")
Not needed, DSA as well as ocelot driver is not enabled/supported in RHEL

Commits:
```
1b808993e194 ("flow_dissector: fix false-positive __read_overflow2_field() warning")
f743f16c548b ("treewide: use get_random_{u8,u16}() when possible, part 2")
7e3cf0843fe5 ("treewide: use get_random_{u8,u16}() when possible, part 1")
8032bf1233a7 ("treewide: use get_random_u32_below() instead of deprecated function")
62423bd2d2e2 ("net: sched: remove qdisc_watchdog->last_expires")
c66b2111c9c9 ("selftests: tc-testing: add tests for action binding")
f5fca219ad45 ("net: do not use skb_mac_header() in qdisc_pkt_len_init()")
e495a9673caf ("sch_cake: do not use skb_mac_header() in cake_overhead()")
b3be94885af4 ("net/sched: remove two skb_mac_header() uses")
fcb3a4653bc5 ("net/sched: act_api: use the correct TCA_ACT attributes in dump")
4170f0ef582c ("fix typos in net/sched/)
8b0f256530d9 ("net/sched: sch_mqprio: use netlink payload helpers")
3dd0c16ec93e ("net/sched: mqprio: simplify handling of nlattr portion of TCA_OPTIONS")
57f21bf85400 ("net/sched: mqprio: add extack to mqprio_parse_nlattr()")
ab277d2084ba ("net/sched: mqprio: add an extack message to mqprio_parse_opt()")
c54876cd5961 ("net/sched: pass netlink extack to mqprio and taprio offload")
f62af20bed2d ("net/sched: mqprio: allow per-TC user input of FP adminStatus")
a721c3e54b80 ("net/sched: taprio: allow per-TC user input of FP adminStatus")
8c966a10eb84 ("flow_dissector: Address kdoc warnings")
54e906f1639e ("selftests: forwarding: sch_tbf_*: Add a pre-run hook")
2f0f9465ad9f ("net: sched: Print msecs when transmit queue time out")
5036034572b7 ("net/sched: act_pedit: use NLA_POLICY for parsing 'ex' keys")
0c83c5210e18 ("net/sched: act_pedit: use extack in 'ex' parsing errors")
e1201bc781c2 ("net/sched: act_pedit: check static offsets a priori")
577140180ba2 ("net/sched: act_pedit: remove extra check for key type")
e3c9673e2f6e ("net/sched: act_pedit: rate limit datapath messages")
807cfded92b0 ("net/sched: sch_htb: use extack on errors messages")
c69a9b023f65 ("net/sched: sch_qfq: use extack on errors messages")
25369891fcef ("net/sched: sch_qfq: refactor parsing of netlink parameters")
7eb060a51a3b ("selftests: tc-testing: add more tests for sch_qfq")
1b483d9f5805 ("net/sched: act_pedit: free pedit keys on bail from offset check")
526f28bd0fbd ("net/sched: act_mirred: Add carrier check")
12e7789ad5b4 ("sch_htb: Allow HTB priority parameter in offload mode")
c7cfbd115001 ("net/sched: sch_ingress: Only create under TC_H_INGRESS")
5eeebfe6c493 ("net/sched: sch_clsact: Only create under TC_H_CLSACT")
f85fa45d4a94 ("net/sched: Reserve TC_H_INGRESS (TC_H_CLSACT) for ingress (clsact) Qdiscs")
9de95df5d15b ("net/sched: Prohibit regrafting ingress or clsact Qdiscs")
7b4858df3bf7 ("skbuff: bridge: Add layer 2 miss indication")
d5ccfd90df7f ("flow_dissector: Dissect layer 2 miss from tc skb extension")
1a432018c0cd ("net/sched: flower: Allow matching on layer 2 miss")
f4356947f029 ("flow_offload: Reject matching on layer 2 miss")
8c33266ae26a ("selftests: forwarding: Add layer 2 miss test cases")
dced11ef84fb ("net/sched: taprio: don't overwrite "sch" variable in taprio_dump_class_stats()")
2d800bc500fb ("net/sched: taprio: replace tc_taprio_qopt_offload :: enable with a "cmd" enum")
6c1adb650c8d ("net/sched: taprio: add netlink reporting for offload statistics counters")
a395b8d1c7c3 ("selftests/tc-testing: replace mq with invalid parent ID")
8cde87b007da ("net: sched: wrap tc_skip_wrapper with CONFIG_RETPOLINE")
cd2b8113c2e8 ("net/sched: fq_pie: ensure reasonable TCA_FQ_PIE_QUANTUM values")
d636fc5dd692 ("net: sched: add rcu annotations around qdisc->qdisc_sleeping")
886bc7d6ed33 ("net: sched: move rtm_tca_policy declaration to include file")
682881ee45c8 ("net: sched: act_police: fix sparse errors in tcf_police_dump()")
6c02568fd1ae ("net/sched: act_pedit: Parse L3 Header for L4 offset")
26e35370b976 ("net/sched: act_pedit: Use kmemdup() to replace kmalloc + memcpy")
2b84960fc5dd ("net/sched: taprio: report class offload stats per TXQ, not per TC")
d7ad70b5ef5a ("net: flow_dissector: add support for cfm packets")
7cfffd5fed3e ("net: flower: add support for matching cfm fields")
1668a55a73f5 ("selftests: net: add tc flower cfm test")
c29e012eae29 ("selftests: forwarding: Fix layer 2 miss test syntax")
aef6e908b542 ("selftests/tc-testing: Fix Error: Specified qdisc kind is unknown.")
b849c566ee9c ("selftests/tc-testing: Fix Error: failed to find target LOG")
b39d8c41c7a8 ("selftests/tc-testing: Fix SFB db test")
11b8b2e70a9b ("selftests/tc-testing: Remove configs that no longer exist")
41f2c7c342d3 ("net/sched: act_ct: Fix promotion of offloaded unreplied tuple")
2d5f6a8d7aef ("net/sched: Refactor qdisc_graft() for ingress and clsact Qdiscs")
84ad0af0bccd ("net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting")
e16ad981e2a1 ("net: sched: Remove unused qdisc_l2t()")
ca4fa8743537 ("selftests: tc-testing: add one test for flushing explicitly created chain")
b4ee93380b3c ("net/sched: act_ipt: add sanity checks on table name and hook locations")
b2dc32dcba08 ("net/sched: act_ipt: add sanity checks on skb before calling target")
93d75d475c5d ("net/sched: act_ipt: zero skb->cb before calling target")
30c45b5361d3 ("net/sched: act_pedit: Add size check for TCA_PEDIT_PARMS_EX")
989b52cdc849 ("net: sched: Replace strlcpy with strscpy")
d3f87278bcb8 ("net/sched: flower: Ensure both minimum and maximum ports are specified")
150e33e62c1f ("net/sched: make psched_mtu() RTNL-less safe")
158810b261d0 ("net/sched: sch_qfq: reintroduce lmax bound check for MTU")
c5a06fdc618d ("selftests: tc-testing: add tests for qfq mtu sanity check")
3e337087c3b5 ("net/sched: sch_qfq: account for stab overhead in qfq_enqueue")
137f6219da59 ("selftests: tc-testing: add test for qfq with stab overhead")
d1cca974548d ("pie: fix kernel-doc notation warning")
b3d0e0489430 ("net: sched: cls_matchall: Undo tcf_bind_filter in case of failure after mall_set_parms")
9cb36faedeaf ("net: sched: cls_u32: Undo tcf_bind_filter if u32_replace_hw_knode")
e8d3d78c19be ("net: sched: cls_u32: Undo refcount decrement in case update failed")
26a22194927e ("net: sched: cls_bpf: Undo tcf_bind_filter in case of an error")
ac177a330077 ("net: sched: cls_flower: Undo tcf_bind_filter in case of an error")
fda05798c22a ("selftests: tc: set timeout to 15 minutes")
719b4774a8cb ("selftests: tc: add 'ct' action kconfig dep")
031c99e71fed ("selftests: tc: add ConnTrack procfs kconfig")
4914109a8e1e ("netfilter: allow exp not to be removed in nf_ct_find_expectation")
76622ced50a1 ("net: sched: set IPS_CONFIRMED in tmpl status only when commit is set in act_ct")
8c8b73320805 ("openvswitch: set IPS_CONFIRMED in tmpl status only when commit is set in conntrack")
9fe63d5f1da9 ("sch_htb: Allow HTB quantum parameter in offload mode")
6c58c8816abb ("net/sched: mqprio: Add length check for TCA_MQPRIO_{MAX/MIN}_RATE64")
4d50e50045aa ("net: flower: fix stack-out-of-bounds in fl_set_key_cfm()")
e68409db9953 ("net: sched: cls_u32: Fix match key mis-addressing")
e739718444f7 ("net/sched: taprio: Limit TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME to INT_MAX.")
21a72166abb9 ("selftests: forwarding: tc_flower_l2_miss: Fix failing test with old libnet")
```

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com>
Approved-by: Michal Schmidt <mschmidt@redhat.com>
Approved-by: Davide Caratti <dcaratti@redhat.com>

Signed-off-by: Scott Weaver <scweaver@redhat.com>
2023-10-24 13:29:05 -04:00
Scott Weaver 03206d751a Merge: CNB94: net: move gso declarations and functions to their own files
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3198

JIRA: https://issues.redhat.com/browse/RHEL-12679
Tested: Just built... no functional change

Commits:
```
d457a0e329b0 ("net: move gso declarations and functions to their own files")
```

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com>
Approved-by: Sabrina Dubroca <sdubroca@redhat.com>

Signed-off-by: Scott Weaver <scweaver@redhat.com>
2023-10-19 10:36:22 -04:00
Scott Weaver ec70982f69 Merge: ice: Enable DPLL support
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2961

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2232515

This feature request is for add and enable DPLL subsystem and DPLL support in ice driver

Signed-off-by: Petr Oros <poros@redhat.com>

Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Approved-by: Michal Schmidt <mschmidt@redhat.com>

Signed-off-by: Scott Weaver <scweaver@redhat.com>
2023-10-19 10:36:20 -04:00
Ivan Vecera 92e020fb45 net: sched: add rcu annotations around qdisc->qdisc_sleeping
JIRA: https://issues.redhat.com/browse/RHEL-1773

Conflicts:
- resolved conflict in net/sched/sch_taprio.c the same way like in
  449f6bc17a51 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")

commit d636fc5dd692c8f4e00ae6e0359c0eceeb5d9bdb
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Jun 6 11:19:29 2023 +0000

    net: sched: add rcu annotations around qdisc->qdisc_sleeping

    syzbot reported a race around qdisc->qdisc_sleeping [1]

    It is time we add proper annotations to reads and writes to/from
    qdisc->qdisc_sleeping.

    [1]
    BUG: KCSAN: data-race in dev_graft_qdisc / qdisc_lookup_rcu

    read to 0xffff8881286fc618 of 8 bytes by task 6928 on cpu 1:
    qdisc_lookup_rcu+0x192/0x2c0 net/sched/sch_api.c:331
    __tcf_qdisc_find+0x74/0x3c0 net/sched/cls_api.c:1174
    tc_get_tfilter+0x18f/0x990 net/sched/cls_api.c:2547
    rtnetlink_rcv_msg+0x7af/0x8c0 net/core/rtnetlink.c:6386
    netlink_rcv_skb+0x126/0x220 net/netlink/af_netlink.c:2546
    rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:6413
    netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline]
    netlink_unicast+0x56f/0x640 net/netlink/af_netlink.c:1365
    netlink_sendmsg+0x665/0x770 net/netlink/af_netlink.c:1913
    sock_sendmsg_nosec net/socket.c:724 [inline]
    sock_sendmsg net/socket.c:747 [inline]
    ____sys_sendmsg+0x375/0x4c0 net/socket.c:2503
    ___sys_sendmsg net/socket.c:2557 [inline]
    __sys_sendmsg+0x1e3/0x270 net/socket.c:2586
    __do_sys_sendmsg net/socket.c:2595 [inline]
    __se_sys_sendmsg net/socket.c:2593 [inline]
    __x64_sys_sendmsg+0x46/0x50 net/socket.c:2593
    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
    entry_SYSCALL_64_after_hwframe+0x63/0xcd

    write to 0xffff8881286fc618 of 8 bytes by task 6912 on cpu 0:
    dev_graft_qdisc+0x4f/0x80 net/sched/sch_generic.c:1115
    qdisc_graft+0x7d0/0xb60 net/sched/sch_api.c:1103
    tc_modify_qdisc+0x712/0xf10 net/sched/sch_api.c:1693
    rtnetlink_rcv_msg+0x807/0x8c0 net/core/rtnetlink.c:6395
    netlink_rcv_skb+0x126/0x220 net/netlink/af_netlink.c:2546
    rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:6413
    netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline]
    netlink_unicast+0x56f/0x640 net/netlink/af_netlink.c:1365
    netlink_sendmsg+0x665/0x770 net/netlink/af_netlink.c:1913
    sock_sendmsg_nosec net/socket.c:724 [inline]
    sock_sendmsg net/socket.c:747 [inline]
    ____sys_sendmsg+0x375/0x4c0 net/socket.c:2503
    ___sys_sendmsg net/socket.c:2557 [inline]
    __sys_sendmsg+0x1e3/0x270 net/socket.c:2586
    __do_sys_sendmsg net/socket.c:2595 [inline]
    __se_sys_sendmsg net/socket.c:2593 [inline]
    __x64_sys_sendmsg+0x46/0x50 net/socket.c:2593
    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
    entry_SYSCALL_64_after_hwframe+0x63/0xcd

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 0 PID: 6912 Comm: syz-executor.5 Not tainted 6.4.0-rc3-syzkaller-00190-g0d85b27b0cc6 #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/16/2023

    Fixes: 3a7d0d07a3 ("net: sched: extend Qdisc with rcu")
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Vlad Buslov <vladbu@nvidia.com>
    Acked-by: Jamal Hadi Salim<jhs@mojatatu.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-10-13 09:03:10 +02:00
Ivan Vecera f43c4f5429 net: do not use skb_mac_header() in qdisc_pkt_len_init()
JIRA: https://issues.redhat.com/browse/RHEL-1773

commit f5fca219ad4548bc45f0221f9857ad22cb8136a1
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Mar 21 16:45:17 2023 +0000

    net: do not use skb_mac_header() in qdisc_pkt_len_init()

    We want to remove our use of skb_mac_header() in tx paths,
    eg remove skb_reset_mac_header() from __dev_queue_xmit().

    Idea is that ndo_start_xmit() can get the mac header
    simply looking at skb->data.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Simon Horman <simon.horman@corigine.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-10-13 09:03:06 +02:00
Ivan Vecera 497f645693 net: move gso declarations and functions to their own files
JIRA: https://issues.redhat.com/browse/RHEL-12679

commit d457a0e329b0bfd3a1450e0b1a18cd2b47a25a08
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Jun 8 19:17:37 2023 +0000

    net: move gso declarations and functions to their own files

    Move declarations into include/net/gso.h and code into net/core/gso.c

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Stanislav Fomichev <sdf@google.com>
    Reviewed-by: Simon Horman <simon.horman@corigine.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Link: https://lore.kernel.org/r/20230608191738.3947077-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-10-11 13:35:27 +02:00
Petr Oros 104234d3d2 netdev: expose DPLL pin handle for netdevice
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2232515

Upstream commit(s):
commit 5f18426928800c59fb0f9bc8fb0c182bb6f5ee24
Author: Jiri Pirko <jiri@nvidia.com>
Date:   Wed Sep 13 21:49:39 2023 +0100

    netdev: expose DPLL pin handle for netdevice

    In case netdevice represents a SyncE port, the user needs to understand
    the connection between netdevice and associated DPLL pin. There might me
    multiple netdevices pointing to the same pin, in case of VF/SF
    implementation.

    Add a IFLA Netlink attribute to nest the DPLL pin handle, similar to
    how it is implemented for devlink port. Add a struct dpll_pin pointer
    to netdev and protect access to it by RTNL. Expose netdev_dpll_pin_set()
    and netdev_dpll_pin_clear() helpers to the drivers so they can set/clear
    the DPLL pin relationship to netdev.

    Note that during the lifetime of struct dpll_pin the pin handle does not
    change. Therefore it is save to access it lockless. It is drivers
    responsibility to call netdev_dpll_pin_clear() before dpll_pin_put().

    Signed-off-by: Jiri Pirko <jiri@nvidia.com>
    Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
    Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Petr Oros <poros@redhat.com>
2023-09-18 15:13:24 +02:00
Ivan Vecera 3cc9e8b28b random32: use real rng for non-deterministic randomness
JIRA: https://issues.redhat.com/browse/RHEL-3646

commit d4150779e60fb6c49be25572596b2cdfc5d46a09
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Wed May 11 16:11:29 2022 +0200

    random32: use real rng for non-deterministic randomness

    random32.c has two random number generators in it: one that is meant to
    be used deterministically, with some predefined seed, and one that does
    the same exact thing as random.c, except does it poorly. The first one
    has some use cases. The second one no longer does and can be replaced
    with calls to random.c's proper random number generator.

    The relatively recent siphash-based bad random32.c code was added in
    response to concerns that the prior random32.c was too deterministic.
    Out of fears that random.c was (at the time) too slow, this code was
    anonymously contributed. Then out of that emerged a kind of shadow
    entropy gathering system, with its own tentacles throughout various net
    code, added willy nilly.

    Stop👏making👏bespoke👏random👏number👏generators👏.

    Fortunately, recent advances in random.c mean that we can stop playing
    with this sketchiness, and just use get_random_u32(), which is now fast
    enough. In micro benchmarks using RDPMC, I'm seeing the same median
    cycle count between the two functions, with the mean being _slightly_
    higher due to batches refilling (which we can optimize further need be).
    However, when doing *real* benchmarks of the net functions that actually
    use these random numbers, the mean cycles actually *decreased* slightly
    (with the median still staying the same), likely because the additional
    prandom code means icache misses and complexity, whereas random.c is
    generally already being used by something else nearby.

    The biggest benefit of this is that there are many users of prandom who
    probably should be using cryptographically secure random numbers. This
    makes all of those accidental cases become secure by just flipping a
    switch. Later on, we can do a tree-wide cleanup to remove the static
    inline wrapper functions that this commit adds.

    There are also some low-ish hanging fruits for making this even faster
    in the future: a get_random_u16() function for use in the networking
    stack will give a 2x performance boost there, using SIMD for ChaCha20
    will let us compute 4 or 8 or 16 blocks of output in parallel, instead
    of just one, giving us large buffers for cheap, and introducing a
    get_random_*_bh() function that assumes irqs are already disabled will
    shave off a few cycles for ordinary calls. These are things we can chip
    away at down the road.

    Acked-by: Jakub Kicinski <kuba@kernel.org>
    Acked-by: Theodore Ts'o <tytso@mit.edu>
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-09-13 18:39:29 +02:00
Jan Stancek 645597c064 Merge: net: core: stable backport form upstream for 9.3 phase 2
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2731

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2217529
Tested: LNST, Tier1

A bunch of fixes for relevant issues.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Approved-by: Florian Westphal <fwestpha@redhat.com>
Approved-by: Hangbin Liu <haliu@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-07-07 07:38:20 +02:00
Jan Stancek e341c7e709 Merge: bpf, xdp: update to 6.3
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2583

Rebase bpf and xdp to 6.3.

Bugzilla: https://bugzilla.redhat.com/2178930

Signed-off-by: Viktor Malik <vmalik@redhat.com>

Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Artem Savkov <asavkov@redhat.com>
Approved-by: Jason Wang <jasowang@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>
Approved-by: Jan Stancek <jstancek@redhat.com>
Approved-by: Baoquan He <5820488-baoquan_he@users.noreply.gitlab.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-06-28 07:52:45 +02:00
Paolo Abeni e4256bf256 net: add vlan_get_protocol_and_depth() helper
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2217529
Tested: LNST, Tier1

Upstream commit:
commit 4063384ef762cc5946fc7a3f89879e76c6ec51e2
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue May 9 13:18:57 2023 +0000

    net: add vlan_get_protocol_and_depth() helper

    Before blamed commit, pskb_may_pull() was used instead
    of skb_header_pointer() in __vlan_get_protocol() and friends.

    Few callers depended on skb->head being populated with MAC header,
    syzbot caught one of them (skb_mac_gso_segment())

    Add vlan_get_protocol_and_depth() to make the intent clearer
    and use it where sensible.

    This is a more generic fix than commit e9d3f80935b6
    ("net/af_packet: make sure to pull mac header") which was
    dealing with a similar issue.

    kernel BUG at include/linux/skbuff.h:2655 !
    invalid opcode: 0000 [#1] SMP KASAN
    CPU: 0 PID: 1441 Comm: syz-executor199 Not tainted 6.1.24-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/14/2023
    RIP: 0010:__skb_pull include/linux/skbuff.h:2655 [inline]
    RIP: 0010:skb_mac_gso_segment+0x68f/0x6a0 net/core/gro.c:136
    Code: fd 48 8b 5c 24 10 44 89 6b 70 48 c7 c7 c0 ae 0d 86 44 89 e6 e8 a1 91 d0 00 48 c7 c7 00 af 0d 86 48 89 de 31 d2 e8 d1 4a e9 ff <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41
    RSP: 0018:ffffc90001bd7520 EFLAGS: 00010286
    RAX: ffffffff8469736a RBX: ffff88810f31dac0 RCX: ffff888115a18b00
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
    RBP: ffffc90001bd75e8 R08: ffffffff84697183 R09: fffff5200037adf9
    R10: 0000000000000000 R11: dffffc0000000001 R12: 0000000000000012
    R13: 000000000000fee5 R14: 0000000000005865 R15: 000000000000fed7
    FS: 000055555633f300(0000) GS:ffff8881f6a00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000020000000 CR3: 0000000116fea000 CR4: 00000000003506f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    <TASK>
    [<ffffffff847018dd>] __skb_gso_segment+0x32d/0x4c0 net/core/dev.c:3419
    [<ffffffff8470398a>] skb_gso_segment include/linux/netdevice.h:4819 [inline]
    [<ffffffff8470398a>] validate_xmit_skb+0x3aa/0xee0 net/core/dev.c:3725
    [<ffffffff84707042>] __dev_queue_xmit+0x1332/0x3300 net/core/dev.c:4313
    [<ffffffff851a9ec7>] dev_queue_xmit+0x17/0x20 include/linux/netdevice.h:3029
    [<ffffffff851b4a82>] packet_snd net/packet/af_packet.c:3111 [inline]
    [<ffffffff851b4a82>] packet_sendmsg+0x49d2/0x6470 net/packet/af_packet.c:3142
    [<ffffffff84669a12>] sock_sendmsg_nosec net/socket.c:716 [inline]
    [<ffffffff84669a12>] sock_sendmsg net/socket.c:736 [inline]
    [<ffffffff84669a12>] __sys_sendto+0x472/0x5f0 net/socket.c:2139
    [<ffffffff84669c75>] __do_sys_sendto net/socket.c:2151 [inline]
    [<ffffffff84669c75>] __se_sys_sendto net/socket.c:2147 [inline]
    [<ffffffff84669c75>] __x64_sys_sendto+0xe5/0x100 net/socket.c:2147
    [<ffffffff8551d40f>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    [<ffffffff8551d40f>] do_syscall_64+0x2f/0x50 arch/x86/entry/common.c:80
    [<ffffffff85600087>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

    Fixes: 469aceddfa ("vlan: consolidate VLAN parsing code and limit max parsing depth")
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Toke Høiland-Jørgensen <toke@redhat.com>
    Cc: Willem de Bruijn <willemb@google.com>
    Reviewed-by: Simon Horman <simon.horman@corigine.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-06-26 16:58:59 +02:00
Jan Stancek 9d37206873 Merge: net: sync skb free reasons
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2627

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2184073

Did not included commit 071c0fc6fb91 ("net: extend drop reasons for multiple subsystems")
as it would be appropriate to backport it in its own MR, would have not user for now,
and it's not clear to me how trace_kfree_skb deals with non-core free reasons once applied.

Signed-off-by: Antoine Tenart <atenart@redhat.com>

Approved-by: Íñigo Huguet <ihuguet@redhat.com>
Approved-by: Xin Long <lxin@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-06-14 13:27:31 +02:00
Felix Maurer b576afd91a netdev-genl: create a simple family for netdev stuff
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2178930
Conflicts:
- include/linux/netdevice.h: Context difference in includes due to missing
  406f42fa0d3c ("net-next: When a bond have a massive amount of VLANs with
  IPv6 addresses, performance of changing link state, attaching a VRF,
  changing an IPv6 address, etc. go down dramtically.")
- net/core/Makefile: Context difference due to missing 2c193f2cb110 ("net:
  kunit: add a test for dev_addr_lists")

commit d3d854fd6a1d97157f790604e07f6386e8df8fe4
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Wed Feb 1 11:24:17 2023 +0100

    netdev-genl: create a simple family for netdev stuff

    Add a Netlink spec-compatible family for netdevs.
    This is a very simple implementation without much
    thought going into it.

    It allows us to reap all the benefits of Netlink specs,
    one can use the generic client to issue the commands:

      $ ./cli.py --spec netdev.yaml --dump dev_get
      [{'ifindex': 1, 'xdp-features': set()},
       {'ifindex': 2, 'xdp-features': {'basic', 'ndo-xmit', 'redirect'}},
       {'ifindex': 3, 'xdp-features': {'rx-sg'}}]

    the generic python library does not have flags-by-name
    support, yet, but we also don't have to carry strings
    in the messages, as user space can get the names from
    the spec.

    Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
    Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org>
    Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
    Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Co-developed-by: Marek Majtyka <alardam@gmail.com>
    Signed-off-by: Marek Majtyka <alardam@gmail.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Link: https://lore.kernel.org/r/327ad9c9868becbe1e601b580c962549c8cd81f2.1675245258.git.lorenzo@kernel.org
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-06-13 22:45:50 +02:00
Felix Maurer e630642b6b bpf: Introduce device-bound XDP programs
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2178930

commit 2b3486bc2d237ec345b3942b7be5deabf8c8fed1
Author: Stanislav Fomichev <sdf@google.com>
Date:   Thu Jan 19 14:15:24 2023 -0800

    bpf: Introduce device-bound XDP programs

    New flag BPF_F_XDP_DEV_BOUND_ONLY plus all the infra to have a way
    to associate a netdev with a BPF program at load time.

    netdevsim checks are dropped in favor of generic check in dev_xdp_attach.

    Cc: John Fastabend <john.fastabend@gmail.com>
    Cc: David Ahern <dsahern@gmail.com>
    Cc: Martin KaFai Lau <martin.lau@linux.dev>
    Cc: Jakub Kicinski <kuba@kernel.org>
    Cc: Willem de Bruijn <willemb@google.com>
    Cc: Jesper Dangaard Brouer <brouer@redhat.com>
    Cc: Anatoly Burakov <anatoly.burakov@intel.com>
    Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
    Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
    Cc: Maryam Tahhan <mtahhan@redhat.com>
    Cc: xdp-hints@xdp-project.net
    Cc: netdev@vger.kernel.org
    Signed-off-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/r/20230119221536.3349901-6-sdf@google.com
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-06-13 22:45:13 +02:00
Felix Maurer c0febc32b2 bpf: Rename bpf_{prog,map}_is_dev_bound to is_offloaded
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2178930

commit 9d03ebc71a027ca495c60f6e94d3cda81921791f
Author: Stanislav Fomichev <sdf@google.com>
Date:   Thu Jan 19 14:15:21 2023 -0800

    bpf: Rename bpf_{prog,map}_is_dev_bound to is_offloaded

    BPF offloading infra will be reused to implement
    bound-but-not-offloaded bpf programs. Rename existing
    helpers for clarity. No functional changes.

    Cc: John Fastabend <john.fastabend@gmail.com>
    Cc: David Ahern <dsahern@gmail.com>
    Cc: Martin KaFai Lau <martin.lau@linux.dev>
    Cc: Willem de Bruijn <willemb@google.com>
    Cc: Jesper Dangaard Brouer <brouer@redhat.com>
    Cc: Anatoly Burakov <anatoly.burakov@intel.com>
    Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
    Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
    Cc: Maryam Tahhan <mtahhan@redhat.com>
    Cc: xdp-hints@xdp-project.net
    Cc: netdev@vger.kernel.org
    Reviewed-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/r/20230119221536.3349901-3-sdf@google.com
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-06-13 22:45:12 +02:00
Ivan Vecera 1cb324e3cc net: Remove the obsolte u64_stats_fetch_*_irq() users (net).
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2193170

Conflicts:
* net/netfilter/ipvs/ip_vs_ctl.c
  - the change was already applied by RHEL commit 914c1e31d9 ("ipvs:
    use u64_stats_t for the per-cpu counters")
* net/core/devlink.c
  - hunk was applied in different file (net/devlink/leftover.c)

commit d120d1a63b2c484d6175873d8ee736a633f74b70
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Wed Oct 26 15:22:15 2022 +0200

    net: Remove the obsolte u64_stats_fetch_*_irq() users (net).

    Now that the 32bit UP oddity is gone and 32bit uses always a sequence
    count, there is no need for the fetch_irq() variants anymore.

    Convert to the regular interface.

    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-06-08 13:38:11 +02:00
Ivan Vecera 41bf85273b net: adopt u64_stats_t in struct pcpu_sw_netstats
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2193170

commit 9962acefbcb92736c268aafe5f52200948f60f3e
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Jun 8 08:46:37 2022 -0700

    net: adopt u64_stats_t in struct pcpu_sw_netstats

    As explained in commit 316580b69d ("u64_stats: provide u64_stats_t type")
    we should use u64_stats_t and related accessors to avoid load/store tearing.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-06-08 13:37:00 +02:00
Antoine Tenart f2ed106175 net: remove enum skb_free_reason
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2184073
Upstream Status: linux.git

commit 40bbae583ec38ea31e728bf42a4ea72bded22ab6
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Mar 6 20:43:13 2023 +0000

    net: remove enum skb_free_reason

    enum skb_drop_reason is more generic, we can adopt it instead.

    Provide dev_kfree_skb_irq_reason() and dev_kfree_skb_any_reason().

    This means drivers can use more precise drop reasons if they want to.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Simon Horman <simon.horman@corigine.com>
    Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com>
    Link: https://lore.kernel.org/r/20230306204313.10492-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2023-06-06 11:23:26 +02:00
Antoine Tenart d48044618a net: add location to trace_consume_skb()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2184073
Upstream Status: linux.git

commit dd1b527831a3ed659afa01b672d8e1f7e6ca95a5
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Feb 16 15:47:18 2023 +0000

    net: add location to trace_consume_skb()

    kfree_skb() includes the location, it makes sense
    to add it to consume_skb() as well.

    After patch:

     taskd_EventMana  8602 [004]   420.406239: skb:consume_skb: skbaddr=0xffff893a4a6d0500 location=unix_stream_read_generic
             swapper     0 [011]   422.732607: skb:consume_skb: skbaddr=0xffff89597f68cee0 location=mlx4_en_free_tx_desc
          discipline  9141 [043]   423.065653: skb:consume_skb: skbaddr=0xffff893a487e9c00 location=skb_consume_udp
             swapper     0 [010]   423.073166: skb:consume_skb: skbaddr=0xffff8949ce9cdb00 location=icmpv6_rcv
             borglet  8672 [014]   425.628256: skb:consume_skb: skbaddr=0xffff8949c42e9400 location=netlink_dump
             swapper     0 [028]   426.263317: skb:consume_skb: skbaddr=0xffff893b1589dce0 location=net_rx_action
                wget 14339 [009]   426.686380: skb:consume_skb: skbaddr=0xffff893a51b552e0 location=tcp_rcv_state_process

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2023-06-06 11:23:26 +02:00
Jan Stancek 6318ae37c7 Merge: ovs: stable backports for 9.3 phase 1
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2438

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2190207

Signed-off-by: Antoine Tenart <atenart@redhat.com>

Approved-by: Andrea Claudi <aclaudi@redhat.com>
Approved-by: Davide Caratti <dcaratti@redhat.com>
Approved-by: Eelco Chaudron <echaudro@redhat.com>
Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-06-01 07:25:53 +02:00
Jan Stancek 91e631150d Merge: Bonding: rebase to linux v6.3
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2419

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2189406

Depends: !2418

Signed-off-by: Hangbin Liu <haliu@redhat.com>

Approved-by: Andrea Claudi <aclaudi@redhat.com>
Approved-by: Xin Long <lxin@redhat.com>
Approved-by: Jarod Wilson <jarod@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-05-17 07:47:13 +02:00
Jan Stancek 704d11b087 Merge: enable io_uring
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2375

# Merge Request Required Information

## Summary of Changes
This MR updates RHEL's io_uring implementation to match upstream v6.2 + fixes (Fixes: tags and Cc: stable commits).  The final 3 RHEL-only patches disable the feature by default, taint the kernel when it is enabled, and turn on the config option.

## Approved Development Ticket
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2170014
Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2214
Omitted-fix: b0b7a7d24b66 ("io_uring: return back links tw run optimisation")
  This is actually just	an optimization, and it	has non-trivial	conflicts
  which	would require additional backports to resolve.	Skip it.
Omitted-fix: 7d481e035633 ("io_uring/rsrc: fix DEFER_TASKRUN rsrc quiesce")
  This fix is incorrectly tagged.  The code that it applies to is not present in our tree.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>

Approved-by: John Meneghini <jmeneghi@redhat.com>
Approved-by: Ming Lei <ming.lei@redhat.com>
Approved-by: Maurizio Lombardi <mlombard@redhat.com>
Approved-by: Brian Foster <bfoster@redhat.com>
Approved-by: Guillaume Nault <gnault@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-05-17 07:47:08 +02:00
Jeff Moyer 2595bc4d80 net: fix kdoc on __dev_queue_xmit()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237

commit be76955dea93fe7ee9e0a6f961a7185290a2417f
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Mon May 9 10:04:12 2022 -0700

    net: fix kdoc on __dev_queue_xmit()
    
    Commit c526fd8f9f4f21 ("net: inline dev_queue_xmit()") exported
    __dev_queue_xmit(), now it's being rendered in html docs, triggering:
    
    Documentation/networking/kapi:92: net/core/dev.c:4101: WARNING: Missing matching underline for section title overline.
    
    Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
    Link: https://lore.kernel.org/linux-next/20220503073420.6d3f135d@canb.auug.org.au/
    Fixes: c526fd8f9f4f21 ("net: inline dev_queue_xmit()")
    Link: https://lore.kernel.org/r/20220509170412.1069190-1-kuba@kernel.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-05-05 15:23:02 -04:00
Paolo Abeni d0ff450947 net: fix __dev_kfree_skb_any() vs drop monitor
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2188560
Tested: LNST, Tier1

Upstream commit:
commit ac3ad19584b26fae9ac86e4faebe790becc74491
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Feb 23 08:38:45 2023 +0000

    net: fix __dev_kfree_skb_any() vs drop monitor

    dev_kfree_skb() is aliased to consume_skb().

    When a driver is dropping a packet by calling dev_kfree_skb_any()
    we should propagate the drop reason instead of pretending
    the packet was consumed.

    Note: Now we have enum skb_drop_reason we could remove
    enum skb_free_reason (for linux-6.4)

    v2: added an unlikely(), suggested by Yunsheng Lin.

    Fixes: e6247027e5 ("net: introduce dev_consume_skb_any()")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Yunsheng Lin <linyunsheng@huawei.com>
    Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-05-02 19:07:41 +02:00
Xin Long 2db946b2f7 net: add gso_ipv4_max_size and gro_ipv4_max_size per device
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2185290
Tested: compile only

Conflicts:
  - move netif_set_gro_max_size() from include/linux/netdevice.h to
    net/core/dev.h, then make the change, as commit 744d49daf8bd was
    backported earlier than eac1b93c14d6. netif_set_gro_max_size()
    was missed the oppotunity to be moved to net/core/dev.h.

  - different context in net/core/dev.h, rps_cpumask_housekeeping()
    is added due to 370ca718fd5e already in RHEL-9.

commit 9eefedd58ae1daece2ba907849a44db2941fb4b0
Author: Xin Long <lucien.xin@gmail.com>
Date:   Sat Jan 28 10:58:38 2023 -0500

    net: add gso_ipv4_max_size and gro_ipv4_max_size per device

    This patch introduces gso_ipv4_max_size and gro_ipv4_max_size
    per device and adds netlink attributes for them, so that IPV4
    BIG TCP can be guarded by a separate tunable in the next patch.

    To not break the old application using "gso/gro_max_size" for
    IPv4 GSO packets, this patch updates "gso/gro_ipv4_max_size"
    in netif_set_gso/gro_max_size() if the new size isn't greater
    than GSO_LEGACY_MAX_SIZE, so that nothing will change even if
    userspace doesn't realize the new netlink attributes.

    Signed-off-by: Xin Long <lucien.xin@gmail.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Xin Long <lxin@redhat.com>
2023-05-02 10:36:11 -04:00
Jeff Moyer 82f65d6ce4 net: inline dev_queue_xmit()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237

commit c526fd8f9f4f21cb83c0b1c9a1ee9c0ac9be9e2e
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Thu Apr 28 11:58:46 2022 +0100

    net: inline dev_queue_xmit()
    
    Inline dev_queue_xmit() and dev_queue_xmit_accel(), they both are small
    proxy functions doing nothing but redirecting the control flow to
    __dev_queue_xmit().
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-04-29 07:56:02 -04:00
Antoine Tenart af98894a33 net: openvswitch: fix race on port output
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2190207
Upstream Status: linux.git

commit 066b86787fa3d97b7aefb5ac0a99a22dad2d15f8
Author: Felix Huettner <felix.huettner@mail.schwarz>
Date:   Wed Apr 5 07:53:41 2023 +0000

    net: openvswitch: fix race on port output

    assume the following setup on a single machine:
    1. An openvswitch instance with one bridge and default flows
    2. two network namespaces "server" and "client"
    3. two ovs interfaces "server" and "client" on the bridge
    4. for each ovs interface a veth pair with a matching name and 32 rx and
       tx queues
    5. move the ends of the veth pairs to the respective network namespaces
    6. assign ip addresses to each of the veth ends in the namespaces (needs
       to be the same subnet)
    7. start some http server on the server network namespace
    8. test if a client in the client namespace can reach the http server

    when following the actions below the host has a chance of getting a cpu
    stuck in a infinite loop:
    1. send a large amount of parallel requests to the http server (around
       3000 curls should work)
    2. in parallel delete the network namespace (do not delete interfaces or
       stop the server, just kill the namespace)

    there is a low chance that this will cause the below kernel cpu stuck
    message. If this does not happen just retry.
    Below there is also the output of bpftrace for the functions mentioned
    in the output.

    The series of events happening here is:
    1. the network namespace is deleted calling
       `unregister_netdevice_many_notify` somewhere in the process
    2. this sets first `NETREG_UNREGISTERING` on both ends of the veth and
       then runs `synchronize_net`
    3. it then calls `call_netdevice_notifiers` with `NETDEV_UNREGISTER`
    4. this is then handled by `dp_device_event` which calls
       `ovs_netdev_detach_dev` (if a vport is found, which is the case for
       the veth interface attached to ovs)
    5. this removes the rx_handlers of the device but does not prevent
       packages to be sent to the device
    6. `dp_device_event` then queues the vport deletion to work in
       background as a ovs_lock is needed that we do not hold in the
       unregistration path
    7. `unregister_netdevice_many_notify` continues to call
       `netdev_unregister_kobject` which sets `real_num_tx_queues` to 0
    8. port deletion continues (but details are not relevant for this issue)
    9. at some future point the background task deletes the vport

    If after 7. but before 9. a packet is send to the ovs vport (which is
    not deleted at this point in time) which forwards it to the
    `dev_queue_xmit` flow even though the device is unregistering.
    In `skb_tx_hash` (which is called in the `dev_queue_xmit`) path there is
    a while loop (if the packet has a rx_queue recorded) that is infinite if
    `dev->real_num_tx_queues` is zero.

    To prevent this from happening we update `do_output` to handle devices
    without carrier the same as if the device is not found (which would
    be the code path after 9. is done).

    Additionally we now produce a warning in `skb_tx_hash` if we will hit
    the infinite loop.

    bpftrace (first word is function name):

    __dev_queue_xmit server: real_num_tx_queues: 1, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 1
    netdev_core_pick_tx server: addr: 0xffff9f0a46d4a000 real_num_tx_queues: 1, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 1
    dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 2, reg_state: 1
    synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
    synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
    synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
    synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
    dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 6, reg_state: 2
    ovs_netdev_detach_dev server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, reg_state: 2
    netdev_rx_handler_unregister server: real_num_tx_queues: 1, cpu: 9, pid: 21024, tid: 21024, reg_state: 2
    synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
    netdev_rx_handler_unregister ret server: real_num_tx_queues: 1, cpu: 9, pid: 21024, tid: 21024, reg_state: 2
    dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 27, reg_state: 2
    dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 22, reg_state: 2
    dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 18, reg_state: 2
    netdev_unregister_kobject: real_num_tx_queues: 1, cpu: 9, pid: 21024, tid: 21024
    synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
    ovs_vport_send server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 2
    __dev_queue_xmit server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 2
    netdev_core_pick_tx server: addr: 0xffff9f0a46d4a000 real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 2
    broken device server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024
    ovs_dp_detach_port server: real_num_tx_queues: 0 cpu 9, pid: 9124, tid: 9124, reg_state: 2
    synchronize_rcu_expedited: cpu 9, pid: 33604, tid: 33604

    stuck message:

    watchdog: BUG: soft lockup - CPU#5 stuck for 26s! [curl:1929279]
    Modules linked in: veth pktgen bridge stp llc ip_set_hash_net nft_counter xt_set nft_compat nf_tables ip_set_hash_ip ip_set nfnetlink_cttimeout nfnetlink openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 tls binfmt_misc nls_iso8859_1 input_leds joydev serio_raw dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua sch_fq_codel drm efi_pstore virtio_rng ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel virtio_net ahci net_failover crypto_simd cryptd psmouse libahci virtio_blk failover
    CPU: 5 PID: 1929279 Comm: curl Not tainted 5.15.0-67-generic #74-Ubuntu
    Hardware name: OpenStack Foundation OpenStack Nova, BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
    RIP: 0010:netdev_pick_tx+0xf1/0x320
    Code: 00 00 8d 48 ff 0f b7 c1 66 39 ca 0f 86 e9 01 00 00 45 0f b7 ff 41 39 c7 0f 87 5b 01 00 00 44 29 f8 41 39 c7 0f 87 4f 01 00 00 <eb> f2 0f 1f 44 00 00 49 8b 94 24 28 04 00 00 48 85 d2 0f 84 53 01
    RSP: 0018:ffffb78b40298820 EFLAGS: 00000246
    RAX: 0000000000000000 RBX: ffff9c8773adc2e0 RCX: 000000000000083f
    RDX: 0000000000000000 RSI: ffff9c8773adc2e0 RDI: ffff9c870a25e000
    RBP: ffffb78b40298858 R08: 0000000000000001 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: ffff9c870a25e000
    R13: ffff9c870a25e000 R14: ffff9c87fe043480 R15: 0000000000000000
    FS:  00007f7b80008f00(0000) GS:ffff9c8e5f740000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f7b80f6a0b0 CR3: 0000000329d66000 CR4: 0000000000350ee0
    Call Trace:
     <IRQ>
     netdev_core_pick_tx+0xa4/0xb0
     __dev_queue_xmit+0xf8/0x510
     ? __bpf_prog_exit+0x1e/0x30
     dev_queue_xmit+0x10/0x20
     ovs_vport_send+0xad/0x170 [openvswitch]
     do_output+0x59/0x180 [openvswitch]
     do_execute_actions+0xa80/0xaa0 [openvswitch]
     ? kfree+0x1/0x250
     ? kfree+0x1/0x250
     ? kprobe_perf_func+0x4f/0x2b0
     ? flow_lookup.constprop.0+0x5c/0x110 [openvswitch]
     ovs_execute_actions+0x4c/0x120 [openvswitch]
     ovs_dp_process_packet+0xa1/0x200 [openvswitch]
     ? ovs_ct_update_key.isra.0+0xa8/0x120 [openvswitch]
     ? ovs_ct_fill_key+0x1d/0x30 [openvswitch]
     ? ovs_flow_key_extract+0x2db/0x350 [openvswitch]
     ovs_vport_receive+0x77/0xd0 [openvswitch]
     ? __htab_map_lookup_elem+0x4e/0x60
     ? bpf_prog_680e8aff8547aec1_kfree+0x3b/0x714
     ? trace_call_bpf+0xc8/0x150
     ? kfree+0x1/0x250
     ? kfree+0x1/0x250
     ? kprobe_perf_func+0x4f/0x2b0
     ? kprobe_perf_func+0x4f/0x2b0
     ? __mod_memcg_lruvec_state+0x63/0xe0
     netdev_port_receive+0xc4/0x180 [openvswitch]
     ? netdev_port_receive+0x180/0x180 [openvswitch]
     netdev_frame_hook+0x1f/0x40 [openvswitch]
     __netif_receive_skb_core.constprop.0+0x23d/0xf00
     __netif_receive_skb_one_core+0x3f/0xa0
     __netif_receive_skb+0x15/0x60
     process_backlog+0x9e/0x170
     __napi_poll+0x33/0x180
     net_rx_action+0x126/0x280
     ? ttwu_do_activate+0x72/0xf0
     __do_softirq+0xd9/0x2e7
     ? rcu_report_exp_cpu_mult+0x1b0/0x1b0
     do_softirq+0x7d/0xb0
     </IRQ>
     <TASK>
     __local_bh_enable_ip+0x54/0x60
     ip_finish_output2+0x191/0x460
     __ip_finish_output+0xb7/0x180
     ip_finish_output+0x2e/0xc0
     ip_output+0x78/0x100
     ? __ip_finish_output+0x180/0x180
     ip_local_out+0x5e/0x70
     __ip_queue_xmit+0x184/0x440
     ? tcp_syn_options+0x1f9/0x300
     ip_queue_xmit+0x15/0x20
     __tcp_transmit_skb+0x910/0x9c0
     ? __mod_memcg_state+0x44/0xa0
     tcp_connect+0x437/0x4e0
     ? ktime_get_with_offset+0x60/0xf0
     tcp_v4_connect+0x436/0x530
     __inet_stream_connect+0xd4/0x3a0
     ? kprobe_perf_func+0x4f/0x2b0
     ? aa_sk_perm+0x43/0x1c0
     inet_stream_connect+0x3b/0x60
     __sys_connect_file+0x63/0x70
     __sys_connect+0xa6/0xd0
     ? setfl+0x108/0x170
     ? do_fcntl+0xe8/0x5a0
     __x64_sys_connect+0x18/0x20
     do_syscall_64+0x5c/0xc0
     ? __x64_sys_fcntl+0xa9/0xd0
     ? exit_to_user_mode_prepare+0x37/0xb0
     ? syscall_exit_to_user_mode+0x27/0x50
     ? do_syscall_64+0x69/0xc0
     ? __sys_setsockopt+0xea/0x1e0
     ? exit_to_user_mode_prepare+0x37/0xb0
     ? syscall_exit_to_user_mode+0x27/0x50
     ? __x64_sys_setsockopt+0x1f/0x30
     ? do_syscall_64+0x69/0xc0
     ? irqentry_exit+0x1d/0x30
     ? exc_page_fault+0x89/0x170
     entry_SYSCALL_64_after_hwframe+0x61/0xcb
    RIP: 0033:0x7f7b8101c6a7
    Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 18 89 54 24 0c 48 89 34 24 89
    RSP: 002b:00007ffffd6b2198 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
    RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7b8101c6a7
    RDX: 0000000000000010 RSI: 00007ffffd6b2360 RDI: 0000000000000005
    RBP: 0000561f1370d560 R08: 00002795ad21d1ac R09: 0030312e302e302e
    R10: 00007ffffd73f080 R11: 0000000000000246 R12: 0000561f1370c410
    R13: 0000000000000000 R14: 0000000000000005 R15: 0000000000000000
     </TASK>

    Fixes: 7f8a436eaa ("openvswitch: Add conntrack action")
    Co-developed-by: Luca Czesla <luca.czesla@mail.schwarz>
    Signed-off-by: Luca Czesla <luca.czesla@mail.schwarz>
    Signed-off-by: Felix Huettner <felix.huettner@mail.schwarz>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Simon Horman <simon.horman@corigine.com>
    Link: https://lore.kernel.org/r/ZC0pBXBAgh7c76CA@kernel-bug-kernel-bug
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2023-04-27 16:30:10 +02:00
Jan Stancek 8e94775eed Merge: CNB: rebase/update devlink for RHEL 9.3
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2191

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2172273
Tested: selftests, basic devlink features on ice and mlx5
Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2175249
Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2175250
Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2176150

Update devlink up to v6.3.

Signed-off-by: Petr Oros <poros@redhat.com>

Approved-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: Íñigo Huguet <ihuguet@redhat.com>
Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com>
Approved-by: Herbert Xu <zxu@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-04-27 07:47:22 +02:00
Hangbin Liu a149ec5e7d net/core: Allow live renaming when an interface is up
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2189406
Upstream Status: net.git commit bd039b5ea2a9

commit bd039b5ea2a91ea707ee8539df26456bd5be80af
Author: Andy Ren <andy.ren@getcruise.com>
Date:   Mon Nov 7 09:42:42 2022 -0800

    net/core: Allow live renaming when an interface is up

    Allow a network interface to be renamed when the interface
    is up.

    As described in the netconsole documentation [1], when netconsole is
    used as a built-in, it will bring up the specified interface as soon as
    possible. As a result, user space will not be able to rename the
    interface since the kernel disallows renaming of interfaces that are
    administratively up unless the 'IFF_LIVE_RENAME_OK' private flag was set
    by the kernel.

    The original solution [2] to this problem was to add a new parameter to
    the netconsole configuration parameters that allows renaming of
    the interface used by netconsole while it is administratively up.
    However, during the discussion that followed, it became apparent that we
    have no reason to keep the current restriction and instead we should
    allow user space to rename interfaces regardless of their administrative
    state:

    1. The restriction was put in place over 20 years ago when renaming was
    only possible via IOCTL and before rtnetlink started notifying user
    space about such changes like it does today.

    2. The 'IFF_LIVE_RENAME_OK' flag was added over 3 years ago in version
    5.2 and no regressions were reported.

    3. In-kernel listeners to 'NETDEV_CHANGENAME' do not seem to care about
    the administrative state of interface.

    Therefore, allow user space to rename running interfaces by removing the
    restriction and the associated 'IFF_LIVE_RENAME_OK' flag. Help in
    possible triage by emitting a message to the kernel log that an
    interface was renamed while UP.

    [1] https://www.kernel.org/doc/Documentation/networking/netconsole.rst
    [2] https://lore.kernel.org/netdev/20221102002420.2613004-1-andy.ren@getcruise.com/

    Signed-off-by: Andy Ren <andy.ren@getcruise.com>
    Reviewed-by: Ido Schimmel <idosch@nvidia.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Hangbin Liu <haliu@redhat.com>
2023-04-25 15:26:55 +08:00
Petr Oros 59e7861deb devlink: Fix netdev notifier chain corruption
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2172273

Conflicts:
-  adjusted upstream merge conflict which was resolved in 675f176b4dcc2b
   ("Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/net")

Upstream commit(s):
commit b20b8aec6ffc07bb547966b356780cd344f20f5b
Author: Ido Schimmel <idosch@nvidia.com>
Date:   Wed Feb 15 09:31:39 2023 +0200

    devlink: Fix netdev notifier chain corruption

    Cited commit changed devlink to register its netdev notifier block on
    the global netdev notifier chain instead of on the per network namespace
    one.

    However, when changing the network namespace of the devlink instance,
    devlink still tries to unregister its notifier block from the chain of
    the old namespace and register it on the chain of the new namespace.
    This results in corruption of the notifier chains, as the same notifier
    block is registered on two different chains: The global one and the per
    network namespace one. In turn, this causes other problems such as the
    inability to dismantle namespaces due to netdev reference count issues.

    Fix by preventing devlink from moving its notifier block between
    namespaces.

    Reproducer:

     # echo "10 1" > /sys/bus/netdevsim/new_device
     # ip netns add test123
     # devlink dev reload netdevsim/netdevsim10 netns test123
     # ip netns del test123
     [   71.935619] unregister_netdevice: waiting for lo to become free. Usage count = 2
     [   71.938348] leaked reference.

    Fixes: 565b4824c39f ("devlink: change port event netdev notifier from per-net to global")
    Signed-off-by: Ido Schimmel <idosch@nvidia.com>
    Reviewed-by: Jiri Pirko <jiri@nvidia.com>
    Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
    Reviewed-by: Jakub Kicinski <kuba@kernel.org>
    Link: https://lore.kernel.org/r/20230215073139.1360108-1-idosch@nvidia.com
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Petr Oros <poros@redhat.com>
2023-04-04 11:12:28 +02:00
Petr Oros 8df3e0fd3b net: introduce a helper to move notifier block to different namespace
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2172273

Upstream commit(s):
commit 3e52fba03a20234abc65a656cef063a1045d9723
Author: Jiri Pirko <jiri@nvidia.com>
Date:   Tue Nov 8 14:22:06 2022 +0100

    net: introduce a helper to move notifier block to different namespace

    Currently, net_dev() netdev notifier variant follows the netdev with
    per-net notifier from namespace to namespace. This is implemented
    by move_netdevice_notifiers_dev_net() helper.

    For devlink it is needed to re-register per-net notifier during
    devlink reload. Introduce a new helper called
    move_netdevice_notifier_net() and share the unregister/register code
    with existing move_netdevice_notifiers_dev_net() helper.

    Signed-off-by: Jiri Pirko <jiri@nvidia.com>
    Reviewed-by: Ido Schimmel <idosch@nvidia.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Petr Oros <poros@redhat.com>
2023-04-03 14:05:59 +02:00
Petr Oros afc2a59634 net: devlink: track netdev with devlink_port assigned
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2172273

Upstream commit(s):
commit 02a68a47eadedf95748facfca6ced31fb0181d52
Author: Jiri Pirko <jiri@nvidia.com>
Date:   Wed Nov 2 17:02:03 2022 +0100

    net: devlink: track netdev with devlink_port assigned

    Currently, ethernet drivers are using devlink_port_type_eth_set() and
    devlink_port_type_clear() to set devlink port type and link to related
    netdev.

    Instead of calling them directly, let the driver use
    SET_NETDEV_DEVLINK_PORT macro to assign devlink_port pointer and let
    devlink to track it. Note the devlink port pointer is static during
    the time netdevice is registered.

    In devlink code, use per-namespace netdev notifier to track
    the netdevices with devlink_port assigned and change the internal
    devlink_port type and related type pointer accordingly.

    Signed-off-by: Jiri Pirko <jiri@nvidia.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Petr Oros <poros@redhat.com>
2023-04-03 10:57:13 +02:00
Íñigo Huguet 3a91b473a8 net: rename reference+tracking helpers
Bugzilla: https://bugzilla.redhat.com/2175258

Conflicts:
 - Removed chunks of unsupported protocol AX.25
 - Renamed the funtions also in ipvlan. Commit 40b9d1ab63f5 ("ipvlan: hold lower
   dev to avoid possible use-after-free") was backported out of order so it had
   to use the old functions names.

commit d62607c3fe45911b2331fac073355a8c914bbde2
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Tue Jun 7 21:39:55 2022 -0700

    net: rename reference+tracking helpers

    Netdev reference helpers have a dev_ prefix for historic
    reasons. Renaming the old helpers would be too much churn
    but we can rename the tracking ones which are relatively
    recent and should be the default for new code.

    Rename:
     dev_hold_track()    -> netdev_hold()
     dev_put_track()     -> netdev_put()
     dev_replace_track() -> netdev_ref_replace()

    Link: https://lore.kernel.org/r/20220608043955.919359-1-kuba@kernel.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
2023-03-23 16:19:21 +01:00
Xin Long 3a75ec1506 net: avoid quadratic behavior in netdev_wait_allrefs_any()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2180612
Tested: compile only

Conflicts:
  - context difference due to cc26c2661fef already in RHEL-9.

commit 86213f80da1b1d007721cc22e04b5f5d0da33127
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Feb 17 22:54:30 2022 -0800

    net: avoid quadratic behavior in netdev_wait_allrefs_any()

    If the list of devices has N elements, netdev_wait_allrefs_any()
    is called N times, and linkwatch_forget_dev() is called N*(N-1)/2 times.

    Fix this by calling linkwatch_forget_dev() only once per device.

    Fixes: faab39f63c1f ("net: allow out-of-order netdev unregistration")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20220218065430.2613262-1-eric.dumazet@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Xin Long <lxin@redhat.com>
2023-03-21 17:39:40 -04:00
Xin Long b1a4490d48 net: allow out-of-order netdev unregistration
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2180612
Tested: compile only

Conflicts:
  - context difference due to 05e49cfc89e4 already in RHEL-9.

commit faab39f63c1fc4bcdf135690f03bd596b578c67e
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Tue Feb 15 14:53:10 2022 -0800

    net: allow out-of-order netdev unregistration

    Sprinkle for each loops to allow netdevices to be unregistered
    out of order, as their refs are released.

    This prevents problems caused by dependencies between netdevs
    which want to release references in their ->priv_destructor.
    See commit d6ff94afd90b ("vlan: move dev_put into vlan_dev_uninit")
    for example.

    Eric has removed the only known ordering requirement in
    commit c002496babfd ("Merge branch 'ipv6-loopback'")
    so let's try this and see if anything explodes...

    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Xin Long <lucien.xin@gmail.com>
    Link: https://lore.kernel.org/r/20220215225310.3679266-2-kuba@kernel.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Xin Long <lxin@redhat.com>
2023-03-21 17:39:26 -04:00
Xin Long bfdcece7f8 net: transition netdev reg state earlier in run_todo
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2180612
Tested: compile only

Conflicts:
  - context difference due to cc26c2661fef already in RHEL-9.

commit ae68db14b6164ce46beffaf35eb7c9bb2f92fee3
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Tue Feb 15 14:53:09 2022 -0800

    net: transition netdev reg state earlier in run_todo

    In prep for unregistering netdevs out of order move the netdev
    state validation and change outside of the loop.

    While at it modernize this code and use WARN() instead of
    pr_err() + dump_stack().

    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Xin Long <lucien.xin@gmail.com>
    Link: https://lore.kernel.org/r/20220215225310.3679266-1-kuba@kernel.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Xin Long <lxin@redhat.com>
2023-03-21 17:38:42 -04:00
Herton R. Krzesinski 05d2a7216e Merge: CNB: net: add netdev_sw_irq_coalesce_default_on()
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1970

Bugzilla: https://bugzilla.redhat.com/2161921

commit d93607082e982223cf92750f2d9039ff365b9d24
Author: Heiner Kallweit <hkallweit1@gmail.com>
Date:   Wed Nov 30 23:28:26 2022 +0100

    net: add netdev_sw_irq_coalesce_default_on()

    Add a helper for drivers wanting to set SW IRQ coalescing
    by default. The related sysfs attributes can be used to
    override the default values.

    Follow Jakub's suggestion and put this functionality into
    net core so that drivers wanting to use software interrupt
    coalescing per default don't have to open-code it.

    Note that this function needs to be called before the
    netdevice is registered.

    Suggested-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Dan Campbell <dacampbe@redhat.com>

Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com>
Approved-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: Michal Schmidt <mschmidt@redhat.com>
Approved-by: Petr Oros <poros@redhat.com>
Approved-by: Andrea Claudi <aclaudi@redhat.com>

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2023-02-08 01:41:42 +00:00
Dan Campbell bee4544aab net: add netdev_sw_irq_coalesce_default_on()
Bugzilla: https://bugzilla.redhat.com/2161921

commit d93607082e982223cf92750f2d9039ff365b9d24
Author: Heiner Kallweit <hkallweit1@gmail.com>
Date:   Wed Nov 30 23:28:26 2022 +0100

    net: add netdev_sw_irq_coalesce_default_on()

    Add a helper for drivers wanting to set SW IRQ coalescing
    by default. The related sysfs attributes can be used to
    override the default values.

    Follow Jakub's suggestion and put this functionality into
    net core so that drivers wanting to use software interrupt
    coalescing per default don't have to open-code it.

    Note that this function needs to be called before the
    netdevice is registered.

    Suggested-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Dan Campbell <dacampbe@redhat.com>
2023-01-27 12:28:55 -06:00
Paolo Abeni af86e36c42 net: Fix return value of qdisc ingress handling on success
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2162711
Tested: vs bz reproducer

Upstream commit:
commit 672e97ef689a38cb20c2cc6a1814298fea34461e
Author: Paul Blakey <paulb@nvidia.com>
Date:   Tue Oct 18 10:34:38 2022 +0300

    net: Fix return value of qdisc ingress handling on success

    Currently qdisc ingress handling (sch_handle_ingress()) doesn't
    set a return value and it is left to the old return value of
    the caller (__netif_receive_skb_core()) which is RX drop, so if
    the packet is consumed, caller will stop and return this value
    as if the packet was dropped.

    This causes a problem in the kernel tcp stack when having a
    egress tc rule forwarding to a ingress tc rule.
    The tcp stack sending packets on the device having the egress rule
    will see the packets as not successfully transmitted (although they
    actually were), will not advance it's internal state of sent data,
    and packets returning on such tcp stream will be dropped by the tcp
    stack with reason ack-of-unsent-data. See reproduction in [0] below.

    Fix that by setting the return value to RX success if
    the packet was handled successfully.

    [0] Reproduction steps:
     $ ip link add veth1 type veth peer name peer1
     $ ip link add veth2 type veth peer name peer2
     $ ifconfig peer1 5.5.5.6/24 up
     $ ip netns add ns0
     $ ip link set dev peer2 netns ns0
     $ ip netns exec ns0 ifconfig peer2 5.5.5.5/24 up
     $ ifconfig veth2 0 up
     $ ifconfig veth1 0 up

     #ingress forwarding veth1 <-> veth2
     $ tc qdisc add dev veth2 ingress
     $ tc qdisc add dev veth1 ingress
     $ tc filter add dev veth2 ingress prio 1 proto all flower \
       action mirred egress redirect dev veth1
     $ tc filter add dev veth1 ingress prio 1 proto all flower \
       action mirred egress redirect dev veth2

     #steal packet from peer1 egress to veth2 ingress, bypassing the veth pipe
     $ tc qdisc add dev peer1 clsact
     $ tc filter add dev peer1 egress prio 20 proto ip flower \
       action mirred ingress redirect dev veth1

     #run iperf and see connection not running
     $ iperf3 -s&
     $ ip netns exec ns0 iperf3 -c 5.5.5.6 -i 1

     #delete egress rule, and run again, now should work
     $ tc filter del dev peer1 egress
     $ ip netns exec ns0 iperf3 -c 5.5.5.6 -i 1

    Fixes: f697c3e8b3 ("[NET]: Avoid unnecessary cloning for ingress filtering")
    Signed-off-by: Paul Blakey <paulb@nvidia.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-20 16:33:01 +01:00
Herton R. Krzesinski 19ce0cbd76 Merge: bpf, xdp: update to 5.19
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1533

bpf, xdp: update to 5.19

Bugzilla: http://bugzilla.redhat.com/2120968
Bugzilla: http://bugzilla.redhat.com/2130850
Bugzilla: http://bugzilla.redhat.com/2140077


Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>

Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>
Approved-by: Artem Savkov <asavkov@redhat.com>
Approved-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2022-12-21 20:49:27 +00:00
Herton R. Krzesinski 09736a3a30 Merge: udp: some performance optimizations
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1541

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2133057
Tested: LNST, Tier1, tput test

This series improves UDP protocol RX tput, to keep it on equal footing with rhel-8 one.

Patches 1,3,4 are there just to reduces the conflicts, and patch 4 is a very partial
backport, to avoid pulling unrelated features.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Florian Westphal <fwestpha@redhat.com>

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2022-12-13 17:35:03 +00:00
Felix Maurer 1e3ab14088 xdp: Fix spurious packet loss in generic XDP TX path
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2120968

commit 1fd6e5675336daf4747940b4285e84b0c114ae32
Author: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Date:   Tue Jul 5 10:23:45 2022 +0200

    xdp: Fix spurious packet loss in generic XDP TX path

    The byte queue limits (BQL) mechanism is intended to move queuing from
    the driver to the network stack in order to reduce latency caused by
    excessive queuing in hardware. However, when transmitting or redirecting
    a packet using generic XDP, the qdisc layer is bypassed and there are no
    additional queues. Since netif_xmit_stopped() also takes BQL limits into
    account, but without having any alternative queuing, packets are
    silently dropped.

    This patch modifies the drop condition to only consider cases when the
    driver itself cannot accept any more packets. This is analogous to the
    condition in __dev_direct_xmit(). Dropped packets are also counted on
    the device.

    Bypassing the qdisc layer in the generic XDP TX path means that XDP
    packets are able to starve other packets going through a qdisc, and
    DDOS attacks will be more effective. In-driver-XDP use dedicated TX
    queues, so they do not have this starvation issue.

    Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/bpf/20220705082345.2494312-1-johan.almbladh@anyfinetworks.com

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2022-11-30 12:47:11 +02:00
Felix Maurer b06bbd83be net: Use this_cpu_inc() to increment net->core_stats
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2130850

commit 6510ea973d8d9d4a0cb2fb557b36bd1ab3eb49f6
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Mon Apr 25 18:39:46 2022 +0200

    net: Use this_cpu_inc() to increment net->core_stats

    The macro dev_core_stats_##FIELD##_inc() disables preemption and invokes
    netdev_core_stats_alloc() to return a per-CPU pointer.
    netdev_core_stats_alloc() will allocate memory on its first invocation
    which breaks on PREEMPT_RT because it requires non-atomic context for
    memory allocation.

    This can be avoided by enabling preemption in netdev_core_stats_alloc()
    assuming the caller always disables preemption.

    It might be better to replace local_inc() with this_cpu_inc() now that
    dev_core_stats_##FIELD##_inc() gained a preempt-disable section and does
    not rely on already disabled preemption. This results in less
    instructions on x86-64:
    local_inc:
    |          incl %gs:__preempt_count(%rip)  # __preempt_count
    |          movq    488(%rdi), %rax # _1->core_stats, _22
    |          testq   %rax, %rax      # _22
    |          je      .L585   #,
    |          add %gs:this_cpu_off(%rip), %rax        # this_cpu_off, tcp_ptr__
    |  .L586:
    |          testq   %rax, %rax      # _27
    |          je      .L587   #,
    |          incq (%rax)            # _6->a.counter
    |  .L587:
    |          decl %gs:__preempt_count(%rip)  # __preempt_count

    this_cpu_inc(), this patch:
    |         movq    488(%rdi), %rax # _1->core_stats, _5
    |         testq   %rax, %rax      # _5
    |         je      .L591   #,
    | .L585:
    |         incq %gs:(%rax) # _18->rx_dropped

    Use unsigned long as type for the counter. Use this_cpu_inc() to
    increment the counter. Use a plain read of the counter.

    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/YmbO0pxgtKpCw4SY@linutronix.de
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2022-11-30 12:47:10 +02:00
Felix Maurer a320271336 net: add per-cpu storage and net->core_stats
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2130850
Conflicts:
- drivers/net/vxlan.c: file is not moved to drivers/net/vxlan/vxlan_core.c
  due to missing 6765393614ea8 ("vxlan: move to its own directory");
  context difference due to missing 4095e0e1328a3 ("drivers: vxlan:
  vnifilter: per vni stats")
- net/core/dev.c: code difference in __netif_receive_skb_core due to
  already applied 9f8ed577c2881 ("net: skb: rename
  SKB_DROP_REASON_PTYPE_ABSENT"). Result is like upstream now.
- net/core/gro_cells.c: context difference due to already applied
  5dcd08cd1991 ("net: Fix data-races around netdev_max_backlog.")

commit 625788b5844511cf4c30cffa7fa0bc3a69cebc82
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Mar 10 21:14:20 2022 -0800

    net: add per-cpu storage and net->core_stats

    Before adding yet another possibly contended atomic_long_t,
    it is time to add per-cpu storage for existing ones:
     dev->tx_dropped, dev->rx_dropped, and dev->rx_nohandler

    Because many devices do not have to increment such counters,
    allocate the per-cpu storage on demand, so that dev_get_stats()
    does not have to spend considerable time folding zero counters.

    Note that some drivers have abused these counters which
    were supposed to be only used by core networking stack.

    v4: should use per_cpu_ptr() in dev_get_stats() (Jakub)
    v3: added a READ_ONCE() in netdev_core_stats_alloc() (Paolo)
    v2: add a missing include (reported by kernel test robot <lkp@intel.com>)
        Change in netdev_core_stats_alloc() (Jakub)

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: jeffreyji <jeffreyji@google.com>
    Reviewed-by: Brian Vazquez <brianvv@google.com>
    Reviewed-by: Jakub Kicinski <kuba@kernel.org>
    Acked-by: Paolo Abeni <pabeni@redhat.com>
    Link: https://lore.kernel.org/r/20220311051420.2608812-1-eric.dumazet@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2022-11-30 12:47:10 +02:00
Frantisek Hrbata a03fbb1743 Merge: CNB: Update TC subsystem to upstream v6.0
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1567

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139170
Tested: Using self-tests, results present in the BZ
Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2133511
Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2128185

Commits:
```
b20dc3c68458 ("gtp: Allow to create GTP device without FDs")
9af41cc33471 ("gtp: Implement GTP echo response")
d33bd757d362 ("gtp: Implement GTP echo request")
e3acda7ade0a ("net/sched: Allow flower to match on GTP options")
81dd9849fa49 ("gtp: Add support for checking GTP device type")
02f393381d14 ("gtp: Fix inconsistent indenting")
4c096ea2d67c ("net/sched: matchall: Take verbose flag into account when logging error messages")
11c95317bc1a ("net/sched: flower: Take verbose flag into account when logging error messages")
c2ccf84ecb71 ("net/sched: act_api: Add extack to offload_act_setup() callback")
69642c2ab2f5 ("net/sched: act_gact: Add extack messages for offload failure")
4dcaa50d0292 ("net/sched: act_mirred: Add extack message for offload failure")
bca3821d19d9 ("net/sched: act_mpls: Add extack messages for offload failure")
bf3b99e4f9ce ("net/sched: act_pedit: Add extack message for offload failure")
b50e462bc22d ("net/sched: act_police: Add extack messages for offload failure")
a9c64939b669 ("net/sched: act_skbedit: Add extack messages for offload failure")
ee367d44b936 ("net/sched: act_tunnel_key: Add extack message for offload failure")
f8fab3169464 ("net/sched: act_vlan: Add extack message for offload failure")
c440615ffbcb ("net/sched: cls_api: Add extack message for unsupported action offload")
0cba5c34b8f4 ("net/sched: matchall: Avoid overwriting error messages")
fd23e0e250c6 ("net/sched: flower: Avoid overwriting error messages")
c9a40d1c87e9 ("net_sched: make qdisc_reset() smaller")
7463acfbe52a ("netfilter: Rename ingress hook include file")
17d20784223d ("netfilter: Generalize ingress hook include file")
42df6e1d221d ("netfilter: Introduce egress hook")
2f1e85b1aee4 ("net: sched: use queue_mapping to pick tx queue")
38a6f0865796 ("net: sched: support hash selecting tx queue")
285ba06b0edb ("net/sched: flower: Helper function for vlan ethtype checks")
6ee59e554d33 ("net/sched: flower: Reduce identation after is_key_vlan refactoring")
b40003128226 ("net/sched: flower: Add number of vlan tags filter")
99fdb22bc5e9 ("net/sched: flower: Consider the number of tags for vlan filters")
b57c7e8b76c6 ("selftests: forwarding: tc_actions: allow mirred egress test to run on non-offloaded h2")
70f87de9fa0d ("net_sched: em_meta: add READ_ONCE() in var_sk_bound_if()")
a2b1a5d40bd1 ("net/sched: sch_netem: Fix arithmetic in netem_dump() for 32-bit platforms")
1da9e27415bf ("tc-testing: gitignore, delete plugins directory")
6deb209dc6b0 ("net: Print hashed skb addresses for all net and qdisc events")
76b39b94382f ("net/sched: act_api: Notify user space if any actions were flushed before error")
88153e29c1e0 ("selftests: tc-testing: Add testcases to test new flush behaviour")
837ced3a1a5d ("time64.h: consolidate uses of PSEC_PER_NSEC")
d7be266adbfd ("net: sched: provide shim definitions for taprio_offload_{get,free}")
fc54d9065f90 ("net/sched: act_ct: set 'net' pointer when creating new nf_flow_table")
b038177636f8 ("netfilter: nf_flow_table: count pending offload workqueue tasks")
b06ada6df9cf ("netfilter: flowtable: fix incorrect Kconfig dependencies")
83d85bb06915 ("net: extract port range fields from fl_flow_key")
bc5c8260f411 ("net/sched: remove return value of unregister_tcf_proto_ops")
88b3822cdf2f ("net/sched: sch_cbq: Delete unused delay_timer")
ca0cab119288 ("net/sched: remove qdisc_root_lock() helper")
c0f47c2822aa ("net/sched: cls_api: Fix flow action initialization")
5008750eff5d ("net/sched: flower: Add PPPoE filter")
a482d47d33ac ("net/sched: sch_cbq: change the type of cbq_set_lss to void")
06799a9085e1 ("net: bonding: replace dev_trans_start() with the jiffies of the last ARP/NS")
4873a1b2024d ("net/sched: remove hacks added to dev_trans_start() for bonding to work")
9ad36309e271 ("net_sched: cls_route: remove from list when handle is 0")
02799571714d ("net_sched: cls_route: disallow handle of 0")
b05972f01e7d ("net: sched: tbf: don't call qdisc_put() while holding tree lock")
f612466ebecb ("net/sched: fix netdevice reference leaks in attach_default_qdiscs()")
9efd23297cca ("sch_sfb: Don't assume the skb is still around after enqueueing to child")
2f09707d0c97 ("sch_sfb: Also store skb len before calling child enqueue")
db46e3a88a09 ("net/sched: taprio: avoid disabling offload when it was never enabled")
1461d212ab27 ("net/sched: taprio: make qdisc_leaf() see the per-netdev-queue pfifo child qdiscs")
c2e1cfefcac3 ("net: sched: fix possible refcount leak in tc_new_tfilter()")
6e23ec0ba92d ("net: sched: act_ct: fix possible refcount leak in tcf_ct_init()")
ffdd33dd9c12 ("netfilter: core: Fix clang warnings about unused static inlines")
6316136ec6e3 ("netfilter: egress: avoid a lockdep splat")
d645552e9bd9 ("netfilter: egress: Report interface as outgoing")
af7b29b1deaa ("Revert "net/sched: taprio: make qdisc_leaf() see the per-netdev-queue pfifo child qdiscs"")
8bdc2acd420c ("net: sched: Fix use after free in red_enqueue()")
```

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Approved-by: Petr Oros <poros@redhat.com>
Approved-by: Davide Caratti <dcaratti@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-11-23 02:46:05 -05:00
Frantisek Hrbata 1269719102 Merge: BPF and XDP rebase to v5.18
Merge conflicts:
-----------------
arch/x86/net/bpf_jit_comp.c
        - bpf_arch_text_poke()
          HEAD(!1464) contains b73b002f7f ("x86/ibt,bpf: Add ENDBR instructions to prologue and trampoline")
          Resolved in favour of !1464, but keep the return statement from !1477

MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1477

Bugzilla: https://bugzilla.redhat.com/2120966

Rebase BPF and XDP to the upstream kernel version 5.18

Patch applied, then reverted:
```
544356 selftests/bpf: switch to new libbpf XDP APIs
0bfb95 selftests, bpf: Do not yet switch to new libbpf XDP APIs
```
Taken in the perf rebase:
```
23fcfc perf: use generic bpf_program__set_type() to set BPF prog type
```
Unsuported arches:
```
5c1011 libbpf: Fix riscv register names
cf0b5b libbpf: Fix accessing syscall arguments on riscv
```
Depends on changes of other subsystems:
```
7fc8c3 s390/bpf: encode register within extable entry
aebfd1 x86/ibt,ftrace: Search for __fentry__ location
589127 x86/ibt,bpf: Add ENDBR instructions to prologue and trampoline
```
Broken selftest:
```
edae34 selftests net: add UDP GRO fraglist + bpf self-tests
cf6783 selftests net: fix bpf build error
7b92aa selftests net: fix kselftest net fatal error
```
Out of scope:
```
baebdf net: dev: Makes sure netif_rx() can be invoked in any context.
5c8166 kbuild: replace $(if A,A,B) with $(or A,B)
1a97ce perf maps: Use a pointer for kmaps
967747 uaccess: remove CONFIG_SET_FS
42b01a s390: always use the packed stack layout
bf0882 flow_dissector: Add support for HSR
d09a30 s390/extable: move EX_TABLE define to asm-extable.h
3d6671 s390/extable: convert to relative table with data
4efd41 s390: raise minimum supported machine generation to z10
f65e58 flow_dissector: Add support for HSRv0
1a6d7a netdevsim: Introduce support for L3 offload xstats
9b1894 selftests: netdevsim: hw_stats_l3: Add a new test
84005b perf ftrace latency: Add -n/--use-nsec option
36c4a7 kasan, arm64: don't tag executable vmalloc allocations
8df013 docs: netdev: move the netdev-FAQ to the process pages
4d4d00 perf tools: Update copy of libbpf's hashmap.c
0df6ad perf evlist: Rename cpus to user_requested_cpus
1b8089 flow_dissector: fix false-positive __read_overflow2_field() warning
0ae065 perf build: Fix check for btf__load_from_kernel_by_id() in libbpf
8994e9 perf test bpf: Skip test if clang is not present
735346 perf build: Fix btf__load_from_kernel_by_id() feature check
f037ac s390/stack: merge empty stack frame slots
335220 docs: netdev: update maintainer-netdev.rst reference
a0b098 s390/nospec: remove unneeded header includes
34513a netdevsim: Fix hwstats debugfs file permissions
```

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>

Approved-by: John W. Linville <linville@redhat.com>
Approved-by: Wander Lairson Costa <wander@redhat.com>
Approved-by: Torez Smith <torez@redhat.com>
Approved-by: Jan Stancek <jstancek@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Felix Maurer <fmaurer@redhat.com>
Approved-by: Viktor Malik <vmalik@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-11-21 05:30:47 -05:00
Frantisek Hrbata 27a89b8946 Merge: tcp: BIG TCP implementation
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1560

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139501
Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2128180
Tested: Using netperf and veth driver. Results meet the assumptions. See https://bugzilla.redhat.com/show_bug.cgi?id=2139501#c1

The series introduces support for BIG TCP.

- Patch 1-2: Preliminary dependencies
- Patch 3-14: Commits from upstream series 7fa2e481ff2f ("Merge branch 'big-tcp'", 2022-05-16)
- Patch 15-19: Follow-ups

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Florian Westphal <fwestpha@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-11-15 07:30:55 -05:00
Frantisek Hrbata 6fd36e2149 Merge: CNB: net: drop the weight argument from netif_napi_add
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1577

Bugzilla: https://bugzilla.redhat.com/2139498
Tested: build, boot

Change netif_napi_add family function's API so `netif_napi_add` and `netif_napi_add_tx` uses by default weight = NAPI_POLL_WEIGHT (as most of drivers were already doing in some or another way), and add `netif_napi_add_weight` and `netif_napi_add_tx_weight` for drivers that want to specify a custom NAPI weight.

Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>

Approved-by: John W. Linville <linville@redhat.com>
Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Torez Smith <torez@redhat.com>
Approved-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: Tony Camuso <tcamuso@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-11-14 10:28:04 -05:00
Ivan Vecera f31181025a net: sched: use queue_mapping to pick tx queue
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139170

commit 2f1e85b1aee459b7d0fd981839042c6a38ffaf0c
Author: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Date:   Sat Apr 16 00:40:45 2022 +0800

    net: sched: use queue_mapping to pick tx queue

    This patch fixes issue:
    * If we install tc filters with act_skbedit in clsact hook.
      It doesn't work, because netdev_core_pick_tx() overwrites
      queue_mapping.

      $ tc filter ... action skbedit queue_mapping 1

    And this patch is useful:
    * We can use FQ + EDT to implement efficient policies. Tx queues
      are picked by xps, ndo_select_queue of netdev driver, or skb hash
      in netdev_core_pick_tx(). In fact, the netdev driver, and skb
      hash are _not_ under control. xps uses the CPUs map to select Tx
      queues, but we can't figure out which task_struct of pod/containter
      running on this cpu in most case. We can use clsact filters to classify
      one pod/container traffic to one Tx queue. Why ?

      In containter networking environment, there are two kinds of pod/
      containter/net-namespace. One kind (e.g. P1, P2), the high throughput
      is key in these applications. But avoid running out of network resource,
      the outbound traffic of these pods is limited, using or sharing one
      dedicated Tx queues assigned HTB/TBF/FQ Qdisc. Other kind of pods
      (e.g. Pn), the low latency of data access is key. And the traffic is not
      limited. Pods use or share other dedicated Tx queues assigned FIFO Qdisc.
      This choice provides two benefits. First, contention on the HTB/FQ Qdisc
      lock is significantly reduced since fewer CPUs contend for the same queue.
      More importantly, Qdisc contention can be eliminated completely if each
      CPU has its own FIFO Qdisc for the second kind of pods.

      There must be a mechanism in place to support classifying traffic based on
      pods/container to different Tx queues. Note that clsact is outside of Qdisc
      while Qdisc can run a classifier to select a sub-queue under the lock.

      In general recording the decision in the skb seems a little heavy handed.
      This patch introduces a per-CPU variable, suggested by Eric.

      The xmit.skip_txqueue flag is firstly cleared in __dev_queue_xmit().
      - Tx Qdisc may install that skbedit actions, then xmit.skip_txqueue flag
        is set in qdisc->enqueue() though tx queue has been selected in
        netdev_tx_queue_mapping() or netdev_core_pick_tx(). That flag is cleared
        firstly in __dev_queue_xmit(), is useful:
      - Avoid picking Tx queue with netdev_tx_queue_mapping() in next netdev
        in such case: eth0 macvlan - eth0.3 vlan - eth0 ixgbe-phy:
        For example, eth0, macvlan in pod, which root Qdisc install skbedit
        queue_mapping, send packets to eth0.3, vlan in host. In __dev_queue_xmit() of
        eth0.3, clear the flag, does not select tx queue according to skb->queue_mapping
        because there is no filters in clsact or tx Qdisc of this netdev.
        Same action taked in eth0, ixgbe in Host.
      - Avoid picking Tx queue for next packet. If we set xmit.skip_txqueue
        in tx Qdisc (qdisc->enqueue()), the proper way to clear it is clearing it
        in __dev_queue_xmit when processing next packets.

      For performance reasons, use the static key. If user does not config the NET_EGRESS,
      the patch will not be compiled.

      +----+      +----+      +----+
      | P1 |      | P2 |      | Pn |
      +----+      +----+      +----+
        |           |           |
        +-----------+-----------+
                    |
                    | clsact/skbedit
                    |      MQ
                    v
        +-----------+-----------+
        | q0        | q1        | qn
        v           v           v
      HTB/FQ      HTB/FQ  ...  FIFO

    Cc: Jamal Hadi Salim <jhs@mojatatu.com>
    Cc: Cong Wang <xiyou.wangcong@gmail.com>
    Cc: Jiri Pirko <jiri@resnulli.us>
    Cc: "David S. Miller" <davem@davemloft.net>
    Cc: Jakub Kicinski <kuba@kernel.org>
    Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Alexander Lobakin <alobakin@pm.me>
    Cc: Paolo Abeni <pabeni@redhat.com>
    Cc: Talal Ahmad <talalahmad@google.com>
    Cc: Kevin Hao <haokexin@gmail.com>
    Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Cc: Antoine Tenart <atenart@kernel.org>
    Cc: Wei Wang <weiwan@google.com>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Suggested-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
    Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-11-13 16:59:02 +01:00
Ivan Vecera d545c120ec netfilter: Introduce egress hook
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139170

commit 42df6e1d221dddc0f2acf2be37e68d553ad65f96
Author: Lukas Wunner <lukas@wunner.de>
Date:   Fri Oct 8 22:06:03 2021 +0200

    netfilter: Introduce egress hook

    Support classifying packets with netfilter on egress to satisfy user
    requirements such as:
    * outbound security policies for containers (Laura)
    * filtering and mangling intra-node Direct Server Return (DSR) traffic
      on a load balancer (Laura)
    * filtering locally generated traffic coming in through AF_PACKET,
      such as local ARP traffic generated for clustering purposes or DHCP
      (Laura; the AF_PACKET plumbing is contained in a follow-up commit)
    * L2 filtering from ingress and egress for AVB (Audio Video Bridging)
      and gPTP with nftables (Pablo)
    * in the future: in-kernel NAT64/NAT46 (Pablo)

    The egress hook introduced herein complements the ingress hook added by
    commit e687ad60af ("netfilter: add netfilter ingress hook after
    handle_ing() under unique static key").  A patch for nftables to hook up
    egress rules from user space has been submitted separately, so users may
    immediately take advantage of the feature.

    Alternatively or in addition to netfilter, packets can be classified
    with traffic control (tc).  On ingress, packets are classified first by
    tc, then by netfilter.  On egress, the order is reversed for symmetry.
    Conceptually, tc and netfilter can be thought of as layers, with
    netfilter layered above tc.

    Traffic control is capable of redirecting packets to another interface
    (man 8 tc-mirred).  E.g., an ingress packet may be redirected from the
    host namespace to a container via a veth connection:
    tc ingress (host) -> tc egress (veth host) -> tc ingress (veth container)

    In this case, netfilter egress classifying is not performed when leaving
    the host namespace!  That's because the packet is still on the tc layer.
    If tc redirects the packet to a physical interface in the host namespace
    such that it leaves the system, the packet is never subjected to
    netfilter egress classifying.  That is only logical since it hasn't
    passed through netfilter ingress classifying either.

    Packets can alternatively be redirected at the netfilter layer using
    nft fwd.  Such a packet *is* subjected to netfilter egress classifying
    since it has reached the netfilter layer.

    Internally, the skb->nf_skip_egress flag controls whether netfilter is
    invoked on egress by __dev_queue_xmit().  Because __dev_queue_xmit() may
    be called recursively by tunnel drivers such as vxlan, the flag is
    reverted to false after sch_handle_egress().  This ensures that
    netfilter is applied both on the overlay and underlying network.

    Interaction between tc and netfilter is possible by setting and querying
    skb->mark.

    If netfilter egress classifying is not enabled on any interface, it is
    patched out of the data path by way of a static_key and doesn't make a
    performance difference that is discernible from noise:

    Before:             1537 1538 1538 1537 1538 1537 Mb/sec
    After:              1536 1534 1539 1539 1539 1540 Mb/sec
    Before + tc accept: 1418 1418 1418 1419 1419 1418 Mb/sec
    After  + tc accept: 1419 1424 1418 1419 1422 1420 Mb/sec
    Before + tc drop:   1620 1619 1619 1619 1620 1620 Mb/sec
    After  + tc drop:   1616 1624 1625 1624 1622 1619 Mb/sec

    When netfilter egress classifying is enabled on at least one interface,
    a minimal performance penalty is incurred for every egress packet, even
    if the interface it's transmitted over doesn't have any netfilter egress
    rules configured.  That is caused by checking dev->nf_hooks_egress
    against NULL.

    Measurements were performed on a Core i7-3615QM.  Commands to reproduce:
    ip link add dev foo type dummy
    ip link set dev foo up
    modprobe pktgen
    echo "add_device foo" > /proc/net/pktgen/kpktgend_3
    samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i foo -n 400000000 -m "11:11:11:11:11:11" -d 1.1.1.1

    Accept all traffic with tc:
    tc qdisc add dev foo clsact
    tc filter add dev foo egress bpf da bytecode '1,6 0 0 0,'

    Drop all traffic with tc:
    tc qdisc add dev foo clsact
    tc filter add dev foo egress bpf da bytecode '1,6 0 0 2,'

    Apply this patch when measuring packet drops to avoid errors in dmesg:
    https://lore.kernel.org/netdev/a73dda33-57f4-95d8-ea51-ed483abd6a7a@iogearbox.net/

    Signed-off-by: Lukas Wunner <lukas@wunner.de>
    Cc: Laura García Liébana <nevola@gmail.com>
    Cc: John Fastabend <john.fastabend@gmail.com>
    Cc: Daniel Borkmann <daniel@iogearbox.net>
    Cc: Alexei Starovoitov <ast@kernel.org>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Thomas Graf <tgraf@suug.ch>
    Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-11-13 16:59:01 +01:00
Ivan Vecera 866706749c netfilter: Generalize ingress hook include file
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139170

commit 17d20784223d52bf1671f984c9e8d5d9b8ea171b
Author: Lukas Wunner <lukas@wunner.de>
Date:   Fri Oct 8 22:06:02 2021 +0200

    netfilter: Generalize ingress hook include file

    Prepare for addition of a netfilter egress hook by generalizing the
    ingress hook include file.

    No functional change intended.

    Signed-off-by: Lukas Wunner <lukas@wunner.de>
    Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-11-13 16:59:01 +01:00
Ivan Vecera 3ccbb377fc netfilter: Rename ingress hook include file
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139170

commit 7463acfbe52ae8b7e0ea6890c1886b3f8ba8bddd
Author: Lukas Wunner <lukas@wunner.de>
Date:   Fri Oct 8 22:06:01 2021 +0200

    netfilter: Rename ingress hook include file

    Prepare for addition of a netfilter egress hook by renaming
    <linux/netfilter_ingress.h> to <linux/netfilter_netdev.h>.

    The egress hook also necessitates a refactoring of the include file,
    but that is done in a separate commit to ease reviewing.

    No functional change intended.

    Signed-off-by: Lukas Wunner <lukas@wunner.de>
    Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-11-13 16:59:01 +01:00
Frantisek Hrbata 0fe0e3e4d8 Merge: CNB: net: HW counters for soft devices
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1580

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2140149
Tested: Using netdevsim hw_stats_l3.sh self-test

Commits:
```
22b67d17194f ("net: rtnetlink: rtnl_stats_get(): Emit an extack for unset filter_mask")
6b524a1d012b ("net: rtnetlink: Namespace functions related to IFLA_OFFLOAD_XSTATS_*")
f6e0fb812988 ("net: rtnetlink: Stop assuming that IFLA_OFFLOAD_XSTATS_* are dev-backed")
46efc97b7306 ("net: rtnetlink: RTM_GETSTATS: Allow filtering inside nests")
05415bccbb09 ("net: rtnetlink: Propagate extack to rtnl_offload_xstats_fill()")
216e690631f5 ("net: rtnetlink: rtnl_fill_statsinfo(): Permit non-EMSGSIZE error returns")
9309f97aef6d ("net: dev: Add hardware stats support")
0e7788fd7622 ("net: rtnetlink: Add UAPI for obtaining L3 offload xstats")
03ba35667091 ("net: rtnetlink: Add RTM_SETSTATS")
5fd0b838efac ("net: rtnetlink: Add UAPI toggle for IFLA_OFFLOAD_XSTATS_L3_STATS")
ba95e7930957 ("selftests: forwarding: hw_stats_l3: Add a new test")
57d29a2935c9 ("net: rtnetlink: fix error handling in rtnl_fill_statsinfo()")
23cfe941b52e ("rtnetlink: Fix handling of disabled L3 stats in RTM_GETSTATS replies")
```

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: Petr Oros <poros@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>
Approved-by: Wander Lairson Costa <wander@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-11-08 09:08:22 -05:00
Frantisek Hrbata 5ac5a1dfd0 Merge: CNB: net: disambiguate the TSO and GSO limits
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1419

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2128180
Tested: Using iperf3 and toggling gso/tso offloading knobs

Commits:
```
2106efda785b ("net: remove .ndo_change_proto_down")
2cc6cdd44a16 ("net: unexport a handful of dev_* functions")
6264f58ca0e5 ("net: extract a few internals from netdevice.h")
6df6398f7c8b ("net: add netif_inherit_tso_max()")
14d7b8122fd5 ("net: don't allow user space to lift the device limits")
ee8b7a1156f3 ("net: make drivers set the TSO limit not the GSO limit")
744d49daf8bd ("net: move netif_set_gso_max helpers")
```

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-11-05 02:54:07 -04:00
Ivan Vecera a5a7be252a net: dev: Add hardware stats support
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2140149

commit 9309f97aef6d8250bb484dabeac925c3a7c57716
Author: Petr Machata <petrm@nvidia.com>
Date:   Wed Mar 2 18:31:20 2022 +0200

    net: dev: Add hardware stats support

    Offloading switch device drivers may be able to collect statistics of the
    traffic taking place in the HW datapath that pertains to a certain soft
    netdevice, such as VLAN. Add the necessary infrastructure to allow exposing
    these statistics to the offloaded netdevice in question. The API was shaped
    by the following considerations:

    - Collection of HW statistics is not free: there may be a finite number of
      counters, and the act of counting may have a performance impact. It is
      therefore necessary to allow toggling whether HW counting should be done
      for any particular SW netdevice.

    - As the drivers are loaded and removed, a particular device may get
      offloaded and unoffloaded again. At the same time, the statistics values
      need to stay monotonic (modulo the eventual 64-bit wraparound),
      increasing only to reflect traffic measured in the device.

      To that end, the netdevice keeps around a lazily-allocated copy of struct
      rtnl_link_stats64. Device drivers then contribute to the values kept
      therein at various points. Even as the driver goes away, the struct stays
      around to maintain the statistics values.

    - Different HW devices may be able to count different things. The
      motivation behind this patch in particular is exposure of HW counters on
      Nvidia Spectrum switches, where the only practical approach to counting
      traffic on offloaded soft netdevices currently is to use router interface
      counters, and count L3 traffic. Correspondingly that is the statistics
      suite added in this patch.

      Other devices may be able to measure different kinds of traffic, and for
      that reason, the APIs are built to allow uniform access to different
      statistics suites.

    - Because soft netdevices and offloading drivers are only loosely bound, a
      netdevice uses a notifier chain to communicate with the drivers. Several
      new notifiers, NETDEV_OFFLOAD_XSTATS_*, have been added to carry messages
      to the offloading drivers.

    - Devices can have various conditions for when a particular counter is
      available. As the device is configured and reconfigured, the device
      offload may become or cease being suitable for counter binding. A
      netdevice can use a notifier type NETDEV_OFFLOAD_XSTATS_REPORT_USED to
      ping offloading drivers and determine whether anyone currently implements
      a given statistics suite. This information can then be propagated to user
      space.

      When the driver decides to unoffload a netdevice, it can use a
      newly-added function, netdev_offload_xstats_report_delta(), to record
      outstanding collected statistics, before destroying the HW counter.

    This patch adds a helper, call_netdevice_notifiers_info_robust(), for
    dispatching a notifier with the possibility of unwind when one of the
    consumers bails. Given the wish to eventually get rid of the global
    notifier block altogether, this helper only invokes the per-netns notifier
    block.

    Signed-off-by: Petr Machata <petrm@nvidia.com>
    Signed-off-by: Ido Schimmel <idosch@nvidia.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-11-04 17:15:40 +01:00
Íñigo Huguet 4ed32c17b9 netdev: reshuffle netif_napi_add() APIs to allow dropping weight
Bugzilla: https://bugzilla.redhat.com/2139498

commit 58caed3dacb4354a25a1aa8d2febc3e9648ba1f4
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Mon May 2 16:27:03 2022 -0700

    netdev: reshuffle netif_napi_add() APIs to allow dropping weight
    
    Most drivers should not have to worry about selecting the right
    weight for their NAPI instances and pass NAPI_POLL_WEIGHT.
    It'd be best if we didn't require the argument at all and selected
    the default internally.
    
    This change prepares the ground for such reshuffling, allowing
    for a smooth transition. The following API should remain after
    the next release cycle:
      netif_napi_add()
      netif_napi_add_weight()
      netif_napi_add_tx()
      netif_napi_add_tx_weight()
    Where the _weight() variants take an explicit weight argument.
    I opted for a _weight() suffix rather than a __ prefix, because
    we use __ in places to mean that caller needs to also issue a
    synchronize_net() call.
    
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Link: https://lore.kernel.org/r/20220502232703.396351-1-kuba@kernel.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
2022-11-04 16:46:33 +01:00
Ivan Vecera fccce056fa net: allow gro_max_size to exceed 65536
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139501

commit 0fe79f28bfaf73b66b7b1562d2468f94aa03bd12
Author: Alexander Duyck <alexanderduyck@fb.com>
Date:   Fri May 13 11:34:03 2022 -0700

    net: allow gro_max_size to exceed 65536

    Allow the gro_max_size to exceed a value larger than 65536.

    There weren't really any external limitations that prevented this other
    than the fact that IPv4 only supports a 16 bit length field. Since we have
    the option of adding a hop-by-hop header for IPv6 we can allow IPv6 to
    exceed this value and for IPv4 and non-TCP flows we can cap things at 65536
    via a constant rather than relying on gro_max_size.

    [edumazet] limit GRO_MAX_SIZE to (8 * 65535) to avoid overflows.

    Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-11-02 18:56:09 +01:00
Ivan Vecera d513603ec1 net: allow gso_max_size to exceed 65536
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139501

commit 7c4e983c4f3cf94fcd879730c6caa877e0768a4d
Author: Alexander Duyck <alexanderduyck@fb.com>
Date:   Fri May 13 11:33:57 2022 -0700

    net: allow gso_max_size to exceed 65536

    The code for gso_max_size was added originally to allow for debugging and
    workaround of buggy devices that couldn't support TSO with blocks 64K in
    size. The original reason for limiting it to 64K was because that was the
    existing limits of IPv4 and non-jumbogram IPv6 length fields.

    With the addition of Big TCP we can remove this limit and allow the value
    to potentially go up to UINT_MAX and instead be limited by the tso_max_size
    value.

    So in order to support this we need to go through and clean up the
    remaining users of the gso_max_size value so that the values will cap at
    64K for non-TCPv6 flows. In addition we can clean up the GSO_MAX_SIZE value
    so that 64K becomes GSO_LEGACY_MAX_SIZE and UINT_MAX will now be the upper
    limit for GSO_MAX_SIZE.

    v6: (edumazet) fixed a compile error if CONFIG_IPV6=n,
                   in a new sk_trim_gso_size() helper.
                   netif_set_tso_max_size() caps the requested TSO size
                   with GSO_MAX_SIZE.

    Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-11-02 18:55:52 +01:00
Ivan Vecera 017d0aca36 gro: add ability to control gro max packet size
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139501

Conflicts:
- context due to existing backport of 14d7b8122fd5 ("net: don't allow
  user space to lift the device limits")

commit eac1b93c14d645ef147b049ace0d5230df755548
Author: Coco Li <lixiaoyan@google.com>
Date:   Wed Jan 5 02:48:38 2022 -0800

    gro: add ability to control gro max packet size

    Eric Dumazet suggested to allow users to modify max GRO packet size.

    We have seen GRO being disabled by users of appliances (such as
    wifi access points) because of claimed bufferbloat issues,
    or some work arounds in sch_cake, to split GRO/GSO packets.

    Instead of disabling GRO completely, one can chose to limit
    the maximum packet size of GRO packets, depending on their
    latency constraints.

    This patch adds a per device gro_max_size attribute
    that can be changed with ip link command.

    ip link set dev eth0 gro_max_size 16000

    Suggested-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Coco Li <lixiaoyan@google.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-11-02 18:55:37 +01:00
Paolo Abeni 022665bacd net: skb: introduce and use a single page frag cache
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2133057
Tested: LNST, Tier1

Upstream commit:
commit dbae2b062824fc2d35ae2d5df2f500626c758e80
Author: Paolo Abeni <pabeni@redhat.com>
Date:   Wed Sep 28 10:43:09 2022 +0200

    net: skb: introduce and use a single page frag cache

    After commit 3226b158e6 ("net: avoid 32 x truesize under-estimation
    for tiny skbs") we are observing 10-20% regressions in performance
    tests with small packets. The perf trace points to high pressure on
    the slab allocator.

    This change tries to improve the allocation schema for small packets
    using an idea originally suggested by Eric: a new per CPU page frag is
    introduced and used in __napi_alloc_skb to cope with small allocation
    requests.

    To ensure that the above does not lead to excessive truesize
    underestimation, the frag size for small allocation is inflated to 1K
    and all the above is restricted to build with 4K page size.

    Note that we need to update accordingly the run-time check introduced
    with commit fd9ea57f4e95 ("net: add napi_get_frags_check() helper").

    Alex suggested a smart page refcount schema to reduce the number
    of atomic operations and deal properly with pfmemalloc pages.

    Under small packet UDP flood, I measure a 15% peak tput increases.

    Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
    Suggested-by: Alexander H Duyck <alexanderduyck@fb.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
    Link: https://lore.kernel.org/r/6b6f65957c59f86a353fc09a5127e83a32ab5999.1664350652.git.pabeni@redhat.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-27 19:12:04 +02:00
Paolo Abeni 7822d83322 net: add napi_get_frags_check() helper
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2133057
Tested: LNST, Tier1

Upstream commit:
commit fd9ea57f4e9514f9d0f0dec505eefd99a8faa148
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Jun 8 09:04:38 2022 -0700

    net: add napi_get_frags_check() helper

    This is a follow up of commit 3226b158e6
    ("net: avoid 32 x truesize under-estimation for tiny skbs")

    When/if we increase MAX_SKB_FRAGS, we better make sure
    the old bug will not come back.

    Adding a check in napi_get_frags() would be costly,
    even if using DEBUG_NET_WARN_ON_ONCE().

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-27 19:10:48 +02:00
Jiri Benc 2da69cb317 net: Postpone skb_clear_delivery_time() until knowing the skb is delivered locally
Bugzilla: https://bugzilla.redhat.com/2120966

Conflicts:
- [minor] context difference in __netif_receive_skb_core due to missing
  42df6e1d221d ("netfilter: Introduce egress hook")

commit cd14e9b7b8d312dfbf75ce1f78552902e51b9045
Author: Martin KaFai Lau <kafai@fb.com>
Date:   Wed Mar 2 11:56:22 2022 -0800

    net: Postpone skb_clear_delivery_time() until knowing the skb is delivered locally

    The previous patches handled the delivery_time in the ingress path
    before the routing decision is made.  This patch can postpone clearing
    delivery_time in a skb until knowing it is delivered locally and also
    set the (rcv) timestamp if needed.  This patch moves the
    skb_clear_delivery_time() from dev.c to ip_local_deliver_finish()
    and ip6_input_finish().

    Signed-off-by: Martin KaFai Lau <kafai@fb.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:58:00 +02:00
Jiri Benc e0f797236e net: Set skb->mono_delivery_time and clear it after sch_handle_ingress()
Bugzilla: https://bugzilla.redhat.com/2120966

Conflicts:
- [minor] context difference in __netif_receive_skb_core due to missing
  42df6e1d221d ("netfilter: Introduce egress hook")

commit d98d58a002619b5c165f1eedcd731e2fe2c19088
Author: Martin KaFai Lau <kafai@fb.com>
Date:   Wed Mar 2 11:55:50 2022 -0800

    net: Set skb->mono_delivery_time and clear it after sch_handle_ingress()

    The previous patches handled the delivery_time before sch_handle_ingress().

    This patch can now set the skb->mono_delivery_time to flag the skb->tstamp
    is used as the mono delivery_time (EDT) instead of the (rcv) timestamp
    and also clear it with skb_clear_delivery_time() after
    sch_handle_ingress().  This will make the bpf_redirect_*()
    to keep the mono delivery_time and used by a qdisc (fq) of
    the egress-ing interface.

    A latter patch will postpone the skb_clear_delivery_time() until the
    stack learns that the skb is being delivered locally and that will
    make other kernel forwarding paths (ip[6]_forward) able to keep
    the delivery_time also.  Thus, like the previous patches on using
    the skb->mono_delivery_time bit, calling skb_clear_delivery_time()
    is not limited within the CONFIG_NET_INGRESS to avoid too many code
    churns among this set.

    Signed-off-by: Martin KaFai Lau <kafai@fb.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:58:00 +02:00
Jiri Benc e17e09a099 net: Clear mono_delivery_time bit in __skb_tstamp_tx()
Bugzilla: https://bugzilla.redhat.com/2120966

commit d93376f503c7a586707925957592c0f16f4db0b1
Author: Martin KaFai Lau <kafai@fb.com>
Date:   Wed Mar 2 11:55:44 2022 -0800

    net: Clear mono_delivery_time bit in __skb_tstamp_tx()

    In __skb_tstamp_tx(), it may clone the egress skb and queues the clone to
    the sk_error_queue.  The outgoing skb may have the mono delivery_time
    while the (rcv) timestamp is expected for the clone, so the
    skb->mono_delivery_time bit needs to be cleared from the clone.

    This patch adds the skb->mono_delivery_time clearing to the existing
    __net_timestamp() and use it in __skb_tstamp_tx().
    The __net_timestamp() fast path usage in dev.c is changed to directly
    call ktime_get_real() since the mono_delivery_time bit is not set at
    that point.

    Signed-off-by: Martin KaFai Lau <kafai@fb.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:58:00 +02:00
Jiri Benc c387356f8d net: Handle delivery_time in skb->tstamp during network tapping with af_packet
Bugzilla: https://bugzilla.redhat.com/2120966

commit 27942a15209f564ed8ee2a9e126cb7b105181355
Author: Martin KaFai Lau <kafai@fb.com>
Date:   Wed Mar 2 11:55:38 2022 -0800

    net: Handle delivery_time in skb->tstamp during network tapping with af_packet

    A latter patch will set the skb->mono_delivery_time to flag the skb->tstamp
    is used as the mono delivery_time (EDT) instead of the (rcv) timestamp.
    skb_clear_tstamp() will then keep this delivery_time during forwarding.

    This patch is to make the network tapping (with af_packet) to handle
    the delivery_time stored in skb->tstamp.

    Regardless of tapping at the ingress or egress,  the tapped skb is
    received by the af_packet socket, so it is ingress to the af_packet
    socket and it expects the (rcv) timestamp.

    When tapping at egress, dev_queue_xmit_nit() is used.  It has already
    expected skb->tstamp may have delivery_time,  so it does
    skb_clone()+net_timestamp_set() to ensure the cloned skb has
    the (rcv) timestamp before passing to the af_packet sk.
    This patch only adds to clear the skb->mono_delivery_time
    bit in net_timestamp_set().

    When tapping at ingress, it currently expects the skb->tstamp is either 0
    or the (rcv) timestamp.  Meaning, the tapping at ingress path
    has already expected the skb->tstamp could be 0 and it will get
    the (rcv) timestamp by ktime_get_real() when needed.

    There are two cases for tapping at ingress:

    One case is af_packet queues the skb to its sk_receive_queue.
    The skb is either not shared or new clone created.  The newly
    added skb_clear_delivery_time() is called to clear the
    delivery_time (if any) and set the (rcv) timestamp if
    needed before the skb is queued to the sk_receive_queue.

    Another case, the ingress skb is directly copied to the rx_ring
    and tpacket_get_timestamp() is used to get the (rcv) timestamp.
    The newly added skb_tstamp() is used in tpacket_get_timestamp()
    to check the skb->mono_delivery_time bit before returning skb->tstamp.
    As mentioned earlier, the tapping@ingress has already expected
    the skb may not have the (rcv) timestamp (because no sk has asked
    for it) and has handled this case by directly calling ktime_get_real().

    Signed-off-by: Martin KaFai Lau <kafai@fb.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:58:00 +02:00
Frantisek Hrbata fa843be1d1 Merge: net: add skb drop reasons
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1454

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161

Sync skb drop reasons with upstream to improve debuggability and visibility in
the net stack. This MR helps in understanding why a given packet is being
dropped.

One way of retrieving the skb drop reason is to hook to the kfree_skb tracepoint:

```
# perf record -e skb:kfree_skb -a sleep 10
# perf script
         swapper     0 [000] 45483.977088: skb:kfree_skb: skbaddr=0xffffa04859090f00 protocol=34525 location=0xffffffff9bc92940 reason: NOT_SPECIFIED
         swapper     0 [000] 45485.792919: skb:kfree_skb: skbaddr=0xffffa04143757900 protocol=34525 location=0xffffffff9bbd84bb reason: TCP_INVALID_SEQUENCE
```

Signed-off-by: Antoine Tenart <atenart@redhat.com>

Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-10-24 14:27:58 -04:00
Ivan Vecera 4ba4dadfe4 net: make drivers set the TSO limit not the GSO limit
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2128180

Conflicts:
* drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c
* drivers/net/ethernet/marvell/octeontx2/nic/otx2_vf.c
  - small context conflicts
* drivers/net/usb/ax88179_178a.c
  - hunk removed, the driver does not call netif_set_gso_max_size()
* drivers/net/usb/lan78xx.c
  - modified due to absence of commits d383216a7efe ("lan78xx: Introduce
    Tx URB processing improvements") and 0dd87266c133 ("lan78xx: Remove
    hardware-specific header update")

commit ee8b7a1156f357613646d6c69d07ac5a087a1071
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Thu May 5 19:51:33 2022 -0700

    net: make drivers set the TSO limit not the GSO limit

    Drivers should call the TSO setting helper, GSO is controllable
    by user space.

    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-10-18 10:27:21 +02:00
Ivan Vecera 8f95afcecf net: don't allow user space to lift the device limits
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2128180

Conflicts:
- small context conflict due to missing eac1b93c14d6 ("gro: add ability
  to control gro max packet size")

commit 14d7b8122fd591693a2388b98563707ba72c6780
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Thu May 5 19:51:32 2022 -0700

    net: don't allow user space to lift the device limits

    Up until commit 46e6b992c2 ("rtnetlink: allow GSO maximums to
    be set on device creation") the gso_max_segs and gso_max_size
    of a device were not controlled from user space.

    The quoted commit added the ability to control them because of
    the following setup:

     netns A  |  netns B
         veth<->veth   eth0

    If eth0 has TSO limitations and user wants to efficiently forward
    traffic between eth0 and the veths they should copy the TSO
    limitations of eth0 onto the veths. This would happen automatically
    for macvlans or ipvlan but veth users are not so lucky (given the
    loose coupling).

    Unfortunately the commit in question allowed users to also override
    the limits on real HW devices.

    It may be useful to control the max GSO size and someone may be using
    that ability (not that I know of any user), so create a separate set
    of knobs to reliably record the TSO limitations. Validate the user
    requests.

    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-10-18 10:27:21 +02:00
Ivan Vecera f9b471a989 net: add netif_inherit_tso_max()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2128180

Conflicts:
- small context conflict due to missing eac1b93c14d6 ("gro: add ability
  to control gro max packet size")

commit 6df6398f7c8b481ce83f28143bc08a5231616deb
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Thu May 5 19:51:31 2022 -0700

    net: add netif_inherit_tso_max()

    To make later patches smaller create a helper for inheriting
    the TSO limitations of a lower device. The TSO in the name
    is not an accident, subsequent patches will replace GSO
    with TSO in more names.

    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-10-18 10:27:21 +02:00
Ivan Vecera 5a0eef8003 net: extract a few internals from netdevice.h
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2128180

Conflicts:
- slightly modified due to missing 0b5c21bbc01e ("net: ensure
  net_todo_list is processed quickly") and d07b26f5bbea ("dev_addr:
  add a modification check")

commit 6264f58ca0e54e41d63c2d00334a48bac28fbf30
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Wed Apr 6 14:37:54 2022 -0700

    net: extract a few internals from netdevice.h

    There's a number of functions and static variables used
    under net/core/ but not from the outside. We currently
    dump most of them into netdevice.h. That bad for many
    reasons:
     - netdevice.h is very cluttered, hard to figure out
       what the APIs are;
     - netdevice.h is very long;
     - we have to touch netdevice.h more which causes expensive
       incremental builds.

    Create a header under net/core/ and move some declarations.

    The new header is also a bit of a catch-all but that's
    fine, if we create more specific headers people will
    likely over-think where their declaration fit best.
    And end up putting them in netdevice.h, again.

    More work should be done on splitting netdevice.h into more
    targeted headers, but that'd be more time consuming so small
    steps.

    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-10-18 10:27:16 +02:00
Antoine Tenart d3b8b917fb net: skb: rename SKB_DROP_REASON_PTYPE_ABSENT
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git
Conflict:\
- In __netif_receive_skb_core due to missing upstream commit
  625788b58445 ("net: add per-cpu storage and net->core_stats") in c9s.

commit 9f8ed577c28813410614b418bad42285840c1a00
Author: Menglong Dong <imagedong@tencent.com>
Date:   Thu Apr 7 14:20:50 2022 +0800

    net: skb: rename SKB_DROP_REASON_PTYPE_ABSENT

    As David Ahern suggested, the reasons for skb drops should be more
    general and not be code based.

    Therefore, rename SKB_DROP_REASON_PTYPE_ABSENT to
    SKB_DROP_REASON_UNHANDLED_PROTO, which is used for the cases of no
    L3 protocol handler, no L4 protocol handler, version extensions, etc.

    From previous discussion, now we have the aim to make these reasons
    more abstract and users based, avoiding code based.

    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:24 +02:00
Antoine Tenart 3f421c9474 net: dev: use kfree_skb_reason() for __netif_receive_skb_core()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit 6c2728b7c14164928cb7cb9c847dead101b2d503
Author: Menglong Dong <imagedong@tencent.com>
Date:   Fri Mar 4 14:00:46 2022 +0800

    net: dev: use kfree_skb_reason() for __netif_receive_skb_core()

    Add reason for skb drops to __netif_receive_skb_core() when packet_type
    not found to handle the skb. For this purpose, the drop reason
    SKB_DROP_REASON_PTYPE_ABSENT is introduced. Take ether packets for
    example, this case mainly happens when L3 protocol is not supported.

    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:23 +02:00
Antoine Tenart 4fa8044e89 net: dev: use kfree_skb_reason() for sch_handle_ingress()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit a568aff26ac03ee9eb1482683514914a5ec3b4c3
Author: Menglong Dong <imagedong@tencent.com>
Date:   Fri Mar 4 14:00:45 2022 +0800

    net: dev: use kfree_skb_reason() for sch_handle_ingress()

    Replace kfree_skb() used in sch_handle_ingress() with
    kfree_skb_reason(). Following drop reasons are introduced:

    SKB_DROP_REASON_TC_INGRESS

    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:23 +02:00
Antoine Tenart 9c9aa3ee0a net: dev: use kfree_skb_reason() for do_xdp_generic()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit 7e726ed81e1ddd5fdc431e02b94fcfe2a9876d42
Author: Menglong Dong <imagedong@tencent.com>
Date:   Fri Mar 4 14:00:44 2022 +0800

    net: dev: use kfree_skb_reason() for do_xdp_generic()

    Replace kfree_skb() used in do_xdp_generic() with kfree_skb_reason().
    The drop reason SKB_DROP_REASON_XDP is introduced for this case.

    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:23 +02:00
Antoine Tenart db388f3375 net: dev: use kfree_skb_reason() for enqueue_to_backlog()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit 44f0bd40803c0e04f1c8cd59df3c7acce783ae9c
Author: Menglong Dong <imagedong@tencent.com>
Date:   Fri Mar 4 14:00:43 2022 +0800

    net: dev: use kfree_skb_reason() for enqueue_to_backlog()

    Replace kfree_skb() used in enqueue_to_backlog() with
    kfree_skb_reason(). The skb rop reason SKB_DROP_REASON_CPU_BACKLOG is
    introduced for the case of failing to enqueue the skb to the per CPU
    backlog queue. The further reason can be backlog queue full or RPS
    flow limition, and I think we needn't to make further distinctions.

    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:23 +02:00
Antoine Tenart b63c068d65 net: dev: add skb drop reasons to __dev_xmit_skb()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit 7faef0547f4c29031a68d058918b031a8e520d49
Author: Menglong Dong <imagedong@tencent.com>
Date:   Fri Mar 4 14:00:42 2022 +0800

    net: dev: add skb drop reasons to __dev_xmit_skb()

    Add reasons for skb drops to __dev_xmit_skb() by replacing
    kfree_skb_list() with kfree_skb_list_reason(). The drop reason of
    SKB_DROP_REASON_QDISC_DROP is introduced for qdisc enqueue fails.

    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:23 +02:00
Antoine Tenart 694219a303 net: dev: use kfree_skb_reason() for sch_handle_egress()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit 98b4d7a4e7374a44c4afd9f08330e72f6ad0d644
Author: Menglong Dong <imagedong@tencent.com>
Date:   Fri Mar 4 14:00:40 2022 +0800

    net: dev: use kfree_skb_reason() for sch_handle_egress()

    Replace kfree_skb() used in sch_handle_egress() with kfree_skb_reason().
    The drop reason SKB_DROP_REASON_TC_EGRESS is introduced. Considering
    the code path of tc egerss, we make it distinct with the drop reason
    of SKB_DROP_REASON_QDISC_DROP in the next commit.

    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:23 +02:00
Paolo Abeni 7403d40195 net: Fix a data-race around netdev_unregister_timeout_secs.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134161
Tested: LNST, Tier1
Conflicts: chunk applied into netdev_wait_allrefs() instead of \
 netdev_wait_allrefs_any() and with different context as rhel-9 \
 lacks the upstream commit faab39f63c1fc ("net: allow out-of-order \
 netdev unregistration")

Upstream commit:
commit 05e49cfc89e4f325eebbc62d24dd122e55f94c23
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Tue Aug 23 10:46:59 2022 -0700

    net: Fix a data-race around netdev_unregister_timeout_secs.

    While reading netdev_unregister_timeout_secs, it can be changed
    concurrently.  Thus, we need to add READ_ONCE() to its reader.

    Fixes: 5aa3afe107 ("net: make unregister netdev warning timeout configurable")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Acked-by: Dmitry Vyukov <dvyukov@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-13 13:00:04 +02:00
Paolo Abeni 48e48d197a net: Fix a data-race around netdev_budget_usecs.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134161
Tested: LNST, Tier1

Upstream commit:
commit fa45d484c52c73f79db2c23b0cdfc6c6455093ad
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Tue Aug 23 10:46:55 2022 -0700

    net: Fix a data-race around netdev_budget_usecs.

    While reading netdev_budget_usecs, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its reader.

    Fixes: 7acf8a1e8a ("Replace 2 jiffies with sysctl netdev_budget_usecs to enable softirq tuning")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-13 13:00:04 +02:00
Paolo Abeni 3d0c78c5c1 net: Fix a data-race around netdev_budget.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134161
Tested: LNST, Tier1

Upstream commit:
commit 2e0c42374ee32e72948559d2ae2f7ba3dc6b977c
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Tue Aug 23 10:46:53 2022 -0700

    net: Fix a data-race around netdev_budget.

    While reading netdev_budget, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its reader.

    Fixes: 51b0bdedb8 ("[NET]: Separate two usages of netdev_max_backlog.")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-13 13:00:04 +02:00
Paolo Abeni 08060d0717 net: Fix data-races around netdev_tstamp_prequeue.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134161
Tested: LNST, Tier1

Upstream commit:
commit 61adf447e38664447526698872e21c04623afb8e
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Tue Aug 23 10:46:47 2022 -0700

    net: Fix data-races around netdev_tstamp_prequeue.

    While reading netdev_tstamp_prequeue, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its readers.

    Fixes: 3b098e2d7c ("net: Consistent skb timestamping")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-13 13:00:03 +02:00
Paolo Abeni 13d50816f6 net: Fix data-races around netdev_max_backlog.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134161
Tested: LNST, Tier1

Upstream commit:
commit 5dcd08cd19912892586c6082d56718333e2d19db
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Tue Aug 23 10:46:46 2022 -0700

    net: Fix data-races around netdev_max_backlog.

    While reading netdev_max_backlog, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its readers.

    While at it, we remove the unnecessary spaces in the doc.

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-13 13:00:03 +02:00
Paolo Abeni 05d6206bdc net: Fix data-races around weight_p and dev_weight_[rt]x_bias.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134161
Tested: LNST, Tier1

Upstream commit:
commit bf955b5ab8f6f7b0632cdef8e36b14e4f6e77829
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Tue Aug 23 10:46:45 2022 -0700

    net: Fix data-races around weight_p and dev_weight_[rt]x_bias.

    While reading weight_p, it can be changed concurrently.  Thus, we need
    to add READ_ONCE() to its reader.

    Also, dev_[rt]x_weight can be read/written at the same time.  So, we
    need to use READ_ONCE() and WRITE_ONCE() for its access.  Moreover, to
    use the same weight_p while changing dev_[rt]x_weight, we add a mutex
    in proc_do_dev_weight().

    Fixes: 3d48b53fb2 ("net: dev_weight: TX/RX orthogonality")
    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-13 13:00:03 +02:00
Ivan Vecera 7ca7843425 net: unexport a handful of dev_* functions
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2128180

commit 2cc6cdd44a1655ac5a9863529a2fd6dbed2d092c
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Wed Apr 6 14:37:53 2022 -0700

    net: unexport a handful of dev_* functions

    We have a bunch of functions which are only used under
    net/core/ yet they get exported. Remove the exports.

    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-10-03 17:03:08 +02:00
Ivan Vecera 616826f600 net: remove .ndo_change_proto_down
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2128180

Conflicts:
- small context conflict due to existing backport of 3b89b511ea0c ("net:
  fix IFF_TX_SKB_NO_LINEAR definition")

commit 2106efda785b55a8957efed9a52dfa28ee0d7280
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Mon Nov 22 17:24:47 2021 -0800

    net: remove .ndo_change_proto_down

    .ndo_change_proto_down was added seemingly to enable out-of-tree
    implementations. Over 2.5yrs later we still have no real users
    upstream. Hardwire the generic implementation for now, we can
    revert once real users materialize. (rocker is a test vehicle,
    not a user.)

    We need to drop the optimization on the sysfs side, because
    unlike ndos priv_flags will be changed at runtime, so we'd
    need READ_ONCE/WRITE_ONCE everywhere..

    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-10-03 17:02:55 +02:00
Felix Maurer 8611666ff2 xdp: check prog type before updating BPF link
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071620

commit 382778edc8262b7535f00523e9eb22edba1b9816
Author: Toke Høiland-Jørgensen <toke@redhat.com>
Date:   Fri Jan 7 23:11:13 2022 +0100

    xdp: check prog type before updating BPF link

    The bpf_xdp_link_update() function didn't check the program type before
    updating the program, which made it possible to install any program type as
    an XDP program, which is obviously not good. Syzbot managed to trigger this
    by swapping in an LWT program on the XDP hook which would crash in a helper
    call.

    Fix this by adding a check and bailing out if the types don't match.

    Fixes: 026a4c28e1 ("bpf, xdp: Implement LINK_UPDATE for BPF XDP link")
    Reported-by: syzbot+983941aa85af6ded1fd9@syzkaller.appspotmail.com
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Link: https://lore.kernel.org/r/20220107221115.326171-1-toke@redhat.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2022-08-24 16:56:03 +02:00
Patrick Talbert 95ad1a9fa6 Merge: CNB: bpf: Let bpf_warn_invalid_xdp_action() report more info
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1070

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2073454
Tested: Build, boot.

The commit let bpf_warn_invalid_xdp_action() report more info

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: Corinna Vinschen <vinschen@redhat.com>
Approved-by: Kamal Heib <kheib@redhat.com>
Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Hangbin Liu <haliu@redhat.com>
Approved-by: Artem Savkov <asavkov@redhat.com>
Approved-by: Petr Oros <poros@redhat.com>
Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com>
Approved-by: Mohamed Gamal Morsy <mgamal@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-07-15 09:40:47 +02:00
Patrick Talbert 5f85d33e47 Merge: net/core: backport fixes from upstream for 9.1 P2
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1057

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101278

The latest path depends on the second latest patch.

Signed-off-by: Hangbin Liu <haliu@redhat.com>

Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Florian Westphal <fwestpha@redhat.com>
Approved-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-07-14 12:07:49 +02:00
Patrick Talbert c2f72a65cf Merge: CNB: gro: get out of core files
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1066

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101789
Tested: Just built - there is no functional change

The series moves GRO related definitions, declarations and code from core files into net/core/gro.h and include/net/gro.h and reduces too big files include/linux/netdevice.h andnet/core/dev.c. Backport of this series provides <net/gro.h> for NIC drivers and avoids conflicts in future GRO related backports and fixes.

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: Kamal Heib <kheib@redhat.com>
Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Hangbin Liu <haliu@redhat.com>
Approved-by: Íñigo Huguet <ihuguet@redhat.com>

Conflicts:
- include/linux/netdevice.h: fuzz.

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-07-12 10:36:03 +02:00
Patrick Talbert f063b56239 Merge: net: backport netdevice and netns refcount tracking and enable them for debug kernels
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1003

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2096377
Tested: Basic networking tasks using namespaces, vlans, veths, macvlans etc. with kernel-debug flavor

Upstream kernel recently introduces refcount tracking infrastructure for network devices and namespaces to help to avoid resource leaks and use-after-free issues. This infrastructure should be helpful for our support teams to debug customers' issues.
The series backports the following commits and enables both trackers for kernel debug flavors:

```
95d1d2490c27 ("netdevice: move xdp_rxq within netdev_rx_queue")
2a12ae5d433d ("net: inline sock_prot_inuse_add()")
d477eb900484 ("net: make sock_inuse_add() available")
4199bae10c49 ("net: merge net->core.prot_inuse and net->core.sock_inuse")
b3cb764aa1d7 ("net: drop nopreempt requirement on sock_prot_inuse_add()")
4e66934eaadc ("lib: add reference counting tracking infrastructure")
914a7b5000d0 ("lib: add tests for reference tracker")
4d92b95ff2f9 ("net: add net device refcount tracker infrastructure")
80e8921b2b72 ("net: add net device refcount tracker to struct netdev_rx_queue")
0b688f24b7d6 ("net: add net device refcount tracker to struct netdev_queue")
5ae2195088d0 ("net: add net device refcount tracker to ethtool_phys_id()")
14ed029b5eb5 ("net: add net device refcount tracker to dev_ifsioc()")
4dbd24f65c60 ("drop_monitor: add net device refcount tracker")
9038c320001d ("net: dst: add net device refcount tracking to dst_entry")
fb67510ba9bd ("ipv6: add net device refcount tracker to rt6_probe_deferred()")
c0fd407a0666 ("sit: add net device refcount tracking to ip_tunnel")
56c1c77948ba ("ipv6: add net device refcount tracker to struct ip6_tnl")
85662c9f8cbd ("net: add net device refcount tracker to struct neighbour")
77a23b1f9543 ("net: add net device refcount tracker to struct pneigh_entry")
08d622568e5a ("net: add net device refcount tracker to struct neigh_parms")
f77159a348f2 ("net: add net device refcount tracker to struct netdev_adjacent")
8c727003c4d0 ("ipv6: add net device refcount tracker to struct inet6_dev")
c04438f58d14 ("ipv4: add net device refcount tracker to struct in_device")
606509f27f67 ("net/sched: add net device refcount tracker to struct Qdisc")
63f13937cbe9 ("net: linkwatch: add net device refcount tracker")
095e200f175f ("net: failover: add net device refcount tracker")
42120a864383 ("ipmr, ip6mr: add net device refcount tracker to struct vif_device")
5fa5ae605821 ("netpoll: add net device refcount tracker to struct netpoll")
c0e5e11af12b ("vrf: use dev_replace_track() for better tracking")
08f0b22d731f ("net: eql: add net device refcount tracker")
19c9ebf6ed70 ("vlan: add net device refcount tracker")
b2dcdc7f731d ("net: bridge: add net device refcount tracker")
f12bf6f3f942 ("net: watchdog: add net device refcount tracker")
4fc003fe0313 ("net: switchdev: add net device refcount tracker")
e44b14ebae10 ("inet: add net device refcount tracker to struct fib_nh_common")
66ce07f7802b ("ax25: add net device refcount tracker")
615d069dcf12 ("llc: add net device refcount tracker")
035f1f2b96ae ("pktgen add net device refcount tracker")
b60645248af3 ("net/smc: add net device tracker to struct smc_pnetentry")
e4b8954074f6 ("netlink: add net device refcount tracker to struct ethnl_req_info")
e7c8ab8419d7 ("openvswitch: add net device refcount tracker to struct vport")
ada066b2e02c ("net: sched: act_mirred: add net device refcount tracker")
4177e4960594 ("xfrm: use net device refcount tracker helpers")
9ba74e6c9e9d ("net: add networking namespace refcount tracker")
ffa84b5ffb37 ("net: add netns refcount tracker to struct sock")
04a931e58d19 ("net: add netns refcount tracker to struct seq_net_private")
dbdcda634ce3 ("net: sched: add netns refcount tracker to struct tcf_exts")
285ec2fef4b8 ("l2tp: add netns refcount tracker to l2tp_dfs_seq_data")
11b311a867b6 ("ppp: add netns refcount tracker")
0976b888a150 ("ethtool: fix null-ptr-deref on ref tracker")
e1b539bd73a7 ("xfrm: add net device refcount tracker to struct xfrm_state_offload")
8b40a9d53d4f ("ipv6: use GFP_ATOMIC in rt6_probe()")
1d2f3d3c6268 ("mptcp: adjust to use netns refcount tracker")
123e495ecc25 ("net: linkwatch: be more careful about dev->linkwatch_dev_tracker")
9280ac2e6f19 ("net: dev_replace_track() cleanup")
34ac17ecbf57 ("ethtool: use ethnl_parse_header_dev_put()")
f1d9268e0618 ("net: add net device refcount tracker to struct packet_type")
3bc14ea0d12a ("ethtool: always write dev in ethnl_parse_header_dev_get")
a9382d9389a0 ("netfilter: nfnetlink: add netns refcount tracker to struct nfulnl_instance")
30db406923b9 ("netfilter: nf_nat_masquerade: make async masq_inet6_event handling generic")
7970a19b7104 ("netfilter: nf_nat_masquerade: defer conntrack walk to work queue")
fc0d026a2fad ("netfilter: nf_nat_masquerade: add netns refcount tracker to masq_dev_work")
88248c357c2a ("net/sched: add missing tracker information in qdisc_create()")
2d6ec25539b0 ("netlink: do not allocate a device refcount tracker in ethnl_default_notify()")
bf44077c1b3a ("af_packet: fix tracking issues in packet_do_bind()")
cb963a19d99f ("net: sched: do not allocate a tracker in tcf_exts_init()")
c12837d1bb31 ("ref_tracker: use __GFP_NOFAIL more carefully")
fcfb894d5952 ("net: bridge: fix net device refcount tracking issue in error path")
7b9b1d449a7c ("net/smc: fix possible NULL deref in smc_pnet_add_eth()")
6cdef8a6ee74 ("SUNRPC: add netns refcount tracker to struct svc_xprt")
9b1831e56c7f ("SUNRPC: add netns refcount tracker to struct gss_auth")
b9a0d6d143ec ("SUNRPC: add netns refcount tracker to struct rpc_xprt")
e3ececfe668f ("ref_tracker: implement use-after-free detection")
8fd5522f44dc ("ref_tracker: add a count of untracked references")
4c6c11ea0f7b ("net: refine dev_put()/dev_hold() debugging")
28f922213886 ("net/smc: fix ref_tracker issue in smc_pnet_add()")
94fdd7c02a56 ("net/smc: use GFP_ATOMIC allocation in smc_pnet_add_eth()")
b2309a71c1f2 ("net: add dev->dev_registered_tracker")
3db09e762dc7 ("net/sched: cls_u32: fix netns refcount changes in u32_change()")
ec5b0f605b10 ("net/sched: cls_u32: fix possible leak in u32_init_knode()")
```

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Eelco Chaudron <echaudro@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-07-01 09:17:32 +02:00
Ivan Vecera ca7c7d9c0c bpf: Let bpf_warn_invalid_xdp_action() report more info
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2073454

Conflicts:
- N/A hunk for unsupported octeontx2 driver omitted

commit c8064e5b4adac5e1255cf4f3b374e75b5376e7ca
Author: Paolo Abeni <pabeni@redhat.com>
Date:   Tue Nov 30 11:08:07 2021 +0100

    bpf: Let bpf_warn_invalid_xdp_action() report more info

    In non trivial scenarios, the action id alone is not sufficient to
    identify the program causing the warning. Before the previous patch,
    the generated stack-trace pointed out at least the involved device
    driver.

    Let's additionally include the program name and id, and the relevant
    device name.

    If the user needs additional infos, he can fetch them via a kernel
    probe, leveraging the arguments added here.

    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Link: https://lore.kernel.org/bpf/ddb96bb975cbfddb1546cf5da60e77d5100b533c.1638189075.git.pabeni@redhat.com

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-06-28 16:13:14 +02:00