Commit Graph

2210 Commits

Author SHA1 Message Date
Ivan Vecera d5968be5cd net: add atomic_long_t to net_device_stats fields
JIRA: https://issues.redhat.com/browse/RHEL-862

commit 6c1c5097781f563b70a81683ea6fdac21637573b
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Nov 15 08:53:55 2022 +0000

    net: add atomic_long_t to net_device_stats fields

    Long standing KCSAN issues are caused by data-race around
    some dev->stats changes.

    Most performance critical paths already use per-cpu
    variables, or per-queue ones.

    It is reasonable (and more correct) to use atomic operations
    for the slow paths.

    This patch adds an union for each field of net_device_stats,
    so that we can convert paths that are not yet protected
    by a spinlock or a mutex.

    netdev_stats_to_stats64() no longer has an #if BITS_PER_LONG==64

    Note that the memcpy() we were using on 64bit arches
    had no provision to avoid load-tearing,
    while atomic_long_read() is providing the needed protection
    at no cost.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-12-05 15:29:59 +01:00
Scott Weaver f51e07d91d Merge: CNB94: xsk: Multi-buffer support
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3310

JIRA: https://issues.redhat.com/browse/RHEL-15250
Tested: Using attached self-tests [Results in JIRA]

The series adds support for multi-buffer to XSK. It is based on upstream series `3226e3139dfe ("Merge branch 'xsk-multi-buffer-support'")` and contains also commits from upstream series `34e78bab67c5 ("Merge branch 'seltests/xsk: prepare for AF_XDP multi-buffer testing'")` to make attached self-tests applicable.

Commits:
```
0c5f48599bed ("xsk: Simplify xp_aligned_validate_desc implementation")
f2f167583601 ("xsk: Remove unused xsk_buff_discard")
e2fa5c2068fb ("xsk: Remove unused inline function xsk_buff_discard()")
63a64a56bc3f ("xsk: prepare 'options' in xdp_desc for multi-buffer use")
81470b5c3c66 ("xsk: introduce XSK_USE_SG bind flag for xsk socket")
556444c4e683 ("xsk: prepare both copy and zero-copy modes to co-exist")
faa91b839b09 ("xsk: move xdp_buff's data length check to xsk_rcv_check")
804627751b42 ("xsk: add support for AF_XDP multi-buffer on Rx path")
b7f72a30e9ac ("xsk: introduce wrappers and helpers for supporting multi-buffer in Tx path")
1b725b0c8163 ("xsk: allow core/drivers to test EOP bit")
cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
07428da9e25a ("xsk: discard zero length descriptors in Tx path")
13ce2daa259a ("xsk: add new netlink attribute dedicated for ZC max frags")
24ea50127ecf ("xsk: support mbuf on ZC RX")
d5581966040f ("xsk: support ZC Tx multi-buffer in batch API")
49ca37d0d825 ("xsk: add multi-buffer documentation")
9a321fd3308e ("selftests/xsk: add xdp populate metadata test")
68e7322142f5 ("selftests: xsk: Deflakify STATS_RX_DROPPED test")
7a2050df244e ("selftests: xsk: Use correct UMEM size in testapp_invalid_desc")
ccd1b2933f8c ("selftests: xsk: Add test case for packets at end of UMEM")
c0801598e543 ("selftests: xsk: Add test UNALIGNED_INV_DESC_4K1_FRAME_SIZE")
d2e541494935 ("selftests/xsk: do not change XDP program when not necessary")
df82d2e89c41 ("selftests/xsk: generate simpler packets with variable length")
feb973a9094f ("selftests/xsk: add varying payload pattern within packet")
7a8a6762822a ("selftests/xsk: dump packet at error")
69fc03d220a3 ("selftests/xsk: add packet iterator for tx to packet stream")
d9f6d9709f87 ("selftests/xsk: store offset in pkt instead of addr")
041b68f688a3 ("selftests/xsx: test for huge pages only once")
86e41755b432 ("selftests/xsk: populate fill ring based on frags needed")
2f6eae0df1a8 ("selftests/xsk: generate data for multi-buffer packets")
7cd6df4f5ec2 ("selftests/xsk: adjust packet pacing for multi-buffer support")
17f1034dd76d ("selftests/xsk: transmit and receive multi-buffer packets")
f540d44e05cf ("selftests/xsk: add basic multi-buffer test")
1005a226da9a ("selftests/xsk: add unaligned mode test for multi-buffer")
697604492b64 ("selftests/xsk: add invalid descriptor test for multi-buffer")
f80ddbec4762 ("selftests/xsk: add metadata copy test for multi-buff")
807bf4da2049 ("selftests/xsk: add test for too many frags")
3666bccab43a ("selftests/xsk: reset NIC settings to default after running test suite")
d609f3d228a8 ("xsk: add multi-buffer support for sockets sharing umem")
9d0a67b9d42c ("xsk: Fix xsk_build_skb() error: 'skb' dereferencing possible ERR_PTR()")
a097627dcadd ("net: add missing net_device::xdp_zc_max_segs description")
```

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: Petr Oros <poros@redhat.com>
Approved-by: Toke Høiland-Jørgensen <toke@redhat.com>

Signed-off-by: Scott Weaver <scweaver@redhat.com>
2023-11-29 14:08:06 -05:00
Scott Weaver 971351c941 Merge: CNB94: net: add check for current MAC address in dev_set_mac_address
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3398

JIRA: https://issues.redhat.com/browse/RHEL-16986
JIRA: https://issues.redhat.com/browse/RHEL-6368

This prevents network drivers' .ndo_set_mac_address method from being called when the MAC address is already the current one. There are drivers that more or less assume that this is how the network core already behaves. For example, iavf will send a virtchnl message to the PF requesting to add the new address and then a message to remove the old address. This logic is broken if old and new are the same address.

Tested: I used the reproducer steps from RHEL-6368, with VFs on Intel E810.

Signed-off-by: Michal Schmidt <mschmidt@redhat.com>

Approved-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: Petr Oros <poros@redhat.com>

Signed-off-by: Scott Weaver <scweaver@redhat.com>
2023-11-28 10:54:46 -05:00
Michal Schmidt 2c3e95de5a net: fix net device address assign type
JIRA: https://issues.redhat.com/browse/RHEL-16986
JIRA: https://issues.redhat.com/browse/RHEL-6368

commit 0ec92a8f56ff07237dbe8af7c7a72aba7f957baf
Author: Piotr Gardocki <piotrx.gardocki@intel.com>
Date:   Wed Jun 21 15:21:06 2023 +0200

    net: fix net device address assign type

    Commit ad72c4a06acc introduced optimization to return from function
    quickly if the MAC address is not changing at all. It was reported
    that such change causes dev->addr_assign_type to not change
    to NET_ADDR_SET from _PERM or _RANDOM.
    Restore the old behavior and skip only call to ndo_set_mac_address.

    Fixes: ad72c4a06acc ("net: add check for current MAC address in dev_set_mac_address")
    Reported-by: Gal Pressman <gal@nvidia.com>
    Signed-off-by: Piotr Gardocki <piotrx.gardocki@intel.com>
    Reviewed-by: Simon Horman <simon.horman@corigine.com>
    Reviewed-by: Jiri Pirko <jiri@nvidia.com>
    Link: https://lore.kernel.org/r/20230621132106.991342-1-piotrx.gardocki@intel.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
2023-11-21 23:16:06 +01:00
Michal Schmidt 37100466e2 net: add check for current MAC address in dev_set_mac_address
JIRA: https://issues.redhat.com/browse/RHEL-16986
JIRA: https://issues.redhat.com/browse/RHEL-6368

commit ad72c4a06acc6762e84994ac2f722da7a07df34e
Author: Piotr Gardocki <piotrx.gardocki@intel.com>
Date:   Wed Jun 14 16:53:00 2023 +0200

    net: add check for current MAC address in dev_set_mac_address

    In some cases it is possible for kernel to come with request
    to change primary MAC address to the address that is already
    set on the given interface.

    Add proper check to return fast from the function in these cases.

    An example of such case is adding an interface to bonding
    channel in balance-alb mode:
    modprobe bonding mode=balance-alb miimon=100 max_bonds=1
    ip link set bond0 up
    ifenslave bond0 <eth>

    Signed-off-by: Piotr Gardocki <piotrx.gardocki@intel.com>
    Reviewed-by: Simon Horman <simon.horman@corigine.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
2023-11-21 23:16:06 +01:00
Antoine Tenart b2c4833a40 net: skbuff: update and rename __kfree_skb_defer()
JIRA: https://issues.redhat.com/browse/RHEL-14554
Upstream Status: linux.git

commit 8fa66e4a1bdd41d55d7842928e60a40fed65715d
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Wed Apr 19 19:00:05 2023 -0700

    net: skbuff: update and rename __kfree_skb_defer()

    __kfree_skb_defer() uses the old naming where "defer" meant
    slab bulk free/alloc APIs. In the meantime we also made
    __kfree_skb_defer() feed the per-NAPI skb cache, which
    implies bulk APIs. So take away the 'defer' and add 'napi'.

    While at it add a drop reason. This only matters on the
    tx_action path, if the skb has a frag_list. But getting
    rid of a SKB_DROP_REASON_NOT_SPECIFIED seems like a net
    benefit so why not.

    Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
    Link: https://lore.kernel.org/r/20230420020005.815854-1-kuba@kernel.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2023-11-10 17:40:29 +01:00
Scott Weaver 6cf5659031 Merge: CNB94: page_pool: allow caching from safely localized NAPI
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3196

JIRA: https://issues.redhat.com/browse/RHEL-12613
Tested: Using LNST net-driver test-suite on i40e, bnxt_en, ice and mlx5_core [http://dashboard.lnst.anl.lab.eng.bos.redhat.com/pipeline/3644]

Commits:
```
4727bab4e9bb ("net: skb: move skb_pp_recycle() to skbuff.c")
eedade12f4cb ("net: kfree_skb_list use kmem_cache_free_bulk")
f72ff8b81ebc ("net: fix kfree_skb_list use of skb_mark_not_on_list")
9dde0cd3b10f ("net: introduce skb_poison_list and use in kfree_skb_list")
b07a2d97ba5e ("net: skb: plumb napi state thru skb freeing paths")
8c48eea3adf3 ("page_pool: allow caching from safely localized NAPI")
dd64b232deb8 ("page_pool: unlink from napi during destroy")
```

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Michal Schmidt <mschmidt@redhat.com>

Signed-off-by: Scott Weaver <scweaver@redhat.com>
2023-11-09 07:22:35 -05:00
Ivan Vecera 96ba8afe11 xsk: add new netlink attribute dedicated for ZC max frags
JIRA: https://issues.redhat.com/browse/RHEL-15250

commit 13ce2daa259a3bfbc9a5aeeee8b9a87058703731
Author: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Date:   Wed Jul 19 15:24:07 2023 +0200

    xsk: add new netlink attribute dedicated for ZC max frags

    Introduce new netlink attribute NETDEV_A_DEV_XDP_ZC_MAX_SEGS that will
    carry maximum fragments that underlying ZC driver is able to handle on
    TX side. It is going to be included in netlink response only when driver
    supports ZC. Any value higher than 1 implies multi-buffer ZC support on
    underlying device.

    Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
    Link: https://lore.kernel.org/r/20230719132421.584801-11-maciej.fijalkowski@intel.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-11-01 14:56:57 +01:00
Ivan Vecera d80ce17d20 page_pool: allow caching from safely localized NAPI
JIRA: https://issues.redhat.com/browse/RHEL-12613

Conflicts:
- simple context conflict in net/core/dev.c due to absence of commit
  8b43fd3d1d7d8 ("net: optimize ____napi_schedule() to avoid extra
  NET_RX_SOFTIRQ") that is out of scope of this series

commit 8c48eea3adf3119e0a3fc57bd31f6966f26ee784
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Wed Apr 12 21:26:04 2023 -0700

    page_pool: allow caching from safely localized NAPI

    Recent patches to mlx5 mentioned a regression when moving from
    driver local page pool to only using the generic page pool code.
    Page pool has two recycling paths (1) direct one, which runs in
    safe NAPI context (basically consumer context, so producing
    can be lockless); and (2) via a ptr_ring, which takes a spin
    lock because the freeing can happen from any CPU; producer
    and consumer may run concurrently.

    Since the page pool code was added, Eric introduced a revised version
    of deferred skb freeing. TCP skbs are now usually returned to the CPU
    which allocated them, and freed in softirq context. This places the
    freeing (producing of pages back to the pool) enticingly close to
    the allocation (consumer).

    If we can prove that we're freeing in the same softirq context in which
    the consumer NAPI will run - lockless use of the cache is perfectly fine,
    no need for the lock.

    Let drivers link the page pool to a NAPI instance. If the NAPI instance
    is scheduled on the same CPU on which we're freeing - place the pages
    in the direct cache.

    With that and patched bnxt (XDP enabled to engage the page pool, sigh,
    bnxt really needs page pool work :() I see a 2.6% perf boost with
    a TCP stream test (app on a different physical core than softirq).

    The CPU use of relevant functions decreases as expected:

      page_pool_refill_alloc_cache   1.17% -> 0%
      _raw_spin_lock                 2.41% -> 0.98%

    Only consider lockless path to be safe when NAPI is scheduled
    - in practice this should cover majority if not all of steady state
    workloads. It's usually the NAPI kicking in that causes the skb flush.

    The main case we'll miss out on is when application runs on the same
    CPU as NAPI. In that case we don't use the deferred skb free path.

    Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
    Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
    Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-10-31 15:09:26 +01:00
Scott Weaver d05495aca0 Merge: CNB94: tc: update tc subsystem to the upstream v6.5
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3067

JIRA: https://issues.redhat.com/browse/RHEL-1773
Depends: https://issues.redhat.com/browse/RHEL-860
Depends: https://issues.redhat.com/browse/RHEL-3646

Update TC (net/sched) to the upstream v6.5

Omitted-fix: cad7526f33ce ("net: dsa: ocelot: unlock on error in vsc9959_qos_port_tas_set()")
Not needed, DSA as well as ocelot driver is not enabled/supported in RHEL

Commits:
```
1b808993e194 ("flow_dissector: fix false-positive __read_overflow2_field() warning")
f743f16c548b ("treewide: use get_random_{u8,u16}() when possible, part 2")
7e3cf0843fe5 ("treewide: use get_random_{u8,u16}() when possible, part 1")
8032bf1233a7 ("treewide: use get_random_u32_below() instead of deprecated function")
62423bd2d2e2 ("net: sched: remove qdisc_watchdog->last_expires")
c66b2111c9c9 ("selftests: tc-testing: add tests for action binding")
f5fca219ad45 ("net: do not use skb_mac_header() in qdisc_pkt_len_init()")
e495a9673caf ("sch_cake: do not use skb_mac_header() in cake_overhead()")
b3be94885af4 ("net/sched: remove two skb_mac_header() uses")
fcb3a4653bc5 ("net/sched: act_api: use the correct TCA_ACT attributes in dump")
4170f0ef582c ("fix typos in net/sched/)
8b0f256530d9 ("net/sched: sch_mqprio: use netlink payload helpers")
3dd0c16ec93e ("net/sched: mqprio: simplify handling of nlattr portion of TCA_OPTIONS")
57f21bf85400 ("net/sched: mqprio: add extack to mqprio_parse_nlattr()")
ab277d2084ba ("net/sched: mqprio: add an extack message to mqprio_parse_opt()")
c54876cd5961 ("net/sched: pass netlink extack to mqprio and taprio offload")
f62af20bed2d ("net/sched: mqprio: allow per-TC user input of FP adminStatus")
a721c3e54b80 ("net/sched: taprio: allow per-TC user input of FP adminStatus")
8c966a10eb84 ("flow_dissector: Address kdoc warnings")
54e906f1639e ("selftests: forwarding: sch_tbf_*: Add a pre-run hook")
2f0f9465ad9f ("net: sched: Print msecs when transmit queue time out")
5036034572b7 ("net/sched: act_pedit: use NLA_POLICY for parsing 'ex' keys")
0c83c5210e18 ("net/sched: act_pedit: use extack in 'ex' parsing errors")
e1201bc781c2 ("net/sched: act_pedit: check static offsets a priori")
577140180ba2 ("net/sched: act_pedit: remove extra check for key type")
e3c9673e2f6e ("net/sched: act_pedit: rate limit datapath messages")
807cfded92b0 ("net/sched: sch_htb: use extack on errors messages")
c69a9b023f65 ("net/sched: sch_qfq: use extack on errors messages")
25369891fcef ("net/sched: sch_qfq: refactor parsing of netlink parameters")
7eb060a51a3b ("selftests: tc-testing: add more tests for sch_qfq")
1b483d9f5805 ("net/sched: act_pedit: free pedit keys on bail from offset check")
526f28bd0fbd ("net/sched: act_mirred: Add carrier check")
12e7789ad5b4 ("sch_htb: Allow HTB priority parameter in offload mode")
c7cfbd115001 ("net/sched: sch_ingress: Only create under TC_H_INGRESS")
5eeebfe6c493 ("net/sched: sch_clsact: Only create under TC_H_CLSACT")
f85fa45d4a94 ("net/sched: Reserve TC_H_INGRESS (TC_H_CLSACT) for ingress (clsact) Qdiscs")
9de95df5d15b ("net/sched: Prohibit regrafting ingress or clsact Qdiscs")
7b4858df3bf7 ("skbuff: bridge: Add layer 2 miss indication")
d5ccfd90df7f ("flow_dissector: Dissect layer 2 miss from tc skb extension")
1a432018c0cd ("net/sched: flower: Allow matching on layer 2 miss")
f4356947f029 ("flow_offload: Reject matching on layer 2 miss")
8c33266ae26a ("selftests: forwarding: Add layer 2 miss test cases")
dced11ef84fb ("net/sched: taprio: don't overwrite "sch" variable in taprio_dump_class_stats()")
2d800bc500fb ("net/sched: taprio: replace tc_taprio_qopt_offload :: enable with a "cmd" enum")
6c1adb650c8d ("net/sched: taprio: add netlink reporting for offload statistics counters")
a395b8d1c7c3 ("selftests/tc-testing: replace mq with invalid parent ID")
8cde87b007da ("net: sched: wrap tc_skip_wrapper with CONFIG_RETPOLINE")
cd2b8113c2e8 ("net/sched: fq_pie: ensure reasonable TCA_FQ_PIE_QUANTUM values")
d636fc5dd692 ("net: sched: add rcu annotations around qdisc->qdisc_sleeping")
886bc7d6ed33 ("net: sched: move rtm_tca_policy declaration to include file")
682881ee45c8 ("net: sched: act_police: fix sparse errors in tcf_police_dump()")
6c02568fd1ae ("net/sched: act_pedit: Parse L3 Header for L4 offset")
26e35370b976 ("net/sched: act_pedit: Use kmemdup() to replace kmalloc + memcpy")
2b84960fc5dd ("net/sched: taprio: report class offload stats per TXQ, not per TC")
d7ad70b5ef5a ("net: flow_dissector: add support for cfm packets")
7cfffd5fed3e ("net: flower: add support for matching cfm fields")
1668a55a73f5 ("selftests: net: add tc flower cfm test")
c29e012eae29 ("selftests: forwarding: Fix layer 2 miss test syntax")
aef6e908b542 ("selftests/tc-testing: Fix Error: Specified qdisc kind is unknown.")
b849c566ee9c ("selftests/tc-testing: Fix Error: failed to find target LOG")
b39d8c41c7a8 ("selftests/tc-testing: Fix SFB db test")
11b8b2e70a9b ("selftests/tc-testing: Remove configs that no longer exist")
41f2c7c342d3 ("net/sched: act_ct: Fix promotion of offloaded unreplied tuple")
2d5f6a8d7aef ("net/sched: Refactor qdisc_graft() for ingress and clsact Qdiscs")
84ad0af0bccd ("net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting")
e16ad981e2a1 ("net: sched: Remove unused qdisc_l2t()")
ca4fa8743537 ("selftests: tc-testing: add one test for flushing explicitly created chain")
b4ee93380b3c ("net/sched: act_ipt: add sanity checks on table name and hook locations")
b2dc32dcba08 ("net/sched: act_ipt: add sanity checks on skb before calling target")
93d75d475c5d ("net/sched: act_ipt: zero skb->cb before calling target")
30c45b5361d3 ("net/sched: act_pedit: Add size check for TCA_PEDIT_PARMS_EX")
989b52cdc849 ("net: sched: Replace strlcpy with strscpy")
d3f87278bcb8 ("net/sched: flower: Ensure both minimum and maximum ports are specified")
150e33e62c1f ("net/sched: make psched_mtu() RTNL-less safe")
158810b261d0 ("net/sched: sch_qfq: reintroduce lmax bound check for MTU")
c5a06fdc618d ("selftests: tc-testing: add tests for qfq mtu sanity check")
3e337087c3b5 ("net/sched: sch_qfq: account for stab overhead in qfq_enqueue")
137f6219da59 ("selftests: tc-testing: add test for qfq with stab overhead")
d1cca974548d ("pie: fix kernel-doc notation warning")
b3d0e0489430 ("net: sched: cls_matchall: Undo tcf_bind_filter in case of failure after mall_set_parms")
9cb36faedeaf ("net: sched: cls_u32: Undo tcf_bind_filter if u32_replace_hw_knode")
e8d3d78c19be ("net: sched: cls_u32: Undo refcount decrement in case update failed")
26a22194927e ("net: sched: cls_bpf: Undo tcf_bind_filter in case of an error")
ac177a330077 ("net: sched: cls_flower: Undo tcf_bind_filter in case of an error")
fda05798c22a ("selftests: tc: set timeout to 15 minutes")
719b4774a8cb ("selftests: tc: add 'ct' action kconfig dep")
031c99e71fed ("selftests: tc: add ConnTrack procfs kconfig")
4914109a8e1e ("netfilter: allow exp not to be removed in nf_ct_find_expectation")
76622ced50a1 ("net: sched: set IPS_CONFIRMED in tmpl status only when commit is set in act_ct")
8c8b73320805 ("openvswitch: set IPS_CONFIRMED in tmpl status only when commit is set in conntrack")
9fe63d5f1da9 ("sch_htb: Allow HTB quantum parameter in offload mode")
6c58c8816abb ("net/sched: mqprio: Add length check for TCA_MQPRIO_{MAX/MIN}_RATE64")
4d50e50045aa ("net: flower: fix stack-out-of-bounds in fl_set_key_cfm()")
e68409db9953 ("net: sched: cls_u32: Fix match key mis-addressing")
e739718444f7 ("net/sched: taprio: Limit TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME to INT_MAX.")
21a72166abb9 ("selftests: forwarding: tc_flower_l2_miss: Fix failing test with old libnet")
```

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com>
Approved-by: Michal Schmidt <mschmidt@redhat.com>
Approved-by: Davide Caratti <dcaratti@redhat.com>

Signed-off-by: Scott Weaver <scweaver@redhat.com>
2023-10-24 13:29:05 -04:00
Scott Weaver 03206d751a Merge: CNB94: net: move gso declarations and functions to their own files
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3198

JIRA: https://issues.redhat.com/browse/RHEL-12679
Tested: Just built... no functional change

Commits:
```
d457a0e329b0 ("net: move gso declarations and functions to their own files")
```

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com>
Approved-by: Sabrina Dubroca <sdubroca@redhat.com>

Signed-off-by: Scott Weaver <scweaver@redhat.com>
2023-10-19 10:36:22 -04:00
Scott Weaver ec70982f69 Merge: ice: Enable DPLL support
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2961

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2232515

This feature request is for add and enable DPLL subsystem and DPLL support in ice driver

Signed-off-by: Petr Oros <poros@redhat.com>

Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Approved-by: Michal Schmidt <mschmidt@redhat.com>

Signed-off-by: Scott Weaver <scweaver@redhat.com>
2023-10-19 10:36:20 -04:00
Ivan Vecera 92e020fb45 net: sched: add rcu annotations around qdisc->qdisc_sleeping
JIRA: https://issues.redhat.com/browse/RHEL-1773

Conflicts:
- resolved conflict in net/sched/sch_taprio.c the same way like in
  449f6bc17a51 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")

commit d636fc5dd692c8f4e00ae6e0359c0eceeb5d9bdb
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Jun 6 11:19:29 2023 +0000

    net: sched: add rcu annotations around qdisc->qdisc_sleeping

    syzbot reported a race around qdisc->qdisc_sleeping [1]

    It is time we add proper annotations to reads and writes to/from
    qdisc->qdisc_sleeping.

    [1]
    BUG: KCSAN: data-race in dev_graft_qdisc / qdisc_lookup_rcu

    read to 0xffff8881286fc618 of 8 bytes by task 6928 on cpu 1:
    qdisc_lookup_rcu+0x192/0x2c0 net/sched/sch_api.c:331
    __tcf_qdisc_find+0x74/0x3c0 net/sched/cls_api.c:1174
    tc_get_tfilter+0x18f/0x990 net/sched/cls_api.c:2547
    rtnetlink_rcv_msg+0x7af/0x8c0 net/core/rtnetlink.c:6386
    netlink_rcv_skb+0x126/0x220 net/netlink/af_netlink.c:2546
    rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:6413
    netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline]
    netlink_unicast+0x56f/0x640 net/netlink/af_netlink.c:1365
    netlink_sendmsg+0x665/0x770 net/netlink/af_netlink.c:1913
    sock_sendmsg_nosec net/socket.c:724 [inline]
    sock_sendmsg net/socket.c:747 [inline]
    ____sys_sendmsg+0x375/0x4c0 net/socket.c:2503
    ___sys_sendmsg net/socket.c:2557 [inline]
    __sys_sendmsg+0x1e3/0x270 net/socket.c:2586
    __do_sys_sendmsg net/socket.c:2595 [inline]
    __se_sys_sendmsg net/socket.c:2593 [inline]
    __x64_sys_sendmsg+0x46/0x50 net/socket.c:2593
    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
    entry_SYSCALL_64_after_hwframe+0x63/0xcd

    write to 0xffff8881286fc618 of 8 bytes by task 6912 on cpu 0:
    dev_graft_qdisc+0x4f/0x80 net/sched/sch_generic.c:1115
    qdisc_graft+0x7d0/0xb60 net/sched/sch_api.c:1103
    tc_modify_qdisc+0x712/0xf10 net/sched/sch_api.c:1693
    rtnetlink_rcv_msg+0x807/0x8c0 net/core/rtnetlink.c:6395
    netlink_rcv_skb+0x126/0x220 net/netlink/af_netlink.c:2546
    rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:6413
    netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline]
    netlink_unicast+0x56f/0x640 net/netlink/af_netlink.c:1365
    netlink_sendmsg+0x665/0x770 net/netlink/af_netlink.c:1913
    sock_sendmsg_nosec net/socket.c:724 [inline]
    sock_sendmsg net/socket.c:747 [inline]
    ____sys_sendmsg+0x375/0x4c0 net/socket.c:2503
    ___sys_sendmsg net/socket.c:2557 [inline]
    __sys_sendmsg+0x1e3/0x270 net/socket.c:2586
    __do_sys_sendmsg net/socket.c:2595 [inline]
    __se_sys_sendmsg net/socket.c:2593 [inline]
    __x64_sys_sendmsg+0x46/0x50 net/socket.c:2593
    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
    entry_SYSCALL_64_after_hwframe+0x63/0xcd

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 0 PID: 6912 Comm: syz-executor.5 Not tainted 6.4.0-rc3-syzkaller-00190-g0d85b27b0cc6 #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/16/2023

    Fixes: 3a7d0d07a3 ("net: sched: extend Qdisc with rcu")
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Vlad Buslov <vladbu@nvidia.com>
    Acked-by: Jamal Hadi Salim<jhs@mojatatu.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-10-13 09:03:10 +02:00
Ivan Vecera f43c4f5429 net: do not use skb_mac_header() in qdisc_pkt_len_init()
JIRA: https://issues.redhat.com/browse/RHEL-1773

commit f5fca219ad4548bc45f0221f9857ad22cb8136a1
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Mar 21 16:45:17 2023 +0000

    net: do not use skb_mac_header() in qdisc_pkt_len_init()

    We want to remove our use of skb_mac_header() in tx paths,
    eg remove skb_reset_mac_header() from __dev_queue_xmit().

    Idea is that ndo_start_xmit() can get the mac header
    simply looking at skb->data.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Simon Horman <simon.horman@corigine.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-10-13 09:03:06 +02:00
Ivan Vecera 497f645693 net: move gso declarations and functions to their own files
JIRA: https://issues.redhat.com/browse/RHEL-12679

commit d457a0e329b0bfd3a1450e0b1a18cd2b47a25a08
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Jun 8 19:17:37 2023 +0000

    net: move gso declarations and functions to their own files

    Move declarations into include/net/gso.h and code into net/core/gso.c

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Stanislav Fomichev <sdf@google.com>
    Reviewed-by: Simon Horman <simon.horman@corigine.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Link: https://lore.kernel.org/r/20230608191738.3947077-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-10-11 13:35:27 +02:00
Petr Oros 104234d3d2 netdev: expose DPLL pin handle for netdevice
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2232515

Upstream commit(s):
commit 5f18426928800c59fb0f9bc8fb0c182bb6f5ee24
Author: Jiri Pirko <jiri@nvidia.com>
Date:   Wed Sep 13 21:49:39 2023 +0100

    netdev: expose DPLL pin handle for netdevice

    In case netdevice represents a SyncE port, the user needs to understand
    the connection between netdevice and associated DPLL pin. There might me
    multiple netdevices pointing to the same pin, in case of VF/SF
    implementation.

    Add a IFLA Netlink attribute to nest the DPLL pin handle, similar to
    how it is implemented for devlink port. Add a struct dpll_pin pointer
    to netdev and protect access to it by RTNL. Expose netdev_dpll_pin_set()
    and netdev_dpll_pin_clear() helpers to the drivers so they can set/clear
    the DPLL pin relationship to netdev.

    Note that during the lifetime of struct dpll_pin the pin handle does not
    change. Therefore it is save to access it lockless. It is drivers
    responsibility to call netdev_dpll_pin_clear() before dpll_pin_put().

    Signed-off-by: Jiri Pirko <jiri@nvidia.com>
    Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
    Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Petr Oros <poros@redhat.com>
2023-09-18 15:13:24 +02:00
Ivan Vecera 3cc9e8b28b random32: use real rng for non-deterministic randomness
JIRA: https://issues.redhat.com/browse/RHEL-3646

commit d4150779e60fb6c49be25572596b2cdfc5d46a09
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Wed May 11 16:11:29 2022 +0200

    random32: use real rng for non-deterministic randomness

    random32.c has two random number generators in it: one that is meant to
    be used deterministically, with some predefined seed, and one that does
    the same exact thing as random.c, except does it poorly. The first one
    has some use cases. The second one no longer does and can be replaced
    with calls to random.c's proper random number generator.

    The relatively recent siphash-based bad random32.c code was added in
    response to concerns that the prior random32.c was too deterministic.
    Out of fears that random.c was (at the time) too slow, this code was
    anonymously contributed. Then out of that emerged a kind of shadow
    entropy gathering system, with its own tentacles throughout various net
    code, added willy nilly.

    Stop👏making👏bespoke👏random👏number👏generators👏.

    Fortunately, recent advances in random.c mean that we can stop playing
    with this sketchiness, and just use get_random_u32(), which is now fast
    enough. In micro benchmarks using RDPMC, I'm seeing the same median
    cycle count between the two functions, with the mean being _slightly_
    higher due to batches refilling (which we can optimize further need be).
    However, when doing *real* benchmarks of the net functions that actually
    use these random numbers, the mean cycles actually *decreased* slightly
    (with the median still staying the same), likely because the additional
    prandom code means icache misses and complexity, whereas random.c is
    generally already being used by something else nearby.

    The biggest benefit of this is that there are many users of prandom who
    probably should be using cryptographically secure random numbers. This
    makes all of those accidental cases become secure by just flipping a
    switch. Later on, we can do a tree-wide cleanup to remove the static
    inline wrapper functions that this commit adds.

    There are also some low-ish hanging fruits for making this even faster
    in the future: a get_random_u16() function for use in the networking
    stack will give a 2x performance boost there, using SIMD for ChaCha20
    will let us compute 4 or 8 or 16 blocks of output in parallel, instead
    of just one, giving us large buffers for cheap, and introducing a
    get_random_*_bh() function that assumes irqs are already disabled will
    shave off a few cycles for ordinary calls. These are things we can chip
    away at down the road.

    Acked-by: Jakub Kicinski <kuba@kernel.org>
    Acked-by: Theodore Ts'o <tytso@mit.edu>
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-09-13 18:39:29 +02:00
Jan Stancek 645597c064 Merge: net: core: stable backport form upstream for 9.3 phase 2
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2731

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2217529
Tested: LNST, Tier1

A bunch of fixes for relevant issues.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Approved-by: Florian Westphal <fwestpha@redhat.com>
Approved-by: Hangbin Liu <haliu@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-07-07 07:38:20 +02:00
Jan Stancek e341c7e709 Merge: bpf, xdp: update to 6.3
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2583

Rebase bpf and xdp to 6.3.

Bugzilla: https://bugzilla.redhat.com/2178930

Signed-off-by: Viktor Malik <vmalik@redhat.com>

Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Artem Savkov <asavkov@redhat.com>
Approved-by: Jason Wang <jasowang@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>
Approved-by: Jan Stancek <jstancek@redhat.com>
Approved-by: Baoquan He <5820488-baoquan_he@users.noreply.gitlab.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-06-28 07:52:45 +02:00
Paolo Abeni e4256bf256 net: add vlan_get_protocol_and_depth() helper
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2217529
Tested: LNST, Tier1

Upstream commit:
commit 4063384ef762cc5946fc7a3f89879e76c6ec51e2
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue May 9 13:18:57 2023 +0000

    net: add vlan_get_protocol_and_depth() helper

    Before blamed commit, pskb_may_pull() was used instead
    of skb_header_pointer() in __vlan_get_protocol() and friends.

    Few callers depended on skb->head being populated with MAC header,
    syzbot caught one of them (skb_mac_gso_segment())

    Add vlan_get_protocol_and_depth() to make the intent clearer
    and use it where sensible.

    This is a more generic fix than commit e9d3f80935b6
    ("net/af_packet: make sure to pull mac header") which was
    dealing with a similar issue.

    kernel BUG at include/linux/skbuff.h:2655 !
    invalid opcode: 0000 [#1] SMP KASAN
    CPU: 0 PID: 1441 Comm: syz-executor199 Not tainted 6.1.24-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/14/2023
    RIP: 0010:__skb_pull include/linux/skbuff.h:2655 [inline]
    RIP: 0010:skb_mac_gso_segment+0x68f/0x6a0 net/core/gro.c:136
    Code: fd 48 8b 5c 24 10 44 89 6b 70 48 c7 c7 c0 ae 0d 86 44 89 e6 e8 a1 91 d0 00 48 c7 c7 00 af 0d 86 48 89 de 31 d2 e8 d1 4a e9 ff <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41
    RSP: 0018:ffffc90001bd7520 EFLAGS: 00010286
    RAX: ffffffff8469736a RBX: ffff88810f31dac0 RCX: ffff888115a18b00
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
    RBP: ffffc90001bd75e8 R08: ffffffff84697183 R09: fffff5200037adf9
    R10: 0000000000000000 R11: dffffc0000000001 R12: 0000000000000012
    R13: 000000000000fee5 R14: 0000000000005865 R15: 000000000000fed7
    FS: 000055555633f300(0000) GS:ffff8881f6a00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000020000000 CR3: 0000000116fea000 CR4: 00000000003506f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    <TASK>
    [<ffffffff847018dd>] __skb_gso_segment+0x32d/0x4c0 net/core/dev.c:3419
    [<ffffffff8470398a>] skb_gso_segment include/linux/netdevice.h:4819 [inline]
    [<ffffffff8470398a>] validate_xmit_skb+0x3aa/0xee0 net/core/dev.c:3725
    [<ffffffff84707042>] __dev_queue_xmit+0x1332/0x3300 net/core/dev.c:4313
    [<ffffffff851a9ec7>] dev_queue_xmit+0x17/0x20 include/linux/netdevice.h:3029
    [<ffffffff851b4a82>] packet_snd net/packet/af_packet.c:3111 [inline]
    [<ffffffff851b4a82>] packet_sendmsg+0x49d2/0x6470 net/packet/af_packet.c:3142
    [<ffffffff84669a12>] sock_sendmsg_nosec net/socket.c:716 [inline]
    [<ffffffff84669a12>] sock_sendmsg net/socket.c:736 [inline]
    [<ffffffff84669a12>] __sys_sendto+0x472/0x5f0 net/socket.c:2139
    [<ffffffff84669c75>] __do_sys_sendto net/socket.c:2151 [inline]
    [<ffffffff84669c75>] __se_sys_sendto net/socket.c:2147 [inline]
    [<ffffffff84669c75>] __x64_sys_sendto+0xe5/0x100 net/socket.c:2147
    [<ffffffff8551d40f>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    [<ffffffff8551d40f>] do_syscall_64+0x2f/0x50 arch/x86/entry/common.c:80
    [<ffffffff85600087>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

    Fixes: 469aceddfa ("vlan: consolidate VLAN parsing code and limit max parsing depth")
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Toke Høiland-Jørgensen <toke@redhat.com>
    Cc: Willem de Bruijn <willemb@google.com>
    Reviewed-by: Simon Horman <simon.horman@corigine.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-06-26 16:58:59 +02:00
Jan Stancek 9d37206873 Merge: net: sync skb free reasons
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2627

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2184073

Did not included commit 071c0fc6fb91 ("net: extend drop reasons for multiple subsystems")
as it would be appropriate to backport it in its own MR, would have not user for now,
and it's not clear to me how trace_kfree_skb deals with non-core free reasons once applied.

Signed-off-by: Antoine Tenart <atenart@redhat.com>

Approved-by: Íñigo Huguet <ihuguet@redhat.com>
Approved-by: Xin Long <lxin@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-06-14 13:27:31 +02:00
Felix Maurer b576afd91a netdev-genl: create a simple family for netdev stuff
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2178930
Conflicts:
- include/linux/netdevice.h: Context difference in includes due to missing
  406f42fa0d3c ("net-next: When a bond have a massive amount of VLANs with
  IPv6 addresses, performance of changing link state, attaching a VRF,
  changing an IPv6 address, etc. go down dramtically.")
- net/core/Makefile: Context difference due to missing 2c193f2cb110 ("net:
  kunit: add a test for dev_addr_lists")

commit d3d854fd6a1d97157f790604e07f6386e8df8fe4
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Wed Feb 1 11:24:17 2023 +0100

    netdev-genl: create a simple family for netdev stuff

    Add a Netlink spec-compatible family for netdevs.
    This is a very simple implementation without much
    thought going into it.

    It allows us to reap all the benefits of Netlink specs,
    one can use the generic client to issue the commands:

      $ ./cli.py --spec netdev.yaml --dump dev_get
      [{'ifindex': 1, 'xdp-features': set()},
       {'ifindex': 2, 'xdp-features': {'basic', 'ndo-xmit', 'redirect'}},
       {'ifindex': 3, 'xdp-features': {'rx-sg'}}]

    the generic python library does not have flags-by-name
    support, yet, but we also don't have to carry strings
    in the messages, as user space can get the names from
    the spec.

    Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
    Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org>
    Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
    Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Co-developed-by: Marek Majtyka <alardam@gmail.com>
    Signed-off-by: Marek Majtyka <alardam@gmail.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Link: https://lore.kernel.org/r/327ad9c9868becbe1e601b580c962549c8cd81f2.1675245258.git.lorenzo@kernel.org
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-06-13 22:45:50 +02:00
Felix Maurer e630642b6b bpf: Introduce device-bound XDP programs
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2178930

commit 2b3486bc2d237ec345b3942b7be5deabf8c8fed1
Author: Stanislav Fomichev <sdf@google.com>
Date:   Thu Jan 19 14:15:24 2023 -0800

    bpf: Introduce device-bound XDP programs

    New flag BPF_F_XDP_DEV_BOUND_ONLY plus all the infra to have a way
    to associate a netdev with a BPF program at load time.

    netdevsim checks are dropped in favor of generic check in dev_xdp_attach.

    Cc: John Fastabend <john.fastabend@gmail.com>
    Cc: David Ahern <dsahern@gmail.com>
    Cc: Martin KaFai Lau <martin.lau@linux.dev>
    Cc: Jakub Kicinski <kuba@kernel.org>
    Cc: Willem de Bruijn <willemb@google.com>
    Cc: Jesper Dangaard Brouer <brouer@redhat.com>
    Cc: Anatoly Burakov <anatoly.burakov@intel.com>
    Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
    Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
    Cc: Maryam Tahhan <mtahhan@redhat.com>
    Cc: xdp-hints@xdp-project.net
    Cc: netdev@vger.kernel.org
    Signed-off-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/r/20230119221536.3349901-6-sdf@google.com
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-06-13 22:45:13 +02:00
Felix Maurer c0febc32b2 bpf: Rename bpf_{prog,map}_is_dev_bound to is_offloaded
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2178930

commit 9d03ebc71a027ca495c60f6e94d3cda81921791f
Author: Stanislav Fomichev <sdf@google.com>
Date:   Thu Jan 19 14:15:21 2023 -0800

    bpf: Rename bpf_{prog,map}_is_dev_bound to is_offloaded

    BPF offloading infra will be reused to implement
    bound-but-not-offloaded bpf programs. Rename existing
    helpers for clarity. No functional changes.

    Cc: John Fastabend <john.fastabend@gmail.com>
    Cc: David Ahern <dsahern@gmail.com>
    Cc: Martin KaFai Lau <martin.lau@linux.dev>
    Cc: Willem de Bruijn <willemb@google.com>
    Cc: Jesper Dangaard Brouer <brouer@redhat.com>
    Cc: Anatoly Burakov <anatoly.burakov@intel.com>
    Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
    Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
    Cc: Maryam Tahhan <mtahhan@redhat.com>
    Cc: xdp-hints@xdp-project.net
    Cc: netdev@vger.kernel.org
    Reviewed-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/r/20230119221536.3349901-3-sdf@google.com
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-06-13 22:45:12 +02:00
Ivan Vecera 1cb324e3cc net: Remove the obsolte u64_stats_fetch_*_irq() users (net).
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2193170

Conflicts:
* net/netfilter/ipvs/ip_vs_ctl.c
  - the change was already applied by RHEL commit 914c1e31d9 ("ipvs:
    use u64_stats_t for the per-cpu counters")
* net/core/devlink.c
  - hunk was applied in different file (net/devlink/leftover.c)

commit d120d1a63b2c484d6175873d8ee736a633f74b70
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Wed Oct 26 15:22:15 2022 +0200

    net: Remove the obsolte u64_stats_fetch_*_irq() users (net).

    Now that the 32bit UP oddity is gone and 32bit uses always a sequence
    count, there is no need for the fetch_irq() variants anymore.

    Convert to the regular interface.

    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-06-08 13:38:11 +02:00
Ivan Vecera 41bf85273b net: adopt u64_stats_t in struct pcpu_sw_netstats
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2193170

commit 9962acefbcb92736c268aafe5f52200948f60f3e
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Jun 8 08:46:37 2022 -0700

    net: adopt u64_stats_t in struct pcpu_sw_netstats

    As explained in commit 316580b69d ("u64_stats: provide u64_stats_t type")
    we should use u64_stats_t and related accessors to avoid load/store tearing.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-06-08 13:37:00 +02:00
Antoine Tenart f2ed106175 net: remove enum skb_free_reason
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2184073
Upstream Status: linux.git

commit 40bbae583ec38ea31e728bf42a4ea72bded22ab6
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Mar 6 20:43:13 2023 +0000

    net: remove enum skb_free_reason

    enum skb_drop_reason is more generic, we can adopt it instead.

    Provide dev_kfree_skb_irq_reason() and dev_kfree_skb_any_reason().

    This means drivers can use more precise drop reasons if they want to.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Simon Horman <simon.horman@corigine.com>
    Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com>
    Link: https://lore.kernel.org/r/20230306204313.10492-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2023-06-06 11:23:26 +02:00
Antoine Tenart d48044618a net: add location to trace_consume_skb()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2184073
Upstream Status: linux.git

commit dd1b527831a3ed659afa01b672d8e1f7e6ca95a5
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Feb 16 15:47:18 2023 +0000

    net: add location to trace_consume_skb()

    kfree_skb() includes the location, it makes sense
    to add it to consume_skb() as well.

    After patch:

     taskd_EventMana  8602 [004]   420.406239: skb:consume_skb: skbaddr=0xffff893a4a6d0500 location=unix_stream_read_generic
             swapper     0 [011]   422.732607: skb:consume_skb: skbaddr=0xffff89597f68cee0 location=mlx4_en_free_tx_desc
          discipline  9141 [043]   423.065653: skb:consume_skb: skbaddr=0xffff893a487e9c00 location=skb_consume_udp
             swapper     0 [010]   423.073166: skb:consume_skb: skbaddr=0xffff8949ce9cdb00 location=icmpv6_rcv
             borglet  8672 [014]   425.628256: skb:consume_skb: skbaddr=0xffff8949c42e9400 location=netlink_dump
             swapper     0 [028]   426.263317: skb:consume_skb: skbaddr=0xffff893b1589dce0 location=net_rx_action
                wget 14339 [009]   426.686380: skb:consume_skb: skbaddr=0xffff893a51b552e0 location=tcp_rcv_state_process

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2023-06-06 11:23:26 +02:00
Jan Stancek 6318ae37c7 Merge: ovs: stable backports for 9.3 phase 1
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2438

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2190207

Signed-off-by: Antoine Tenart <atenart@redhat.com>

Approved-by: Andrea Claudi <aclaudi@redhat.com>
Approved-by: Davide Caratti <dcaratti@redhat.com>
Approved-by: Eelco Chaudron <echaudro@redhat.com>
Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-06-01 07:25:53 +02:00
Jan Stancek 91e631150d Merge: Bonding: rebase to linux v6.3
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2419

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2189406

Depends: !2418

Signed-off-by: Hangbin Liu <haliu@redhat.com>

Approved-by: Andrea Claudi <aclaudi@redhat.com>
Approved-by: Xin Long <lxin@redhat.com>
Approved-by: Jarod Wilson <jarod@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-05-17 07:47:13 +02:00
Jan Stancek 704d11b087 Merge: enable io_uring
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2375

# Merge Request Required Information

## Summary of Changes
This MR updates RHEL's io_uring implementation to match upstream v6.2 + fixes (Fixes: tags and Cc: stable commits).  The final 3 RHEL-only patches disable the feature by default, taint the kernel when it is enabled, and turn on the config option.

## Approved Development Ticket
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2170014
Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2214
Omitted-fix: b0b7a7d24b66 ("io_uring: return back links tw run optimisation")
  This is actually just	an optimization, and it	has non-trivial	conflicts
  which	would require additional backports to resolve.	Skip it.
Omitted-fix: 7d481e035633 ("io_uring/rsrc: fix DEFER_TASKRUN rsrc quiesce")
  This fix is incorrectly tagged.  The code that it applies to is not present in our tree.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>

Approved-by: John Meneghini <jmeneghi@redhat.com>
Approved-by: Ming Lei <ming.lei@redhat.com>
Approved-by: Maurizio Lombardi <mlombard@redhat.com>
Approved-by: Brian Foster <bfoster@redhat.com>
Approved-by: Guillaume Nault <gnault@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-05-17 07:47:08 +02:00
Jeff Moyer 2595bc4d80 net: fix kdoc on __dev_queue_xmit()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237

commit be76955dea93fe7ee9e0a6f961a7185290a2417f
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Mon May 9 10:04:12 2022 -0700

    net: fix kdoc on __dev_queue_xmit()
    
    Commit c526fd8f9f4f21 ("net: inline dev_queue_xmit()") exported
    __dev_queue_xmit(), now it's being rendered in html docs, triggering:
    
    Documentation/networking/kapi:92: net/core/dev.c:4101: WARNING: Missing matching underline for section title overline.
    
    Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
    Link: https://lore.kernel.org/linux-next/20220503073420.6d3f135d@canb.auug.org.au/
    Fixes: c526fd8f9f4f21 ("net: inline dev_queue_xmit()")
    Link: https://lore.kernel.org/r/20220509170412.1069190-1-kuba@kernel.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-05-05 15:23:02 -04:00
Paolo Abeni d0ff450947 net: fix __dev_kfree_skb_any() vs drop monitor
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2188560
Tested: LNST, Tier1

Upstream commit:
commit ac3ad19584b26fae9ac86e4faebe790becc74491
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Feb 23 08:38:45 2023 +0000

    net: fix __dev_kfree_skb_any() vs drop monitor

    dev_kfree_skb() is aliased to consume_skb().

    When a driver is dropping a packet by calling dev_kfree_skb_any()
    we should propagate the drop reason instead of pretending
    the packet was consumed.

    Note: Now we have enum skb_drop_reason we could remove
    enum skb_free_reason (for linux-6.4)

    v2: added an unlikely(), suggested by Yunsheng Lin.

    Fixes: e6247027e5 ("net: introduce dev_consume_skb_any()")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Yunsheng Lin <linyunsheng@huawei.com>
    Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-05-02 19:07:41 +02:00
Xin Long 2db946b2f7 net: add gso_ipv4_max_size and gro_ipv4_max_size per device
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2185290
Tested: compile only

Conflicts:
  - move netif_set_gro_max_size() from include/linux/netdevice.h to
    net/core/dev.h, then make the change, as commit 744d49daf8bd was
    backported earlier than eac1b93c14d6. netif_set_gro_max_size()
    was missed the oppotunity to be moved to net/core/dev.h.

  - different context in net/core/dev.h, rps_cpumask_housekeeping()
    is added due to 370ca718fd5e already in RHEL-9.

commit 9eefedd58ae1daece2ba907849a44db2941fb4b0
Author: Xin Long <lucien.xin@gmail.com>
Date:   Sat Jan 28 10:58:38 2023 -0500

    net: add gso_ipv4_max_size and gro_ipv4_max_size per device

    This patch introduces gso_ipv4_max_size and gro_ipv4_max_size
    per device and adds netlink attributes for them, so that IPV4
    BIG TCP can be guarded by a separate tunable in the next patch.

    To not break the old application using "gso/gro_max_size" for
    IPv4 GSO packets, this patch updates "gso/gro_ipv4_max_size"
    in netif_set_gso/gro_max_size() if the new size isn't greater
    than GSO_LEGACY_MAX_SIZE, so that nothing will change even if
    userspace doesn't realize the new netlink attributes.

    Signed-off-by: Xin Long <lucien.xin@gmail.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Xin Long <lxin@redhat.com>
2023-05-02 10:36:11 -04:00
Jeff Moyer 82f65d6ce4 net: inline dev_queue_xmit()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237

commit c526fd8f9f4f21cb83c0b1c9a1ee9c0ac9be9e2e
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Thu Apr 28 11:58:46 2022 +0100

    net: inline dev_queue_xmit()
    
    Inline dev_queue_xmit() and dev_queue_xmit_accel(), they both are small
    proxy functions doing nothing but redirecting the control flow to
    __dev_queue_xmit().
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-04-29 07:56:02 -04:00
Antoine Tenart af98894a33 net: openvswitch: fix race on port output
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2190207
Upstream Status: linux.git

commit 066b86787fa3d97b7aefb5ac0a99a22dad2d15f8
Author: Felix Huettner <felix.huettner@mail.schwarz>
Date:   Wed Apr 5 07:53:41 2023 +0000

    net: openvswitch: fix race on port output

    assume the following setup on a single machine:
    1. An openvswitch instance with one bridge and default flows
    2. two network namespaces "server" and "client"
    3. two ovs interfaces "server" and "client" on the bridge
    4. for each ovs interface a veth pair with a matching name and 32 rx and
       tx queues
    5. move the ends of the veth pairs to the respective network namespaces
    6. assign ip addresses to each of the veth ends in the namespaces (needs
       to be the same subnet)
    7. start some http server on the server network namespace
    8. test if a client in the client namespace can reach the http server

    when following the actions below the host has a chance of getting a cpu
    stuck in a infinite loop:
    1. send a large amount of parallel requests to the http server (around
       3000 curls should work)
    2. in parallel delete the network namespace (do not delete interfaces or
       stop the server, just kill the namespace)

    there is a low chance that this will cause the below kernel cpu stuck
    message. If this does not happen just retry.
    Below there is also the output of bpftrace for the functions mentioned
    in the output.

    The series of events happening here is:
    1. the network namespace is deleted calling
       `unregister_netdevice_many_notify` somewhere in the process
    2. this sets first `NETREG_UNREGISTERING` on both ends of the veth and
       then runs `synchronize_net`
    3. it then calls `call_netdevice_notifiers` with `NETDEV_UNREGISTER`
    4. this is then handled by `dp_device_event` which calls
       `ovs_netdev_detach_dev` (if a vport is found, which is the case for
       the veth interface attached to ovs)
    5. this removes the rx_handlers of the device but does not prevent
       packages to be sent to the device
    6. `dp_device_event` then queues the vport deletion to work in
       background as a ovs_lock is needed that we do not hold in the
       unregistration path
    7. `unregister_netdevice_many_notify` continues to call
       `netdev_unregister_kobject` which sets `real_num_tx_queues` to 0
    8. port deletion continues (but details are not relevant for this issue)
    9. at some future point the background task deletes the vport

    If after 7. but before 9. a packet is send to the ovs vport (which is
    not deleted at this point in time) which forwards it to the
    `dev_queue_xmit` flow even though the device is unregistering.
    In `skb_tx_hash` (which is called in the `dev_queue_xmit`) path there is
    a while loop (if the packet has a rx_queue recorded) that is infinite if
    `dev->real_num_tx_queues` is zero.

    To prevent this from happening we update `do_output` to handle devices
    without carrier the same as if the device is not found (which would
    be the code path after 9. is done).

    Additionally we now produce a warning in `skb_tx_hash` if we will hit
    the infinite loop.

    bpftrace (first word is function name):

    __dev_queue_xmit server: real_num_tx_queues: 1, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 1
    netdev_core_pick_tx server: addr: 0xffff9f0a46d4a000 real_num_tx_queues: 1, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 1
    dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 2, reg_state: 1
    synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
    synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
    synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
    synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
    dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 6, reg_state: 2
    ovs_netdev_detach_dev server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, reg_state: 2
    netdev_rx_handler_unregister server: real_num_tx_queues: 1, cpu: 9, pid: 21024, tid: 21024, reg_state: 2
    synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
    netdev_rx_handler_unregister ret server: real_num_tx_queues: 1, cpu: 9, pid: 21024, tid: 21024, reg_state: 2
    dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 27, reg_state: 2
    dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 22, reg_state: 2
    dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 18, reg_state: 2
    netdev_unregister_kobject: real_num_tx_queues: 1, cpu: 9, pid: 21024, tid: 21024
    synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
    ovs_vport_send server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 2
    __dev_queue_xmit server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 2
    netdev_core_pick_tx server: addr: 0xffff9f0a46d4a000 real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 2
    broken device server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024
    ovs_dp_detach_port server: real_num_tx_queues: 0 cpu 9, pid: 9124, tid: 9124, reg_state: 2
    synchronize_rcu_expedited: cpu 9, pid: 33604, tid: 33604

    stuck message:

    watchdog: BUG: soft lockup - CPU#5 stuck for 26s! [curl:1929279]
    Modules linked in: veth pktgen bridge stp llc ip_set_hash_net nft_counter xt_set nft_compat nf_tables ip_set_hash_ip ip_set nfnetlink_cttimeout nfnetlink openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 tls binfmt_misc nls_iso8859_1 input_leds joydev serio_raw dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua sch_fq_codel drm efi_pstore virtio_rng ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel virtio_net ahci net_failover crypto_simd cryptd psmouse libahci virtio_blk failover
    CPU: 5 PID: 1929279 Comm: curl Not tainted 5.15.0-67-generic #74-Ubuntu
    Hardware name: OpenStack Foundation OpenStack Nova, BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
    RIP: 0010:netdev_pick_tx+0xf1/0x320
    Code: 00 00 8d 48 ff 0f b7 c1 66 39 ca 0f 86 e9 01 00 00 45 0f b7 ff 41 39 c7 0f 87 5b 01 00 00 44 29 f8 41 39 c7 0f 87 4f 01 00 00 <eb> f2 0f 1f 44 00 00 49 8b 94 24 28 04 00 00 48 85 d2 0f 84 53 01
    RSP: 0018:ffffb78b40298820 EFLAGS: 00000246
    RAX: 0000000000000000 RBX: ffff9c8773adc2e0 RCX: 000000000000083f
    RDX: 0000000000000000 RSI: ffff9c8773adc2e0 RDI: ffff9c870a25e000
    RBP: ffffb78b40298858 R08: 0000000000000001 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: ffff9c870a25e000
    R13: ffff9c870a25e000 R14: ffff9c87fe043480 R15: 0000000000000000
    FS:  00007f7b80008f00(0000) GS:ffff9c8e5f740000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f7b80f6a0b0 CR3: 0000000329d66000 CR4: 0000000000350ee0
    Call Trace:
     <IRQ>
     netdev_core_pick_tx+0xa4/0xb0
     __dev_queue_xmit+0xf8/0x510
     ? __bpf_prog_exit+0x1e/0x30
     dev_queue_xmit+0x10/0x20
     ovs_vport_send+0xad/0x170 [openvswitch]
     do_output+0x59/0x180 [openvswitch]
     do_execute_actions+0xa80/0xaa0 [openvswitch]
     ? kfree+0x1/0x250
     ? kfree+0x1/0x250
     ? kprobe_perf_func+0x4f/0x2b0
     ? flow_lookup.constprop.0+0x5c/0x110 [openvswitch]
     ovs_execute_actions+0x4c/0x120 [openvswitch]
     ovs_dp_process_packet+0xa1/0x200 [openvswitch]
     ? ovs_ct_update_key.isra.0+0xa8/0x120 [openvswitch]
     ? ovs_ct_fill_key+0x1d/0x30 [openvswitch]
     ? ovs_flow_key_extract+0x2db/0x350 [openvswitch]
     ovs_vport_receive+0x77/0xd0 [openvswitch]
     ? __htab_map_lookup_elem+0x4e/0x60
     ? bpf_prog_680e8aff8547aec1_kfree+0x3b/0x714
     ? trace_call_bpf+0xc8/0x150
     ? kfree+0x1/0x250
     ? kfree+0x1/0x250
     ? kprobe_perf_func+0x4f/0x2b0
     ? kprobe_perf_func+0x4f/0x2b0
     ? __mod_memcg_lruvec_state+0x63/0xe0
     netdev_port_receive+0xc4/0x180 [openvswitch]
     ? netdev_port_receive+0x180/0x180 [openvswitch]
     netdev_frame_hook+0x1f/0x40 [openvswitch]
     __netif_receive_skb_core.constprop.0+0x23d/0xf00
     __netif_receive_skb_one_core+0x3f/0xa0
     __netif_receive_skb+0x15/0x60
     process_backlog+0x9e/0x170
     __napi_poll+0x33/0x180
     net_rx_action+0x126/0x280
     ? ttwu_do_activate+0x72/0xf0
     __do_softirq+0xd9/0x2e7
     ? rcu_report_exp_cpu_mult+0x1b0/0x1b0
     do_softirq+0x7d/0xb0
     </IRQ>
     <TASK>
     __local_bh_enable_ip+0x54/0x60
     ip_finish_output2+0x191/0x460
     __ip_finish_output+0xb7/0x180
     ip_finish_output+0x2e/0xc0
     ip_output+0x78/0x100
     ? __ip_finish_output+0x180/0x180
     ip_local_out+0x5e/0x70
     __ip_queue_xmit+0x184/0x440
     ? tcp_syn_options+0x1f9/0x300
     ip_queue_xmit+0x15/0x20
     __tcp_transmit_skb+0x910/0x9c0
     ? __mod_memcg_state+0x44/0xa0
     tcp_connect+0x437/0x4e0
     ? ktime_get_with_offset+0x60/0xf0
     tcp_v4_connect+0x436/0x530
     __inet_stream_connect+0xd4/0x3a0
     ? kprobe_perf_func+0x4f/0x2b0
     ? aa_sk_perm+0x43/0x1c0
     inet_stream_connect+0x3b/0x60
     __sys_connect_file+0x63/0x70
     __sys_connect+0xa6/0xd0
     ? setfl+0x108/0x170
     ? do_fcntl+0xe8/0x5a0
     __x64_sys_connect+0x18/0x20
     do_syscall_64+0x5c/0xc0
     ? __x64_sys_fcntl+0xa9/0xd0
     ? exit_to_user_mode_prepare+0x37/0xb0
     ? syscall_exit_to_user_mode+0x27/0x50
     ? do_syscall_64+0x69/0xc0
     ? __sys_setsockopt+0xea/0x1e0
     ? exit_to_user_mode_prepare+0x37/0xb0
     ? syscall_exit_to_user_mode+0x27/0x50
     ? __x64_sys_setsockopt+0x1f/0x30
     ? do_syscall_64+0x69/0xc0
     ? irqentry_exit+0x1d/0x30
     ? exc_page_fault+0x89/0x170
     entry_SYSCALL_64_after_hwframe+0x61/0xcb
    RIP: 0033:0x7f7b8101c6a7
    Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 18 89 54 24 0c 48 89 34 24 89
    RSP: 002b:00007ffffd6b2198 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
    RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7b8101c6a7
    RDX: 0000000000000010 RSI: 00007ffffd6b2360 RDI: 0000000000000005
    RBP: 0000561f1370d560 R08: 00002795ad21d1ac R09: 0030312e302e302e
    R10: 00007ffffd73f080 R11: 0000000000000246 R12: 0000561f1370c410
    R13: 0000000000000000 R14: 0000000000000005 R15: 0000000000000000
     </TASK>

    Fixes: 7f8a436eaa ("openvswitch: Add conntrack action")
    Co-developed-by: Luca Czesla <luca.czesla@mail.schwarz>
    Signed-off-by: Luca Czesla <luca.czesla@mail.schwarz>
    Signed-off-by: Felix Huettner <felix.huettner@mail.schwarz>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Simon Horman <simon.horman@corigine.com>
    Link: https://lore.kernel.org/r/ZC0pBXBAgh7c76CA@kernel-bug-kernel-bug
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2023-04-27 16:30:10 +02:00
Jan Stancek 8e94775eed Merge: CNB: rebase/update devlink for RHEL 9.3
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2191

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2172273
Tested: selftests, basic devlink features on ice and mlx5
Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2175249
Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2175250
Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2176150

Update devlink up to v6.3.

Signed-off-by: Petr Oros <poros@redhat.com>

Approved-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: Íñigo Huguet <ihuguet@redhat.com>
Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com>
Approved-by: Herbert Xu <zxu@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-04-27 07:47:22 +02:00
Hangbin Liu a149ec5e7d net/core: Allow live renaming when an interface is up
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2189406
Upstream Status: net.git commit bd039b5ea2a9

commit bd039b5ea2a91ea707ee8539df26456bd5be80af
Author: Andy Ren <andy.ren@getcruise.com>
Date:   Mon Nov 7 09:42:42 2022 -0800

    net/core: Allow live renaming when an interface is up

    Allow a network interface to be renamed when the interface
    is up.

    As described in the netconsole documentation [1], when netconsole is
    used as a built-in, it will bring up the specified interface as soon as
    possible. As a result, user space will not be able to rename the
    interface since the kernel disallows renaming of interfaces that are
    administratively up unless the 'IFF_LIVE_RENAME_OK' private flag was set
    by the kernel.

    The original solution [2] to this problem was to add a new parameter to
    the netconsole configuration parameters that allows renaming of
    the interface used by netconsole while it is administratively up.
    However, during the discussion that followed, it became apparent that we
    have no reason to keep the current restriction and instead we should
    allow user space to rename interfaces regardless of their administrative
    state:

    1. The restriction was put in place over 20 years ago when renaming was
    only possible via IOCTL and before rtnetlink started notifying user
    space about such changes like it does today.

    2. The 'IFF_LIVE_RENAME_OK' flag was added over 3 years ago in version
    5.2 and no regressions were reported.

    3. In-kernel listeners to 'NETDEV_CHANGENAME' do not seem to care about
    the administrative state of interface.

    Therefore, allow user space to rename running interfaces by removing the
    restriction and the associated 'IFF_LIVE_RENAME_OK' flag. Help in
    possible triage by emitting a message to the kernel log that an
    interface was renamed while UP.

    [1] https://www.kernel.org/doc/Documentation/networking/netconsole.rst
    [2] https://lore.kernel.org/netdev/20221102002420.2613004-1-andy.ren@getcruise.com/

    Signed-off-by: Andy Ren <andy.ren@getcruise.com>
    Reviewed-by: Ido Schimmel <idosch@nvidia.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Hangbin Liu <haliu@redhat.com>
2023-04-25 15:26:55 +08:00
Petr Oros 59e7861deb devlink: Fix netdev notifier chain corruption
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2172273

Conflicts:
-  adjusted upstream merge conflict which was resolved in 675f176b4dcc2b
   ("Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/net")

Upstream commit(s):
commit b20b8aec6ffc07bb547966b356780cd344f20f5b
Author: Ido Schimmel <idosch@nvidia.com>
Date:   Wed Feb 15 09:31:39 2023 +0200

    devlink: Fix netdev notifier chain corruption

    Cited commit changed devlink to register its netdev notifier block on
    the global netdev notifier chain instead of on the per network namespace
    one.

    However, when changing the network namespace of the devlink instance,
    devlink still tries to unregister its notifier block from the chain of
    the old namespace and register it on the chain of the new namespace.
    This results in corruption of the notifier chains, as the same notifier
    block is registered on two different chains: The global one and the per
    network namespace one. In turn, this causes other problems such as the
    inability to dismantle namespaces due to netdev reference count issues.

    Fix by preventing devlink from moving its notifier block between
    namespaces.

    Reproducer:

     # echo "10 1" > /sys/bus/netdevsim/new_device
     # ip netns add test123
     # devlink dev reload netdevsim/netdevsim10 netns test123
     # ip netns del test123
     [   71.935619] unregister_netdevice: waiting for lo to become free. Usage count = 2
     [   71.938348] leaked reference.

    Fixes: 565b4824c39f ("devlink: change port event netdev notifier from per-net to global")
    Signed-off-by: Ido Schimmel <idosch@nvidia.com>
    Reviewed-by: Jiri Pirko <jiri@nvidia.com>
    Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
    Reviewed-by: Jakub Kicinski <kuba@kernel.org>
    Link: https://lore.kernel.org/r/20230215073139.1360108-1-idosch@nvidia.com
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Petr Oros <poros@redhat.com>
2023-04-04 11:12:28 +02:00
Petr Oros 8df3e0fd3b net: introduce a helper to move notifier block to different namespace
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2172273

Upstream commit(s):
commit 3e52fba03a20234abc65a656cef063a1045d9723
Author: Jiri Pirko <jiri@nvidia.com>
Date:   Tue Nov 8 14:22:06 2022 +0100

    net: introduce a helper to move notifier block to different namespace

    Currently, net_dev() netdev notifier variant follows the netdev with
    per-net notifier from namespace to namespace. This is implemented
    by move_netdevice_notifiers_dev_net() helper.

    For devlink it is needed to re-register per-net notifier during
    devlink reload. Introduce a new helper called
    move_netdevice_notifier_net() and share the unregister/register code
    with existing move_netdevice_notifiers_dev_net() helper.

    Signed-off-by: Jiri Pirko <jiri@nvidia.com>
    Reviewed-by: Ido Schimmel <idosch@nvidia.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Petr Oros <poros@redhat.com>
2023-04-03 14:05:59 +02:00
Petr Oros afc2a59634 net: devlink: track netdev with devlink_port assigned
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2172273

Upstream commit(s):
commit 02a68a47eadedf95748facfca6ced31fb0181d52
Author: Jiri Pirko <jiri@nvidia.com>
Date:   Wed Nov 2 17:02:03 2022 +0100

    net: devlink: track netdev with devlink_port assigned

    Currently, ethernet drivers are using devlink_port_type_eth_set() and
    devlink_port_type_clear() to set devlink port type and link to related
    netdev.

    Instead of calling them directly, let the driver use
    SET_NETDEV_DEVLINK_PORT macro to assign devlink_port pointer and let
    devlink to track it. Note the devlink port pointer is static during
    the time netdevice is registered.

    In devlink code, use per-namespace netdev notifier to track
    the netdevices with devlink_port assigned and change the internal
    devlink_port type and related type pointer accordingly.

    Signed-off-by: Jiri Pirko <jiri@nvidia.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Petr Oros <poros@redhat.com>
2023-04-03 10:57:13 +02:00
Íñigo Huguet 3a91b473a8 net: rename reference+tracking helpers
Bugzilla: https://bugzilla.redhat.com/2175258

Conflicts:
 - Removed chunks of unsupported protocol AX.25
 - Renamed the funtions also in ipvlan. Commit 40b9d1ab63f5 ("ipvlan: hold lower
   dev to avoid possible use-after-free") was backported out of order so it had
   to use the old functions names.

commit d62607c3fe45911b2331fac073355a8c914bbde2
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Tue Jun 7 21:39:55 2022 -0700

    net: rename reference+tracking helpers

    Netdev reference helpers have a dev_ prefix for historic
    reasons. Renaming the old helpers would be too much churn
    but we can rename the tracking ones which are relatively
    recent and should be the default for new code.

    Rename:
     dev_hold_track()    -> netdev_hold()
     dev_put_track()     -> netdev_put()
     dev_replace_track() -> netdev_ref_replace()

    Link: https://lore.kernel.org/r/20220608043955.919359-1-kuba@kernel.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
2023-03-23 16:19:21 +01:00
Xin Long 3a75ec1506 net: avoid quadratic behavior in netdev_wait_allrefs_any()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2180612
Tested: compile only

Conflicts:
  - context difference due to cc26c2661fef already in RHEL-9.

commit 86213f80da1b1d007721cc22e04b5f5d0da33127
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Feb 17 22:54:30 2022 -0800

    net: avoid quadratic behavior in netdev_wait_allrefs_any()

    If the list of devices has N elements, netdev_wait_allrefs_any()
    is called N times, and linkwatch_forget_dev() is called N*(N-1)/2 times.

    Fix this by calling linkwatch_forget_dev() only once per device.

    Fixes: faab39f63c1f ("net: allow out-of-order netdev unregistration")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20220218065430.2613262-1-eric.dumazet@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Xin Long <lxin@redhat.com>
2023-03-21 17:39:40 -04:00
Xin Long b1a4490d48 net: allow out-of-order netdev unregistration
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2180612
Tested: compile only

Conflicts:
  - context difference due to 05e49cfc89e4 already in RHEL-9.

commit faab39f63c1fc4bcdf135690f03bd596b578c67e
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Tue Feb 15 14:53:10 2022 -0800

    net: allow out-of-order netdev unregistration

    Sprinkle for each loops to allow netdevices to be unregistered
    out of order, as their refs are released.

    This prevents problems caused by dependencies between netdevs
    which want to release references in their ->priv_destructor.
    See commit d6ff94afd90b ("vlan: move dev_put into vlan_dev_uninit")
    for example.

    Eric has removed the only known ordering requirement in
    commit c002496babfd ("Merge branch 'ipv6-loopback'")
    so let's try this and see if anything explodes...

    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Xin Long <lucien.xin@gmail.com>
    Link: https://lore.kernel.org/r/20220215225310.3679266-2-kuba@kernel.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Xin Long <lxin@redhat.com>
2023-03-21 17:39:26 -04:00
Xin Long bfdcece7f8 net: transition netdev reg state earlier in run_todo
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2180612
Tested: compile only

Conflicts:
  - context difference due to cc26c2661fef already in RHEL-9.

commit ae68db14b6164ce46beffaf35eb7c9bb2f92fee3
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Tue Feb 15 14:53:09 2022 -0800

    net: transition netdev reg state earlier in run_todo

    In prep for unregistering netdevs out of order move the netdev
    state validation and change outside of the loop.

    While at it modernize this code and use WARN() instead of
    pr_err() + dump_stack().

    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Xin Long <lucien.xin@gmail.com>
    Link: https://lore.kernel.org/r/20220215225310.3679266-1-kuba@kernel.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Xin Long <lxin@redhat.com>
2023-03-21 17:38:42 -04:00
Herton R. Krzesinski 05d2a7216e Merge: CNB: net: add netdev_sw_irq_coalesce_default_on()
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1970

Bugzilla: https://bugzilla.redhat.com/2161921

commit d93607082e982223cf92750f2d9039ff365b9d24
Author: Heiner Kallweit <hkallweit1@gmail.com>
Date:   Wed Nov 30 23:28:26 2022 +0100

    net: add netdev_sw_irq_coalesce_default_on()

    Add a helper for drivers wanting to set SW IRQ coalescing
    by default. The related sysfs attributes can be used to
    override the default values.

    Follow Jakub's suggestion and put this functionality into
    net core so that drivers wanting to use software interrupt
    coalescing per default don't have to open-code it.

    Note that this function needs to be called before the
    netdevice is registered.

    Suggested-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Dan Campbell <dacampbe@redhat.com>

Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com>
Approved-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: Michal Schmidt <mschmidt@redhat.com>
Approved-by: Petr Oros <poros@redhat.com>
Approved-by: Andrea Claudi <aclaudi@redhat.com>

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2023-02-08 01:41:42 +00:00
Dan Campbell bee4544aab net: add netdev_sw_irq_coalesce_default_on()
Bugzilla: https://bugzilla.redhat.com/2161921

commit d93607082e982223cf92750f2d9039ff365b9d24
Author: Heiner Kallweit <hkallweit1@gmail.com>
Date:   Wed Nov 30 23:28:26 2022 +0100

    net: add netdev_sw_irq_coalesce_default_on()

    Add a helper for drivers wanting to set SW IRQ coalescing
    by default. The related sysfs attributes can be used to
    override the default values.

    Follow Jakub's suggestion and put this functionality into
    net core so that drivers wanting to use software interrupt
    coalescing per default don't have to open-code it.

    Note that this function needs to be called before the
    netdevice is registered.

    Suggested-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Dan Campbell <dacampbe@redhat.com>
2023-01-27 12:28:55 -06:00
Paolo Abeni af86e36c42 net: Fix return value of qdisc ingress handling on success
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2162711
Tested: vs bz reproducer

Upstream commit:
commit 672e97ef689a38cb20c2cc6a1814298fea34461e
Author: Paul Blakey <paulb@nvidia.com>
Date:   Tue Oct 18 10:34:38 2022 +0300

    net: Fix return value of qdisc ingress handling on success

    Currently qdisc ingress handling (sch_handle_ingress()) doesn't
    set a return value and it is left to the old return value of
    the caller (__netif_receive_skb_core()) which is RX drop, so if
    the packet is consumed, caller will stop and return this value
    as if the packet was dropped.

    This causes a problem in the kernel tcp stack when having a
    egress tc rule forwarding to a ingress tc rule.
    The tcp stack sending packets on the device having the egress rule
    will see the packets as not successfully transmitted (although they
    actually were), will not advance it's internal state of sent data,
    and packets returning on such tcp stream will be dropped by the tcp
    stack with reason ack-of-unsent-data. See reproduction in [0] below.

    Fix that by setting the return value to RX success if
    the packet was handled successfully.

    [0] Reproduction steps:
     $ ip link add veth1 type veth peer name peer1
     $ ip link add veth2 type veth peer name peer2
     $ ifconfig peer1 5.5.5.6/24 up
     $ ip netns add ns0
     $ ip link set dev peer2 netns ns0
     $ ip netns exec ns0 ifconfig peer2 5.5.5.5/24 up
     $ ifconfig veth2 0 up
     $ ifconfig veth1 0 up

     #ingress forwarding veth1 <-> veth2
     $ tc qdisc add dev veth2 ingress
     $ tc qdisc add dev veth1 ingress
     $ tc filter add dev veth2 ingress prio 1 proto all flower \
       action mirred egress redirect dev veth1
     $ tc filter add dev veth1 ingress prio 1 proto all flower \
       action mirred egress redirect dev veth2

     #steal packet from peer1 egress to veth2 ingress, bypassing the veth pipe
     $ tc qdisc add dev peer1 clsact
     $ tc filter add dev peer1 egress prio 20 proto ip flower \
       action mirred ingress redirect dev veth1

     #run iperf and see connection not running
     $ iperf3 -s&
     $ ip netns exec ns0 iperf3 -c 5.5.5.6 -i 1

     #delete egress rule, and run again, now should work
     $ tc filter del dev peer1 egress
     $ ip netns exec ns0 iperf3 -c 5.5.5.6 -i 1

    Fixes: f697c3e8b3 ("[NET]: Avoid unnecessary cloning for ingress filtering")
    Signed-off-by: Paul Blakey <paulb@nvidia.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-20 16:33:01 +01:00
Herton R. Krzesinski 19ce0cbd76 Merge: bpf, xdp: update to 5.19
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1533

bpf, xdp: update to 5.19

Bugzilla: http://bugzilla.redhat.com/2120968
Bugzilla: http://bugzilla.redhat.com/2130850
Bugzilla: http://bugzilla.redhat.com/2140077


Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>

Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>
Approved-by: Artem Savkov <asavkov@redhat.com>
Approved-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2022-12-21 20:49:27 +00:00
Herton R. Krzesinski 09736a3a30 Merge: udp: some performance optimizations
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1541

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2133057
Tested: LNST, Tier1, tput test

This series improves UDP protocol RX tput, to keep it on equal footing with rhel-8 one.

Patches 1,3,4 are there just to reduces the conflicts, and patch 4 is a very partial
backport, to avoid pulling unrelated features.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Florian Westphal <fwestpha@redhat.com>

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2022-12-13 17:35:03 +00:00
Felix Maurer 1e3ab14088 xdp: Fix spurious packet loss in generic XDP TX path
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2120968

commit 1fd6e5675336daf4747940b4285e84b0c114ae32
Author: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Date:   Tue Jul 5 10:23:45 2022 +0200

    xdp: Fix spurious packet loss in generic XDP TX path

    The byte queue limits (BQL) mechanism is intended to move queuing from
    the driver to the network stack in order to reduce latency caused by
    excessive queuing in hardware. However, when transmitting or redirecting
    a packet using generic XDP, the qdisc layer is bypassed and there are no
    additional queues. Since netif_xmit_stopped() also takes BQL limits into
    account, but without having any alternative queuing, packets are
    silently dropped.

    This patch modifies the drop condition to only consider cases when the
    driver itself cannot accept any more packets. This is analogous to the
    condition in __dev_direct_xmit(). Dropped packets are also counted on
    the device.

    Bypassing the qdisc layer in the generic XDP TX path means that XDP
    packets are able to starve other packets going through a qdisc, and
    DDOS attacks will be more effective. In-driver-XDP use dedicated TX
    queues, so they do not have this starvation issue.

    Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/bpf/20220705082345.2494312-1-johan.almbladh@anyfinetworks.com

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2022-11-30 12:47:11 +02:00
Felix Maurer b06bbd83be net: Use this_cpu_inc() to increment net->core_stats
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2130850

commit 6510ea973d8d9d4a0cb2fb557b36bd1ab3eb49f6
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Mon Apr 25 18:39:46 2022 +0200

    net: Use this_cpu_inc() to increment net->core_stats

    The macro dev_core_stats_##FIELD##_inc() disables preemption and invokes
    netdev_core_stats_alloc() to return a per-CPU pointer.
    netdev_core_stats_alloc() will allocate memory on its first invocation
    which breaks on PREEMPT_RT because it requires non-atomic context for
    memory allocation.

    This can be avoided by enabling preemption in netdev_core_stats_alloc()
    assuming the caller always disables preemption.

    It might be better to replace local_inc() with this_cpu_inc() now that
    dev_core_stats_##FIELD##_inc() gained a preempt-disable section and does
    not rely on already disabled preemption. This results in less
    instructions on x86-64:
    local_inc:
    |          incl %gs:__preempt_count(%rip)  # __preempt_count
    |          movq    488(%rdi), %rax # _1->core_stats, _22
    |          testq   %rax, %rax      # _22
    |          je      .L585   #,
    |          add %gs:this_cpu_off(%rip), %rax        # this_cpu_off, tcp_ptr__
    |  .L586:
    |          testq   %rax, %rax      # _27
    |          je      .L587   #,
    |          incq (%rax)            # _6->a.counter
    |  .L587:
    |          decl %gs:__preempt_count(%rip)  # __preempt_count

    this_cpu_inc(), this patch:
    |         movq    488(%rdi), %rax # _1->core_stats, _5
    |         testq   %rax, %rax      # _5
    |         je      .L591   #,
    | .L585:
    |         incq %gs:(%rax) # _18->rx_dropped

    Use unsigned long as type for the counter. Use this_cpu_inc() to
    increment the counter. Use a plain read of the counter.

    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/YmbO0pxgtKpCw4SY@linutronix.de
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2022-11-30 12:47:10 +02:00
Felix Maurer a320271336 net: add per-cpu storage and net->core_stats
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2130850
Conflicts:
- drivers/net/vxlan.c: file is not moved to drivers/net/vxlan/vxlan_core.c
  due to missing 6765393614ea8 ("vxlan: move to its own directory");
  context difference due to missing 4095e0e1328a3 ("drivers: vxlan:
  vnifilter: per vni stats")
- net/core/dev.c: code difference in __netif_receive_skb_core due to
  already applied 9f8ed577c2881 ("net: skb: rename
  SKB_DROP_REASON_PTYPE_ABSENT"). Result is like upstream now.
- net/core/gro_cells.c: context difference due to already applied
  5dcd08cd1991 ("net: Fix data-races around netdev_max_backlog.")

commit 625788b5844511cf4c30cffa7fa0bc3a69cebc82
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Mar 10 21:14:20 2022 -0800

    net: add per-cpu storage and net->core_stats

    Before adding yet another possibly contended atomic_long_t,
    it is time to add per-cpu storage for existing ones:
     dev->tx_dropped, dev->rx_dropped, and dev->rx_nohandler

    Because many devices do not have to increment such counters,
    allocate the per-cpu storage on demand, so that dev_get_stats()
    does not have to spend considerable time folding zero counters.

    Note that some drivers have abused these counters which
    were supposed to be only used by core networking stack.

    v4: should use per_cpu_ptr() in dev_get_stats() (Jakub)
    v3: added a READ_ONCE() in netdev_core_stats_alloc() (Paolo)
    v2: add a missing include (reported by kernel test robot <lkp@intel.com>)
        Change in netdev_core_stats_alloc() (Jakub)

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: jeffreyji <jeffreyji@google.com>
    Reviewed-by: Brian Vazquez <brianvv@google.com>
    Reviewed-by: Jakub Kicinski <kuba@kernel.org>
    Acked-by: Paolo Abeni <pabeni@redhat.com>
    Link: https://lore.kernel.org/r/20220311051420.2608812-1-eric.dumazet@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2022-11-30 12:47:10 +02:00
Frantisek Hrbata a03fbb1743 Merge: CNB: Update TC subsystem to upstream v6.0
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1567

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139170
Tested: Using self-tests, results present in the BZ
Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2133511
Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2128185

Commits:
```
b20dc3c68458 ("gtp: Allow to create GTP device without FDs")
9af41cc33471 ("gtp: Implement GTP echo response")
d33bd757d362 ("gtp: Implement GTP echo request")
e3acda7ade0a ("net/sched: Allow flower to match on GTP options")
81dd9849fa49 ("gtp: Add support for checking GTP device type")
02f393381d14 ("gtp: Fix inconsistent indenting")
4c096ea2d67c ("net/sched: matchall: Take verbose flag into account when logging error messages")
11c95317bc1a ("net/sched: flower: Take verbose flag into account when logging error messages")
c2ccf84ecb71 ("net/sched: act_api: Add extack to offload_act_setup() callback")
69642c2ab2f5 ("net/sched: act_gact: Add extack messages for offload failure")
4dcaa50d0292 ("net/sched: act_mirred: Add extack message for offload failure")
bca3821d19d9 ("net/sched: act_mpls: Add extack messages for offload failure")
bf3b99e4f9ce ("net/sched: act_pedit: Add extack message for offload failure")
b50e462bc22d ("net/sched: act_police: Add extack messages for offload failure")
a9c64939b669 ("net/sched: act_skbedit: Add extack messages for offload failure")
ee367d44b936 ("net/sched: act_tunnel_key: Add extack message for offload failure")
f8fab3169464 ("net/sched: act_vlan: Add extack message for offload failure")
c440615ffbcb ("net/sched: cls_api: Add extack message for unsupported action offload")
0cba5c34b8f4 ("net/sched: matchall: Avoid overwriting error messages")
fd23e0e250c6 ("net/sched: flower: Avoid overwriting error messages")
c9a40d1c87e9 ("net_sched: make qdisc_reset() smaller")
7463acfbe52a ("netfilter: Rename ingress hook include file")
17d20784223d ("netfilter: Generalize ingress hook include file")
42df6e1d221d ("netfilter: Introduce egress hook")
2f1e85b1aee4 ("net: sched: use queue_mapping to pick tx queue")
38a6f0865796 ("net: sched: support hash selecting tx queue")
285ba06b0edb ("net/sched: flower: Helper function for vlan ethtype checks")
6ee59e554d33 ("net/sched: flower: Reduce identation after is_key_vlan refactoring")
b40003128226 ("net/sched: flower: Add number of vlan tags filter")
99fdb22bc5e9 ("net/sched: flower: Consider the number of tags for vlan filters")
b57c7e8b76c6 ("selftests: forwarding: tc_actions: allow mirred egress test to run on non-offloaded h2")
70f87de9fa0d ("net_sched: em_meta: add READ_ONCE() in var_sk_bound_if()")
a2b1a5d40bd1 ("net/sched: sch_netem: Fix arithmetic in netem_dump() for 32-bit platforms")
1da9e27415bf ("tc-testing: gitignore, delete plugins directory")
6deb209dc6b0 ("net: Print hashed skb addresses for all net and qdisc events")
76b39b94382f ("net/sched: act_api: Notify user space if any actions were flushed before error")
88153e29c1e0 ("selftests: tc-testing: Add testcases to test new flush behaviour")
837ced3a1a5d ("time64.h: consolidate uses of PSEC_PER_NSEC")
d7be266adbfd ("net: sched: provide shim definitions for taprio_offload_{get,free}")
fc54d9065f90 ("net/sched: act_ct: set 'net' pointer when creating new nf_flow_table")
b038177636f8 ("netfilter: nf_flow_table: count pending offload workqueue tasks")
b06ada6df9cf ("netfilter: flowtable: fix incorrect Kconfig dependencies")
83d85bb06915 ("net: extract port range fields from fl_flow_key")
bc5c8260f411 ("net/sched: remove return value of unregister_tcf_proto_ops")
88b3822cdf2f ("net/sched: sch_cbq: Delete unused delay_timer")
ca0cab119288 ("net/sched: remove qdisc_root_lock() helper")
c0f47c2822aa ("net/sched: cls_api: Fix flow action initialization")
5008750eff5d ("net/sched: flower: Add PPPoE filter")
a482d47d33ac ("net/sched: sch_cbq: change the type of cbq_set_lss to void")
06799a9085e1 ("net: bonding: replace dev_trans_start() with the jiffies of the last ARP/NS")
4873a1b2024d ("net/sched: remove hacks added to dev_trans_start() for bonding to work")
9ad36309e271 ("net_sched: cls_route: remove from list when handle is 0")
02799571714d ("net_sched: cls_route: disallow handle of 0")
b05972f01e7d ("net: sched: tbf: don't call qdisc_put() while holding tree lock")
f612466ebecb ("net/sched: fix netdevice reference leaks in attach_default_qdiscs()")
9efd23297cca ("sch_sfb: Don't assume the skb is still around after enqueueing to child")
2f09707d0c97 ("sch_sfb: Also store skb len before calling child enqueue")
db46e3a88a09 ("net/sched: taprio: avoid disabling offload when it was never enabled")
1461d212ab27 ("net/sched: taprio: make qdisc_leaf() see the per-netdev-queue pfifo child qdiscs")
c2e1cfefcac3 ("net: sched: fix possible refcount leak in tc_new_tfilter()")
6e23ec0ba92d ("net: sched: act_ct: fix possible refcount leak in tcf_ct_init()")
ffdd33dd9c12 ("netfilter: core: Fix clang warnings about unused static inlines")
6316136ec6e3 ("netfilter: egress: avoid a lockdep splat")
d645552e9bd9 ("netfilter: egress: Report interface as outgoing")
af7b29b1deaa ("Revert "net/sched: taprio: make qdisc_leaf() see the per-netdev-queue pfifo child qdiscs"")
8bdc2acd420c ("net: sched: Fix use after free in red_enqueue()")
```

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Approved-by: Petr Oros <poros@redhat.com>
Approved-by: Davide Caratti <dcaratti@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-11-23 02:46:05 -05:00
Frantisek Hrbata 1269719102 Merge: BPF and XDP rebase to v5.18
Merge conflicts:
-----------------
arch/x86/net/bpf_jit_comp.c
        - bpf_arch_text_poke()
          HEAD(!1464) contains b73b002f7f ("x86/ibt,bpf: Add ENDBR instructions to prologue and trampoline")
          Resolved in favour of !1464, but keep the return statement from !1477

MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1477

Bugzilla: https://bugzilla.redhat.com/2120966

Rebase BPF and XDP to the upstream kernel version 5.18

Patch applied, then reverted:
```
544356 selftests/bpf: switch to new libbpf XDP APIs
0bfb95 selftests, bpf: Do not yet switch to new libbpf XDP APIs
```
Taken in the perf rebase:
```
23fcfc perf: use generic bpf_program__set_type() to set BPF prog type
```
Unsuported arches:
```
5c1011 libbpf: Fix riscv register names
cf0b5b libbpf: Fix accessing syscall arguments on riscv
```
Depends on changes of other subsystems:
```
7fc8c3 s390/bpf: encode register within extable entry
aebfd1 x86/ibt,ftrace: Search for __fentry__ location
589127 x86/ibt,bpf: Add ENDBR instructions to prologue and trampoline
```
Broken selftest:
```
edae34 selftests net: add UDP GRO fraglist + bpf self-tests
cf6783 selftests net: fix bpf build error
7b92aa selftests net: fix kselftest net fatal error
```
Out of scope:
```
baebdf net: dev: Makes sure netif_rx() can be invoked in any context.
5c8166 kbuild: replace $(if A,A,B) with $(or A,B)
1a97ce perf maps: Use a pointer for kmaps
967747 uaccess: remove CONFIG_SET_FS
42b01a s390: always use the packed stack layout
bf0882 flow_dissector: Add support for HSR
d09a30 s390/extable: move EX_TABLE define to asm-extable.h
3d6671 s390/extable: convert to relative table with data
4efd41 s390: raise minimum supported machine generation to z10
f65e58 flow_dissector: Add support for HSRv0
1a6d7a netdevsim: Introduce support for L3 offload xstats
9b1894 selftests: netdevsim: hw_stats_l3: Add a new test
84005b perf ftrace latency: Add -n/--use-nsec option
36c4a7 kasan, arm64: don't tag executable vmalloc allocations
8df013 docs: netdev: move the netdev-FAQ to the process pages
4d4d00 perf tools: Update copy of libbpf's hashmap.c
0df6ad perf evlist: Rename cpus to user_requested_cpus
1b8089 flow_dissector: fix false-positive __read_overflow2_field() warning
0ae065 perf build: Fix check for btf__load_from_kernel_by_id() in libbpf
8994e9 perf test bpf: Skip test if clang is not present
735346 perf build: Fix btf__load_from_kernel_by_id() feature check
f037ac s390/stack: merge empty stack frame slots
335220 docs: netdev: update maintainer-netdev.rst reference
a0b098 s390/nospec: remove unneeded header includes
34513a netdevsim: Fix hwstats debugfs file permissions
```

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>

Approved-by: John W. Linville <linville@redhat.com>
Approved-by: Wander Lairson Costa <wander@redhat.com>
Approved-by: Torez Smith <torez@redhat.com>
Approved-by: Jan Stancek <jstancek@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Felix Maurer <fmaurer@redhat.com>
Approved-by: Viktor Malik <vmalik@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-11-21 05:30:47 -05:00
Frantisek Hrbata 27a89b8946 Merge: tcp: BIG TCP implementation
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1560

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139501
Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2128180
Tested: Using netperf and veth driver. Results meet the assumptions. See https://bugzilla.redhat.com/show_bug.cgi?id=2139501#c1

The series introduces support for BIG TCP.

- Patch 1-2: Preliminary dependencies
- Patch 3-14: Commits from upstream series 7fa2e481ff2f ("Merge branch 'big-tcp'", 2022-05-16)
- Patch 15-19: Follow-ups

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Florian Westphal <fwestpha@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-11-15 07:30:55 -05:00
Frantisek Hrbata 6fd36e2149 Merge: CNB: net: drop the weight argument from netif_napi_add
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1577

Bugzilla: https://bugzilla.redhat.com/2139498
Tested: build, boot

Change netif_napi_add family function's API so `netif_napi_add` and `netif_napi_add_tx` uses by default weight = NAPI_POLL_WEIGHT (as most of drivers were already doing in some or another way), and add `netif_napi_add_weight` and `netif_napi_add_tx_weight` for drivers that want to specify a custom NAPI weight.

Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>

Approved-by: John W. Linville <linville@redhat.com>
Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Torez Smith <torez@redhat.com>
Approved-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: Tony Camuso <tcamuso@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-11-14 10:28:04 -05:00
Ivan Vecera f31181025a net: sched: use queue_mapping to pick tx queue
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139170

commit 2f1e85b1aee459b7d0fd981839042c6a38ffaf0c
Author: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Date:   Sat Apr 16 00:40:45 2022 +0800

    net: sched: use queue_mapping to pick tx queue

    This patch fixes issue:
    * If we install tc filters with act_skbedit in clsact hook.
      It doesn't work, because netdev_core_pick_tx() overwrites
      queue_mapping.

      $ tc filter ... action skbedit queue_mapping 1

    And this patch is useful:
    * We can use FQ + EDT to implement efficient policies. Tx queues
      are picked by xps, ndo_select_queue of netdev driver, or skb hash
      in netdev_core_pick_tx(). In fact, the netdev driver, and skb
      hash are _not_ under control. xps uses the CPUs map to select Tx
      queues, but we can't figure out which task_struct of pod/containter
      running on this cpu in most case. We can use clsact filters to classify
      one pod/container traffic to one Tx queue. Why ?

      In containter networking environment, there are two kinds of pod/
      containter/net-namespace. One kind (e.g. P1, P2), the high throughput
      is key in these applications. But avoid running out of network resource,
      the outbound traffic of these pods is limited, using or sharing one
      dedicated Tx queues assigned HTB/TBF/FQ Qdisc. Other kind of pods
      (e.g. Pn), the low latency of data access is key. And the traffic is not
      limited. Pods use or share other dedicated Tx queues assigned FIFO Qdisc.
      This choice provides two benefits. First, contention on the HTB/FQ Qdisc
      lock is significantly reduced since fewer CPUs contend for the same queue.
      More importantly, Qdisc contention can be eliminated completely if each
      CPU has its own FIFO Qdisc for the second kind of pods.

      There must be a mechanism in place to support classifying traffic based on
      pods/container to different Tx queues. Note that clsact is outside of Qdisc
      while Qdisc can run a classifier to select a sub-queue under the lock.

      In general recording the decision in the skb seems a little heavy handed.
      This patch introduces a per-CPU variable, suggested by Eric.

      The xmit.skip_txqueue flag is firstly cleared in __dev_queue_xmit().
      - Tx Qdisc may install that skbedit actions, then xmit.skip_txqueue flag
        is set in qdisc->enqueue() though tx queue has been selected in
        netdev_tx_queue_mapping() or netdev_core_pick_tx(). That flag is cleared
        firstly in __dev_queue_xmit(), is useful:
      - Avoid picking Tx queue with netdev_tx_queue_mapping() in next netdev
        in such case: eth0 macvlan - eth0.3 vlan - eth0 ixgbe-phy:
        For example, eth0, macvlan in pod, which root Qdisc install skbedit
        queue_mapping, send packets to eth0.3, vlan in host. In __dev_queue_xmit() of
        eth0.3, clear the flag, does not select tx queue according to skb->queue_mapping
        because there is no filters in clsact or tx Qdisc of this netdev.
        Same action taked in eth0, ixgbe in Host.
      - Avoid picking Tx queue for next packet. If we set xmit.skip_txqueue
        in tx Qdisc (qdisc->enqueue()), the proper way to clear it is clearing it
        in __dev_queue_xmit when processing next packets.

      For performance reasons, use the static key. If user does not config the NET_EGRESS,
      the patch will not be compiled.

      +----+      +----+      +----+
      | P1 |      | P2 |      | Pn |
      +----+      +----+      +----+
        |           |           |
        +-----------+-----------+
                    |
                    | clsact/skbedit
                    |      MQ
                    v
        +-----------+-----------+
        | q0        | q1        | qn
        v           v           v
      HTB/FQ      HTB/FQ  ...  FIFO

    Cc: Jamal Hadi Salim <jhs@mojatatu.com>
    Cc: Cong Wang <xiyou.wangcong@gmail.com>
    Cc: Jiri Pirko <jiri@resnulli.us>
    Cc: "David S. Miller" <davem@davemloft.net>
    Cc: Jakub Kicinski <kuba@kernel.org>
    Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Alexander Lobakin <alobakin@pm.me>
    Cc: Paolo Abeni <pabeni@redhat.com>
    Cc: Talal Ahmad <talalahmad@google.com>
    Cc: Kevin Hao <haokexin@gmail.com>
    Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Cc: Antoine Tenart <atenart@kernel.org>
    Cc: Wei Wang <weiwan@google.com>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Suggested-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
    Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-11-13 16:59:02 +01:00
Ivan Vecera d545c120ec netfilter: Introduce egress hook
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139170

commit 42df6e1d221dddc0f2acf2be37e68d553ad65f96
Author: Lukas Wunner <lukas@wunner.de>
Date:   Fri Oct 8 22:06:03 2021 +0200

    netfilter: Introduce egress hook

    Support classifying packets with netfilter on egress to satisfy user
    requirements such as:
    * outbound security policies for containers (Laura)
    * filtering and mangling intra-node Direct Server Return (DSR) traffic
      on a load balancer (Laura)
    * filtering locally generated traffic coming in through AF_PACKET,
      such as local ARP traffic generated for clustering purposes or DHCP
      (Laura; the AF_PACKET plumbing is contained in a follow-up commit)
    * L2 filtering from ingress and egress for AVB (Audio Video Bridging)
      and gPTP with nftables (Pablo)
    * in the future: in-kernel NAT64/NAT46 (Pablo)

    The egress hook introduced herein complements the ingress hook added by
    commit e687ad60af ("netfilter: add netfilter ingress hook after
    handle_ing() under unique static key").  A patch for nftables to hook up
    egress rules from user space has been submitted separately, so users may
    immediately take advantage of the feature.

    Alternatively or in addition to netfilter, packets can be classified
    with traffic control (tc).  On ingress, packets are classified first by
    tc, then by netfilter.  On egress, the order is reversed for symmetry.
    Conceptually, tc and netfilter can be thought of as layers, with
    netfilter layered above tc.

    Traffic control is capable of redirecting packets to another interface
    (man 8 tc-mirred).  E.g., an ingress packet may be redirected from the
    host namespace to a container via a veth connection:
    tc ingress (host) -> tc egress (veth host) -> tc ingress (veth container)

    In this case, netfilter egress classifying is not performed when leaving
    the host namespace!  That's because the packet is still on the tc layer.
    If tc redirects the packet to a physical interface in the host namespace
    such that it leaves the system, the packet is never subjected to
    netfilter egress classifying.  That is only logical since it hasn't
    passed through netfilter ingress classifying either.

    Packets can alternatively be redirected at the netfilter layer using
    nft fwd.  Such a packet *is* subjected to netfilter egress classifying
    since it has reached the netfilter layer.

    Internally, the skb->nf_skip_egress flag controls whether netfilter is
    invoked on egress by __dev_queue_xmit().  Because __dev_queue_xmit() may
    be called recursively by tunnel drivers such as vxlan, the flag is
    reverted to false after sch_handle_egress().  This ensures that
    netfilter is applied both on the overlay and underlying network.

    Interaction between tc and netfilter is possible by setting and querying
    skb->mark.

    If netfilter egress classifying is not enabled on any interface, it is
    patched out of the data path by way of a static_key and doesn't make a
    performance difference that is discernible from noise:

    Before:             1537 1538 1538 1537 1538 1537 Mb/sec
    After:              1536 1534 1539 1539 1539 1540 Mb/sec
    Before + tc accept: 1418 1418 1418 1419 1419 1418 Mb/sec
    After  + tc accept: 1419 1424 1418 1419 1422 1420 Mb/sec
    Before + tc drop:   1620 1619 1619 1619 1620 1620 Mb/sec
    After  + tc drop:   1616 1624 1625 1624 1622 1619 Mb/sec

    When netfilter egress classifying is enabled on at least one interface,
    a minimal performance penalty is incurred for every egress packet, even
    if the interface it's transmitted over doesn't have any netfilter egress
    rules configured.  That is caused by checking dev->nf_hooks_egress
    against NULL.

    Measurements were performed on a Core i7-3615QM.  Commands to reproduce:
    ip link add dev foo type dummy
    ip link set dev foo up
    modprobe pktgen
    echo "add_device foo" > /proc/net/pktgen/kpktgend_3
    samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i foo -n 400000000 -m "11:11:11:11:11:11" -d 1.1.1.1

    Accept all traffic with tc:
    tc qdisc add dev foo clsact
    tc filter add dev foo egress bpf da bytecode '1,6 0 0 0,'

    Drop all traffic with tc:
    tc qdisc add dev foo clsact
    tc filter add dev foo egress bpf da bytecode '1,6 0 0 2,'

    Apply this patch when measuring packet drops to avoid errors in dmesg:
    https://lore.kernel.org/netdev/a73dda33-57f4-95d8-ea51-ed483abd6a7a@iogearbox.net/

    Signed-off-by: Lukas Wunner <lukas@wunner.de>
    Cc: Laura García Liébana <nevola@gmail.com>
    Cc: John Fastabend <john.fastabend@gmail.com>
    Cc: Daniel Borkmann <daniel@iogearbox.net>
    Cc: Alexei Starovoitov <ast@kernel.org>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Thomas Graf <tgraf@suug.ch>
    Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-11-13 16:59:01 +01:00
Ivan Vecera 866706749c netfilter: Generalize ingress hook include file
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139170

commit 17d20784223d52bf1671f984c9e8d5d9b8ea171b
Author: Lukas Wunner <lukas@wunner.de>
Date:   Fri Oct 8 22:06:02 2021 +0200

    netfilter: Generalize ingress hook include file

    Prepare for addition of a netfilter egress hook by generalizing the
    ingress hook include file.

    No functional change intended.

    Signed-off-by: Lukas Wunner <lukas@wunner.de>
    Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-11-13 16:59:01 +01:00
Ivan Vecera 3ccbb377fc netfilter: Rename ingress hook include file
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139170

commit 7463acfbe52ae8b7e0ea6890c1886b3f8ba8bddd
Author: Lukas Wunner <lukas@wunner.de>
Date:   Fri Oct 8 22:06:01 2021 +0200

    netfilter: Rename ingress hook include file

    Prepare for addition of a netfilter egress hook by renaming
    <linux/netfilter_ingress.h> to <linux/netfilter_netdev.h>.

    The egress hook also necessitates a refactoring of the include file,
    but that is done in a separate commit to ease reviewing.

    No functional change intended.

    Signed-off-by: Lukas Wunner <lukas@wunner.de>
    Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-11-13 16:59:01 +01:00
Frantisek Hrbata 0fe0e3e4d8 Merge: CNB: net: HW counters for soft devices
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1580

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2140149
Tested: Using netdevsim hw_stats_l3.sh self-test

Commits:
```
22b67d17194f ("net: rtnetlink: rtnl_stats_get(): Emit an extack for unset filter_mask")
6b524a1d012b ("net: rtnetlink: Namespace functions related to IFLA_OFFLOAD_XSTATS_*")
f6e0fb812988 ("net: rtnetlink: Stop assuming that IFLA_OFFLOAD_XSTATS_* are dev-backed")
46efc97b7306 ("net: rtnetlink: RTM_GETSTATS: Allow filtering inside nests")
05415bccbb09 ("net: rtnetlink: Propagate extack to rtnl_offload_xstats_fill()")
216e690631f5 ("net: rtnetlink: rtnl_fill_statsinfo(): Permit non-EMSGSIZE error returns")
9309f97aef6d ("net: dev: Add hardware stats support")
0e7788fd7622 ("net: rtnetlink: Add UAPI for obtaining L3 offload xstats")
03ba35667091 ("net: rtnetlink: Add RTM_SETSTATS")
5fd0b838efac ("net: rtnetlink: Add UAPI toggle for IFLA_OFFLOAD_XSTATS_L3_STATS")
ba95e7930957 ("selftests: forwarding: hw_stats_l3: Add a new test")
57d29a2935c9 ("net: rtnetlink: fix error handling in rtnl_fill_statsinfo()")
23cfe941b52e ("rtnetlink: Fix handling of disabled L3 stats in RTM_GETSTATS replies")
```

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: Petr Oros <poros@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>
Approved-by: Wander Lairson Costa <wander@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-11-08 09:08:22 -05:00
Frantisek Hrbata 5ac5a1dfd0 Merge: CNB: net: disambiguate the TSO and GSO limits
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1419

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2128180
Tested: Using iperf3 and toggling gso/tso offloading knobs

Commits:
```
2106efda785b ("net: remove .ndo_change_proto_down")
2cc6cdd44a16 ("net: unexport a handful of dev_* functions")
6264f58ca0e5 ("net: extract a few internals from netdevice.h")
6df6398f7c8b ("net: add netif_inherit_tso_max()")
14d7b8122fd5 ("net: don't allow user space to lift the device limits")
ee8b7a1156f3 ("net: make drivers set the TSO limit not the GSO limit")
744d49daf8bd ("net: move netif_set_gso_max helpers")
```

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-11-05 02:54:07 -04:00
Ivan Vecera a5a7be252a net: dev: Add hardware stats support
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2140149

commit 9309f97aef6d8250bb484dabeac925c3a7c57716
Author: Petr Machata <petrm@nvidia.com>
Date:   Wed Mar 2 18:31:20 2022 +0200

    net: dev: Add hardware stats support

    Offloading switch device drivers may be able to collect statistics of the
    traffic taking place in the HW datapath that pertains to a certain soft
    netdevice, such as VLAN. Add the necessary infrastructure to allow exposing
    these statistics to the offloaded netdevice in question. The API was shaped
    by the following considerations:

    - Collection of HW statistics is not free: there may be a finite number of
      counters, and the act of counting may have a performance impact. It is
      therefore necessary to allow toggling whether HW counting should be done
      for any particular SW netdevice.

    - As the drivers are loaded and removed, a particular device may get
      offloaded and unoffloaded again. At the same time, the statistics values
      need to stay monotonic (modulo the eventual 64-bit wraparound),
      increasing only to reflect traffic measured in the device.

      To that end, the netdevice keeps around a lazily-allocated copy of struct
      rtnl_link_stats64. Device drivers then contribute to the values kept
      therein at various points. Even as the driver goes away, the struct stays
      around to maintain the statistics values.

    - Different HW devices may be able to count different things. The
      motivation behind this patch in particular is exposure of HW counters on
      Nvidia Spectrum switches, where the only practical approach to counting
      traffic on offloaded soft netdevices currently is to use router interface
      counters, and count L3 traffic. Correspondingly that is the statistics
      suite added in this patch.

      Other devices may be able to measure different kinds of traffic, and for
      that reason, the APIs are built to allow uniform access to different
      statistics suites.

    - Because soft netdevices and offloading drivers are only loosely bound, a
      netdevice uses a notifier chain to communicate with the drivers. Several
      new notifiers, NETDEV_OFFLOAD_XSTATS_*, have been added to carry messages
      to the offloading drivers.

    - Devices can have various conditions for when a particular counter is
      available. As the device is configured and reconfigured, the device
      offload may become or cease being suitable for counter binding. A
      netdevice can use a notifier type NETDEV_OFFLOAD_XSTATS_REPORT_USED to
      ping offloading drivers and determine whether anyone currently implements
      a given statistics suite. This information can then be propagated to user
      space.

      When the driver decides to unoffload a netdevice, it can use a
      newly-added function, netdev_offload_xstats_report_delta(), to record
      outstanding collected statistics, before destroying the HW counter.

    This patch adds a helper, call_netdevice_notifiers_info_robust(), for
    dispatching a notifier with the possibility of unwind when one of the
    consumers bails. Given the wish to eventually get rid of the global
    notifier block altogether, this helper only invokes the per-netns notifier
    block.

    Signed-off-by: Petr Machata <petrm@nvidia.com>
    Signed-off-by: Ido Schimmel <idosch@nvidia.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-11-04 17:15:40 +01:00
Íñigo Huguet 4ed32c17b9 netdev: reshuffle netif_napi_add() APIs to allow dropping weight
Bugzilla: https://bugzilla.redhat.com/2139498

commit 58caed3dacb4354a25a1aa8d2febc3e9648ba1f4
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Mon May 2 16:27:03 2022 -0700

    netdev: reshuffle netif_napi_add() APIs to allow dropping weight
    
    Most drivers should not have to worry about selecting the right
    weight for their NAPI instances and pass NAPI_POLL_WEIGHT.
    It'd be best if we didn't require the argument at all and selected
    the default internally.
    
    This change prepares the ground for such reshuffling, allowing
    for a smooth transition. The following API should remain after
    the next release cycle:
      netif_napi_add()
      netif_napi_add_weight()
      netif_napi_add_tx()
      netif_napi_add_tx_weight()
    Where the _weight() variants take an explicit weight argument.
    I opted for a _weight() suffix rather than a __ prefix, because
    we use __ in places to mean that caller needs to also issue a
    synchronize_net() call.
    
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Link: https://lore.kernel.org/r/20220502232703.396351-1-kuba@kernel.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
2022-11-04 16:46:33 +01:00
Ivan Vecera fccce056fa net: allow gro_max_size to exceed 65536
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139501

commit 0fe79f28bfaf73b66b7b1562d2468f94aa03bd12
Author: Alexander Duyck <alexanderduyck@fb.com>
Date:   Fri May 13 11:34:03 2022 -0700

    net: allow gro_max_size to exceed 65536

    Allow the gro_max_size to exceed a value larger than 65536.

    There weren't really any external limitations that prevented this other
    than the fact that IPv4 only supports a 16 bit length field. Since we have
    the option of adding a hop-by-hop header for IPv6 we can allow IPv6 to
    exceed this value and for IPv4 and non-TCP flows we can cap things at 65536
    via a constant rather than relying on gro_max_size.

    [edumazet] limit GRO_MAX_SIZE to (8 * 65535) to avoid overflows.

    Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-11-02 18:56:09 +01:00
Ivan Vecera d513603ec1 net: allow gso_max_size to exceed 65536
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139501

commit 7c4e983c4f3cf94fcd879730c6caa877e0768a4d
Author: Alexander Duyck <alexanderduyck@fb.com>
Date:   Fri May 13 11:33:57 2022 -0700

    net: allow gso_max_size to exceed 65536

    The code for gso_max_size was added originally to allow for debugging and
    workaround of buggy devices that couldn't support TSO with blocks 64K in
    size. The original reason for limiting it to 64K was because that was the
    existing limits of IPv4 and non-jumbogram IPv6 length fields.

    With the addition of Big TCP we can remove this limit and allow the value
    to potentially go up to UINT_MAX and instead be limited by the tso_max_size
    value.

    So in order to support this we need to go through and clean up the
    remaining users of the gso_max_size value so that the values will cap at
    64K for non-TCPv6 flows. In addition we can clean up the GSO_MAX_SIZE value
    so that 64K becomes GSO_LEGACY_MAX_SIZE and UINT_MAX will now be the upper
    limit for GSO_MAX_SIZE.

    v6: (edumazet) fixed a compile error if CONFIG_IPV6=n,
                   in a new sk_trim_gso_size() helper.
                   netif_set_tso_max_size() caps the requested TSO size
                   with GSO_MAX_SIZE.

    Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-11-02 18:55:52 +01:00
Ivan Vecera 017d0aca36 gro: add ability to control gro max packet size
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139501

Conflicts:
- context due to existing backport of 14d7b8122fd5 ("net: don't allow
  user space to lift the device limits")

commit eac1b93c14d645ef147b049ace0d5230df755548
Author: Coco Li <lixiaoyan@google.com>
Date:   Wed Jan 5 02:48:38 2022 -0800

    gro: add ability to control gro max packet size

    Eric Dumazet suggested to allow users to modify max GRO packet size.

    We have seen GRO being disabled by users of appliances (such as
    wifi access points) because of claimed bufferbloat issues,
    or some work arounds in sch_cake, to split GRO/GSO packets.

    Instead of disabling GRO completely, one can chose to limit
    the maximum packet size of GRO packets, depending on their
    latency constraints.

    This patch adds a per device gro_max_size attribute
    that can be changed with ip link command.

    ip link set dev eth0 gro_max_size 16000

    Suggested-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Coco Li <lixiaoyan@google.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-11-02 18:55:37 +01:00
Paolo Abeni 022665bacd net: skb: introduce and use a single page frag cache
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2133057
Tested: LNST, Tier1

Upstream commit:
commit dbae2b062824fc2d35ae2d5df2f500626c758e80
Author: Paolo Abeni <pabeni@redhat.com>
Date:   Wed Sep 28 10:43:09 2022 +0200

    net: skb: introduce and use a single page frag cache

    After commit 3226b158e6 ("net: avoid 32 x truesize under-estimation
    for tiny skbs") we are observing 10-20% regressions in performance
    tests with small packets. The perf trace points to high pressure on
    the slab allocator.

    This change tries to improve the allocation schema for small packets
    using an idea originally suggested by Eric: a new per CPU page frag is
    introduced and used in __napi_alloc_skb to cope with small allocation
    requests.

    To ensure that the above does not lead to excessive truesize
    underestimation, the frag size for small allocation is inflated to 1K
    and all the above is restricted to build with 4K page size.

    Note that we need to update accordingly the run-time check introduced
    with commit fd9ea57f4e95 ("net: add napi_get_frags_check() helper").

    Alex suggested a smart page refcount schema to reduce the number
    of atomic operations and deal properly with pfmemalloc pages.

    Under small packet UDP flood, I measure a 15% peak tput increases.

    Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
    Suggested-by: Alexander H Duyck <alexanderduyck@fb.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
    Link: https://lore.kernel.org/r/6b6f65957c59f86a353fc09a5127e83a32ab5999.1664350652.git.pabeni@redhat.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-27 19:12:04 +02:00
Paolo Abeni 7822d83322 net: add napi_get_frags_check() helper
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2133057
Tested: LNST, Tier1

Upstream commit:
commit fd9ea57f4e9514f9d0f0dec505eefd99a8faa148
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Jun 8 09:04:38 2022 -0700

    net: add napi_get_frags_check() helper

    This is a follow up of commit 3226b158e6
    ("net: avoid 32 x truesize under-estimation for tiny skbs")

    When/if we increase MAX_SKB_FRAGS, we better make sure
    the old bug will not come back.

    Adding a check in napi_get_frags() would be costly,
    even if using DEBUG_NET_WARN_ON_ONCE().

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-27 19:10:48 +02:00
Jiri Benc 2da69cb317 net: Postpone skb_clear_delivery_time() until knowing the skb is delivered locally
Bugzilla: https://bugzilla.redhat.com/2120966

Conflicts:
- [minor] context difference in __netif_receive_skb_core due to missing
  42df6e1d221d ("netfilter: Introduce egress hook")

commit cd14e9b7b8d312dfbf75ce1f78552902e51b9045
Author: Martin KaFai Lau <kafai@fb.com>
Date:   Wed Mar 2 11:56:22 2022 -0800

    net: Postpone skb_clear_delivery_time() until knowing the skb is delivered locally

    The previous patches handled the delivery_time in the ingress path
    before the routing decision is made.  This patch can postpone clearing
    delivery_time in a skb until knowing it is delivered locally and also
    set the (rcv) timestamp if needed.  This patch moves the
    skb_clear_delivery_time() from dev.c to ip_local_deliver_finish()
    and ip6_input_finish().

    Signed-off-by: Martin KaFai Lau <kafai@fb.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:58:00 +02:00
Jiri Benc e0f797236e net: Set skb->mono_delivery_time and clear it after sch_handle_ingress()
Bugzilla: https://bugzilla.redhat.com/2120966

Conflicts:
- [minor] context difference in __netif_receive_skb_core due to missing
  42df6e1d221d ("netfilter: Introduce egress hook")

commit d98d58a002619b5c165f1eedcd731e2fe2c19088
Author: Martin KaFai Lau <kafai@fb.com>
Date:   Wed Mar 2 11:55:50 2022 -0800

    net: Set skb->mono_delivery_time and clear it after sch_handle_ingress()

    The previous patches handled the delivery_time before sch_handle_ingress().

    This patch can now set the skb->mono_delivery_time to flag the skb->tstamp
    is used as the mono delivery_time (EDT) instead of the (rcv) timestamp
    and also clear it with skb_clear_delivery_time() after
    sch_handle_ingress().  This will make the bpf_redirect_*()
    to keep the mono delivery_time and used by a qdisc (fq) of
    the egress-ing interface.

    A latter patch will postpone the skb_clear_delivery_time() until the
    stack learns that the skb is being delivered locally and that will
    make other kernel forwarding paths (ip[6]_forward) able to keep
    the delivery_time also.  Thus, like the previous patches on using
    the skb->mono_delivery_time bit, calling skb_clear_delivery_time()
    is not limited within the CONFIG_NET_INGRESS to avoid too many code
    churns among this set.

    Signed-off-by: Martin KaFai Lau <kafai@fb.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:58:00 +02:00
Jiri Benc e17e09a099 net: Clear mono_delivery_time bit in __skb_tstamp_tx()
Bugzilla: https://bugzilla.redhat.com/2120966

commit d93376f503c7a586707925957592c0f16f4db0b1
Author: Martin KaFai Lau <kafai@fb.com>
Date:   Wed Mar 2 11:55:44 2022 -0800

    net: Clear mono_delivery_time bit in __skb_tstamp_tx()

    In __skb_tstamp_tx(), it may clone the egress skb and queues the clone to
    the sk_error_queue.  The outgoing skb may have the mono delivery_time
    while the (rcv) timestamp is expected for the clone, so the
    skb->mono_delivery_time bit needs to be cleared from the clone.

    This patch adds the skb->mono_delivery_time clearing to the existing
    __net_timestamp() and use it in __skb_tstamp_tx().
    The __net_timestamp() fast path usage in dev.c is changed to directly
    call ktime_get_real() since the mono_delivery_time bit is not set at
    that point.

    Signed-off-by: Martin KaFai Lau <kafai@fb.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:58:00 +02:00
Jiri Benc c387356f8d net: Handle delivery_time in skb->tstamp during network tapping with af_packet
Bugzilla: https://bugzilla.redhat.com/2120966

commit 27942a15209f564ed8ee2a9e126cb7b105181355
Author: Martin KaFai Lau <kafai@fb.com>
Date:   Wed Mar 2 11:55:38 2022 -0800

    net: Handle delivery_time in skb->tstamp during network tapping with af_packet

    A latter patch will set the skb->mono_delivery_time to flag the skb->tstamp
    is used as the mono delivery_time (EDT) instead of the (rcv) timestamp.
    skb_clear_tstamp() will then keep this delivery_time during forwarding.

    This patch is to make the network tapping (with af_packet) to handle
    the delivery_time stored in skb->tstamp.

    Regardless of tapping at the ingress or egress,  the tapped skb is
    received by the af_packet socket, so it is ingress to the af_packet
    socket and it expects the (rcv) timestamp.

    When tapping at egress, dev_queue_xmit_nit() is used.  It has already
    expected skb->tstamp may have delivery_time,  so it does
    skb_clone()+net_timestamp_set() to ensure the cloned skb has
    the (rcv) timestamp before passing to the af_packet sk.
    This patch only adds to clear the skb->mono_delivery_time
    bit in net_timestamp_set().

    When tapping at ingress, it currently expects the skb->tstamp is either 0
    or the (rcv) timestamp.  Meaning, the tapping at ingress path
    has already expected the skb->tstamp could be 0 and it will get
    the (rcv) timestamp by ktime_get_real() when needed.

    There are two cases for tapping at ingress:

    One case is af_packet queues the skb to its sk_receive_queue.
    The skb is either not shared or new clone created.  The newly
    added skb_clear_delivery_time() is called to clear the
    delivery_time (if any) and set the (rcv) timestamp if
    needed before the skb is queued to the sk_receive_queue.

    Another case, the ingress skb is directly copied to the rx_ring
    and tpacket_get_timestamp() is used to get the (rcv) timestamp.
    The newly added skb_tstamp() is used in tpacket_get_timestamp()
    to check the skb->mono_delivery_time bit before returning skb->tstamp.
    As mentioned earlier, the tapping@ingress has already expected
    the skb may not have the (rcv) timestamp (because no sk has asked
    for it) and has handled this case by directly calling ktime_get_real().

    Signed-off-by: Martin KaFai Lau <kafai@fb.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:58:00 +02:00
Frantisek Hrbata fa843be1d1 Merge: net: add skb drop reasons
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1454

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161

Sync skb drop reasons with upstream to improve debuggability and visibility in
the net stack. This MR helps in understanding why a given packet is being
dropped.

One way of retrieving the skb drop reason is to hook to the kfree_skb tracepoint:

```
# perf record -e skb:kfree_skb -a sleep 10
# perf script
         swapper     0 [000] 45483.977088: skb:kfree_skb: skbaddr=0xffffa04859090f00 protocol=34525 location=0xffffffff9bc92940 reason: NOT_SPECIFIED
         swapper     0 [000] 45485.792919: skb:kfree_skb: skbaddr=0xffffa04143757900 protocol=34525 location=0xffffffff9bbd84bb reason: TCP_INVALID_SEQUENCE
```

Signed-off-by: Antoine Tenart <atenart@redhat.com>

Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-10-24 14:27:58 -04:00
Ivan Vecera 4ba4dadfe4 net: make drivers set the TSO limit not the GSO limit
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2128180

Conflicts:
* drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c
* drivers/net/ethernet/marvell/octeontx2/nic/otx2_vf.c
  - small context conflicts
* drivers/net/usb/ax88179_178a.c
  - hunk removed, the driver does not call netif_set_gso_max_size()
* drivers/net/usb/lan78xx.c
  - modified due to absence of commits d383216a7efe ("lan78xx: Introduce
    Tx URB processing improvements") and 0dd87266c133 ("lan78xx: Remove
    hardware-specific header update")

commit ee8b7a1156f357613646d6c69d07ac5a087a1071
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Thu May 5 19:51:33 2022 -0700

    net: make drivers set the TSO limit not the GSO limit

    Drivers should call the TSO setting helper, GSO is controllable
    by user space.

    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-10-18 10:27:21 +02:00
Ivan Vecera 8f95afcecf net: don't allow user space to lift the device limits
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2128180

Conflicts:
- small context conflict due to missing eac1b93c14d6 ("gro: add ability
  to control gro max packet size")

commit 14d7b8122fd591693a2388b98563707ba72c6780
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Thu May 5 19:51:32 2022 -0700

    net: don't allow user space to lift the device limits

    Up until commit 46e6b992c2 ("rtnetlink: allow GSO maximums to
    be set on device creation") the gso_max_segs and gso_max_size
    of a device were not controlled from user space.

    The quoted commit added the ability to control them because of
    the following setup:

     netns A  |  netns B
         veth<->veth   eth0

    If eth0 has TSO limitations and user wants to efficiently forward
    traffic between eth0 and the veths they should copy the TSO
    limitations of eth0 onto the veths. This would happen automatically
    for macvlans or ipvlan but veth users are not so lucky (given the
    loose coupling).

    Unfortunately the commit in question allowed users to also override
    the limits on real HW devices.

    It may be useful to control the max GSO size and someone may be using
    that ability (not that I know of any user), so create a separate set
    of knobs to reliably record the TSO limitations. Validate the user
    requests.

    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-10-18 10:27:21 +02:00
Ivan Vecera f9b471a989 net: add netif_inherit_tso_max()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2128180

Conflicts:
- small context conflict due to missing eac1b93c14d6 ("gro: add ability
  to control gro max packet size")

commit 6df6398f7c8b481ce83f28143bc08a5231616deb
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Thu May 5 19:51:31 2022 -0700

    net: add netif_inherit_tso_max()

    To make later patches smaller create a helper for inheriting
    the TSO limitations of a lower device. The TSO in the name
    is not an accident, subsequent patches will replace GSO
    with TSO in more names.

    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-10-18 10:27:21 +02:00
Ivan Vecera 5a0eef8003 net: extract a few internals from netdevice.h
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2128180

Conflicts:
- slightly modified due to missing 0b5c21bbc01e ("net: ensure
  net_todo_list is processed quickly") and d07b26f5bbea ("dev_addr:
  add a modification check")

commit 6264f58ca0e54e41d63c2d00334a48bac28fbf30
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Wed Apr 6 14:37:54 2022 -0700

    net: extract a few internals from netdevice.h

    There's a number of functions and static variables used
    under net/core/ but not from the outside. We currently
    dump most of them into netdevice.h. That bad for many
    reasons:
     - netdevice.h is very cluttered, hard to figure out
       what the APIs are;
     - netdevice.h is very long;
     - we have to touch netdevice.h more which causes expensive
       incremental builds.

    Create a header under net/core/ and move some declarations.

    The new header is also a bit of a catch-all but that's
    fine, if we create more specific headers people will
    likely over-think where their declaration fit best.
    And end up putting them in netdevice.h, again.

    More work should be done on splitting netdevice.h into more
    targeted headers, but that'd be more time consuming so small
    steps.

    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-10-18 10:27:16 +02:00
Antoine Tenart d3b8b917fb net: skb: rename SKB_DROP_REASON_PTYPE_ABSENT
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git
Conflict:\
- In __netif_receive_skb_core due to missing upstream commit
  625788b58445 ("net: add per-cpu storage and net->core_stats") in c9s.

commit 9f8ed577c28813410614b418bad42285840c1a00
Author: Menglong Dong <imagedong@tencent.com>
Date:   Thu Apr 7 14:20:50 2022 +0800

    net: skb: rename SKB_DROP_REASON_PTYPE_ABSENT

    As David Ahern suggested, the reasons for skb drops should be more
    general and not be code based.

    Therefore, rename SKB_DROP_REASON_PTYPE_ABSENT to
    SKB_DROP_REASON_UNHANDLED_PROTO, which is used for the cases of no
    L3 protocol handler, no L4 protocol handler, version extensions, etc.

    From previous discussion, now we have the aim to make these reasons
    more abstract and users based, avoiding code based.

    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:24 +02:00
Antoine Tenart 3f421c9474 net: dev: use kfree_skb_reason() for __netif_receive_skb_core()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit 6c2728b7c14164928cb7cb9c847dead101b2d503
Author: Menglong Dong <imagedong@tencent.com>
Date:   Fri Mar 4 14:00:46 2022 +0800

    net: dev: use kfree_skb_reason() for __netif_receive_skb_core()

    Add reason for skb drops to __netif_receive_skb_core() when packet_type
    not found to handle the skb. For this purpose, the drop reason
    SKB_DROP_REASON_PTYPE_ABSENT is introduced. Take ether packets for
    example, this case mainly happens when L3 protocol is not supported.

    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:23 +02:00
Antoine Tenart 4fa8044e89 net: dev: use kfree_skb_reason() for sch_handle_ingress()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit a568aff26ac03ee9eb1482683514914a5ec3b4c3
Author: Menglong Dong <imagedong@tencent.com>
Date:   Fri Mar 4 14:00:45 2022 +0800

    net: dev: use kfree_skb_reason() for sch_handle_ingress()

    Replace kfree_skb() used in sch_handle_ingress() with
    kfree_skb_reason(). Following drop reasons are introduced:

    SKB_DROP_REASON_TC_INGRESS

    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:23 +02:00
Antoine Tenart 9c9aa3ee0a net: dev: use kfree_skb_reason() for do_xdp_generic()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit 7e726ed81e1ddd5fdc431e02b94fcfe2a9876d42
Author: Menglong Dong <imagedong@tencent.com>
Date:   Fri Mar 4 14:00:44 2022 +0800

    net: dev: use kfree_skb_reason() for do_xdp_generic()

    Replace kfree_skb() used in do_xdp_generic() with kfree_skb_reason().
    The drop reason SKB_DROP_REASON_XDP is introduced for this case.

    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:23 +02:00
Antoine Tenart db388f3375 net: dev: use kfree_skb_reason() for enqueue_to_backlog()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit 44f0bd40803c0e04f1c8cd59df3c7acce783ae9c
Author: Menglong Dong <imagedong@tencent.com>
Date:   Fri Mar 4 14:00:43 2022 +0800

    net: dev: use kfree_skb_reason() for enqueue_to_backlog()

    Replace kfree_skb() used in enqueue_to_backlog() with
    kfree_skb_reason(). The skb rop reason SKB_DROP_REASON_CPU_BACKLOG is
    introduced for the case of failing to enqueue the skb to the per CPU
    backlog queue. The further reason can be backlog queue full or RPS
    flow limition, and I think we needn't to make further distinctions.

    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:23 +02:00
Antoine Tenart b63c068d65 net: dev: add skb drop reasons to __dev_xmit_skb()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit 7faef0547f4c29031a68d058918b031a8e520d49
Author: Menglong Dong <imagedong@tencent.com>
Date:   Fri Mar 4 14:00:42 2022 +0800

    net: dev: add skb drop reasons to __dev_xmit_skb()

    Add reasons for skb drops to __dev_xmit_skb() by replacing
    kfree_skb_list() with kfree_skb_list_reason(). The drop reason of
    SKB_DROP_REASON_QDISC_DROP is introduced for qdisc enqueue fails.

    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:23 +02:00
Antoine Tenart 694219a303 net: dev: use kfree_skb_reason() for sch_handle_egress()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit 98b4d7a4e7374a44c4afd9f08330e72f6ad0d644
Author: Menglong Dong <imagedong@tencent.com>
Date:   Fri Mar 4 14:00:40 2022 +0800

    net: dev: use kfree_skb_reason() for sch_handle_egress()

    Replace kfree_skb() used in sch_handle_egress() with kfree_skb_reason().
    The drop reason SKB_DROP_REASON_TC_EGRESS is introduced. Considering
    the code path of tc egerss, we make it distinct with the drop reason
    of SKB_DROP_REASON_QDISC_DROP in the next commit.

    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:23 +02:00
Paolo Abeni 7403d40195 net: Fix a data-race around netdev_unregister_timeout_secs.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134161
Tested: LNST, Tier1
Conflicts: chunk applied into netdev_wait_allrefs() instead of \
 netdev_wait_allrefs_any() and with different context as rhel-9 \
 lacks the upstream commit faab39f63c1fc ("net: allow out-of-order \
 netdev unregistration")

Upstream commit:
commit 05e49cfc89e4f325eebbc62d24dd122e55f94c23
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Tue Aug 23 10:46:59 2022 -0700

    net: Fix a data-race around netdev_unregister_timeout_secs.

    While reading netdev_unregister_timeout_secs, it can be changed
    concurrently.  Thus, we need to add READ_ONCE() to its reader.

    Fixes: 5aa3afe107 ("net: make unregister netdev warning timeout configurable")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Acked-by: Dmitry Vyukov <dvyukov@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-13 13:00:04 +02:00
Paolo Abeni 48e48d197a net: Fix a data-race around netdev_budget_usecs.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134161
Tested: LNST, Tier1

Upstream commit:
commit fa45d484c52c73f79db2c23b0cdfc6c6455093ad
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Tue Aug 23 10:46:55 2022 -0700

    net: Fix a data-race around netdev_budget_usecs.

    While reading netdev_budget_usecs, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its reader.

    Fixes: 7acf8a1e8a ("Replace 2 jiffies with sysctl netdev_budget_usecs to enable softirq tuning")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-13 13:00:04 +02:00
Paolo Abeni 3d0c78c5c1 net: Fix a data-race around netdev_budget.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134161
Tested: LNST, Tier1

Upstream commit:
commit 2e0c42374ee32e72948559d2ae2f7ba3dc6b977c
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Tue Aug 23 10:46:53 2022 -0700

    net: Fix a data-race around netdev_budget.

    While reading netdev_budget, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its reader.

    Fixes: 51b0bdedb8 ("[NET]: Separate two usages of netdev_max_backlog.")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-13 13:00:04 +02:00
Paolo Abeni 08060d0717 net: Fix data-races around netdev_tstamp_prequeue.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134161
Tested: LNST, Tier1

Upstream commit:
commit 61adf447e38664447526698872e21c04623afb8e
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Tue Aug 23 10:46:47 2022 -0700

    net: Fix data-races around netdev_tstamp_prequeue.

    While reading netdev_tstamp_prequeue, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its readers.

    Fixes: 3b098e2d7c ("net: Consistent skb timestamping")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-13 13:00:03 +02:00
Paolo Abeni 13d50816f6 net: Fix data-races around netdev_max_backlog.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134161
Tested: LNST, Tier1

Upstream commit:
commit 5dcd08cd19912892586c6082d56718333e2d19db
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Tue Aug 23 10:46:46 2022 -0700

    net: Fix data-races around netdev_max_backlog.

    While reading netdev_max_backlog, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its readers.

    While at it, we remove the unnecessary spaces in the doc.

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-13 13:00:03 +02:00
Paolo Abeni 05d6206bdc net: Fix data-races around weight_p and dev_weight_[rt]x_bias.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134161
Tested: LNST, Tier1

Upstream commit:
commit bf955b5ab8f6f7b0632cdef8e36b14e4f6e77829
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Tue Aug 23 10:46:45 2022 -0700

    net: Fix data-races around weight_p and dev_weight_[rt]x_bias.

    While reading weight_p, it can be changed concurrently.  Thus, we need
    to add READ_ONCE() to its reader.

    Also, dev_[rt]x_weight can be read/written at the same time.  So, we
    need to use READ_ONCE() and WRITE_ONCE() for its access.  Moreover, to
    use the same weight_p while changing dev_[rt]x_weight, we add a mutex
    in proc_do_dev_weight().

    Fixes: 3d48b53fb2 ("net: dev_weight: TX/RX orthogonality")
    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-13 13:00:03 +02:00
Ivan Vecera 7ca7843425 net: unexport a handful of dev_* functions
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2128180

commit 2cc6cdd44a1655ac5a9863529a2fd6dbed2d092c
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Wed Apr 6 14:37:53 2022 -0700

    net: unexport a handful of dev_* functions

    We have a bunch of functions which are only used under
    net/core/ yet they get exported. Remove the exports.

    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-10-03 17:03:08 +02:00
Ivan Vecera 616826f600 net: remove .ndo_change_proto_down
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2128180

Conflicts:
- small context conflict due to existing backport of 3b89b511ea0c ("net:
  fix IFF_TX_SKB_NO_LINEAR definition")

commit 2106efda785b55a8957efed9a52dfa28ee0d7280
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Mon Nov 22 17:24:47 2021 -0800

    net: remove .ndo_change_proto_down

    .ndo_change_proto_down was added seemingly to enable out-of-tree
    implementations. Over 2.5yrs later we still have no real users
    upstream. Hardwire the generic implementation for now, we can
    revert once real users materialize. (rocker is a test vehicle,
    not a user.)

    We need to drop the optimization on the sysfs side, because
    unlike ndos priv_flags will be changed at runtime, so we'd
    need READ_ONCE/WRITE_ONCE everywhere..

    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-10-03 17:02:55 +02:00
Felix Maurer 8611666ff2 xdp: check prog type before updating BPF link
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071620

commit 382778edc8262b7535f00523e9eb22edba1b9816
Author: Toke Høiland-Jørgensen <toke@redhat.com>
Date:   Fri Jan 7 23:11:13 2022 +0100

    xdp: check prog type before updating BPF link

    The bpf_xdp_link_update() function didn't check the program type before
    updating the program, which made it possible to install any program type as
    an XDP program, which is obviously not good. Syzbot managed to trigger this
    by swapping in an LWT program on the XDP hook which would crash in a helper
    call.

    Fix this by adding a check and bailing out if the types don't match.

    Fixes: 026a4c28e1 ("bpf, xdp: Implement LINK_UPDATE for BPF XDP link")
    Reported-by: syzbot+983941aa85af6ded1fd9@syzkaller.appspotmail.com
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Link: https://lore.kernel.org/r/20220107221115.326171-1-toke@redhat.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2022-08-24 16:56:03 +02:00
Patrick Talbert 95ad1a9fa6 Merge: CNB: bpf: Let bpf_warn_invalid_xdp_action() report more info
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1070

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2073454
Tested: Build, boot.

The commit let bpf_warn_invalid_xdp_action() report more info

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: Corinna Vinschen <vinschen@redhat.com>
Approved-by: Kamal Heib <kheib@redhat.com>
Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Hangbin Liu <haliu@redhat.com>
Approved-by: Artem Savkov <asavkov@redhat.com>
Approved-by: Petr Oros <poros@redhat.com>
Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com>
Approved-by: Mohamed Gamal Morsy <mgamal@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-07-15 09:40:47 +02:00
Patrick Talbert 5f85d33e47 Merge: net/core: backport fixes from upstream for 9.1 P2
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1057

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101278

The latest path depends on the second latest patch.

Signed-off-by: Hangbin Liu <haliu@redhat.com>

Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Florian Westphal <fwestpha@redhat.com>
Approved-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-07-14 12:07:49 +02:00
Patrick Talbert c2f72a65cf Merge: CNB: gro: get out of core files
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1066

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101789
Tested: Just built - there is no functional change

The series moves GRO related definitions, declarations and code from core files into net/core/gro.h and include/net/gro.h and reduces too big files include/linux/netdevice.h andnet/core/dev.c. Backport of this series provides <net/gro.h> for NIC drivers and avoids conflicts in future GRO related backports and fixes.

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: Kamal Heib <kheib@redhat.com>
Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Hangbin Liu <haliu@redhat.com>
Approved-by: Íñigo Huguet <ihuguet@redhat.com>

Conflicts:
- include/linux/netdevice.h: fuzz.

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-07-12 10:36:03 +02:00
Patrick Talbert f063b56239 Merge: net: backport netdevice and netns refcount tracking and enable them for debug kernels
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1003

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2096377
Tested: Basic networking tasks using namespaces, vlans, veths, macvlans etc. with kernel-debug flavor

Upstream kernel recently introduces refcount tracking infrastructure for network devices and namespaces to help to avoid resource leaks and use-after-free issues. This infrastructure should be helpful for our support teams to debug customers' issues.
The series backports the following commits and enables both trackers for kernel debug flavors:

```
95d1d2490c27 ("netdevice: move xdp_rxq within netdev_rx_queue")
2a12ae5d433d ("net: inline sock_prot_inuse_add()")
d477eb900484 ("net: make sock_inuse_add() available")
4199bae10c49 ("net: merge net->core.prot_inuse and net->core.sock_inuse")
b3cb764aa1d7 ("net: drop nopreempt requirement on sock_prot_inuse_add()")
4e66934eaadc ("lib: add reference counting tracking infrastructure")
914a7b5000d0 ("lib: add tests for reference tracker")
4d92b95ff2f9 ("net: add net device refcount tracker infrastructure")
80e8921b2b72 ("net: add net device refcount tracker to struct netdev_rx_queue")
0b688f24b7d6 ("net: add net device refcount tracker to struct netdev_queue")
5ae2195088d0 ("net: add net device refcount tracker to ethtool_phys_id()")
14ed029b5eb5 ("net: add net device refcount tracker to dev_ifsioc()")
4dbd24f65c60 ("drop_monitor: add net device refcount tracker")
9038c320001d ("net: dst: add net device refcount tracking to dst_entry")
fb67510ba9bd ("ipv6: add net device refcount tracker to rt6_probe_deferred()")
c0fd407a0666 ("sit: add net device refcount tracking to ip_tunnel")
56c1c77948ba ("ipv6: add net device refcount tracker to struct ip6_tnl")
85662c9f8cbd ("net: add net device refcount tracker to struct neighbour")
77a23b1f9543 ("net: add net device refcount tracker to struct pneigh_entry")
08d622568e5a ("net: add net device refcount tracker to struct neigh_parms")
f77159a348f2 ("net: add net device refcount tracker to struct netdev_adjacent")
8c727003c4d0 ("ipv6: add net device refcount tracker to struct inet6_dev")
c04438f58d14 ("ipv4: add net device refcount tracker to struct in_device")
606509f27f67 ("net/sched: add net device refcount tracker to struct Qdisc")
63f13937cbe9 ("net: linkwatch: add net device refcount tracker")
095e200f175f ("net: failover: add net device refcount tracker")
42120a864383 ("ipmr, ip6mr: add net device refcount tracker to struct vif_device")
5fa5ae605821 ("netpoll: add net device refcount tracker to struct netpoll")
c0e5e11af12b ("vrf: use dev_replace_track() for better tracking")
08f0b22d731f ("net: eql: add net device refcount tracker")
19c9ebf6ed70 ("vlan: add net device refcount tracker")
b2dcdc7f731d ("net: bridge: add net device refcount tracker")
f12bf6f3f942 ("net: watchdog: add net device refcount tracker")
4fc003fe0313 ("net: switchdev: add net device refcount tracker")
e44b14ebae10 ("inet: add net device refcount tracker to struct fib_nh_common")
66ce07f7802b ("ax25: add net device refcount tracker")
615d069dcf12 ("llc: add net device refcount tracker")
035f1f2b96ae ("pktgen add net device refcount tracker")
b60645248af3 ("net/smc: add net device tracker to struct smc_pnetentry")
e4b8954074f6 ("netlink: add net device refcount tracker to struct ethnl_req_info")
e7c8ab8419d7 ("openvswitch: add net device refcount tracker to struct vport")
ada066b2e02c ("net: sched: act_mirred: add net device refcount tracker")
4177e4960594 ("xfrm: use net device refcount tracker helpers")
9ba74e6c9e9d ("net: add networking namespace refcount tracker")
ffa84b5ffb37 ("net: add netns refcount tracker to struct sock")
04a931e58d19 ("net: add netns refcount tracker to struct seq_net_private")
dbdcda634ce3 ("net: sched: add netns refcount tracker to struct tcf_exts")
285ec2fef4b8 ("l2tp: add netns refcount tracker to l2tp_dfs_seq_data")
11b311a867b6 ("ppp: add netns refcount tracker")
0976b888a150 ("ethtool: fix null-ptr-deref on ref tracker")
e1b539bd73a7 ("xfrm: add net device refcount tracker to struct xfrm_state_offload")
8b40a9d53d4f ("ipv6: use GFP_ATOMIC in rt6_probe()")
1d2f3d3c6268 ("mptcp: adjust to use netns refcount tracker")
123e495ecc25 ("net: linkwatch: be more careful about dev->linkwatch_dev_tracker")
9280ac2e6f19 ("net: dev_replace_track() cleanup")
34ac17ecbf57 ("ethtool: use ethnl_parse_header_dev_put()")
f1d9268e0618 ("net: add net device refcount tracker to struct packet_type")
3bc14ea0d12a ("ethtool: always write dev in ethnl_parse_header_dev_get")
a9382d9389a0 ("netfilter: nfnetlink: add netns refcount tracker to struct nfulnl_instance")
30db406923b9 ("netfilter: nf_nat_masquerade: make async masq_inet6_event handling generic")
7970a19b7104 ("netfilter: nf_nat_masquerade: defer conntrack walk to work queue")
fc0d026a2fad ("netfilter: nf_nat_masquerade: add netns refcount tracker to masq_dev_work")
88248c357c2a ("net/sched: add missing tracker information in qdisc_create()")
2d6ec25539b0 ("netlink: do not allocate a device refcount tracker in ethnl_default_notify()")
bf44077c1b3a ("af_packet: fix tracking issues in packet_do_bind()")
cb963a19d99f ("net: sched: do not allocate a tracker in tcf_exts_init()")
c12837d1bb31 ("ref_tracker: use __GFP_NOFAIL more carefully")
fcfb894d5952 ("net: bridge: fix net device refcount tracking issue in error path")
7b9b1d449a7c ("net/smc: fix possible NULL deref in smc_pnet_add_eth()")
6cdef8a6ee74 ("SUNRPC: add netns refcount tracker to struct svc_xprt")
9b1831e56c7f ("SUNRPC: add netns refcount tracker to struct gss_auth")
b9a0d6d143ec ("SUNRPC: add netns refcount tracker to struct rpc_xprt")
e3ececfe668f ("ref_tracker: implement use-after-free detection")
8fd5522f44dc ("ref_tracker: add a count of untracked references")
4c6c11ea0f7b ("net: refine dev_put()/dev_hold() debugging")
28f922213886 ("net/smc: fix ref_tracker issue in smc_pnet_add()")
94fdd7c02a56 ("net/smc: use GFP_ATOMIC allocation in smc_pnet_add_eth()")
b2309a71c1f2 ("net: add dev->dev_registered_tracker")
3db09e762dc7 ("net/sched: cls_u32: fix netns refcount changes in u32_change()")
ec5b0f605b10 ("net/sched: cls_u32: fix possible leak in u32_init_knode()")
```

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Eelco Chaudron <echaudro@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-07-01 09:17:32 +02:00
Ivan Vecera ca7c7d9c0c bpf: Let bpf_warn_invalid_xdp_action() report more info
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2073454

Conflicts:
- N/A hunk for unsupported octeontx2 driver omitted

commit c8064e5b4adac5e1255cf4f3b374e75b5376e7ca
Author: Paolo Abeni <pabeni@redhat.com>
Date:   Tue Nov 30 11:08:07 2021 +0100

    bpf: Let bpf_warn_invalid_xdp_action() report more info

    In non trivial scenarios, the action id alone is not sufficient to
    identify the program causing the warning. Before the previous patch,
    the generated stack-trace pointed out at least the involved device
    driver.

    Let's additionally include the program name and id, and the relevant
    device name.

    If the user needs additional infos, he can fetch them via a kernel
    probe, leveraging the arguments added here.

    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Link: https://lore.kernel.org/bpf/ddb96bb975cbfddb1546cf5da60e77d5100b533c.1638189075.git.pabeni@redhat.com

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-06-28 16:13:14 +02:00
Ivan Vecera 7ba9ae4395 net: gro: populate net/core/gro.c
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101789

Conflicts:
- adjusted due to existing backport of 7881453e4adf ("net: gro: avoid
  re-computing truesize twice on recycle")

commit 587652bbdd06ab38a4c1b85e40f933d2cf4a1147
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 15 09:05:54 2021 -0800

    net: gro: populate net/core/gro.c

    Move gro code and data from net/core/dev.c to net/core/gro.c
    to ease maintenance.

    gro_normal_list() and gro_normal_one() are inlined
    because they are called from both files.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-06-28 13:28:41 +02:00
Ivan Vecera e9721641ed net:dev: Change napi_gro_complete return type to void
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101789

commit 1643771eeb2db9b487cbbde12e2a3f6ed0171490
Author: Gyumin Hwang <hkm73560@gmail.com>
Date:   Sat Oct 2 08:11:36 2021 +0000

    net:dev: Change napi_gro_complete return type to void

    napi_gro_complete always returned the same value, NET_RX_SUCCESS
    And the value was not used anywhere

    Signed-off-by: Gyumin Hwang <hkm73560@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-06-28 13:28:40 +02:00
Ivan Vecera 2119ff5330 move netdev_boot_setup into Space.c
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101789

commit 5ea2f5ffde39251115ef9a566262fb9e52b91cb7
Author: Arnd Bergmann <arnd@arndb.de>
Date:   Tue Aug 3 13:40:46 2021 +0200

    move netdev_boot_setup into Space.c

    This is now only used by a handful of old ISA drivers,
    and can be moved into the file they already all depend on.

    Signed-off-by: Arnd Bergmann <arnd@arndb.de>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-06-28 13:28:39 +02:00
Hangbin Liu e4c3a2b313 net: fix data-race in dev_isalive()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101278
Upstream Status: net.git commit cc26c2661fef

Conflicts: context conflicts due to missing ae68db14b616 ("net: transition
netdev reg state earlier in run_todo") and 86213f80da1b ("net: avoid quadratic
behavior in netdev_wait_allrefs_any()")

commit cc26c2661fefea215f41edb665193324a5f99021
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Jun 16 00:34:34 2022 -0700

    net: fix data-race in dev_isalive()

    dev_isalive() is called under RTNL or dev_base_lock protection.

    This means that changes to dev->reg_state should be done with both locks held.

    syzbot reported:

    BUG: KCSAN: data-race in register_netdevice / type_show

    write to 0xffff888144ecf518 of 1 bytes by task 20886 on cpu 0:
    register_netdevice+0xb9f/0xdf0 net/core/dev.c:10050
    lapbeth_new_device drivers/net/wan/lapbether.c:414 [inline]
    lapbeth_device_event+0x4a0/0x6c0 drivers/net/wan/lapbether.c:456
    notifier_call_chain kernel/notifier.c:87 [inline]
    raw_notifier_call_chain+0x53/0xb0 kernel/notifier.c:455
    __dev_notify_flags+0x1d6/0x3a0
    dev_change_flags+0xa2/0xc0 net/core/dev.c:8607
    do_setlink+0x778/0x2230 net/core/rtnetlink.c:2780
    __rtnl_newlink net/core/rtnetlink.c:3546 [inline]
    rtnl_newlink+0x114c/0x16a0 net/core/rtnetlink.c:3593
    rtnetlink_rcv_msg+0x811/0x8c0 net/core/rtnetlink.c:6089
    netlink_rcv_skb+0x13e/0x240 net/netlink/af_netlink.c:2501
    rtnetlink_rcv+0x18/0x20 net/core/rtnetlink.c:6107
    netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
    netlink_unicast+0x58a/0x660 net/netlink/af_netlink.c:1345
    netlink_sendmsg+0x661/0x750 net/netlink/af_netlink.c:1921
    sock_sendmsg_nosec net/socket.c:714 [inline]
    sock_sendmsg net/socket.c:734 [inline]
    __sys_sendto+0x21e/0x2c0 net/socket.c:2119
    __do_sys_sendto net/socket.c:2131 [inline]
    __se_sys_sendto net/socket.c:2127 [inline]
    __x64_sys_sendto+0x74/0x90 net/socket.c:2127
    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
    entry_SYSCALL_64_after_hwframe+0x46/0xb0

    read to 0xffff888144ecf518 of 1 bytes by task 20423 on cpu 1:
    dev_isalive net/core/net-sysfs.c:38 [inline]
    netdev_show net/core/net-sysfs.c:50 [inline]
    type_show+0x24/0x90 net/core/net-sysfs.c:112
    dev_attr_show+0x35/0x90 drivers/base/core.c:2095
    sysfs_kf_seq_show+0x175/0x240 fs/sysfs/file.c:59
    kernfs_seq_show+0x75/0x80 fs/kernfs/file.c:162
    seq_read_iter+0x2c3/0x8e0 fs/seq_file.c:230
    kernfs_fop_read_iter+0xd1/0x2f0 fs/kernfs/file.c:235
    call_read_iter include/linux/fs.h:2052 [inline]
    new_sync_read fs/read_write.c:401 [inline]
    vfs_read+0x5a5/0x6a0 fs/read_write.c:482
    ksys_read+0xe8/0x1a0 fs/read_write.c:620
    __do_sys_read fs/read_write.c:630 [inline]
    __se_sys_read fs/read_write.c:628 [inline]
    __x64_sys_read+0x3e/0x50 fs/read_write.c:628
    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
    entry_SYSCALL_64_after_hwframe+0x46/0xb0

    value changed: 0x00 -> 0x01

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 1 PID: 20423 Comm: udevd Tainted: G W 5.19.0-rc2-syzkaller-dirty #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Hangbin Liu <haliu@redhat.com>
2022-06-27 16:39:41 +08:00
Hangbin Liu ca3a0598a6 net: Write lock dev_base_lock without disabling bottom halves.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101278
Upstream Status: net.git commit fd888e85fe6b

commit fd888e85fe6b661e78044dddfec0be5271afa626
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Fri Nov 26 17:15:29 2021 +0100

    net: Write lock dev_base_lock without disabling bottom halves.

    The writer acquires dev_base_lock with disabled bottom halves.
    The reader can acquire dev_base_lock without disabling bottom halves
    because there is no writer in softirq context.

    On PREEMPT_RT the softirqs are preemptible and local_bh_disable() acts
    as a lock to ensure that resources, that are protected by disabling
    bottom halves, remain protected.
    This leads to a circular locking dependency if the lock acquired with
    disabled bottom halves (as in write_lock_bh()) and somewhere else with
    enabled bottom halves (as by read_lock() in netstat_show()) followed by
    disabling bottom halves (cxgb_get_stats() -> t4_wr_mbox_meat_timeout()
    -> spin_lock_bh()). This is the reverse locking order.

    All read_lock() invocation are from sysfs callback which are not invoked
    from softirq context. Therefore there is no need to disable bottom
    halves while acquiring a write lock.

    Acquire the write lock of dev_base_lock without disabling bottom halves.

    Reported-by: Pei Zhang <pezhang@redhat.com>
    Reported-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Hangbin Liu <haliu@redhat.com>
2022-06-27 16:37:14 +08:00
Hangbin Liu 7b9f2507ce net: fix dev_fill_forward_path with pppoe + bridge
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101278
Upstream Status: net.git commit cf2df74e202d

commit cf2df74e202d81b09f09d84c2d8903e0e87e9274
Author: Felix Fietkau <nbd@nbd.name>
Date:   Mon May 9 14:26:15 2022 +0200

    net: fix dev_fill_forward_path with pppoe + bridge

    When calling dev_fill_forward_path on a pppoe device, the provided destination
    address is invalid. In order for the bridge fdb lookup to succeed, the pppoe
    code needs to update ctx->daddr to the correct value.
    Fix this by storing the address inside struct net_device_path_ctx

    Fixes: f6efc675c9 ("net: ppp: resolve forwarding path for bridge pppoe devices")
    Signed-off-by: Felix Fietkau <nbd@nbd.name>
    Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

Signed-off-by: Hangbin Liu <haliu@redhat.com>
2022-06-27 12:26:09 +08:00
Patrick Talbert 164ce13234 Merge: CNB: Update TC subsystem to upstream v5.18
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/971

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2090410
Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2094002
Tested: Using TC related kernel self-tests

The series rebases TC subsystem to upstream v5.18

Commits:
```
f79a3bcb1a50 ("net/sched: Remove unnecessary if statement")
409f386b8e5d ("qdisc: add new field for qdisc_enqueue tracepoint")
56af5e749f20 ("net/sched: act_skbmod: Add SKBMOD_F_ECN option support")
68f9884837c6 ("tc-testing: Add control-plane selftest for skbmod SKBMOD_F_ECN option")
695176bfe5de ("net_sched: refactor TC action init API")
625af9f0298b ("tc-testing: Add control-plane selftests for sch_mq")
a5397d68b2db ("net/sched: cls_api, reset flags on replay")
efe487fce306 ("fix array-index-out-of-bounds in taprio_change")
1e080f17750d ("net: sched: update default qdisc visibility after Tx queue cnt changes")
2e367522ce6b ("netdevsim: add ability to change channel count")
2d6a58996ee2 ("selftests: net: test ethtool -L vs mq")
f7116fb46085 ("net: sched: move and reuse mq_change_real_num_tx()")
b193e15ac69d ("net: prevent user from passing illegal stab size")
69508d43334e ("net_sched: Use struct_size() and flex_array_size() helpers")
129291980f49 ("net: sched: Use struct_size() helper in kvmalloc()")
fbf307c89eb0 ("gen_stats: Add instead Set the value in __gnet_stats_copy_basic().")
448e163f8b9b ("gen_stats: Add gnet_stats_add_queue().")
7361df4606ba ("mq, mqprio: Use gnet_stats_add_queue().")
10940eb746d4 ("gen_stats: Move remaining users to gnet_stats_add_queue().")
f2efdb179289 ("u64_stats: Introduce u64_stats_set()")
67c9e6270f30 ("net: sched: Protect Qdisc::bstats with u64_stats")
f56940daa5a7 ("net: sched: Use _bstats_update/set() instead of raw writes")
50dc9a8572aa ("net: sched: Merge Qdisc::bstats and Qdisc::cpu_bstats data types")
29cbcd858283 ("net: sched: Remove Qdisc::running sequence counter")
4c57e2fac41c ("net: sched: fix logic error in qdisc_run_begin()")
97604c65bcda ("net: sched: remove one pair of atomic operations")
6b3efbfa4e68 ("net: sch_tbf: Add a graft command")
e22db7bd552f ("net: sched: Allow statistics reads from softirq.")
c5c6e589a8c8 ("net: stats: Read the statistics in ___gnet_stats_copy_basic() instead of adding.")
f25c0515c521 ("net: sched: gred: dynamically allocate tc_gred_qopt_offload")
267463823adb ("net: sch: eliminate unnecessary RCU waits in mini_qdisc_pair_swap()")
85c0c3eb9a66 ("net: sch: simplify condtion for selecting mini_Qdisc_pair buffer")
648a991cf316 ("sch_htb: Add extack messages for EOPNOTSUPP errors")
6de6e46d27ef ("cls_flower: Fix inability to match GRE/IPIP packets")
af0a51113cb7 ("selftests: forwarding: Fix packet matching in mirroring selftests")
cb3ef7b00042 ("net: sched: sch_netem: Refactor code in 4-state loss generator")
bdf1565fe03d ("selftests/tc-testing: match any qdisc type")
b43c2793f5e9 ("netfilter: nfnetlink_queue: silence bogus compiler warning")
43332cf97425 ("net/sched: act_ct: Offload only ASSURED connections")
40bd094d65fc ("flow_offload: fill flags to action structure")
144d4c9e800d ("flow_offload: reject to offload tc actions in offload drivers")
5a9959008fb6 ("flow_offload: add index to flow_action_entry structure")
9c1c0e124ca2 ("flow_offload: rename offload functions with offload instead of flow")
c54e1d920f04 ("flow_offload: add ops to tc_action_ops for flow action setup")
8cbfe939abe9 ("flow_offload: allow user to offload tc action to net device")
7adc57651211 ("flow_offload: add skip_hw and skip_sw to control if offload the action")
bcd64368584b ("flow_offload: rename exts stats update functions with hw")
c7a66f8d8a94 ("flow_offload: add process to update action stats from hardware")
e8cb5bcf6ed6 ("net: sched: save full flags for tc action")
13926d19a11e ("flow_offload: add reoffload process to update hw_count")
c86e0209dc77 ("flow_offload: validate flags of filter and actions")
eb473bac4a4b ("selftests: tc-testing: add action offload selftest for action and filter")
c48c94b0ab75 ("net/sched: use min() macro instead of doing it manually")
963178a06352 ("flow_offload: fix suspicious RCU usage when offloading tc action")
9795ded7f924 ("net/sched: act_ct: Fill offloading tuple iifidx")
b702436a51df ("net: openvswitch: Fill act ct extension")
7d18a07897d0 ("sch_qfq: prevent shift-out-of-bounds in qfq_init_qdisc")
c25af830ab26 ("sch_cake: revise Diffserv docs")
719774377622 ("netfilter: conntrack: convert to refcount_t api")
3fce16493dc1 ("netfilter: core: move ip_ct_attach indirection to struct nf_ct_hook")
285c8a7a5815 ("netfilter: make function op structures const")
6ae7989c9af0 ("netfilter: conntrack: avoid useless indirection during conntrack destruction")
408bdcfce8df ("net: prefer nf_ct_put instead of nf_conntrack_put")
fb80445c438c ("net_sched: restore "mpu xxx" handling")
973bf8fdd12f ("net: sched: Clarify error message when qdisc kind is unknown")
bb62a765b1b5 ("netfilter: conntrack: make all extensions 8-byte alignned")
5f31edc0676b ("netfilter: conntrack: move extension sizes into core")
1bc91a5ddf3e ("netfilter: conntrack: handle ->destroy hook via nat_ops instead")
1015c3de23ee ("netfilter: conntrack: remove extension register api")
34243b9ec856 ("netfilter: nft_ct: fix use after free when attaching zone template")
429c3be8a5e2 ("sch_htb: Fail on unsupported parameters when offload is requested")
98b608629746 ("net: sched: remove psched_tdiff_bounded()")
a459bc9a3a68 ("net: sched: remove qdisc_qlen_cpu()")
04c2a47ffb13 ("net: sched: fix use-after-free in tc_new_tfilter()")
35d39fecbc24 ("net/sched: Enable tc skb ext allocation on chain miss only when needed")
4ddc844eb81d ("net/sched: act_police: more accurate MTU policing")
5891cd5ec46c ("net_sched: add __rcu annotation to netdev->qdisc")
5740d0689096 ("net: sched: limit TC_ACT_REPEAT loops")
2f131de361f6 ("net/sched: act_ct: Fix flow table lookup after ct clear or switching zones")
ecf4a24cf978 ("net: sched: avoid newline at end of message in NL_SET_ERR_MSG_MOD")
b8cd5831c61c ("net: flow_offload: add tc police action parameters")
d97b4b105ce7 ("flow_offload: reject offload for all drivers with invalid police parameters")
fcb6aa86532c ("act_ct: Support GRE offload")
db6140e5e35a ("net/sched: act_ct: Fix flow table lookup failure with no originating ifindex")
d922a99b96d0 ("flow_offload: improve extack msg for user when adding invalid filter")
ab95465cde23 ("net/sched: add vlan push_eth and pop_eth action to the hardware IR")
054d5575cd6e ("net/sched: fix incorrect vlan_push_eth dest field")
bcb74e132a76 ("net/sched: act_ct: fix ref leak when switching zones")
2105f700b53c ("net/sched: flower: fix parsing of ethertype following VLAN header")
e65812fd22eb ("net/sched: fix initialization order when updating chain 0 head")
e8a64bbaaad1 ("net/sched: taprio: Check if socket flags are valid")
3db09e762dc7 ("net/sched: cls_u32: fix netns refcount changes in u32_change()")
ec5b0f605b10 ("net/sched: cls_u32: fix possible leak in u32_init_knode()")
8b796475fd78 ("net/sched: act_pedit: really ensure the skb is writable")
4d42d54a7d6a ("net/sched: act_pedit: sanitize shift argument before usage")
86360030cc51 ("net/sched: act_api: fix error code in tcf_ct_flow_table_fill_tuple_ipv6()")
```

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Approved-by: Corinna Vinschen <vinschen@redhat.com>
Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Davide Caratti <dcaratti@redhat.com>
Approved-by: Petr Oros <poros@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-06-21 10:07:08 +02:00
Ivan Vecera 056507f0cb net: add dev->dev_registered_tracker
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2096377

commit b2309a71c1f2fc841feb184195b2e46b2e139bf4
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Feb 7 10:41:07 2022 -0800

    net: add dev->dev_registered_tracker

    Convert one dev_hold()/dev_put() pair in register_netdevice()
    and unregister_netdevice_many() to dev_hold_track()
    and dev_put_track().

    This would allow to detect a rogue dev_put() a bit earlier.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20220207184107.1401096-1-eric.dumazet@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-06-13 18:39:34 +02:00
Ivan Vecera 859ed7a9a3 net: refine dev_put()/dev_hold() debugging
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2096377

commit 4c6c11ea0f7b00a1894803efe980dfaf3b074886
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Feb 4 14:42:37 2022 -0800

    net: refine dev_put()/dev_hold() debugging

    We are still chasing some syzbot reports where we think a rogue dev_put()
    is called with no corresponding prior dev_hold().
    Unfortunately it eats a reference on dev->dev_refcnt taken by innocent
    dev_hold_track(), meaning that the refcount saturation splat comes
    too late to be useful.

    Make sure that 'not tracked' dev_put() and dev_hold() better use
    CONFIG_NET_DEV_REFCNT_TRACKER=y debug infrastructure:

    Prior patch in the series allowed ref_tracker_alloc() and ref_tracker_free()
    to be called with a NULL @trackerp parameter, and to use a separate refcount
    only to detect too many put() even in the following case:

    dev_hold_track(dev, tracker_1, GFP_ATOMIC);
     dev_hold(dev);
     dev_put(dev);
     dev_put(dev); // Should complain loudly here.
    dev_put_track(dev, tracker_1); // instead of here

    Add clarification about netdev_tracker_alloc() role.

    v2: I replaced the dev_put() in linkwatch_do_dev()
        with __dev_put() because callers called netdev_tracker_free().

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-06-13 18:39:33 +02:00
Ivan Vecera 6ce56701da net: add net device refcount tracker to struct netdev_adjacent
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2096377

commit f77159a348f2d6078af7fe4933a60229d7c7aae2
Author: Eric Dumazet <edumazet@google.com>
Date:   Sat Dec 4 20:22:10 2021 -0800

    net: add net device refcount tracker to struct netdev_adjacent

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-06-13 18:38:19 +02:00
Ivan Vecera f516b70a26 net: add net device refcount tracker infrastructure
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2096377

Conflicts:
- context conflict due to missing commit 5ea2f5ffde392 ("move
  netdev_boot_setup into Space.c")

commit 4d92b95ff2f95f13df9bad0b5a25a9f60e72758d
Author: Eric Dumazet <edumazet@google.com>
Date:   Sat Dec 4 20:21:57 2021 -0800

    net: add net device refcount tracker infrastructure

    net device are refcounted. Over the years we had numerous bugs
    caused by imbalanced dev_hold() and dev_put() calls.

    The general idea is to be able to precisely pair each decrement with
    a corresponding prior increment. Both share a cookie, basically
    a pointer to private data storing stack traces.

    This patch adds dev_hold_track() and dev_put_track().

    To use these helpers, each data structure owning a refcount
    should also use a "netdevice_tracker" to pair the hold and put.

    netdevice_tracker dev_tracker;
    ...
    dev_hold_track(dev, &dev_tracker, GFP_ATOMIC);
    ...
    dev_put_track(dev, &dev_tracker);

    Whenever a leak happens, we will get precise stack traces
    of the point dev_hold_track() happened, at device dismantle phase.

    We will also get a stack trace if too many dev_put_track() for the same
    netdevice_tracker are attempted.

    This is guarded by CONFIG_NET_DEV_REFCNT_TRACKER option.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-06-13 18:36:42 +02:00
Patrick Talbert 0b353d8be8 Merge: CNB: net: consolidate neif_rx() and make it callable from any context
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/968

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089703
Tested: basic network tests on lo, tun, veth

Series consolidate neif_rx() and make it callable from any context.
It is backport for these upstream series:
 da54d75bebf4d8 ("Merge branch 'netdev-RT'")
 9f9919f73c94ae ("Merge branch 'netif_rx'")
 83b7b77af37a89 ("Merge branch 'netif_rx-conversions-part2'")
 e21af12622c0fb ("Merge branch 'netif_rx-part3'")

Omitted-fix: b903117b48681e12fae38e09c874f38c45186dc6
Omitted-fix: e1f9e434617fb28097223d9484de66218bc0b52d

Signed-off-by: Petr Oros <poros@redhat.com>

Approved-by: Jiri Benc <jbenc@redhat.com>
Approved-by: Hangbin Liu <haliu@redhat.com>
Approved-by: John W. Linville <linville@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-06-10 09:44:49 +02:00
Ivan Vecera 0cdfbe9c70 net: sched: update default qdisc visibility after Tx queue cnt changes
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2090410

commit 1e080f17750d1083e8a32f7b350584ae1cd7ff20
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Mon Sep 13 15:53:30 2021 -0700

    net: sched: update default qdisc visibility after Tx queue cnt changes

    mq / mqprio make the default child qdiscs visible. They only do
    so for the qdiscs which are within real_num_tx_queues when the
    device is registered. Depending on order of calls in the driver,
    or if user space changes config via ethtool -L the number of
    qdiscs visible under tc qdisc show will differ from the number
    of queues. This is confusing to users and potentially to system
    configuration scripts which try to make sure qdiscs have the
    right parameters.

    Add a new Qdisc_ops callback and make relevant qdiscs TTRT.

    Note that this uncovers the "shortcut" created by
    commit 1f27cde313 ("net: sched: use pfifo_fast for non real queues")
    The default child qdiscs beyond initial real_num_tx are always
    pfifo_fast, no matter what the sysfs setting is. Fixing this
    gets a little tricky because we'd need to keep a reference
    on whatever the default qdisc was at the time of creation.
    In practice this is likely an non-issue the qdiscs likely have
    to be configured to non-default settings, so whatever user space
    is doing such configuration can replace the pfifos... now that
    it will see them.

    Reported-by: Matthew Massey <matthewmassey@fb.com>
    Reviewed-by: Dave Taht <dave.taht@gmail.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-06-06 16:29:55 +02:00
Ivan Vecera bfa8b4c7ce net: add netif_set_real_num_queues() for device reconfig
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2094002

commit 271e5b7d00aeff7c61fb6c5415d14dbedb783b68
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Tue Aug 3 06:05:26 2021 -0700

    net: add netif_set_real_num_queues() for device reconfig

    netif_set_real_num_rx_queues() and netif_set_real_num_tx_queues()
    can fail which breaks drivers trying to implement reconfiguration
    in a way that can't leave the device half-broken. In other words
    those functions are incompatible with prepare/commit approach.

    Luckily setting real number of queues can fail only if the number
    is increased, meaning that if we order operations correctly we
    can guarantee ending up with either new config (success), or
    the old one (on error).

    Provide a helper implementing such logic so that drivers don't
    have to duplicate it.

    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-06-06 16:24:58 +02:00
Petr Oros 6f8d815bcf net: dev: Use netif_rx().
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089703

Upstream commit(s):
commit ad0a043fc26c17522ede3cc986d559f05ece20f4
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Thu Mar 3 18:15:05 2022 +0100

    net: dev: Use netif_rx().

    Since commit
       baebdf48c3600 ("net: dev: Makes sure netif_rx() can be invoked in any context.")

    the function netif_rx() can be used in preemptible/thread context as
    well as in interrupt context.

    Use netif_rx().

    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Petr Oros <poros@redhat.com>
2022-06-06 11:54:24 +02:00
Petr Oros ee3d25c7a3 net: Correct wrong BH disable in hard-interrupt.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089703

Upstream commit(s):
commit 167053f8dd0ed60287858448696b4784d7e1d899
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Wed Feb 16 18:50:46 2022 +0100

    net: Correct wrong BH disable in hard-interrupt.

    I missed the obvious case where netif_ix() is invoked from hard-IRQ
    context.

    Disabling bottom halves is only needed in process context. This ensures
    that the code remains on the current CPU and that the soft-interrupts
    are processed at local_bh_enable() time.
    In hard- and soft-interrupt context this is already the case and the
    soft-interrupts will be processed once the context is left (at irq-exit
    time).

    Disable bottom halves if neither hard-interrupts nor soft-interrupts are
    disabled. Update the kernel-doc, mention that interrupts must be enabled
    if invoked from process context.

    Fixes: baebdf48c3600 ("net: dev: Makes sure netif_rx() can be invoked in any context.")
    Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
    Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
    Link: https://lore.kernel.org/r/Yg05duINKBqvnxUc@linutronix.de
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Petr Oros <poros@redhat.com>
2022-06-06 11:54:20 +02:00
Petr Oros 32c9187bad net: dev: Make rps_lock() disable interrupts.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089703

Upstream commit(s):
commit e722db8de6e6932267457ace2657a19015f3db4a
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Sat Feb 12 00:38:39 2022 +0100

    net: dev: Make rps_lock() disable interrupts.

    Disabling interrupts and in the RPS case locking input_pkt_queue is
    split into local_irq_disable() and optional spin_lock().

    This breaks on PREEMPT_RT because the spinlock_t typed lock can not be
    acquired with disabled interrupts.
    The sections in which the lock is acquired is usually short in a sense that it
    is not causing long und unbounded latiencies. One exception is the
    skb_flow_limit() invocation which may invoke a BPF program (and may
    require sleeping locks).

    By moving local_irq_disable() + spin_lock() into rps_lock(), we can keep
    interrupts disabled on !PREEMPT_RT and enabled on PREEMPT_RT kernels.
    Without RPS on a PREEMPT_RT kernel, the needed synchronisation happens
    as part of local_bh_disable() on the local CPU.
    ____napi_schedule() is only invoked if sd is from the local CPU. Replace
    it with __napi_schedule_irqoff() which already disables interrupts on
    PREEMPT_RT as needed. Move this call to rps_ipi_queued() and rename the
    function to napi_schedule_rps as suggested by Jakub.

    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Reviewed-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Petr Oros <poros@redhat.com>
2022-06-06 11:25:38 +02:00
Petr Oros 56766d1469 net: dev: Makes sure netif_rx() can be invoked in any context.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089703

Conflicts:
- drivers/net/amt.c Unmerged because file missing in rhel

Upstream commit(s):
commit baebdf48c360080710f80699eea3affbb13d6c65
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Sat Feb 12 00:38:38 2022 +0100

    net: dev: Makes sure netif_rx() can be invoked in any context.

    Dave suggested a while ago (eleven years by now) "Let's make netif_rx()
    work in all contexts and get rid of netif_rx_ni()". Eric agreed and
    pointed out that modern devices should use netif_receive_skb() to avoid
    the overhead.
    In the meantime someone added another variant, netif_rx_any_context(),
    which behaves as suggested.

    netif_rx() must be invoked with disabled bottom halves to ensure that
    pending softirqs, which were raised within the function, are handled.
    netif_rx_ni() can be invoked only from process context (bottom halves
    must be enabled) because the function handles pending softirqs without
    checking if bottom halves were disabled or not.
    netif_rx_any_context() invokes on the former functions by checking
    in_interrupts().

    netif_rx() could be taught to handle both cases (disabled and enabled
    bottom halves) by simply disabling bottom halves while invoking
    netif_rx_internal(). The local_bh_enable() invocation will then invoke
    pending softirqs only if the BH-disable counter drops to zero.

    Eric is concerned about the overhead of BH-disable+enable especially in
    regard to the loopback driver. As critical as this driver is, it will
    receive a shortcut to avoid the additional overhead which is not needed.

    Add a local_bh_disable() section in netif_rx() to ensure softirqs are
    handled if needed.
    Provide __netif_rx() which does not disable BH and has a lockdep assert
    to ensure that interrupts are disabled. Use this shortcut in the
    loopback driver and in drivers/net/*.c.
    Make netif_rx_ni() and netif_rx_any_context() invoke netif_rx() so they
    can be removed once they are no more users left.

    Link: https://lkml.kernel.org/r/20100415.020246.218622820.davem@davemloft.net
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Petr Oros <poros@redhat.com>
2022-06-06 11:25:37 +02:00
Petr Oros c15df5c592 net: dev: Remove preempt_disable() and get_cpu() in netif_rx_internal().
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089703

Upstream commit(s):
commit f234ae2947612825686b25cae3e9579188a6ba95
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Sat Feb 12 00:38:37 2022 +0100

    net: dev: Remove preempt_disable() and get_cpu() in netif_rx_internal().

    The preempt_disable() () section was introduced in commit
        cece1945bf ("net: disable preemption before call smp_processor_id()")

    and adds it in case this function is invoked from preemtible context and
    because get_cpu() later on as been added.

    The get_cpu() usage was added in commit
        b0e28f1eff ("net: netif_rx() must disable preemption")

    because ip_dev_loopback_xmit() invoked netif_rx() with enabled preemption
    causing a warning in smp_processor_id(). The function netif_rx() should
    only be invoked from an interrupt context which implies disabled
    preemption. The commit
       e30b38c298 ("ip: Fix ip_dev_loopback_xmit()")

    was addressing this and replaced netif_rx() with in netif_rx_ni() in
    ip_dev_loopback_xmit().

    Based on the discussion on the list, the former patch (b0e28f1eff)
    should not have been applied only the latter (e30b38c298).

    Remove get_cpu() and preempt_disable() since the function is supposed to
    be invoked from context with stable per-CPU pointers. Bottom halves have
    to be disabled at this point because the function may raise softirqs
    which need to be processed.

    Link: https://lkml.kernel.org/r/20100415.013347.98375530.davem@davemloft.net
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Petr Oros <poros@redhat.com>
2022-06-06 11:25:37 +02:00
Patrick Talbert 8c5b3f7fd9 Merge: XDP and networking eBPF rebase to v5.15
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/674

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071618

Depends: !572

Tested: Using bpf selftests, everything passes.

This rebases XDP and networking eBPF to upstream kernel version 5.15.

Signed-off-by: Jiri Benc <jbenc@redhat.com>

Approved-by: Hangbin Liu <haliu@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Toke Høiland-Jørgensen <toke@redhat.com>
Approved-by: Íñigo Huguet <ihuguet@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-06-03 09:26:25 +02:00
Patrick Talbert 092af648a0 Merge: bpf: update to v5.15
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/572

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2041365

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>

Approved-by: Rado Vrbovsky <rvrbovsk@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Yauheni Kaliuta <ykaliuta@redhat.com>
Approved-by: Artem Savkov <asavkov@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-05-26 09:27:25 +02:00
Jiri Benc 7e6f15045c net: in_irq() cleanup
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071618

commit afa79d08c6c8e1901cb1547591e3ccd3ec6965d9
Author: Changbin Du <changbin.du@intel.com>
Date:   Fri Aug 13 22:57:49 2021 +0800

    net: in_irq() cleanup

    Replace the obsolete and ambiguos macro in_irq() with new
    macro in_hardirq().

    Signed-off-by: Changbin Du <changbin.du@gmail.com>
    Link: https://lore.kernel.org/r/20210813145749.86512-1-changbin.du@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-05-12 17:29:49 +02:00
Jiri Benc c773bf00b4 net, core: Allow netdev_lower_get_next_private_rcu in bh context
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071618

commit 689186699931313c7a42462602bd5c03eef77f9f
Author: Jussi Maki <joamaki@gmail.com>
Date:   Sat Jul 31 05:57:36 2021 +0000

    net, core: Allow netdev_lower_get_next_private_rcu in bh context

    For the XDP bonding slave lookup to work in the NAPI poll context in which
    the redudant rcu_read_lock() has been removed we have to follow the same
    approach as in 694cea395f ("bpf: Allow RCU-protected lookups to happen
    from bh context") and modify the WARN_ON to also check rcu_read_lock_bh_held().

    Signed-off-by: Jussi Maki <joamaki@gmail.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Cc: Toke Høiland-Jørgensen <toke@redhat.com>
    Link: https://lore.kernel.org/bpf/20210731055738.16820-6-joamaki@gmail.com

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-05-12 17:29:48 +02:00
Jiri Benc 88b4e5f8ea net, core: Add support for XDP redirection to slave device
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071618

Conflicts:
- Using lower case __bpf_prog_run in bpf_prog_run_xdp due to out of order
  backport of fb7dd8bca013 ("bpf: Refactor BPF_PROG_RUN into a function")

commit 879af96ffd72706c6e3278ea6b45b0b0e37ec5d7
Author: Jussi Maki <joamaki@gmail.com>
Date:   Sat Jul 31 05:57:33 2021 +0000

    net, core: Add support for XDP redirection to slave device

    This adds the ndo_xdp_get_xmit_slave hook for transforming XDP_TX
    into XDP_REDIRECT after BPF program run when the ingress device
    is a bond slave.

    The dev_xdp_prog_count is exposed so that slave devices can be checked
    for loaded XDP programs in order to avoid the situation where both
    bond master and slave have programs loaded according to xdp_state.

    Signed-off-by: Jussi Maki <joamaki@gmail.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Cc: Jay Vosburgh <j.vosburgh@gmail.com>
    Cc: Veaceslav Falico <vfalico@gmail.com>
    Cc: Andy Gospodarek <andy@greyhouse.net>
    Link: https://lore.kernel.org/bpf/20210731055738.16820-3-joamaki@gmail.com

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-05-12 17:29:47 +02:00
Hangbin Liu b2ce8f1b0b net: initialize init_net earlier
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081920
Upstream Status: net.git commit 9c1be1935fb6

Conflicts: context conflicts due to missing commit
41467d2ff4df ("net: net_namespace: Optimize the code")

commit 9c1be1935fb68b2413796cdc03d019b8cf35ab51
Author: Eric Dumazet <edumazet@google.com>
Date:   Sat Feb 5 09:01:25 2022 -0800

    net: initialize init_net earlier

    While testing a patch that will follow later
    ("net: add netns refcount tracker to struct nsproxy")
    I found that devtmpfs_init() was called before init_net
    was initialized.

    This is a bug, because devtmpfs_setup() calls
    ksys_unshare(CLONE_NEWNS);

    This has the effect of increasing init_net refcount,
    which will be later overwritten to 1, as part of setup_net(&init_net)

    We had too many prior patches [1] trying to work around the root cause.

    Really, make sure init_net is in BSS section, and that net_ns_init()
    is called earlier at boot time.

    Note that another patch ("vfs: add netns refcount tracker
    to struct fs_context") also will need net_ns_init() being called
    before vfs_caches_init()

    As a bonus, this patch saves around 4KB in .data section.

    [1]

    f8c46cb390 ("netns: do not call pernet ops for not yet set up init_net namespace")
    b5082df801 ("net: Initialise init_net.count to 1")
    734b65417b ("net: Statically initialize init_net.dev_base_head")

    v2: fixed a build error reported by kernel build bots (CONFIG_NET=n)

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Hangbin Liu <haliu@redhat.com>
2022-05-05 12:26:57 +08:00
Hangbin Liu 970a02e10a net: gro: avoid re-computing truesize twice on recycle
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081920
Upstream Status: net.git commit 7881453e4adf

Conflicts: there is no net/core/gro.c due to missing commit
587652bbdd06 ("net: gro: populate net/core/gro.c")

commit 7881453e4adf497cf9109c84fa21eedda9ac6164
Author: Paolo Abeni <pabeni@redhat.com>
Date:   Fri Feb 4 12:28:36 2022 +0100

    net: gro: avoid re-computing truesize twice on recycle

    After commit 5e10da5385d2 ("skbuff: allow 'slow_gro' for skb
    carring sock reference") and commit af352460b465 ("net: fix GRO
    skb truesize update") the truesize of the skb with stolen head is
    properly updated by the GRO engine, we don't need anymore resetting
    it at recycle time.

    v1 -> v2:
     - clarify the commit message (Alexander)

    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Hangbin Liu <haliu@redhat.com>
2022-05-05 12:26:41 +08:00
Hangbin Liu 9ef759e929 net: annotate data-races on txq->xmit_lock_owner
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081920
Upstream Status: net.git commit 7a10d8c810cf

commit 7a10d8c810cfad3e79372d7d1c77899d86cd6662
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Nov 30 09:01:55 2021 -0800

    net: annotate data-races on txq->xmit_lock_owner

    syzbot found that __dev_queue_xmit() is reading txq->xmit_lock_owner
    without annotations.

    No serious issue there, let's document what is happening there.

    BUG: KCSAN: data-race in __dev_queue_xmit / __dev_queue_xmit

    write to 0xffff888139d09484 of 4 bytes by interrupt on cpu 0:
     __netif_tx_unlock include/linux/netdevice.h:4437 [inline]
     __dev_queue_xmit+0x948/0xf70 net/core/dev.c:4229
     dev_queue_xmit_accel+0x19/0x20 net/core/dev.c:4265
     macvlan_queue_xmit drivers/net/macvlan.c:543 [inline]
     macvlan_start_xmit+0x2b3/0x3d0 drivers/net/macvlan.c:567
     __netdev_start_xmit include/linux/netdevice.h:4987 [inline]
     netdev_start_xmit include/linux/netdevice.h:5001 [inline]
     xmit_one+0x105/0x2f0 net/core/dev.c:3590
     dev_hard_start_xmit+0x72/0x120 net/core/dev.c:3606
     sch_direct_xmit+0x1b2/0x7c0 net/sched/sch_generic.c:342
     __dev_xmit_skb+0x83d/0x1370 net/core/dev.c:3817
     __dev_queue_xmit+0x590/0xf70 net/core/dev.c:4194
     dev_queue_xmit+0x13/0x20 net/core/dev.c:4259
     neigh_hh_output include/net/neighbour.h:511 [inline]
     neigh_output include/net/neighbour.h:525 [inline]
     ip6_finish_output2+0x995/0xbb0 net/ipv6/ip6_output.c:126
     __ip6_finish_output net/ipv6/ip6_output.c:191 [inline]
     ip6_finish_output+0x444/0x4c0 net/ipv6/ip6_output.c:201
     NF_HOOK_COND include/linux/netfilter.h:296 [inline]
     ip6_output+0x10e/0x210 net/ipv6/ip6_output.c:224
     dst_output include/net/dst.h:450 [inline]
     NF_HOOK include/linux/netfilter.h:307 [inline]
     ndisc_send_skb+0x486/0x610 net/ipv6/ndisc.c:508
     ndisc_send_rs+0x3b0/0x3e0 net/ipv6/ndisc.c:702
     addrconf_rs_timer+0x370/0x540 net/ipv6/addrconf.c:3898
     call_timer_fn+0x2e/0x240 kernel/time/timer.c:1421
     expire_timers+0x116/0x240 kernel/time/timer.c:1466
     __run_timers+0x368/0x410 kernel/time/timer.c:1734
     run_timer_softirq+0x2e/0x60 kernel/time/timer.c:1747
     __do_softirq+0x158/0x2de kernel/softirq.c:558
     __irq_exit_rcu kernel/softirq.c:636 [inline]
     irq_exit_rcu+0x37/0x70 kernel/softirq.c:648
     sysvec_apic_timer_interrupt+0x3e/0xb0 arch/x86/kernel/apic/apic.c:1097
     asm_sysvec_apic_timer_interrupt+0x12/0x20

    read to 0xffff888139d09484 of 4 bytes by interrupt on cpu 1:
     __dev_queue_xmit+0x5e3/0xf70 net/core/dev.c:4213
     dev_queue_xmit_accel+0x19/0x20 net/core/dev.c:4265
     macvlan_queue_xmit drivers/net/macvlan.c:543 [inline]
     macvlan_start_xmit+0x2b3/0x3d0 drivers/net/macvlan.c:567
     __netdev_start_xmit include/linux/netdevice.h:4987 [inline]
     netdev_start_xmit include/linux/netdevice.h:5001 [inline]
     xmit_one+0x105/0x2f0 net/core/dev.c:3590
     dev_hard_start_xmit+0x72/0x120 net/core/dev.c:3606
     sch_direct_xmit+0x1b2/0x7c0 net/sched/sch_generic.c:342
     __dev_xmit_skb+0x83d/0x1370 net/core/dev.c:3817
     __dev_queue_xmit+0x590/0xf70 net/core/dev.c:4194
     dev_queue_xmit+0x13/0x20 net/core/dev.c:4259
     neigh_resolve_output+0x3db/0x410 net/core/neighbour.c:1523
     neigh_output include/net/neighbour.h:527 [inline]
     ip6_finish_output2+0x9be/0xbb0 net/ipv6/ip6_output.c:126
     __ip6_finish_output net/ipv6/ip6_output.c:191 [inline]
     ip6_finish_output+0x444/0x4c0 net/ipv6/ip6_output.c:201
     NF_HOOK_COND include/linux/netfilter.h:296 [inline]
     ip6_output+0x10e/0x210 net/ipv6/ip6_output.c:224
     dst_output include/net/dst.h:450 [inline]
     NF_HOOK include/linux/netfilter.h:307 [inline]
     ndisc_send_skb+0x486/0x610 net/ipv6/ndisc.c:508
     ndisc_send_rs+0x3b0/0x3e0 net/ipv6/ndisc.c:702
     addrconf_rs_timer+0x370/0x540 net/ipv6/addrconf.c:3898
     call_timer_fn+0x2e/0x240 kernel/time/timer.c:1421
     expire_timers+0x116/0x240 kernel/time/timer.c:1466
     __run_timers+0x368/0x410 kernel/time/timer.c:1734
     run_timer_softirq+0x2e/0x60 kernel/time/timer.c:1747
     __do_softirq+0x158/0x2de kernel/softirq.c:558
     __irq_exit_rcu kernel/softirq.c:636 [inline]
     irq_exit_rcu+0x37/0x70 kernel/softirq.c:648
     sysvec_apic_timer_interrupt+0x8d/0xb0 arch/x86/kernel/apic/apic.c:1097
     asm_sysvec_apic_timer_interrupt+0x12/0x20
     kcsan_setup_watchpoint+0x94/0x420 kernel/kcsan/core.c:443
     folio_test_anon include/linux/page-flags.h:581 [inline]
     PageAnon include/linux/page-flags.h:586 [inline]
     zap_pte_range+0x5ac/0x10e0 mm/memory.c:1347
     zap_pmd_range mm/memory.c:1467 [inline]
     zap_pud_range mm/memory.c:1496 [inline]
     zap_p4d_range mm/memory.c:1517 [inline]
     unmap_page_range+0x2dc/0x3d0 mm/memory.c:1538
     unmap_single_vma+0x157/0x210 mm/memory.c:1583
     unmap_vmas+0xd0/0x180 mm/memory.c:1615
     exit_mmap+0x23d/0x470 mm/mmap.c:3170
     __mmput+0x27/0x1b0 kernel/fork.c:1113
     mmput+0x3d/0x50 kernel/fork.c:1134
     exit_mm+0xdb/0x170 kernel/exit.c:507
     do_exit+0x608/0x17a0 kernel/exit.c:819
     do_group_exit+0xce/0x180 kernel/exit.c:929
     get_signal+0xfc3/0x1550 kernel/signal.c:2852
     arch_do_signal_or_restart+0x8c/0x2e0 arch/x86/kernel/signal.c:868
     handle_signal_work kernel/entry/common.c:148 [inline]
     exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
     exit_to_user_mode_prepare+0x113/0x190 kernel/entry/common.c:207
     __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
     syscall_exit_to_user_mode+0x20/0x40 kernel/entry/common.c:300
     do_syscall_64+0x50/0xd0 arch/x86/entry/common.c:86
     entry_SYSCALL_64_after_hwframe+0x44/0xae

    value changed: 0x00000000 -> 0xffffffff

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 1 PID: 28712 Comm: syz-executor.0 Tainted: G        W         5.16.0-rc1-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Link: https://lore.kernel.org/r/20211130170155.2331929-1-eric.dumazet@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Hangbin Liu <haliu@redhat.com>
2022-05-05 12:26:41 +08:00
Hangbin Liu 1928aa8364 net: multicast: calculate csum of looped-back and forwarded packets
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081920
Upstream Status: net.git commit 9122a70a6333

commit 9122a70a6333705c0c35614ddc51c274ed1d3637
Author: Cyril Strejc <cyril.strejc@skoda.cz>
Date:   Sun Oct 24 22:14:25 2021 +0200

    net: multicast: calculate csum of looped-back and forwarded packets

    During a testing of an user-space application which transmits UDP
    multicast datagrams and utilizes multicast routing to send the UDP
    datagrams out of defined network interfaces, I've found a multicast
    router does not fill-in UDP checksum into locally produced, looped-back
    and forwarded UDP datagrams, if an original output NIC the datagrams
    are sent to has UDP TX checksum offload enabled.

    The datagrams are sent malformed out of the NIC the datagrams have been
    forwarded to.

    It is because:

    1. If TX checksum offload is enabled on the output NIC, UDP checksum
       is not calculated by kernel and is not filled into skb data.

    2. dev_loopback_xmit(), which is called solely by
       ip_mc_finish_output(), sets skb->ip_summed = CHECKSUM_UNNECESSARY
       unconditionally.

    3. Since 35fc92a9 ("[NET]: Allow forwarding of ip_summed except
       CHECKSUM_COMPLETE"), the ip_summed value is preserved during
       forwarding.

    4. If ip_summed != CHECKSUM_PARTIAL, checksum is not calculated during
       a packet egress.

    The minimum fix in dev_loopback_xmit():

    1. Preserves skb->ip_summed CHECKSUM_PARTIAL. This is the
       case when the original output NIC has TX checksum offload enabled.
       The effects are:

         a) If the forwarding destination interface supports TX checksum
            offloading, the NIC driver is responsible to fill-in the
            checksum.

         b) If the forwarding destination interface does NOT support TX
            checksum offloading, checksums are filled-in by kernel before
            skb is submitted to the NIC driver.

         c) For local delivery, checksum validation is skipped as in the
            case of CHECKSUM_UNNECESSARY, thanks to skb_csum_unnecessary().

    2. Translates ip_summed CHECKSUM_NONE to CHECKSUM_UNNECESSARY. It
       means, for CHECKSUM_NONE, the behavior is unmodified and is there
       to skip a looped-back packet local delivery checksum validation.

    Signed-off-by: Cyril Strejc <cyril.strejc@skoda.cz>
    Reviewed-by: Willem de Bruijn <willemb@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Hangbin Liu <haliu@redhat.com>
2022-05-05 12:26:41 +08:00
Jerome Marchand 850123ac6a bpf: devmap: Implement devmap prog execution for generic XDP
Bugzilla: http://bugzilla.redhat.com/2041365

commit 2ea5eabaf04a1829383aefe98ac38a2e5ae2d698
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Fri Jul 2 16:48:24 2021 +0530

    bpf: devmap: Implement devmap prog execution for generic XDP

    This lifts the restriction on running devmap BPF progs in generic
    redirect mode. To match native XDP behavior, it is invoked right before
    generic_xdp_tx is called, and only supports XDP_PASS/XDP_ABORTED/
    XDP_DROP actions.

    We also return 0 even if devmap program drops the packet, as
    semantically redirect has already succeeded and the devmap prog is the
    last point before TX of the packet to device where it can deliver a
    verdict on the packet.

    This also means it must take care of freeing the skb, as
    xdp_do_generic_redirect callers only do that in case an error is
    returned.

    Since devmap entry prog is supported, remove the check in
    generic_xdp_install entirely.

    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Link: https://lore.kernel.org/bpf/20210702111825.491065-5-memxor@gmail.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-04-29 18:14:30 +02:00
Jerome Marchand 01fb58edc6 bpf: cpumap: Implement generic cpumap
Bugzilla: http://bugzilla.redhat.com/2041365

commit 11941f8a85362f612df61f4aaab0e41b64d2111d
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Fri Jul 2 16:48:23 2021 +0530

    bpf: cpumap: Implement generic cpumap

    This change implements CPUMAP redirect support for generic XDP programs.
    The idea is to reuse the cpu map entry's queue that is used to push
    native xdp frames for redirecting skb to a different CPU. This will
    match native XDP behavior (in that RPS is invoked again for packet
    reinjected into networking stack).

    To be able to determine whether the incoming skb is from the driver or
    cpumap, we reuse skb->redirected bit that skips generic XDP processing
    when it is set. To always make use of this, CONFIG_NET_REDIRECT guard on
    it has been lifted and it is always available.

    >From the redirect side, we add the skb to ptr_ring with its lowest bit
    set to 1.  This should be safe as skb is not 1-byte aligned. This allows
    kthread to discern between xdp_frames and sk_buff. On consumption of the
    ptr_ring item, the lowest bit is unset.

    In the end, the skb is simply added to the list that kthread is anyway
    going to maintain for xdp_frames converted to skb, and then received
    again by using netif_receive_skb_list.

    Bulking optimization for generic cpumap is left as an exercise for a
    future patch for now.

    Since cpumap entry progs are now supported, also remove check in
    generic_xdp_install for the cpumap.

    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
    Link: https://lore.kernel.org/bpf/20210702111825.491065-4-memxor@gmail.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-04-29 18:14:30 +02:00
Jerome Marchand 9d29c832f5 net: core: Split out code to run generic XDP prog
Bugzilla: http://bugzilla.redhat.com/2041365

commit fe21cb91ae7bca1ae7805454be80b6d03bec85f7
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Fri Jul 2 16:48:21 2021 +0530

    net: core: Split out code to run generic XDP prog

    This helper can later be utilized in code that runs cpumap and devmap
    programs in generic redirect mode and adjust skb based on changes made
    to xdp_buff.

    When returning XDP_REDIRECT/XDP_TX, it invokes __skb_push, so whenever a
    generic redirect path invokes devmap/cpumap prog if set, it must
    __skb_pull again as we expect mac header to be pulled.

    It also drops the skb_reset_mac_len call after do_xdp_generic, as the
    mac_header and network_header are advanced by the same offset, so the
    difference (mac_len) remains constant.

    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Link: https://lore.kernel.org/bpf/20210702111825.491065-2-memxor@gmail.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-04-29 18:14:30 +02:00
Ivan Vecera 85520fc44a net: annotate accesses to dev->gso_max_segs
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2073465

Conflicts:
- small context conflicts in octeontx2 driver

commit 6d872df3e3b91532b142de9044e5b4984017a55f
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Nov 19 07:43:32 2021 -0800

    net: annotate accesses to dev->gso_max_segs

    dev->gso_max_segs is written under RTNL protection, or when the device is
    not yet visible, but is read locklessly.

    Add netif_set_gso_max_segs() helper.

    Add the READ_ONCE()/WRITE_ONCE() pairs, and use netif_set_gso_max_segs()
    where we can to better document what is going on.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-04-08 16:46:08 +02:00
Herton R. Krzesinski 90182f8b73 Merge: ovs: backports P2 for 9.0
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/431

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2045048
Tested: Sanity only

A bit large for a P2 backport; but those patches are needed and were
requested by members of the OVS team.

Signed-off-by: Antoine Tenart <atenart@redhat.com>

Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: Davide Caratti <dcaratti@redhat.com>
Approved-by: Eelco Chaudron <echaudro@redhat.com>

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2022-02-15 22:45:52 +00:00
Herton R. Krzesinski 4f893751ba Merge: net: introduce kfree_skb_reason
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/405

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2041931
Tested: Instructions in bz

Signed-off-by: Antoine Tenart <atenart@redhat.com>

Approved-by: Florian Westphal <fwestpha@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2022-01-26 22:28:46 +00:00
Herton R. Krzesinski adc4082e23 Merge: CNB: net: Remove redundant if statements
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/328

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2037315

Series moving dev NULL check into dev_put()/dev_hold()

Signed-off-by: Petr Oros <poros@redhat.com>

Approved-by: John W. Linville <linville@redhat.com>
Approved-by: Andrea Claudi <aclaudi@redhat.com>
Approved-by: Corinna Vinschen <vinschen@redhat.com>
Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Ivan Vecera <ivecera@redhat.com>

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2022-01-26 22:11:25 +00:00
Antoine Tenart b5e24650b7 net/sched: Extend qdisc control block with tc control block
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2045048
Upstream Status: linux.git
Tested: Sanity only

commit ec624fe740b416fb68d536b37fb8eef46f90b5c2
Author: Paul Blakey <paulb@nvidia.com>
Date:   Tue Dec 14 19:24:33 2021 +0200

    net/sched: Extend qdisc control block with tc control block

    BPF layer extends the qdisc control block via struct bpf_skb_data_end
    and because of that there is no more room to add variables to the
    qdisc layer control block without going over the skb->cb size.

    Extend the qdisc control block with a tc control block,
    and move all tc related variables to there as a pre-step for
    extending the tc control block with additional members.

    Signed-off-by: Paul Blakey <paulb@nvidia.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-01-26 16:54:01 +01:00
Antoine Tenart 4a0269b225 net: skb: introduce kfree_skb_reason()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2041931
Upstream Status: linux.git
Tested: Instructions in bz

commit c504e5c2f9648a1e5c2be01e8c3f59d394192bd3
Author: Menglong Dong <imagedong@tencent.com>
Date:   Sun Jan 9 14:36:26 2022 +0800

    net: skb: introduce kfree_skb_reason()

    Introduce the interface kfree_skb_reason(), which is able to pass
    the reason why the skb is dropped to 'kfree_skb' tracepoint.

    Add the 'reason' field to 'trace_kfree_skb', therefor user can get
    more detail information about abnormal skb with 'drop_monitor' or
    eBPF.

    All drop reasons are defined in the enum 'skb_drop_reason', and
    they will be print as string in 'kfree_skb' tracepoint in format
    of 'reason: XXX'.

    ( Maybe the reasons should be defined in a uapi header file, so that
    user space can use them? )

    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-01-21 10:05:00 +01:00
Herton R. Krzesinski b8f20958b7 Merge: net: core stable backport for rhel 9.0
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/212

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028276
Tested: LNST, Tier1

This includes a few critical bugfixes for the core network stack.

Notably it includes 7f678def99d2 ("skb_expand_head() adjust skb->truesize incorrectly") and a whole series of pre-requisites. The bug addressed there is nasty and present even prior to skb_expand_head() introduction.

commit 719c57197010 ("net: make napi_disable() symmetric with enable") instead has been explicitly excluded, as it's not really a fix, is known to introduce problems and it's still quite new

Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Guillaume Nault <gnault@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2022-01-14 16:53:21 +00:00
Herton R. Krzesinski 911d813798 Merge: net/sched: 9.0 P1 backports from upstream
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/197

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2025552  
Upstream Status: all mainline in net.git  
Conflicts: None  
Tested: boot-tested only  

Signed-off-by: Davide Caratti <dcaratti@redhat.com>

Approved-by: Kamal Heib <kheib@redhat.com>
Approved-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Approved-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2022-01-12 15:43:04 +00:00
Petr Oros ea6b084bc4 net: Remove redundant if statements
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2037315

Upstream commit(s):
commit 1160dfa178eb848327e9dec39960a735f4dc1685
Author: Yajun Deng <yajun.deng@linux.dev>
Date:   Thu Aug 5 19:55:27 2021 +0800

    net: Remove redundant if statements

    The 'if (dev)' statement already move into dev_{put , hold}, so remove
    redundant if statements.

    Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Petr Oros <poros@redhat.com>
2022-01-10 16:20:08 +01:00
Herton R. Krzesinski adc818bf26 Merge: Replace deprecated CPU-hotplug functions for kernel-rt
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/134
Bugzilla: http://bugzilla.redhat.com/2023079

Depends: https://gitlab.com/redhat/rhel/src/kernel/rhel-8/-/merge_requests/99

The kernel-rt variant requires these changes in order to make future
changes to the RHEL9 kernel.  These changes were found by code inspection
and affect not only kernel-rt but the regular kernel variants as well.

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
RH-Acked-by: Rafael Aquini <aquini@redhat.com>
RH-Acked-by: John W. Linville <linville@redhat.com>
RH-Acked-by: David Arcari <darcari@redhat.com>
RH-Acked-by: Vladis Dronov <vdronov@redhat.com>
RH-Acked-by: Jiri Benc <jbenc@redhat.com>
RH-Acked-by: Jarod Wilson <jarod@redhat.com>
RH-Acked-by: Waiman Long <longman@redhat.com>
RH-Acked-by: Phil Auld <pauld@redhat.com>
RH-Acked-by: Wander Lairson Costa <wander@redhat.com>
Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2022-01-10 11:46:27 -03:00
Paolo Abeni d27bdebcab sk_buff: avoid potentially clearing 'slow_gro' field
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028927
Tested: LNST, Tier1

Upstream commit:
commit a432934a30679c0e3c47b87f13e4901bc1a3fc03
Author: Paolo Abeni <pabeni@redhat.com>
Date:   Fri Jul 30 18:30:53 2021 +0200

    sk_buff: avoid potentially clearing 'slow_gro' field

    If skb_dst_set_noref() is invoked with a NULL dst, the 'slow_gro'
    field is cleared, too. That could lead to wrong behavior if
    the skb later enters the GRO stage.

    Fix the potential issue replacing preserving a non-zero value of
    the 'slow_gro' field.

    Additionally, fix a comment typo.

    Reported-by: Sabrina Dubroca <sd@queasysnail.net>
    Reported-by: Jakub Kicinski <kuba@kernel.org>
    Fixes: 8a886b142bd0 ("sk_buff: track dst status in slow_gro")
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Link: https://lore.kernel.org/r/aa42529252dc8bb02bd42e8629427040d1058537.1627662501.git.pabeni@redhat.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2021-12-09 18:58:21 +01:00
Paolo Abeni 2bea014388 skbuff: allow 'slow_gro' for skb carring sock reference
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028927
Tested: LNST, Tier1

Upstream commit:
commit 5e10da5385d20c4bae587bc2921e5fdd9655d5fc
Author: Paolo Abeni <pabeni@redhat.com>
Date:   Wed Jul 28 18:24:03 2021 +0200

    skbuff: allow 'slow_gro' for skb carring sock reference

    This change leverages the infrastructure introduced by the previous
    patches to allow soft devices passing to the GRO engine owned skbs
    without impacting the fast-path.

    It's up to the GRO caller ensuring the slow_gro bit validity before
    invoking the GRO engine. The new helper skb_prepare_for_gro() is
    introduced for that goal.

    On slow_gro, skbs are aggregated only with equal sk.
    Additionally, skb truesize on GRO recycle and free is correctly
    updated so that sk wmem is not changed by the GRO processing.

    rfc-> v1:
     - fixed bad truesize on dev_gro_receive NAPI_FREE
     - use the existing state bit

    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2021-12-09 18:57:52 +01:00
Paolo Abeni 9ce6ef4e71 net: optimize GRO for the common case.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028927
Tested: LNST, Tier1

Upstream commit:
commit 9efb4b5baf6ce851b247288992b0632cb4d31c17
Author: Paolo Abeni <pabeni@redhat.com>
Date:   Wed Jul 28 18:24:02 2021 +0200

    net: optimize GRO for the common case.

    After the previous patches, at GRO time, skb->slow_gro is
    usually 0, unless the packets comes from some H/W offload
    slowpath or tunnel.

    We can optimize the GRO code assuming !skb->slow_gro is likely.
    This remove multiple conditionals in the most common path, at the
    price of an additional one when we hit the above "slow-paths".

    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2021-12-09 18:57:26 +01:00
Prarit Bhargava 286c7df21b net: Replace deprecated CPU-hotplug functions.
Bugzilla: http://bugzilla.redhat.com/2023079

commit 372bbdd5bb3fc454d9c280dc0914486a3c7419d5
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Tue Aug 3 16:16:06 2021 +0200

    net: Replace deprecated CPU-hotplug functions.

    The functions get_online_cpus() and put_online_cpus() have been
    deprecated during the CPU hotplug rework. They map directly to
    cpus_read_lock() and cpus_read_unlock().

    Replace deprecated CPU-hotplug functions with the official version.
    The behavior remains unchanged.

    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
2021-12-09 09:04:08 -05:00
Davide Caratti bee2c235ef net/sched: store the last executed chain also for clsact egress
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2025552
Upstream Status: net-next.git commit 3aa260559455

commit 3aa2605594556c676fb88744bd9845acae60683d
Author: Davide Caratti <dcaratti@redhat.com>
Date:   Wed Jul 28 20:08:00 2021 +0200

    net/sched: store the last executed chain also for clsact egress

    currently, only 'ingress' and 'clsact ingress' qdiscs store the tc 'chain
    id' in the skb extension. However, userspace programs (like ovs) are able
    to setup egress rules, and datapath gets confused in case it doesn't find
    the 'chain id' for a packet that's "recirculated" by tc.
    Change tcf_classify() to have the same semantic as tcf_classify_ingress()
    so that a single function can be called in ingress / egress, using the tc
    ingress / egress block respectively.

    Suggested-by: Alaa Hleilel <alaa@nvidia.com>
    Signed-off-by: Davide Caratti <dcaratti@redhat.com>
    Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2021-12-09 12:01:45 +01:00
Paolo Abeni 96d14cbcf2 net: Prevent infinite while loop in skb_tx_hash()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028276
Tested: LNST, Tier1

Upstream commit:
commit 0c57eeecc559ca6bc18b8c4e2808bc78dbe769b0
Author: Michael Chan <michael.chan@broadcom.com>
Date:   Mon Oct 25 05:05:28 2021 -0400

    net: Prevent infinite while loop in skb_tx_hash()

    Drivers call netdev_set_num_tc() and then netdev_set_tc_queue()
    to set the queue count and offset for each TC.  So the queue count
    and offset for the TCs may be zero for a short period after dev->num_tc
    has been set.  If a TX packet is being transmitted at this time in the
    code path netdev_pick_tx() -> skb_tx_hash(), skb_tx_hash() may see
    nonzero dev->num_tc but zero qcount for the TC.  The while loop that
    keeps looping while hash >= qcount will not end.

    Fix it by checking the TC's qcount to be nonzero before using it.

    Fixes: eadec877ce ("net: Add support for subordinate traffic classes to netdev_pick_tx")
    Reviewed-by: Andy Gospodarek <gospo@broadcom.com>
    Signed-off-by: Michael Chan <michael.chan@broadcom.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2021-12-09 10:44:31 +01:00
Paolo Abeni a1950c1dcf napi: fix race inside napi_enable
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028276
Tested: LNST, Tier1

Upstream commit:
commit 3765996e4f0b8a755cab215a08df744490c76052
Author: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Date:   Sat Sep 18 16:52:32 2021 +0800

    napi: fix race inside napi_enable

    The process will cause napi.state to contain NAPI_STATE_SCHED and
    not in the poll_list, which will cause napi_disable() to get stuck.

    The prefix "NAPI_STATE_" is removed in the figure below, and
    NAPI_STATE_HASHED is ignored in napi.state.

                          CPU0       |                   CPU1       | napi.state
    ===============================================================================
    napi_disable()                   |                              | SCHED | NPSVC
    napi_enable()                    |                              |
    {                                |                              |
        smp_mb__before_atomic();     |                              |
        clear_bit(SCHED, &n->state); |                              | NPSVC
                                     | napi_schedule_prep()         | SCHED | NPSVC
                                     | napi_poll()                  |
                                     |   napi_complete_done()       |
                                     |   {                          |
                                     |      if (n->state & (NPSVC | | (1)
                                     |               _BUSY_POLL)))  |
                                     |           return false;      |
                                     |     ................         |
                                     |   }                          | SCHED | NPSVC
                                     |                              |
        clear_bit(NPSVC, &n->state); |                              | SCHED
    }                                |                              |
                                     |                              |
    napi_schedule_prep()             |                              | SCHED | MISSED (2)

    (1) Here return direct. Because of NAPI_STATE_NPSVC exists.
    (2) NAPI_STATE_SCHED exists. So not add napi.poll_list to sd->poll_list

    Since NAPI_STATE_SCHED already exists and napi is not in the
    sd->poll_list queue, NAPI_STATE_SCHED cannot be cleared and will always
    exist.

    1. This will cause this queue to no longer receive packets.
    2. If you encounter napi_disable under the protection of rtnl_lock, it
       will cause the entire rtnl_lock to be locked, affecting the overall
       system.

    This patch uses cmpxchg to implement napi_enable(), which ensures that
    there will be no race due to the separation of clear two bits.

    Fixes: 2d8bff1269 ("netpoll: Close race condition between poll_one_napi and napi_disable")
    Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
    Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2021-12-09 10:44:31 +01:00
David S. Miller 20192d9c9f Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Andrii Nakryiko says:

====================
pull-request: bpf 2021-07-15

The following pull-request contains BPF updates for your *net* tree.

We've added 9 non-merge commits during the last 5 day(s) which contain
a total of 9 files changed, 37 insertions(+), 15 deletions(-).

The main changes are:

1) Fix NULL pointer dereference in BPF_TEST_RUN for BPF_XDP_DEVMAP and
   BPF_XDP_CPUMAP programs, from Xuan Zhuo.

2) Fix use-after-free of net_device in XDP bpf_link, from Xuan Zhuo.

3) Follow-up fix to subprog poke descriptor use-after-free problem, from
   Daniel Borkmann and John Fastabend.

4) Fix out-of-range array access in s390 BPF JIT backend, from Colin Ian King.

5) Fix memory leak in BPF sockmap, from John Fastabend.

6) Fix for sockmap to prevent proc stats reporting bug, from John Fastabend
   and Jakub Sitnicki.

7) Fix NULL pointer dereference in bpftool, from Tobias Klauser.

8) AF_XDP documentation fixes, from Baruch Siach.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-15 14:39:45 -07:00
Qitao Xu 70713dddf3 net_sched: introduce tracepoint trace_qdisc_enqueue()
Tracepoint trace_qdisc_enqueue() is introduced to trace skb at
the entrance of TC layer on TX side. This is similar to
trace_qdisc_dequeue():

1. For both we only trace successful cases. The failure cases
   can be traced via trace_kfree_skb().

2. They are called at entrance or exit of TC layer, not for each
   ->enqueue() or ->dequeue(). This is intentional, because
   we want to make trace_qdisc_enqueue() symmetric to
   trace_qdisc_dequeue(), which is easier to use.

The return value of qdisc_enqueue() is not interesting here,
we have Qdisc's drop packets in ->dequeue(), it is impossible to
trace them even if we have the return value, the only way to trace
them is tracing kfree_skb().

We only add information we need to trace ring buffer. If any other
information is needed, it is easy to extend it without breaking ABI,
see commit 3dd344ea84 ("net: tracepoint: exposing sk_family in all
tcp:tracepoints").

Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Qitao Xu <qitao.xu@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-15 10:32:38 -07:00
Xuan Zhuo 5acc7d3e8d xdp, net: Fix use-after-free in bpf_xdp_link_release
The problem occurs between dev_get_by_index() and dev_xdp_attach_link().
At this point, dev_xdp_uninstall() is called. Then xdp link will not be
detached automatically when dev is released. But link->dev already
points to dev, when xdp link is released, dev will still be accessed,
but dev has been released.

dev_get_by_index()        |
link->dev = dev           |
                          |      rtnl_lock()
                          |      unregister_netdevice_many()
                          |          dev_xdp_uninstall()
                          |      rtnl_unlock()
rtnl_lock();              |
dev_xdp_attach_link()     |
rtnl_unlock();            |
                          |      netdev_run_todo() // dev released
bpf_xdp_link_release()    |
    /* access dev.        |
       use-after-free */  |

[   45.966867] BUG: KASAN: use-after-free in bpf_xdp_link_release+0x3b8/0x3d0
[   45.967619] Read of size 8 at addr ffff00000f9980c8 by task a.out/732
[   45.968297]
[   45.968502] CPU: 1 PID: 732 Comm: a.out Not tainted 5.13.0+ #22
[   45.969222] Hardware name: linux,dummy-virt (DT)
[   45.969795] Call trace:
[   45.970106]  dump_backtrace+0x0/0x4c8
[   45.970564]  show_stack+0x30/0x40
[   45.970981]  dump_stack_lvl+0x120/0x18c
[   45.971470]  print_address_description.constprop.0+0x74/0x30c
[   45.972182]  kasan_report+0x1e8/0x200
[   45.972659]  __asan_report_load8_noabort+0x2c/0x50
[   45.973273]  bpf_xdp_link_release+0x3b8/0x3d0
[   45.973834]  bpf_link_free+0xd0/0x188
[   45.974315]  bpf_link_put+0x1d0/0x218
[   45.974790]  bpf_link_release+0x3c/0x58
[   45.975291]  __fput+0x20c/0x7e8
[   45.975706]  ____fput+0x24/0x30
[   45.976117]  task_work_run+0x104/0x258
[   45.976609]  do_notify_resume+0x894/0xaf8
[   45.977121]  work_pending+0xc/0x328
[   45.977575]
[   45.977775] The buggy address belongs to the page:
[   45.978369] page:fffffc00003e6600 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x4f998
[   45.979522] flags: 0x7fffe0000000000(node=0|zone=0|lastcpupid=0x3ffff)
[   45.980349] raw: 07fffe0000000000 fffffc00003e6708 ffff0000dac3c010 0000000000000000
[   45.981309] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
[   45.982259] page dumped because: kasan: bad access detected
[   45.982948]
[   45.983153] Memory state around the buggy address:
[   45.983753]  ffff00000f997f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[   45.984645]  ffff00000f998000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   45.985533] >ffff00000f998080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   45.986419]                                               ^
[   45.987112]  ffff00000f998100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   45.988006]  ffff00000f998180: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   45.988895] ==================================================================
[   45.989773] Disabling lock debugging due to kernel taint
[   45.990552] Kernel panic - not syncing: panic_on_warn set ...
[   45.991166] CPU: 1 PID: 732 Comm: a.out Tainted: G    B             5.13.0+ #22
[   45.991929] Hardware name: linux,dummy-virt (DT)
[   45.992448] Call trace:
[   45.992753]  dump_backtrace+0x0/0x4c8
[   45.993208]  show_stack+0x30/0x40
[   45.993627]  dump_stack_lvl+0x120/0x18c
[   45.994113]  dump_stack+0x1c/0x34
[   45.994530]  panic+0x3a4/0x7d8
[   45.994930]  end_report+0x194/0x198
[   45.995380]  kasan_report+0x134/0x200
[   45.995850]  __asan_report_load8_noabort+0x2c/0x50
[   45.996453]  bpf_xdp_link_release+0x3b8/0x3d0
[   45.997007]  bpf_link_free+0xd0/0x188
[   45.997474]  bpf_link_put+0x1d0/0x218
[   45.997942]  bpf_link_release+0x3c/0x58
[   45.998429]  __fput+0x20c/0x7e8
[   45.998833]  ____fput+0x24/0x30
[   45.999247]  task_work_run+0x104/0x258
[   45.999731]  do_notify_resume+0x894/0xaf8
[   46.000236]  work_pending+0xc/0x328
[   46.000697] SMP: stopping secondary CPUs
[   46.001226] Dumping ftrace buffer:
[   46.001663]    (ftrace buffer empty)
[   46.002110] Kernel Offset: disabled
[   46.002545] CPU features: 0x00000001,23202c00
[   46.003080] Memory Limit: none

Fixes: aa8d3a716b ("bpf, xdp: Add bpf_link-based XDP attachment API")
Reported-by: Abaci <abaci@linux.alibaba.com>
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210710031635.41649-1-xuanzhuo@linux.alibaba.com
2021-07-13 08:22:31 -07:00
Antoine Tenart 28b34f01a7 net: do not reuse skbuff allocated from skbuff_fclone_cache in the skb cache
Some socket buffers allocated in the fclone cache (in __alloc_skb) can
end-up in the following path[1]:

napi_skb_finish
  __kfree_skb_defer
    napi_skb_cache_put

The issue is napi_skb_cache_put is not fclone friendly and will put
those skbuff in the skb cache to be reused later, although this cache
only expects skbuff allocated from skbuff_head_cache. When this happens
the skbuff is eventually freed using the wrong origin cache, and we can
see traces similar to:

[ 1223.947534] cache_from_obj: Wrong slab cache. skbuff_head_cache but object is from skbuff_fclone_cache
[ 1223.948895] WARNING: CPU: 3 PID: 0 at mm/slab.h:442 kmem_cache_free+0x251/0x3e0
[ 1223.950211] Modules linked in:
[ 1223.950680] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.13.0+ #474
[ 1223.951587] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-3.fc34 04/01/2014
[ 1223.953060] RIP: 0010:kmem_cache_free+0x251/0x3e0

Leading sometimes to other memory related issues.

Fix this by using __kfree_skb for fclone skbuff, similar to what is done
the other place __kfree_skb_defer is called.

[1] At least in setups using veth pairs and tunnels. Building a kernel
    with KASAN we can for example see packets allocated in
    sk_stream_alloc_skb hit the above path and later the issue arises
    when the skbuff is reused.

Fixes: 9243adfc31 ("skbuff: queue NAPI_MERGED_FREE skbs into NAPI cache instead of freeing")
Cc: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-09 11:26:27 -07:00
Florian Fainelli 9615fe36b3 skbuff: Fix build with SKB extensions disabled
We will fail to build with CONFIG_SKB_EXTENSIONS disabled after
8550ff8d8c ("skbuff: Release nfct refcount on napi stolen or re-used
skbs") since there is an unconditionally use of skb_ext_find() without
an appropriate stub. Simply build the code conditionally and properly
guard against both COFNIG_SKB_EXTENSIONS as well as
CONFIG_NET_TC_SKB_EXT being disabled.

Fixes: Fixes: 8550ff8d8c ("skbuff: Release nfct refcount on napi stolen or re-used skbs")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-08 00:07:14 -07:00
Paul Blakey 8550ff8d8c skbuff: Release nfct refcount on napi stolen or re-used skbs
When multiple SKBs are merged to a new skb under napi GRO,
or SKB is re-used by napi, if nfct was set for them in the
driver, it will not be released while freeing their stolen
head state or on re-use.

Release nfct on napi's stolen or re-used SKBs, and
in gro_list_prepare, check conntrack metadata diff.

Fixes: 5c6b946047 ("net/mlx5e: CT: Handle misses after executing CT action")
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-06 10:26:29 -07:00
Linus Torvalds dbe69e4337 Networking changes for 5.14.
Core:
 
  - BPF:
    - add syscall program type and libbpf support for generating
      instructions and bindings for in-kernel BPF loaders (BPF loaders
      for BPF), this is a stepping stone for signed BPF programs
    - infrastructure to migrate TCP child sockets from one listener
      to another in the same reuseport group/map to improve flexibility
      of service hand-off/restart
    - add broadcast support to XDP redirect
 
  - allow bypass of the lockless qdisc to improving performance
    (for pktgen: +23% with one thread, +44% with 2 threads)
 
  - add a simpler version of "DO_ONCE()" which does not require
    jump labels, intended for slow-path usage
 
  - virtio/vsock: introduce SOCK_SEQPACKET support
 
  - add getsocketopt to retrieve netns cookie
 
  - ip: treat lowest address of a IPv4 subnet as ordinary unicast address
        allowing reclaiming of precious IPv4 addresses
 
  - ipv6: use prandom_u32() for ID generation
 
  - ip: add support for more flexible field selection for hashing
        across multi-path routes (w/ offload to mlxsw)
 
  - icmp: add support for extended RFC 8335 PROBE (ping)
 
  - seg6: add support for SRv6 End.DT46 behavior
 
  - mptcp:
     - DSS checksum support (RFC 8684) to detect middlebox meddling
     - support Connection-time 'C' flag
     - time stamping support
 
  - sctp: packetization Layer Path MTU Discovery (RFC 8899)
 
  - xfrm: speed up state addition with seq set
 
  - WiFi:
     - hidden AP discovery on 6 GHz and other HE 6 GHz improvements
     - aggregation handling improvements for some drivers
     - minstrel improvements for no-ack frames
     - deferred rate control for TXQs to improve reaction times
     - switch from round robin to virtual time-based airtime scheduler
 
  - add trace points:
     - tcp checksum errors
     - openvswitch - action execution, upcalls
     - socket errors via sk_error_report
 
 Device APIs:
 
  - devlink: add rate API for hierarchical control of max egress rate
             of virtual devices (VFs, SFs etc.)
 
  - don't require RCU read lock to be held around BPF hooks
    in NAPI context
 
  - page_pool: generic buffer recycling
 
 New hardware/drivers:
 
  - mobile:
     - iosm: PCIe Driver for Intel M.2 Modem
     - support for Qualcomm MSM8998 (ipa)
 
  - WiFi: Qualcomm QCN9074 and WCN6855 PCI devices
 
  - sparx5: Microchip SparX-5 family of Enterprise Ethernet switches
 
  - Mellanox BlueField Gigabit Ethernet (control NIC of the DPU)
 
  - NXP SJA1110 Automotive Ethernet 10-port switch
 
  - Qualcomm QCA8327 switch support (qca8k)
 
  - Mikrotik 10/25G NIC (atl1c)
 
 Driver changes:
 
  - ACPI support for some MDIO, MAC and PHY devices from Marvell and NXP
    (our first foray into MAC/PHY description via ACPI)
 
  - HW timestamping (PTP) support: bnxt_en, ice, sja1105, hns3, tja11xx
 
  - Mellanox/Nvidia NIC (mlx5)
    - NIC VF offload of L2 bridging
    - support IRQ distribution to Sub-functions
 
  - Marvell (prestera):
     - add flower and match all
     - devlink trap
     - link aggregation
 
  - Netronome (nfp): connection tracking offload
 
  - Intel 1GE (igc): add AF_XDP support
 
  - Marvell DPU (octeontx2): ingress ratelimit offload
 
  - Google vNIC (gve): new ring/descriptor format support
 
  - Qualcomm mobile (rmnet & ipa): inline checksum offload support
 
  - MediaTek WiFi (mt76)
     - mt7915 MSI support
     - mt7915 Tx status reporting
     - mt7915 thermal sensors support
     - mt7921 decapsulation offload
     - mt7921 enable runtime pm and deep sleep
 
  - Realtek WiFi (rtw88)
     - beacon filter support
     - Tx antenna path diversity support
     - firmware crash information via devcoredump
 
  - Qualcomm 60GHz WiFi (wcn36xx)
     - Wake-on-WLAN support with magic packets and GTK rekeying
 
  - Micrel PHY (ksz886x/ksz8081): add cable test support
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmDb+fUACgkQMUZtbf5S
 Irs2Jg//aqN0Q8CgIvYCVhPxQw1tY7pTAbgyqgBZ01vwjyvtIOgJiWzSfFEU84mX
 M8fcpFX5eTKrOyJ9S6UFfQ/JG114n3hjAxFFT4Hxk2gC1Tg0vHuFQTDHcUl28bUE
 mTm61e1YpdorILnv2k5JVQ/wu0vs5QKDrjcYcrcPnh+j93wvnPOgAfDBV95nZzjS
 OTt4q2fR8GzLcSYWWsclMbDNkzyTG50RW/0Yd6aGjr5QGvXfrMeXfUJNz533PMf/
 w5lNyjRKv+x9mdTZJzU0+msNUrZgUdRz7W8Ey8lD3hJZRE+D6/uU7FtsE8Mi3+uc
 HWxeZUyzA3YF1MfVl/eesbxyPT7S/OkLzk4O5B35FbqP0YltaP+bOjq1/nM3ce1/
 io9Dx9pIl/2JANUgRCAtLi8Z2dkvRoqTaBxZ/nPudCCljFwDwl6joTMJ7Ow22i5Y
 5aIkcXFmZq4LbJDiHvbTlqT7yiuaEvu2UK/23bSIg/K3nF4eAmkY9Y1EgiMf60OF
 78Ttw0wk2tUegwaS5MZnCniKBKDyl9gM2F6rbZ/IxQRR2LTXFc1B6gC+ynUxgXfh
 Ub8O++6qGYGYZ0XvQH4pzco79p3qQWBTK5beIp2eu6BOAjBVIXq4AibUfoQLACsu
 hX7jMPYd0kc3WFgUnKgQP8EnjFSwbf4XiaE7fIXvWBY8hzCw2h4=
 =LvtX
 -----END PGP SIGNATURE-----

Merge tag 'net-next-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core:

   - BPF:
      - add syscall program type and libbpf support for generating
        instructions and bindings for in-kernel BPF loaders (BPF loaders
        for BPF), this is a stepping stone for signed BPF programs
      - infrastructure to migrate TCP child sockets from one listener to
        another in the same reuseport group/map to improve flexibility
        of service hand-off/restart
      - add broadcast support to XDP redirect

   - allow bypass of the lockless qdisc to improving performance (for
     pktgen: +23% with one thread, +44% with 2 threads)

   - add a simpler version of "DO_ONCE()" which does not require jump
     labels, intended for slow-path usage

   - virtio/vsock: introduce SOCK_SEQPACKET support

   - add getsocketopt to retrieve netns cookie

   - ip: treat lowest address of a IPv4 subnet as ordinary unicast
     address allowing reclaiming of precious IPv4 addresses

   - ipv6: use prandom_u32() for ID generation

   - ip: add support for more flexible field selection for hashing
     across multi-path routes (w/ offload to mlxsw)

   - icmp: add support for extended RFC 8335 PROBE (ping)

   - seg6: add support for SRv6 End.DT46 behavior

   - mptcp:
      - DSS checksum support (RFC 8684) to detect middlebox meddling
      - support Connection-time 'C' flag
      - time stamping support

   - sctp: packetization Layer Path MTU Discovery (RFC 8899)

   - xfrm: speed up state addition with seq set

   - WiFi:
      - hidden AP discovery on 6 GHz and other HE 6 GHz improvements
      - aggregation handling improvements for some drivers
      - minstrel improvements for no-ack frames
      - deferred rate control for TXQs to improve reaction times
      - switch from round robin to virtual time-based airtime scheduler

   - add trace points:
      - tcp checksum errors
      - openvswitch - action execution, upcalls
      - socket errors via sk_error_report

  Device APIs:

   - devlink: add rate API for hierarchical control of max egress rate
     of virtual devices (VFs, SFs etc.)

   - don't require RCU read lock to be held around BPF hooks in NAPI
     context

   - page_pool: generic buffer recycling

  New hardware/drivers:

   - mobile:
      - iosm: PCIe Driver for Intel M.2 Modem
      - support for Qualcomm MSM8998 (ipa)

   - WiFi: Qualcomm QCN9074 and WCN6855 PCI devices

   - sparx5: Microchip SparX-5 family of Enterprise Ethernet switches

   - Mellanox BlueField Gigabit Ethernet (control NIC of the DPU)

   - NXP SJA1110 Automotive Ethernet 10-port switch

   - Qualcomm QCA8327 switch support (qca8k)

   - Mikrotik 10/25G NIC (atl1c)

  Driver changes:

   - ACPI support for some MDIO, MAC and PHY devices from Marvell and
     NXP (our first foray into MAC/PHY description via ACPI)

   - HW timestamping (PTP) support: bnxt_en, ice, sja1105, hns3, tja11xx

   - Mellanox/Nvidia NIC (mlx5)
      - NIC VF offload of L2 bridging
      - support IRQ distribution to Sub-functions

   - Marvell (prestera):
      - add flower and match all
      - devlink trap
      - link aggregation

   - Netronome (nfp): connection tracking offload

   - Intel 1GE (igc): add AF_XDP support

   - Marvell DPU (octeontx2): ingress ratelimit offload

   - Google vNIC (gve): new ring/descriptor format support

   - Qualcomm mobile (rmnet & ipa): inline checksum offload support

   - MediaTek WiFi (mt76)
      - mt7915 MSI support
      - mt7915 Tx status reporting
      - mt7915 thermal sensors support
      - mt7921 decapsulation offload
      - mt7921 enable runtime pm and deep sleep

   - Realtek WiFi (rtw88)
      - beacon filter support
      - Tx antenna path diversity support
      - firmware crash information via devcoredump

   - Qualcomm WiFi (wcn36xx)
      - Wake-on-WLAN support with magic packets and GTK rekeying

   - Micrel PHY (ksz886x/ksz8081): add cable test support"

* tag 'net-next-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2168 commits)
  tcp: change ICSK_CA_PRIV_SIZE definition
  tcp_yeah: check struct yeah size at compile time
  gve: DQO: Fix off by one in gve_rx_dqo()
  stmmac: intel: set PCI_D3hot in suspend
  stmmac: intel: Enable PHY WOL option in EHL
  net: stmmac: option to enable PHY WOL with PMT enabled
  net: say "local" instead of "static" addresses in ndo_dflt_fdb_{add,del}
  net: use netdev_info in ndo_dflt_fdb_{add,del}
  ptp: Set lookup cookie when creating a PTP PPS source.
  net: sock: add trace for socket errors
  net: sock: introduce sk_error_report
  net: dsa: replay the local bridge FDB entries pointing to the bridge dev too
  net: dsa: ensure during dsa_fdb_offload_notify that dev_hold and dev_put are on the same dev
  net: dsa: include fdb entries pointing to bridge in the host fdb list
  net: dsa: include bridge addresses which are local in the host fdb list
  net: dsa: sync static FDB entries on foreign interfaces to hardware
  net: dsa: install the host MDB and FDB entries in the master's RX filter
  net: dsa: reference count the FDB addresses at the cross-chip notifier level
  net: dsa: introduce a separate cross-chip notifier type for host FDBs
  net: dsa: reference count the MDB entries at the cross-chip notifier level
  ...
2021-06-30 15:51:09 -07:00
Jakub Kicinski b6df00789e Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Trivial conflict in net/netfilter/nf_tables_api.c.

Duplicate fix in tools/testing/selftests/net/devlink_port_split.py
- take the net-next version.

skmsg, and L4 bpf - keep the bpf code but remove the flags
and err params.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-06-29 15:45:27 -07:00
Tanner Love 127d7355ab net: update netdev_rx_csum_fault() print dump only once
Printing this stack dump multiple times does not provide additional
useful information, and consumes time in the data path. Printing once
is sufficient.

Changes
  v2: Format indentation properly

Signed-off-by: Tanner Love <tannerlove@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-28 15:54:57 -07:00
Yunsheng Lin c4fef01ba4 net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc
Currently pfifo_fast has both TCQ_F_CAN_BYPASS and TCQ_F_NOLOCK
flag set, but queue discipline by-pass does not work for lockless
qdisc because skb is always enqueued to qdisc even when the qdisc
is empty, see __dev_xmit_skb().

This patch calls sch_direct_xmit() to transmit the skb directly
to the driver for empty lockless qdisc, which aviod enqueuing
and dequeuing operation.

As qdisc->empty is not reliable to indicate a empty qdisc because
there is a time window between enqueuing and setting qdisc->empty.
So we use the MISSED state added in commit a90c57f2ce ("net:
sched: fix packet stuck problem for lockless qdisc"), which
indicate there is lock contention, suggesting that it is better
not to do the qdisc bypass in order to avoid packet out of order
problem.

In order to make MISSED state reliable to indicate a empty qdisc,
we need to ensure that testing and clearing of MISSED state is
within the protection of qdisc->seqlock, only setting MISSED state
can be done without the protection of qdisc->seqlock. A MISSED
state testing is added without the protection of qdisc->seqlock to
aviod doing unnecessary spin_trylock() for contention case.

As the enqueuing is not within the protection of qdisc->seqlock,
there is still a potential data race as mentioned by Jakub [1]:

      thread1               thread2             thread3
qdisc_run_begin() # true
                        qdisc_run_begin(q)
                             set(MISSED)
pfifo_fast_dequeue
  clear(MISSED)
  # recheck the queue
qdisc_run_end()
                            enqueue skb1
                                             qdisc empty # true
                                          qdisc_run_begin() # true
                                          sch_direct_xmit() # skb2
                         qdisc_run_begin()
                            set(MISSED)

When above happens, skb1 enqueued by thread2 is transmited after
skb2 is transmited by thread3 because MISSED state setting and
enqueuing is not under the qdisc->seqlock. If qdisc bypass is
disabled, skb1 has better chance to be transmited quicker than
skb2.

This patch does not take care of the above data race, because we
view this as similar as below:
Even at the same time CPU1 and CPU2 write the skb to two socket
which both heading to the same qdisc, there is no guarantee that
which skb will hit the qdisc first, because there is a lot of
factor like interrupt/softirq/cache miss/scheduling afffecting
that.

There are below cases that need special handling:
1. When MISSED state is cleared before another round of dequeuing
   in pfifo_fast_dequeue(), and __qdisc_run() might not be able to
   dequeue all skb in one round and call __netif_schedule(), which
   might result in a non-empty qdisc without MISSED set. In order
   to avoid this, the MISSED state is set for lockless qdisc and
   __netif_schedule() will be called at the end of qdisc_run_end.

2. The MISSED state also need to be set for lockless qdisc instead
   of calling __netif_schedule() directly when requeuing a skb for
   a similar reason.

3. For netdev queue stopped case, the MISSED case need clearing
   while the netdev queue is stopped, otherwise there may be
   unnecessary __netif_schedule() calling. So a new DRAINING state
   is added to indicate this case, which also indicate a non-empty
   qdisc.

4. As there is already netif_xmit_frozen_or_stopped() checking in
   dequeue_skb() and sch_direct_xmit(), which are both within the
   protection of qdisc->seqlock, but the same checking in
   __dev_xmit_skb() is without the protection, which might cause
   empty indication of a lockless qdisc to be not reliable. So
   remove the checking in __dev_xmit_skb(), and the checking in
   the protection of qdisc->seqlock seems enough to avoid the cpu
   consumption problem for netdev queue stopped case.

1. https://lkml.org/lkml/2021/5/29/215

Acked-by: Jakub Kicinski <kuba@kernel.org>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # flexcan
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-23 12:17:35 -07:00
Sebastian Andrzej Siewior 2b4cd14fd9 net/netif_receive_skb_core: Use migrate_disable()
The preempt disable around do_xdp_generic() has been introduced in
commit
   bbbe211c29 ("net: rcu lock and preempt disable missing around generic xdp")

For BPF it is enough to use migrate_disable() and the code was updated
as it can be seen in commit
   3c58482a38 ("bpf: Provide bpf_prog_run_pin_on_cpu() helper")

This is a leftover which was not converted.

Use migrate_disable() before invoking do_xdp_generic().

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-21 12:08:02 -07:00
Peter Zijlstra 2f064a59a1 sched: Change task_struct::state
Change the type and name of task_struct::state. Drop the volatile and
shrink it to an 'unsigned int'. Rename it in order to find all uses
such that we can use READ_ONCE/WRITE_ONCE as appropriate.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Daniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20210611082838.550736351@infradead.org
2021-06-18 11:43:09 +02:00
Jakub Kicinski 5ada57a9a6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
cdc-wdm: s/kill_urbs/poison_urbs/ to fix build

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-05-27 09:55:10 -07:00
Yunsheng Lin dcad9ee9e0 net: sched: fix tx action reschedule issue with stopped queue
The netdev qeueue might be stopped when byte queue limit has
reached or tx hw ring is full, net_tx_action() may still be
rescheduled if STATE_MISSED is set, which consumes unnecessary
cpu without dequeuing and transmiting any skb because the
netdev queue is stopped, see qdisc_run_end().

This patch fixes it by checking the netdev queue state before
calling qdisc_run() and clearing STATE_MISSED if netdev queue is
stopped during qdisc_run(), the net_tx_action() is rescheduled
again when netdev qeueue is restarted, see netif_tx_wake_queue().

As there is time window between netif_xmit_frozen_or_stopped()
checking and STATE_MISSED clearing, between which STATE_MISSED
may set by net_tx_action() scheduled by netif_tx_wake_queue(),
so set the STATE_MISSED again if netdev queue is restarted.

Fixes: 6b3ba9146f ("net: sched: allow qdiscs to handle locking")
Reported-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-14 15:05:46 -07:00
Yunsheng Lin 102b55ee92 net: sched: fix tx action rescheduling issue during deactivation
Currently qdisc_run() checks the STATE_DEACTIVATED of lockless
qdisc before calling __qdisc_run(), which ultimately clear the
STATE_MISSED when all the skb is dequeued. If STATE_DEACTIVATED
is set before clearing STATE_MISSED, there may be rescheduling
of net_tx_action() at the end of qdisc_run_end(), see below:

CPU0(net_tx_atcion)  CPU1(__dev_xmit_skb)  CPU2(dev_deactivate)
          .                   .                     .
          .            set STATE_MISSED             .
          .           __netif_schedule()            .
          .                   .           set STATE_DEACTIVATED
          .                   .                qdisc_reset()
          .                   .                     .
          .<---------------   .              synchronize_net()
clear __QDISC_STATE_SCHED  |  .                     .
          .                |  .                     .
          .                |  .            some_qdisc_is_busy()
          .                |  .               return *false*
          .                |  .                     .
  test STATE_DEACTIVATED   |  .                     .
__qdisc_run() *not* called |  .                     .
          .                |  .                     .
   test STATE_MISS         |  .                     .
 __netif_schedule()--------|  .                     .
          .                   .                     .
          .                   .                     .

__qdisc_run() is not called by net_tx_atcion() in CPU0 because
CPU2 has set STATE_DEACTIVATED flag during dev_deactivate(), and
STATE_MISSED is only cleared in __qdisc_run(), __netif_schedule
is called at the end of qdisc_run_end(), causing tx action
rescheduling problem.

qdisc_run() called by net_tx_action() runs in the softirq context,
which should has the same semantic as the qdisc_run() called by
__dev_xmit_skb() protected by rcu_read_lock_bh(). And there is a
synchronize_net() between STATE_DEACTIVATED flag being set and
qdisc_reset()/some_qdisc_is_busy in dev_deactivate(), we can safely
bail out for the deactived lockless qdisc in net_tx_action(), and
qdisc_reset() will reset all skb not dequeued yet.

So add the rcu_read_lock() explicitly to protect the qdisc_run()
and do the STATE_DEACTIVATED checking in net_tx_action() before
calling qdisc_run_begin(). Another option is to do the checking in
the qdisc_run_end(), but it will add unnecessary overhead for
non-tx_action case, because __dev_queue_xmit() will not see qdisc
with STATE_DEACTIVATED after synchronize_net(), the qdisc with
STATE_DEACTIVATED can only be seen by net_tx_action() because of
__netif_schedule().

The STATE_DEACTIVATED checking in qdisc_run() is to avoid race
between net_tx_action() and qdisc_reset(), see:
commit d518d2ed86 ("net/sched: fix race between deactivation
and dequeue for NOLOCK qdisc"). As the bailout added above for
deactived lockless qdisc in net_tx_action() provides better
protection for the race without calling qdisc_run() at all, so
remove the STATE_DEACTIVATED checking in qdisc_run().

After qdisc_reset(), there is no skb in qdisc to be dequeued, so
clear the STATE_MISSED in dev_reset_queue() too.

Fixes: 6b3ba9146f ("net: sched: allow qdiscs to handle locking")
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
V8: Clearing STATE_MISSED before calling __netif_schedule() has
    avoid the endless rescheduling problem, but there may still
    be a unnecessary rescheduling, so adjust the commit log.
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-14 15:05:46 -07:00
Sebastian Andrzej Siewior 8380c81d5c net: Treat __napi_schedule_irqoff() as __napi_schedule() on PREEMPT_RT
__napi_schedule_irqoff() is an optimized version of __napi_schedule()
which can be used where it is known that interrupts are disabled,
e.g. in interrupt-handlers, spin_lock_irq() sections or hrtimer
callbacks.

On PREEMPT_RT enabled kernels this assumptions is not true. Force-
threaded interrupt handlers and spinlocks are not disabling interrupts
and the NAPI hrtimer callback is forced into softirq context which runs
with interrupts enabled as well.

Chasing all usage sites of __napi_schedule_irqoff() is a whack-a-mole
game so make __napi_schedule_irqoff() invoke __napi_schedule() for
PREEMPT_RT kernels.

The callers of ____napi_schedule() in the networking core have been
audited and are correct on PREEMPT_RT kernels as well.

Reported-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-13 13:11:19 -07:00
David S. Miller 6876a18d33 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-04-26 12:00:00 -07:00
David S. Miller 5f6c2f536d Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2021-04-23

The following pull-request contains BPF updates for your *net-next* tree.

We've added 69 non-merge commits during the last 22 day(s) which contain
a total of 69 files changed, 3141 insertions(+), 866 deletions(-).

The main changes are:

1) Add BPF static linker support for extern resolution of global, from Andrii.

2) Refine retval for bpf_get_task_stack helper, from Dave.

3) Add a bpf_snprintf helper, from Florent.

4) A bunch of miscellaneous improvements from many developers.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-25 18:02:32 -07:00
Martin Willi 22b6034323 net, xdp: Update pkt_type if generic XDP changes unicast MAC
If a generic XDP program changes the destination MAC address from/to
multicast/broadcast, the skb->pkt_type is updated to properly handle
the packet when passed up the stack. When changing the MAC from/to
the NICs MAC, PACKET_HOST/OTHERHOST is not updated, though, making
the behavior different from that of native XDP.

Remember the PACKET_HOST/OTHERHOST state before calling the program
in generic XDP, and update pkt_type accordingly if the destination
MAC address has changed. As eth_type_trans() assumes a default
pkt_type of PACKET_HOST, restore that before calling it.

The use case for this is when a XDP program wants to push received
packets up the stack by rewriting the MAC to the NICs MAC, for
example by cluster nodes sharing MAC addresses.

Fixes: 2972495699 ("net: fix generic XDP to handle if eth header was mangled")
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210419141559.8611-1-martin@strongswan.org
2021-04-22 23:18:02 +02:00
Alexander Lobakin 7ad18ff644 gro: fix napi_gro_frags() Fast GRO breakage due to IP alignment check
Commit 38ec4944b5 ("gro: ensure frag0 meets IP header alignment")
did the right thing, but missed the fact that napi_gro_frags() logics
calls for skb_gro_reset_offset() *before* pulling Ethernet header
to the skb linear space.
That said, the introduced check for frag0 address being aligned to 4
always fails for it as Ethernet header is obviously 14 bytes long,
and in case with NET_IP_ALIGN its start is not aligned to 4.

Fix this by adding @nhoff argument to skb_gro_reset_offset() which
tells if an IP header is placed right at the start of frag0 or not.
This restores Fast GRO for napi_gro_frags() that became very slow
after the mentioned commit, and preserves the introduced check to
avoid silent unaligned accesses.

From v1 [0]:
 - inline tiny skb_gro_reset_offset() to let the code be optimized
   more efficively (esp. for the !NET_IP_ALIGN case) (Eric);
 - pull in Reviewed-by from Eric.

[0] https://lore.kernel.org/netdev/20210418114200.5839-1-alobakin@pm.me

Fixes: 38ec4944b5 ("gro: ensure frag0 meets IP header alignment")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-19 16:03:32 -07:00
Jakub Kicinski 8203c7ce4e Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
 - keep the ZC code, drop the code related to reinit
net/bridge/netfilter/ebtables.c
 - fix build after move to net_generic

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-04-17 11:08:07 -07:00
Eric Dumazet 38ec4944b5 gro: ensure frag0 meets IP header alignment
After commit 0f6925b3e8 ("virtio_net: Do not pull payload in skb->head")
Guenter Roeck reported one failure in his tests using sh architecture.

After much debugging, we have been able to spot silent unaligned accesses
in inet_gro_receive()

The issue at hand is that upper networking stacks assume their header
is word-aligned. Low level drivers are supposed to reserve NET_IP_ALIGN
bytes before the Ethernet header to make that happen.

This patch hardens skb_gro_reset_offset() to not allow frag0 fast-path
if the fragment is not properly aligned.

Some arches like x86, arm64 and powerpc do not care and define NET_IP_ALIGN
as 0, this extra check will be a NOP for them.

Note that if frag0 is not used, GRO will call pskb_may_pull()
as many times as needed to pull network and transport headers.

Fixes: 0f6925b3e8 ("virtio_net: Do not pull payload in skb->head")
Fixes: 78a478d0ef ("gro: Inline skb_gro_header and cache frag0 virtual address")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-13 15:09:31 -07:00
Jakub Kicinski 8859a44ea0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts:

MAINTAINERS
 - keep Chandrasekar
drivers/net/ethernet/mellanox/mlx5/core/en_main.c
 - simple fix + trust the code re-added to param.c in -next is fine
include/linux/bpf.h
 - trivial
include/linux/ethtool.h
 - trivial, fix kdoc while at it
include/linux/skmsg.h
 - move to relevant place in tcp.c, comment re-wrapped
net/core/skmsg.c
 - add the sk = sk // sk = NULL around calls
net/tipc/crypto.c
 - trivial

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-04-09 20:48:35 -07:00
Paolo Abeni 27f0ad7169 net: fix hangup on napi_disable for threaded napi
napi_disable() is subject to an hangup, when the threaded
mode is enabled and the napi is under heavy traffic.

If the relevant napi has been scheduled and the napi_disable()
kicks in before the next napi_threaded_wait() completes - so
that the latter quits due to the napi_disable_pending() condition,
the existing code leaves the NAPI_STATE_SCHED bit set and the
napi_disable() loop waiting for such bit will hang.

This patch addresses the issue by dropping the NAPI_STATE_DISABLE
bit test in napi_thread_wait(). The later napi_threaded_poll()
iteration will take care of clearing the NAPI_STATE_SCHED.

This also addresses a related problem reported by Jakub:
before this patch a napi_disable()/napi_enable() pair killed
the napi thread, effectively disabling the threaded mode.
On the patched kernel napi_disable() simply stops scheduling
the relevant thread.

v1 -> v2:
  - let the main napi_thread_poll() loop clear the SCHED bit

Reported-by: Jakub Kicinski <kuba@kernel.org>
Fixes: 29863d41bb ("net: implement threaded-able napi poll loop support")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/883923fa22745a9589e8610962b7dc59df09fb1f.1617981844.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-04-09 12:50:31 -07:00
Andrei Vagin 0854fa82c9 net: remove the new_ifindex argument from dev_change_net_namespace
Here is only one place where we want to specify new_ifindex. In all
other cases, callers pass 0 as new_ifindex. It looks reasonable to add a
low-level function with new_ifindex and to convert
dev_change_net_namespace to a static inline wrapper.

Fixes: eeb85a14ee ("net: Allow to specify ifindex when device is moved to another namespace")
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-07 14:43:28 -07:00
Andrei Vagin eeb85a14ee net: Allow to specify ifindex when device is moved to another namespace
Currently, we can specify ifindex on link creation. This change allows
to specify ifindex when a device is moved to another network namespace.

Even now, a device ifindex can be changed if there is another device
with the same ifindex in the target namespace. So this change doesn't
introduce completely new behavior, it adds more control to the process.

CRIU users want to restore containers with pre-created network devices.
A user will provide network devices and instructions where they have to
be restored, then CRIU will restore network namespaces and move devices
into them. The problem is that devices have to be restored with the same
indexes that they have before C/R.

Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Suggested-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-05 14:49:40 -07:00
Dmitry Vyukov 6c996e1994 net: change netdev_unregister_timeout_secs min value to 1
netdev_unregister_timeout_secs=0 can lead to printing the
"waiting for dev to become free" message every jiffy.
This is too frequent and unnecessary.
Set the min value to 1 second.

Also fix the merge issue introduced by
"net: make unregister netdev warning timeout configurable":
it changed "refcnt != 1" to "refcnt".

Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Fixes: 5aa3afe107 ("net: make unregister netdev warning timeout configurable")
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 17:24:06 -07:00
David S. Miller efd13b71a3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 15:31:22 -07:00
Pablo Neira Ayuso ddb94eafab net: resolve forwarding path from virtual netdevice and HW destination address
This patch adds dev_fill_forward_path() which resolves the path to reach
the real netdevice from the IP forwarding side. This function takes as
input the netdevice and the destination hardware address and it walks
down the devices calling .ndo_fill_forward_path() for each device until
the real device is found.

For instance, assuming the following topology:

               IP forwarding
              /             \
           br0              eth0
           / \
       eth1  eth2
        .
        .
        .
       ethX
 ab💿ef🆎cd:ef

where eth1 and eth2 are bridge ports and eth0 provides WAN connectivity.
ethX is the interface in another box which is connected to the eth1
bridge port.

For packets going through IP forwarding to br0 whose destination MAC
address is ab💿ef🆎cd:ef, dev_fill_forward_path() provides the
following path:

	br0 -> eth1

.ndo_fill_forward_path for br0 looks up at the FDB for the bridge port
from the destination MAC address to get the bridge port eth1.

This information allows to create a fast path that bypasses the classic
bridge and IP forwarding paths, so packets go directly from the bridge
port eth1 to eth0 (wan interface) and vice versa.

             fast path
      .------------------------.
     /                          \
    |           IP forwarding   |
    |          /             \  \/
    |       br0               eth0
    .       / \
     -> eth1  eth2
        .
        .
        .
       ethX
 ab💿ef🆎cd:ef

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24 12:48:38 -07:00
Dmitry Vyukov 5aa3afe107 net: make unregister netdev warning timeout configurable
netdev_wait_allrefs() issues a warning if refcount does not drop to 0
after 10 seconds. While 10 second wait generally should not happen
under normal workload in normal environment, it seems to fire falsely
very often during fuzzing and/or in qemu emulation (~10x slower).
At least it's not possible to understand if it's really a false
positive or not. Automated testing generally bumps all timeouts
to very high values to avoid flake failures.
Add net.core.netdev_unregister_timeout_secs sysctl to make
the timeout configurable for automated testing systems.
Lowering the timeout may also be useful for e.g. manual bisection.
The default value matches the current behavior.

Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=211877
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-23 17:22:50 -07:00
Eric Dumazet add2d73631 net: set initial device refcount to 1
When adding CONFIG_PCPU_DEV_REFCNT, I forgot that the
initial net device refcount was 0.

When CONFIG_PCPU_DEV_REFCNT is not set, this means
the first dev_hold() triggers an illegal refcount
operation (addition on 0)

refcount_t: addition on 0; use-after-free.
WARNING: CPU: 0 PID: 1 at lib/refcount.c:25 refcount_warn_saturate+0x128/0x1a4

Fix is to change initial (and final) refcount to be 1.

Also add a missing kerneldoc piece, as reported by
Stephen Rothwell.

Fixes: 919067cc84 ("net: add CONFIG_PCPU_DEV_REFCNT")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Guenter Roeck <groeck@google.com>
Tested-by: Guenter Roeck <groeck@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-22 16:57:36 -07:00
Vladimir Oltean 5da9ace340 net: make xps_needed and xps_rxqs_needed static
Since their introduction in commit 04157469b7 ("net: Use static_key
for XPS maps"), xps_needed and xps_rxqs_needed were never used outside
net/core/dev.c, so I don't really understand why they were exported as
symbols in the first place.

This is needed in order to silence a "make W=1" warning about these
static keys not being declared as static variables, but not having a
previous declaration in a header file nonetheless.

Cc: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-22 13:13:55 -07:00
Eric Dumazet 919067cc84 net: add CONFIG_PCPU_DEV_REFCNT
I was working on a syzbot issue, claiming one device could not be
dismantled because its refcount was -1

unregister_netdevice: waiting for sit0 to become free. Usage count = -1

It would be nice if syzbot could trigger a warning at the time
this reference count became negative.

This patch adds CONFIG_PCPU_DEV_REFCNT options which defaults
to per cpu variables (as before this patch) on SMP builds.

v2: free_dev label in alloc_netdev_mqs() is moved to avoid
    a compiler warning (-Wunused-label), as reported
    by kernel test robot <lkp@intel.com>

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-19 13:38:46 -07:00
Antoine Tenart 75b2758abc net: NULL the old xps map entries when freeing them
In __netif_set_xps_queue, old map entries from the old dev_maps are
freed but their corresponding entry in the old dev_maps aren't NULLed.
Fix this.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Antoine Tenart 2d05bf0153 net: fix use after free in xps
When setting up an new dev_maps in __netif_set_xps_queue, we remove and
free maps from unused CPUs/rx-queues near the end of the function; by
calling remove_xps_queue. However it's possible those maps are also part
of the old not-freed-yet dev_maps, which might be used concurrently.
When that happens, a map can be freed while its corresponding entry in
the old dev_maps table isn't NULLed, leading to: "BUG: KASAN:
use-after-free" in different places.

This fixes the map freeing logic for unused CPUs/rx-queues, to also NULL
the map entries from the old dev_maps table.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Antoine Tenart 132f743b01 net: improve queue removal readability in __netif_set_xps_queue
Improve the readability of the loop removing tx-queue from unused
CPUs/rx-queues in __netif_set_xps_queue. The change should only be
cosmetic.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Antoine Tenart 402fbb992e net: add an helper to copy xps maps to the new dev_maps
This patch adds an helper, xps_copy_dev_maps, to copy maps from dev_maps
to new_dev_maps at a given index. The logic should be the same, with an
improved code readability and maintenance.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Antoine Tenart 044ab86d43 net: move the xps maps to an array
Move the xps maps (xps_cpus_map and xps_rxqs_map) to an array in
net_device. That will simplify a lot the code removing the need for lots
of if/else conditionals as the correct map will be available using its
offset in the array.

This should not modify the xps maps behaviour in any way.

Suggested-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Antoine Tenart 6f36158e05 net: remove the xps possible_mask
Remove the xps possible_mask. It was an optimization but we can just
loop from 0 to nr_ids now that it is embedded in the xps dev_maps. That
simplifies the code a bit.

Suggested-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Antoine Tenart 5478fcd0f4 net: embed nr_ids in the xps maps
Embed nr_ids (the number of cpu for the xps cpus map, and the number of
rxqs for the xps cpus map) in dev_maps. That will help not accessing out
of bound memory if those values change after dev_maps was allocated.

Suggested-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Antoine Tenart 255c04a87f net: embed num_tc in the xps maps
The xps cpus/rxqs map is accessed using dev->num_tc, which is used when
allocating the map. But later updates of dev->num_tc can lead to having
a mismatch between the maps and how they're accessed. In such cases the
map values do not make any sense and out of bound accesses can occur
(that can be easily seen using KASAN).

This patch aims at fixing this by embedding num_tc into the maps, using
the value at the time the map is created. This brings two improvements:
- The maps can be accessed using the embedded num_tc, so we know for
  sure we won't have out of bound accesses.
- Checks can be made before accessing the maps so we know the values
  retrieved will make sense.

We also update __netif_set_xps_queue to conditionally copy old maps from
dev_maps in the new one only if the number of traffic classes from both
maps match.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Jiri Bohac 6c015a2256 net: check all name nodes in __dev_alloc_name
__dev_alloc_name(), when supplied with a name containing '%d',
will search for the first available device number to generate a
unique device name.

Since commit ff92741270 ("net:
introduce name_node struct to be used in hashlist") network
devices may have alternate names.  __dev_alloc_name() does take
these alternate names into account, possibly generating a name
that is already taken and failing with -ENFILE as a result.

This demonstrates the bug:

    # rmmod dummy 2>/dev/null
    # ip link property add dev lo altname dummy0
    # modprobe dummy numdummies=1
    modprobe: ERROR: could not insert 'dummy': Too many open files in system

Instead of creating a device named dummy1, modprobe fails.

Fix this by checking all the names in the d->name_node list, not just d->name.

Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Fixes: ff92741270 ("net: introduce name_node struct to be used in hashlist")
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:40:53 -07:00
Wei Wang cb03835793 net: fix race between napi kthread mode and busy poll
Currently, napi_thread_wait() checks for NAPI_STATE_SCHED bit to
determine if the kthread owns this napi and could call napi->poll() on
it. However, if socket busy poll is enabled, it is possible that the
busy poll thread grabs this SCHED bit (after the previous napi->poll()
invokes napi_complete_done() and clears SCHED bit) and tries to poll
on the same napi. napi_disable() could grab the SCHED bit as well.
This patch tries to fix this race by adding a new bit
NAPI_STATE_SCHED_THREADED in napi->state. This bit gets set in
____napi_schedule() if the threaded mode is enabled, and gets cleared
in napi_complete_done(), and we only poll the napi in kthread if this
bit is set. This helps distinguish the ownership of the napi between
kthread and other scenarios and fixes the race issue.

Fixes: 29863d41bb ("net: implement threaded-able napi poll loop support")
Reported-by: Martin Zaharinov <micron10@gmail.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Wei Wang <weiwan@google.com>
Cc: Alexander Duyck <alexanderduyck@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-17 14:31:17 -07:00
Martin Willi 3a5ca85707 can: dev: Move device back to init netns on owning netns delete
When a non-initial netns is destroyed, the usual policy is to delete
all virtual network interfaces contained, but move physical interfaces
back to the initial netns. This keeps the physical interface visible
on the system.

CAN devices are somewhat special, as they define rtnl_link_ops even
if they are physical devices. If a CAN interface is moved into a
non-initial netns, destroying that netns lets the interface vanish
instead of moving it back to the initial netns. default_device_exit()
skips CAN interfaces due to having rtnl_link_ops set. Reproducer:

  ip netns add foo
  ip link set can0 netns foo
  ip netns delete foo

WARNING: CPU: 1 PID: 84 at net/core/dev.c:11030 ops_exit_list+0x38/0x60
CPU: 1 PID: 84 Comm: kworker/u4:2 Not tainted 5.10.19 #1
Workqueue: netns cleanup_net
[<c010e700>] (unwind_backtrace) from [<c010a1d8>] (show_stack+0x10/0x14)
[<c010a1d8>] (show_stack) from [<c086dc10>] (dump_stack+0x94/0xa8)
[<c086dc10>] (dump_stack) from [<c086b938>] (__warn+0xb8/0x114)
[<c086b938>] (__warn) from [<c086ba10>] (warn_slowpath_fmt+0x7c/0xac)
[<c086ba10>] (warn_slowpath_fmt) from [<c0629f20>] (ops_exit_list+0x38/0x60)
[<c0629f20>] (ops_exit_list) from [<c062a5c4>] (cleanup_net+0x230/0x380)
[<c062a5c4>] (cleanup_net) from [<c0142c20>] (process_one_work+0x1d8/0x438)
[<c0142c20>] (process_one_work) from [<c0142ee4>] (worker_thread+0x64/0x5a8)
[<c0142ee4>] (worker_thread) from [<c0148a98>] (kthread+0x148/0x14c)
[<c0148a98>] (kthread) from [<c0100148>] (ret_from_fork+0x14/0x2c)

To properly restore physical CAN devices to the initial netns on owning
netns exit, introduce a flag on rtnl_link_ops that can be set by drivers.
For CAN devices setting this flag, default_device_exit() considers them
non-virtual, applying the usual namespace move.

The issue was introduced in the commit mentioned below, as at that time
CAN devices did not have a dellink() operation.

Fixes: e008b5fc8d ("net: Simplfy default_device_exit and improve batching.")
Link: https://lore.kernel.org/r/20210302122423.872326-1-martin@strongswan.org
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2021-03-16 08:40:04 +01:00
Lorenzo Bianconi 8f64860f8b net: export dev_set_threaded symbol
For wireless devices (e.g. mt76 driver) multiple net_devices belongs to
the same wireless phy and the napi object is registered in a dummy
netdevice related to the wireless phy.
Export dev_set_threaded in order to be reused in device drivers enabling
threaded NAPI.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-15 12:35:23 -07:00
Alexander Lobakin d0eed5c325 gro: give 'hash' variable in dev_gro_receive() a less confusing name
'hash' stores not the flow hash, but the index of the GRO bucket
corresponding to it.
Change its name to 'bucket' to avoid confusion while reading lines
like '__set_bit(hash, &napi->gro_bitmask)'.

Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-14 14:41:09 -07:00
Alexander Lobakin 9dc2c31337 gro: consistentify napi->gro_hash[x] access in dev_gro_receive()
GRO bucket index doesn't change through the entire function.
Store a pointer to the corresponding bucket instead of its member
and use it consistently through the function.
It is performance-safe since &gro_list->list == gro_list.

Misc: remove superfluous braces around single-line branches.

Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-14 14:41:08 -07:00
Alexander Lobakin 0ccf4d50d1 gro: simplify gro_list_prepare()
gro_list_prepare() always returns &napi->gro_hash[bucket].list,
without any variations. Moreover, it uses 'napi' argument only to
have access to this list, and calculates the bucket index for the
second time (firstly it happens at the beginning of
dev_gro_receive()) to do that.
Given that dev_gro_receive() already has an index to the needed
list, just pass it as the first argument to eliminate redundant
calculations, and make gro_list_prepare() return void.
Also, both arguments of gro_list_prepare() can be constified since
this function can only modify the skbs from the bucket list.

Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-14 14:41:08 -07:00
Gustavo A. R. Silva b1866bfff9 net: core: Fix fall-through warnings for Clang
In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning
by explicitly adding a break statement instead of letting the code fall
through to the next case.

Link: https://github.com/KSPP/linux/issues/115
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-10 12:45:15 -08:00
David S. Miller b8af417e4d Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2021-02-16

The following pull-request contains BPF updates for your *net-next* tree.

There's a small merge conflict between 7eeba1706e ("tcp: Add receive timestamp
support for receive zerocopy.") from net-next tree and 9cacf81f81 ("bpf: Remove
extra lock_sock for TCP_ZEROCOPY_RECEIVE") from bpf-next tree. Resolve as follows:

  [...]
                lock_sock(sk);
                err = tcp_zerocopy_receive(sk, &zc, &tss);
                err = BPF_CGROUP_RUN_PROG_GETSOCKOPT_KERN(sk, level, optname,
                                                          &zc, &len, err);
                release_sock(sk);
  [...]

We've added 116 non-merge commits during the last 27 day(s) which contain
a total of 156 files changed, 5662 insertions(+), 1489 deletions(-).

The main changes are:

1) Adds support of pointers to types with known size among global function
   args to overcome the limit on max # of allowed args, from Dmitrii Banshchikov.

2) Add bpf_iter for task_vma which can be used to generate information similar
   to /proc/pid/maps, from Song Liu.

3) Enable bpf_{g,s}etsockopt() from all sock_addr related program hooks. Allow
   rewriting bind user ports from BPF side below the ip_unprivileged_port_start
   range, both from Stanislav Fomichev.

4) Prevent recursion on fentry/fexit & sleepable programs and allow map-in-map
   as well as per-cpu maps for the latter, from Alexei Starovoitov.

5) Add selftest script to run BPF CI locally. Also enable BPF ringbuffer
   for sleepable programs, both from KP Singh.

6) Extend verifier to enable variable offset read/write access to the BPF
   program stack, from Andrei Matei.

7) Improve tc & XDP MTU handling and add a new bpf_check_mtu() helper to
   query device MTU from programs, from Jesper Dangaard Brouer.

8) Allow bpf_get_socket_cookie() helper also be called from [sleepable] BPF
   tracing programs, from Florent Revest.

9) Extend x86 JIT to pad JMPs with NOPs for helping image to converge when
   otherwise too many passes are required, from Gary Lin.

10) Verifier fixes on atomics with BPF_FETCH as well as function-by-function
    verification both related to zero-extension handling, from Ilya Leoshkevich.

11) Better kernel build integration of resolve_btfids tool, from Jiri Olsa.

12) Batch of AF_XDP selftest cleanups and small performance improvement
    for libbpf's xsk map redirect for newer kernels, from Björn Töpel.

13) Follow-up BPF doc and verifier improvements around atomics with
    BPF_FETCH, from Brendan Jackman.

14) Permit zero-sized data sections e.g. if ELF .rodata section contains
    read-only data from local variables, from Yonghong Song.

15) veth driver skb bulk-allocation for ndo_xdp_xmit, from Lorenzo Bianconi.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-16 13:14:06 -08:00
Alexander Lobakin 9243adfc31 skbuff: queue NAPI_MERGED_FREE skbs into NAPI cache instead of freeing
napi_frags_finish() and napi_skb_finish() can only be called inside
NAPI Rx context, so we can feed NAPI cache with skbuff_heads that
got NAPI_MERGED_FREE verdict instead of immediate freeing.
Replace __kfree_skb() with __kfree_skb_defer() in napi_skb_finish()
and move napi_skb_free_stolen_head() to skbuff.c, so it can drop skbs
to NAPI cache.
As many drivers call napi_alloc_skb()/napi_get_frags() on their
receive path, this becomes especially useful.

Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-13 14:32:04 -08:00
Alexander Lobakin fec6e49b63 skbuff: remove __kfree_skb_flush()
This function isn't much needed as NAPI skb queue gets bulk-freed
anyway when there's no more room, and even may reduce the efficiency
of bulk operations.
It will be even less needed after reusing skb cache on allocation path,
so remove it and this way lighten network softirqs a bit.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-13 14:32:03 -08:00