JIRA: https://issues.redhat.com/browse/RHEL-59091
commit 05c1280a2bcfca187fe7fa90bb240602cf54af0a
Author: Alexander Lobakin <aleksander.lobakin@intel.com>
Date: Thu Aug 29 14:33:38 2024 +0200
netdev_features: convert NETIF_F_NETNS_LOCAL to dev->netns_local
"Interface can't change network namespaces" is rather an attribute,
not a feature, and it can't be changed via Ethtool.
Make it a "cold" private flag instead of a netdev_feature and free
one more bit.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Conflicts:
drivers/net/amt.c
drivers/net/ethernet/adi/adin1110.c
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-59091
commit beb5a9bea8239cdf4adf6b62672e30db3e9fa5ce
Author: Alexander Lobakin <aleksander.lobakin@intel.com>
Date: Thu Aug 29 14:33:36 2024 +0200
netdevice: convert private flags > BIT(31) to bitfields
Make dev->priv_flags `u32` back and define bits higher than 31 as
bitfield booleans as per Jakub's suggestion. This simplifies code
which accesses these bits with no optimization loss (testb both
before/after), allows to not extend &netdev_priv_flags each time,
but also scales better as bits > 63 in the future would only add
a new u64 to the structure with no complications, comparing to
that extending ::priv_flags would require converting it to a bitmap.
Note that I picked `unsigned long :1` to not lose any potential
optimizations comparing to `bool :1` etc.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Conflicts:
drivers/net/ethernet/microchip/lan966x/lan966x_main.c
- Driver not present in RHEL 9.
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-57750
commit 87925151191b64d9623e63ccf11e517eacc99d7d
Author: Edward Cree <ecree.xilinx@gmail.com>
Date: Thu Jun 27 16:33:51 2024 +0100
net: ethtool: add a mutex protecting RSS contexts
While this is not needed to serialise the ethtool entry points (which
are all under RTNL), drivers may have cause to asynchronously access
dev->ethtool->rss_ctx; taking dev->ethtool->rss_lock allows them to
do this safely without needing to take the RTNL.
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://patch.msgid.link/7f9c15eb7525bf87af62c275dde3a8570ee8bf0a.1719502240.git.ecree.xilinx@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-57750
commit 30a32cdf6b130356805b3193a6208de25cbb2015
Author: Edward Cree <ecree.xilinx@gmail.com>
Date: Thu Jun 27 16:33:50 2024 +0100
net: ethtool: add an extack parameter to new rxfh_context APIs
Currently passed as NULL, but will allow drivers to report back errors
when ethnl support for these ops is added.
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://patch.msgid.link/6e0012347d175fdd1280363d7bfa76a2f2777e17.1719502240.git.ecree.xilinx@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-57750
commit 847a8ab186767be6ee95643f9739fa9d0f839589
Author: Edward Cree <ecree.xilinx@gmail.com>
Date: Thu Jun 27 16:33:49 2024 +0100
net: ethtool: let the core choose RSS context IDs
Add a new API to create/modify/remove RSS contexts, that passes in the
newly-chosen context ID (not as a pointer) rather than leaving the
driver to choose it on create. Also pass in the ctx, allowing drivers
to easily use its private data area to store their hardware-specific
state.
Keep the existing .set_rxfh API for now as a fallback, but deprecate it
for custom contexts (rss_context != 0).
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://patch.msgid.link/45f1fe61df2163c091ec394c9f52000c8b16cc3b.1719502240.git.ecree.xilinx@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-57750
commit 6ad2962f8adfd53fca52dce7f830783e95d99ce7
Author: Edward Cree <ecree.xilinx@gmail.com>
Date: Thu Jun 27 16:33:47 2024 +0100
net: ethtool: attach an XArray of custom RSS contexts to a netdevice
Each context stores the RXFH settings (indir, key, and hfunc) as well
as optionally some driver private data.
Delete any still-existing contexts at netdev unregister time.
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://patch.msgid.link/cbd1c402cec38f2e03124f2ab65b4ae4e08bd90d.1719502240.git.ecree.xilinx@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-57750
commit 3ebbd9f6de7ec6d538639ebb657246f629ace81e
Author: Edward Cree <ecree.xilinx@gmail.com>
Date: Thu Jun 27 16:33:46 2024 +0100
net: move ethtool-related netdev state into its own struct
net_dev->ethtool is a pointer to new struct ethtool_netdev_state, which
currently contains only the wol_enabled field.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://patch.msgid.link/293a562278371de7534ed1eb17531838ca090633.1719502239.git.ecree.xilinx@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Conflicts:
drivers/net/ethernet/wangxun/ngbe/ngbe_ethtool.c
drivers/net/ethernet/wangxun/ngbe/ngbe_main.c
- The driver is not preset in RHEL 9.
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-57750
commit c1742dcb6bda5fd535fbaa2145f0a180bc329aa6
Author: Eric Dumazet <edumazet@google.com>
Date: Thu May 2 17:39:26 2024 +0000
net: no longer acquire RTNL in threaded_show()
dev->threaded can be read locklessly, if we add
corresponding READ_ONCE()/WRITE_ONCE() annotations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240502173926.2010646-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-59092
commit c661050f93d3fd37a33c06041bb18a89688de7d2
Author: Breno Leitao <leitao@debian.org>
Date: Mon Apr 22 05:38:56 2024 -0700
net: create a dummy net_device allocator
It is impossible to use init_dummy_netdev together with alloc_netdev()
as the 'setup' argument.
This is because alloc_netdev() initializes some fields in the net_device
structure, and later init_dummy_netdev() memzero them all. This causes
some problems as reported here:
https://lore.kernel.org/all/20240322082336.49f110cc@kernel.org/
Split the init_dummy_netdev() function in two. Create a new function called
init_dummy_netdev_core() that does not memzero the net_device structure.
Then have init_dummy_netdev() memzero-ing and calling
init_dummy_netdev_core(), keeping the old behaviour.
init_dummy_netdev_core() is the new function that could be called as an
argument for alloc_netdev().
Also, create a helper to allocate and initialize dummy net devices,
leveraging init_dummy_netdev_core() as the setup argument. This function
basically simplify the allocation of dummy devices, by allocating and
initializing it. Freeing the device continue to be done through
free_netdev()
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Izabela Bakollari <ibakolla@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-59092
commit f8d05679fb3faae478d604177b0c188b340371cd
Author: Breno Leitao <leitao@debian.org>
Date: Mon Apr 22 05:38:55 2024 -0700
net: free_netdev: exit earlier if dummy
For dummy devices, exit earlier at free_netdev() instead of executing
the whole function. This is necessary, because dummy devices are
special, and shouldn't have the second part of the function executed.
Otherwise reg_state, which is NETREG_DUMMY, will be overwritten and
there will be no way to identify that this is a dummy device. Also, this
device do not need the final put_device(), since dummy devices are not
registered (through register_netdevice()), where the device reference is
increased (at netdev_register_kobject()/device_add()).
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Izabela Bakollari <ibakolla@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-59092
commit d160c66cda0ac8614adc53a5b5b0e6d6f1a05a5b
Author: Amit Cohen <amcohen@nvidia.com>
Date: Mon Feb 5 12:30:22 2024 +0200
net: Do not return value from init_dummy_netdev()
init_dummy_netdev() always returns zero and all the callers do not check
the returned value. Set the function to not return value, as it is not
really used today.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240205103022.440946-1-amcohen@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Izabela Bakollari <ibakolla@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-59100
Conflicts:
- net/core/dev.c:
context conflict due to missing 8e15aee62161 ("net: move
altnames together with the netdevice")
commit 1b3ef46cb7f2618cc0b507393220a69810f6da12
Author: Eric Dumazet <edumazet@google.com>
Date: Tue Feb 13 06:32:45 2024 +0000
net: remove dev_base_lock
dev_base_lock is not needed anymore, all remaining users also hold RTNL.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-59100
commit e51b962438741f5482c82fb225c1d59136f0fd87
Author: Eric Dumazet <edumazet@google.com>
Date: Tue Feb 13 06:32:44 2024 +0000
net: remove dev_base_lock from register_netdevice() and friends.
RTNL already protects writes to dev->reg_state, we no longer need to hold
dev_base_lock to protect the readers.
unlist_netdevice() second argument can be removed.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-59100
commit c7d52737e7ebd31cc5fef46380d94b58becf9479
Author: Eric Dumazet <edumazet@google.com>
Date: Tue Feb 13 06:32:38 2024 +0000
net-sysfs: use dev_addr_sem to remove races in address_show()
Using dev_base_lock is not preventing from reading garbage.
Use dev_addr_sem instead.
v4: place dev_addr_sem extern in net/core/dev.h (Jakub Kicinski)
Link: https://lore.kernel.org/netdev/20240212175845.10f6680a@kernel.org/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-59100
commit 4d42b37def70327b2bb19f823d42289aed2cd7c7
Author: Eric Dumazet <edumazet@google.com>
Date: Tue Feb 13 06:32:36 2024 +0000
net: convert dev->reg_state to u8
Prepares things so that dev->reg_state reads can be lockless,
by adding WRITE_ONCE() on write side.
READ_ONCE()/WRITE_ONCE() do not support bitfields.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-59100
commit 1c07dbb0cccfe85060b6eb089db3d6bfeb6aaf31
Author: Eric Dumazet <edumazet@google.com>
Date: Tue Feb 13 06:32:33 2024 +0000
net: annotate data-races around dev->name_assign_type
name_assign_type_show() runs locklessly, we should annotate
accesses to dev->name_assign_type.
Alternative would be to grab devnet_rename_sem semaphore
from name_assign_type_show(), but this would not bring
more accuracy.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-59100
commit facd15dfd69122042502d99ab8c9f888b48ee994
Author: Johannes Berg <johannes.berg@intel.com>
Date: Mon Dec 4 21:47:07 2023 +0100
net: core: synchronize link-watch when carrier is queried
There are multiple ways to query for the carrier state: through
rtnetlink, sysfs, and (possibly) ethtool. Synchronize linkwatch
work before these operations so that we don't have a situation
where userspace queries the carrier state between the driver's
carrier off->on transition and linkwatch running and expects it
to work, when really (at least) TX cannot work until linkwatch
has run.
I previously posted a longer explanation of how this applies to
wireless [1] but with this wireless can simply query the state
before sending data, to ensure the kernel is ready for it.
[1] https://lore.kernel.org/all/346b21d87c69f817ea3c37caceb34f1f56255884.camel@sipsolutions.net/
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20231204214706.303c62768415.I1caedccae72ee5a45c9085c5eb49c145ce1c0dd5@changeid
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-9145
Conflicts: we already have d80ce17d20 ("page_pool: allow caching
from safely localized NAPI")
commit 8b43fd3d1d7d88293eb15e92090826e6b7cc13e4
Author: Eric Dumazet <edumazet@google.com>
Date: Tue Mar 28 23:50:21 2023 +0000
net: optimize ____napi_schedule() to avoid extra NET_RX_SOFTIRQ
____napi_schedule() adds a napi into current cpu softnet_data poll_list,
then raises NET_RX_SOFTIRQ to make sure net_rx_action() will process it.
Idea of this patch is to not raise NET_RX_SOFTIRQ when being called indirectly
from net_rx_action(), because we can process poll_list from this point,
without going to full softirq loop.
This needs a change in net_rx_action() to make sure we restart
its main loop if sd->poll_list was updated without NET_RX_SOFTIRQ
being raised.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Tested-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-9145
commit 909876500251b3b48480a840bbf9053588254eee
Author: Eric Dumazet <edumazet@google.com>
Date: Sun May 15 21:24:56 2022 -0700
net: call skb_defer_free_flush() before each napi_poll()
skb_defer_free_flush() can consume cpu cycles,
it seems better to call it in the inner loop:
- Potentially frees page/skb that will be reallocated while hot.
- Account for the cpu cycles in the @time_limit determination.
- Keep softnet_data.defer_count small to reduce chances for
skb_attempt_defer_free() to send an IPI.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-9145
commit 5086f0fe46dcf8687c8c2e41e1f07826affebbba
Author: Eric Dumazet <edumazet@google.com>
Date: Thu Mar 28 17:34:48 2024 +0000
net: do not consume a cacheline for system_page_pool
There is no reason to consume a full cacheline to store system_page_pool.
We can eventually move it to softnet_data later for full locality control.
Fixes: 2b0cfa6e4956 ("net: add generic percpu page_pool allocator")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Lorenzo Bianconi <lorenzo@kernel.org>
Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/20240328173448.2262593-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-9145
commit f3412b3879b4f7c4313b186b03940d4791345534
Author: Eric Dumazet <edumazet@google.com>
Date: Wed Apr 27 13:41:47 2022 -0700
net: make sure net_rx_action() calls skb_defer_free_flush()
I missed a stray return; in net_rx_action(), which very well
is taken whenever trigger_rx_softirq() has been called on
a cpu that is no longer receiving network packets,
or receiving too few of them.
Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/20220427204147.1310161-1-eric.dumazet@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-9145
commit 1200097fa8f0d8e8ddfe5c554d8fa2bc03b2df92
Author: Eric Dumazet <edumazet@google.com>
Date: Tue Feb 27 21:01:04 2024 +0000
net: call skb_defer_free_flush() from __napi_busy_loop()
skb_defer_free_flush() is currently called from net_rx_action()
and napi_threaded_poll().
We should also call it from __napi_busy_loop() otherwise
there is the risk the percpu queue can grow until an IPI
is forced from skb_attempt_defer_free() adding a latency spike.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Samiullah Khawaja <skhawaja@google.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240227210105.3815474-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-9145
commit 931e93bdf8ca71cef1f8759c43bc2c5385392b8b
Author: Eric Dumazet <edumazet@google.com>
Date: Fri Apr 21 09:43:54 2023 +0000
net: do not provide hard irq safety for sd->defer_lock
kfree_skb() can be called from hard irq handlers,
but skb_attempt_defer_free() is meant to be used
from process or BH contexts, and skb_defer_free_flush()
is meant to be called from BH contexts.
Not having to mask hard irq can save some cycles.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-9145
commit 3176eb82681ec9c8af31c6588ddedcc6cfb9e445
Author: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Fri Jan 20 13:07:43 2023 +0100
net: avoid irqsave in skb_defer_free_flush
The spin_lock irqsave/restore API variant in skb_defer_free_flush can
be replaced with the faster spin_lock irq variant, which doesn't need
to read and restore the CPU flags.
Using the unconditional irq "disable/enable" API variant is safe,
because the skb_defer_free_flush() function is only called during
NAPI-RX processing in net_rx_action(), where it is known the IRQs
are enabled.
Expected gain is 14 cycles from avoiding reading and restoring CPU
flags in a spin_lock_irqsave/restore operation, measured via a
microbencmark kernel module[1] on CPU E5-1650 v4 @ 3.60GHz.
Microbenchmark overhead of spin_lock+unlock:
- spin_lock_unlock_irq cost: 34 cycles(tsc) 9.486 ns
- spin_lock_unlock_irqsave cost: 48 cycles(tsc) 13.567 ns
We don't expect to see a measurable packet performance gain, as
skb_defer_free_flush() is called infrequently once per NIC device NAPI
bulk cycle and conditionally only if SKBs have been deferred by other
CPUs via skb_attempt_defer_free().
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/r/167421646327.1321776.7390743166998776914.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-9145
commit 39564c3fdc6684c6726b63e131d2a9f3809811cb
Author: Eric Dumazet <edumazet@google.com>
Date: Sun May 15 21:24:55 2022 -0700
net: add skb_defer_max sysctl
commit 68822bdf76f1 ("net: generalize skb freeing
deferral to per-cpu lists") added another per-cpu
cache of skbs. It was expected to be small,
and an IPI was forced whenever the list reached 128
skbs.
We might need to be able to control more precisely
queue capacity and added latency.
An IPI is generated whenever queue reaches half capacity.
Default value of the new limit is 64.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-9145
commit 2db60eed1a957423cf06ee1060fc45ed3971990d
Author: Eric Dumazet <edumazet@google.com>
Date: Sun May 15 21:24:54 2022 -0700
net: use napi_consume_skb() in skb_defer_free_flush()
skb_defer_free_flush() runs from softirq context,
we have the opportunity to refill the napi_alloc_cache,
and/or use kmem_cache_free_bulk() when this cache is full.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-9145
commit 765b11f8f4e20b7433e4ba4a3e9106a0d59501ed
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date: Mon Mar 25 08:40:31 2024 +0100
net: Rename rps_lock to backlog_lock.
The rps_lock.*() functions use the inner lock of a sk_buff_head for
locking. This lock is used if RPS is enabled, otherwise the list is
accessed lockless and disabling interrupts is enough for the
synchronisation because it is only accessed CPU local. Not only the list
is protected but also the NAPI state protected.
With the addition of backlog threads, the lock is also needed because of
the cross CPU access even without RPS. The clean up of the defer_list
list is also done via backlog threads (if enabled).
It has been suggested to rename the locking function since it is no
longer just RPS.
Rename the rps_lock*() functions to backlog_lock*().
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-9145
commit 80d2eefcb4c84aa9018b2a997ab3a4c567bc821a
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date: Mon Mar 25 08:40:30 2024 +0100
net: Use backlog-NAPI to clean up the defer_list.
The defer_list is a per-CPU list which is used to free skbs outside of
the socket lock and on the CPU on which they have been allocated.
The list is processed during NAPI callbacks so ideally the list is
cleaned up.
Should the amount of skbs on the list exceed a certain water mark then
the softirq is triggered remotely on the target CPU by invoking a remote
function call. The raise of the softirqs via a remote function call
leads to waking the ksoftirqd on PREEMPT_RT which is undesired.
The backlog-NAPI threads already provide the infrastructure which can be
utilized to perform the cleanup of the defer_list.
The NAPI state is updated with the input_pkt_queue.lock acquired. It
order not to break the state, it is needed to also wake the backlog-NAPI
thread with the lock held. This requires to acquire the use the lock in
rps_lock_irq*() if the backlog-NAPI threads are used even with RPS
disabled.
Move the logic of remotely starting softirqs to clean up the defer_list
into kick_defer_list_purge(). Make sure a lock is held in
rps_lock_irq*() if backlog-NAPI threads are used. Schedule backlog-NAPI
for defer_list cleanup if backlog-NAPI is available.
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-9145
Conflicts: we don't have d6dbbb11247c ("net: report RCU QS on threaded NAPI repolling")
commit dad6b97702639fba27a2bd3e986982ad6f0db3a7
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date: Mon Mar 25 08:40:29 2024 +0100
net: Allow to use SMP threads for backlog NAPI.
Backlog NAPI is a per-CPU NAPI struct only (with no device behind it)
used by drivers which don't do NAPI them self, RPS and parts of the
stack which need to avoid recursive deadlocks while processing a packet.
The non-NAPI driver use the CPU local backlog NAPI. If RPS is enabled
then a flow for the skb is computed and based on the flow the skb can be
enqueued on a remote CPU. Scheduling/ raising the softirq (for backlog's
NAPI) on the remote CPU isn't trivial because the softirq is only
scheduled on the local CPU and performed after the hardirq is done.
In order to schedule a softirq on the remote CPU, an IPI is sent to the
remote CPU which schedules the backlog-NAPI on the then local CPU.
On PREEMPT_RT interrupts are force-threaded. The soft interrupts are
raised within the interrupt thread and processed after the interrupt
handler completed still within the context of the interrupt thread. The
softirq is handled in the context where it originated.
With force-threaded interrupts enabled, ksoftirqd is woken up if a
softirq is raised from hardirq context. This is the case if it is raised
from an IPI. Additionally there is a warning on PREEMPT_RT if the
softirq is raised from the idle thread.
This was done for two reasons:
- With threaded interrupts the processing should happen in thread
context (where it originated) and ksoftirqd is the only thread for
this context if raised from hardirq. Using the currently running task
instead would "punish" a random task.
- Once ksoftirqd is active it consumes all further softirqs until it
stops running. This changed recently and is no longer the case.
Instead of keeping the backlog NAPI in ksoftirqd (in force-threaded/
PREEMPT_RT setups) I am proposing NAPI-threads for backlog.
The "proper" setup with threaded-NAPI is not doable because the threads
are not pinned to an individual CPU and can be modified by the user.
Additionally a dummy network device would have to be assigned. Also
CPU-hotplug has to be considered if additional CPUs show up.
All this can be probably done/ solved but the smpboot-threads already
provide this infrastructure.
Sending UDP packets over loopback expects that the packet is processed
within the call. Delaying it by handing it over to the thread hurts
performance. It is not beneficial to the outcome if the context switch
happens immediately after enqueue or after a while to process a few
packets in a batch.
There is no need to always use the thread if the backlog NAPI is
requested on the local CPU. This restores the loopback throuput. The
performance drops mostly to the same value after enabling RPS on the
loopback comparing the IPI and the tread result.
Create NAPI-threads for backlog if request during boot. The thread runs
the inner loop from napi_threaded_poll(), the wait part is different. It
checks for NAPI_STATE_SCHED (the backlog NAPI can not be disabled).
The NAPI threads for backlog are optional, it has to be enabled via the boot
argument "thread_backlog_napi". It is mandatory for PREEMPT_RT to avoid the
wakeup of ksoftirqd from the IPI.
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-9145
commit 56364c910691f6d10ba88c964c9041b9ab777bd6
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date: Mon Mar 25 08:40:28 2024 +0100
net: Remove conditional threaded-NAPI wakeup based on task state.
A NAPI thread is scheduled by first setting NAPI_STATE_SCHED bit. If
successful (the bit was not yet set) then the NAPI_STATE_SCHED_THREADED
is set but only if thread's state is not TASK_INTERRUPTIBLE (is
TASK_RUNNING) followed by task wakeup.
If the task is idle (TASK_INTERRUPTIBLE) then the
NAPI_STATE_SCHED_THREADED bit is not set. The thread is no relying on
the bit but always leaving the wait-loop after returning from schedule()
because there must have been a wakeup.
The smpboot-threads implementation for per-CPU threads requires an
explicit condition and does not support "if we get out of schedule()
then there must be something to do".
Removing this optimisation simplifies the following integration.
Set NAPI_STATE_SCHED_THREADED unconditionally on wakeup and rely on it
in the wait path by removing the `woken' condition.
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-9145
Conflicts: we already have 490a79faf95e ("net: introduce include/net/rps.h")
commit 2b0cfa6e49566c8fa6759734cf821aa6e8271a9e
Author: Lorenzo Bianconi <lorenzo@kernel.org>
Date: Mon Feb 12 10:50:54 2024 +0100
net: add generic percpu page_pool allocator
Introduce generic percpu page_pools allocator.
Moreover add page_pool_create_percpu() and cpuid filed in page_pool struct
in order to recycle the page in the page_pool "hot" cache if
napi_pp_put_page() is running on the same cpu.
This is a preliminary patch to add xdp multi-buff support for xdp running
in generic mode.
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Toke Hoiland-Jorgensen <toke@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://lore.kernel.org/r/80bc4285228b6f4220cd03de1999d86e46e3fcbd.1707729884.git.lorenzo@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-9145
commit 87eff2ec57b6d68d294013d8dd21e839a1175e3a
Author: Eric Dumazet <edumazet@google.com>
Date: Fri Apr 21 09:43:57 2023 +0000
net: optimize napi_threaded_poll() vs RPS/RFS
We use napi_threaded_poll() in order to reduce our softirq dependency.
We can add a followup of 821eba962d95 ("net: optimize napi_schedule_rps()")
to further remove the need of firing NET_RX_SOFTIRQ whenever
RPS/RFS are used.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-9145
commit a1aaee7f8f79d1b0595e24f8c3caed24630d6cb6
Author: Eric Dumazet <edumazet@google.com>
Date: Fri Apr 21 09:43:56 2023 +0000
net: make napi_threaded_poll() aware of sd->defer_list
If we call skb_defer_free_flush() from napi_threaded_poll(),
we can avoid to raise IPI from skb_attempt_defer_free()
when the list becomes too big.
This allows napi_threaded_poll() to rely less on softirqs,
and lowers latency caused by a too big list.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-9145
commit 821eba962d95806beb0440742c4062a9da8a386b
Author: Eric Dumazet <edumazet@google.com>
Date: Tue Mar 28 23:50:20 2023 +0000
net: optimize napi_schedule_rps()
Based on initial patch from Jason Xing.
Idea is to not raise NET_RX_SOFTIRQ from napi_schedule_rps()
when we queued a packet into another cpu backlog.
We can do this only in the context of us being called indirectly
from net_rx_action(), to have the guarantee our rps_ipi_list
will be processed before we exit from net_rx_action().
Link: https://lore.kernel.org/lkml/20230325152417.5403-1-kerneljasonxing@gmail.com/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Tested-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-9145
commit c59647c0dc679008886756a888368da1c6d4ccd3
Author: Eric Dumazet <edumazet@google.com>
Date: Tue Mar 28 23:50:19 2023 +0000
net: add softnet_data.in_net_rx_action
We want to make two optimizations in napi_schedule_rps() and
____napi_schedule() which require to know if these helpers are
called from net_rx_action(), instead of being called from
other contexts.
sd.in_net_rx_action is only read/written by the owning cpu.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Tested-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-9145
commit 8fcb76b934daff12cde76adeab3d502eeb0734b1
Author: Eric Dumazet <edumazet@google.com>
Date: Tue Mar 28 23:50:18 2023 +0000
net: napi_schedule_rps() cleanup
napi_schedule_rps() return value is ignored, remove it.
Change the comment to clarify the intent.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Tested-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-9145
commit 97e719a82b43c6c2bb5eebdb3c5d479a332ac2ac
Author: Eric Dumazet <edumazet@google.com>
Date: Sun May 15 21:24:53 2022 -0700
net: fix possible race in skb_attempt_defer_free()
A cpu can observe sd->defer_count reaching 128,
and call smp_call_function_single_async()
Problem is that the remote CPU can clear sd->defer_count
before the IPI is run/acknowledged.
Other cpus can queue more packets and also decide
to call smp_call_function_single_async() while the pending
IPI was not yet delivered.
This is a common issue with smp_call_function_single_async().
Callers must ensure correct synchronization and serialization.
I triggered this issue while experimenting smaller threshold.
Performing the call to smp_call_function_single_async()
under sd->defer_lock protection did not solve the problem.
Commit 5a18ceca63 ("smp: Allow smp_call_function_single_async()
to insert locked csd") replaced an informative WARN_ON_ONCE()
with a return of -EBUSY, which is often ignored.
Test of CSD_FLAG_LOCK presence is racy anyway.
Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-9145
Conflicts:
inet/tls/tls_sw.c: we already have:
* 4cbc325ed6b4 ("tls: rx: allow only one reader at a time")
net/ipv4/tcp_ipv4.c: we already have:
* 67b688aecd tcp: fix tcp_cleanup_rbuf() for tcp_read_skb()
* 0240ed7c51 tcp: allow again tcp_disconnect() when threads are waiting
* 0d5e52df56 bpf: net: Avoid do_tcp_getsockopt() taking sk lock when called from bpf
* 7a26dc9e7b43 net: tcp: add skb drop reasons to tcp_add_backlog()
commit 68822bdf76f10c3dc80609d4e2cdc1e847429086
Author: Eric Dumazet <edumazet@google.com>
Date: Fri Apr 22 13:12:37 2022 -0700
net: generalize skb freeing deferral to per-cpu lists
Logic added in commit f35f821935d8 ("tcp: defer skb freeing after socket
lock is released") helped bulk TCP flows to move the cost of skbs
frees outside of critical section where socket lock was held.
But for RPC traffic, or hosts with RFS enabled, the solution is far from
being ideal.
For RPC traffic, recvmsg() has to return to user space right after
skb payload has been consumed, meaning that BH handler has no chance
to pick the skb before recvmsg() thread. This issue is more visible
with BIG TCP, as more RPC fit one skb.
For RFS, even if BH handler picks the skbs, they are still picked
from the cpu on which user thread is running.
Ideally, it is better to free the skbs (and associated page frags)
on the cpu that originally allocated them.
This patch removes the per socket anchor (sk->defer_list) and
instead uses a per-cpu list, which will hold more skbs per round.
This new per-cpu list is drained at the end of net_action_rx(),
after incoming packets have been processed, to lower latencies.
In normal conditions, skbs are added to the per-cpu list with
no further action. In the (unlikely) cases where the cpu does not
run net_action_rx() handler fast enough, we use an IPI to raise
NET_RX_SOFTIRQ on the remote cpu.
Also, we do not bother draining the per-cpu list from dev_cpu_dead()
This is because skbs in this list have no requirement on how fast
they should be freed.
Note that we can add in the future a small per-cpu cache
if we see any contention on sd->defer_lock.
Tested on a pair of hosts with 100Gbit NIC, RFS enabled,
and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around
page recycling strategy used by NIC driver (its page pool capacity
being too small compared to number of skbs/pages held in sockets
receive queues)
Note that this tuning was only done to demonstrate worse
conditions for skb freeing for this particular test.
These conditions can happen in more general production workload.
10 runs of one TCP_STREAM flow
Before:
Average throughput: 49685 Mbit.
Kernel profiles on cpu running user thread recvmsg() show high cost for
skb freeing related functions (*)
57.81% [kernel] [k] copy_user_enhanced_fast_string
(*) 12.87% [kernel] [k] skb_release_data
(*) 4.25% [kernel] [k] __free_one_page
(*) 3.57% [kernel] [k] __list_del_entry_valid
1.85% [kernel] [k] __netif_receive_skb_core
1.60% [kernel] [k] __skb_datagram_iter
(*) 1.59% [kernel] [k] free_unref_page_commit
(*) 1.16% [kernel] [k] __slab_free
1.16% [kernel] [k] _copy_to_iter
(*) 1.01% [kernel] [k] kfree
(*) 0.88% [kernel] [k] free_unref_page
0.57% [kernel] [k] ip6_rcv_core
0.55% [kernel] [k] ip6t_do_table
0.54% [kernel] [k] flush_smp_call_function_queue
(*) 0.54% [kernel] [k] free_pcppages_bulk
0.51% [kernel] [k] llist_reverse_order
0.38% [kernel] [k] process_backlog
(*) 0.38% [kernel] [k] free_pcp_prepare
0.37% [kernel] [k] tcp_recvmsg_locked
(*) 0.37% [kernel] [k] __list_add_valid
0.34% [kernel] [k] sock_rfree
0.34% [kernel] [k] _raw_spin_lock_irq
(*) 0.33% [kernel] [k] __page_cache_release
0.33% [kernel] [k] tcp_v6_rcv
(*) 0.33% [kernel] [k] __put_page
(*) 0.29% [kernel] [k] __mod_zone_page_state
0.27% [kernel] [k] _raw_spin_lock
After patch:
Average throughput: 73076 Mbit.
Kernel profiles on cpu running user thread recvmsg() looks better:
81.35% [kernel] [k] copy_user_enhanced_fast_string
1.95% [kernel] [k] _copy_to_iter
1.95% [kernel] [k] __skb_datagram_iter
1.27% [kernel] [k] __netif_receive_skb_core
1.03% [kernel] [k] ip6t_do_table
0.60% [kernel] [k] sock_rfree
0.50% [kernel] [k] tcp_v6_rcv
0.47% [kernel] [k] ip6_rcv_core
0.45% [kernel] [k] read_tsc
0.44% [kernel] [k] _raw_spin_lock_irqsave
0.37% [kernel] [k] _raw_spin_lock
0.37% [kernel] [k] native_irq_return_iret
0.33% [kernel] [k] __inet6_lookup_established
0.31% [kernel] [k] ip6_protocol_deliver_rcu
0.29% [kernel] [k] tcp_rcv_established
0.29% [kernel] [k] llist_reverse_order
v2: kdoc issue (kernel bots)
do not defer if (alloc_cpu == smp_processor_id()) (Paolo)
replace the sk_buff_head with a single-linked list (Jakub)
add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-57767
commit 047f340b36fc550c0fc6a8947fc0a1f8e429e9ab
Author: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Date: Mon Mar 25 20:47:36 2024 +0000
net: sched: make skip_sw actually skip software
TC filters come in 3 variants:
- no flag (try to process in hardware, but fallback to software))
- skip_hw (do not process filter by hardware)
- skip_sw (do not process filter by software)
However skip_sw is implemented so that the skip_sw
flag can first be checked, after it has been matched.
IMHO it's common when using skip_sw, to use it on all rules.
So if all filters in a block is skip_sw filters, then
we can bail early, we can thus avoid having to match
the filters, just to check for the skip_sw flag.
This patch adds a bypass, for when only TC skip_sw rules
are used. The bypass is guarded by a static key, to avoid
harming other workloads.
There are 3 ways that a packet from a skip_sw ruleset, can
end up in the kernel path. Although the send packets to a
non-existent chain way is only improved a few percents, then
I believe it's worth optimizing the trap and fall-though
use-cases.
+----------------------------+--------+--------+--------+
| Test description | Pre- | Post- | Rel. |
| | kpps | kpps | chg. |
+----------------------------+--------+--------+--------+
| basic forwarding + notrack | 3589.3 | 3587.9 | 1.00x |
| switch to eswitch mode | 3081.8 | 3094.7 | 1.00x |
| add ingress qdisc | 3042.9 | 3063.6 | 1.01x |
| tc forward in hw / skip_sw |37024.7 |37028.4 | 1.00x |
| tc forward in sw / skip_hw | 3245.0 | 3245.3 | 1.00x |
+----------------------------+--------+--------+--------+
| tests with only skip_sw rules below: |
+----------------------------+--------+--------+--------+
| 1 non-matching rule | 2694.7 | 3058.7 | 1.14x |
| 1 n-m rule, match trap | 2611.2 | 3323.1 | 1.27x |
| 1 n-m rule, goto non-chain | 2886.8 | 2945.9 | 1.02x |
| 5 non-matching rules | 1958.2 | 3061.3 | 1.56x |
| 5 n-m rules, match trap | 1911.9 | 3327.0 | 1.74x |
| 5 n-m rules, goto non-chain| 2883.1 | 2947.5 | 1.02x |
| 10 non-matching rules | 1466.3 | 3062.8 | 2.09x |
| 10 n-m rules, match trap | 1444.3 | 3317.9 | 2.30x |
| 10 n-m rules,goto non-chain| 2883.1 | 2939.5 | 1.02x |
| 25 non-matching rules | 838.5 | 3058.9 | 3.65x |
| 25 n-m rules, match trap | 824.5 | 3323.0 | 4.03x |
| 25 n-m rules,goto non-chain| 2875.8 | 2944.7 | 1.02x |
| 50 non-matching rules | 488.1 | 3054.7 | 6.26x |
| 50 n-m rules, match trap | 484.9 | 3318.5 | 6.84x |
| 50 n-m rules,goto non-chain| 2884.1 | 2939.7 | 1.02x |
+----------------------------+--------+--------+--------+
perf top (25 n-m skip_sw rules - pre patch):
20.39% [kernel] [k] __skb_flow_dissect
16.43% [kernel] [k] rhashtable_jhash2
10.58% [kernel] [k] fl_classify
10.23% [kernel] [k] fl_mask_lookup
4.79% [kernel] [k] memset_orig
2.58% [kernel] [k] tcf_classify
1.47% [kernel] [k] __x86_indirect_thunk_rax
1.42% [kernel] [k] __dev_queue_xmit
1.36% [kernel] [k] nft_do_chain
1.21% [kernel] [k] __rcu_read_lock
perf top (25 n-m skip_sw rules - post patch):
5.12% [kernel] [k] __dev_queue_xmit
4.77% [kernel] [k] nft_do_chain
3.65% [kernel] [k] dev_gro_receive
3.41% [kernel] [k] check_preemption_disabled
3.14% [kernel] [k] mlx5e_skb_from_cqe_mpwrq_nonlinear
2.88% [kernel] [k] __netif_receive_skb_core.constprop.0
2.49% [kernel] [k] mlx5e_xmit
2.15% [kernel] [k] ip_forward
1.95% [kernel] [k] mlx5e_tc_restore_tunnel
1.92% [kernel] [k] vlan_gro_receive
Test setup:
DUT: Intel Xeon D-1518 (2.20GHz) w/ Nvidia/Mellanox ConnectX-6 Dx 2x100G
Data rate measured on switch (Extreme X690), and DUT connected as
a router on a stick, with pktgen and pktsink as VLANs.
Pktgen-dpdk was in range 36.6-37.7 Mpps 64B packets across all tests.
Full test data at https://files.fiberby.net/ast/2024/tc_skip_sw/v2_tests/
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-57767
commit c24831a13ba2e472f874483525084da2f2feb890
Author: Jakub Kicinski <kuba@kernel.org>
Date: Mon Apr 17 08:53:47 2023 -0700
net: skbuff: hide csum_not_inet when CONFIG_IP_SCTP not set
SCTP is not universally deployed, allow hiding its bit
from the skb.
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-57740
Conflicts:
- include/linux/netdevice.h:
context conflict due to backported commit c353c7b7ffb7a ("net-device:
move lstats in net_device_read_txrx")
- drivers/net/veth.c:
modified due to backported commit 0bef512012b1c ("net: add
netdev_lockdep_set_classes() to virtual drivers")
- net/core/dev.c:
context conflict due to missing commit 1202cdd665315 ("Remove DECnet
support from kernel")
commit 34d21de99cea9cb17967874313e5b0262527833c
Author: Daniel Borkmann <daniel@iogearbox.net>
Date: Tue Nov 14 01:42:14 2023 +0100
net: Move {l,t,d}stats allocation to core and convert veth & vrf
Move {l,t,d}stats allocation to the core and let netdevs pick the stats
type they need. That way the driver doesn't have to bother with error
handling (allocation failure checking, making sure free happens in the
right spot, etc) - all happening in the core.
Co-developed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Cc: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20231114004220.6495-3-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-6297
commit 7f3eb2174512fe6c9c0f062e96eccb0d3cc6d5cd
Author: Christian Marangi <ansuelsmth@gmail.com>
Date: Wed Oct 18 14:35:47 2023 +0200
net: introduce napi_is_scheduled helper
We currently have napi_if_scheduled_mark_missed that can be used to
check if napi is scheduled but that does more thing than simply checking
it and return a bool. Some driver already implement custom function to
check if napi is scheduled.
Drop these custom function and introduce napi_is_scheduled that simply
check if napi is scheduled atomically.
Update any driver and code that implement a similar check and instead
use this new helper.
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Andrew Halaney <ahalaney@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4765
JIRA: https://issues.redhat.com/browse/RHEL-48648
Various visibility improvements; mainly around drop reasons, reset reason and improved tracepoints this time.
Signed-off-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>
Merged-by: Lucas Zampieri <lzampier@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4307
JIRA: https://issues.redhat.com/browse/RHEL-30902
Tested: manual testing and preliminary LNST run show improvement in some
tests and no regressions.
The fields that the rx and tx paths use were placed all over the core
networking structs. Reorganize these structs so the fields of each
struct that are read/written in rx/tx paths are placed close to each
other to reduce the number of cache lines used.
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Sabrina Dubroca <sdubroca@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>
Merged-by: Lucas Zampieri <lzampier@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-48648
Upstream Status: net-next.git
commit c53795d48ee8f385c6a9e394651e7ee914baaeba
Author: Yan Zhai <yan@cloudflare.com>
Date: Mon Jun 17 11:09:04 2024 -0700
net: add rx_sk to trace_kfree_skb
skb does not include enough information to find out receiving
sockets/services and netns/containers on packet drops. In theory
skb->dev tells about netns, but it can get cleared/reused, e.g. by TCP
stack for OOO packet lookup. Similarly, skb->sk often identifies a local
sender, and tells nothing about a receiver.
Allow passing an extra receiving socket to the tracepoint to improve
the visibility on receiving drops.
Signed-off-by: Yan Zhai <yan@cloudflare.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Antoine Tenart <atenart@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30902
commit f6e0a4984c2e7244689ea87b62b433bed9d07e94
Author: Eric Dumazet <edumazet@google.com>
Date: Thu Mar 14 20:08:45 2024 +0000
net: move dev->state into net_device_read_txrx group
dev->state can be read in rx and tx fast paths.
netif_running() which needs dev->state is called from
- enqueue_to_backlog() [RX path]
- __dev_direct_xmit() [TX path]
Fixes: 43a71cd66b9c ("net-device: reorganize net_device fast path variables")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Coco Li <lixiaoyan@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240314200845.3050179-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30902
Conflicts:
- include/linux/netdevice.h: Context difference due to missing 34d21de99cea
("net: Move {l,t,d}stats allocation to core and convert veth & vrf");
this doesn't affect that the stats pointer union itself is read in the rx
and tx fast paths.
commit c353c7b7ffb7ae6ed8f3339906fe33c8be6cf344
Author: Eric Dumazet <edumazet@google.com>
Date: Thu Feb 8 14:43:23 2024 +0000
net-device: move lstats in net_device_read_txrx
dev->lstats is notably used from loopback ndo_start_xmit()
and other virtual drivers.
Per cpu stats updates are dirtying per-cpu data,
but the pointer itself is read-only.
Fixes: 43a71cd66b9c ("net-device: reorganize net_device fast path variables")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Coco Li <lixiaoyan@google.com>
Cc: Simon Horman <horms@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30902
Conflicts:
- include/linux/netdevice.h: code differece because we are maintaining kABI
exclusions.
commit d3d344a1ca69d8fb2413e29e6400f3ad58a05c06
Author: Eric Dumazet <edumazet@google.com>
Date: Tue Jan 2 16:22:20 2024 +0000
net-device: move xdp_prog to net_device_read_rx
xdp_prog is used in receive path, both from XDP enabled drivers
and from netif_elide_gro().
This patch also removes two 4-bytes holes.
Fixes: 43a71cd66b9c ("net-device: reorganize net_device fast path variables")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Coco Li <lixiaoyan@google.com>
Cc: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240102162220.750823-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30902
commit 993498e537af9260e697219ce41b41b22b6199cc
Author: Eric Dumazet <edumazet@google.com>
Date: Thu Dec 21 14:07:47 2023 +0000
net-device: move gso_partial_features to net_device_read_tx
dev->gso_partial_features is read from tx fast path for GSO packets.
Move it to appropriate section to avoid a cache line miss.
Fixes: 43a71cd66b9c ("net-device: reorganize net_device fast path variables")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Coco Li <lixiaoyan@google.com>
Cc: David Ahern <dsahern@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30902
Conflicts:
- include/linux/netdevice.h: Conflicts due to kABI exclusions in the
struct. Reordering kABI excluded fields maintains the kABI exclusion.
- include/linux/netdevice.h: Context differences due to missing patches
from upstream.
commit 43a71cd66b9c0a4af3d15d8644359fde35bdbed0
Author: Coco Li <lixiaoyan@google.com>
Date: Mon Dec 4 20:12:30 2023 +0000
net-device: reorganize net_device fast path variables
Reorganize fast path variables on tx-txrx-rx order
Fastpath variables end after npinfo.
Below data generated with pahole on x86 architecture.
Fast path variables span cache lines before change: 12
Fast path variables span cache lines after change: 4
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Coco Li <lixiaoyan@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20231204201232.520025-2-lixiaoyan@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4236
JIRA: https://issues.redhat.com/browse/RHEL-36217
Commits:
```
b534dc46c8ae ("net_tstamp: add SOF_TIMESTAMPING_OPT_ID_TCP")
70f7457ad6d6 ("net: create device lookup API with reference tracking")
3515440df461 ("ipv6: also use netdev_hold() in ip6_route_check_nh()")
108a36d07c01 ("ethtool: Fix mod state of verbose no_mask bitset")
524515020f25 ("Revert "ethtool: Fix mod state of verbose no_mask bitset"")
f55d8e60f109 ("net: ethtool: Fix documentation of ethtool_sprintf()")
65c9fde15a65 ("net: vlan: convert to ndo_hwtstamp_get() / ndo_hwtstamp_set()")
0bca3f7f9acd ("net: macvlan: convert to ndo_hwtstamp_get() / ndo_hwtstamp_set()")
c0dabeb4c666 ("net: bonding: convert to ndo_hwtstamp_get() / ndo_hwtstamp_set()")
ef5eb9c5ce45 ("net: fec: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set()")
547b006d1922 ("net: fec: delete fec_ptp_disable_hwts()")
fd770e856e22 ("net: remove phy_has_hwtstamp() -> phy_mii_ioctl() decision from converted drivers")
c35e927cbe09 ("net: omit ndo_hwtstamp_get() call when possible in dev_set_hwtstamp_phylib()")
446e2305827b ("net: Convert PHYs hwtstamp callback to use kernel_hwtstamp_config")
430dc3256d57 ("net: phy: Remove the call to phy_mii_ioctl in phy_hwstamp_get/set")
b8768dc40777 ("net: ethtool: Refactor identical get_ts_info implementations.")
202cb220026e ("net: macb: Convert to ndo_hwtstamp_get() and ndo_hwtstamp_set()")
011dd3b3f83f ("net: Make dev_set_hwtstamp_phylib accessible")
915d25a9d69b ("net: phy: micrel: fix ts_info value in case of no phc")
acec05fb78ab ("net_tstamp: Add TIMESTAMPING SOFTWARE and HARDWARE mask")
11d55be06df0 ("net: ethtool: Add a command to expose current time stamping layer")
d905f9c75329 ("net: ethtool: Add a command to list available time stamping layers")
51bdf3165f01 ("net: Replace hwtstamp_source by timestamping layer")
0f7f463d4821 ("net: Change the API of PHY default timestamp to MAC")
091fab122869 ("net: ethtool: ts: Update GET_TS to reply the current selected timestamp")
152c75e1d002 ("net: ethtool: ts: Let the active time stamping layer be selectable")
289354f21b2c ("net: partial revert of the "Make timestamping selectable: series")
cc124ad39288 ("Documentation: networking: add missing PLCA messages from the message list")
d0c3891db2d2 ("ethtool: reformat kerneldoc for struct ethtool_link_settings")
1271ca00aa7f ("ethtool: reformat kerneldoc for struct ethtool_fec_stats")
f1172f3ee3a9 ("ethtool: netlink: Add missing ethnl_ops_begin/complete")
```
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: Davide Caratti <dcaratti@redhat.com>
Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Michal Schmidt <mschmidt@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>
Merged-by: Lucas Zampieri <lzampier@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30139
Conflicts:
- context conflict due to RH KABI reservations for z-stream
commit 26793bfb5d6072326d1465343e7cbf6156abca4f
Author: Amritha Nambiar <amritha.nambiar@intel.com>
Date: Fri Dec 1 15:29:07 2023 -0800
net: Add NAPI IRQ support
Add support to associate the interrupt vector number for a
NAPI instance.
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Link: https://lore.kernel.org/r/170147334728.5260.13221803396905901904.stgit@anambiarhost.jf.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30139
Conflicts:
- context conflict due to missing 9a675ba55a96 ("net, bpf: Add
a warning if NAPI cb missed xdp_do_flush().")
commit 27f91aaf49b3a50e5a02ad5fa27b7c453d029a72
Author: Amritha Nambiar <amritha.nambiar@intel.com>
Date: Fri Dec 1 15:28:56 2023 -0800
netdev-genl: Add netlink framework functions for napi
Implement the netdev netlink framework functions for
napi support. The netdev structure tracks all the napi
instances and napi fields. The napi instances and associated
parameters can be retrieved this way.
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Link: https://lore.kernel.org/r/170147333637.5260.14807433239805550815.stgit@anambiarhost.jf.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30139
Conflicts:
- context conflict due to RH KABI reservations for z-stream
commit 2a502ff0c4e42a739b5aa550c901bf3852795532
Author: Amritha Nambiar <amritha.nambiar@intel.com>
Date: Fri Dec 1 15:28:34 2023 -0800
net: Add queue and napi association
Add the napi pointer in netdev queue for tracking the napi
instance for each queue. This achieves the queue<->napi mapping.
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Link: https://lore.kernel.org/r/170147331483.5260.15723438819994285695.stgit@anambiarhost.jf.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4235
JIRA: https://issues.redhat.com/browse/RHEL-36218
Note that patch 2 is needed for patch 3 to avoid compiler warnings and patch 1 is a dependency for patch 2.
Commits:
```
4eb6bd55cfb2 ("compiler.h: drop fallback overflow checkers")
d219d2a9a92e ("overflow: Allow mixed type arguments")
8798481b667f ("net/sched: wrap open coded Qdics class filter counter")
daf8d9181b9b ("net/sched: sch_drr: warn about class in use while deleting")
e20e75017c5a ("net/sched: sch_qfq: warn about class in use while deleting")
a57c34a80cbe ("net: flow_dissector: Add IPSEC dissector")
4c13eda757e3 ("tc: flower: support for SPI")
c8915d7329d6 ("tc: flower: Enable offload support IPSEC SPI field.")
992b47851be9 ("net: pkt_cls: Remove unused inline helpers")
09e0c3bbde90 ("net/sched: taprio: don't access q->qdiscs[] in unoffloaded mode during attach()")
25b0d4e4e41f ("net/sched: taprio: keep child Qdisc refcount elevated at 2 in offload mode")
98766add2d55 ("net/sched: taprio: try again to report q->qdiscs[] to qdisc_leaf()")
6e0ec800c174 ("net/sched: taprio: delete misleading comment about preallocating child qdiscs")
665338b2a7a0 ("net/sched: taprio: dump class stats for the actual q->qdiscs[]")
40b0425f8ba1 ("net: ptp: create a mock-up PTP Hardware Clock driver")
b63e78fca889 ("net: netdevsim: use mock PHC driver")
35da47fe1c47 ("net: netdevsim: mimic tc-taprio offload")
355adce3010b ("selftests/tc-testing: add ptp_mock Kconfig dependency")
1890cf08bd99 ("selftests/tc-testing: test that taprio can only be attached as root")
29c298d2bc82 ("selftests/tc-testing: verify that a qdisc can be grafted onto a taprio class")
4072d97ddc44 ("netem: add prng attribute to netem_sched_data")
9c87b2aeccf1 ("netem: use a seeded PRNG for generating random losses")
3cad70bc74ef ("netem: use seeded PRNG for correlated loss events")
8c21ab1bae94 ("net/sched: fq_pie: avoid stalls in fq_pie_timer()")
8fc134fee27f ("net: sched: sch_qfq: Fix UAF in qfq_dequeue()")
a5e2151ff9d5 ("net/ipv6: SKB symmetric hash should incorporate transport ports")
70ad43333cbe ("selftests/tc-testing: cls_fw: add tests for classid")
7c339083616c ("selftests/tc-testing: cls_route: add tests for classid")
e2f2fb3c352d ("selftests/tc-testing: cls_u32: add tests for classid")
ef765c258759 ("net/sched: cls_route: make netlink errors meaningful")
98cfbe4234a4 ("selftests/tc-testing: localize test resources")
d227cc0b1ee1 ("selftests/tc-testing: update test definitions for local resources")
ac9b82930964 ("selftests/tc-testing: implement tdc parallel test run")
d3fc4eea9742 ("selftests/tc-testing: update tdc documentation")
1add90738cf5 ("net_sched: constify qdisc_priv()")
54ff8ad69c6e ("net_sched: sch_fq: struct sched_data reorg")
ee9af4e14d16 ("net_sched: sch_fq: change how @inactive is tracked")
076433bd78d7 ("net_sched: sch_fq: add fast path for mostly idle qdisc")
8f6c4ff9e052 ("net_sched: sch_fq: always garbage collect")
2ae45136a938 ("net_sched: sch_fq: remove q->ktime_cache")
5579ee462dfe ("net_sched: export pfifo_fast prio2band[]")
29f834aa326e ("net_sched: sch_fq: add 3 bands and WRR scheduling")
49e7265fd098 ("net_sched: sch_fq: add TCA_FQ_WEIGHTS attribute")
0fef0907d6fa ("netem: Annotate struct disttable with __counted_by")
c4d49196ceec ("net: sched: cls_u32: Fix allocation size in u32_init()")
54a59aed395c ("net, sched: Make tc-related drop reason more flexible")
39d08b91646d ("net, sched: Add tcf_set_drop_reason for {__,}tcf_classify")
f157b73d5114 ("selftests: tc-testing: add missing Kconfig options to 'config'")
35027c790970 ("selftests: tc-testing: move auxiliary scripts to a dedicated folder")
ee3d12285471 ("selftests: tc-testing: add test for 'rt' upgrade on hfsc")
06e4dd18f868 ("net_sched: sch_fq: fix off-by-one error in fq_dequeue()")
81a416985698 ("net_sched: sch_fq: fastpath needs to take care of sk->sk_pacing_status")
6d25d1dc76bf ("net: sched: sch_qfq: Use non-work-conserving warning handler")
70f06c115bcc ("sched: act_ct: switch to per-action label counting")
49b02a19c23a ("net: sched: Fill in MODULE_DESCRIPTION for act_gate")
a9c92771fa23 ("net: sched: Fill in missing MODULE_DESCRIPTION for classifiers")
f96118c5d86f ("net: sched: Fill in missing MODULE_DESCRIPTION for qdiscs")
40cb2fdfed34 ("net, sched: Fix SKB_NOT_DROPPED_YET splat under debug config")
f1a3b283f852 ("net_sched: sch_fq: better validate TCA_FQ_WEIGHTS and TCA_FQ_PRIOMAP")
e316dd1cf135 ("net: don't dump stack on queue timeout")
9ffa01cab069 ("selftests: tc-testing: drop '-N' argument from nsPlugin")
fa63d353ddfb ("selftests: tc-testing: rework namespaces and devices setup")
bb9623c337f5 ("selftests: tc-testing: preload all modules in kselftests")
04fd47bf70f9 ("selftests: tc-testing: use parallel tdc in kselftests")
6b78debe1c07 ("net/sched: cls_u32: replace int refcounts with proper refcounts")
54293e4d6a62 ("selftests/tc-testing: add hashtable tests for u32")
025de7b6a6dd ("selftests: tc-testing: cap parallel tdc to 4 cores")
50a5988a7a54 ("selftests: tc-testing: move back to per test ns setup")
3d5026fc5adb ("selftests: tc-testing: use netns delete from pyroute2")
3f2d94a4ff48 ("selftests: tc-testing: leverage -all in suite ns teardown")
4b480cfb1066 ("selftests: tc-testing: timeout on unbounded loops")
4968afa0143d ("selftests: tc-testing: report number of workers in use")
a79d8ba734bd ("selftests: tc-testing: remove buildebpf plugin")
8059e68b9928 ("selftests: tc-testing: remove unnecessary time.sleep")
56e16bc69bb7 ("selftests: tc-testing: prefix iproute2 functions with "ipr2"")
501679f5d4a4 ("selftests: tc-testing: cleanup on Ctrl-C")
ed346fccfc40 ("selftests: tc-testing: remove unused import")
000db9e9ad42 ("net/sched: cbs: Use units.h instead of the copy of a definition")
f7580f00cc6e ("selftests: tc-testing: remove spurious nsPlugin usage")
74f7e7eeb1d2 ("selftests: tc-testing: remove spurious './' from Makefile")
7de8b2efafeb ("selftests: tc-testing: rename concurrency.json to flower.json")
0fbb5a54f941 ("selftests: tc-testing: remove filters/tests.json")
3872347e0a16 ("net/sched: act_api: use tcf_act_for_each_action")
a0e947c9ccff ("net/sched: act_api: avoid non-contiguous action array")
e09ac779f736 ("net/sched: act_api: stop loop over ops array on NULL in tcf_action_init")
f9bfc8eb1342 ("net/sched: act_api: use tcf_act_for_each_action in tcf_idr_insert_many")
c5e2a973448d ("rtnl: add helper to check if rtnl group has listeners")
8439109b76a3 ("rtnl: add helper to check if a notification is needed")
ddb6b284bdc3 ("rtnl: add helper to send if skb is not null")
c73724bfde09 ("net/sched: act_api: don't open code max()")
8d4390f51920 ("net/sched: act_api: conditional notification of events")
e522755520ef ("net/sched: cls_api: remove 'unicast' argument from delete notification")
93775590b1ee ("net/sched: cls_api: conditional notification of events")
4b55e86736d5 ("net/sched: act_api: rely on rcu in tcf_idr_check_alloc")
1dd7f18fc0ed ("net/sched: act_api: skip idr replace on bound actions")
fb2780721ca5 ("net: sched: Move drop_reason to struct tc_skb_cb")
b6a3c6066afc ("net: sched: Make tc-related drop reason more flexible for remaining qdiscs")
2f57dd94bdef ("packet: add a generic drop reason for receive")
4cf24dc89340 ("net: sched: Add initial TC error skb drop reasons")
913b47d3424e ("net/sched: Introduce tc block netdev tracking infra")
a7042cf8f231 ("net/sched: cls_api: Expose tc block to the datapath")
415e38bf1d8d ("net/sched: act_mirred: Add helper function tcf_mirred_replace_dev")
42f39036cda8 ("net/sched: act_mirred: Allow mirred to block")
8fcb0382af6f ("net: sched: em_text: fix possible memory leak in em_text_destroy()")
ba24ea129126 ("net/sched: Retire ipt action")
6d6d80e4f6bc ("net/sched: Remove CONFIG_NET_ACT_IPT from default configs")
41bc3e8fc1f7 ("net/sched: Remove uapi support for rsvp classifier")
82b2545ed9a4 ("net/sched: Remove uapi support for tcindex classifier")
fe3b739a5472 ("net/sched: Remove uapi support for dsmark qdisc")
26cc8714fc7f ("net/sched: Remove uapi support for ATM qdisc")
33241dca4862 ("net/sched: Remove uapi support for CBQ qdisc")
2ab1efad60ad ("net/sched: cls_api: complement tcf_tfilter_dump_policy")
c2a67de9bb54 ("net/sched: introduce ACT_P_BOUND return code")
530496985cea ("net/sched: sch_api: conditional netlink notifications")
94e2557d086a ("net: sched: move block device tracking into tcf_block_get/put_ext()")
405cd9fc6f44 ("net/sched: simplify tc_action_load_ops parameters")
2ffca83aa39c ("net/sched: Remove ipt action tests")
e18405d0be80 ("net: sched: track device in tcf_block_get/put_ext() only for clsact binder types")
ea937f772083 ("net: netdevsim: don't try to destroy PHC on VFs")
93590849a05e ("selftests: forwarding: Fix layer 2 miss test flakiness")
aae09a6c7783 ("net/sched: act_mirred: Don't zero blockid when net device is being deleted")
a46c31bf2744 ("net: fill in MODULE_DESCRIPTION()s for net/sched")
86fe596b588f ("net: sched: Remove NET_ACT_IPT from Kconfig")
eb2c11b27c58 ("net: bql: fix building with BQL disabled")
51270d573a8d ("tracing/net_sched: Fix tracepoints that save qdisc_dev() as a string")
```
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: Florian Westphal <fwestpha@redhat.com>
Approved-by: Davide Caratti <dcaratti@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>
Merged-by: Scott Weaver <scweaver@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-29681
Upstream Status: linux.git
commit ee403248fa6db5ca23031fc51b06284d6855cd02
Author: Eric Dumazet <edumazet@google.com>
Date: Mon Feb 7 20:50:38 2022 -0800
net: remove default_device_exit()
For some reason default_device_ops kept two exit method:
1) default_device_exit() is called for each netns being dismantled in
a cleanup_net() round. This acquires rtnl for each invocation.
2) default_device_exit_batch() is called once with the list of all netns
int the batch, allowing for a single rtnl invocation.
Get rid of the .exit() method to handle the logic from
default_device_exit_batch(), to decrease the number of rtnl acquisition
to one.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Antoine Tenart <atenart@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-35058
CVE: CVE-2024-27010
Upstream Status: net.git commit 0f022d32c3eca477fbf79a205243a6123ed0fe11
commit 0f022d32c3eca477fbf79a205243a6123ed0fe11
Author: Eric Dumazet <edumazet@google.com>
Date: Mon Apr 15 18:07:28 2024 -0300
net/sched: Fix mirred deadlock on device recursion
When the mirred action is used on a classful egress qdisc and a packet is
mirrored or redirected to self we hit a qdisc lock deadlock.
See trace below.
[..... other info removed for brevity....]
[ 82.890906]
[ 82.890906] ============================================
[ 82.890906] WARNING: possible recursive locking detected
[ 82.890906] 6.8.0-05205-g77fadd89fe2d-dirty #213 Tainted: G W
[ 82.890906] --------------------------------------------
[ 82.890906] ping/418 is trying to acquire lock:
[ 82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at:
__dev_queue_xmit+0x1778/0x3550
[ 82.890906]
[ 82.890906] but task is already holding lock:
[ 82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at:
__dev_queue_xmit+0x1778/0x3550
[ 82.890906]
[ 82.890906] other info that might help us debug this:
[ 82.890906] Possible unsafe locking scenario:
[ 82.890906]
[ 82.890906] CPU0
[ 82.890906] ----
[ 82.890906] lock(&sch->q.lock);
[ 82.890906] lock(&sch->q.lock);
[ 82.890906]
[ 82.890906] *** DEADLOCK ***
[ 82.890906]
[..... other info removed for brevity....]
Example setup (eth0->eth0) to recreate
tc qdisc add dev eth0 root handle 1: htb default 30
tc filter add dev eth0 handle 1: protocol ip prio 2 matchall \
action mirred egress redirect dev eth0
Another example(eth0->eth1->eth0) to recreate
tc qdisc add dev eth0 root handle 1: htb default 30
tc filter add dev eth0 handle 1: protocol ip prio 2 matchall \
action mirred egress redirect dev eth1
tc qdisc add dev eth1 root handle 1: htb default 30
tc filter add dev eth1 handle 1: protocol ip prio 2 matchall \
action mirred egress redirect dev eth0
We fix this by adding an owner field (CPU id) to struct Qdisc set after
root qdisc is entered. When the softirq enters it a second time, if the
qdisc owner is the same CPU, the packet is dropped to break the loop.
Reported-by: Mingshuai Ren <renmingshuai@huawei.com>
Closes: https://lore.kernel.org/netdev/20240314111713.5979-1-renmingshuai@huawei.com/
Fixes: 3bcb846ca4 ("net: get rid of spin_trylock() in net_tx_action()")
Fixes: e578d9c025 ("net: sched: use counter to break reclassify loops")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20240415210728.36949-1-victor@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-32098
Conflicts:
- drivers/net/ethernet/mellanox/mlx5/core/dpll.c: chunk omitted due
to missing 496fd0a26bbf73 ("mlx5: Implement SyncE support using DPLL
infrastructure")
Upstream commit(s):
commit 289e922582af5b4721ba02e86bde4d9ba918158a
Author: Jakub Kicinski <kuba@kernel.org>
Date: Mon Mar 4 17:35:32 2024 -0800
dpll: move all dpll<>netdev helpers to dpll code
Older versions of GCC really want to know the full definition
of the type involved in rcu_assign_pointer().
struct dpll_pin is defined in a local header, net/core can't
reach it. Move all the netdev <> dpll code into dpll, where
the type is known. Otherwise we'd need multiple function calls
to jump between the compilation units.
This is the same problem the commit under fixes was trying to address,
but with rcu_assign_pointer() not rcu_dereference().
Some of the exports are not needed, networking core can't
be a module, we only need exports for the helpers used by
drivers.
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Link: https://lore.kernel.org/all/35a869c8-52e8-177-1d4d-e57578b99b6@linux-m68k.org/
Fixes: 640f41ed33b5 ("dpll: fix build failure due to rcu_dereference_check() on unknown type")
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240305013532.694866-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Petr Oros <poros@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-32098
Upstream commit(s):
commit 0d60d8df6f493bb46bf5db40d39dd60a1bafdd4e
Author: Eric Dumazet <edumazet@google.com>
Date: Fri Feb 23 12:32:08 2024 +0000
dpll: rely on rcu for netdev_dpll_pin()
This fixes a possible UAF in if_nlmsg_size(),
which can run without RTNL.
Add rcu protection to "struct dpll_pin"
Move netdev_dpll_pin() from netdevice.h to dpll.h to
decrease name pollution.
Note: This looks possible to no longer acquire RTNL in
netdev_dpll_pin_assign() later in net-next.
v2: do not force rcu_read_lock() in rtnl_dpll_pin_size() (Jiri Pirko)
Fixes: 5f1842692880 ("netdev: expose DPLL pin handle for netdevice")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Cc: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240223123208.3543319-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Petr Oros <poros@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-36217
Conflicts:
- hunk for lan966x removed as it does not exist in RHEL
- context conflict caused by presence of RH_KABI macros
commit 289354f21b2c3fac93e956efd45f256a88a4d997
Author: Jakub Kicinski <kuba@kernel.org>
Date: Sat Nov 18 18:38:05 2023 -0800
net: partial revert of the "Make timestamping selectable: series
Revert following commits:
commit acec05fb78ab ("net_tstamp: Add TIMESTAMPING SOFTWARE and HARDWARE mask")
commit 11d55be06df0 ("net: ethtool: Add a command to expose current time stamping layer")
commit bb8645b00ced ("netlink: specs: Introduce new netlink command to get current timestamp")
commit d905f9c75329 ("net: ethtool: Add a command to list available time stamping layers")
commit aed5004ee7a0 ("netlink: specs: Introduce new netlink command to list available time stamping layers")
commit 51bdf3165f01 ("net: Replace hwtstamp_source by timestamping layer")
commit 0f7f463d4821 ("net: Change the API of PHY default timestamp to MAC")
commit 091fab122869 ("net: ethtool: ts: Update GET_TS to reply the current selected timestamp")
commit 152c75e1d002 ("net: ethtool: ts: Let the active time stamping layer be selectable")
commit ee60ea6be0d3 ("netlink: specs: Introduce time stamping set command")
They need more time for reviews.
Link: https://lore.kernel.org/all/20231118183529.6e67100c@kernel.org/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-36217
Conflicts:
- context conflict caused by presence of RH_KABI macros
commit 0f7f463d4821a4f52fa5c0a961389e651d50c384
Author: Kory Maincent <kory.maincent@bootlin.com>
Date: Tue Nov 14 12:28:41 2023 +0100
net: Change the API of PHY default timestamp to MAC
Change the API to select MAC default time stamping instead of the PHY.
Indeed the PHY is closer to the wire therefore theoretically it has less
delay than the MAC timestamping but the reality is different. Due to lower
time stamping clock frequency, latency in the MDIO bus and no PHC hardware
synchronization between different PHY, the PHY PTP is often less precise
than the MAC. The exception is for PHY designed specially for PTP case but
these devices are not very widespread. For not breaking the compatibility I
introduce a default_timestamp flag in phy_device that is set by the phy
driver to know we are using the old API behavior.
The phy_set_timestamp function is called at each call of phy_attach_direct.
In case of MAC driver using phylink this function is called when the
interface is turned up. Then if the interface goes down and up again the
last choice of timestamp will be overwritten by the default choice.
A solution could be to cache the timestamp status but it can bring other
issues. In case of SFP, if we change the module, it doesn't make sense to
blindly re-set the timestamp back to PHY, if the new module has a PHY with
mediocre timestamping capabilities.
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-36217
commit 70f7457ad6d655e65f1b93cbba2a519e4b11c946
Author: Jakub Kicinski <kuba@kernel.org>
Date: Mon Jun 12 14:49:43 2023 -0700
net: create device lookup API with reference tracking
New users of dev_get_by_index() and dev_get_by_name() keep
getting added and it would be nice to steer them towards
the APIs with reference tracking.
Add variants of those calls which allocate the reference
tracker and use them in a couple of places.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4000
JIRA: https://issues.redhat.com/browse/RHEL-30145
Depends: !3939
The series updates netlink and devlink core to upstream version v6.8.
Both have to be updated at once due to circular dependencies.
Signed-off-by: Petr Oros <poros@redhat.com>
Approved-by: Jan Stancek <jstancek@redhat.com>
Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com>
Approved-by: Davide Caratti <dcaratti@redhat.com>
Approved-by: Ivan Vecera <ivecera@redhat.com>
Merged-by: Lucas Zampieri <lzampier@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-36218
commit b6a3c6066afc2cb7b92f45c67ab0b12ded81cb11
Author: Victor Nogueira <victor@mojatatu.com>
Date: Sat Dec 16 17:44:35 2023 -0300
net: sched: Make tc-related drop reason more flexible for remaining qdiscs
Incrementing on Daniel's patch[1], make tc-related drop reason more
flexible for remaining qdiscs - that is, all qdiscs aside from clsact.
In essence, the drop reason will be set by cls_api and act_api in case
any error occurred in the data path. With that, we can give the user more
detailed information so that they can distinguish between a policy drop
or an error drop.
[1] https://lore.kernel.org/all/20231009092655.22025-1-daniel@iogearbox.net
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-36218
commit fb2780721ca5e9f78bbe4544b819b929a982df9c
Author: Victor Nogueira <victor@mojatatu.com>
Date: Sat Dec 16 17:44:34 2023 -0300
net: sched: Move drop_reason to struct tc_skb_cb
Move drop_reason from struct tcf_result to skb cb - more specifically to
struct tc_skb_cb. With that, we'll be able to also set the drop reason for
the remaining qdiscs (aside from clsact) that do not have access to
tcf_result when time comes to set the skb drop reason.
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-36218
commit 54a59aed395ce0f4177b5212e5746a6462de3ad9
Author: Daniel Borkmann <daniel@iogearbox.net>
Date: Mon Oct 9 11:26:54 2023 +0200
net, sched: Make tc-related drop reason more flexible
Currently, the kfree_skb_reason() in sch_handle_{ingress,egress}() can only
express a basic SKB_DROP_REASON_TC_INGRESS or SKB_DROP_REASON_TC_EGRESS reason.
Victor kicked-off an initial proposal to make this more flexible by disambiguating
verdict from return code by moving the verdict into struct tcf_result and
letting tcf_classify() return a negative error. If hit, then two new drop
reasons were added in the proposal, that is SKB_DROP_REASON_TC_INGRESS_ERROR
as well as SKB_DROP_REASON_TC_EGRESS_ERROR. Further analysis of the actual
error codes would have required to attach to tcf_classify via kprobe/kretprobe
to more deeply debug skb and the returned error.
In order to make the kfree_skb_reason() in sch_handle_{ingress,egress}() more
extensible, it can be addressed in a more straight forward way, that is: Instead
of placing the verdict into struct tcf_result, we can just put the drop reason
in there, which does not require changes throughout various classful schedulers
given the existing verdict logic can stay as is.
Then, SKB_DROP_REASON_TC_ERROR{,_*} can be added to the enum skb_drop_reason
to disambiguate between an error or an intentional drop. New drop reason error
codes can be added successively to the tc code base.
For internal error locations which have not yet been annotated with a
SKB_DROP_REASON_TC_ERROR{,_*}, the fallback is SKB_DROP_REASON_TC_INGRESS and
SKB_DROP_REASON_TC_EGRESS, respectively. Generic errors could be marked with a
SKB_DROP_REASON_TC_ERROR code until they are converted to more specific ones
if it is found that they would be useful for troubleshooting.
While drop reasons have infrastructure for subsystem specific error codes which
are currently used by mac80211 and ovs, Jakub mentioned that it is preferred
for tc to use the enum skb_drop_reason core codes given it is a better fit and
currently the tooling support is better, too.
With regards to the latter:
[...] I think Alastair (bpftrace) is working on auto-prettifying enums when
bpftrace outputs maps. So we can do something like:
$ bpftrace -e 'tracepoint:skb:kfree_skb { @[args->reason] = count(); }'
Attaching 1 probe...
^C
@[SKB_DROP_REASON_TC_INGRESS]: 2
@[SKB_CONSUMED]: 34
^^^^^^^^^^^^ names!!
Auto-magically. [...]
Add a small helper tcf_set_drop_reason() which can be used to set the drop reason
into the tcf_result.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Victor Nogueira <victor@mojatatu.com>
Link: https://lore.kernel.org/netdev/20231006063233.74345d36@kernel.org
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20231009092655.22025-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3966
# Merge Request Required Information
This is the first pass at drm dependencies for backporting 6.8 or 6.9 into RHEL 9.5
Marked as draft as I think there will be a few more patches needed, and maybe some other teams might be in the same area (e.g. kunit).
JIRA: https://issues.redhat.com/browse/RHEL-24101
Signed-off-by: Dave Airlie <airlied@redhat.com>
## Summary of Changes
## Approved Development Ticket
All submissions to CentOS Stream must reference an approved ticket in [Red Hat Jira](https://issues.redhat.com/). Please follow the CentOS Stream [contribution documentation](https://docs.centos.org/en-US/stream-contrib/quickstart/) for how to file this ticket and have it approved.
Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Approved-by: Mika Penttilä <mpenttil@redhat.com>
Approved-by: Aristeu Rozanski <arozansk@redhat.com>
Approved-by: David Arcari <darcari@redhat.com>
Approved-by: Steve Best <sbest@redhat.com>
Merged-by: Patrick Talbert <ptalbert@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30145
Upstream commit(s):
commit 9f30831390ede02d9fcd54fd9ea5a585ab649f4a
Author: Eric Dumazet <edumazet@google.com>
Date: Fri Feb 9 18:12:48 2024 +0000
net: add rcu safety to rtnl_prop_list_size()
rtnl_prop_list_size() can be called while alternative names
are added or removed concurrently.
if_nlmsg_size() / rtnl_calcit() can indeed be called
without RTNL held.
Use explicit RCU protection to avoid UAF.
Fixes: 88f4fb0c74 ("net: rtnetlink: put alternative names to getlink message")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240209181248.96637-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Petr Oros <poros@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3939
JIRA: https://issues.redhat.com/browse/RHEL-30656
Tested: LNST
Depends: !3918
The series updates netlink and devlink core to upstream version v6.6. Both have to be updated at once due to circular dependencies.
Omitted-fix: 83f2df9d66bc
The fix needs an additional devlink dependencies and it will be applied in next rebase series covered by RHEL-30145
Commits:
```
6978052448f9 ("netlink: remove unused 'compare' function")
74bf6477c18b ("netlink-specs: add partial specification for devlink")
82b3297009b6 ("netlink: specs: allow uapi-header in genetlink")
56c874f7dbca ("tools: ynl: skip the explicit op array size when not needed")
8da3a5598f75 ("ynl: allow to encode u8 attr")
bc77f7318da8 ("tools: ynl: add the Python requirements.txt file")
dd3a7d58dcc2 ("tools: ynl: Add missing types to encode/decode")
4c6170d1ae2c ("tools: ynl: default to treating enums as flags for mask generation")
bec0b7a2db35 ("tools: ynl: Add struct parsing to nlspec")
b423c3c86325 ("tools: ynl: Add C array attribute decoding to ynl")
2607191395bd ("tools: ynl: Add struct attr decoding to ynl")
f036d936ca57 ("tools: ynl: Add fixed-header support to ynl")
643ef4a676e3 ("netlink: specs: add partial specification for openvswitch")
88e288968412 ("docs: netlink: document struct support for genetlink-legacy")
04eac39361d3 ("docs: netlink: document the sub-type attribute property")
9f7cc57fe550 ("tools: ynl: support byte-order in cli")
a353318ebf24 ("tools: ynl: populate most of the ethtool spec")
48993e22d23a ("tools: ynl: replace print with NlError")
f3d07b02b2b8 ("tools: ynl: ethtool testing tool")
ebe3bdc4359e ("tools: ynl: throw a more meaningful exception if family not supported")
3ea31e66644b ("tools: ynl: Remove absolute paths to yaml files from ethtool testing tool")
85a4abed1554 ("tools: ynl: Rename ethtool to ethtool.py")
d913d32cc270 ("netlink: Use copy_to_user() for optval in netlink_getsockopt().")
a939d14919b7 ("netlink: annotate accesses to nlk->cb_running")
7c2435ef76e5 ("tools: ynl: Use dict of predefined Structs to decode scalar types")
bddd2e561b0a ("tools: ynl: Handle byte-order in struct members")
081e8df68199 ("tools: ynl: avoid dict errors on older Python versions")
9b66ee06e5ca ("net: ynl: prefix uAPI header include with uapi/")
0684f29a89e5 ("netlink: specs: correct types of legacy arrays")
6d6bae63053d ("doc: ynl: Add doc attr to struct members in genetlink-legacy spec")
5ac18889bde0 ("tools: ynl: Initialise fixed headers to 0 in genetlink-legacy")
313a7a808ca8 ("tools: ynl: Support enums in struct members in genetlink-legacy")
93b230b549bc ("netlink: specs: add ynl spec for ovs_flow")
f4e4534850a9 ("net/netlink: fix NETLINK_LIST_MEMBERSHIPS length report")
91dfaef243cd ("tools: ynl-gen: add extra headers for user space")
6ad49839ba9b ("tools: ynl-gen: fix unused / pad attribute handling")
67c65ce762ad ("tools: ynl-gen: don't override pure nested struct")
5605f102378f ("tools: ynl-gen: loosen type consistency check for events")
eef9b794eac8 ("tools: ynl-gen: add error checking for nested structs")
21b6e302789c ("tools: ynl-gen: generate enum-to-string helpers")
dc0956c98f11 ("tools: ynl-gen: move the response reading logic into YNL")
5d58f911c755 ("tools: ynl-gen: generate alloc and free helpers for req")
8cb6afb33541 ("tools: ynl-gen: switch to family struct")
59d814f0f285 ("tools: ynl-gen: generate static descriptions of notifications")
a99bfdf64795 ("tools: ynl-gen: clean up stray new lines at the end of reply-less requests")
86878f14d71a ("tools: ynl: user space helpers")
d75fdfbc6f26 ("tools: ynl: support fou and netdev in C")
ee0202e2e731 ("tools: ynl: add sample for netdev")
f6ca5baf2a86 ("netlink: specs: ethtool: fix random typos")
2cc9671a82e3 ("tools: ynl-gen: fill in support for MultiAttr scalars")
58da455b31ba ("tools: ynl-gen: improve unwind on parsing errors")
7a11f70ce882 ("tools: ynl: generate code for the handshake family")
8947e5037371 ("netlink: specs: devlink: fill in some details important for C")
9858bfc271de ("tools: ynl-gen: use enum names in op strmap more carefully")
6f115d4575ab ("tools: ynl-gen: refactor strmap helper generation")
ff6db4b58c93 ("tools: ynl-gen: enable code gen for directional specs")
6afaa0ef9b0e ("tools: ynl-gen: try to sort the types more intelligently")
37487f93b125 ("tools: ynl-gen: inherit struct use info")
eae7af21bdb9 ("tools: ynl-gen: walk nested types in depth")
168dea20ecef ("tools: ynl-gen: don't generate forward declarations for policies")
0a9471219672 ("tools: ynl-gen: don't generate forward declarations for policies - regen")
5d1a30eb989a ("tools: ynl: generate code for the devlink family")
fff8660b5425 ("tools: ynl: add sample for devlink")
30b5c720e1a9 ("tools: ynl-gen: cleanup user space header includes")
9b52fd4b6305 ("tools: ynl: regen: cleanup user space header includes")
820343ccbb2e ("tools: ynl-gen: complete the C keyword list")
2c0f1466867c ("tools: ynl-gen: combine else with closing bracket")
e4ea3cc68472 ("tools: ynl-gen: get attr type outside of if()")
7234415b8f86 ("tools: ynl: regen: regenerate the if ladders")
f2ba1e5e2208 ("tools: ynl-gen: stop generating common notification handlers")
d0915d64c3a6 ("tools: ynl: regen: stop generating common notification handlers")
ced1568862bd ("tools: ynl-gen: sanitize notification tracking")
6da3424fd629 ("tools: ynl-gen: support code gen for events")
6f96ec73cb5a ("tools: ynl-gen: don't pass op_name to RenderInfo")
76abff37f0d7 ("tools: ynl-gen: support / skip pads on the way to kernel")
008bcd6835a2 ("tools: ynl-gen: support excluding tricky ops")
33eedb0071c8 ("tools: ynl-gen: record extra args for regen")
ed2042cc77f1 ("netlink: specs: support setting prefix-name per attribute")
d4813b11d679 ("netlink: specs: ethtool: add C render hints")
dddc9f53da3e ("tools: ynl-gen: don't generate enum types if unnamed")
2c9d47a095f7 ("tools: ynl-gen: resolve enum vs struct name conflicts")
180ad455273a ("netlink: specs: ethtool: add empty enum stringset")
37c852222712 ("netlink: specs: ethtool: untangle UDP tunnels and cable test a bit")
709d0c3b3d4c ("netlink: specs: ethtool: untangle stats-get")
68335713d2ea ("netlink: specs: ethtool: mark pads as pads")
2d7be507d65e ("tools: ynl: generate code for the ethtool family")
f561ff232a6b ("tools: ynl: add sample for ethtool")
10c4d2a7b88d ("tools: ynl-gen: correct enum policies")
be093a80dff0 ("tools: ynl-gen: inherit policy in multi-attr")
fa0e21fa4443 ("rtnetlink: extend RTEXT_FILTER_SKIP_STATS to IFLA_VF_INFO")
89da780aa4c7 ("rtnetlink: move validate_linkmsg out of do_setlink")
f0ec58d557d6 ("tools: ynl: work around stale system headers")
6907217a8054 ("netlink: specs: fixup openvswitch specs for code generation")
8d61f926d420 ("netlink: fix potential deadlock in netlink_set_err()")
0c3d6fd4b89c ("tools: ynl: improve the direct-include header guard logic")
737eab775d36 ("netlink: specs: add display-hint to schema definitions")
d8eea68d913c ("tools: ynl: add display-hint support to ynl")
334f39ce17ef ("netlink: specs: add display hints to ovs_flow")
25a9c8a4431c ("netlink: Add __sock_i_ino() for __netlink_diag_dump().")
b8e39b38487e ("netlink: Make use of __assign_bit() API")
633d76ad01ad ("devlink: remove reload failed checks in params get/set callbacks")
4a59cdfd6699 ("rtnetlink: Move nesting cancellation rollback to proper function")
5766946ea511 ("genetlink: add explicit ordering break check for split ops")
a3377386b564 ("netlink: Reverse the patch which removed filtering")
a4c9a56e6a2c ("netlink: Add new netlink_release function")
d7ddf5f4269f ("tools: ynl-gen: fix enum index in _decode_enum(..)")
df15c15e6c98 ("tools: ynl-gen: fix parse multi-attr enum attribute")
5fac9b7c16c5 ("netlink: allow be16 and be32 types in all uint policy checks")
e5c157f081ab ("ynl: expose xdp-zc-max-segs")
37844828d290 ("ynl: mark max/mask as private for kdoc")
25b5a2a1905f ("ynl: regenerate all headers")
26fdb67e8b4a ("ynl: print xdp-zc-max-segs in the sample")
759ab1edb56c ("net: store netdevs in an xarray")
84e00d9bd4e4 ("net: convert some netlink netdev iterators to depend on the xarray")
2628d40899d1 ("devlink: Remove unused extern declaration devlink_port_region_destroy()")
78c96d7b7c9a ("netlink: specs: add dump-strict flag for dont-validate property")
dc7b81a828db ("ynl-gen-c.py: filter rendering of validate field values for split ops")
eab7be688b44 ("ynl-gen-c.py: allow directional model for kernel mode")
fa8ba3502ade ("ynl-gen-c.py: render netlink policies static for split ops")
ba0f66c95fa6 ("devlink: rename devlink_nl_ops to devlink_nl_small_ops")
d61aedcf628e ("devlink: rename couple of doit netlink callbacks to match generated names")
491a24872a64 ("devlink: introduce couple of dumpit callbacks for split ops")
8300dce542e4 ("devlink: un-static devlink_nl_pre/post_doit()")
759f661012d1 ("netlink: specs: devlink: add info-get dump op")
6b7c486cae81 ("devlink: add split ops generated according to spec")
b2551b1517d8 ("devlink: include the generated netlink header")
6e067d0cab68 ("devlink: use generated split ops and remove duplicated commands from small ops")
b876b71a6ac2 ("devlink: Remove unused devlink_dpipe_table_resource_set() declaration")
2c0e9f3806c4 ("tools: ynl-gen: avoid rendering empty validate field")
832140804e3b ("devlink: clear flag on port register error path")
cd3112ebbaf4 ("tools: ynl-gen: add missing empty line between policies")
8fe08d70a2b6 ("netlink: convert nlk->flags to atomic flags")
63618463cb94 ("devlink: parse linecard attr in doit() callbacks")
41a1d4d1399a ("devlink: parse rate attrs in doit() callbacks")
ee6d78ac28c7 ("devlink: introduce devlink_nl_pre_doit_port*() helper functions")
8fa995ad1f7f ("devlink: rename doit callbacks for per-instance dump commands")
24c8e56d4f98 ("devlink: introduce dumpit callbacks for split ops")
7d3c6fec6135 ("devlink: pass flags as an arg of dump_one() callback")
7199c86247e9 ("netlink: specs: devlink: add commands that do per-instance dump")
ddff283280ba ("devlink: remove duplicate temporary netlink callback prototypes")
833e479d330c ("devlink: remove converted commands from small ops")
4a1b5aa8b5c7 ("devlink: allow user to narrow per-instance dumps by passing handle attrs")
34493336e7d3 ("netlink: specs: devlink: extend per-instance dump commands to accept instance attributes")
b03f13cb67a5 ("devlink: extend health reporter dump selector by port index")
0149bca17262 ("netlink: specs: devlink: extend health reporter dump attributes by port index")
84817d8c6042 ("genetlink: push conditional locking into dumpit/done")
fde9bd4a4d41 ("genetlink: make genl_info->nlhdr const")
bffcc6882a1b ("genetlink: remove userhdr from struct genl_info")
9272af109fe6 ("genetlink: add struct genl_info to struct genl_dumpit_info")
7288dd2fd488 ("genetlink: use attrs from struct genl_info")
5c670a010de4 ("genetlink: add a family pointer to struct genl_info")
5aa51d9f889c ("genetlink: add genlmsg_iput() API")
0e19d3108aea ("netdev-genl: use struct genl_info for reply construction")
ec0e5b09b834 ("ethtool: netlink: simplify arguments to ethnl_default_parse()")
f946270d05c2 ("ethtool: netlink: always pass genl_info to .prepare_data")
956db0a13b47 ("net: warn about attempts to register negative ifindex")
ded67d90815a ("netlink: specs: add ovs_vport new command")
7582113c6917 ("tools: ynl: add more info to KeyErrors on missing attrs")
d56b699d76d1 ("Documentation: Fix typos")
f65f305ae008 ("tools: ynl-gen: use temporary file for rendering")
f534f6581ec0 ("net: validate veth and vxcan peer ifindexes")
649bde9004ac ("tools: ynl: allow passing binary data")
a149a3a13bbc ("tools: ynl-gen: set length of binary fields")
dc2ef94d8926 ("tools: ynl-gen: fix collecting global policy attrs")
4c8c24e801e6 ("tools: ynl-gen: support empty attribute lists")
e83d4e9b2d0f ("netlink: specs: fix indent in fou")
a02430c06f56 ("tools: ynl-gen: fix uAPI generation after tempfile changes")
52d08fda3516 ("doc/netlink: Add delete operation to ovs_vport spec")
ed68c58c0eb4 ("doc/netlink: Add a schema for netlink-raw families")
294f37fc8772 ("doc/netlink: Update genetlink-legacy documentation")
2db8abf0b455 ("doc/netlink: Document the netlink-raw schema extensions")
88901b967958 ("tools/ynl: Add mcast-group schema parsing to ynl")
fb0a06d455d6 ("tools/net/ynl: Fix extack parsing with fixed header genlmsg")
e46dd903efe3 ("tools/net/ynl: Add support for netlink-raw families")
0493e56d021d ("tools/net/ynl: Implement nlattr array-nest decoding in ynl")
1768d8a767f8 ("tools/net/ynl: Add support for create flags")
dfb0f7d9d979 ("doc/netlink: Add spec for rt addr messages")
b2f63d904e72 ("doc/netlink: Add spec for rt link messages")
023289b4f582 ("doc/netlink: Add spec for rt route messages")
56e65312830e ("devlink: push object register/unregister notifications into separate helpers")
eec1e5ea1d71 ("devlink: push port related code into separate file")
2b4d8bb08889 ("devlink: push shared buffer related code into separate file")
2475ed158c47 ("devlink: move and rename devlink_dpipe_send_and_alloc_skb() helper")
a9fd44b15fc5 ("devlink: push dpipe related code into separate file")
a9f960074ecd ("devlink: push resource related code into separate file")
830c41e1e987 ("devlink: push param related code into separate file")
1aa47ca1f52e ("devlink: push region related code into separate file")
85facf94fd80 ("devlink: use tracepoint_enabled() helper")
4bbdec80ff27 ("devlink: push trap related code into separate file")
7cc7194e85ca ("devlink: push rate related code into separate file")
9edbe6f36c5f ("devlink: push linecard related code into separate file")
890c55667437 ("devlink: move tracepoint definitions into core.c")
29a390d17748 ("devlink: move small_ops definition into netlink.c")
71179ac5c211 ("devlink: move devlink_notify_register/unregister() to dev.c")
ee940b57a929 ("doc/netlink: Fix missing classic_netlink doc reference")
d0f95894fda7 ("netlink: annotate data-races around sk->sk_err")
0f4d44f6ee04 ("netlink: specs: devlink: fix reply command values")
69844e335d8c ("selftests/bpf: Fix sockopt_sk selftest")
e4fe082c38cd ("tools: ynl: make sure we always pass yarg to mnl_cb_run")
5d78b73e8514 ("tools: ynl: don't leak mcast_groups on init error")
b6c65eb20ffa ("tools: ynl: fix handling of multiple mcast groups")
ceaac91dcd06 ("net: make sure we never create ifindex = 0")
0e0939c0adf9 ("net-procfs: use xarray iterator to implement /proc/net/dev")
```
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com>
Approved-by: Sabrina Dubroca <sdubroca@redhat.com>
Merged-by: Lucas Zampieri <lzampier@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3918
JIRA: https://issues.redhat.com/browse/RHEL-30344
Tested: LNST
Commits:
```
cfdf0d9ae75b ("rtnetlink: use nlmsg_notify() in rtnetlink_send()")
fef773fc8110 ("netlink: Deal with ESRCH error in nlmsg_notify()")
f9b282b36dfa ("net: netlink: add the case when nlh is NULL")
bc830525615d ("net: netlink: Remove unused function")
d3432bf10f17 ("net: Support filtering interfaces on no master")
4fc29989835a ("net: rtnetlink: convert rcu_assign_pointer to RCU_INIT_POINTER")
7707a4d01a64 ("netlink: annotate data races around nlk->bound")
549017aa1bb7 ("netlink: remove netlink_broadcast_filtered")
50af5969bb22 ("net/core: Remove unused assignment operations and variable")
efd38f75bb04 ("net: rtnetlink: use __dev_addr_set()")
f123cffdd8fe ("net: netlink: af_netlink: Prevent empty skb by adding a check on len.")
d59a67f2f3f3 ("netlink: remove nl_set_extack_cookie_u32()")
ede6c39c4f90 ("net: make net->dev_unreg_count atomic")
7b8135f4df98 ("rtnetlink: add new rtm tunnel api for tunnel id filtering")
5d26cff5bdbe ("net: account alternate interface name memory")
155fb43b70b5 ("net: limit altnames to 64k total")
0caf6d992219 ("af_netlink: Fix shift out of bounds in group mask calculation")
0b5c21bbc01e ("net: ensure net_todo_list is processed quickly")
ef2a7c9065ce ("rtnetlink: return ENODEV when ifname does not exist and group is given")
5ea08b5286f6 ("rtnetlink: enable alt_ifname for setlink/newlink")
dee04163e9f2 ("rtnetlink: return ENODEV when IFLA_ALT_IFNAME is used in dellink")
b6177d3240a4 ("rtnetlink: return EINVAL when request cannot succeed")
99c07327ae11 ("netlink: reset network and mac headers in netlink_dump()")
6f37c9f9dfbf ("Revert "rtnetlink: return EINVAL when request cannot succeed"")
c92bf26ccebc ("rtnl: allocate more attr tables on the heap")
63105e83987a ("rtnl: split __rtnl_newlink() into two functions")
02839cc8d72b ("rtnl: move rtnl_newlink_create()")
d5076fe4049c ("netlink: do not reset transport header in netlink_recvmsg()")
f329a0ebeaba ("genetlink: correct uAPI defines")
5c221f0af68c ("net: add missing kdoc for struct genl_multicast_group::flags")
30b6055428a9 ("net: improve and fix netlink kdoc")
0bf73255d3a3 ("netlink: fix some kernel-doc comments")
8f1948bdcf2f ("genetlink: hold read cb_lock during iteration of genl_fam_idr in genl_bind()")
abbc79280abc ("net: rtnetlink: use netif_oper_up instead of open code")
710d21fdff9a ("netlink: Bounds-check struct nlmsgerr creation")
08724ef69907 ("netlink: introduce NLA_POLICY_MAX_BE")
e7af210e6dd0 ("netfilter: nft_payload: reject out-of-range attributes via policy")
a4abfa627c38 ("net: rtnetlink: Enslave device before bringing it up")
5493a2ad0d20 ("docs: netlink: clarify the historical baggage of Netlink flags")
7354c9024f28 ("netlink: hide validation union fields from kdoc")
738136a0e375 ("netlink: split up copies in the ack construction")
1d997f101307 ("rtnetlink: pass netlink message header and portid to rtnl_configure_link()")
77f4aa9a2a17 ("net: add new helper unregister_netdevice_many_notify")
d88e136cab37 ("rtnetlink: Honour NLM_F_ECHO flag in rtnl_newlink_create")
f3a63cce1b4f ("rtnetlink: Honour NLM_F_ECHO flag in rtnl_delete_link")
ecaf75ffd5f5 ("netlink: introduce bigendian integer types")
e69761483361 ("netlink: Fix potential skb memleak in netlink_ack")
8e18be7610ae ("lib: Fix some kernel-doc comments")
8032bf1233a7 ("treewide: use get_random_u32_below() instead of deprecated function")
c73a72f4cbb4 ("netlink: remove the flex array from struct nlmsghdr")
f0950402e8c7 ("netlink: prevent potential spectre v1 gadgets")
c1bb9484e3b0 ("netlink: annotate data races around nlk->portid")
004db64d185a ("netlink: annotate data races around dst_portid and dst_group")
9b663b5cbb15 ("netlink: annotate data races around sk_state")
9d6a65079c98 ("docs: add more netlink docs (incl. spec docs)")
e616c07ca518 ("netlink: add schemas for YAML specs")
be5bea1cc0bf ("net: add basic C code generators for Netlink")
4eb77b4ecd3c ("netlink: add a proto specification for FOU")
3a330496baa8 ("net: fou: regenerate the uAPI from the spec")
08d323234d10 ("net: fou: rename the source for linking")
1d562c32e439 ("net: fou: use policy and operation tables generated from the spec")
e4b48ed460d3 ("tools: ynl: add a completely generic client")
66fa34b9c2a5 ("tools: ynl: support kdocs for flags in code generation")
b49c34e217c6 ("tools: ynl: rename ops_list -> msg_list")
3a43ded081f8 ("tools: ynl: store ops in ordered dict to avoid random ordering")
70eb3911d80f ("net: netlink: recommend policy range validation")
eaf317e7d2bb ("tools: ynl-gen: prevent do / dump reordering")
4e4480e89c47 ("tools: ynl: move the cli and netlink code around")
3aacf8281336 ("tools: ynl: add an object hierarchy to represent parsed spec")
30a5c6c8104f ("tools: ynl: use the common YAML loading and validation code")
19b64b48a33e ("tools: ynl: add support for types needed by ethtool")
fd0616d34274 ("tools: ynl: support directional enum-model in CLI")
90256f3f8093 ("tools: ynl: support multi-attr")
4cd2796f3f8d ("tools: ynl: support pretty printing bad attribute names")
8dfec0a88868 ("tools: ynl: use operation names from spec on the CLI")
5c6674f6eb52 ("tools: ynl: load jsonschema on demand")
8403bf044530 ("netlink: specs: finish up operation enum-models")
01e47a372268 ("docs: netlink: add a starting guide for working with specs")
981cbcb030d9 ("tools: net: use python3 explicitly")
f1db99c07b4f ("string_helpers: Move string_is_valid() to the header")
d4545bf9c33b ("genetlink: Use string_is_terminated() helper")
f7cf644796fc ("tools: ynl-gen: fix single attribute structs with attr 0 only")
b9d3a3e4ae0c ("tools: ynl-gen: re-raise the exception instead of printing")
d77e7eceeac9 ("tools: net: add __pycache__ to gitignore")
7cf93538e087 ("tools: ynl: fully inherit attrs in subsets")
ad4fafcde5bc ("tools: ynl: use 1 as the default for first entry in attrs/ops")
bcec7171eba9 ("netlink: specs: update for codegen enumerating from 1")
37d9df224d1e ("ynl: re-license uniformly under GPL-2.0 OR BSD-3-Clause")
6517a60b0307 ("tools: ynl: move the enum classes to shared code")
c311aaa74ca1 ("tools: ynl: fix enum-as-flags in the generic CLI")
8f76a4f80fba ("tools: ynl: fix render-max for flags definition")
bf51d27704c9 ("tools: ynl: fix get_mask utility routine")
054abb515f34 ("tools: ynl: make definitions optional again")
4e16b6a748df ("ynl: broaden the license even more")
cfab77c0b545 ("ynl: make the tooling check the license")
758d29fb3a8b ("tools: ynl: Fix genlmsg header encoding formats")
a1865f2e7d10 ("netlink: annotate lockless accesses to nlk->max_recvmsg_len")
59d3efd27c11 ("rtnetlink: Restore RTM_NEW/DELLINK notification behavior")
```
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: Sabrina Dubroca <sdubroca@redhat.com>
Approved-by: Petr Oros <poros@redhat.com>
Merged-by: Lucas Zampieri <lzampier@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3968
JIRA: https://issues.redhat.com/browse/RHEL-28590
Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3841
Tested: bpf tc selftests pass, manual tests that the tcx hooks work as
expected.
Add the new tcx hook for bpf. It attaches at a similar place as the tc
hook but has several advantages: it is based on the new multi prog
infrastructure in the kernel to allow adding multiple bpf programs at
the same hook; it follows the link semantics most other bpf hooks use
which gives applications better control over the lifecycle of the bpf
program; and tcx does not require a qdisc making the setup simpler.
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
Approved-by: Toke Høiland-Jørgensen <toke@redhat.com>
Approved-by: Artem Savkov <asavkov@redhat.com>
Merged-by: Lucas Zampieri <lzampier@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-24101
Upstream Status: v6.5-rc1
This doesn't backport the namespace chunk that isn't
in RHEL yet.
Conflicts:
net/core/net_namespace.c
commit b6d7c0eb2dcbd238fa233a3a1737654e380e784a
Author: Andrzej Hajda <andrzej.hajda@intel.com>
AuthorDate: Fri Jun 2 12:21:34 2023 +0200
Commit: Jakub Kicinski <kuba@kernel.org>
CommitDate: Mon Jun 5 15:28:42 2023 -0700
In case the library is tracking busy subsystem, simply
printing stack for every active reference will spam log
with long, hard to read, redundant stack traces. To improve
readabilty following changes have been made:
- reports are printed per stack_handle - log is more compact,
- added display name for ref_tracker_dir - it will differentiate
multiple subsystems,
- stack trace is printed indented, in the same printk call,
- info about dropped references is printed as well.
Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com>
Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Dave Airlie <airlied@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30656
commit ceaac91dcd065db781d1ed5dfaef0686b8ec44dc
Author: Jakub Kicinski <kuba@kernel.org>
Date: Mon Jul 31 10:11:58 2023 -0700
net: make sure we never create ifindex = 0
Instead of allocating from 1 use proper xa_init flag,
to protect ourselves from IDs wrapping back to 0.
Fixes: 759ab1edb56c ("net: store netdevs in an xarray")
Reported-by: Stephen Hemminger <stephen@networkplumber.org>
Link: https://lore.kernel.org/all/20230728162350.2a6d4979@hermes.local/
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20230731171159.988962-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30656
commit 956db0a13b47df7f3d6d624394e602e8bf9b057e
Author: Jakub Kicinski <kuba@kernel.org>
Date: Mon Aug 14 13:56:25 2023 -0700
net: warn about attempts to register negative ifindex
Since the xarray changes we mix returning valid ifindex and negative
errno in a single int returned from dev_index_reserve(). This depends
on the fact that ifindexes can't be negative. Otherwise we may insert
into the xarray and return a very large negative value. This in turn
may break ERR_PTR().
OvS is susceptible to this problem and lacking validation (fix posted
separately for net).
Reject negative ifindex explicitly. Add a warning because the input
validation is better handled by the caller.
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20230814205627.2914583-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30656
commit 759ab1edb56c88906830fd6b2e7b12514dd32758
Author: Jakub Kicinski <kuba@kernel.org>
Date: Wed Jul 26 11:55:29 2023 -0700
net: store netdevs in an xarray
Iterating over the netdev hash table for netlink dumps is hard.
Dumps are done in "chunks" so we need to save the position
after each chunk, so we know where to restart from. Because
netdevs are stored in a hash table we remember which bucket
we were in and how many devices we dumped.
Since we don't hold any locks across the "chunks" - devices may
come and go while we're dumping. If that happens we may miss
a device (if device is deleted from the bucket we were in).
We indicate to user space that this may have happened by setting
NLM_F_DUMP_INTR. User space is supposed to dump again (I think)
if it sees that. Somehow I doubt most user space gets this right..
To illustrate let's look at an example:
System state:
start: # [A, B, C]
del: B # [A, C]
with the hash table we may dump [A, B], missing C completely even
tho it existed both before and after the "del B".
Add an xarray and use it to allocate ifindexes. This way we
can iterate ifindexes in order, without the worry that we'll
skip one. We may still generate a dump of a state which "never
existed", for example for a set of values and sequence of ops:
System state:
start: # [A, B]
add: C # [A, C, B]
del: B # [A, C]
we may generate a dump of [A], if C got an index between A and B.
System has never been in such state. But I'm 90% sure that's perfectly
fine, important part is that we can't _miss_ devices which exist before
and after. User space which wants to mirror kernel's state subscribes
to notifications and does periodic dumps so it will know that C exists
from the notification about its creation or from the next dump
(next dump is _guaranteed_ to include C, if it doesn't get removed).
To avoid any perf regressions keep the hash table for now. Most
net namespaces have very few devices and microbenchmarking 1M lookups
on Skylake I get the following results (not counting loopback
to number of devs):
#devs | hash | xa | delta
2 | 18.3 | 20.1 | + 9.8%
16 | 18.3 | 20.1 | + 9.5%
64 | 18.3 | 26.3 | +43.8%
128 | 20.4 | 26.3 | +28.6%
256 | 20.0 | 26.4 | +32.1%
1024 | 26.6 | 26.7 | + 0.2%
8192 |541.3 | 33.5 | -93.8%
No surprises since the hash table has 256 entries.
The microbenchmark scans indexes in order, if the pattern is more
random xa starts to win at 512 devices already. But that's a lot
of devices, in practice.
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20230726185530.2247698-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-31916
Conflicts:
* net/core/dev.c
context conflict due to missing commit 2b0cfa6e49566 ("net: add
generic percpu page_pool allocator")
* net/core/sysctl_net_core.c
context conflict due to missing commit 2658b5a8a4eee ("net: introduce
struct net_hotdata")
commit 490a79faf95e705ba0ffd9ebf04a624b379e53c9
Author: Eric Dumazet <edumazet@google.com>
Date: Wed Mar 6 16:00:30 2024 +0000
net: introduce include/net/rps.h
Move RPS related structures and helpers from include/linux/netdevice.h
and include/net/sock.h to a new include file.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306160031.874438-18-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-31916
Conflicts:
* include/linux/netdevice.h
Adjusted due to KABI reservations made by RHEL
commit 3b3a52715a ("net: exclude BPF/XDP from kABI")
commit 49e47a5b6145d86c30022fe0e949bbb24bae28ba
Author: Jakub Kicinski <kuba@kernel.org>
Date: Wed Aug 2 18:02:29 2023 -0700
net: move struct netdev_rx_queue out of netdevice.h
struct netdev_rx_queue is touched in only a few places
and having it defined in netdevice.h brings in the dependency
on xdp.h, because struct xdp_rxq_info gets embedded in
struct netdev_rx_queue.
In prep for removal of xdp.h from netdevice.h move all
the netdev_rx_queue stuff to a new header.
We could technically break the new header up to avoid
the sysfs.h include but it's so rarely included it
doesn't seem to be worth it at this point.
Reviewed-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/20230803010230.1755386-3-kuba@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-31916
commit 5c3b74a92aa285a3df722bf6329ba7ccf70346d6
Author: Eric Dumazet <edumazet@google.com>
Date: Tue Jun 6 07:41:15 2023 +0000
rfs: annotate lockless accesses to RFS sock flow table
Add READ_ONCE()/WRITE_ONCE() on accesses to the sock flow table.
This also prevents a (smart ?) compiler to remove the condition in:
if (table->ents[index] != newval)
table->ents[index] = newval;
We need the condition to avoid dirtying a shared cache line.
Fixes: fec5e652e5 ("rfs: Receive Flow Steering")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-28590
commit 28d18b673ffa2d13112ddb6e4c32c60d9b0cda50
Author: Daniel Borkmann <daniel@iogearbox.net>
Date: Fri Aug 25 15:49:45 2023 +0200
net: Fix skb consume leak in sch_handle_egress
Fix a memory leak for the tc egress path with TC_ACT_{STOLEN,QUEUED,TRAP}:
[...]
unreferenced object 0xffff88818bcb4f00 (size 232):
comm "softirq", pid 0, jiffies 4299085078 (age 134.028s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 80 70 61 81 88 ff ff 00 41 31 14 81 88 ff ff ..pa.....A1.....
backtrace:
[<ffffffff9991b938>] kmem_cache_alloc_node+0x268/0x400
[<ffffffff9b3d9231>] __alloc_skb+0x211/0x2c0
[<ffffffff9b3f0c7e>] alloc_skb_with_frags+0xbe/0x6b0
[<ffffffff9b3bf9a9>] sock_alloc_send_pskb+0x6a9/0x870
[<ffffffff9b6b3f00>] __ip_append_data+0x14d0/0x3bf0
[<ffffffff9b6ba24e>] ip_append_data+0xee/0x190
[<ffffffff9b7e1496>] icmp_push_reply+0xa6/0x470
[<ffffffff9b7e4030>] icmp_reply+0x900/0xa00
[<ffffffff9b7e42e3>] icmp_echo.part.0+0x1a3/0x230
[<ffffffff9b7e444d>] icmp_echo+0xcd/0x190
[<ffffffff9b7e9566>] icmp_rcv+0x806/0xe10
[<ffffffff9b699bd1>] ip_protocol_deliver_rcu+0x351/0x3d0
[<ffffffff9b699f14>] ip_local_deliver_finish+0x2b4/0x450
[<ffffffff9b69a234>] ip_local_deliver+0x174/0x1f0
[<ffffffff9b69a4b2>] ip_sublist_rcv_finish+0x1f2/0x420
[<ffffffff9b69ab56>] ip_sublist_rcv+0x466/0x920
[...]
I was able to reproduce this via:
ip link add dev dummy0 type dummy
ip link set dev dummy0 up
tc qdisc add dev eth0 clsact
tc filter add dev eth0 egress protocol ip prio 1 u32 match ip protocol 1 0xff action mirred egress redirect dev dummy0
ping 1.1.1.1
<stolen>
After the fix, there are no kmemleak reports with the reproducer. This is
in line with what is also done on the ingress side, and from debugging the
skb_unref(skb) on dummy xmit and sch_handle_egress() side, it is visible
that these are two different skbs with both skb_unref(skb) as true. The two
seen skbs are due to mirred doing a skb_clone() internally as use_reinsert
is false in tcf_mirred_act() for egress. This was initially reported by Gal.
Fixes: e420bed02507 ("bpf: Add fd-based tcx multi-prog infra with link support")
Reported-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/bdfc2640-8f65-5b56-4472-db8e2b161aab@nvidia.com
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-28590
Conflicts:
- MAINTAINERS: The file has been restructured upstream, but this is not
relevant for us. All paths are already covered.
- include/linux/netdevice.h: We have excluded TC from kABI with
845ad79d11 ("net: exclude TC from kABI"). Keep this exclusion.
- include/linux/skbuff.h: The order of the fields has been changed upstream
in c0ba861117c3 ("net: skbuff: move the fields BPF cares about directly
next to the offset marker"). The actual change is just changing config
options. Do this instead of picking the field reordering to make
backporting easier.
- include/uapi/linux/bpf.h and tools/include/uapi/linux/bpf.h: The changes
to these files were already backported through 1d5bff6a09 ("bpf: Add
fd-based tcx multi-prog infra with link support") to keep UAPI close to
upstream.
- kernel/bpf/syscall.c: Already backported 58ff9f1ec9 ("bpf: Add
attach_type checks under bpf_prog_attach_check_attach_type") moves one
switch block around. The case BPF_PROG_TYPE_SCHED_CLS was added during
that backport, therefore this hunk is missing now. This also causes
context differences.
- kernel/bpf/syscall.c: Already backported 81b5cf0a11 ("bpf: Fix
BPF_PROG_QUERY last field check") fixed the QUERY_LAST_FIELD.
commit e420bed025071a623d2720a92bc2245c84757ecb
Author: Daniel Borkmann <daniel@iogearbox.net>
Date: Wed Jul 19 16:08:52 2023 +0200
bpf: Add fd-based tcx multi-prog infra with link support
This work refactors and adds a lightweight extension ("tcx") to the tc BPF
ingress and egress data path side for allowing BPF program management based
on fds via bpf() syscall through the newly added generic multi-prog API.
The main goal behind this work which we also presented at LPC [0] last year
and a recent update at LSF/MM/BPF this year [3] is to support long-awaited
BPF link functionality for tc BPF programs, which allows for a model of safe
ownership and program detachment.
Given the rise in tc BPF users in cloud native environments, this becomes
necessary to avoid hard to debug incidents either through stale leftover
programs or 3rd party applications accidentally stepping on each others toes.
As a recap, a BPF link represents the attachment of a BPF program to a BPF
hook point. The BPF link holds a single reference to keep BPF program alive.
Moreover, hook points do not reference a BPF link, only the application's
fd or pinning does. A BPF link holds meta-data specific to attachment and
implements operations for link creation, (atomic) BPF program update,
detachment and introspection. The motivation for BPF links for tc BPF programs
is multi-fold, for example:
- From Meta: "It's especially important for applications that are deployed
fleet-wide and that don't "control" hosts they are deployed to. If such
application crashes and no one notices and does anything about that, BPF
program will keep running draining resources or even just, say, dropping
packets. We at FB had outages due to such permanent BPF attachment
semantics. With fd-based BPF link we are getting a framework, which allows
safe, auto-detachable behavior by default, unless application explicitly
opts in by pinning the BPF link." [1]
- From Cilium-side the tc BPF programs we attach to host-facing veth devices
and phys devices build the core datapath for Kubernetes Pods, and they
implement forwarding, load-balancing, policy, EDT-management, etc, within
BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently
experienced hard-to-debug issues in a user's staging environment where
another Kubernetes application using tc BPF attached to the same prio/handle
of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath
it. The goal is to establish a clear/safe ownership model via links which
cannot accidentally be overridden. [0,2]
BPF links for tc can co-exist with non-link attachments, and the semantics are
in line also with XDP links: BPF links cannot replace other BPF links, BPF
links cannot replace non-BPF links, non-BPF links cannot replace BPF links and
lastly only non-BPF links can replace non-BPF links. In case of Cilium, this
would solve mentioned issue of safe ownership model as 3rd party applications
would not be able to accidentally wipe Cilium programs, even if they are not
BPF link aware.
Earlier attempts [4] have tried to integrate BPF links into core tc machinery
to solve cls_bpf, which has been intrusive to the generic tc kernel API with
extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could
be wiped from the qdisc also. Locking a tc BPF program in place this way, is
getting into layering hacks given the two object models are vastly different.
We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF
attach API, so that the BPF link implementation blends in naturally similar to
other link types which are fd-based and without the need for changing core tc
internal APIs. BPF programs for tc can then be successively migrated from classic
cls_bpf to the new tc BPF link without needing to change the program's source
code, just the BPF loader mechanics for attaching is sufficient.
For the current tc framework, there is no change in behavior with this change
and neither does this change touch on tc core kernel APIs. The gist of this
patch is that the ingress and egress hook have a lightweight, qdisc-less
extension for BPF to attach its tc BPF programs, in other words, a minimal
entry point for tc BPF. The name tcx has been suggested from discussion of
earlier revisions of this work as a good fit, and to more easily differ between
the classic cls_bpf attachment and the fd-based one.
For the ingress and egress tcx points, the device holds a cache-friendly array
with program pointers which is separated from control plane (slow-path) data.
Earlier versions of this work used priority to determine ordering and expression
of dependencies similar as with classic tc, but it was challenged that for
something more future-proof a better user experience is required. Hence this
resulted in the design and development of the generic attach/detach/query API
for multi-progs. See prior patch with its discussion on the API design. tcx is
the first user and later we plan to integrate also others, for example, one
candidate is multi-prog support for XDP which would benefit and have the same
'look and feel' from API perspective.
The goal with tcx is to have maximum compatibility to existing tc BPF programs,
so they don't need to be rewritten specifically. Compatibility to call into
classic tcf_classify() is also provided in order to allow successive migration
or both to cleanly co-exist where needed given its all one logical tc layer and
the tcx plus classic tc cls/act build one logical overall processing pipeline.
tcx supports the simplified return codes TCX_NEXT which is non-terminating (go
to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT.
The fd-based API is behind a static key, so that when unused the code is also
not entered. The struct tcx_entry's program array is currently static, but
could be made dynamic if necessary at a point in future. The a/b pair swap
design has been chosen so that for detachment there are no allocations which
otherwise could fail.
The work has been tested with tc-testing selftest suite which all passes, as
well as the tc BPF tests from the BPF CI, and also with Cilium's L4LB.
Thanks also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews
of this work.
[0] https://lpc.events/event/16/contributions/1353/
[1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com
[2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog
[3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
[4] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230719140858.13224-3-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30344
commit 59d3efd27c11c59b32291e5ebc307bed2edb65ee
Author: Martin Willi <martin@strongswan.org>
Date: Tue Apr 11 09:43:19 2023 +0200
rtnetlink: Restore RTM_NEW/DELLINK notification behavior
The commits referenced below allows userspace to use the NLM_F_ECHO flag
for RTM_NEW/DELLINK operations to receive unicast notifications for the
affected link. Prior to these changes, applications may have relied on
multicast notifications to learn the same information without specifying
the NLM_F_ECHO flag.
For such applications, the mentioned commits changed the behavior for
requests not using NLM_F_ECHO. Multicast notifications are still received,
but now use the portid of the requester and the sequence number of the
request instead of zero values used previously. For the application, this
message may be unexpected and likely handled as a response to the
NLM_F_ACKed request, especially if it uses the same socket to handle
requests and notifications.
To fix existing applications relying on the old notification behavior,
set the portid and sequence number in the notification only if the
request included the NLM_F_ECHO flag. This restores the old behavior
for applications not using it, but allows unicasted notifications for
others.
Fixes: f3a63cce1b4f ("rtnetlink: Honour NLM_F_ECHO flag in rtnl_delete_link")
Fixes: d88e136cab37 ("rtnetlink: Honour NLM_F_ECHO flag in rtnl_newlink_create")
Signed-off-by: Martin Willi <martin@strongswan.org>
Acked-by: Guillaume Nault <gnault@redhat.com>
Acked-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://lore.kernel.org/r/20230411074319.24133-1-martin@strongswan.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30344
commit 77f4aa9a2a1766a0b9343fd812b71f18d05178da
Author: Hangbin Liu <liuhangbin@gmail.com>
Date: Fri Oct 28 04:42:22 2022 -0400
net: add new helper unregister_netdevice_many_notify
Add new helper unregister_netdevice_many_notify(), pass netlink message
header and portid, which could be used to notify userspace when flag
NLM_F_ECHO is set.
Make the unregister_netdevice_many() as a wrapper of new function
unregister_netdevice_many_notify().
Suggested-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30344
commit 1d997f1013079c05b642c739901e3584a3ae558d
Author: Hangbin Liu <liuhangbin@gmail.com>
Date: Fri Oct 28 04:42:21 2022 -0400
rtnetlink: pass netlink message header and portid to rtnl_configure_link()
This patch pass netlink message header and portid to rtnl_configure_link()
All the functions in this call chain need to add the parameters so we can
use them in the last call rtnl_notify(), and notify the userspace about
the new link info if NLM_F_ECHO flag is set.
- rtnl_configure_link()
- __dev_notify_flags()
- rtmsg_ifinfo()
- rtmsg_ifinfo_event()
- rtmsg_ifinfo_build_skb()
- rtmsg_ifinfo_send()
- rtnl_notify()
Also move __dev_notify_flags() declaration to net/core/dev.h, as Jakub
suggested.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30344
Conflicts:
- we have already backported 6264f58ca0e54 ("net: extract a few
internals from netdevice.h") so the net_todo_list has to be placed in
net/core/dev.h instead of include/linux/netdevice.h
commit 0b5c21bbc01e92745ca1ca4f6fd87d878fa3ea5e
Author: Johannes Berg <johannes.berg@intel.com>
Date: Mon Apr 4 11:38:47 2022 +0200
net: ensure net_todo_list is processed quickly
In [1], Will raised a potential issue that the cfg80211 code,
which does (from a locking perspective)
rtnl_lock()
wiphy_lock()
rtnl_unlock()
might be suspectible to ABBA deadlocks, because rtnl_unlock()
calls netdev_run_todo(), which might end up calling rtnl_lock()
again, which could then deadlock (see the comment in the code
added here for the scenario).
Some back and forth and thinking ensued, but clearly this can't
happen if the net_todo_list is empty at the rtnl_unlock() here.
Clearly, the code here cannot actually put an entry on it, and
all other users of rtnl_unlock() will empty it since that will
always go through netdev_run_todo(), emptying the list.
So the only other way to get there would be to add to the list
and then unlock the RTNL without going through rtnl_unlock(),
which is only possible through __rtnl_unlock(). However, this
isn't exported and not used in many places, and none of them
seem to be able to unregister before using it.
Therefore, add a WARN_ON() in the code to ensure this invariant
won't be broken, so that the cfg80211 (or any similar) code
stays safe.
[1] https://lore.kernel.org/r/Yjzpo3TfZxtKPMAG@google.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20220404113847.0ee02e4a70da.Ic73d206e217db20fd22dcec14fe5442ca732804b@changeid
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-30344
commit ede6c39c4f9068cbeb4036448c45fff5393e0432
Author: Eric Dumazet <edumazet@google.com>
Date: Wed Feb 9 18:59:32 2022 -0800
net: make net->dev_unreg_count atomic
Having to acquire rtnl from netdev_run_todo() for every dismantled
device is not desirable when/if rtnl is under stress.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3584
JIRA: https://issues.redhat.com/browse/RHEL-21356
Upstream Status: mostly RHEL-only patches
This series adds reserved fields to networking structs, and excludes
some areas of networking from the kABI guarantee. These reserved
fields are only needed during backports to z-stream.
Signed-off-by: Sabrina Dubroca <sdubroca@redhat.com>
Approved-by: Jan Stancek <jstancek@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Paolo Abeni <pabeni@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Scott Weaver <scweaver@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-21447
Tested: LNST, Tier1
Upstream commit:
commit 24ab059d2ebd62fdccc43794796f6ffbabe49ebc
Author: Eric Dumazet <edumazet@google.com>
Date: Tue Dec 19 12:53:31 2023 +0000
net: check dev->gso_max_size in gso_features_check()
Some drivers might misbehave if TSO packets get too big.
GVE for instance uses a 16bit field in its TX descriptor,
and will do bad things if a packet is bigger than 2^16 bytes.
Linux TCP stack honors dev->gso_max_size, but there are
other ways for too big packets to reach an ndo_start_xmit()
handler : virtio_net, af_packet, GRO...
Add a generic check in gso_features_check() and fallback
to GSO when needed.
gso_max_size was added in the blamed commit.
Fixes: 82cc1a7a56 ("[NET]: Add per-connection option to set max TSO frame size")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20231219125331.4127498-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-21356
Upstream Status: RHEL-only
rtnl_link_stats and rtnl_link_stats64 are protected by kABI, add 4
reserved fields. We need to use a custom mechanism here, because those
structures are part of uapi.
Signed-off-by: Sabrina Dubroca <sdubroca@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3460
JIRA: https://issues.redhat.com/browse/RHEL-18147
Tested: Just built... No way to test the new interface as no driver was converted yet.
Commits:
```
00d521b39307 ("net: don't abuse "default" case for unknown ioctl in dev_ifsioc()")
1193db2a55b6 ("net: simplify handling of dsa_ndo_eth_ioctl() return code")
4ee58e1e5680 ("net: promote SIOCSHWTSTAMP and SIOCGHWTSTAMP ioctls to dedicated handlers")
d5d5fd8f2552 ("net: move copy_from_user() out of net_hwtstamp_validate()")
c4bffeaa8d50 ("net: add struct kernel_hwtstamp_config and make net_hwtstamp_validate() use it")
88c0a6b503b7 ("net: create a netdev notifier for DSA to reject PTP on DSA master")
5a17818682cf ("net: dsa: replace NETDEV_PRE_CHANGE_HWTSTAMP notifier with a stub")
66f7223039c0 ("net: add NDOs for configuring hardware timestamping")
e47d01fea663 ("net: add hwtstamping helpers for stackable net devices")
```
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com>
Approved-by: Antoine Tenart <atenart@redhat.com>
Signed-off-by: Scott Weaver <scweaver@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-18147
Conflicts:
- DSA stuff removed except dsa_stubs.h that provides inline function
dsa_master_hwtstamp_validate()
commit 5a17818682cf43ad0fdd6035945f3b7a8c9dc5e9
Author: Vladimir Oltean <vladimir.oltean@nxp.com>
Date: Thu Apr 6 14:42:46 2023 +0300
net: dsa: replace NETDEV_PRE_CHANGE_HWTSTAMP notifier with a stub
There was a sort of rush surrounding commit 88c0a6b503b7 ("net: create a
netdev notifier for DSA to reject PTP on DSA master"), due to a desire
to convert DSA's attempt to deny TX timestamping on a DSA master to
something that doesn't block the kernel-wide API conversion from
ndo_eth_ioctl() to ndo_hwtstamp_set().
What was required was a mechanism that did not depend on ndo_eth_ioctl(),
and what was provided was a mechanism that did not depend on
ndo_eth_ioctl(), while at the same time introducing something that
wasn't absolutely necessary - a new netdev notifier.
There have been objections from Jakub Kicinski that using notifiers in
general when they are not absolutely necessary creates complications to
the control flow and difficulties to maintainers who look at the code.
So there is a desire to not use notifiers.
In addition to that, the notifier chain gets called even if there is no
DSA in the system and no one is interested in applying any restriction.
Take the model of udp_tunnel_nic_ops and introduce a stub mechanism,
through which net/core/dev_ioctl.c can call into DSA even when
CONFIG_NET_DSA=m.
Compared to the code that existed prior to the notifier conversion, aka
what was added in commits:
- 4cfab35667 ("net: dsa: Add wrappers for overloaded ndo_ops")
- 3369afba1e ("net: Call into DSA netdevice_ops wrappers")
this is different because we are not overloading any struct
net_device_ops of the DSA master anymore, but rather, we are exposing a
rather specific functionality which is orthogonal to which API is used
to enable it - ndo_eth_ioctl() or ndo_hwtstamp_set().
Also, what is similar is that both approaches use function pointers to
get from built-in code to DSA.
There is no point in replicating the function pointers towards
__dsa_master_hwtstamp_validate() once for every CPU port (dev->dsa_ptr).
Instead, it is sufficient to introduce a singleton struct dsa_stubs,
built into the kernel, which contains a single function pointer to
__dsa_master_hwtstamp_validate().
I find this approach preferable to what we had originally, because
dev->dsa_ptr->netdev_ops->ndo_do_ioctl() used to require going through
struct dsa_port (dev->dsa_ptr), and so, this was incompatible with any
attempts to add any data encapsulation and hide DSA data structures from
the outside world.
Link: https://lore.kernel.org/netdev/20230403083019.120b72fd@kernel.org/
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-18147
Conflicts:
- Omitted DSA changes as they are not applicable. Note tha DSA is disabled
in RHEL.
commit 88c0a6b503b7f4fffb68a8d49c3987870c5b1d6b
Author: Vladimir Oltean <vladimir.oltean@nxp.com>
Date: Sun Apr 2 15:37:55 2023 +0300
net: create a netdev notifier for DSA to reject PTP on DSA master
The fact that PTP 2-step TX timestamping is broken on DSA switches if
the master also timestamps the same packets is documented by commit
f685e609a3 ("net: dsa: Deny PTP on master if switch supports it").
We attempt to help the users avoid shooting themselves in the foot by
making DSA reject the timestamping ioctls on an interface that is a DSA
master, and the switch tree beneath it contains switches which are aware
of PTP.
The only problem is that there isn't an established way of intercepting
ndo_eth_ioctl calls, so DSA creates avoidable burden upon the network
stack by creating a struct dsa_netdevice_ops with overlaid function
pointers that are manually checked from the relevant call sites. There
used to be 2 such dsa_netdevice_ops, but now, ndo_eth_ioctl is the only
one left.
There is an ongoing effort to migrate driver-visible hardware timestamping
control from the ndo_eth_ioctl() based API to a new ndo_hwtstamp_set()
model, but DSA actively prevents that migration, since dsa_master_ioctl()
is currently coded to manually call the master's legacy ndo_eth_ioctl(),
and so, whenever a network device driver would be converted to the new
API, DSA's restrictions would be circumvented, because any device could
be used as a DSA master.
The established way for unrelated modules to react on a net device event
is via netdevice notifiers. So we create a new notifier which gets
called whenever there is an attempt to change hardware timestamping
settings on a device.
Finally, there is another reason why a netdev notifier will be a good
idea, besides strictly DSA, and this has to do with PHY timestamping.
With ndo_eth_ioctl(), all MAC drivers must manually call
phy_has_hwtstamp() before deciding whether to act upon SIOCSHWTSTAMP,
otherwise they must pass this ioctl to the PHY driver via
phy_mii_ioctl().
With the new ndo_hwtstamp_set() API, it will be desirable to simply not
make any calls into the MAC device driver when timestamping should be
performed at the PHY level.
But there exist drivers, such as the lan966x switch, which need to
install packet traps for PTP regardless of whether they are the layer
that provides the hardware timestamps, or the PHY is. That would be
impossible to support with the new API.
The proposal there, too, is to introduce a netdev notifier which acts as
a better cue for switching drivers to add or remove PTP packet traps,
than ndo_hwtstamp_set(). The one introduced here "almost" works there as
well, except for the fact that packet traps should only be installed if
the PHY driver succeeded to enable hardware timestamping, whereas here,
we need to deny hardware timestamping on the DSA master before it
actually gets enabled. This is why this notifier is called "PRE_", and
the notifier that would get used for PHY timestamping and packet traps
would be called NETDEV_CHANGE_HWTSTAMP. This isn't a new concept, for
example NETDEV_CHANGEUPPER and NETDEV_PRECHANGEUPPER do the same thing.
In expectation of future netlink UAPI, we also pass a non-NULL extack
pointer to the netdev notifier, and we make DSA populate it with an
informative reason for the rejection. To avoid making it go to waste, we
make the ioctl-based dev_set_hwtstamp() create a fake extack and print
the message to the kernel log.
Link: https://lore.kernel.org/netdev/20230401191215.tvveoi3lkawgg6g4@skbuf/
Link: https://lore.kernel.org/netdev/20230310164451.ls7bbs6pdzs4m6pw@skbuf/
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-17413
Upstream Status: linux.git
commit bf4ea1d0b2cb2251f9e5619c81daa98591087c33
Author: Leon Hwang <hffilwlqm@gmail.com>
Date: Tue Aug 1 22:26:20 2023 +0800
bpf, xdp: Add tracepoint to xdp attaching failure
When error happens in dev_xdp_attach(), it should have a way to tell
users the error message like the netlink approach.
To avoid breaking uapi, adding a tracepoint in bpf_xdp_link_attach() is
an appropriate way to notify users the error message.
Hence, bpf libraries are able to retrieve the error message by this
tracepoint, and then report the error message to users.
Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
Link: https://lore.kernel.org/r/20230801142621.7925-2-hffilwlqm@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Antoine Tenart <atenart@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-862
commit 9b55d3f0a69af649c62cbc2633e6d695bb3cc583
Author: Felix Riemann <felix.riemann@sma.de>
Date: Fri Feb 10 13:36:44 2023 +0100
net: Fix unwanted sign extension in netdev_stats_to_stats64()
When converting net_device_stats to rtnl_link_stats64 sign extension
is triggered on ILP32 machines as 6c1c509778 changed the previous
"ulong -> u64" conversion to "long -> u64" by accessing the
net_device_stats fields through a (signed) atomic_long_t.
This causes for example the received bytes counter to jump to 16EiB after
having received 2^31 bytes. Casting the atomic value to "unsigned long"
beforehand converting it into u64 avoids this.
Fixes: 6c1c5097781f ("net: add atomic_long_t to net_device_stats fields")
Signed-off-by: Felix Riemann <felix.riemann@sma.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>