JIRA: https://issues.redhat.com/browse/RHEL-77329
commit e361560a7912958ba3059f51e7dd21612d119169
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date: Wed Jan 15 18:55:43 2025 +0900
dev: Acquire netdev_rename_lock before restoring dev->name in dev_change_name().
The cited commit forgot to add netdev_rename_lock in one of the
error paths in dev_change_name().
Let's hold netdev_rename_lock before restoring the old dev->name.
Fixes: 0840556e5a3a ("net: Protect dev->name by seqlock.")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250115095545.52709-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-60271
JIRA: https://issues.redhat.com/browse/RHEL-61181
Upstream Status: net-next.git
Tested: compile only
commit a12c76a03386e32413ae8eaaefa337e491880632
Author: Xin Long <lucien.xin@gmail.com>
Date: Wed Jan 15 09:27:54 2025 -0500
net: sched: refine software bypass handling in tc_run
This patch addresses issues with filter counting in block (tcf_block),
particularly for software bypass scenarios, by introducing a more
accurate mechanism using useswcnt.
Previously, filtercnt and skipswcnt were introduced by:
Commit 2081fd3445fe ("net: sched: cls_api: add filter counter") and
Commit f631ef39d819 ("net: sched: cls_api: add skip_sw counter")
filtercnt tracked all tp (tcf_proto) objects added to a block, and
skipswcnt counted tp objects with the skipsw attribute set.
The problem is: a single tp can contain multiple filters, some with skipsw
and others without. The current implementation fails in the case:
When the first filter in a tp has skipsw, both skipswcnt and filtercnt
are incremented, then adding a second filter without skipsw to the same
tp does not modify these counters because tp->counted is already set.
This results in bypass software behavior based solely on skipswcnt
equaling filtercnt, even when the block includes filters without
skipsw. Consequently, filters without skipsw are inadvertently bypassed.
To address this, the patch introduces useswcnt in block to explicitly count
tp objects containing at least one filter without skipsw. Key changes
include:
Whenever a filter without skipsw is added, its tp is marked with usesw
and counted in useswcnt. tc_run() now uses useswcnt to determine software
bypass, eliminating reliance on filtercnt and skipswcnt.
This refined approach prevents software bypass for blocks containing
mixed filters, ensuring correct behavior in tc_run().
Additionally, as atomic operations on useswcnt ensure thread safety and
tp->lock guards access to tp->usesw and tp->counted, the broader lock
down_write(&block->cb_lock) is no longer required in tc_new_tfilter(),
and this resolves a performance regression caused by the filter counting
mechanism during parallel filter insertions.
The improvement can be demonstrated using the following script:
# cat insert_tc_rules.sh
tc qdisc add dev ens1f0np0 ingress
for i in $(seq 16); do
taskset -c $i tc -b rules_$i.txt &
done
wait
Each of rules_$i.txt files above includes 100000 tc filter rules to a
mlx5 driver NIC ens1f0np0.
Without this patch:
# time sh insert_tc_rules.sh
real 0m50.780s
user 0m23.556s
sys 4m13.032s
With this patch:
# time sh insert_tc_rules.sh
real 0m17.718s
user 0m7.807s
sys 3m45.050s
Fixes: 047f340b36fc ("net: sched: make skip_sw actually skip software")
Reported-by: Shuang Li <shuali@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Tested-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Xin Long <lxin@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-57756
Upstream commit(s):
commit 55a2c86c8db3d7aa2c1967efd37ed47d5ae37f43
Author: Eric Dumazet <edumazet@google.com>
Date: Fri May 3 19:20:55 2024 +0000
net: write once on dev->allmulti and dev->promiscuity
In the following patch we want to read dev->allmulti
and dev->promiscuity locklessly from rtnl_fill_ifinfo()
In this patch I change __dev_set_promiscuity() and
__dev_set_allmulti() to write these fields (and dev->flags)
only if they succeed, with WRITE_ONCE() annotations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Petr Oros <poros@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-57756
Upstream commit(s):
commit 802dcbd6f30feaa7c96a1fb4ecb1db57082df9d7
Author: Jesse Brandeburg <jesse.brandeburg@intel.com>
Date: Tue Feb 14 13:01:16 2023 -0800
net/core: print message for allmulticast
When the user sets or clears the IFF_ALLMULTI flag in the netdev, there are
no log messages printed to the kernel log to indicate anything happened.
This is inexplicably different from most other dev->flags changes, and
could suprise the user.
Typically this occurs from user-space when a user:
ip link set eth0 allmulticast <on|off>
However, other devices like bridge set allmulticast as well, and many
other flows might trigger entry into allmulticast as well.
The new message uses the standard netdev_info print and looks like:
[ 413.246110] ixgbe 0000:17:00.0 eth0: entered allmulticast mode
[ 415.977184] ixgbe 0000:17:00.0 eth0: left allmulticast mode
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Petr Oros <poros@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-57756
Upstream commit(s):
commit 3ba0bf47edf955d6f52fdb16b54acd1932cb9445
Author: Jesse Brandeburg <jesse.brandeburg@intel.com>
Date: Tue Feb 14 13:01:17 2023 -0800
net/core: refactor promiscuous mode message
The kernel stack can be more consistent by printing the IFF_PROMISC
aka promiscuous enable/disable messages with the standard netdev_info
message which can include bus and driver info as well as the device.
typical command usage from user space looks like:
ip link set eth0 promisc <on|off>
But lots of utilities such as bridge, tcpdump, etc put the interface into
promiscuous mode.
old message:
[ 406.034418] device eth0 entered promiscuous mode
[ 408.424703] device eth0 left promiscuous mode
new message:
[ 406.034431] ice 0000:17:00.0 eth0: entered promiscuous mode
[ 408.424715] ice 0000:17:00.0 eth0: left promiscuous mode
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Petr Oros <poros@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-57756
Upstream commit(s):
commit 5b92be649605504e1019a1ad0c95b0d74a4e2be1
Author: Jesse Brandeburg <jesse.brandeburg@intel.com>
Date: Tue Oct 19 09:42:28 2021 -0700
net-core: use netdev_* calls for kernel messages
While loading a driver and changing the number of queues, I noticed this
message in the kernel log:
"[253489.070080] Number of in use tx queues changed invalidating tc
mappings. Priority traffic classification disabled!"
But I had no idea what interface was being talked about because this
message used pr_warn().
After investigating, it appears we can use the netdev_* helpers already
defined to create predictably formatted messages, and that already handle
<unknown netdev> cases, in more of the messages in dev.c.
After this change, this message (and others) will look like this:
"[ 170.181093] ice 0000:3b:00.0 ens785f0: Number of in use tx queues
changed invalidating tc mappings. Priority traffic classification
disabled!"
One goal here was not to change the message significantly from the
original format so as to not break user's expectations, so I just
changed messages that used pr_* and generally started with %s ==
dev->name.
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Petr Oros <poros@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-57756
Upstream commit(s):
commit 6890ab31d1a35444741e6150db19d64797db2919
Author: Eric Dumazet <edumazet@google.com>
Date: Fri May 3 19:20:57 2024 +0000
rtnetlink: do not depend on RTNL in rtnl_fill_proto_down()
Change dev_change_proto_down() and dev_change_proto_down_reason()
to write once on dev->proto_down and dev->proto_down_reason.
Then rtnl_fill_proto_down() can use READ_ONCE() annotations
and run locklessly.
rtnl_proto_down_size() should assume worst case,
because readng dev->proto_down_reason multiple
times would be racy without RTNL in the future.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Petr Oros <poros@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-57756
Upstream commit(s):
commit ad13b5b0d1f9eb8e048394919e6393e520b14552
Author: Eric Dumazet <edumazet@google.com>
Date: Fri May 3 19:20:54 2024 +0000
rtnetlink: do not depend on RTNL for IFLA_TXQLEN output
rtnl_fill_ifinfo() can read dev->tx_queue_len locklessly,
granted we add corresponding READ_ONCE()/WRITE_ONCE() annotations.
Add missing READ_ONCE(dev->tx_queue_len) in teql_enqueue()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Petr Oros <poros@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-68117
Conflicts:
- context due to missing dd891b5b106fa ("net: do not send a MOVE event
when netdev changes netns")
commit 0840556e5a3a331b6932ef17dd4bc94445df3297
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date: Mon Apr 29 18:58:12 2024 -0700
net: Protect dev->name by seqlock.
We will convert ioctl(SIOCGARP) to RCU, and then we need to copy
dev->name which is currently protected by rtnl_lock().
This patch does the following:
1) Add seqlock netdev_rename_lock to protect dev->name
2) Add netdev_copy_name() that copies dev->name to buffer
under netdev_rename_lock
3) Use netdev_copy_name() in netdev_get_name() and drop
devnet_rename_sem
Suggested-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/netdev/CANn89iJEWs7AYSJqGCUABeVqOCTkErponfZdT5kV-iD=-SajnQ@mail.gmail.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240430015813.71143-7-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-64867
commit b4e8ae5c8c41355791a99fdf2fcac16deace1e79
Author: Stefan Roesch <shr@devkernel.io>
Date: Tue Feb 6 09:30:04 2024 -0700
net: add napi_busy_loop_rcu()
This adds the napi_busy_loop_rcu() function. This function assumes that
the calling function is already holding the rcu read lock and
napi_busy_loop() does not need to take the rcu read lock. Add a
NAPI_F_NO_SCHED flag, which tells __napi_busy_loop() to abort if we
need to reschedule rather than drop the RCU read lock and reschedule.
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Link: https://lore.kernel.org/r/20230608163839.2891748-3-shr@devkernel.io
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-64867
commit 13d381b440ed84ec4cc92975de035efb1a9e5f7e
Author: Stefan Roesch <shr@devkernel.io>
Date: Tue Feb 6 09:30:03 2024 -0700
net: split off __napi_busy_poll from napi_busy_poll
This splits off the key part of the napi_busy_poll function into its own
function, __napi_busy_poll, and changes the prefer_busy_poll bool to be
flag based to allow passing in more flags in the future.
This is done in preparation for an additional napi_busy_poll() function,
that doesn't take the rcu_read_lock(). The new function is introduced
in the next patch.
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Link: https://lore.kernel.org/r/20230608163839.2891748-2-shr@devkernel.io
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-64867
commit c857946a4e262e0f798fe7625fadf85bf1190fc4
Author: Kurt Kanzenbach <kurt@linutronix.de>
Date: Tue May 23 13:15:18 2023 +0200
net/core: Enable socket busy polling on -RT
Busy polling is currently not allowed on PREEMPT_RT, because it disables
preemption while invoking the NAPI callback. It is not possible to acquire
sleeping locks with disabled preemption. For details see commit
20ab39d13e2e ("net/core: disable NET_RX_BUSY_POLL on PREEMPT_RT").
However, strict cyclic and/or low latency network applications may prefer busy
polling e.g., using AF_XDP instead of interrupt driven communication.
The preempt_disable() is used in order to prevent the poll_owner and NAPI owner
to be preempted while owning the resource to ensure progress. Netpoll performs
busy polling in order to acquire the lock. NAPI is locked by setting the
NAPIF_STATE_SCHED flag. There is no busy polling if the flag is set and the
"owner" is preempted. Worst case is that the task owning NAPI gets preempted and
NAPI processing stalls. This is can be prevented by properly prioritising the
tasks within the system.
Allow RX_BUSY_POLL on PREEMPT_RT if NETPOLL is disabled. Don't disable
preemption on PREEMPT_RT within the busy poll loop.
Tested on x86 hardware with v6.1-RT and v6.3-RT on Intel i225 (igc) with
AF_XDP/ZC sockets configured to run in busy polling mode.
Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-57755
Upstream commit(s):
commit ffabe98cb576097b77d404d39e8b3df03caa986a
Author: Eric Dumazet <edumazet@google.com>
Date: Fri Feb 2 10:11:06 2024 +0000
net: make dev_unreg_count global
We can use a global dev_unreg_count counter instead
of a per netns one.
As a bonus we can factorize the changes done on it
for bulk device removals.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Petr Oros <poros@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-57755
Upstream commit(s):
commit 8a4fc54b07d756f1884af6c47ec84dfa3da663df
Author: Eric Dumazet <edumazet@google.com>
Date: Fri Feb 18 09:58:56 2022 -0800
net: get rid of rtnl_lock_unregistering()
After recent patches, and in particular commits
faab39f63c1f ("net: allow out-of-order netdev unregistration") and
e5f80fcf869a ("ipv6: give an IPv6 dev to blackhole_netdev")
we no longer need the barrier implemented in rtnl_lock_unregistering().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Petr Oros <poros@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-68063
Upstream Status: linux.git
Conflicts:\
- Context difference due to missing upstream commit 794c24e9921f
("net-core: rx_otherhost_dropped to core_stats") in c9s.
commit 5247dbf16cee4e83eb89e4d3b87bd5e79c5d1655
Author: Yajun Deng <yajun.deng@linux.dev>
Date: Mon Oct 9 19:16:33 2023 +0800
net/core: Introduce netdev_core_stats_inc()
Although there is a kfree_skb_reason() helper function that can be used to
find the reason why this skb is dropped, but most callers didn't increase
one of rx_dropped, tx_dropped, rx_nohandler and rx_otherhost_dropped.
For the users, people are more concerned about why the dropped in ip
is increasing.
Introduce netdev_core_stats_inc() for trace the caller of
dev_core_stats_*_inc().
Also, add __code to netdev_core_stats_alloc(), as it's called with small
probability. And add noinline make sure netdev_core_stats_inc was never
inlined.
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Suggested-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Antoine Tenart <atenart@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5657
JIRA: https://issues.redhat.com/browse/RHEL-57751
Depends: !5431
New features:
- Extend core and ethtool APIs to support many PHYs connected to a single interface (PHY topologies).
- Extend cable diagnostics to specify whether Time Domain Reflectometry (TDR) or Active Link Cable Diagnostic (ALCD) was used.
- Support listing / dumping all allocated RSS contexts.
About half of the patches are simple cleanups in drivers' .get_ts_info implementations after moving the responsibility of reporting SOF_TIMESTAMPING_RX_SOFTWARE, SOF_TIMESTAMPING_SOFTWARE and setting phc_index to -1 to the core.
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Approved-by: Kamal Heib <kheib@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: Eric Chanudet <echanude@redhat.com>
Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com>
Approved-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>
Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-62849
JIRA: https://issues.redhat.com/browse/RHEL-64328
CVE: CVE-2024-49948
Tested: LNST, Tier1
Upstream commit:
commit ab9a9a9e9647392a19e7a885b08000e89c86b535
Author: Eric Dumazet <edumazet@google.com>
Date: Tue Sep 24 15:02:57 2024 +0000
net: add more sanity checks to qdisc_pkt_len_init()
One path takes care of SKB_GSO_DODGY, assuming
skb->len is bigger than hdr_len.
virtio_net_hdr_to_skb() does not fully dissect TCP headers,
it only make sure it is at least 20 bytes.
It is possible for an user to provide a malicious 'GSO' packet,
total length of 80 bytes.
- 20 bytes of IPv4 header
- 60 bytes TCP header
- a small gso_size like 8
virtio_net_hdr_to_skb() would declare this packet as a normal
GSO packet, because it would see 40 bytes of payload,
bigger than gso_size.
We need to make detect this case to not underflow
qdisc_skb_cb(skb)->pkt_len.
Fixes: 1def9238d4 ("net_sched: more precise pkt_len computation")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-62849
Tested: LNST, Tier1
Upstream commit:
commit e609c959a939660c7519895f853dfa5624c6827a
Author: Daniel Borkmann <daniel@iogearbox.net>
Date: Mon Sep 23 23:22:42 2024 +0200
net: Fix gso_features_check to check for both dev->gso_{ipv4_,}max_size
Commit 24ab059d2ebd ("net: check dev->gso_max_size in gso_features_check()")
added a dev->gso_max_size test to gso_features_check() in order to fall
back to GSO when needed.
This was added as it was noticed that some drivers could misbehave if TSO
packets get too big. However, the check doesn't respect dev->gso_ipv4_max_size
limit. For instance, a device could be configured with BIG TCP for IPv4,
but not IPv6.
Therefore, add a netif_get_gso_max_size() equivalent to netif_get_gro_max_size()
and use the helper to respect both limits before falling back to GSO engine.
Fixes: 24ab059d2ebd ("net: check dev->gso_max_size in gso_features_check()")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240923212242.15669-2-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-62849
Tested: LNST, Tier1
Upstream commit:
commit cd42ba1c8ac9deb9032add6adf491110e7442040
Author: Eric Dumazet <edumazet@google.com>
Date: Fri Apr 26 06:42:22 2024 +0000
net: give more chances to rcu in netdev_wait_allrefs_any()
This came while reviewing commit c4e86b4363ac ("net: add two more
call_rcu_hurry()").
Paolo asked if adding one synchronize_rcu() would help.
While synchronize_rcu() does not help, making sure to call
rcu_barrier() before msleep(wait) is definitely helping
to make sure lazy call_rcu() are completed.
Instead of waiting ~100 seconds in my tests, the ref_tracker
splats occurs one time only, and netdev_wait_allrefs_any()
latency is reduced to the strict minimum.
Ideally we should audit our call_rcu() users to make sure
no refcount (or cascading call_rcu()) is held too long,
because rcu_barrier() is quite expensive.
Fixes: 0e4be9e57e ("net: use exponential backoff in netdev_wait_allrefs")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/all/28bbf698-befb-42f6-b561-851c67f464aa@kernel.org/T/#m76d73ed6b03cd930778ac4d20a777f22a08d6824
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-62204
Upstream Status: linux.git
commit 590e92cdc835fcf435d8611f2477fff0e16877c7
Author: Eric Dumazet <edumazet@google.com>
Date: Thu Feb 29 11:40:15 2024 +0000
inet: prepare inet_base_seq() to run without RTNL
In the following patch, inet_base_seq() will no longer be called
with RTNL held.
Add READ_ONCE()/WRITE_ONCE() annotations in dev_base_seq_inc()
and inet_base_seq().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Antoine Tenart <atenart@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-65205
commit 024ee930cb3c9ae49e4266aee89cfde0ebb407e1
Author: Peilin Ye <peilin.ye@bytedance.com>
Date: Tue Nov 14 01:42:17 2023 +0100
bpf: Fix dev's rx stats for bpf_redirect_peer traffic
Traffic redirected by bpf_redirect_peer() (used by recent CNIs like Cilium)
is not accounted for in the RX stats of supported devices (that is, veth
and netkit), confusing user space metrics collectors such as cAdvisor [0],
as reported by Youlun.
Fix it by calling dev_sw_netstats_rx_add() in skb_do_redirect(), to update
RX traffic counters. Devices that support ndo_get_peer_dev _must_ use the
@tstats per-CPU counters (instead of @lstats, or @dstats).
To make this more fool-proof, error out when ndo_get_peer_dev is set but
@tstats are not selected.
[0] Specifically, the "container_network_receive_{byte,packet}s_total"
counters are affected.
Fixes: 9aa1206e8f ("bpf: Add redirect_peer helper")
Reported-by: Youlun Zhang <zhangyoulun@bytedance.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://lore.kernel.org/r/20231114004220.6495-6-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-57751
commit 3849687869092094003ba009dc00e2e0237a3b8a
Author: Maxime Chevallier <maxime.chevallier@bootlin.com>
Date: Wed Aug 21 17:09:55 2024 +0200
net: phy: Introduce ethernet link topology representation
Link topologies containing multiple network PHYs attached to the same
net_device can be found when using a PHY as a media converter for use
with an SFP connector, on which an SFP transceiver containing a PHY can
be used.
With the current model, the transceiver's PHY can't be used for
operations such as cable testing, timestamping, macsec offload, etc.
The reason being that most of the logic for these configuration, coming
from either ethtool netlink or ioctls tend to use netdev->phydev, which
in multi-phy systems will reference the PHY closest to the MAC.
Introduce a numbering scheme allowing to enumerate PHY devices that
belong to any netdev, which can in turn allow userspace to take more
precise decisions with regard to each PHY's configuration.
The numbering is maintained per-netdev, in a phy_device_list.
The numbering works similarly to a netdevice's ifindex, with
identifiers that are only recycled once INT_MAX has been reached.
This prevents races that could occur between PHY listing and SFP
transceiver removal/insertion.
The identifiers are assigned at phy_attach time, as the numbering
depends on the netdevice the phy is attached to. The PHY index can be
re-used for PHYs that are persistent.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts: Dropped a whitespace-only hunk.
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-62123
commit 8afc7a78d55de726b2747d7775c54def79509ec5
Author: Eric Dumazet <edumazet@google.com>
Date: Thu Feb 22 10:50:10 2024 +0000
ipv6: prepare inet6_fill_ifinfo() for RCU protection
We want to use RCU protection instead of RTNL
for inet6_fill_ifinfo().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-62123
Conflicts:
* drivers/net/netkit.c
- hunk omitted as the driver is not present in RHEL
* net/dsa/user.c
- the hunk applied in dsa/slave.c due to absence of DSA deps
commit e353ea9ce471331c13edffd5977eadd602d1bb80
Author: Eric Dumazet <edumazet@google.com>
Date: Thu Feb 22 10:50:08 2024 +0000
rtnetlink: prepare nla_put_iflink() to run under RCU
We want to be able to run rtnl_fill_ifinfo() under RCU protection
instead of RTNL in the future.
This patch prepares dev_get_iflink() and nla_put_iflink()
to run either with RTNL or RCU held.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-62123
commit 723de3ebef03bc14bd72531f00f9094337654009
Author: Jakub Kicinski <kuba@kernel.org>
Date: Fri Jan 26 12:14:49 2024 -0800
net: free altname using an RCU callback
We had to add another synchronize_rcu() in recent fix.
Bite the bullet and add an rcu_head to netdev_name_node,
free from RCU.
Note that name_node does not hold any reference on dev
to which it points, but there must be a synchronize_rcu()
on device removal path, so we should be fine.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-62123
commit d09486a04f5da0a812c26217213b89a3b1acf836
Author: Jakub Kicinski <kuba@kernel.org>
Date: Thu Jan 18 16:58:59 2024 -0800
net: fix removing a namespace with conflicting altnames
Mark reports a BUG() when a net namespace is removed.
kernel BUG at net/core/dev.c:11520!
Physical interfaces moved outside of init_net get "refunded"
to init_net when that namespace disappears. The main interface
name may get overwritten in the process if it would have
conflicted. We need to also discard all conflicting altnames.
Recent fixes addressed ensuring that altnames get moved
with the main interface, which surfaced this problem.
Reported-by: Марк Коренберг <socketpair@gmail.com>
Link: https://lore.kernel.org/all/CAEmTpZFZ4Sv3KwqFOY2WKDHeZYdi0O7N5H1nTvcGp=SAEavtDg@mail.gmail.com/
Fixes: 7663d522099e ("net: check for altname conflicts when changing netdev's netns")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-62123
Conflicts:
* net/core/dev.c
- simple context conflict due to existing backport of commit
1b3ef46cb7f26 ("net: remove dev_base_lock")
commit 8e15aee621618a3ee3abecaf1fd8c1428098b7ef
Author: Jakub Kicinski <kuba@kernel.org>
Date: Tue Oct 17 18:38:16 2023 -0700
net: move altnames together with the netdevice
The altname nodes are currently not moved to the new netns
when netdevice itself moves:
[ ~]# ip netns add test
[ ~]# ip -netns test link add name eth0 type dummy
[ ~]# ip -netns test link property add dev eth0 altname some-name
[ ~]# ip -netns test link show dev some-name
2: eth0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 1e:67:ed:19:3d:24 brd ff:ff:ff:ff:ff:ff
altname some-name
[ ~]# ip -netns test link set dev eth0 netns 1
[ ~]# ip link
...
3: eth0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 02:40:88:62:ec:b8 brd ff:ff:ff:ff:ff:ff
altname some-name
[ ~]# ip li show dev some-name
Device "some-name" does not exist.
Remove them from the hash table when device is unlisted
and add back when listed again.
Fixes: 36fbf1e52b ("net: rtnetlink: add linkprop commands to add and delete alternative ifnames")
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-62123
commit 1a83f4a7c156fa6bbd6b530e89fa3270bf3d9d1b
Author: Jakub Kicinski <kuba@kernel.org>
Date: Tue Oct 17 18:38:15 2023 -0700
net: avoid UAF on deleted altname
Altnames are accessed under RCU (dev_get_by_name_rcu())
but freed by kfree() with no synchronization point.
Each node has one or two allocations (node and a variable-size
name, sometimes the name is netdev->name). Adding rcu_heads
here is a bit tedious. Besides most code which unlists the names
already has rcu barriers - so take the simpler approach of adding
synchronize_rcu(). Note that the one on the unregistration path
(which matters more) is removed by the next fix.
Fixes: ff92741270 ("net: introduce name_node struct to be used in hashlist")
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-62123
commit 7663d522099ecc464512164e660bc771b2ff7b64
Author: Jakub Kicinski <kuba@kernel.org>
Date: Tue Oct 17 18:38:14 2023 -0700
net: check for altname conflicts when changing netdev's netns
It's currently possible to create an altname conflicting
with an altname or real name of another device by creating
it in another netns and moving it over:
[ ~]$ ip link add dev eth0 type dummy
[ ~]$ ip netns add test
[ ~]$ ip -netns test link add dev ethX netns test type dummy
[ ~]$ ip -netns test link property add dev ethX altname eth0
[ ~]$ ip -netns test link set dev ethX netns 1
[ ~]$ ip link
...
3: eth0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 02:40:88:62:ec:b8 brd ff:ff:ff:ff:ff:ff
...
5: ethX: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 26:b7:28:78:38:0f brd ff:ff:ff:ff:ff:ff
altname eth0
Create a macro for walking the altnames, this hopefully makes
it clearer that the list we walk contains only altnames.
Which is otherwise not entirely intuitive.
Fixes: 36fbf1e52b ("net: rtnetlink: add linkprop commands to add and delete alternative ifnames")
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-62123
commit 311cca40661f428b7aa114fb5af578cfdbe3e8b6
Author: Jakub Kicinski <kuba@kernel.org>
Date: Tue Oct 17 18:38:13 2023 -0700
net: fix ifname in netlink ntf during netns move
dev_get_valid_name() overwrites the netdev's name on success.
This makes it hard to use in prepare-commit-like fashion,
where we do validation first, and "commit" to the change
later.
Factor out a helper which lets us save the new name to a buffer.
Use it to fix the problem of notification on netns move having
incorrect name:
5: eth0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default
link/ether be:4d:58:f9:d5:40 brd ff:ff:ff:ff:ff:ff
6: eth1: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default
link/ether 1e:4a:34:36:e3:cd brd ff:ff:ff:ff:ff:ff
[ ~]# ip link set dev eth0 netns 1 name eth1
ip monitor inside netns:
Deleted inet eth0
Deleted inet6 eth0
Deleted 5: eth1: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default
link/ether be:4d:58:f9:d5:40 brd ff:ff:ff:ff:ff:ff new-netnsid 0 new-ifindex 7
Name is reported as eth1 in old netns for ifindex 5, already renamed.
Fixes: d90310243f ("net: device name allocation cleanups")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-62123
commit 75ea27d0d62281c31ee259c872dfdeb072cf5e39
Author: Antoine Tenart <atenart@kernel.org>
Date: Thu Oct 7 18:16:50 2021 +0200
net: introduce a function to check if a netdev name is in use
__dev_get_by_name is currently used to either retrieve a net device
reference using its name or to check if a name is already used by a
registered net device (per ns). In the later case there is no need to
return a reference to a net device.
Introduce a new helper, netdev_name_in_use, to check if a name is
currently used by a registered net device without leaking a reference
the corresponding net device. This helper uses netdev_name_node_lookup
instead of __dev_get_by_name as we don't need the extra logic retrieving
a reference to the corresponding net device.
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-57765
Conflicts:
- net/core/page_pool.c: context difference due to missing aaf153aecef1
("page_pool: halve BIAS_MAX for multiple user references of a fragment")
commit f853fa5c54e7a0364a52125074dedeaf2c7ddace
Author: Lorenzo Bianconi <lorenzo@kernel.org>
Date: Fri Feb 16 10:25:43 2024 +0100
net: page_pool: fix recycle stats for system page_pool allocator
Use global percpu page_pool_recycle_stats counter for system page_pool
allocator instead of allocating a separate percpu variable for each
(also percpu) page pool instance.
Reviewed-by: Toke Hoiland-Jorgensen <toke@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/r/87f572425e98faea3da45f76c3c68815c01a20ee.1708075412.git.lorenzo@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5362
JIRA: https://issues.redhat.com/browse/RHEL-59091
Explanation from the upstream cover letter by Alexander Lobakin:
> NETDEV_FEATURE_COUNT is currently 64, which means we can't add any new
> features as netdev_features_t is u64.
> As per several discussions, instead of converting netdev_features_t to
> a bitmap, which would mean A LOT of changes, we can try cleaning up
> netdev feature bits.
> There's a bunch of bits which don't really mean features, rather device
> attributes/properties that can't be changed via Ethtool in any of the
> drivers. Such attributes can be moved to netdev private flags without
> losing any functionality.
>
> Start converting some read-only netdev features to private flags from
> the ones that are most obvious, like lockless Tx, inability to change
> network namespace etc. I was able to reduce NETDEV_FEATURE_COUNT from
> 64 to 60, which mean 4 free slots for new features. There are obviously
> more read-only features to convert, such as highDMA, "challenged VLAN",
> HSR (4 bits) - this will be done in subsequent series.
> Please note that netdev features are not uAPI/ABI by any means. Ethtool
> passes their names and bits to the userspace separately and there are no
> hardcoded names/bits in the userspace, so that new Ethtool could work
> on older kernels and vice versa. Even shell scripts won't most likely
> break since the removed bits were always read-only, meaning nobody would
> try touching them from a script.
I proposed a Release Note Text in the Jira to document that "tx-lockless", "netns-local", "fcoe-mtu" will no longer appear in "ethtool -k".
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com>
Approved-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Eric Chanudet <echanude@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>
Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5197
JIRA: https://issues.redhat.com/browse/RHEL-57750
Depends: !5196
This updates the ethtool subsystem to v6.11. At the end of this series, the only remaining diffs from v6.11 are the RH_KABI_RESERVES in struct ethtool_ops, as shown by:
`git diff v6.11 -- net/ethtool include/linux/ethtool.h include/uapi/linux/ethtool{,_netlink}.h Documentation/netlink/specs/ethtool.yaml Documentation/networking/ethtool-netlink.rst tools/net/ynl/ethtool.py`
Omitted-Fix: 9dbad38336a9 ("eth: bnxt: populate defaults in the RSS context struct")
- bnxt has not been converted to .create_rxfh_context yet.
This will be in a driver update later.
Omitted-Fix: cdc90f75387c ("pse-core: Conditionally set current limit during PI regulator registration")
Omitted-Fix: 326f442784c2 ("net: pse-pd: pse_core: Fix pse regulator type")
- All changes to pse_core.c omitted in the series.
Omitted-Fix: 2fa809b90617 ("net: pse-pd: Kconfig: Add missing Regulator API dependency")
- Irrelevant. CONFIG_PSE_CONTROLLER is disabled.
Omitted-Fix: 93c3a96c301f ("net: pse-pd: Do not return EOPNOSUPP if config is null")
- Contained in the merge conflict resolution backported in "net: ethtool: pse-pd: Fix possible null-deref".
Omitted-Fix: dda3529d2e84 ("net: pse-pd: Fix enabled status mismatch")
- Irrelevant. CONFIG_PSE_CONTROLLER is disabled.
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: Eric Chanudet <echanude@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>
Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>