Commit Graph

798 Commits

Author SHA1 Message Date
Petr Oros 5597fb4160 net: fix crash when config small gso_max_size/gso_ipv4_max_size
JIRA: https://issues.redhat.com/browse/RHEL-57756

CVE: CVE-2024-50258

Upstream commit(s):
commit 9ab5cf19fb0e4680f95e506d6c544259bf1111c4
Author: Wang Liang <wangliang74@huawei.com>
Date:   Wed Oct 23 11:52:13 2024 +0800

    net: fix crash when config small gso_max_size/gso_ipv4_max_size

    Config a small gso_max_size/gso_ipv4_max_size will lead to an underflow
    in sk_dst_gso_max_size(), which may trigger a BUG_ON crash,
    because sk->sk_gso_max_size would be much bigger than device limits.
    Call Trace:
    tcp_write_xmit
        tso_segs = tcp_init_tso_segs(skb, mss_now);
            tcp_set_skb_tso_segs
                tcp_skb_pcount_set
                    // skb->len = 524288, mss_now = 8
                    // u16 tso_segs = 524288/8 = 65535 -> 0
                    tso_segs = DIV_ROUND_UP(skb->len, mss_now)
        BUG_ON(!tso_segs)
    Add check for the minimum value of gso_max_size and gso_ipv4_max_size.

    Fixes: 46e6b992c2 ("rtnetlink: allow GSO maximums to be set on device creation")
    Fixes: 9eefedd58ae1 ("net: add gso_ipv4_max_size and gro_ipv4_max_size per device")
    Signed-off-by: Wang Liang <wangliang74@huawei.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://patch.msgid.link/20241023035213.517386-1-wangliang74@huawei.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Petr Oros <poros@redhat.com>
2024-12-10 10:37:56 +01:00
Petr Oros 7ed4c990cc rtnetlink: Add bulk registration helpers for rtnetlink message handlers.
JIRA: https://issues.redhat.com/browse/RHEL-57756

Upstream commit(s):
commit 07cc7b0b942bf55ef1a471470ecda8d2a6a6541f
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Tue Oct 8 11:47:32 2024 -0700

    rtnetlink: Add bulk registration helpers for rtnetlink message handlers.

    Before commit addf9b90de ("net: rtnetlink: use rcu to free rtnl message
    handlers"), once rtnl_msg_handlers[protocol] was allocated, the following
    rtnl_register_module() for the same protocol never failed.

    However, after the commit, rtnl_msg_handler[protocol][msgtype] needs to
    be allocated in each rtnl_register_module(), so each call could fail.

    Many callers of rtnl_register_module() do not handle the returned error,
    and we need to add many error handlings.

    To handle that easily, let's add wrapper functions for bulk registration
    of rtnetlink message handlers.

    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Petr Oros <poros@redhat.com>
2024-12-10 10:37:56 +01:00
Petr Oros b24948627d rtnetlink: delete redundant judgment statements
JIRA: https://issues.redhat.com/browse/RHEL-57756

Upstream commit(s):
commit 2d522384fb5b8187cb7f8fe7d05c119ac38fd8f3
Author: Li Zetao <lizetao1@huawei.com>
Date:   Thu Aug 22 12:32:46 2024 +0800

    rtnetlink: delete redundant judgment statements

    The initial value of err is -ENOBUFS, and err is guaranteed to be
    less than 0 before all goto errout. Therefore, on the error path
    of errout, there is no need to repeatedly judge that err is less than 0,
    and delete redundant judgments to make the code more concise.

    Signed-off-by: Li Zetao <lizetao1@huawei.com>
    Reviewed-by: Petr Machata <petrm@nvidia.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Petr Oros <poros@redhat.com>
2024-12-10 10:37:55 +01:00
Petr Oros 42016f1fd6 net: reduce rtnetlink_rcv_msg() stack usage
JIRA: https://issues.redhat.com/browse/RHEL-57756

Upstream commit(s):
commit cef4902b0fadfc4181176ef5713f0b7cf2a40d8f
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Jul 10 15:16:53 2024 +0000

    net: reduce rtnetlink_rcv_msg() stack usage

    IFLA_MAX is increasing slowly but surely.

    Some compilers use more than 512 bytes of stack in rtnetlink_rcv_msg()
    because it calls rtnl_calcit() for RTM_GETLINK message.

    Use noinline_for_stack attribute to not inline rtnl_calcit(),
    and directly use nla_for_each_attr_type() (Jakub suggestion)
    because we only care about IFLA_EXT_MASK at this stage.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Simon Horman <horms@kernel.org>
    Link: https://patch.msgid.link/20240710151653.3786604-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Petr Oros <poros@redhat.com>
2024-12-10 10:37:55 +01:00
Petr Oros c976657153 rtnetlink: move rtnl_lock handling out of af_netlink
JIRA: https://issues.redhat.com/browse/RHEL-57756

Upstream commit(s):
commit 5380d64f8d766576ac5c0f627418b2d0e1d2641f
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Thu Jun 6 12:29:05 2024 -0700

    rtnetlink: move rtnl_lock handling out of af_netlink

    Now that we have an intermediate layer of code for handling
    rtnl-level netlink dump quirks, we can move the rtnl_lock
    taking there.

    For dump handlers with RTNL_FLAG_DUMP_SPLIT_NLM_DONE we can
    avoid taking rtnl_lock just to generate NLM_DONE, once again.

    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Petr Oros <poros@redhat.com>
2024-12-10 10:37:55 +01:00
Petr Oros 5f99ca47f6 rtnetlink: allow rtnl_fill_link_netnsid() to run under RCU protection
JIRA: https://issues.redhat.com/browse/RHEL-57756

Upstream commit(s):
commit 9cf621bd5fcbeadc2804951d13d487e22e95b363
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri May 3 19:20:59 2024 +0000

    rtnetlink: allow rtnl_fill_link_netnsid() to run under RCU protection

    We want to be able to run rtnl_fill_ifinfo() under RCU protection
    instead of RTNL in the future.

    All rtnl_link_ops->get_link_net() methods already using dev_net()
    are ready. I added READ_ONCE() annotations on others.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Simon Horman <horms@kernel.org>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Petr Oros <poros@redhat.com>
2024-12-10 10:37:54 +01:00
Petr Oros 1f803ff5bd rtnetlink: do not depend on RTNL in rtnl_xdp_prog_skb()
JIRA: https://issues.redhat.com/browse/RHEL-57756

Upstream commit(s):
commit 979aad40da9217d5e907ee4ad7c7f0dc555944a7
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri May 3 19:20:58 2024 +0000

    rtnetlink: do not depend on RTNL in rtnl_xdp_prog_skb()

    dev->xdp_prog is protected by RCU, we can lift RTNL requirement
    from rtnl_xdp_prog_skb().

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Simon Horman <horms@kernel.org>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Petr Oros <poros@redhat.com>
2024-12-10 10:37:54 +01:00
Petr Oros 861345bba2 rtnetlink: do not depend on RTNL in rtnl_fill_proto_down()
JIRA: https://issues.redhat.com/browse/RHEL-57756

Upstream commit(s):
commit 6890ab31d1a35444741e6150db19d64797db2919
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri May 3 19:20:57 2024 +0000

    rtnetlink: do not depend on RTNL in rtnl_fill_proto_down()

    Change dev_change_proto_down() and dev_change_proto_down_reason()
    to write once on dev->proto_down and dev->proto_down_reason.

    Then rtnl_fill_proto_down() can use READ_ONCE() annotations
    and run locklessly.

    rtnl_proto_down_size() should assume worst case,
    because readng dev->proto_down_reason multiple
    times would be racy without RTNL in the future.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Simon Horman <horms@kernel.org>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Petr Oros <poros@redhat.com>
2024-12-10 10:37:54 +01:00
Petr Oros 0f51110863 rtnetlink: do not depend on RTNL for many attributes
JIRA: https://issues.redhat.com/browse/RHEL-57756

Upstream commit(s):
commit 6747a5d4990b8c8d7392f7a06b7a4bb5f4ada80e
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri May 3 19:20:56 2024 +0000

    rtnetlink: do not depend on RTNL for many attributes

    Following device fields can be read locklessly
    in rtnl_fill_ifinfo() :

    type, ifindex, operstate, link_mode, mtu, min_mtu, max_mtu, group,
    promiscuity, allmulti, num_tx_queues, gso_max_segs, gso_max_size,
    gro_max_size, gso_ipv4_max_size, gro_ipv4_max_size, tso_max_size,
    tso_max_segs, num_rx_queues.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Petr Oros <poros@redhat.com>
2024-12-10 10:37:54 +01:00
Petr Oros ce54afa357 rtnetlink: do not depend on RTNL for IFLA_TXQLEN output
JIRA: https://issues.redhat.com/browse/RHEL-57756

Upstream commit(s):
commit ad13b5b0d1f9eb8e048394919e6393e520b14552
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri May 3 19:20:54 2024 +0000

    rtnetlink: do not depend on RTNL for IFLA_TXQLEN output

    rtnl_fill_ifinfo() can read dev->tx_queue_len locklessly,
    granted we add corresponding READ_ONCE()/WRITE_ONCE() annotations.

    Add missing READ_ONCE(dev->tx_queue_len) in teql_enqueue()

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Petr Oros <poros@redhat.com>
2024-12-10 10:37:54 +01:00
Petr Oros 0db2362ca1 rtnetlink: do not depend on RTNL for IFLA_IFNAME output
JIRA: https://issues.redhat.com/browse/RHEL-57756

Upstream commit(s):
commit 8a58268133622c3d50155ac5798ad1d51d6bd3be
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri May 3 19:20:53 2024 +0000

    rtnetlink: do not depend on RTNL for IFLA_IFNAME output

    We can use netdev_copy_name() to no longer rely on RTNL
    to fetch dev->name.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Simon Horman <horms@kernel.org>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Petr Oros <poros@redhat.com>
2024-12-10 10:37:54 +01:00
Petr Oros 83d54ab0fe rtnetlink: do not depend on RTNL for IFLA_QDISC output
JIRA: https://issues.redhat.com/browse/RHEL-57756

Upstream commit(s):
commit 698419ffb6fc83dd7b0359d9e8476e732967eed2
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri May 3 19:20:52 2024 +0000

    rtnetlink: do not depend on RTNL for IFLA_QDISC output

    dev->qdisc can be read using RCU protection.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Simon Horman <horms@kernel.org>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Petr Oros <poros@redhat.com>
2024-12-10 10:37:54 +01:00
Petr Oros 5c66564865 rtnetlink: use for_each_netdev_dump() in rtnl_stats_dump()
JIRA: https://issues.redhat.com/browse/RHEL-57756

Upstream commit(s):
commit 0feb396f7428b95710ea72c1dc33ae363019fae5
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu May 2 11:37:48 2024 +0000

    rtnetlink: use for_each_netdev_dump() in rtnl_stats_dump()

    Switch rtnl_stats_dump() to use for_each_netdev_dump()
    instead of net->dev_index_head[] hash table.

    This makes the code much easier to read, and fixes
    scalability issues.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Link: https://lore.kernel.org/r/20240502113748.1622637-3-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Petr Oros <poros@redhat.com>
2024-12-10 10:37:54 +01:00
Petr Oros a50ab4a87a rtnetlink: change rtnl_stats_dump() return value
JIRA: https://issues.redhat.com/browse/RHEL-57756

Upstream commit(s):
commit 136c2a9a2a8760d8dae83ae7c882c50be02bdb63
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu May 2 11:37:47 2024 +0000

    rtnetlink: change rtnl_stats_dump() return value

    By returning 0 (or an error) instead of skb->len,
    we allow NLMSG_DONE to be appended to the current
    skb at the end of a dump, saving a couple of recvmsg()
    system calls.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Link: https://lore.kernel.org/r/20240502113748.1622637-2-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Petr Oros <poros@redhat.com>
2024-12-10 10:37:54 +01:00
Petr Oros b5e456d0ed netlink: let core handle error cases in dump operations
JIRA: https://issues.redhat.com/browse/RHEL-57756

Upstream commit(s):
commit 02e24903e5a46b7a7fca44bcfe0cd6fa5b240c34
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Mar 6 10:24:26 2024 +0000

    netlink: let core handle error cases in dump operations

    After commit b5a899154aa9 ("netlink: handle EMSGSIZE errors
    in the core"), we can remove some code that was not 100 % correct
    anyway.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Simon Horman <horms@kernel.org>
    Reviewed-by: Ido Schimmel <idosch@nvidia.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Link: https://lore.kernel.org/r/20240306102426.245689-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Petr Oros <poros@redhat.com>
2024-12-10 10:37:52 +01:00
Rado Vrbovsky fb874c9815 Merge: CNB96: netlink/devlink: update devlink & netlink to the v6.9
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5257

JIRA: https://issues.redhat.com/browse/RHEL-57755
Depends: !5414
Depends: !4753
Signed-off-by: Petr Oros <poros@redhat.com>

Approved-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
2024-11-27 11:19:20 +00:00
Rado Vrbovsky 18484e6ffa Merge: CNB96: net: RTNL pressure reduction
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5605

A series of patches reducing RTNL pressure in net, namely the following upstream series and their prerequisites / fixes / related changes:  
- 3cbab89268c6 Merge branch 'inet-implement-lockless-rtm_getnetconf-ops'  
- 9f780efa6eaa Merge branch 'ipv6-devconf-lockless'  
- e96082570933 Merge branch 'inet_dump_ifaddr-no-rtnl'  
- 570c86ed60cc Merge branch 'ipv6-lockless-dump-addrs'  
  
Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5414  
  
JIRA: https://issues.redhat.com/browse/RHEL-62205  
JIRA: https://issues.redhat.com/browse/RHEL-62204  
JIRA: https://issues.redhat.com/browse/RHEL-62203  
JIRA: https://issues.redhat.com/browse/RHEL-62202  
  
Signed-off-by: Antoine Tenart <atenart@redhat.com>

Approved-by: Sabrina Dubroca <sdubroca@redhat.com>
Approved-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
2024-11-22 09:20:48 +00:00
Petr Oros f70dcaae94 net: make dev_unreg_count global
JIRA: https://issues.redhat.com/browse/RHEL-57755

Upstream commit(s):
commit ffabe98cb576097b77d404d39e8b3df03caa986a
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Feb 2 10:11:06 2024 +0000

    net: make dev_unreg_count global

    We can use a global dev_unreg_count counter instead
    of a per netns one.

    As a bonus we can factorize the changes done on it
    for bulk device removals.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Petr Oros <poros@redhat.com>
2024-11-20 10:13:42 +01:00
Paolo Abeni d49c5b08c8 rtnetlink: Don't ignore IFLA_TARGET_NETNSID when ifname is specified in rtnl_dellink().
JIRA: https://issues.redhat.com/browse/RHEL-62849
Tested: LNST, Tier1

Upstream commit:
commit 9415d375d8520e0ed55f0c0b058928da9a5b5b3d
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Fri Jul 26 17:19:53 2024 -0700

    rtnetlink: Don't ignore IFLA_TARGET_NETNSID when ifname is specified in rtnl_dellink().

    The cited commit accidentally replaced tgt_net with net in rtnl_dellink().

    As a result, IFLA_TARGET_NETNSID is ignored if the interface is specified
    with IFLA_IFNAME or IFLA_ALT_IFNAME.

    Let's pass tgt_net to rtnl_dev_get().

    Fixes: cc6090e985 ("net: rtnetlink: introduce helper to get net_device instance by ifname")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Reviewed-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-11-15 09:21:35 +01:00
Antoine Tenart ba114b046d rtnetlink: make the "split" NLM_DONE handling generic
JIRA: https://issues.redhat.com/browse/RHEL-62204
Upstream Status: linux.git

commit 5b4b62a169e10401cca34a6e7ac39161986f5605
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Mon Jun 3 11:48:26 2024 -0700

    rtnetlink: make the "split" NLM_DONE handling generic

    Jaroslav reports Dell's OMSA Systems Management Data Engine
    expects NLM_DONE in a separate recvmsg(), both for rtnl_dump_ifinfo()
    and inet_dump_ifaddr(). We already added a similar fix previously in
    commit 460b0d33cf10 ("inet: bring NLM_DONE out to a separate recv() again")

    Instead of modifying all the dump handlers, and making them look
    different than modern for_each_netdev_dump()-based dump handlers -
    put the workaround in rtnetlink code. This will also help us move
    the custom rtnl-locking from af_netlink in the future (in net-next).

    Note that this change is not touching rtnl_dump_all(). rtnl_dump_all()
    is different kettle of fish and a potential problem. We now mix families
    in a single recvmsg(), but NLM_DONE is not coalesced.

    Tested:

      ./cli.py --dbg-small-recv 4096 --spec netlink/specs/rt_addr.yaml \
               --dump getaddr --json '{"ifa-family": 2}'

      ./cli.py --dbg-small-recv 4096 --spec netlink/specs/rt_route.yaml \
               --dump getroute --json '{"rtm-family": 2}'

      ./cli.py --dbg-small-recv 4096 --spec netlink/specs/rt_link.yaml \
               --dump getlink

    Fixes: 3e41af90767d ("rtnetlink: use xarray iterator to implement rtnl_dump_ifinfo()")
    Fixes: cdb2f80f1c10 ("inet: use xa_array iterator to implement inet_dump_ifaddr()")
    Reported-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
    Link: https://lore.kernel.org/all/CAK8fFZ7MKoFSEzMBDAOjoUt+vTZRRQgLDNXEOfdCCXSoXXKE0g@mail.gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-11-14 10:16:49 +01:00
Antoine Tenart 7ab8c5dc6d rtnetlink: use xarray iterator to implement rtnl_dump_ifinfo()
JIRA: https://issues.redhat.com/browse/RHEL-62204
Upstream Status: linux.git

commit 3e41af90767dcf8e5ca91cfbbbcb772584940df9
Author: Eric Dumazet <edumazet@google.com>
Date:   Sun Feb 11 21:44:04 2024 +0000

    rtnetlink: use xarray iterator to implement rtnl_dump_ifinfo()

    Adopt net->dev_by_index as I did in commit 0e0939c0adf9
    ("net-procfs: use xarray iterator to implement /proc/net/dev")

    This makes sure an existing device is always visible in the dump,
    regardless of concurrent insertions/deletions.

    v2: added suggestions from Jakub Kicinski and Ido Schimmel,
        thanks for the help !

    Link: https://lore.kernel.org/all/20240209142441.6c56435b@kernel.org/
    Link: https://lore.kernel.org/all/ZckR-XOsULLI9EHc@shredder/
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Ido Schimmel <idosch@nvidia.com>
    Link: https://lore.kernel.org/r/20240211214404.1882191-3-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-11-14 10:16:49 +01:00
Ivan Vecera dd746452b2 rtnetlink: provide RCU protection to rtnl_fill_prop_list()
JIRA: https://issues.redhat.com/browse/RHEL-62123

commit 0ec4e48c3a233820e0bce1f5ba9ed3e4520f90e9
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Feb 22 10:50:21 2024 +0000

    rtnetlink: provide RCU protection to rtnl_fill_prop_list()

    We want to be able to run rtnl_fill_ifinfo() under RCU protection
    instead of RTNL in the future.

    dev->name_node items are already rcu protected.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-10-24 16:14:44 +02:00
Ivan Vecera 938311f7cc rtnetlink: make rtnl_fill_link_ifmap() RCU ready
JIRA: https://issues.redhat.com/browse/RHEL-62123

commit 74808e72e0b2d7cac886151198c0330daadaee70
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Feb 22 10:50:20 2024 +0000

    rtnetlink: make rtnl_fill_link_ifmap() RCU ready

    Use READ_ONCE() to read the following device fields:

            dev->mem_start
            dev->mem_end
            dev->base_addr
            dev->irq
            dev->dma
            dev->if_port

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-10-24 16:14:43 +02:00
Ivan Vecera 8da88cd9da rtnetlink: add RTNL_FLAG_DUMP_UNLOCKED flag
JIRA: https://issues.redhat.com/browse/RHEL-62123

commit 386520e0ecc01004d3a29c70c5a77d4bbf8a8420
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Feb 22 10:50:15 2024 +0000

    rtnetlink: add RTNL_FLAG_DUMP_UNLOCKED flag

    Similarly to RTNL_FLAG_DOIT_UNLOCKED, this new flag
    allows dump operations registered via rtnl_register()
    or rtnl_register_module() to opt-out from RTNL protection.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-10-24 16:14:43 +02:00
Ivan Vecera f183fb3c8a rtnetlink: prepare nla_put_iflink() to run under RCU
JIRA: https://issues.redhat.com/browse/RHEL-62123

Conflicts:
* drivers/net/netkit.c
  - hunk omitted as the driver is not present in RHEL
* net/dsa/user.c
  - the hunk applied in dsa/slave.c due to absence of DSA deps

commit e353ea9ce471331c13edffd5977eadd602d1bb80
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Feb 22 10:50:08 2024 +0000

    rtnetlink: prepare nla_put_iflink() to run under RCU

    We want to be able to run rtnl_fill_ifinfo() under RCU protection
    instead of RTNL in the future.

    This patch prepares dev_get_iflink() and nla_put_iflink()
    to run either with RTNL or RCU held.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-10-24 16:14:43 +02:00
Rado Vrbovsky f177edd8c5 Merge: CNB96: netdev_features: start cleaning netdev_features_t up
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5362

JIRA: https://issues.redhat.com/browse/RHEL-59091

Explanation from the upstream cover letter by Alexander Lobakin:

> NETDEV_FEATURE_COUNT is currently 64, which means we can't add any new
> features as netdev_features_t is u64.
> As per several discussions, instead of converting netdev_features_t to
> a bitmap, which would mean A LOT of changes, we can try cleaning up
> netdev feature bits.
> There's a bunch of bits which don't really mean features, rather device
> attributes/properties that can't be changed via Ethtool in any of the
> drivers. Such attributes can be moved to netdev private flags without
> losing any functionality.
> 
> Start converting some read-only netdev features to private flags from
> the ones that are most obvious, like lockless Tx, inability to change
> network namespace etc. I was able to reduce NETDEV_FEATURE_COUNT from
> 64 to 60, which mean 4 free slots for new features. There are obviously
> more read-only features to convert, such as highDMA, "challenged VLAN",
> HSR (4 bits) - this will be done in subsequent series.
> Please note that netdev features are not uAPI/ABI by any means. Ethtool
> passes their names and bits to the userspace separately and there are no
> hardcoded names/bits in the userspace, so that new Ethtool could work
> on older kernels and vice versa. Even shell scripts won't most likely
> break since the removed bits were always read-only, meaning nobody would
> try touching them from a script.

I proposed a Release Note Text in the Jira to document that "tx-lockless", "netns-local", "fcoe-mtu" will no longer appear in "ethtool -k". 

Signed-off-by: Michal Schmidt <mschmidt@redhat.com>

Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com>
Approved-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Eric Chanudet <echanude@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
2024-10-20 09:09:03 +00:00
Michal Schmidt 12a989692f netdevice: convert private flags > BIT(31) to bitfields
JIRA: https://issues.redhat.com/browse/RHEL-59091

commit beb5a9bea8239cdf4adf6b62672e30db3e9fa5ce
Author: Alexander Lobakin <aleksander.lobakin@intel.com>
Date:   Thu Aug 29 14:33:36 2024 +0200

    netdevice: convert private flags > BIT(31) to bitfields

    Make dev->priv_flags `u32` back and define bits higher than 31 as
    bitfield booleans as per Jakub's suggestion. This simplifies code
    which accesses these bits with no optimization loss (testb both
    before/after), allows to not extend &netdev_priv_flags each time,
    but also scales better as bits > 63 in the future would only add
    a new u64 to the structure with no complications, comparing to
    that extending ::priv_flags would require converting it to a bitmap.
    Note that I picked `unsigned long :1` to not lose any potential
    optimizations comparing to `bool :1` etc.

    Suggested-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Conflicts:
	drivers/net/ethernet/microchip/lan966x/lan966x_main.c
	- Driver not present in RHEL 9.

Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
2024-10-03 17:59:39 +02:00
Michal Schmidt 8e7994801b netlink: introduce type-checking attribute iteration
JIRA: https://issues.redhat.com/browse/RHEL-57750

commit e8058a49e67fe7bc7e4a0308851a3ca3a6d2e45d
Author: Johannes Berg <johannes.berg@intel.com>
Date:   Thu Mar 28 20:31:45 2024 +0100

    netlink: introduce type-checking attribute iteration

    There are, especially with multi-attr arrays, many cases
    of needing to iterate all attributes of a specific type
    in a netlink message or a nested attribute. Add specific
    macros to support that case.

    Also convert many instances using this spatch:

        @@
        iterator nla_for_each_attr;
        iterator name nla_for_each_attr_type;
        identifier nla;
        expression head, len, rem;
        expression ATTR;
        type T;
        identifier x;
        @@
        -nla_for_each_attr(nla, head, len, rem)
        +nla_for_each_attr_type(nla, ATTR, head, len, rem)
         {
        <... T x; ...>
        -if (nla_type(nla) == ATTR) {
         ...
        -}
         }

        @@
        identifier nla;
        iterator nla_for_each_nested;
        iterator name nla_for_each_nested_type;
        expression attr, rem;
        expression ATTR;
        type T;
        identifier x;
        @@
        -nla_for_each_nested(nla, attr, rem)
        +nla_for_each_nested_type(nla, ATTR, attr, rem)
         {
        <... T x; ...>
        -if (nla_type(nla) == ATTR) {
         ...
        -}
         }

        @@
        iterator nla_for_each_attr;
        iterator name nla_for_each_attr_type;
        identifier nla;
        expression head, len, rem;
        expression ATTR;
        type T;
        identifier x;
        @@
        -nla_for_each_attr(nla, head, len, rem)
        +nla_for_each_attr_type(nla, ATTR, head, len, rem)
         {
        <... T x; ...>
        -if (nla_type(nla) != ATTR) continue;
         ...
         }

        @@
        identifier nla;
        iterator nla_for_each_nested;
        iterator name nla_for_each_nested_type;
        expression attr, rem;
        expression ATTR;
        type T;
        identifier x;
        @@
        -nla_for_each_nested(nla, attr, rem)
        +nla_for_each_nested_type(nla, ATTR, attr, rem)
         {
        <... T x; ...>
        -if (nla_type(nla) != ATTR) continue;
         ...
         }

    Although I had to undo one bad change this made, and
    I also adjusted some other code for whitespace and to
    use direct variable initialization now.

    Signed-off-by: Johannes Berg <johannes.berg@intel.com>
    Link: https://lore.kernel.org/r/20240328203144.b5a6c895fb80.I1869b44767379f204998ff44dd239803f39c23e0@changeid
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Conflicts:
	drivers/net/ethernet/netronome/nfp/nfp_net_common.c
	- The driver lacks .ndo_bridge_setlink implementation in RHEL 9.
	net/core/bpf_sk_storage.c
	- Missing commit bcc29b7f5af6 ("bpf: Add length check for
	  SK_DIAG_BPF_STORAGE_REQ_MAP_FD parsing")

Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
2024-10-01 12:19:13 +02:00
Ivan Vecera c1b0641934 net: remove dev_base_lock from do_setlink()
JIRA: https://issues.redhat.com/browse/RHEL-59100

commit 2dd4d828d648e101aaf19326afcdfee8667cb185
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Feb 13 06:32:43 2024 +0000

    net: remove dev_base_lock from do_setlink()

    We hold RTNL here, and dev->link_mode readers already
    are using READ_ONCE().

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-09-17 12:17:18 +02:00
Ivan Vecera 2e6db4aa04 net: add netdev_set_operstate() helper
JIRA: https://issues.redhat.com/browse/RHEL-59100

commit 6a2968ee1ee2cc6fce30f6f5724442b34b1483b3
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Feb 13 06:32:42 2024 +0000

    net: add netdev_set_operstate() helper

    dev_base_lock is going away, add netdev_set_operstate() helper
    so that hsr does not have to know core internals.

    Remove dev_base_lock acquisition from rfc2863_policy()

    v3: use an "unsigned int" for dev->operstate,
        so that try_cmpxchg() can work on all arches.
            ( https://lore.kernel.org/oe-kbuild-all/202402081918.OLyGaea3-lkp@intel.com/ )

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-09-17 12:17:17 +02:00
Ivan Vecera 116bf3b894 net-sysfs: convert dev->operstate reads to lockless ones
JIRA: https://issues.redhat.com/browse/RHEL-59100

commit 004d138364fd10dd5ff8ceb54cfdc2d792a7b338
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Feb 13 06:32:39 2024 +0000

    net-sysfs: convert dev->operstate reads to lockless ones

    operstate_show() can omit dev_base_lock acquisition only
    to read dev->operstate.

    Annotate accesses to dev->operstate.

    Writers still acquire dev_base_lock for mutual exclusion.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-09-17 12:17:15 +02:00
Ivan Vecera 218f188cbf dev: annotate accesses to dev->link
JIRA: https://issues.redhat.com/browse/RHEL-59100

commit a6473fe9b623f6667af72d972b87cd9a5ff87e21
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Feb 13 06:32:35 2024 +0000

    dev: annotate accesses to dev->link

    Following patch will read dev->link locklessly,
    annotate the write from do_setlink().

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-09-17 12:17:12 +02:00
Ivan Vecera 4863bafaf6 net: core: synchronize link-watch when carrier is queried
JIRA: https://issues.redhat.com/browse/RHEL-59100

commit facd15dfd69122042502d99ab8c9f888b48ee994
Author: Johannes Berg <johannes.berg@intel.com>
Date:   Mon Dec 4 21:47:07 2023 +0100

    net: core: synchronize link-watch when carrier is queried

    There are multiple ways to query for the carrier state: through
    rtnetlink, sysfs, and (possibly) ethtool. Synchronize linkwatch
    work before these operations so that we don't have a situation
    where userspace queries the carrier state between the driver's
    carrier off->on transition and linkwatch running and expects it
    to work, when really (at least) TX cannot work until linkwatch
    has run.

    I previously posted a longer explanation of how this applies to
    wireless [1] but with this wireless can simply query the state
    before sending data, to ensure the kernel is ready for it.

    [1] https://lore.kernel.org/all/346b21d87c69f817ea3c37caceb34f1f56255884.camel@sipsolutions.net/

    Signed-off-by: Johannes Berg <johannes.berg@intel.com>
    Reviewed-by: Jiri Pirko <jiri@nvidia.com>
    Link: https://lore.kernel.org/r/20231204214706.303c62768415.I1caedccae72ee5a45c9085c5eb49c145ce1c0dd5@changeid
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-09-17 12:17:10 +02:00
Davide Caratti 1809a32ff2 rtnetlink: Correct nested IFLA_VF_VLAN_LIST attribute validation
JIRA: https://issues.redhat.com/browse/RHEL-39715
CVE: CVE-2024-36017
Upstream Status: net.git commit 1aec77b2bb2ed1db0f5efc61c4c1ca3813307489

commit 1aec77b2bb2ed1db0f5efc61c4c1ca3813307489
Author: Roded Zats <rzats@paloaltonetworks.com>
Date:   Thu May 2 18:57:51 2024 +0300

    rtnetlink: Correct nested IFLA_VF_VLAN_LIST attribute validation

    Each attribute inside a nested IFLA_VF_VLAN_LIST is assumed to be a
    struct ifla_vf_vlan_info so the size of such attribute needs to be at least
    of sizeof(struct ifla_vf_vlan_info) which is 14 bytes.
    The current size validation in do_setvfinfo is against NLA_HDRLEN (4 bytes)
    which is less than sizeof(struct ifla_vf_vlan_info) so this validation
    is not enough and a too small attribute might be cast to a
    struct ifla_vf_vlan_info, this might result in an out of bands
    read access when accessing the saved (casted) entry in ivvl.

    Fixes: 79aab093a0 ("net: Update API for VF vlan protocol 802.1ad support")
    Signed-off-by: Roded Zats <rzats@paloaltonetworks.com>
    Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
    Link: https://lore.kernel.org/r/20240502155751.75705-1-rzats@paloaltonetworks.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2024-06-20 11:16:44 +02:00
Lucas Zampieri a1c1d84297 Merge: rtnetlink: fix error logic of IFLA_BRIDGE_FLAGS writing back
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4431

JIRA: https://issues.redhat.com/browse/RHEL-36874  
CVE: CVE-2024-27414  
Upstream Status: all maiinline in net.git  
Conflicts: None  
Tested: boot-tested only  
  
Signed-off-by: Davide Caratti <dcaratti@redhat.com>

Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Xin Long <lxin@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-06-19 18:25:40 +00:00
Lucas Zampieri 1cc33b9d3b Merge: CNB95: bridge: update bridge core to upstream v6.8
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4261

JIRA: https://issues.redhat.com/browse/RHEL-36219  
Depends: !4249  
Tested: using existing bridge self-tests  

Commits:
```
29cfb2aaa442 ("bridge: Add backup nexthop ID support")
b408453053fb ("selftests: net: Add bridge backup port and backup nexthop ID test")
cbf51acbc5d5 ("net: bridge: Set BR_FDB_ADDED_BY_USER early in fdb_add_entry")
bdb4dfda3b41 ("net: bridge: Track and limit dynamically learned FDB entries")
ddd1ad68826d ("net: bridge: Add netlink knobs for number / max learned FDB entries")
19297c3ab23c ("net: bridge: Set strict_start_type for br_policy")
6f84090333bb ("selftests: forwarding: bridge_fdb_learning_limit: Add a new selftest")
ee6f05dcd672 ("br_netfilter: use single forward hook for ip and arp")
b9109b5b77f0 ("bridge: mcast: Dump MDB entries even when snooping is disabled")
1b6d993509c1 ("bridge: mcast: Account for missing attributes")
62ef9cba98a2 ("bridge: mcast: Factor out a helper for PG entry size calculation")
6d0259dd6c53 ("bridge: mcast: Rename MDB entry get function")
ff97d2a956a1 ("vxlan: mdb: Adjust function arguments")
14c32a46d992 ("vxlan: mdb: Factor out a helper for remote entry size calculation")
68b380a395a7 ("bridge: mcast: Add MDB get support")
32d9673e96dc ("vxlan: mdb: Add MDB get support")
ddd17a54e692 ("rtnetlink: Add MDB get support")
e8bba9e83c88 ("selftests: bridge_mdb: Use MDB get instead of dump")
0514dd05939a ("selftests: vxlan_mdb: Use MDB get instead of dump")
6808918343a8 ("net: bridge: fill in MODULE_DESCRIPTION()")
e8a4195d843f ("docs: bridge: update doc format to rst")
8ebe06611666 ("net: bridge: add document for IFLA_BR enum")
8c4bafdb01cc ("net: bridge: add document for IFLA_BRPORT enum")
bcc1f84e4d34 ("docs: bridge: Add kAPI/uAPI fields")
567d2608209f ("docs: bridge: add STP doc")
041a6ac4bf79 ("docs: bridge: add VLAN doc")
75ceac88efb8 ("docs: bridge: add multicast doc")
3c37f17d6ca9 ("docs: bridge: add switchdev doc")
1b1a4c7e82ae ("docs: bridge: add netfilter doc")
d2afc2cd7f1f ("docs: bridge: add other features")
25ae948b4478 ("selftests/net: add lib.sh")
4624a78c18c6 ("selftests/net: convert test_bridge_backup_port.sh to run it in unique namespace")
312abe3d93a3 ("selftests/net: convert test_bridge_neigh_suppress.sh to run it in unique namespace")
e37a11fca418 ("bridge: add MDB state mask uAPI attribute")
a6acb535afb2 ("bridge: mdb: Add MDB bulk deletion support")
4cde72fead4c ("vxlan: mdb: Add MDB bulk deletion support")
bd2dcb94c81e ("selftests: bridge_mdb: Add MDB bulk deletion test")
c3e87a7fcd0b ("selftests: vxlan_mdb: Add MDB bulk deletion test")
c2b2ee36250d ("bridge: cfm: fix enum typo in br_cc_ccm_tx_parse")
2114e83381d3 ("selftests: forwarding: Avoid failures to source net/lib.sh")
49078c1b80b6 ("selftests: forwarding: Remove executable bits from lib.sh")
fc836129f708 ("selftests/net/lib: update busywait timeout value")
f5c3eb4b7251 ("bridge: mcast: fix disabled snooping after long uptime")
b40f873a7c80 ("selftests: net: Add missing matchall classifier")
96cd5ac4c0e6 ("selftests: forwarding: List helper scripts in TEST_FILES Makefile variable")
38ee0cb2a2e2 ("selftests: net: Fix bridge backup port test flakiness")
93590849a05e ("selftests: forwarding: Fix layer 2 miss test flakiness")
7399e2ce4d42 ("selftests: forwarding: Fix bridge MDB test flakiness")
dd6b34589441 ("selftests: forwarding: Suppress grep warnings")
f97f1fcc9690 ("selftests: forwarding: Fix bridge locked port test flakiness")
dc489f86257c ("net: bridge: switchdev: Skip MDB replays of deferred events on offload")
f7a70d650b0b ("net: bridge: switchdev: Ensure deferred event delivery on unoffload")
9adcac650618 ("netlink: specs: Add missing bridge linkinfo attrs")
83e93942796d ("selftests/net/lib: no need to record ns name if it already exist")
```

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: Hangbin Liu <haliu@redhat.com>
Approved-by: Kamal Heib <kheib@redhat.com>
Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-06-10 13:42:40 +00:00
Davide Caratti db1c39363a rtnetlink: fix error logic of IFLA_BRIDGE_FLAGS writing back
JIRA: https://issues.redhat.com/browse/RHEL-36874
CVE: CVE-2024-27414
Upstream Status: net.git commit 743ad091fb46e622f1b690385bb15e3cd3daf874

commit 743ad091fb46e622f1b690385bb15e3cd3daf874
Author: Lin Ma <linma@zju.edu.cn>
Date:   Tue Feb 27 20:11:28 2024 +0800

    rtnetlink: fix error logic of IFLA_BRIDGE_FLAGS writing back

    In the commit d73ef2d69c0d ("rtnetlink: let rtnl_bridge_setlink checks
    IFLA_BRIDGE_MODE length"), an adjustment was made to the old loop logic
    in the function `rtnl_bridge_setlink` to enable the loop to also check
    the length of the IFLA_BRIDGE_MODE attribute. However, this adjustment
    removed the `break` statement and led to an error logic of the flags
    writing back at the end of this function.

    if (have_flags)
        memcpy(nla_data(attr), &flags, sizeof(flags));
        // attr should point to IFLA_BRIDGE_FLAGS NLA !!!

    Before the mentioned commit, the `attr` is granted to be IFLA_BRIDGE_FLAGS.
    However, this is not necessarily true fow now as the updated loop will let
    the attr point to the last NLA, even an invalid NLA which could cause
    overflow writes.

    This patch introduces a new variable `br_flag` to save the NLA pointer
    that points to IFLA_BRIDGE_FLAGS and uses it to resolve the mentioned
    error logic.

    Fixes: d73ef2d69c0d ("rtnetlink: let rtnl_bridge_setlink checks IFLA_BRIDGE_MODE length")
    Signed-off-by: Lin Ma <linma@zju.edu.cn>
    Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
    Link: https://lore.kernel.org/r/20240227121128.608110-1-linma@zju.edu.cn
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2024-06-06 11:14:39 +02:00
Lucas Zampieri 71cdbf2b82 Merge: net: dst: Improve concurrency performance of dst_entry
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3031

JIRA: https://issues.redhat.com/browse/RHEL-15695  
Tested: verified performance improvement with memcached/memtier_bench  
with one thread per core each.  
  
The patches improve the performance of parallel local connections.  
Because the receive side of the connection is handled on the same cpu as  
the data was sent for local connections, contention and false sharing  
was observed between the sending core and the receiving core.  
  
Signed-off-by: Felix Maurer <fmaurer@redhat.com>

Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Paolo Abeni <pabeni@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-06-03 19:53:47 +00:00
Felix Maurer 934e8cc341 net: dst: Switch to rcuref_t reference counting
JIRA: https://issues.redhat.com/browse/RHEL-15695

commit bc9d3a9f2afca189a6ae40225b6985e3c775375e
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Thu Mar 23 21:55:32 2023 +0100

    net: dst: Switch to rcuref_t reference counting

    Under high contention dst_entry::__refcnt becomes a significant bottleneck.

    atomic_inc_not_zero() is implemented with a cmpxchg() loop, which goes into
    high retry rates on contention.

    Switch the reference count to rcuref_t which results in a significant
    performance gain. Rename the reference count member to __rcuref to reflect
    the change.

    The gain depends on the micro-architecture and the number of concurrent
    operations and has been measured in the range of +25% to +130% with a
    localhost memtier/memcached benchmark which amplifies the problem
    massively.

    Running the memtier/memcached benchmark over a real (1Gb) network
    connection the conversion on top of the false sharing fix for struct
    dst_entry::__refcnt results in a total gain in the 2%-5% range over the
    upstream baseline.

    Reported-by: Wangyang Guo <wangyang.guo@intel.com>
    Reported-by: Arjan Van De Ven <arjan.van.de.ven@intel.com>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Link: https://lore.kernel.org/r/20230307125538.989175656@linutronix.de
    Link: https://lore.kernel.org/r/20230323102800.215027837@linutronix.de
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2024-05-21 17:19:20 +02:00
Ivan Vecera f90088128e rtnetlink: Add MDB get support
JIRA: https://issues.redhat.com/browse/RHEL-36219

commit ddd17a54e692bef1b646febf5242db10982e1965
Author: Ido Schimmel <idosch@nvidia.com>
Date:   Wed Oct 25 15:30:18 2023 +0300

    rtnetlink: Add MDB get support

    Now that both the bridge and VXLAN drivers implement the MDB get net
    device operation, expose the functionality to user space by registering
    a handler for RTM_GETMDB messages. Derive the net device from the
    ifindex specified in the ancillary header and invoke its MDB get NDO.

    Note that unlike other get handlers, the allocation of the skb
    containing the response is not performed in the common rtnetlink code as
    the size is variable and needs to be determined by the respective
    driver.

    Signed-off-by: Ido Schimmel <idosch@nvidia.com>
    Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-05-17 13:48:17 +02:00
Ivan Vecera a83d6a5553 bridge: Add backup nexthop ID support
JIRA: https://issues.redhat.com/browse/RHEL-36219

commit 29cfb2aaa4425a608651a05b9b875bc445394443
Author: Ido Schimmel <idosch@nvidia.com>
Date:   Mon Jul 17 11:12:28 2023 +0300

    bridge: Add backup nexthop ID support

    Add a new bridge port attribute that allows attaching a nexthop object
    ID to an skb that is redirected to a backup bridge port with VLAN
    tunneling enabled.

    Specifically, when redirecting a known unicast packet, read the backup
    nexthop ID from the bridge port that lost its carrier and set it in the
    bridge control block of the skb before forwarding it via the backup
    port. Note that reading the ID from the bridge port should not result in
    a cache miss as the ID is added next to the 'backup_port' field that was
    already accessed. After this change, the 'state' field still stays on
    the first cache line, together with other data path related fields such
    as 'flags and 'vlgrp':

    struct net_bridge_port {
            struct net_bridge *        br;                   /*     0     8 */
            struct net_device *        dev;                  /*     8     8 */
            netdevice_tracker          dev_tracker;          /*    16     0 */
            struct list_head           list;                 /*    16    16 */
            long unsigned int          flags;                /*    32     8 */
            struct net_bridge_vlan_group * vlgrp;            /*    40     8 */
            struct net_bridge_port *   backup_port;          /*    48     8 */
            u32                        backup_nhid;          /*    56     4 */
            u8                         priority;             /*    60     1 */
            u8                         state;                /*    61     1 */
            u16                        port_no;              /*    62     2 */
            /* --- cacheline 1 boundary (64 bytes) --- */
    [...]
    } __attribute__((__aligned__(8)));

    When forwarding an skb via a bridge port that has VLAN tunneling
    enabled, check if the backup nexthop ID stored in the bridge control
    block is valid (i.e., not zero). If so, instead of attaching the
    pre-allocated metadata (that only has the tunnel key set), allocate a
    new metadata, set both the tunnel key and the nexthop object ID and
    attach it to the skb.

    By default, do not dump the new attribute to user space as a value of
    zero is an invalid nexthop object ID.

    The above is useful for EVPN multihoming. When one of the links
    composing an Ethernet Segment (ES) fails, traffic needs to be redirected
    towards the host via one of the other ES peers. For example, if a host
    is multihomed to three different VTEPs, the backup port of each ES link
    needs to be set to the VXLAN device and the backup nexthop ID needs to
    point to an FDB nexthop group that includes the IP addresses of the
    other two VTEPs. The VXLAN driver will extract the ID from the metadata
    of the redirected skb, calculate its flow hash and forward it towards
    one of the other VTEPs. If the ID does not exist, or represents an
    invalid nexthop object, the VXLAN driver will drop the skb. This
    relieves the bridge driver from the need to validate the ID.

    Signed-off-by: Ido Schimmel <idosch@nvidia.com>
    Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-05-17 13:47:59 +02:00
Petr Oros 149ecfe407 dpll: move all dpll<>netdev helpers to dpll code
JIRA: https://issues.redhat.com/browse/RHEL-32098

Conflicts:
- drivers/net/ethernet/mellanox/mlx5/core/dpll.c: chunk omitted due
  to missing 496fd0a26bbf73 ("mlx5: Implement SyncE support using DPLL
  infrastructure")

Upstream commit(s):
commit 289e922582af5b4721ba02e86bde4d9ba918158a
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Mon Mar 4 17:35:32 2024 -0800

    dpll: move all dpll<>netdev helpers to dpll code

    Older versions of GCC really want to know the full definition
    of the type involved in rcu_assign_pointer().

    struct dpll_pin is defined in a local header, net/core can't
    reach it. Move all the netdev <> dpll code into dpll, where
    the type is known. Otherwise we'd need multiple function calls
    to jump between the compilation units.

    This is the same problem the commit under fixes was trying to address,
    but with rcu_assign_pointer() not rcu_dereference().

    Some of the exports are not needed, networking core can't
    be a module, we only need exports for the helpers used by
    drivers.

    Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
    Link: https://lore.kernel.org/all/35a869c8-52e8-177-1d4d-e57578b99b6@linux-m68k.org/
    Fixes: 640f41ed33b5 ("dpll: fix build failure due to rcu_dereference_check() on unknown type")
    Reviewed-by: Jiri Pirko <jiri@nvidia.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20240305013532.694866-1-kuba@kernel.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Petr Oros <poros@redhat.com>
2024-05-16 20:47:06 +02:00
Petr Oros 866233764b net: add rcu safety to rtnl_prop_list_size()
JIRA: https://issues.redhat.com/browse/RHEL-30145

Upstream commit(s):
commit 9f30831390ede02d9fcd54fd9ea5a585ab649f4a
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Feb 9 18:12:48 2024 +0000

    net: add rcu safety to rtnl_prop_list_size()

    rtnl_prop_list_size() can be called while alternative names
    are added or removed concurrently.

    if_nlmsg_size() / rtnl_calcit() can indeed be called
    without RTNL held.

    Use explicit RCU protection to avoid UAF.

    Fixes: 88f4fb0c74 ("net: rtnetlink: put alternative names to getlink message")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Jiri Pirko <jiri@nvidia.com>
    Link: https://lore.kernel.org/r/20240209181248.96637-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Petr Oros <poros@redhat.com>
2024-04-26 17:16:11 +02:00
Petr Oros 74ee6c85d9 rtnetlink: bridge: Enable MDB bulk deletion
JIRA: https://issues.redhat.com/browse/RHEL-30145

Upstream commit(s):
commit 2601e9c4b1176253e33025ca24e56ed67c8d434f
Author: Ido Schimmel <idosch@nvidia.com>
Date:   Sun Dec 17 10:32:42 2023 +0200

    rtnetlink: bridge: Enable MDB bulk deletion

    Now that both the common code as well as individual drivers support MDB
    bulk deletion, allow user space to make such requests.

    Signed-off-by: Ido Schimmel <idosch@nvidia.com>
    Reviewed-by: Petr Machata <petrm@nvidia.com>
    Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Petr Oros <poros@redhat.com>
2024-04-26 17:16:10 +02:00
Petr Oros 0dc57cd5b3 rtnetlink: bridge: Invoke MDB bulk deletion when needed
JIRA: https://issues.redhat.com/browse/RHEL-30145

Upstream commit(s):
commit d8e81f131178dad603c6817421056030ed2f4ac2
Author: Ido Schimmel <idosch@nvidia.com>
Date:   Sun Dec 17 10:32:39 2023 +0200

    rtnetlink: bridge: Invoke MDB bulk deletion when needed

    Invoke the new MDB bulk deletion device operation when the 'NLM_F_BULK'
    flag is set in the netlink message header.

    Signed-off-by: Ido Schimmel <idosch@nvidia.com>
    Reviewed-by: Petr Machata <petrm@nvidia.com>
    Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Petr Oros <poros@redhat.com>
2024-04-26 17:16:10 +02:00
Petr Oros 3574764a85 rtnetlink: bridge: Use a different policy for MDB bulk delete
JIRA: https://issues.redhat.com/browse/RHEL-30145

Upstream commit(s):
commit e0cd06f7fcb51b8acd6e68e64cc805be1283de9d
Author: Ido Schimmel <idosch@nvidia.com>
Date:   Sun Dec 17 10:32:37 2023 +0200

    rtnetlink: bridge: Use a different policy for MDB bulk delete

    For MDB bulk delete we will need to validate 'MDBA_SET_ENTRY'
    differently compared to regular delete. Specifically, allow the ifindex
    to be zero (in case not filtering on bridge port) and force the address
    to be zero as bulk delete based on address is not supported.

    Do that by introducing a new policy and choosing the correct policy
    based on the presence of the 'NLM_F_BULK' flag in the netlink message
    header. Use nlmsg_parse() for strict validation.

    Signed-off-by: Ido Schimmel <idosch@nvidia.com>
    Reviewed-by: Petr Machata <petrm@nvidia.com>
    Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Petr Oros <poros@redhat.com>
2024-04-26 17:16:10 +02:00
Petr Oros 2e71d91d3c net: rtnl: use rcu_replace_pointer_rtnl in rtnl_unregister_*
JIRA: https://issues.redhat.com/browse/RHEL-30145

Upstream commit(s):
commit 174523479aae31b17c043de127c87ff2aef3d54e
Author: Pedro Tammela <pctammela@mojatatu.com>
Date:   Fri Dec 15 14:57:11 2023 -0300

    net: rtnl: use rcu_replace_pointer_rtnl in rtnl_unregister_*

    With the introduction of the rcu_replace_pointer_rtnl helper,
    cleanup the rtnl_unregister_* functions to use the helper instead
    of open coding it.

    Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
    Reviewed-by: Ido Schimmel <idosch@nvidia.com>
    Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Petr Oros <poros@redhat.com>
2024-04-26 17:16:08 +02:00
Petr Oros 147089bf66 rtnetlink: introduce nlmsg_new_large and use it in rtnl_getlink
JIRA: https://issues.redhat.com/browse/RHEL-30145

Upstream commit(s):
commit ac40916a3f7243efbe6e129ebf495b5c33a3adfe
Author: Li RongQing <lirongqing@baidu.com>
Date:   Wed Nov 15 20:01:08 2023 +0800

    rtnetlink: introduce nlmsg_new_large and use it in rtnl_getlink

    if a PF has 256 or more VFs, ip link command will allocate an order 3
    memory or more, and maybe trigger OOM due to memory fragment,
    the VFs needed memory size is computed in rtnl_vfinfo_size.

    so introduce nlmsg_new_large which calls netlink_alloc_large_skb in
    which vmalloc is used for large memory, to avoid the failure of
    allocating memory

        ip invoked oom-killer: gfp_mask=0xc2cc0(GFP_KERNEL|__GFP_NOWARN|\
            __GFP_COMP|__GFP_NOMEMALLOC), order=3, oom_score_adj=0
        CPU: 74 PID: 204414 Comm: ip Kdump: loaded Tainted: P           OE
        Call Trace:
        dump_stack+0x57/0x6a
        dump_header+0x4a/0x210
        oom_kill_process+0xe4/0x140
        out_of_memory+0x3e8/0x790
        __alloc_pages_slowpath.constprop.116+0x953/0xc50
        __alloc_pages_nodemask+0x2af/0x310
        kmalloc_large_node+0x38/0xf0
        __kmalloc_node_track_caller+0x417/0x4d0
        __kmalloc_reserve.isra.61+0x2e/0x80
        __alloc_skb+0x82/0x1c0
        rtnl_getlink+0x24f/0x370
        rtnetlink_rcv_msg+0x12c/0x350
        netlink_rcv_skb+0x50/0x100
        netlink_unicast+0x1b2/0x280
        netlink_sendmsg+0x355/0x4a0
        sock_sendmsg+0x5b/0x60
        ____sys_sendmsg+0x1ea/0x250
        ___sys_sendmsg+0x88/0xd0
        __sys_sendmsg+0x5e/0xa0
        do_syscall_64+0x33/0x40
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f95a65a5b70

    Cc: Yunsheng Lin <linyunsheng@huawei.com>
    Signed-off-by: Li RongQing <lirongqing@baidu.com>
    Link: https://lore.kernel.org/r/20231115120108.3711-1-lirongqing@baidu.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Petr Oros <poros@redhat.com>
2024-04-26 17:16:06 +02:00
Petr Oros e45e8ed090 net: Handle bulk delete policy in bridge driver
JIRA: https://issues.redhat.com/browse/RHEL-30145

Upstream commit(s):
commit 38985e8c278b82e6d4d62d4acd57c761cc23ce63
Author: Amit Cohen <amcohen@nvidia.com>
Date:   Mon Oct 9 13:06:08 2023 +0300

    net: Handle bulk delete policy in bridge driver

    The merge commit 92716869375b ("Merge branch 'br-flush-filtering'")
    added support for FDB flushing in bridge driver. The following patches
    will extend VXLAN driver to support FDB flushing as well. The netlink
    message for bulk delete is shared between the drivers. With the existing
    implementation, there is no way to prevent user from flushing with
    attributes that are not supported per driver. For example, when VNI will
    be added, user will not get an error for flush FDB entries in bridge
    with VNI, although this attribute is not relevant for bridge.

    As preparation for support of FDB flush in VXLAN driver, move the policy
    to be handled in bridge driver, later a new policy for VXLAN will be
    added in VXLAN driver. Do not pass 'vid' as part of ndo_fdb_del_bulk(),
    as this field is relevant only for bridge.

    Signed-off-by: Amit Cohen <amcohen@nvidia.com>
    Reviewed-by: Ido Schimmel <idosch@nvidia.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Petr Oros <poros@redhat.com>
2024-04-26 17:16:01 +02:00
Ivan Vecera 7d341ab9ea net: validate veth and vxcan peer ifindexes
JIRA: https://issues.redhat.com/browse/RHEL-30656

commit f534f6581ec084fe94d6759f7672bd009794b07e
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Fri Aug 18 18:26:02 2023 -0700

    net: validate veth and vxcan peer ifindexes

    veth and vxcan need to make sure the ifindexes of the peer
    are not negative, core does not validate this.

    Using iproute2 with user-space-level checking removed:

    Before:

      # ./ip link add index 10 type veth peer index -1
      # ip link show
      1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
        link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
      2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
        link/ether 52:54:00:74:b2:03 brd ff:ff:ff:ff:ff:ff
      10: veth1@veth0: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
        link/ether 8a:90:ff:57:6d:5d brd ff:ff:ff:ff:ff:ff
      -1: veth0@veth1: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
        link/ether ae:ed:18:e6:fa:7f brd ff:ff:ff:ff:ff:ff

    Now:

      $ ./ip link add index 10 type veth peer index -1
      Error: ifindex can't be negative.

    This problem surfaced in net-next because an explicit WARN()
    was added, the root cause is older.

    Fixes: e6f8f1a739 ("veth: Allow to create peer link with given ifindex")
    Fixes: a8f820a380 ("can: add Virtual CAN Tunnel driver (vxcan)")
    Reported-by: syzbot+5ba06978f34abb058571@syzkaller.appspotmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-04-10 09:19:31 +02:00