Commit Graph

60 Commits

Author SHA1 Message Date
Ivan Vecera aca7726276 net: bridge: switchdev: Ensure deferred event delivery on unoffload
JIRA: https://issues.redhat.com/browse/RHEL-36219

commit f7a70d650b0b6b0134ccba763d672c8439d9f09b
Author: Tobias Waldekranz <tobias@waldekranz.com>
Date:   Wed Feb 14 22:40:04 2024 +0100

    net: bridge: switchdev: Ensure deferred event delivery on unoffload

    When unoffloading a device, it is important to ensure that all
    relevant deferred events are delivered to it before it disassociates
    itself from the bridge.

    Before this change, this was true for the normal case when a device
    maps 1:1 to a net_bridge_port, i.e.

       br0
       /
    swp0

    When swp0 leaves br0, the call to switchdev_deferred_process() in
    del_nbp() makes sure to process any outstanding events while the
    device is still associated with the bridge.

    In the case when the association is indirect though, i.e. when the
    device is attached to the bridge via an intermediate device, like a
    LAG...

        br0
        /
      lag0
      /
    swp0

    ...then detaching swp0 from lag0 does not cause any net_bridge_port to
    be deleted, so there was no guarantee that all events had been
    processed before the device disassociated itself from the bridge.

    Fix this by always synchronously processing all deferred events before
    signaling completion of unoffloading back to the driver.

    Fixes: 4e51bf44a03a ("net: bridge: move the switchdev object replay helpers to "push" mode")
    Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
    Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-05-17 13:49:52 +02:00
Ivan Vecera e9948493f6 net: bridge: switchdev: Skip MDB replays of deferred events on offload
JIRA: https://issues.redhat.com/browse/RHEL-36219

commit dc489f86257cab5056e747344f17a164f63bff4b
Author: Tobias Waldekranz <tobias@waldekranz.com>
Date:   Wed Feb 14 22:40:03 2024 +0100

    net: bridge: switchdev: Skip MDB replays of deferred events on offload

    Before this change, generation of the list of MDB events to replay
    would race against the creation of new group memberships, either from
    the IGMP/MLD snooping logic or from user configuration.

    While new memberships are immediately visible to walkers of
    br->mdb_list, the notification of their existence to switchdev event
    subscribers is deferred until a later point in time. So if a replay
    list was generated during a time that overlapped with such a window,
    it would also contain a replay of the not-yet-delivered event.

    The driver would thus receive two copies of what the bridge internally
    considered to be one single event. On destruction of the bridge, only
    a single membership deletion event was therefore sent. As a
    consequence of this, drivers which reference count memberships (at
    least DSA), would be left with orphan groups in their hardware
    database when the bridge was destroyed.

    This is only an issue when replaying additions. While deletion events
    may still be pending on the deferred queue, they will already have
    been removed from br->mdb_list, so no duplicates can be generated in
    that scenario.

    To a user this meant that old group memberships, from a bridge in
    which a port was previously attached, could be reanimated (in
    hardware) when the port joined a new bridge, without the new bridge's
    knowledge.

    For example, on an mv88e6xxx system, create a snooping bridge and
    immediately add a port to it:

        root@infix-06-0b-00:~$ ip link add dev br0 up type bridge mcast_snooping 1 && \
        > ip link set dev x3 up master br0

    And then destroy the bridge:

        root@infix-06-0b-00:~$ ip link del dev br0
        root@infix-06-0b-00:~$ mvls atu
        ADDRESS             FID  STATE      Q  F  0  1  2  3  4  5  6  7  8  9  a
        DEV:0 Marvell 88E6393X
        33:33:00:00:00:6a     1  static     -  -  0  .  .  .  .  .  .  .  .  .  .
        33:33:ff:87:e4:3f     1  static     -  -  0  .  .  .  .  .  .  .  .  .  .
        ff:ff:ff:ff:ff:ff     1  static     -  -  0  1  2  3  4  5  6  7  8  9  a
        root@infix-06-0b-00:~$

    The two IPv6 groups remain in the hardware database because the
    port (x3) is notified of the host's membership twice: once via the
    original event and once via a replay. Since only a single delete
    notification is sent, the count remains at 1 when the bridge is
    destroyed.

    Then add the same port (or another port belonging to the same hardware
    domain) to a new bridge, this time with snooping disabled:

        root@infix-06-0b-00:~$ ip link add dev br1 up type bridge mcast_snooping 0 && \
        > ip link set dev x3 up master br1

    All multicast, including the two IPv6 groups from br0, should now be
    flooded, according to the policy of br1. But instead the old
    memberships are still active in the hardware database, causing the
    switch to only forward traffic to those groups towards the CPU (port
    0).

    Eliminate the race in two steps:

    1. Grab the write-side lock of the MDB while generating the replay
       list.

    This prevents new memberships from showing up while we are generating
    the replay list. But it leaves the scenario in which a deferred event
    was already generated, but not delivered, before we grabbed the
    lock. Therefore:

    2. Make sure that no deferred version of a replay event is already
       enqueued to the switchdev deferred queue, before adding it to the
       replay list, when replaying additions.

    Fixes: 4f2673b3a2 ("net: bridge: add helper to replay port and host-joined mdb entries")
    Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
    Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-05-17 13:49:51 +02:00
Ivan Vecera 0dbc141e0c net: switchdev: Add a helper to replay objects on a bridge port
JIRA: https://issues.redhat.com/browse/RHEL-862

commit f2e2857b352277a451e2f91409e461fa7ebf2d15
Author: Petr Machata <petrm@nvidia.com>
Date:   Wed Jul 19 13:01:17 2023 +0200

    net: switchdev: Add a helper to replay objects on a bridge port

    When a front panel joins a bridge via another netdevice (typically a LAG),
    the driver needs to learn about the objects configured on the bridge port.
    When the bridge port is offloaded by the driver for the first time, this
    can be achieved by passing a notifier to switchdev_bridge_port_offload().
    The notifier is then invoked for the individual objects (such as VLANs)
    configured on the bridge, and can look for the interesting ones.

    Calling switchdev_bridge_port_offload() when the second port joins the
    bridge lower is unnecessary, but the replay is still needed. To that end,
    add a new function, switchdev_bridge_port_replay(), which does only the
    replay part of the _offload() function in exactly the same way as that
    function.

    Cc: Jiri Pirko <jiri@resnulli.us>
    Cc: Ivan Vecera <ivecera@redhat.com>
    Cc: Roopa Prabhu <roopa@nvidia.com>
    Cc: Nikolay Aleksandrov <razor@blackwall.org>
    Cc: bridge@lists.linux-foundation.org
    Signed-off-by: Petr Machata <petrm@nvidia.com>
    Reviewed-by: Danielle Ratson <danieller@nvidia.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-12-05 15:29:48 +01:00
Ivan Vecera a311f72e10 net: bridge: br_switchdev: Tolerate -EOPNOTSUPP when replaying MDB
JIRA: https://issues.redhat.com/browse/RHEL-862

commit 989280d6ea70b2d91b8b0e20ac11a6529a37ac08
Author: Petr Machata <petrm@nvidia.com>
Date:   Wed Jul 19 13:01:16 2023 +0200

    net: bridge: br_switchdev: Tolerate -EOPNOTSUPP when replaying MDB

    There are two kinds of MDB entries to be replayed: port MDB entries, and
    host MDB entries. They are both replayed by br_switchdev_mdb_replay(). If
    the driver supports one kind, but lacks the other, the first -EOPNOTSUPP
    returned terminates the whole replay, including any further still-supported
    objects in the list.

    For this to cause issues, there must be MDB entries for both the host and
    the port being replayed. In that case, if the driver bails out from
    handling the host entry, the port entries are never replayed. However, the
    replay is currently only done when a switchdev port joins a bridge. There
    would be no port memberships at that point. Thus despite being erroneous,
    the code does not cause observable bugs.

    This is not an issue with other object kinds either, because there, each
    function replays one object kind. If a driver does not support that kind,
    it makes sense to bail out early. -EOPNOTSUPP is then ignored in
    nbp_switchdev_sync_objs().

    For MDB, suppress the -EOPNOTSUPP error code in br_switchdev_mdb_replay()
    already, so that the whole list gets replayed.

    The reason we need this patch is that a future patch will introduce a
    replay that should be used when a front-panel port netdevice is enslaved to
    a bridge lower, in particular a LAG. The LAG netdevice can already have
    both host and port MDB entries. The port entries need to be replayed so
    that they are offloaded on the port that joins the LAG.

    Cc: Jiri Pirko <jiri@resnulli.us>
    Cc: Ivan Vecera <ivecera@redhat.com>
    Cc: Roopa Prabhu <roopa@nvidia.com>
    Cc: Nikolay Aleksandrov <razor@blackwall.org>
    Cc: bridge@lists.linux-foundation.org
    Signed-off-by: Petr Machata <petrm@nvidia.com>
    Reviewed-by: Danielle Ratson <danieller@nvidia.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-12-05 15:29:48 +01:00
Jan Stancek eafcadf1e9 Merge: CNB: bridge and switchdev update to v6.3
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2506

Bugzilla: https://bugzilla.redhat.com/2184372
Depends: !2472
Depends: !2473
Tested: selected net/forwarding selftests (see BZ for details)

Bridge and switchdev update to v6.3.

Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>

Approved-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: Michal Schmidt <mschmidt@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-06-09 13:58:14 +02:00
Íñigo Huguet c896e5c5d0 net: bridge: switchdev: don't notify FDB entries with "master dynamic"
Bugzilla: https://bugzilla.redhat.com/2184372

commit 927cdea5d2095287ddd5246e5aa68eb5d68db2be
Author: Vladimir Oltean <vladimir.oltean@nxp.com>
Date:   Tue Apr 18 18:59:02 2023 +0300

    net: bridge: switchdev: don't notify FDB entries with "master dynamic"
    
    There is a structural problem in switchdev, where the flag bits in
    struct switchdev_notifier_fdb_info (added_by_user, is_local etc) only
    represent a simplified / denatured view of what's in struct
    net_bridge_fdb_entry :: flags (BR_FDB_ADDED_BY_USER, BR_FDB_LOCAL etc).
    Each time we want to pass more information about struct
    net_bridge_fdb_entry :: flags to struct switchdev_notifier_fdb_info
    (here, BR_FDB_STATIC), we find that FDB entries were already notified to
    switchdev with no regard to this flag, and thus, switchdev drivers had
    no indication whether the notified entries were static or not.
    
    For example, this command:
    
    ip link add br0 type bridge && ip link set swp0 master br0
    bridge fdb add dev swp0 00:01:02:03:04:05 master dynamic
    
    has never worked as intended with switchdev. It causes a struct
    net_bridge_fdb_entry to be passed to br_switchdev_fdb_notify() which has
    a single flag set: BR_FDB_ADDED_BY_USER.
    
    This is further passed to the switchdev notifier chain, where interested
    drivers have no choice but to assume this is a static (does not age) and
    sticky (does not migrate) FDB entry. So currently, all drivers offload
    it to hardware as such, as can be seen below ("offload" is set).
    
    bridge fdb get 00:01:02:03:04:05 dev swp0 master
    00:01:02:03:04:05 dev swp0 offload master br0
    
    The software FDB entry expires $ageing_time centiseconds after the
    kernel last sees a packet with this MAC SA, and the bridge notifies its
    deletion as well, so it eventually disappears from hardware too.
    
    This is a problem, because it is actually desirable to start offloading
    "master dynamic" FDB entries correctly - they should expire $ageing_time
    centiseconds after the *hardware* port last sees a packet with this
    MAC SA - and this is how the current incorrect behavior was discovered.
    With an offloaded data plane, it can be expected that software only sees
    exception path packets, so an otherwise active dynamic FDB entry would
    be aged out by software sooner than it should.
    
    With the change in place, these FDB entries are no longer offloaded:
    
    bridge fdb get 00:01:02:03:04:05 dev swp0 master
    00:01:02:03:04:05 dev swp0 master br0
    
    and this also constitutes a better way (assuming a backport to stable
    kernels) for user space to determine whether the kernel has the
    capability of doing something sane with these or not.
    
    As opposed to "master dynamic" FDB entries, on the current behavior of
    which no one currently depends on (which can be deduced from the lack of
    kselftests), Ido Schimmel explains that entries with the "extern_learn"
    flag (BR_FDB_ADDED_BY_EXT_LEARN) should still be notified to switchdev,
    since the spectrum driver listens to them (and this is kind of okay,
    because although they are treated identically to "static", they are
    expected to not age, and to roam).
    
    Fixes: 6b26b51b1d ("net: bridge: Add support for notifying devices about FDB add/del")
    Link: https://lore.kernel.org/netdev/20230327115206.jk5q5l753aoelwus@skbuf/
    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
    Reviewed-by: Ido Schimmel <idosch@nvidia.com>
    Tested-by: Ido Schimmel <idosch@nvidia.com>
    Link: https://lore.kernel.org/r/20230418155902.898627-1-vladimir.oltean@nxp.com
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
2023-05-17 12:54:08 +02:00
Petr Oros dbe46df4e4 netlink: provide an ability to set default extack message
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2193453

Conflicts:
- net/dsa/master.c: chunk skipped because the patched feature is not in rhel
  due to missing acc43b7bf52a01 ("net: dsa: allow masters to join a LAG")
- net/dsa/slave.c: chunk skipped due to missing 855a28f9c96c80 ("net:
  dsa: don't dereference NULL extack in dsa_slave_changeupper()")
- net/xfrm/xfrm_device.c: chunk skipped due to missing 3089386db0901a ("xfrm:
  extend add policy callback to set failure reason")

Upstream commit(s):
commit 028fb19c6ba743ed308ba99ac325afa968795e0f
Author: Leon Romanovsky <leon@kernel.org>
Date:   Tue Jan 31 15:31:57 2023 +0200

    netlink: provide an ability to set default extack message

    In netdev common pattern, extack pointer is forwarded to the drivers
    to be filled with error message. However, the caller can easily
    overwrite the filled message.

    Instead of adding multiple "if (!extack->_msg)" checks before any
    NL_SET_ERR_MSG() call, which appears after call to the driver, let's
    add new macro to common code.

    [1] https://lore.kernel.org/all/Y9Irgrgf3uxOjwUm@unreal
    Reviewed-by: Simon Horman <simon.horman@corigine.com>
    Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
    Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
    Link: https://lore.kernel.org/r/6993fac557a40a1973dfa0095107c3d03d40bec1.1675171790.git.leon@kernel.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Petr Oros <poros@redhat.com>
2023-05-16 13:28:48 +02:00
Íñigo Huguet 8a27c70e7b bridge: switchdev: Reflect MAB bridge port flag to device drivers
Bugzilla: https://bugzilla.redhat.com/2184372

commit 9c0ca02bace4837d123e1a5a30f6f44dbdc5fb92
Author: Ido Schimmel <idosch@nvidia.com>
Date:   Tue Nov 8 11:47:09 2022 +0100

    bridge: switchdev: Reflect MAB bridge port flag to device drivers
    
    Reflect the 'BR_PORT_MAB' flag to device drivers so that:
    
    * Drivers that support MAB could act upon the flag being toggled.
    * Drivers that do not support MAB will prevent MAB from being enabled.
    
    Signed-off-by: Ido Schimmel <idosch@nvidia.com>
    Reviewed-by: Petr Machata <petrm@nvidia.com>
    Signed-off-by: Petr Machata <petrm@nvidia.com>
    Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
2023-05-11 09:39:16 +02:00
Íñigo Huguet ab6e9c38e4 bridge: switchdev: Allow device drivers to install locked FDB entries
Bugzilla: https://bugzilla.redhat.com/2184372

commit 27fabd02abf30a9df9899f92d467591c7eabb1ba
Author: Hans J. Schultz <netdev@kapio-technology.com>
Date:   Tue Nov 8 11:47:08 2022 +0100

    bridge: switchdev: Allow device drivers to install locked FDB entries
    
    When the bridge is offloaded to hardware, FDB entries are learned and
    aged-out by the hardware. Some device drivers synchronize the hardware
    and software FDBs by generating switchdev events towards the bridge.
    
    When a port is locked, the hardware must not learn autonomously, as
    otherwise any host will blindly gain authorization. Instead, the
    hardware should generate events regarding hosts that are trying to gain
    authorization and their MAC addresses should be notified by the device
    driver as locked FDB entries towards the bridge driver.
    
    Allow device drivers to notify the bridge driver about such entries by
    extending the 'switchdev_notifier_fdb_info' structure with the 'locked'
    bit. The bit can only be set by device drivers and not by the bridge
    driver.
    
    Prevent a locked entry from being installed if MAB is not enabled on the
    bridge port.
    
    If an entry already exists in the bridge driver, reject the locked entry
    if the current entry does not have the "locked" flag set or if it points
    to a different port. The same semantics are implemented in the software
    data path.
    
    Signed-off-by: Hans J. Schultz <netdev@kapio-technology.com>
    Signed-off-by: Ido Schimmel <idosch@nvidia.com>
    Reviewed-by: Petr Machata <petrm@nvidia.com>
    Signed-off-by: Petr Machata <petrm@nvidia.com>
    Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
2023-05-11 09:39:16 +02:00
Íñigo Huguet 99bb87e08c net: bridge: offload BR_HAIRPIN_MODE, BR_ISOLATED, BR_MULTICAST_TO_UNICAST
Bugzilla: https://bugzilla.redhat.com/2184372

commit c3976a3f84451ca05ea5be013af6071bf9acab2c
Author: Arınç ÜNAL <arinc.unal@arinc9.com>
Date:   Sun Apr 10 16:42:27 2022 +0300

    net: bridge: offload BR_HAIRPIN_MODE, BR_ISOLATED, BR_MULTICAST_TO_UNICAST
    
    Add BR_HAIRPIN_MODE, BR_ISOLATED and BR_MULTICAST_TO_UNICAST port flags to
    BR_PORT_FLAGS_HW_OFFLOAD so that switchdev drivers which have an offloaded
    data plane have a chance to reject these bridge port flags if they don't
    support them yet.
    
    It makes the code path go through the
    SWITCHDEV_ATTR_ID_PORT_PRE_BRIDGE_FLAGS driver handlers, which return
    -EINVAL for everything they don't recognize.
    
    For drivers that don't catch SWITCHDEV_ATTR_ID_PORT_PRE_BRIDGE_FLAGS at
    all, switchdev will return -EOPNOTSUPP for those which is then ignored, but
    those are in the minority.
    
    Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
    Reviewed-by: Ido Schimmel <idosch@nvidia.com>
    Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
    Link: https://lore.kernel.org/r/20220410134227.18810-1-arinc.unal@arinc9.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
2023-05-11 09:20:06 +02:00
Ivan Vecera a76d2f44bc net: bridge: switchdev: check br_vlan_group() return value
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081601

commit 7f40ea2145d926510b27b785562d2c92df1b0d91
Author: Clément Léger <clement.leger@bootlin.com>
Date:   Thu Apr 21 12:12:47 2022 +0200

    net: bridge: switchdev: check br_vlan_group() return value

    br_vlan_group() can return NULL and thus return value must be checked
    to avoid dereferencing a NULL pointer.

    Fixes: 6284c723d9b9 ("net: bridge: mst: Notify switchdev drivers of VLAN MSTI migrations")
    Signed-off-by: Clément Léger <clement.leger@bootlin.com>
    Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
    Link: https://lore.kernel.org/r/20220421101247.121896-1-clement.leger@bootlin.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-05-04 09:55:36 +02:00
Ivan Vecera 49e2fa557d net: bridge: mst: Notify switchdev drivers of VLAN MSTI migrations
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081601

commit 6284c723d9b9995cc27ab3c6368a9d95d67111ff
Author: Tobias Waldekranz <tobias@waldekranz.com>
Date:   Wed Mar 16 16:08:47 2022 +0100

    net: bridge: mst: Notify switchdev drivers of VLAN MSTI migrations

    Whenever a VLAN moves to a new MSTI, send a switchdev notification so
    that switchdevs can track a bridge's VID to MSTI mappings.

    Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
    Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-05-04 09:55:36 +02:00
Ivan Vecera b609941857 net: bridge: Add support for offloading of locked port flag
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081601

commit fa1c83342987d8b2fd246894295393e09b0af18f
Author: Hans Schultz <schultz.hans@gmail.com>
Date:   Wed Feb 23 11:16:47 2022 +0100

    net: bridge: Add support for offloading of locked port flag

    Various switchcores support setting ports in locked mode, so that
    clients behind locked ports cannot send traffic through the port
    unless a fdb entry is added with the clients MAC address.

    Signed-off-by: Hans Schultz <schultz.hans+netdev@gmail.com>
    Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
    Reviewed-by: Ido Schimmel <idosch@nvidia.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-05-04 09:55:35 +02:00
Ivan Vecera 3b32d4a261 net: bridge: switchdev: replay all VLAN groups
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081601

commit b28d580e2939544ea0c56ca7aed7aacef1da466e
Author: Vladimir Oltean <vladimir.oltean@nxp.com>
Date:   Tue Feb 15 19:02:14 2022 +0200

    net: bridge: switchdev: replay all VLAN groups

    The major user of replayed switchdev objects is DSA, and so far it
    hasn't needed information about anything other than bridge port VLANs,
    so this is all that br_switchdev_vlan_replay() knows to handle.

    DSA has managed to get by through replicating every VLAN addition on a
    user port such that the same VLAN is also added on all DSA and CPU
    ports, but there is a corner case where this does not work.

    The mv88e6xxx DSA driver currently prints this error message as soon as
    the first port of a switch joins a bridge:

    mv88e6085 0x0000000008b96000:00: port 0 failed to add a6:ef:77:c8:5f:3d vid 1 to fdb: -95

    where a6:ef:77:c8:5f:3d vid 1 is a local FDB entry corresponding to the
    bridge MAC address in the default_pvid.

    The -EOPNOTSUPP is returned by mv88e6xxx_port_db_load_purge() because it
    tries to map VID 1 to a FID (the ATU is indexed by FID not VID), but
    fails to do so. This is because ->port_fdb_add() is called before
    ->port_vlan_add() for VID 1.

    The abridged timeline of the calls is:

    br_add_if
    -> netdev_master_upper_dev_link
       -> dsa_port_bridge_join
          -> switchdev_bridge_port_offload
             -> br_switchdev_vlan_replay (*)
             -> br_switchdev_fdb_replay
                -> mv88e6xxx_port_fdb_add
    -> nbp_vlan_init
       -> nbp_vlan_add
          -> mv88e6xxx_port_vlan_add

    and the issue is that at the time of (*), the bridge port isn't in VID 1
    (nbp_vlan_init hasn't been called), therefore br_switchdev_vlan_replay()
    won't have anything to replay, therefore VID 1 won't be in the VTU by
    the time mv88e6xxx_port_fdb_add() is called.

    This happens only when the first port of a switch joins. For further
    ports, the initial mv88e6xxx_port_vlan_add() is sufficient for VID 1 to
    be loaded in the VTU (which is switch-wide, not per port).

    The problem is somewhat unique to mv88e6xxx by chance, because most
    other drivers offload an FDB entry by VID, so FDBs and VLANs can be
    added asynchronously with respect to each other, but addressing the
    issue at the bridge layer makes sense, since what mv88e6xxx requires
    isn't absurd.

    To fix this problem, we need to recognize that it isn't the VLAN group
    of the port that we're interested in, but the VLAN group of the bridge
    itself (so it isn't a timing issue, but rather insufficient information
    being passed from switchdev to drivers).

    As mentioned, currently nbp_switchdev_sync_objs() only calls
    br_switchdev_vlan_replay() for VLANs corresponding to the port, but the
    VLANs corresponding to the bridge itself, for local termination, also
    need to be replayed. In this case, VID 1 is not (yet) present in the
    port's VLAN group but is present in the bridge's VLAN group.

    So to fix this bug, DSA is now obligated to explicitly handle VLANs
    pointing towards the bridge in order to "close this race" (which isn't
    really a race). As Tobias Waldekranz notices, this also implies that it
    must explicitly handle port VLANs on foreign interfaces, something that
    worked implicitly before:
    https://patchwork.kernel.org/project/netdevbpf/patch/20220209213044.2353153-6-vladimir.oltean@nxp.com/#24735260

    So in the end, br_switchdev_vlan_replay() must replay all VLANs from all
    VLAN groups: all the ports, and the bridge itself.

    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-05-04 09:55:35 +02:00
Ivan Vecera e40ce431b4 net: bridge: make nbp_switchdev_unsync_objs() follow reverse order of sync()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081601

commit 263029ae317298a3719d2cd88a818c3a29a80e15
Author: Vladimir Oltean <vladimir.oltean@nxp.com>
Date:   Tue Feb 15 19:02:13 2022 +0200

    net: bridge: make nbp_switchdev_unsync_objs() follow reverse order of sync()

    There may be switchdev drivers that can add/remove a FDB or MDB entry
    only as long as the VLAN it's in has been notified and offloaded first.
    The nbp_switchdev_sync_objs() method satisfies this requirement on
    addition, but nbp_switchdev_unsync_objs() first deletes VLANs, then
    deletes MDBs and FDBs. Reverse the order of the function calls to cater
    to this requirement.

    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-05-04 09:55:35 +02:00
Ivan Vecera cf074ea318 net: bridge: switchdev: differentiate new VLANs from changed ones
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081601

commit 8d23a54f5beea59b560855fb571e5d73d783e0b4
Author: Vladimir Oltean <vladimir.oltean@nxp.com>
Date:   Tue Feb 15 19:02:12 2022 +0200

    net: bridge: switchdev: differentiate new VLANs from changed ones

    br_switchdev_port_vlan_add() currently emits a SWITCHDEV_PORT_OBJ_ADD
    event with a SWITCHDEV_OBJ_ID_PORT_VLAN for 2 distinct cases:

    - a struct net_bridge_vlan got created
    - an existing struct net_bridge_vlan was modified

    This makes it impossible for switchdev drivers to properly balance
    PORT_OBJ_ADD with PORT_OBJ_DEL events, so if we want to allow that to
    happen, we must provide a way for drivers to distinguish between a
    VLAN with changed flags and a new one.

    Annotate struct switchdev_obj_port_vlan with a "bool changed" that
    distinguishes the 2 cases above.

    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-05-04 09:55:34 +02:00
Ivan Vecera dcdcf7b0b7 net: bridge: switchdev: consistent function naming
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081601

commit 326b212e9cd67498841f3654a96d91718dd11f39
Author: Vladimir Oltean <vladimir.oltean@nxp.com>
Date:   Wed Oct 27 19:21:19 2021 +0300

    net: bridge: switchdev: consistent function naming

    Rename all recently imported functions in br_switchdev.c to start with a
    br_switchdev_* prefix.

    br_fdb_replay_one() -> br_switchdev_fdb_replay_one()
    br_fdb_replay() -> br_switchdev_fdb_replay()
    br_vlan_replay_one() -> br_switchdev_vlan_replay_one()
    br_vlan_replay() -> br_switchdev_vlan_replay()
    struct br_mdb_complete_info -> struct br_switchdev_mdb_complete_info
    br_mdb_complete() -> br_switchdev_mdb_complete()
    br_mdb_switchdev_host_port() -> br_switchdev_host_mdb_one()
    br_mdb_switchdev_host() -> br_switchdev_host_mdb()
    br_mdb_replay_one() -> br_switchdev_mdb_replay_one()
    br_mdb_replay() -> br_switchdev_mdb_replay()
    br_mdb_queue_one() -> br_switchdev_mdb_queue_one()

    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-05-04 09:55:33 +02:00
Ivan Vecera ffd8b80d1e net: bridge: mdb: move all switchdev logic to br_switchdev.c
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081601

commit 9776457c784f6549d43f80eb96d4122b51558258
Author: Vladimir Oltean <vladimir.oltean@nxp.com>
Date:   Wed Oct 27 19:21:18 2021 +0300

    net: bridge: mdb: move all switchdev logic to br_switchdev.c

    The following functions:

    br_mdb_complete
    br_switchdev_mdb_populate
    br_mdb_replay_one
    br_mdb_queue_one
    br_mdb_replay
    br_mdb_switchdev_host_port
    br_mdb_switchdev_host
    br_switchdev_mdb_notify

    are only accessible from code paths where CONFIG_NET_SWITCHDEV is
    enabled. So move them to br_switchdev.c, in order for that code to be
    compiled out if that config option is disabled.

    Note that br_switchdev.c gets build regardless of whether
    CONFIG_BRIDGE_IGMP_SNOOPING is enabled or not, whereas br_mdb.c only got
    built when CONFIG_BRIDGE_IGMP_SNOOPING was enabled. So to preserve
    correct compilation with CONFIG_BRIDGE_IGMP_SNOOPING being disabled, we
    must now place an #ifdef around these functions in br_switchdev.c.
    The offending bridge data structures that need this are
    br->multicast_lock and br->mdb_list, these are also compiled out of
    struct net_bridge when CONFIG_BRIDGE_IGMP_SNOOPING is turned off.

    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-05-04 09:55:33 +02:00
Ivan Vecera c09c5a8d8f net: bridge: move br_vlan_replay to br_switchdev.c
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081601

commit 4a6849e4617309b7b5934f9ea761c02915b5332a
Author: Vladimir Oltean <vladimir.oltean@nxp.com>
Date:   Wed Oct 27 19:21:16 2021 +0300

    net: bridge: move br_vlan_replay to br_switchdev.c

    br_vlan_replay() is relevant only if CONFIG_NET_SWITCHDEV is enabled, so
    move it to br_switchdev.c.

    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-05-04 09:55:33 +02:00
Ivan Vecera bc33e4ff45 net: bridge: create a common function for populating switchdev FDB entries
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081601

commit fab9eca88410ddd22e4c4bb275849b8327b49a60
Author: Vladimir Oltean <vladimir.oltean@nxp.com>
Date:   Tue Oct 26 17:27:42 2021 +0300

    net: bridge: create a common function for populating switchdev FDB entries

    There are two places where a switchdev FDB entry is constructed, one is
    br_switchdev_fdb_notify() and the other is br_fdb_replay(). One uses a
    struct initializer, and the other declares the structure as
    uninitialized and populates the elements one by one.

    One problem when introducing new members of struct
    switchdev_notifier_fdb_info is that there is a risk for one of these
    functions to run with an uninitialized value.

    So centralize the logic of populating such structure into a dedicated
    function. Being the primary location where these structures are created,
    using an uninitialized variable and populating the members one by one
    should be fine, since this one function is supposed to assign values to
    all its members.

    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    Reviewed-by: Ido Schimmel <idosch@nvidia.com>
    Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-05-04 09:55:33 +02:00
Ivan Vecera f5099a4cba net: bridge: move br_fdb_replay inside br_switchdev.c
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081601

commit 5cda5272a4605c34740859378bf7a7940229feca
Author: Vladimir Oltean <vladimir.oltean@nxp.com>
Date:   Tue Oct 26 17:27:41 2021 +0300

    net: bridge: move br_fdb_replay inside br_switchdev.c

    br_fdb_replay is only called from switchdev code paths, so it makes
    sense to be disabled if switchdev is not enabled in the first place.

    As opposed to br_mdb_replay and br_vlan_replay which might be turned off
    depending on bridge support for multicast and VLANs, FDB support is
    always on. So moving br_mdb_replay and br_vlan_replay inside
    br_switchdev.c would mean adding some #ifdef's in br_switchdev.c, so we
    keep those where they are.

    The reason for the movement is that in future changes there will be some
    code reuse between br_switchdev_fdb_notify and br_fdb_replay.

    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    Reviewed-by: Ido Schimmel <idosch@nvidia.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-05-04 09:55:33 +02:00
Ivan Vecera 8fe6b3d464 net: make switchdev_bridge_port_{,unoffload} loosely coupled with the bridge
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2037335

commit 957e2235e5264c97cd6be8e2e17f2e11b41f2239
Author: Vladimir Oltean <vladimir.oltean@nxp.com>
Date:   Tue Aug 3 23:34:08 2021 +0300

    net: make switchdev_bridge_port_{,unoffload} loosely coupled with the bridge

    With the introduction of explicit offloading API in switchdev in commit
    2f5dc00f7a3e ("net: bridge: switchdev: let drivers inform which bridge
    ports are offloaded"), we started having Ethernet switch drivers calling
    directly into a function exported by net/bridge/br_switchdev.c, which is
    a function exported by the bridge driver.

    This means that drivers that did not have an explicit dependency on the
    bridge before, like cpsw and am65-cpsw, now do - otherwise it is not
    possible to call a symbol exported by a driver that can be built as
    module unless you are a module too.

    There was an attempt to solve the dependency issue in the form of commit
    b0e81817629a ("net: build all switchdev drivers as modules when the
    bridge is a module"). Grygorii Strashko, however, says about it:

    | In my opinion, the problem is a bit bigger here than just fixing the
    | build :(
    |
    | In case, of ^cpsw the switchdev mode is kinda optional and in many
    | cases (especially for testing purposes, NFS) the multi-mac mode is
    | still preferable mode.
    |
    | There were no such tight dependency between switchdev drivers and
    | bridge core before and switchdev serviced as independent, notification
    | based layer between them, so ^cpsw still can be "Y" and bridge can be
    | "M". Now for mostly every kernel build configuration the CONFIG_BRIDGE
    | will need to be set as "Y", or we will have to update drivers to
    | support build with BRIDGE=n and maintain separate builds for
    | networking vs non-networking testing.  But is this enough?  Wouldn't
    | it cause 'chain reaction' required to add more and more "Y" options
    | (like CONFIG_VLAN_8021Q)?
    |
    | PS. Just to be sure we on the same page - ARM builds will be forced
    | (with this patch) to have CONFIG_TI_CPSW_SWITCHDEV=m and so all our
    | automation testing will just fail with omap2plus_defconfig.

    In the light of this, it would be desirable for some configurations to
    avoid dependencies between switchdev drivers and the bridge, and have
    the switchdev mode as completely optional within the driver.

    Arnd Bergmann also tried to write a patch which better expressed the
    build time dependency for Ethernet switch drivers where the switchdev
    support is optional, like cpsw/am65-cpsw, and this made the drivers
    follow the bridge (compile as module if the bridge is a module) only if
    the optional switchdev support in the driver was enabled in the first
    place:
    https://patchwork.kernel.org/project/netdevbpf/patch/20210802144813.1152762-1-arnd@kernel.org/

    but this still did not solve the fact that cpsw and am65-cpsw now must
    be built as modules when the bridge is a module - it just expressed
    correctly that optional dependency. But the new behavior is an apparent
    regression from Grygorii's perspective.

    So to support the use case where the Ethernet driver is built-in,
    NET_SWITCHDEV (a bool option) is enabled, and the bridge is a module, we
    need a framework that can handle the possible absence of the bridge from
    the running system, i.e. runtime bloatware as opposed to build-time
    bloatware.

    Luckily we already have this framework, since switchdev has been using
    it extensively. Events from the bridge side are transmitted to the
    driver side using notifier chains - this was originally done so that
    unrelated drivers could snoop for events emitted by the bridge towards
    ports that are implemented by other drivers (think of a switch driver
    with LAG offload that listens for switchdev events on a bonding/team
    interface that it offloads).

    There are also events which are transmitted from the driver side to the
    bridge side, which again are modeled using notifiers.
    SWITCHDEV_FDB_ADD_TO_BRIDGE is an example of this, and deals with
    notifying the bridge that a MAC address has been dynamically learned.
    So there is a precedent we can use for modeling the new framework.

    The difference compared to SWITCHDEV_FDB_ADD_TO_BRIDGE is that the work
    that the bridge needs to do when a port becomes offloaded is blocking in
    its nature: replay VLANs, MDBs etc. The calling context is indeed
    blocking (we are under rtnl_mutex), but the existing switchdev
    notification chain that the bridge is subscribed to is only the atomic
    one. So we need to subscribe the bridge to the blocking switchdev
    notification chain too.

    This patch:
    - keeps the driver-side perception of the switchdev_bridge_port_{,un}offload
      unchanged
    - moves the implementation of switchdev_bridge_port_{,un}offload from
      the bridge module into the switchdev module.
    - makes everybody that is subscribed to the switchdev blocking notifier
      chain "hear" offload & unoffload events
    - makes the bridge driver subscribe and handle those events
    - moves the bridge driver's handling of those events into 2 new
      functions called br_switchdev_port_{,un}offload. These functions
      contain in fact the core of the logic that was previously in
      switchdev_bridge_port_{,un}offload, just that now we go through an
      extra indirection layer to reach them.

    Unlike all the other switchdev notification structures, the structure
    used to carry the bridge port information, struct
    switchdev_notifier_brport_info, does not contain a "bool handled".
    This is because in the current usage pattern, we always know that a
    switchdev bridge port offloading event will be handled by the bridge,
    because the switchdev_bridge_port_offload() call was initiated by a
    NETDEV_CHANGEUPPER event in the first place, where info->upper_dev is a
    bridge. So if the bridge wasn't loaded, then the CHANGEUPPER event
    couldn't have happened.

    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    Tested-by: Grygorii Strashko <grygorii.strashko@ti.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-01-05 16:14:37 +01:00
Ivan Vecera 92d867d18f net: bridge: switchdev: fix incorrect use of FDB flags when picking the dst device
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2037335

commit 2e19bb35ce15a8b49f4a809469163f668e2d539f
Author: Vladimir Oltean <vladimir.oltean@nxp.com>
Date:   Mon Aug 2 14:36:33 2021 +0300

    net: bridge: switchdev: fix incorrect use of FDB flags when picking the dst device

    Nikolay points out that it is incorrect to assume that it is impossible
    to have an fdb entry with fdb->dst == NULL and the BR_FDB_LOCAL bit in
    fdb->flags not set. This is because there are reader-side places that
    test_bit(BR_FDB_LOCAL, &fdb->flags) without the br->hash_lock, and if
    the updating of the FDB entry happens on another CPU, there are no
    memory barriers at writer or reader side which would ensure that the
    reader sees the updates to both fdb->flags and fdb->dst in the same
    order, i.e. the reader will not see an inconsistent FDB entry.

    So we must be prepared to deal with FDB entries where fdb->dst and
    fdb->flags are in a potentially inconsistent state, and that means that
    fdb->dst == NULL should remain a condition to pick the net_device that
    we report to switchdev as being the bridge device, which is what the
    code did prior to the blamed patch.

    Fixes: 52e4bec15546 ("net: bridge: switchdev: treat local FDBs the same as entries towards the bridge")
    Suggested-by: Nikolay Aleksandrov <nikolay@nvidia.com>
    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
    Link: https://lore.kernel.org/r/20210802113633.189831-1-vladimir.oltean@nxp.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-01-05 16:14:37 +01:00
Ivan Vecera dbbee5ba0b net: bridge: switchdev: treat local FDBs the same as entries towards the bridge
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2037335

commit 52e4bec15546d58f2a14dc74a6b8be128aa7db0a
Author: Vladimir Oltean <vladimir.oltean@nxp.com>
Date:   Wed Jul 28 21:27:48 2021 +0300

    net: bridge: switchdev: treat local FDBs the same as entries towards the bridge

    Currently the following script:

    1. ip link add br0 type bridge vlan_filtering 1 && ip link set br0 up
    2. ip link set swp2 up && ip link set swp2 master br0
    3. ip link set swp3 up && ip link set swp3 master br0
    4. ip link set swp4 up && ip link set swp4 master br0
    5. bridge vlan del dev swp2 vid 1
    6. bridge vlan del dev swp3 vid 1
    7. ip link set swp4 nomaster
    8. ip link set swp3 nomaster

    produces the following output:

    [  641.010738] sja1105 spi0.1: port 2 failed to delete 00:1f:7b:63:02:48 vid 1 from fdb: -2

    [ swp2, swp3 and br0 all have the same MAC address, the one listed above ]

    In short, this happens because the number of FDB entry additions
    notified to switchdev is unbalanced with the number of deletions.

    At step 1, the bridge has a random MAC address. At step 2, the
    br_fdb_replay of swp2 receives this initial MAC address. Then the bridge
    inherits the MAC address of swp2 via br_fdb_change_mac_address(), and it
    notifies switchdev (only swp2 at this point) of the deletion of the
    random MAC address and the addition of 00:1f:7b:63:02:48 as a local FDB
    entry with fdb->dst == swp2, in VLANs 0 and the default_pvid (1).

    During step 7:

    del_nbp
    -> br_fdb_delete_by_port(br, p, vid=0, do_all=1);
       -> fdb_delete_local(br, p, f);

    br_fdb_delete_by_port() deletes all entries towards the ports,
    regardless of vid, because do_all is 1.

    fdb_delete_local() has logic to migrate local FDB entries deleted from
    one port to another port which shares the same MAC address and is in the
    same VLAN, or to the bridge device itself. This migration happens
    without notifying switchdev of the deletion on the old port and the
    addition on the new one, just fdb->dst is changed and the added_by_user
    flag is cleared.

    In the example above, the del_nbp(swp4) causes the
    "addr 00:1f:7b:63:02:48 vid 1" local FDB entry with fdb->dst == swp4
    that existed up until then to be migrated directly towards the bridge
    (fdb->dst == NULL). This is because it cannot be migrated to any of the
    other ports (swp2 and swp3 are not in VLAN 1).

    After the migration to br0 takes place, swp4 requests a deletion replay
    of all FDB entries. Since the "addr 00:1f:7b:63:02:48 vid 1" entry now
    point towards the bridge, a deletion of it is replayed. There was just
    a prior addition of this address, so the switchdev driver deletes this
    entry.

    Then, the del_nbp(swp3) at step 8 triggers another br_fdb_replay, and
    switchdev is notified again to delete "addr 00:1f:7b:63:02:48 vid 1".
    But it can't because it no longer has it, so it returns -ENOENT.

    There are other possibilities to trigger this issue, but this is by far
    the simplest to explain.

    To fix this, we must avoid the situation where the addition of an FDB
    entry is notified to switchdev as a local entry on a port, and the
    deletion is notified on the bridge itself.

    Considering that the 2 types of FDB entries are completely equivalent
    and we cannot have the same MAC address as a local entry on 2 bridge
    ports, or on a bridge port and pointing towards the bridge at the same
    time, it makes sense to hide away from switchdev completely the fact
    that a local FDB entry is associated with a given bridge port at all.
    Just say that it points towards the bridge, it should make no difference
    whatsoever to the switchdev driver and should even lead to a simpler
    overall implementation, will less cases to handle.

    This also avoids any modification at all to the core bridge driver, just
    what is reported to switchdev changes. With the local/permanent entries
    on bridge ports being already reported to user space, it is hard to
    believe that the bridge behavior can change in any backwards-incompatible
    way such as making all local FDB entries point towards the bridge.

    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-01-05 16:14:37 +01:00
Ivan Vecera d115309428 net: bridge: switchdev: replay the entire FDB for each port
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2037335

commit b4454bc6a0fbf2f9edcddd08862175085b990856
Author: Vladimir Oltean <vladimir.oltean@nxp.com>
Date:   Wed Jul 28 21:27:47 2021 +0300

    net: bridge: switchdev: replay the entire FDB for each port

    Currently when a switchdev port joins a bridge, we replay all FDB
    entries pointing towards that port or towards the bridge.

    However, this is insufficient in certain situations:

    (a) DSA, through its assisted_learning_on_cpu_port logic, snoops
        dynamically learned FDB entries on foreign interfaces.
        These are FDB entries that are pointing neither towards the newly
        joined switchdev port, nor towards the bridge. So these addresses
        would be missed when joining a bridge where a foreign interface has
        already learned some addresses, and they would also linger on if the
        DSA port leaves the bridge before the foreign interface forgets them.
        None of this happens if we replay the entire FDB when the port joins.

    (b) There is a desire to treat local FDB entries on a port (i.e. the
        port's termination MAC address) identically to FDB entries pointing
        towards the bridge itself. More details on the reason behind this in
        the next patch. The point is that this cannot be done given the
        current structure of br_fdb_replay() in this situation:
          ip link set swp0 master br0  # br0 inherits its MAC address from swp0
          ip link set swp1 master br0
        What is desirable is that when swp1 joins the bridge, br_fdb_replay()
        also notifies swp1 of br0's MAC address, but this won't in fact
        happen because the MAC address of br0 does not have fdb->dst == NULL
        (it doesn't point towards the bridge), but it has fdb->dst == swp0.
        So our current logic makes it impossible for that address to be
        replayed. But if we dump the entire FDB instead of just the entries
        with fdb->dst == swp1 and fdb->dst == NULL, then the inherited MAC
        address of br0 will be replayed too, which is what we need.

    A natural question arises: say there is an FDB entry to be replayed,
    like a MAC address dynamically learned on a foreign interface that
    belongs to a bridge where no switchdev port has joined yet. If 10
    switchdev ports belonging to the same driver join this bridge, one by
    one, won't every port get notified 10 times of the foreign FDB entry,
    amounting to a total of 100 notifications for this FDB entry in the
    switchdev driver?

    Well, yes, but this is where the "void *ctx" argument for br_fdb_replay
    is useful: every port of the switchdev driver is notified whenever any
    other port requests an FDB replay, but because the replay was initiated
    by a different port, its context is different from the initiating port's
    context, so it ignores those replays.

    So the foreign FDB entry will be installed only 10 times, once per port.
    This is done so that the following 4 code paths are always well balanced:
    (a) addition of foreign FDB entry is replayed when port joins bridge
    (b) deletion of foreign FDB entry is replayed when port leaves bridge
    (c) addition of foreign FDB entry is notified to all ports currently in bridge
    (c) deletion of foreign FDB entry is notified to all ports currently in bridge

    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-01-05 16:14:36 +01:00
Ivan Vecera def6c352f2 net: bridge: fix build when setting skb->offload_fwd_mark with CONFIG_NET_SWITCHDEV=n
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2037335

commit c5381154393d42e2edd77fc74f1493f9f93908cb
Author: Vladimir Oltean <vladimir.oltean@nxp.com>
Date:   Fri Jul 23 23:49:11 2021 +0300

    net: bridge: fix build when setting skb->offload_fwd_mark with CONFIG_NET_SWITCHDEV=n

    Switchdev support can be disabled at compile time, and in that case,
    struct sk_buff will not contain the offload_fwd_mark field.

    To make the code in br_forward.c work in both cases, we do what is done
    in other places and we create a helper function, with an empty shim
    definition, that is implemented by the br_switchdev.o translation module.
    This is always compiled if and only if CONFIG_NET_SWITCHDEV is y or m.

    Reported-by: kernel test robot <lkp@intel.com>
    Fixes: 472111920f1c ("net: bridge: switchdev: allow the TX data plane forwarding to be offloaded")
    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-01-05 16:14:36 +01:00
Ivan Vecera 5bcf0c4d06 net: bridge: switchdev: allow the TX data plane forwarding to be offloaded
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2037335

commit 472111920f1c5fbe103022a4b05bfb37128a2a29
Author: Tobias Waldekranz <tobias@waldekranz.com>
Date:   Thu Jul 22 18:55:38 2021 +0300

    net: bridge: switchdev: allow the TX data plane forwarding to be offloaded

    Allow switchdevs to forward frames from the CPU in accordance with the
    bridge configuration in the same way as is done between bridge
    ports. This means that the bridge will only send a single skb towards
    one of the ports under the switchdev's control, and expects the driver
    to deliver the packet to all eligible ports in its domain.

    Primarily this improves the performance of multicast flows with
    multiple subscribers, as it allows the hardware to perform the frame
    replication.

    The basic flow between the driver and the bridge is as follows:

    - When joining a bridge port, the switchdev driver calls
      switchdev_bridge_port_offload() with tx_fwd_offload = true.

    - The bridge sends offloadable skbs to one of the ports under the
      switchdev's control using skb->offload_fwd_mark = true.

    - The switchdev driver checks the skb->offload_fwd_mark field and lets
      its FDB lookup select the destination port mask for this packet.

    v1->v2:
    - convert br_input_skb_cb::fwd_hwdoms to a plain unsigned long
    - introduce a static key "br_switchdev_fwd_offload_used" to minimize the
      impact of the newly introduced feature on all the setups which don't
      have hardware that can make use of it
    - introduce a check for nbp->flags & BR_FWD_OFFLOAD to optimize cache
      line access
    - reorder nbp_switchdev_frame_mark_accel() and br_handle_vlan() in
      __br_forward()
    - do not strip VLAN on egress if forwarding offload on VLAN-aware bridge
      is being used
    - propagate errors from .ndo_dfwd_add_station() if not EOPNOTSUPP

    v2->v3:
    - replace the solution based on .ndo_dfwd_add_station with a solution
      based on switchdev_bridge_port_offload
    - rename BR_FWD_OFFLOAD to BR_TX_FWD_OFFLOAD
    v3->v4: rebase
    v4->v5:
    - make sure the static key is decremented on bridge port unoffload
    - more function and variable renaming and comments for them:
      br_switchdev_fwd_offload_used to br_switchdev_tx_fwd_offload
      br_switchdev_accels_skb to br_switchdev_frame_uses_tx_fwd_offload
      nbp_switchdev_frame_mark_tx_fwd to nbp_switchdev_frame_mark_tx_fwd_to_hwdom
      nbp_switchdev_frame_mark_accel to nbp_switchdev_frame_mark_tx_fwd_offload
      fwd_accel to tx_fwd_offload

    Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-01-05 16:14:34 +01:00
Ivan Vecera a6590fa257 net: bridge: move the switchdev object replay helpers to "push" mode
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2037335

commit 4e51bf44a03af6fa19a39a36ea8fedfacb8ccadf
Author: Vladimir Oltean <vladimir.oltean@nxp.com>
Date:   Wed Jul 21 19:24:03 2021 +0300

    net: bridge: move the switchdev object replay helpers to "push" mode

    Starting with commit 4f2673b3a2 ("net: bridge: add helper to replay
    port and host-joined mdb entries"), DSA has introduced some bridge
    helpers that replay switchdev events (FDB/MDB/VLAN additions and
    deletions) that can be lost by the switchdev drivers in a variety of
    circumstances:

    - an IP multicast group was host-joined on the bridge itself before any
      switchdev port joined the bridge, leading to the host MDB entries
      missing in the hardware database.
    - during the bridge creation process, the MAC address of the bridge was
      added to the FDB as an entry pointing towards the bridge device
      itself, but with no switchdev ports being part of the bridge yet, this
      local FDB entry would remain unknown to the switchdev hardware
      database.
    - a VLAN/FDB/MDB was added to a bridge port that is a LAG interface,
      before any switchdev port joined that LAG, leading to the hardware
      database missing those entries.
    - a switchdev port left a LAG that is a bridge port, while the LAG
      remained part of the bridge, and all FDB/MDB/VLAN entries remained
      installed in the hardware database of the switchdev port.

    Also, since commit 0d2cfbd41c ("net: bridge: ignore switchdev events
    for LAG ports which didn't request replay"), DSA introduced a method,
    based on a const void *ctx, to ensure that two switchdev ports under the
    same LAG that is a bridge port do not see the same MDB/VLAN entry being
    replayed twice by the bridge, once for every bridge port that joins the
    LAG.

    With so many ordering corner cases being possible, it seems unreasonable
    to expect a switchdev driver writer to get it right from the first try.
    Therefore, now that DSA has experimented with the bridge replay helpers
    for a little bit, we can move the code to the bridge driver where it is
    more readily available to all switchdev drivers.

    To convert the switchdev object replay helpers from "pull mode" (where
    the driver asks for them) to a "push mode" (where the bridge offers them
    automatically), the biggest problem is that the bridge needs to be aware
    when a switchdev port joins and leaves, even when the switchdev is only
    indirectly a bridge port (for example when the bridge port is a LAG
    upper of the switchdev).

    Luckily, we already have a hook for that, in the form of the newly
    introduced switchdev_bridge_port_offload() and
    switchdev_bridge_port_unoffload() calls. These offer a natural place for
    hooking the object addition and deletion replays.

    Extend the above 2 functions with:
    - pointers to the switchdev atomic notifier (for FDB replays) and the
      blocking notifier (for MDB and VLAN replays).
    - the "const void *ctx" argument required for drivers to be able to
      disambiguate between which port is targeted, when multiple ports are
      lowers of the same LAG that is a bridge port. Most of the drivers pass
      NULL to this argument, except the ones that support LAG offload and have
      the proper context check already in place in the switchdev blocking
      notifier handler.

    Also unexport the replay helpers, since nobody except the bridge calls
    them directly now.

    Note that:
    (a) we abuse the terminology slightly, because FDB entries are not
        "switchdev objects", but we count them as objects nonetheless.
        With no direct way to prove it, I think they are not modeled as
        switchdev objects because those can only be installed by the bridge
        to the hardware (as opposed to FDB entries which can be propagated
        in the other direction too). This is merely an abuse of terms, FDB
        entries are replayed too, despite not being objects.
    (b) the bridge does not attempt to sync port attributes to newly joined
        ports, just the countable stuff (the objects). The reason for this
        is simple: no universal and symmetric way to sync and unsync them is
        known. For example, VLAN filtering: what to do on unsync, disable or
        leave it enabled? Similarly, STP state, ageing timer, etc etc. What
        a switchdev port does when it becomes standalone again is not really
        up to the bridge's competence, and the driver should deal with it.
        On the other hand, replaying deletions of switchdev objects can be
        seen a matter of cleanup and therefore be treated by the bridge,
        hence this patch.

    We make the replay helpers opt-in for drivers, because they might not
    bring immediate benefits for them:

    - nbp_vlan_init() is called _after_ netdev_master_upper_dev_link(),
      so br_vlan_replay() should not do anything for the new drivers on
      which we call it. The existing drivers where there was even a slight
      possibility for there to exist a VLAN on a bridge port before they
      join it are already guarded against this: mlxsw and prestera deny
      joining LAG interfaces that are members of a bridge.

    - br_fdb_replay() should now notify of local FDB entries, but I patched
      all drivers except DSA to ignore these new entries in commit
      2c4eca3ef7 ("net: bridge: switchdev: include local flag in FDB
      notifications"). Driver authors can lift this restriction as they
      wish, and when they do, they can also opt into the FDB replay
      functionality.

    - br_mdb_replay() should fix a real issue which is described in commit
      4f2673b3a2 ("net: bridge: add helper to replay port and host-joined
      mdb entries"). However most drivers do not offload the
      SWITCHDEV_OBJ_ID_HOST_MDB to see this issue: only cpsw and am65_cpsw
      offload this switchdev object, and I don't completely understand the
      way in which they offload this switchdev object anyway. So I'll leave
      it up to these drivers' respective maintainers to opt into
      br_mdb_replay().

    So most of the drivers pass NULL notifier blocks for the replay helpers,
    except:
    - dpaa2-switch which was already acked/regression-tested with the
      helpers enabled (and there isn't much of a downside in having them)
    - ocelot which already had replay logic in "pull" mode
    - DSA which already had replay logic in "pull" mode

    An important observation is that the drivers which don't currently
    request bridge event replays don't even have the
    switchdev_bridge_port_{offload,unoffload} calls placed in proper places
    right now. This was done to avoid unnecessary rework for drivers which
    might never even add support for this. For driver writers who wish to
    add replay support, this can be used as a tentative placement guide:
    https://patchwork.kernel.org/project/netdevbpf/patch/20210720134655.892334-11-vladimir.oltean@nxp.com/

    Cc: Vadym Kochan <vkochan@marvell.com>
    Cc: Taras Chornyi <tchornyi@marvell.com>
    Cc: Ioana Ciornei <ioana.ciornei@nxp.com>
    Cc: Lars Povlsen <lars.povlsen@microchip.com>
    Cc: Steen Hegelund <Steen.Hegelund@microchip.com>
    Cc: UNGLinuxDriver@microchip.com
    Cc: Claudiu Manoil <claudiu.manoil@nxp.com>
    Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
    Cc: Grygorii Strashko <grygorii.strashko@ti.com>
    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    Acked-by: Ioana Ciornei <ioana.ciornei@nxp.com> # dpaa2-switch
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-01-05 16:14:33 +01:00
Ivan Vecera 54e450d80b net: bridge: switchdev: let drivers inform which bridge ports are offloaded
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2037335

commit 2f5dc00f7a3ea669fd387ce79ffca92bff361550
Author: Vladimir Oltean <vladimir.oltean@nxp.com>
Date:   Wed Jul 21 19:24:01 2021 +0300

    net: bridge: switchdev: let drivers inform which bridge ports are offloaded

    On reception of an skb, the bridge checks if it was marked as 'already
    forwarded in hardware' (checks if skb->offload_fwd_mark == 1), and if it
    is, it assigns the source hardware domain of that skb based on the
    hardware domain of the ingress port. Then during forwarding, it enforces
    that the egress port must have a different hardware domain than the
    ingress one (this is done in nbp_switchdev_allowed_egress).

    Non-switchdev drivers don't report any physical switch id (neither
    through devlink nor .ndo_get_port_parent_id), therefore the bridge
    assigns them a hardware domain of 0, and packets coming from them will
    always have skb->offload_fwd_mark = 0. So there aren't any restrictions.

    Problems appear due to the fact that DSA would like to perform software
    fallback for bonding and team interfaces that the physical switch cannot
    offload.

           +-- br0 ---+
          / /   |      \
         / /    |       \
        /  |    |      bond0
       /   |    |     /    \
     swp0 swp1 swp2 swp3 swp4

    There, it is desirable that the presence of swp3 and swp4 under a
    non-offloaded LAG does not preclude us from doing hardware bridging
    beteen swp0, swp1 and swp2. The bandwidth of the CPU is often times high
    enough that software bridging between {swp0,swp1,swp2} and bond0 is not
    impractical.

    But this creates an impossible paradox given the current way in which
    port hardware domains are assigned. When the driver receives a packet
    from swp0 (say, due to flooding), it must set skb->offload_fwd_mark to
    something.

    - If we set it to 0, then the bridge will forward it towards swp1, swp2
      and bond0. But the switch has already forwarded it towards swp1 and
      swp2 (not to bond0, remember, that isn't offloaded, so as far as the
      switch is concerned, ports swp3 and swp4 are not looking up the FDB,
      and the entire bond0 is a destination that is strictly behind the
      CPU). But we don't want duplicated traffic towards swp1 and swp2, so
      it's not ok to set skb->offload_fwd_mark = 0.

    - If we set it to 1, then the bridge will not forward the skb towards
      the ports with the same switchdev mark, i.e. not to swp1, swp2 and
      bond0. Towards swp1 and swp2 that's ok, but towards bond0? It should
      have forwarded the skb there.

    So the real issue is that bond0 will be assigned the same hardware
    domain as {swp0,swp1,swp2}, because the function that assigns hardware
    domains to bridge ports, nbp_switchdev_add(), recurses through bond0's
    lower interfaces until it finds something that implements devlink (calls
    dev_get_port_parent_id with bool recurse = true). This is a problem
    because the fact that bond0 can be offloaded by swp3 and swp4 in our
    example is merely an assumption.

    A solution is to give the bridge explicit hints as to what hardware
    domain it should use for each port.

    Currently, the bridging offload is very 'silent': a driver registers a
    netdevice notifier, which is put on the netns's notifier chain, and
    which sniffs around for NETDEV_CHANGEUPPER events where the upper is a
    bridge, and the lower is an interface it knows about (one registered by
    this driver, normally). Then, from within that notifier, it does a bunch
    of stuff behind the bridge's back, without the bridge necessarily
    knowing that there's somebody offloading that port. It looks like this:

         ip link set swp0 master br0
                      |
                      v
     br_add_if() calls netdev_master_upper_dev_link()
                      |
                      v
            call_netdevice_notifiers
                      |
                      v
           dsa_slave_netdevice_event
                      |
                      v
            oh, hey! it's for me!
                      |
                      v
               .port_bridge_join

    What we do to solve the conundrum is to be less silent, and change the
    switchdev drivers to present themselves to the bridge. Something like this:

         ip link set swp0 master br0
                      |
                      v
     br_add_if() calls netdev_master_upper_dev_link()
                      |
                      v                    bridge: Aye! I'll use this
            call_netdevice_notifiers           ^  ppid as the
                      |                        |  hardware domain for
                      v                        |  this port, and zero
           dsa_slave_netdevice_event           |  if I got nothing.
                      |                        |
                      v                        |
            oh, hey! it's for me!              |
                      |                        |
                      v                        |
               .port_bridge_join               |
                      |                        |
                      +------------------------+
                 switchdev_bridge_port_offload(swp0, swp0)

    Then stacked interfaces (like bond0 on top of swp3/swp4) would be
    treated differently in DSA, depending on whether we can or cannot
    offload them.

    The offload case:

        ip link set bond0 master br0
                      |
                      v
     br_add_if() calls netdev_master_upper_dev_link()
                      |
                      v                    bridge: Aye! I'll use this
            call_netdevice_notifiers           ^  ppid as the
                      |                        |  switchdev mark for
                      v                        |        bond0.
           dsa_slave_netdevice_event           | Coincidentally (or not),
                      |                        | bond0 and swp0, swp1, swp2
                      v                        | all have the same switchdev
            hmm, it's not quite for me,        | mark now, since the ASIC
             but my driver has already         | is able to forward towards
               called .port_lag_join           | all these ports in hw.
              for it, because I have           |
          a port with dp->lag_dev == bond0.    |
                      |                        |
                      v                        |
               .port_bridge_join               |
               for swp3 and swp4               |
                      |                        |
                      +------------------------+
                switchdev_bridge_port_offload(bond0, swp3)
                switchdev_bridge_port_offload(bond0, swp4)

    And the non-offload case:

        ip link set bond0 master br0
                      |
                      v
     br_add_if() calls netdev_master_upper_dev_link()
                      |
                      v                    bridge waiting:
            call_netdevice_notifiers           ^  huh, switchdev_bridge_port_offload
                      |                        |  wasn't called, okay, I'll use a
                      v                        |  hwdom of zero for this one.
           dsa_slave_netdevice_event           :  Then packets received on swp0 will
                      |                        :  not be software-forwarded towards
                      v                        :  swp1, but they will towards bond0.
             it's not for me, but
           bond0 is an upper of swp3
          and swp4, but their dp->lag_dev
           is NULL because they couldn't
                offload it.

    Basically we can draw the conclusion that the lowers of a bridge port
    can come and go, so depending on the configuration of lowers for a
    bridge port, it can dynamically toggle between offloaded and unoffloaded.
    Therefore, we need an equivalent switchdev_bridge_port_unoffload too.

    This patch changes the way any switchdev driver interacts with the
    bridge. From now on, everybody needs to call switchdev_bridge_port_offload
    and switchdev_bridge_port_unoffload, otherwise the bridge will treat the
    port as non-offloaded and allow software flooding to other ports from
    the same ASIC.

    Note that these functions lay the ground for a more complex handshake
    between switchdev drivers and the bridge in the future.

    For drivers that will request a replay of the switchdev objects when
    they offload and unoffload a bridge port (DSA, dpaa2-switch, ocelot), we
    place the call to switchdev_bridge_port_unoffload() strategically inside
    the NETDEV_PRECHANGEUPPER notifier's code path, and not inside
    NETDEV_CHANGEUPPER. This is because the switchdev object replay helpers
    need the netdev adjacency lists to be valid, and that is only true in
    NETDEV_PRECHANGEUPPER.

    Cc: Vadym Kochan <vkochan@marvell.com>
    Cc: Taras Chornyi <tchornyi@marvell.com>
    Cc: Ioana Ciornei <ioana.ciornei@nxp.com>
    Cc: Lars Povlsen <lars.povlsen@microchip.com>
    Cc: Steen Hegelund <Steen.Hegelund@microchip.com>
    Cc: UNGLinuxDriver@microchip.com
    Cc: Claudiu Manoil <claudiu.manoil@nxp.com>
    Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
    Cc: Grygorii Strashko <grygorii.strashko@ti.com>
    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    Tested-by: Ioana Ciornei <ioana.ciornei@nxp.com> # dpaa2-switch: regression
    Acked-by: Ioana Ciornei <ioana.ciornei@nxp.com> # dpaa2-switch
    Tested-by: Horatiu Vultur <horatiu.vultur@microchip.com> # ocelot-switch
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-01-05 16:14:31 +01:00
Ivan Vecera 591d981ef9 net: bridge: switchdev: recycle unused hwdoms
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2037335

commit 8582661048eb64341edf73dd2ca828b4f039c5c2
Author: Tobias Waldekranz <tobias@waldekranz.com>
Date:   Wed Jul 21 19:24:00 2021 +0300

    net: bridge: switchdev: recycle unused hwdoms

    Since hwdoms have only been used thus far for equality comparisons, the
    bridge has used the simplest possible assignment policy; using a
    counter to keep track of the last value handed out.

    With the upcoming transmit offloading, we need to perform set
    operations efficiently based on hwdoms, e.g. we want to answer
    questions like "has this skb been forwarded to any port within this
    hwdom?"

    Move to a bitmap-based allocation scheme that recycles hwdoms once all
    members leaves the bridge. This means that we can use a single
    unsigned long to keep track of the hwdoms that have received an skb.

    v1->v2: convert the typedef DECLARE_BITMAP(br_hwdom_map_t, BR_HWDOM_MAX)
            into a plain unsigned long.
    v2->v6: none

    Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-01-05 16:14:30 +01:00
Ivan Vecera af623cc506 net: bridge: disambiguate offload_fwd_mark
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2037335

commit f7cf972f9375388838b0fbdaa007ce8494646990
Author: Tobias Waldekranz <tobias@waldekranz.com>
Date:   Wed Jul 21 19:23:59 2021 +0300

    net: bridge: disambiguate offload_fwd_mark

    Before this change, four related - but distinct - concepts where named
    offload_fwd_mark:

    - skb->offload_fwd_mark: Set by the switchdev driver if the underlying
      hardware has already forwarded this frame to the other ports in the
      same hardware domain.

    - nbp->offload_fwd_mark: An idetifier used to group ports that share
      the same hardware forwarding domain.

    - br->offload_fwd_mark: Counter used to make sure that unique IDs are
      used in cases where a bridge contains ports from multiple hardware
      domains.

    - skb->cb->offload_fwd_mark: The hardware domain on which the frame
      ingressed and was forwarded.

    Introduce the term "hardware forwarding domain" ("hwdom") in the
    bridge to denote a set of ports with the following property:

        If an skb with skb->offload_fwd_mark set, is received on a port
        belonging to hwdom N, that frame has already been forwarded to all
        other ports in hwdom N.

    By decoupling the name from "offload_fwd_mark", we can extend the
    term's definition in the future - e.g. to add constraints that
    describe expected egress behavior - without overloading the meaning of
    "offload_fwd_mark".

    - nbp->offload_fwd_mark thus becomes nbp->hwdom.

    - br->offload_fwd_mark becomes br->last_hwdom.

    - skb->cb->offload_fwd_mark becomes skb->cb->src_hwdom. The slight
      change in naming here mandates a slight change in behavior of the
      nbp_switchdev_frame_mark() function. Previously, it only set this
      value in skb->cb for packets with skb->offload_fwd_mark true (ones
      which were forwarded in hardware). Whereas now we always track the
      incoming hwdom for all packets coming from a switchdev (even for the
      packets which weren't forwarded in hardware, such as STP BPDUs, IGMP
      reports etc). As all uses of skb->cb->offload_fwd_mark were already
      gated behind checks of skb->offload_fwd_mark, this will not introduce
      any functional change, but it paves the way for future changes where
      the ingressing hwdom must be known for frames coming from a switchdev
      regardless of whether they were forwarded in hardware or not
      (basically, if the skb comes from a switchdev, skb->cb->src_hwdom now
      always tracks which one).

      A typical example where this is relevant: the switchdev has a fixed
      configuration to trap STP BPDUs, but STP is not running on the bridge
      and the group_fwd_mask allows them to be forwarded. Say we have this
      setup:

            br0
           / | \
          /  |  \
      swp0 swp1 swp2

      A BPDU comes in on swp0 and is trapped to the CPU; the driver does not
      set skb->offload_fwd_mark. The bridge determines that the frame should
      be forwarded to swp{1,2}. It is imperative that forward offloading is
      _not_ allowed in this case, as the source hwdom is already "poisoned".

      Recording the source hwdom allows this case to be handled properly.

    v2->v3: added code comments
    v3->v6: none

    Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com>
    Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-01-05 16:14:30 +01:00
Tobias Waldekranz 6eb38bf8eb net: bridge: switchdev: send FDB notifications for host addresses
Treat addresses added to the bridge itself in the same way as regular
ports and send out a notification so that drivers may sync it down to
the hardware FDB.

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-29 10:46:23 -07:00
Vladimir Oltean 3e19ae7c6f net: bridge: use READ_ONCE() and WRITE_ONCE() compiler barriers for fdb->dst
Annotate the writer side of fdb->dst:

- fdb_create()
- br_fdb_update()
- fdb_add_entry()
- br_fdb_external_learn_add()

with WRITE_ONCE() and the reader side:

- br_fdb_test_addr()
- br_fdb_update()
- fdb_fill_info()
- fdb_add_entry()
- fdb_delete_by_addr_and_port()
- br_fdb_external_learn_add()
- br_switchdev_fdb_notify()

with compiler barriers such that the readers do not attempt to reload
fdb->dst multiple times, leading to potentially different destination
ports when the fdb entry is updated concurrently.

This is especially important in read-side sections where fdb->dst is
used more than once, but let's convert all accesses for the sake of
uniformity.

Suggested-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-29 10:46:22 -07:00
Vladimir Oltean 2c4eca3ef7 net: bridge: switchdev: include local flag in FDB notifications
As explained in bugfix commit 6ab4c3117a ("net: bridge: don't notify
switchdev for local FDB addresses") as well as in this discussion:
https://lore.kernel.org/netdev/20210117193009.io3nungdwuzmo5f7@skbuf/

the switchdev notifiers for FDB entries managed to have a zero-day bug,
which was that drivers would not know what to do with local FDB entries,
because they were not told that they are local. The bug fix was to
simply not notify them of those addresses.

Let us now add the 'is_local' bit to bridge FDB entries, and make all
drivers ignore these entries by their own choice.

Co-developed-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16 15:15:45 -07:00
Tobias Waldekranz e5b4b8988b net: bridge: switchdev: refactor br_switchdev_fdb_notify
Instead of having to add more and more arguments to
br_switchdev_fdb_call_notifiers, get rid of it and build the info
struct directly in br_switchdev_fdb_notify.

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16 15:15:45 -07:00
Vladimir Oltean 6ab4c3117a net: bridge: don't notify switchdev for local FDB addresses
As explained in this discussion:
https://lore.kernel.org/netdev/20210117193009.io3nungdwuzmo5f7@skbuf/

the switchdev notifiers for FDB entries managed to have a zero-day bug.
The bridge would not say that this entry is local:

ip link add br0 type bridge
ip link set swp0 master br0
bridge fdb add dev swp0 00:01:02:03:04:05 master local

and the switchdev driver would be more than happy to offload it as a
normal static FDB entry. This is despite the fact that 'local' and
non-'local' entries have completely opposite directions: a local entry
is locally terminated and not forwarded, whereas a static entry is
forwarded and not locally terminated. So, for example, DSA would install
this entry on swp0 instead of installing it on the CPU port as it should.

There is an even sadder part, which is that the 'local' flag is implicit
if 'static' is not specified, meaning that this command produces the
same result of adding a 'local' entry:

bridge fdb add dev swp0 00:01:02:03:04:05 master

I've updated the man pages for 'bridge', and after reading it now, it
should be pretty clear to any user that the commands above were broken
and should have never resulted in the 00:01:02:03:04:05 address being
forwarded (this behavior is coherent with non-switchdev interfaces):
https://patchwork.kernel.org/project/netdevbpf/cover/20210211104502.2081443-1-olteanv@gmail.com/
If you're a user reading this and this is what you want, just use:

bridge fdb add dev swp0 00:01:02:03:04:05 master static

Because switchdev should have given drivers the means from day one to
classify FDB entries as local/non-local, but didn't, it means that all
drivers are currently broken. So we can just as well omit the switchdev
notifications for local FDB entries, which is exactly what this patch
does to close the bug in stable trees. For further development work
where drivers might want to trap the local FDB entries to the host, we
can add a 'bool is_local' to br_switchdev_fdb_call_notifiers(), and
selectively make drivers act upon that bit, while all the others ignore
those entries if the 'is_local' bit is set.

Fixes: 6b26b51b1d ("net: bridge: Add support for notifying devices about FDB add/del")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-23 14:39:41 -07:00
Vladimir Oltean dcbdf1350e net: bridge: propagate extack through switchdev_port_attr_set
The benefit is the ability to propagate errors from switchdev drivers
for the SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING and
SWITCHDEV_ATTR_ID_BRIDGE_VLAN_PROTOCOL attributes.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-14 17:38:11 -08:00
Vladimir Oltean e18f4c18ab net: switchdev: pass flags and mask to both {PRE_,}BRIDGE_FLAGS attributes
This switchdev attribute offers a counterproductive API for a driver
writer, because although br_switchdev_set_port_flag gets passed a
"flags" and a "mask", those are passed piecemeal to the driver, so while
the PRE_BRIDGE_FLAGS listener knows what changed because it has the
"mask", the BRIDGE_FLAGS listener doesn't, because it only has the final
value. But certain drivers can offload only certain combinations of
settings, like for example they cannot change unicast flooding
independently of multicast flooding - they must be both on or both off.
The way the information is passed to switchdev makes drivers not
expressive enough, and unable to reject this request ahead of time, in
the PRE_BRIDGE_FLAGS notifier, so they are forced to reject it during
the deferred BRIDGE_FLAGS attribute, where the rejection is currently
ignored.

This patch also changes drivers to make use of the "mask" field for edge
detection when possible.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-12 17:08:04 -08:00
Vladimir Oltean 078bbb851e net: bridge: don't print in br_switchdev_set_port_flag
For the netlink interface, propagate errors through extack rather than
simply printing them to the console. For the sysfs interface, we still
print to the console, but at least that's one layer higher than in
switchdev, which also allows us to silently ignore the offloading of
flags if that is ever needed in the future.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-12 17:08:04 -08:00
Vladimir Oltean 304ae3bf1c net: bridge: offload all port flags at once in br_setport
If for example this command:

ip link set swp0 type bridge_slave flood off mcast_flood off learning off

succeeded at configuring BR_FLOOD and BR_MCAST_FLOOD but not at
BR_LEARNING, there would be no attempt to revert the partial state in
any way. Arguably, if the user changes more than one flag through the
same netlink command, this one _should_ be all or nothing, which means
it should be passed through switchdev as all or nothing.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-12 17:08:04 -08:00
Vladimir Oltean b7a9e0da2d net: switchdev: remove vid_begin -> vid_end range from VLAN objects
The call path of a switchdev VLAN addition to the bridge looks something
like this today:

        nbp_vlan_init
        |  __br_vlan_set_default_pvid
        |  |                       |
        |  |    br_afspec          |
        |  |        |              |
        |  |        v              |
        |  | br_process_vlan_info  |
        |  |        |              |
        |  |        v              |
        |  |   br_vlan_info        |
        |  |       / \            /
        |  |      /   \          /
        |  |     /     \        /
        |  |    /       \      /
        v  v   v         v    v
      nbp_vlan_add   br_vlan_add ------+
       |              ^      ^ |       |
       |             /       | |       |
       |            /       /  /       |
       \ br_vlan_get_master/  /        v
        \        ^        /  /  br_vlan_add_existing
         \       |       /  /          |
          \      |      /  /          /
           \     |     /  /          /
            \    |    /  /          /
             \   |   /  /          /
              v  |   | v          /
              __vlan_add         /
                 / |            /
                /  |           /
               v   |          /
   __vlan_vid_add  |         /
               \   |        /
                v  v        v
      br_switchdev_port_vlan_add

The ranges UAPI was introduced to the bridge in commit bdced7ef78
("bridge: support for multiple vlans and vlan ranges in setlink and
dellink requests") (Jan 10 2015). But the VLAN ranges (parsed in br_afspec)
have always been passed one by one, through struct bridge_vlan_info
tmp_vinfo, to br_vlan_info. So the range never went too far in depth.

Then Scott Feldman introduced the switchdev_port_bridge_setlink function
in commit 47f8328bb1 ("switchdev: add new switchdev bridge setlink").
That marked the introduction of the SWITCHDEV_OBJ_PORT_VLAN, which made
full use of the range. But switchdev_port_bridge_setlink was called like
this:

br_setlink
-> br_afspec
-> switchdev_port_bridge_setlink

Basically, the switchdev and the bridge code were not tightly integrated.
Then commit 41c498b935 ("bridge: restore br_setlink back to original")
came, and switchdev drivers were required to implement
.ndo_bridge_setlink = switchdev_port_bridge_setlink for a while.

In the meantime, commits such as 0944d6b5a2 ("bridge: try switchdev op
first in __vlan_vid_add/del") finally made switchdev penetrate the
br_vlan_info() barrier and start to develop the call path we have today.
But remember, br_vlan_info() still receives VLANs one by one.

Then Arkadi Sharshevsky refactored the switchdev API in 2017 in commit
29ab586c3d ("net: switchdev: Remove bridge bypass support from
switchdev") so that drivers would not implement .ndo_bridge_setlink any
longer. The switchdev_port_bridge_setlink also got deleted.
This refactoring removed the parallel bridge_setlink implementation from
switchdev, and left the only switchdev VLAN objects to be the ones
offloaded from __vlan_vid_add (basically RX filtering) and  __vlan_add
(the latter coming from commit 9c86ce2c1a ("net: bridge: Notify about
bridge VLANs")).

That is to say, today the switchdev VLAN object ranges are not used in
the kernel. Refactoring the above call path is a bit complicated, when
the bridge VLAN call path is already a bit complicated.

Let's go off and finish the job of commit 29ab586c3d by deleting the
bogus iteration through the VLAN ranges from the drivers. Some aspects
of this feature never made too much sense in the first place. For
example, what is a range of VLANs all having the BRIDGE_VLAN_INFO_PVID
flag supposed to mean, when a port can obviously have a single pvid?
This particular configuration _is_ denied as of commit 6623c60dc2
("bridge: vlan: enforce no pvid flag in vlan ranges"), but from an API
perspective, the driver still has to play pretend, and only offload the
vlan->vid_end as pvid. And the addition of a switchdev VLAN object can
modify the flags of another, completely unrelated, switchdev VLAN
object! (a VLAN that is PVID will invalidate the PVID flag from whatever
other VLAN had previously been offloaded with switchdev and had that
flag. Yet switchdev never notifies about that change, drivers are
supposed to guess).

Nonetheless, having a VLAN range in the API makes error handling look
scarier than it really is - unwinding on errors and all of that.
When in reality, no one really calls this API with more than one VLAN.
It is all unnecessary complexity.

And despite appearing pretentious (two-phase transactional model and
all), the switchdev API is really sloppy because the VLAN addition and
removal operations are not paired with one another (you can add a VLAN
100 times and delete it just once). The bridge notifies through
switchdev of a VLAN addition not only when the flags of an existing VLAN
change, but also when nothing changes. There are switchdev drivers out
there who don't like adding a VLAN that has already been added, and
those checks don't really belong at driver level. But the fact that the
API contains ranges is yet another factor that prevents this from being
addressed in the future.

Of the existing switchdev pieces of hardware, it appears that only
Mellanox Spectrum supports offloading more than one VLAN at a time,
through mlxsw_sp_port_vlan_set. I have kept that code internal to the
driver, because there is some more bookkeeping that makes use of it, but
I deleted it from the switchdev API. But since the switchdev support for
ranges has already been de facto deleted by a Mellanox employee and
nobody noticed for 4 years, I'm going to assume it's not a biggie.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com> # switchdev and mlxsw
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> # hellcreek
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-11 16:00:56 -08:00
Nikolay Aleksandrov d38c6e3db0 net: bridge: fdb: convert offloaded to use bitops
Convert the offloaded field to a flag and use bitops.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 18:12:49 -07:00
Nikolay Aleksandrov ac3ca6af44 net: bridge: fdb: convert added_by_user to bitops
Straight-forward convert of the added_by_user field to bitops.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 18:12:49 -07:00
Florian Fainelli d45224d604 net: switchdev: Replace port attr set SDO with a notification
Drop switchdev_ops.switchdev_port_attr_set. Drop the uses of this field
from all clients, which were migrated to use switchdev notification in
the previous patches.

Add a new function switchdev_port_attr_notify() that sends the switchdev
notifications SWITCHDEV_PORT_ATTR_SET and calls the blocking (process)
notifier chain.

We have one odd case within net/bridge/br_switchdev.c with the
SWITCHDEV_ATTR_ID_PORT_PRE_BRIDGE_FLAGS attribute identifier that
requires executing from atomic context, we deal with that one
specifically.

Drop __switchdev_port_attr_set() and update switchdev_port_attr_set()
likewise.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-27 12:39:56 -08:00
Florian Fainelli 1ef0764486 net: bridge: Stop calling switchdev_port_attr_get()
Now that all switchdev drivers have been converted to check the
SWITCHDEV_ATTR_ID_PORT_PRE_BRIDGE_FLAGS flags and report flags that they
do not support accordingly, we can migrate the bridge code to try to set
that attribute first, check the results and then do the actual setting.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-21 14:55:14 -08:00
Florian Fainelli bccb30254a net: Get rid of SWITCHDEV_ATTR_ID_PORT_PARENT_ID
Now that we have a dedicated NDO for getting a port's parent ID, get rid
of SWITCHDEV_ATTR_ID_PORT_PARENT_ID and convert all callers to use the
NDO exclusively. This is a preliminary change to getting rid of
switchdev_ops eventually.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-06 14:17:03 -08:00
Florian Fainelli d6abc59694 net: Introduce ndo_get_port_parent_id()
In preparation for getting rid of switchdev_ops, create a dedicated NDO
operation for getting the port's parent identifier. There are
essentially two classes of drivers that need to implement getting the
port's parent ID which are VF/PF drivers with a built-in switch, and
pure switchdev drivers such as mlxsw, ocelot, dsa etc.

We introduce a helper function: dev_get_port_parent_id() which supports
recursion into the lower devices to obtain the first port's parent ID.

Convert the bridge, core and ipv4 multicast routing code to check for
such ndo_get_port_parent_id() and call the helper function when valid
before falling back to switchdev_port_attr_get(). This will allow us to
convert all relevant drivers in one go instead of having to implement
both switchdev_port_attr_get() and ndo_get_port_parent_id() operations,
then get rid of switchdev_port_attr_get().

Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-06 14:16:11 -08:00
Petr Machata 6685987c29 switchdev: Add extack argument to call_switchdev_notifiers()
A follow-up patch will enable vetoing of FDB entries. Make it possible
to communicate details of why an FDB entry is not acceptable back to the
user.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-17 15:18:47 -08:00
Petr Machata 69b7320e14 net: switchdev: Add extack argument to switchdev_port_obj_add()
After the previous patch, bridge driver has extack argument available to
pass to switchdev. Therefore extend switchdev_port_obj_add() with this
argument, updating all callers, and passing the argument through to
switchdev_port_obj_notify().

Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-12 16:34:21 -08:00
Petr Machata 169327d585 net: bridge: Propagate extack to switchdev
ndo_bridge_setlink has been updated in the previous patch to have extack
available, and changelink RTNL op has had this argument since the time
extack was added. Propagate both through the bridge driver to eventually
reach br_switchdev_port_vlan_add(), where it will be used by subsequent
patches.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Ivan Vecera <ivecera@redhat.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-12 16:34:21 -08:00