2021-02-13 22:37:56 +00:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0
|
2021-09-17 11:17:35 +00:00
|
|
|
* Copyright 2019-2021 NXP
|
2021-02-13 22:37:56 +00:00
|
|
|
*/
|
|
|
|
|
|
|
|
#ifndef _NET_DSA_TAG_OCELOT_H
|
|
|
|
#define _NET_DSA_TAG_OCELOT_H
|
|
|
|
|
net: mscc: ocelot: use ocelot_xmit_get_vlan_info() also for FDMA and register injection
Problem description
-------------------
On an NXP LS1028A (felix DSA driver) with the following configuration:
- ocelot-8021q tagging protocol
- VLAN-aware bridge (with STP) spanning at least swp0 and swp1
- 8021q VLAN upper interfaces on swp0 and swp1: swp0.700, swp1.700
- ptp4l on swp0.700 and swp1.700
we see that the ptp4l instances do not see each other's traffic,
and they all go to the grand master state due to the
ANNOUNCE_RECEIPT_TIMEOUT_EXPIRES condition.
Jumping to the conclusion for the impatient
-------------------------------------------
There is a zero-day bug in the ocelot switchdev driver in the way it
handles VLAN-tagged packet injection. The correct logic already exists in
the source code, in function ocelot_xmit_get_vlan_info() added by commit
5ca721c54d86 ("net: dsa: tag_ocelot: set the classified VLAN during xmit").
But it is used only for normal NPI-based injection with the DSA "ocelot"
tagging protocol. The other injection code paths (register-based and
FDMA-based) roll their own wrong logic. This affects and was noticed on
the DSA "ocelot-8021q" protocol because it uses register-based injection.
By moving ocelot_xmit_get_vlan_info() to a place that's common for both
the DSA tagger and the ocelot switch library, it can also be called from
ocelot_port_inject_frame() in ocelot.c.
We need to touch the lines with ocelot_ifh_port_set()'s prototype
anyway, so let's rename it to something clearer regarding what it does,
and add a kernel-doc. ocelot_ifh_set_basic() should do.
Investigation notes
-------------------
Debugging reveals that PTP event (aka those carrying timestamps, like
Sync) frames injected into swp0.700 (but also swp1.700) hit the wire
with two VLAN tags:
00000000: 01 1b 19 00 00 00 00 01 02 03 04 05 81 00 02 bc
~~~~~~~~~~~
00000010: 81 00 02 bc 88 f7 00 12 00 2c 00 00 02 00 00 00
~~~~~~~~~~~
00000020: 00 00 00 00 00 00 00 00 00 00 00 01 02 ff fe 03
00000030: 04 05 00 01 00 04 00 00 00 00 00 00 00 00 00 00
00000040: 00 00
The second (unexpected) VLAN tag makes felix_check_xtr_pkt() ->
ptp_classify_raw() fail to see these as PTP packets at the link
partner's receiving end, and return PTP_CLASS_NONE (because the BPF
classifier is not written to expect 2 VLAN tags).
The reason why packets have 2 VLAN tags is because the transmission
code treats VLAN incorrectly.
Neither ocelot switchdev, nor felix DSA, declare the NETIF_F_HW_VLAN_CTAG_TX
feature. Therefore, at xmit time, all VLANs should be in the skb head,
and none should be in the hwaccel area. This is done by:
static struct sk_buff *validate_xmit_vlan(struct sk_buff *skb,
netdev_features_t features)
{
if (skb_vlan_tag_present(skb) &&
!vlan_hw_offload_capable(features, skb->vlan_proto))
skb = __vlan_hwaccel_push_inside(skb);
return skb;
}
But ocelot_port_inject_frame() handles things incorrectly:
ocelot_ifh_port_set(ifh, port, rew_op, skb_vlan_tag_get(skb));
void ocelot_ifh_port_set(struct sk_buff *skb, void *ifh, int port, u32 rew_op)
{
(...)
if (vlan_tag)
ocelot_ifh_set_vlan_tci(ifh, vlan_tag);
(...)
}
The way __vlan_hwaccel_push_inside() pushes the tag inside the skb head
is by calling:
static inline void __vlan_hwaccel_clear_tag(struct sk_buff *skb)
{
skb->vlan_present = 0;
}
which does _not_ zero out skb->vlan_tci as seen by skb_vlan_tag_get().
This means that ocelot, when it calls skb_vlan_tag_get(), sees
(and uses) a residual skb->vlan_tci, while the same VLAN tag is
_already_ in the skb head.
The trivial fix for double VLAN headers is to replace the content of
ocelot_ifh_port_set() with:
if (skb_vlan_tag_present(skb))
ocelot_ifh_set_vlan_tci(ifh, skb_vlan_tag_get(skb));
but this would not be correct either, because, as mentioned,
vlan_hw_offload_capable() is false for us, so we'd be inserting dead
code and we'd always transmit packets with VID=0 in the injection frame
header.
I can't actually test the ocelot switchdev driver and rely exclusively
on code inspection, but I don't think traffic from 8021q uppers has ever
been injected properly, and not double-tagged. Thus I'm blaming the
introduction of VLAN fields in the injection header - early driver code.
As hinted at in the early conclusion, what we _want_ to happen for
VLAN transmission was already described once in commit 5ca721c54d86
("net: dsa: tag_ocelot: set the classified VLAN during xmit").
ocelot_xmit_get_vlan_info() intends to ensure that if the port through
which we're transmitting is under a VLAN-aware bridge, the outer VLAN
tag from the skb head is stripped from there and inserted into the
injection frame header (so that the packet is processed in hardware
through that actual VLAN). And in all other cases, the packet is sent
with VID=0 in the injection frame header, since the port is VLAN-unaware
and has logic to strip this VID on egress (making it invisible to the
wire).
Fixes: 08d02364b12f ("net: mscc: fix the injection header")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-15 00:07:02 +00:00
|
|
|
#include <linux/if_bridge.h>
|
|
|
|
#include <linux/if_vlan.h>
|
2021-10-12 11:40:41 +00:00
|
|
|
#include <linux/kthread.h>
|
2021-02-13 22:37:56 +00:00
|
|
|
#include <linux/packing.h>
|
2021-10-12 11:40:40 +00:00
|
|
|
#include <linux/skbuff.h>
|
2021-12-09 23:34:38 +00:00
|
|
|
#include <net/dsa.h>
|
2021-10-12 11:40:40 +00:00
|
|
|
|
|
|
|
struct ocelot_skb_cb {
|
|
|
|
struct sk_buff *clone;
|
|
|
|
unsigned int ptp_class; /* valid only for clones */
|
net: mscc: ocelot: be resilient to loss of PTP packets during transmission
The Felix DSA driver presents unique challenges that make the simplistic
ocelot PTP TX timestamping procedure unreliable: any transmitted packet
may be lost in hardware before it ever leaves our local system.
This may happen because there is congestion on the DSA conduit, the
switch CPU port or even user port (Qdiscs like taprio may delay packets
indefinitely by design).
The technical problem is that the kernel, i.e. ocelot_port_add_txtstamp_skb(),
runs out of timestamp IDs eventually, because it never detects that
packets are lost, and keeps the IDs of the lost packets on hold
indefinitely. The manifestation of the issue once the entire timestamp
ID range becomes busy looks like this in dmesg:
mscc_felix 0000:00:00.5: port 0 delivering skb without TX timestamp
mscc_felix 0000:00:00.5: port 1 delivering skb without TX timestamp
At the surface level, we need a timeout timer so that the kernel knows a
timestamp ID is available again. But there is a deeper problem with the
implementation, which is the monotonically increasing ocelot_port->ts_id.
In the presence of packet loss, it will be impossible to detect that and
reuse one of the holes created in the range of free timestamp IDs.
What we actually need is a bitmap of 63 timestamp IDs tracking which one
is available. That is able to use up holes caused by packet loss, but
also gives us a unique opportunity to not implement an actual timer_list
for the timeout timer (very complicated in terms of locking).
We could only declare a timestamp ID stale on demand (lazily), aka when
there's no other timestamp ID available. There are pros and cons to this
approach: the implementation is much more simple than per-packet timers
would be, but most of the stale packets would be quasi-leaked - not
really leaked, but blocked in driver memory, since this algorithm sees
no reason to free them.
An improved technique would be to check for stale timestamp IDs every
time we allocate a new one. Assuming a constant flux of PTP packets,
this avoids stale packets being blocked in memory, but of course,
packets lost at the end of the flux are still blocked until the flux
resumes (nobody left to kick them out).
Since implementing per-packet timers is way too complicated, this should
be good enough.
Testing procedure:
Persistently block traffic class 5 and try to run PTP on it:
$ tc qdisc replace dev swp3 parent root taprio num_tc 8 \
map 0 1 2 3 4 5 6 7 queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
base-time 0 sched-entry S 0xdf 100000 flags 0x2
[ 126.948141] mscc_felix 0000:00:00.5: port 3 tc 5 min gate length 0 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 1 octets including FCS
$ ptp4l -i swp3 -2 -P -m --socket_priority 5 --fault_reset_interval ASAP --logSyncInterval -3
ptp4l[70.351]: port 1 (swp3): INITIALIZING to LISTENING on INIT_COMPLETE
ptp4l[70.354]: port 0 (/var/run/ptp4l): INITIALIZING to LISTENING on INIT_COMPLETE
ptp4l[70.358]: port 0 (/var/run/ptp4lro): INITIALIZING to LISTENING on INIT_COMPLETE
[ 70.394583] mscc_felix 0000:00:00.5: port 3 timestamp id 0
ptp4l[70.406]: timed out while polling for tx timestamp
ptp4l[70.406]: increasing tx_timestamp_timeout or increasing kworker priority may correct this issue, but a driver bug likely causes it
ptp4l[70.406]: port 1 (swp3): send peer delay response failed
ptp4l[70.407]: port 1 (swp3): clearing fault immediately
ptp4l[70.952]: port 1 (swp3): new foreign master d858d7.fffe.00ca6d-1
[ 71.394858] mscc_felix 0000:00:00.5: port 3 timestamp id 1
ptp4l[71.400]: timed out while polling for tx timestamp
ptp4l[71.400]: increasing tx_timestamp_timeout or increasing kworker priority may correct this issue, but a driver bug likely causes it
ptp4l[71.401]: port 1 (swp3): send peer delay response failed
ptp4l[71.401]: port 1 (swp3): clearing fault immediately
[ 72.393616] mscc_felix 0000:00:00.5: port 3 timestamp id 2
ptp4l[72.401]: timed out while polling for tx timestamp
ptp4l[72.402]: increasing tx_timestamp_timeout or increasing kworker priority may correct this issue, but a driver bug likely causes it
ptp4l[72.402]: port 1 (swp3): send peer delay response failed
ptp4l[72.402]: port 1 (swp3): clearing fault immediately
ptp4l[72.952]: port 1 (swp3): new foreign master d858d7.fffe.00ca6d-1
[ 73.395291] mscc_felix 0000:00:00.5: port 3 timestamp id 3
ptp4l[73.400]: timed out while polling for tx timestamp
ptp4l[73.400]: increasing tx_timestamp_timeout or increasing kworker priority may correct this issue, but a driver bug likely causes it
ptp4l[73.400]: port 1 (swp3): send peer delay response failed
ptp4l[73.400]: port 1 (swp3): clearing fault immediately
[ 74.394282] mscc_felix 0000:00:00.5: port 3 timestamp id 4
ptp4l[74.400]: timed out while polling for tx timestamp
ptp4l[74.401]: increasing tx_timestamp_timeout or increasing kworker priority may correct this issue, but a driver bug likely causes it
ptp4l[74.401]: port 1 (swp3): send peer delay response failed
ptp4l[74.401]: port 1 (swp3): clearing fault immediately
ptp4l[74.953]: port 1 (swp3): new foreign master d858d7.fffe.00ca6d-1
[ 75.396830] mscc_felix 0000:00:00.5: port 3 invalidating stale timestamp ID 0 which seems lost
[ 75.405760] mscc_felix 0000:00:00.5: port 3 timestamp id 0
ptp4l[75.410]: timed out while polling for tx timestamp
ptp4l[75.411]: increasing tx_timestamp_timeout or increasing kworker priority may correct this issue, but a driver bug likely causes it
ptp4l[75.411]: port 1 (swp3): send peer delay response failed
ptp4l[75.411]: port 1 (swp3): clearing fault immediately
(...)
Remove the blocking condition and see that the port recovers:
$ same tc command as above, but use "sched-entry S 0xff" instead
$ same ptp4l command as above
ptp4l[99.489]: port 1 (swp3): INITIALIZING to LISTENING on INIT_COMPLETE
ptp4l[99.490]: port 0 (/var/run/ptp4l): INITIALIZING to LISTENING on INIT_COMPLETE
ptp4l[99.492]: port 0 (/var/run/ptp4lro): INITIALIZING to LISTENING on INIT_COMPLETE
[ 100.403768] mscc_felix 0000:00:00.5: port 3 invalidating stale timestamp ID 0 which seems lost
[ 100.412545] mscc_felix 0000:00:00.5: port 3 invalidating stale timestamp ID 1 which seems lost
[ 100.421283] mscc_felix 0000:00:00.5: port 3 invalidating stale timestamp ID 2 which seems lost
[ 100.430015] mscc_felix 0000:00:00.5: port 3 invalidating stale timestamp ID 3 which seems lost
[ 100.438744] mscc_felix 0000:00:00.5: port 3 invalidating stale timestamp ID 4 which seems lost
[ 100.447470] mscc_felix 0000:00:00.5: port 3 timestamp id 0
[ 100.505919] mscc_felix 0000:00:00.5: port 3 timestamp id 0
ptp4l[100.963]: port 1 (swp3): new foreign master d858d7.fffe.00ca6d-1
[ 101.405077] mscc_felix 0000:00:00.5: port 3 timestamp id 0
[ 101.507953] mscc_felix 0000:00:00.5: port 3 timestamp id 0
[ 102.405405] mscc_felix 0000:00:00.5: port 3 timestamp id 0
[ 102.509391] mscc_felix 0000:00:00.5: port 3 timestamp id 0
[ 103.406003] mscc_felix 0000:00:00.5: port 3 timestamp id 0
[ 103.510011] mscc_felix 0000:00:00.5: port 3 timestamp id 0
[ 104.405601] mscc_felix 0000:00:00.5: port 3 timestamp id 0
[ 104.510624] mscc_felix 0000:00:00.5: port 3 timestamp id 0
ptp4l[104.965]: selected best master clock d858d7.fffe.00ca6d
ptp4l[104.966]: port 1 (swp3): assuming the grand master role
ptp4l[104.967]: port 1 (swp3): LISTENING to GRAND_MASTER on RS_GRAND_MASTER
[ 105.106201] mscc_felix 0000:00:00.5: port 3 timestamp id 0
[ 105.232420] mscc_felix 0000:00:00.5: port 3 timestamp id 0
[ 105.359001] mscc_felix 0000:00:00.5: port 3 timestamp id 0
[ 105.405500] mscc_felix 0000:00:00.5: port 3 timestamp id 0
[ 105.485356] mscc_felix 0000:00:00.5: port 3 timestamp id 0
[ 105.511220] mscc_felix 0000:00:00.5: port 3 timestamp id 0
[ 105.610938] mscc_felix 0000:00:00.5: port 3 timestamp id 0
[ 105.737237] mscc_felix 0000:00:00.5: port 3 timestamp id 0
(...)
Notice that in this new usage pattern, a non-congested port should
basically use timestamp ID 0 all the time, progressing to higher numbers
only if there are unacknowledged timestamps in flight. Compare this to
the old usage, where the timestamp ID used to monotonically increase
modulo OCELOT_MAX_PTP_ID.
In terms of implementation, this simplifies the bookkeeping of the
ocelot_port :: ts_id and ptp_skbs_in_flight. Since we need to traverse
the list of two-step timestampable skbs for each new packet anyway, the
information can already be computed and does not need to be stored.
Also, ocelot_port->tx_skbs is always accessed under the switch-wide
ocelot->ts_id_lock IRQ-unsafe spinlock, so we don't need the skb queue's
lock and can use the unlocked primitives safely.
This problem was actually detected using the tc-taprio offload, and is
causing trouble in TSN scenarios, which Felix (NXP LS1028A / VSC9959)
supports but Ocelot (VSC7514) does not. Thus, I've selected the commit
to blame as the one adding initial timestamping support for the Felix
switch.
Fixes: c0bcf537667c ("net: dsa: ocelot: add hardware timestamping support for Felix")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241205145519.1236778-5-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-05 14:55:18 +00:00
|
|
|
unsigned long ptp_tx_time; /* valid only for clones */
|
net: dsa: felix: fix broken VLAN-tagged PTP under VLAN-aware bridge
Normally it is expected that the dsa_device_ops :: rcv() method finishes
parsing the DSA tag and consumes it, then never looks at it again.
But commit c0bcf537667c ("net: dsa: ocelot: add hardware timestamping
support for Felix") added support for RX timestamping in a very
unconventional way. On this switch, a partial timestamp is available in
the DSA header, but the driver got away with not parsing that timestamp
right away, but instead delayed that parsing for a little longer:
dsa_switch_rcv():
nskb = cpu_dp->rcv(skb, dev); <------------- not here
-> ocelot_rcv()
...
skb = nskb;
skb_push(skb, ETH_HLEN);
skb->pkt_type = PACKET_HOST;
skb->protocol = eth_type_trans(skb, skb->dev);
...
if (dsa_skb_defer_rx_timestamp(p, skb)) <--- but here
-> felix_rxtstamp()
return 0;
When in felix_rxtstamp(), this driver accounted for the fact that
eth_type_trans() happened in the meanwhile, so it got a hold of the
extraction header again by subtracting (ETH_HLEN + OCELOT_TAG_LEN) bytes
from the current skb->data.
This worked for quite some time but was quite fragile from the very
beginning. Not to mention that having DSA tag parsing split in two
different files, under different folders (net/dsa/tag_ocelot.c vs
drivers/net/dsa/ocelot/felix.c) made it quite non-obvious for patches to
come that they might break this.
Finally, the blamed commit does the following: at the end of
ocelot_rcv(), it checks whether the skb payload contains a VLAN header.
If it does, and this port is under a VLAN-aware bridge, that VLAN ID
might not be correct in the sense that the packet might have suffered
VLAN rewriting due to TCAM rules (VCAP IS1). So we consume the VLAN ID
from the skb payload using __skb_vlan_pop(), and take the classified
VLAN ID from the DSA tag, and construct a hwaccel VLAN tag with the
classified VLAN, and the skb payload is VLAN-untagged.
The big problem is that __skb_vlan_pop() does:
memmove(skb->data + VLAN_HLEN, skb->data, 2 * ETH_ALEN);
__skb_pull(skb, VLAN_HLEN);
aka it moves the Ethernet header 4 bytes to the right, and pulls 4 bytes
from the skb headroom (effectively also moving skb->data, by definition).
So for felix_rxtstamp()'s fragile logic, all bets are off now.
Instead of having the "extraction" pointer point to the DSA header,
it actually points to 4 bytes _inside_ the extraction header.
Corollary, the last 4 bytes of the "extraction" header are in fact 4
stale bytes of the destination MAC address from the Ethernet header,
from prior to the __skb_vlan_pop() movement.
So of course, RX timestamps are completely bogus when the system is
configured in this way.
The fix is actually very simple: just don't structure the code like that.
For better or worse, the DSA PTP timestamping API does not offer a
straightforward way for drivers to present their RX timestamps, but
other drivers (sja1105) have established a simple mechanism to carry
their RX timestamp from dsa_device_ops :: rcv() all the way to
dsa_switch_ops :: port_rxtstamp() and even later. That mechanism is to
simply save the partial timestamp to the skb->cb, and complete it later.
Question: why don't we simply populate the skb's struct
skb_shared_hwtstamps from ocelot_rcv(), and bother with this
complication of propagating the timestamp to felix_rxtstamp()?
Answer: dsa_switch_ops :: port_rxtstamp() answers the question whether
PTP packets need sleepable context to retrieve the full RX timestamp.
Currently felix_rxtstamp() answers "no, thanks" to that question, and
calls ocelot_ptp_gettime64() from softirq atomic context. This is
understandable, since Felix VSC9959 is a PCIe memory-mapped switch, so
hardware access does not require sleeping. But the felix driver is
preparing for the introduction of other switches where hardware access
is over a slow bus like SPI or MDIO:
https://lore.kernel.org/lkml/20210814025003.2449143-1-colin.foster@in-advantage.com/
So I would like to keep this code structure, so the rework needed when
that driver will need PTP support will be minimal (answer "yes, I need
deferred context for this skb's RX timestamp", then the partial
timestamp will still be found in the skb->cb.
Fixes: ea440cd2d9b2 ("net: dsa: tag_ocelot: use VLAN information from tagging header when available")
Reported-by: Po Liu <po.liu@nxp.com>
Cc: Yangbo Lu <yangbo.lu@nxp.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-02 19:31:22 +00:00
|
|
|
u32 tstamp_lo;
|
2021-10-12 11:40:40 +00:00
|
|
|
u8 ptp_cmd;
|
|
|
|
u8 ts_id;
|
|
|
|
};
|
|
|
|
|
|
|
|
#define OCELOT_SKB_CB(skb) \
|
|
|
|
((struct ocelot_skb_cb *)((skb)->cb))
|
|
|
|
|
|
|
|
#define IFH_TAG_TYPE_C 0
|
|
|
|
#define IFH_TAG_TYPE_S 1
|
|
|
|
|
|
|
|
#define IFH_REW_OP_NOOP 0x0
|
|
|
|
#define IFH_REW_OP_DSCP 0x1
|
|
|
|
#define IFH_REW_OP_ONE_STEP_PTP 0x2
|
|
|
|
#define IFH_REW_OP_TWO_STEP_PTP 0x3
|
|
|
|
#define IFH_REW_OP_ORIGIN_PTP 0x5
|
2021-02-13 22:37:56 +00:00
|
|
|
|
|
|
|
#define OCELOT_TAG_LEN 16
|
|
|
|
#define OCELOT_SHORT_PREFIX_LEN 4
|
|
|
|
#define OCELOT_LONG_PREFIX_LEN 16
|
|
|
|
#define OCELOT_TOTAL_TAG_LEN (OCELOT_SHORT_PREFIX_LEN + OCELOT_TAG_LEN)
|
|
|
|
|
|
|
|
/* The CPU injection header and the CPU extraction header can have 3 types of
|
|
|
|
* prefixes: long, short and no prefix. The format of the header itself is the
|
|
|
|
* same in all 3 cases.
|
|
|
|
*
|
|
|
|
* Extraction with long prefix:
|
|
|
|
*
|
|
|
|
* +-------------------+-------------------+------+------+------------+-------+
|
|
|
|
* | ff:ff:ff:ff:ff:ff | fe:ff:ff:ff:ff:ff | 8880 | 000a | extraction | frame |
|
|
|
|
* | | | | | header | |
|
|
|
|
* +-------------------+-------------------+------+------+------------+-------+
|
|
|
|
* 48 bits 48 bits 16 bits 16 bits 128 bits
|
|
|
|
*
|
|
|
|
* Extraction with short prefix:
|
|
|
|
*
|
|
|
|
* +------+------+------------+-------+
|
|
|
|
* | 8880 | 000a | extraction | frame |
|
|
|
|
* | | | header | |
|
|
|
|
* +------+------+------------+-------+
|
|
|
|
* 16 bits 16 bits 128 bits
|
|
|
|
*
|
|
|
|
* Extraction with no prefix:
|
|
|
|
*
|
|
|
|
* +------------+-------+
|
|
|
|
* | extraction | frame |
|
|
|
|
* | header | |
|
|
|
|
* +------------+-------+
|
|
|
|
* 128 bits
|
|
|
|
*
|
|
|
|
*
|
|
|
|
* Injection with long prefix:
|
|
|
|
*
|
|
|
|
* +-------------------+-------------------+------+------+------------+-------+
|
|
|
|
* | any dmac | any smac | 8880 | 000a | injection | frame |
|
|
|
|
* | | | | | header | |
|
|
|
|
* +-------------------+-------------------+------+------+------------+-------+
|
|
|
|
* 48 bits 48 bits 16 bits 16 bits 128 bits
|
|
|
|
*
|
|
|
|
* Injection with short prefix:
|
|
|
|
*
|
|
|
|
* +------+------+------------+-------+
|
|
|
|
* | 8880 | 000a | injection | frame |
|
|
|
|
* | | | header | |
|
|
|
|
* +------+------+------------+-------+
|
|
|
|
* 16 bits 16 bits 128 bits
|
|
|
|
*
|
|
|
|
* Injection with no prefix:
|
|
|
|
*
|
|
|
|
* +------------+-------+
|
|
|
|
* | injection | frame |
|
|
|
|
* | header | |
|
|
|
|
* +------------+-------+
|
|
|
|
* 128 bits
|
|
|
|
*
|
|
|
|
* The injection header looks like this (network byte order, bit 127
|
|
|
|
* is part of lowest address byte in memory, bit 0 is part of highest
|
|
|
|
* address byte):
|
|
|
|
*
|
|
|
|
* +------+------+------+------+------+------+------+------+
|
|
|
|
* 127:120 |BYPASS| MASQ | MASQ_PORT |REW_OP|REW_OP|
|
|
|
|
* +------+------+------+------+------+------+------+------+
|
|
|
|
* 119:112 | REW_OP |
|
|
|
|
* +------+------+------+------+------+------+------+------+
|
|
|
|
* 111:104 | REW_VAL |
|
|
|
|
* +------+------+------+------+------+------+------+------+
|
|
|
|
* 103: 96 | REW_VAL |
|
|
|
|
* +------+------+------+------+------+------+------+------+
|
|
|
|
* 95: 88 | REW_VAL |
|
|
|
|
* +------+------+------+------+------+------+------+------+
|
|
|
|
* 87: 80 | REW_VAL |
|
|
|
|
* +------+------+------+------+------+------+------+------+
|
|
|
|
* 79: 72 | RSV |
|
|
|
|
* +------+------+------+------+------+------+------+------+
|
|
|
|
* 71: 64 | RSV | DEST |
|
|
|
|
* +------+------+------+------+------+------+------+------+
|
|
|
|
* 63: 56 | DEST |
|
|
|
|
* +------+------+------+------+------+------+------+------+
|
|
|
|
* 55: 48 | RSV |
|
|
|
|
* +------+------+------+------+------+------+------+------+
|
|
|
|
* 47: 40 | RSV | SRC_PORT | RSV |TFRM_TIMER|
|
|
|
|
* +------+------+------+------+------+------+------+------+
|
|
|
|
* 39: 32 | TFRM_TIMER | RSV |
|
|
|
|
* +------+------+------+------+------+------+------+------+
|
|
|
|
* 31: 24 | RSV | DP | POP_CNT | CPUQ |
|
|
|
|
* +------+------+------+------+------+------+------+------+
|
|
|
|
* 23: 16 | CPUQ | QOS_CLASS |TAG_TYPE|
|
|
|
|
* +------+------+------+------+------+------+------+------+
|
|
|
|
* 15: 8 | PCP | DEI | VID |
|
|
|
|
* +------+------+------+------+------+------+------+------+
|
|
|
|
* 7: 0 | VID |
|
|
|
|
* +------+------+------+------+------+------+------+------+
|
|
|
|
*
|
|
|
|
* And the extraction header looks like this:
|
|
|
|
*
|
|
|
|
* +------+------+------+------+------+------+------+------+
|
|
|
|
* 127:120 | RSV | REW_OP |
|
|
|
|
* +------+------+------+------+------+------+------+------+
|
|
|
|
* 119:112 | REW_OP | REW_VAL |
|
|
|
|
* +------+------+------+------+------+------+------+------+
|
|
|
|
* 111:104 | REW_VAL |
|
|
|
|
* +------+------+------+------+------+------+------+------+
|
|
|
|
* 103: 96 | REW_VAL |
|
|
|
|
* +------+------+------+------+------+------+------+------+
|
|
|
|
* 95: 88 | REW_VAL |
|
|
|
|
* +------+------+------+------+------+------+------+------+
|
|
|
|
* 87: 80 | REW_VAL | LLEN |
|
|
|
|
* +------+------+------+------+------+------+------+------+
|
|
|
|
* 79: 72 | LLEN | WLEN |
|
|
|
|
* +------+------+------+------+------+------+------+------+
|
|
|
|
* 71: 64 | WLEN | RSV |
|
|
|
|
* +------+------+------+------+------+------+------+------+
|
|
|
|
* 63: 56 | RSV |
|
|
|
|
* +------+------+------+------+------+------+------+------+
|
|
|
|
* 55: 48 | RSV |
|
|
|
|
* +------+------+------+------+------+------+------+------+
|
|
|
|
* 47: 40 | RSV | SRC_PORT | ACL_ID |
|
|
|
|
* +------+------+------+------+------+------+------+------+
|
|
|
|
* 39: 32 | ACL_ID | RSV | SFLOW_ID |
|
|
|
|
* +------+------+------+------+------+------+------+------+
|
|
|
|
* 31: 24 |ACL_HIT| DP | LRN_FLAGS | CPUQ |
|
|
|
|
* +------+------+------+------+------+------+------+------+
|
|
|
|
* 23: 16 | CPUQ | QOS_CLASS |TAG_TYPE|
|
|
|
|
* +------+------+------+------+------+------+------+------+
|
|
|
|
* 15: 8 | PCP | DEI | VID |
|
|
|
|
* +------+------+------+------+------+------+------+------+
|
|
|
|
* 7: 0 | VID |
|
|
|
|
* +------+------+------+------+------+------+------+------+
|
|
|
|
*/
|
|
|
|
|
2021-10-12 11:40:41 +00:00
|
|
|
struct felix_deferred_xmit_work {
|
|
|
|
struct dsa_port *dp;
|
|
|
|
struct sk_buff *skb;
|
|
|
|
struct kthread_work work;
|
|
|
|
};
|
|
|
|
|
2021-12-09 23:34:38 +00:00
|
|
|
struct ocelot_8021q_tagger_data {
|
2021-10-12 11:40:41 +00:00
|
|
|
void (*xmit_work_fn)(struct kthread_work *work);
|
|
|
|
};
|
|
|
|
|
2021-12-09 23:34:38 +00:00
|
|
|
static inline struct ocelot_8021q_tagger_data *
|
|
|
|
ocelot_8021q_tagger_data(struct dsa_switch *ds)
|
|
|
|
{
|
|
|
|
BUG_ON(ds->dst->tag_ops->proto != DSA_TAG_PROTO_OCELOT_8021Q);
|
|
|
|
|
|
|
|
return ds->tagger_data;
|
|
|
|
}
|
|
|
|
|
2021-02-13 22:37:56 +00:00
|
|
|
static inline void ocelot_xfh_get_rew_val(void *extraction, u64 *rew_val)
|
|
|
|
{
|
|
|
|
packing(extraction, rew_val, 116, 85, OCELOT_TAG_LEN, UNPACK, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void ocelot_xfh_get_len(void *extraction, u64 *len)
|
|
|
|
{
|
|
|
|
u64 llen, wlen;
|
|
|
|
|
|
|
|
packing(extraction, &llen, 84, 79, OCELOT_TAG_LEN, UNPACK, 0);
|
|
|
|
packing(extraction, &wlen, 78, 71, OCELOT_TAG_LEN, UNPACK, 0);
|
|
|
|
|
|
|
|
*len = 60 * wlen + llen - 80;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void ocelot_xfh_get_src_port(void *extraction, u64 *src_port)
|
|
|
|
{
|
|
|
|
packing(extraction, src_port, 46, 43, OCELOT_TAG_LEN, UNPACK, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void ocelot_xfh_get_qos_class(void *extraction, u64 *qos_class)
|
|
|
|
{
|
|
|
|
packing(extraction, qos_class, 19, 17, OCELOT_TAG_LEN, UNPACK, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void ocelot_xfh_get_tag_type(void *extraction, u64 *tag_type)
|
|
|
|
{
|
|
|
|
packing(extraction, tag_type, 16, 16, OCELOT_TAG_LEN, UNPACK, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void ocelot_xfh_get_vlan_tci(void *extraction, u64 *vlan_tci)
|
|
|
|
{
|
|
|
|
packing(extraction, vlan_tci, 15, 0, OCELOT_TAG_LEN, UNPACK, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void ocelot_ifh_set_bypass(void *injection, u64 bypass)
|
|
|
|
{
|
|
|
|
packing(injection, &bypass, 127, 127, OCELOT_TAG_LEN, PACK, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void ocelot_ifh_set_rew_op(void *injection, u64 rew_op)
|
|
|
|
{
|
|
|
|
packing(injection, &rew_op, 125, 117, OCELOT_TAG_LEN, PACK, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void ocelot_ifh_set_dest(void *injection, u64 dest)
|
|
|
|
{
|
|
|
|
packing(injection, &dest, 67, 56, OCELOT_TAG_LEN, PACK, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void ocelot_ifh_set_qos_class(void *injection, u64 qos_class)
|
|
|
|
{
|
|
|
|
packing(injection, &qos_class, 19, 17, OCELOT_TAG_LEN, PACK, 0);
|
|
|
|
}
|
|
|
|
|
net: dsa: tag_ocelot: create separate tagger for Seville
The ocelot tagger is a hot mess currently, it relies on memory
initialized by the attached driver for basic frame transmission.
This is against all that DSA tagging protocols stand for, which is that
the transmission and reception of a DSA-tagged frame, the data path,
should be independent from the switch control path, because the tag
protocol is in principle hot-pluggable and reusable across switches
(even if in practice it wasn't until very recently). But if another
driver like dsa_loop wants to make use of tag_ocelot, it couldn't.
This was done to have common code between Felix and Ocelot, which have
one bit difference in the frame header format. Quoting from commit
67c2404922c2 ("net: dsa: felix: create a template for the DSA tags on
xmit"):
Other alternatives have been analyzed, such as:
- Create a separate tag_seville.c: too much code duplication for just 1
bit field difference.
- Create a separate DSA_TAG_PROTO_SEVILLE under tag_ocelot.c, just like
tag_brcm.c, which would have a separate .xmit function. Again, too
much code duplication for just 1 bit field difference.
- Allocate the template from the init function of the tag_ocelot.c
module, instead of from the driver: couldn't figure out a method of
accessing the correct port template corresponding to the correct
tagger in the .xmit function.
The really interesting part is that Seville should have had its own
tagging protocol defined - it is not compatible on the wire with Ocelot,
even for that single bit. In principle, a packet generated by
DSA_TAG_PROTO_OCELOT when booted on NXP LS1028A would look in a certain
way, but when booted on NXP T1040 it would look differently. The reverse
is also true: a packet generated by a Seville switch would be
interpreted incorrectly by Wireshark if it was told it was generated by
an Ocelot switch.
Actually things are a bit more nuanced. If we concentrate only on the
DSA tag, what I said above is true, but Ocelot/Seville also support an
optional DSA tag prefix, which can be short or long, and it is possible
to distinguish the two taggers based on an integer constant put in that
prefix. Nonetheless, creating a separate tagger is still justified,
since the tag prefix is optional, and without it, there is again no way
to distinguish.
Claiming backwards binary compatibility is a bit more tough, since I've
already changed the format of tag_ocelot once, in commit 5124197ce58b
("net: dsa: tag_ocelot: use a short prefix on both ingress and egress").
Therefore I am not very concerned with treating this as a bugfix and
backporting it to stable kernels (which would be another mess due to the
fact that there would be lots of conflicts with the other DSA_TAG_PROTO*
definitions). It's just simpler to say that the string values of the
taggers have ABI value starting with kernel 5.12, which will be when the
changing of tag protocol via /sys/class/net/<dsa-master>/dsa/tagging
goes live.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-13 22:37:58 +00:00
|
|
|
static inline void seville_ifh_set_dest(void *injection, u64 dest)
|
|
|
|
{
|
|
|
|
packing(injection, &dest, 67, 57, OCELOT_TAG_LEN, PACK, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void ocelot_ifh_set_src(void *injection, u64 src)
|
|
|
|
{
|
|
|
|
packing(injection, &src, 46, 43, OCELOT_TAG_LEN, PACK, 0);
|
|
|
|
}
|
|
|
|
|
2021-02-13 22:37:56 +00:00
|
|
|
static inline void ocelot_ifh_set_tag_type(void *injection, u64 tag_type)
|
|
|
|
{
|
|
|
|
packing(injection, &tag_type, 16, 16, OCELOT_TAG_LEN, PACK, 0);
|
|
|
|
}
|
|
|
|
|
2021-10-01 15:15:27 +00:00
|
|
|
static inline void ocelot_ifh_set_vlan_tci(void *injection, u64 vlan_tci)
|
2021-02-13 22:37:56 +00:00
|
|
|
{
|
2021-10-01 15:15:27 +00:00
|
|
|
packing(injection, &vlan_tci, 15, 0, OCELOT_TAG_LEN, PACK, 0);
|
2021-02-13 22:37:56 +00:00
|
|
|
}
|
|
|
|
|
2021-10-12 11:40:40 +00:00
|
|
|
/* Determine the PTP REW_OP to use for injecting the given skb */
|
|
|
|
static inline u32 ocelot_ptp_rew_op(struct sk_buff *skb)
|
|
|
|
{
|
|
|
|
struct sk_buff *clone = OCELOT_SKB_CB(skb)->clone;
|
|
|
|
u8 ptp_cmd = OCELOT_SKB_CB(skb)->ptp_cmd;
|
|
|
|
u32 rew_op = 0;
|
|
|
|
|
|
|
|
if (ptp_cmd == IFH_REW_OP_TWO_STEP_PTP && clone) {
|
|
|
|
rew_op = ptp_cmd;
|
|
|
|
rew_op |= OCELOT_SKB_CB(clone)->ts_id << 3;
|
|
|
|
} else if (ptp_cmd == IFH_REW_OP_ORIGIN_PTP) {
|
|
|
|
rew_op = ptp_cmd;
|
|
|
|
}
|
|
|
|
|
|
|
|
return rew_op;
|
|
|
|
}
|
|
|
|
|
net: mscc: ocelot: use ocelot_xmit_get_vlan_info() also for FDMA and register injection
Problem description
-------------------
On an NXP LS1028A (felix DSA driver) with the following configuration:
- ocelot-8021q tagging protocol
- VLAN-aware bridge (with STP) spanning at least swp0 and swp1
- 8021q VLAN upper interfaces on swp0 and swp1: swp0.700, swp1.700
- ptp4l on swp0.700 and swp1.700
we see that the ptp4l instances do not see each other's traffic,
and they all go to the grand master state due to the
ANNOUNCE_RECEIPT_TIMEOUT_EXPIRES condition.
Jumping to the conclusion for the impatient
-------------------------------------------
There is a zero-day bug in the ocelot switchdev driver in the way it
handles VLAN-tagged packet injection. The correct logic already exists in
the source code, in function ocelot_xmit_get_vlan_info() added by commit
5ca721c54d86 ("net: dsa: tag_ocelot: set the classified VLAN during xmit").
But it is used only for normal NPI-based injection with the DSA "ocelot"
tagging protocol. The other injection code paths (register-based and
FDMA-based) roll their own wrong logic. This affects and was noticed on
the DSA "ocelot-8021q" protocol because it uses register-based injection.
By moving ocelot_xmit_get_vlan_info() to a place that's common for both
the DSA tagger and the ocelot switch library, it can also be called from
ocelot_port_inject_frame() in ocelot.c.
We need to touch the lines with ocelot_ifh_port_set()'s prototype
anyway, so let's rename it to something clearer regarding what it does,
and add a kernel-doc. ocelot_ifh_set_basic() should do.
Investigation notes
-------------------
Debugging reveals that PTP event (aka those carrying timestamps, like
Sync) frames injected into swp0.700 (but also swp1.700) hit the wire
with two VLAN tags:
00000000: 01 1b 19 00 00 00 00 01 02 03 04 05 81 00 02 bc
~~~~~~~~~~~
00000010: 81 00 02 bc 88 f7 00 12 00 2c 00 00 02 00 00 00
~~~~~~~~~~~
00000020: 00 00 00 00 00 00 00 00 00 00 00 01 02 ff fe 03
00000030: 04 05 00 01 00 04 00 00 00 00 00 00 00 00 00 00
00000040: 00 00
The second (unexpected) VLAN tag makes felix_check_xtr_pkt() ->
ptp_classify_raw() fail to see these as PTP packets at the link
partner's receiving end, and return PTP_CLASS_NONE (because the BPF
classifier is not written to expect 2 VLAN tags).
The reason why packets have 2 VLAN tags is because the transmission
code treats VLAN incorrectly.
Neither ocelot switchdev, nor felix DSA, declare the NETIF_F_HW_VLAN_CTAG_TX
feature. Therefore, at xmit time, all VLANs should be in the skb head,
and none should be in the hwaccel area. This is done by:
static struct sk_buff *validate_xmit_vlan(struct sk_buff *skb,
netdev_features_t features)
{
if (skb_vlan_tag_present(skb) &&
!vlan_hw_offload_capable(features, skb->vlan_proto))
skb = __vlan_hwaccel_push_inside(skb);
return skb;
}
But ocelot_port_inject_frame() handles things incorrectly:
ocelot_ifh_port_set(ifh, port, rew_op, skb_vlan_tag_get(skb));
void ocelot_ifh_port_set(struct sk_buff *skb, void *ifh, int port, u32 rew_op)
{
(...)
if (vlan_tag)
ocelot_ifh_set_vlan_tci(ifh, vlan_tag);
(...)
}
The way __vlan_hwaccel_push_inside() pushes the tag inside the skb head
is by calling:
static inline void __vlan_hwaccel_clear_tag(struct sk_buff *skb)
{
skb->vlan_present = 0;
}
which does _not_ zero out skb->vlan_tci as seen by skb_vlan_tag_get().
This means that ocelot, when it calls skb_vlan_tag_get(), sees
(and uses) a residual skb->vlan_tci, while the same VLAN tag is
_already_ in the skb head.
The trivial fix for double VLAN headers is to replace the content of
ocelot_ifh_port_set() with:
if (skb_vlan_tag_present(skb))
ocelot_ifh_set_vlan_tci(ifh, skb_vlan_tag_get(skb));
but this would not be correct either, because, as mentioned,
vlan_hw_offload_capable() is false for us, so we'd be inserting dead
code and we'd always transmit packets with VID=0 in the injection frame
header.
I can't actually test the ocelot switchdev driver and rely exclusively
on code inspection, but I don't think traffic from 8021q uppers has ever
been injected properly, and not double-tagged. Thus I'm blaming the
introduction of VLAN fields in the injection header - early driver code.
As hinted at in the early conclusion, what we _want_ to happen for
VLAN transmission was already described once in commit 5ca721c54d86
("net: dsa: tag_ocelot: set the classified VLAN during xmit").
ocelot_xmit_get_vlan_info() intends to ensure that if the port through
which we're transmitting is under a VLAN-aware bridge, the outer VLAN
tag from the skb head is stripped from there and inserted into the
injection frame header (so that the packet is processed in hardware
through that actual VLAN). And in all other cases, the packet is sent
with VID=0 in the injection frame header, since the port is VLAN-unaware
and has logic to strip this VID on egress (making it invisible to the
wire).
Fixes: 08d02364b12f ("net: mscc: fix the injection header")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-15 00:07:02 +00:00
|
|
|
/**
|
|
|
|
* ocelot_xmit_get_vlan_info: Determine VLAN_TCI and TAG_TYPE for injected frame
|
|
|
|
* @skb: Pointer to socket buffer
|
|
|
|
* @br: Pointer to bridge device that the port is under, if any
|
|
|
|
* @vlan_tci:
|
|
|
|
* @tag_type:
|
|
|
|
*
|
|
|
|
* If the port is under a VLAN-aware bridge, remove the VLAN header from the
|
|
|
|
* payload and move it into the DSA tag, which will make the switch classify
|
|
|
|
* the packet to the bridge VLAN. Otherwise, leave the classified VLAN at zero,
|
|
|
|
* which is the pvid of standalone ports (OCELOT_STANDALONE_PVID), although not
|
|
|
|
* of VLAN-unaware bridge ports (that would be ocelot_vlan_unaware_pvid()).
|
|
|
|
* Anyway, VID 0 is fine because it is stripped on egress for these port modes,
|
|
|
|
* and source address learning is not performed for packets injected from the
|
|
|
|
* CPU anyway, so it doesn't matter that the VID is "wrong".
|
|
|
|
*/
|
|
|
|
static inline void ocelot_xmit_get_vlan_info(struct sk_buff *skb,
|
|
|
|
struct net_device *br,
|
|
|
|
u64 *vlan_tci, u64 *tag_type)
|
|
|
|
{
|
|
|
|
struct vlan_ethhdr *hdr;
|
|
|
|
u16 proto, tci;
|
|
|
|
|
|
|
|
if (!br || !br_vlan_enabled(br)) {
|
|
|
|
*vlan_tci = 0;
|
|
|
|
*tag_type = IFH_TAG_TYPE_C;
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
hdr = (struct vlan_ethhdr *)skb_mac_header(skb);
|
|
|
|
br_vlan_get_proto(br, &proto);
|
|
|
|
|
|
|
|
if (ntohs(hdr->h_vlan_proto) == proto) {
|
|
|
|
vlan_remove_tag(skb, &tci);
|
|
|
|
*vlan_tci = tci;
|
|
|
|
} else {
|
|
|
|
rcu_read_lock();
|
|
|
|
br_vlan_get_pvid_rcu(br, &tci);
|
|
|
|
rcu_read_unlock();
|
|
|
|
*vlan_tci = tci;
|
|
|
|
}
|
|
|
|
|
|
|
|
*tag_type = (proto != ETH_P_8021Q) ? IFH_TAG_TYPE_S : IFH_TAG_TYPE_C;
|
|
|
|
}
|
|
|
|
|
2021-02-13 22:37:56 +00:00
|
|
|
#endif
|