linux-kernelorg-stable/net
Ilya Maximets 59f44c9ccc net: openvswitch: allow providing upcall pid for the 'execute' command
When a packet enters OVS datapath and there is no flow to handle it,
packet goes to userspace through a MISS upcall.  With per-CPU upcall
dispatch mechanism, we're using the current CPU id to select the
Netlink PID on which to send this packet.  This allows us to send
packets from the same traffic flow through the same handler.

The handler will process the packet, install required flow into the
kernel and re-inject the original packet via OVS_PACKET_CMD_EXECUTE.

While handling OVS_PACKET_CMD_EXECUTE, however, we may hit a
recirculation action that will pass the (likely modified) packet
through the flow lookup again.  And if the flow is not found, the
packet will be sent to userspace again through another MISS upcall.

However, the handler thread in userspace is likely running on a
different CPU core, and the OVS_PACKET_CMD_EXECUTE request is handled
in the syscall context of that thread.  So, when the time comes to
send the packet through another upcall, the per-CPU dispatch will
choose a different Netlink PID, and this packet will end up processed
by a different handler thread on a different CPU.

The process continues as long as there are new recirculations, each
time the packet goes to a different handler thread before it is sent
out of the OVS datapath to the destination port.  In real setups the
number of recirculations can go up to 4 or 5, sometimes more.

There is always a chance to re-order packets while processing upcalls,
because userspace will first install the flow and then re-inject the
original packet.  So, there is a race window when the flow is already
installed and the second packet can match it and be forwarded to the
destination before the first packet is re-injected.  But the fact that
packets are going through multiple upcalls handled by different
userspace threads makes the reordering noticeably more likely, because
we not only have a race between the kernel and a userspace handler
(which is hard to avoid), but also between multiple userspace handlers.

For example, let's assume that 10 packets got enqueued through a MISS
upcall for handler-1, it will start processing them, will install the
flow into the kernel and start re-injecting packets back, from where
they will go through another MISS to handler-2.  Handler-2 will install
the flow into the kernel and start re-injecting the packets, while
handler-1 continues to re-inject the last of the 10 packets, they will
hit the flow installed by handler-2 and be forwarded without going to
the handler-2, while handler-2 still re-injects the first of these 10
packets.  Given multiple recirculations and misses, these 10 packets
may end up completely mixed up on the output from the datapath.

Let's allow userspace to specify on which Netlink PID the packets
should be upcalled while processing OVS_PACKET_CMD_EXECUTE.
This makes it possible to ensure that all the packets are processed
by the same handler thread in the userspace even with them being
upcalled multiple times in the process.  Packets will remain in order
since they will be enqueued to the same socket and re-injected in the
same order.  This doesn't eliminate re-ordering as stated above, since
we still have a race between kernel and the userspace thread, but it
allows to eliminate races between multiple userspace threads.

Userspace knows the PID of the socket on which the original upcall is
received, so there is no need to send it up from the kernel.

Solution requires storing the value somewhere for the duration of the
packet processing.  There are two potential places for this: our skb
extension or the per-CPU storage.  It's not clear which is better,
so just following currently used scheme of storing this kind of things
along the skb.  We still have a decent amount of space in the cb.

Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Link: https://patch.msgid.link/20250702155043.2331772-1-i.maximets@ovn.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07 14:30:39 -07:00
..
6lowpan
9p netfs: Fix the request's work item to not require a ref 2025-05-21 14:35:20 +02:00
802 treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
8021q net: 802: Remove unused p8022 code 2025-04-22 07:04:02 -07:00
appletalk net: remove sock_i_uid() 2025-06-23 17:04:03 -07:00
atm atm: Release atm_dev_mutex after removing procfs in atm_dev_deregister(). 2025-06-25 16:43:39 -07:00
ax25 treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
batman-adv treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
bluetooth Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2025-07-04 08:03:18 +02:00
bpf selftests/bpf: Add test to access const void pointer argument in tracing program 2025-04-23 11:26:22 -07:00
bridge bridge: mcast: Fix use-after-free during router port configuration 2025-06-23 18:19:10 -07:00
caif caif: reduce stack size, again 2025-06-23 16:58:43 -07:00
can Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2025-06-12 10:09:10 -07:00
ceph A small CephFS encryption-related fix and a dead code cleanup. 2025-04-25 15:51:28 -07:00
core net: preserve MSG_ZEROCOPY with forwarding 2025-07-02 15:07:16 -07:00
dcb
devlink devlink: Extend devlink rate API with traffic classes bandwidth management 2025-07-02 15:39:05 -07:00
dns_resolver
dsa net: dsa: tag_brcm: add support for legacy FCS tags 2025-06-17 17:52:01 -07:00
ethernet
ethtool net: ethtool: fix leaking netdev ref if ethnl_default_parse() failed 2025-07-01 17:27:53 -07:00
handshake
hsr treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
ieee802154 treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
ife
ipv4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2025-07-04 08:03:18 +02:00
ipv6 ipv6: Cleanup fib6_drop_pcpu_from() 2025-07-03 16:00:50 +02:00
iucv
kcm
key net: remove sock_i_uid() 2025-06-23 17:04:03 -07:00
l2tp net: annotate races around sk->sk_uid 2025-06-23 17:04:03 -07:00
l3mdev net: fib_rules: Fix iif / oif matching on L3 master device 2025-04-15 17:54:56 -07:00
lapb treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
llc net: make sk->sk_rcvtimeo lockless 2025-06-23 17:05:12 -07:00
mac80211 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2025-06-26 10:40:50 -07:00
mac802154
mctp net: mctp: use nlmsg_payload() for netlink message data extraction 2025-05-26 17:38:27 +02:00
mpls mpls: Use rcu_dereference_rtnl() in mpls_route_input_rcu(). 2025-06-17 18:21:59 -07:00
mptcp tcp: move tcp_memory_allocated into net_aligned_data 2025-07-02 14:22:02 -07:00
ncsi net: ncsi: Fix buffer overflow in fetching version id 2025-06-12 18:21:59 -07:00
netfilter net: dst: annotate data-races around dst->obsolete 2025-07-02 14:32:29 -07:00
netlabel calipso: unlock rcu before returning -EAFNOSUPPORT 2025-06-05 08:03:38 -07:00
netlink netlink: fix policy dump for int with validation callback 2025-05-12 18:50:09 -07:00
netrom treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
nfc Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2025-06-19 13:00:24 -07:00
nsh
openvswitch net: openvswitch: allow providing upcall pid for the 'execute' command 2025-07-07 14:30:39 -07:00
packet net: remove sock_i_uid() 2025-06-23 17:04:03 -07:00
phonet net: remove sock_i_uid() 2025-06-23 17:04:03 -07:00
psample
qrtr
rds rds: Correct spelling 2025-06-21 07:35:39 -07:00
rfkill
rose rose: fix dangling neighbour pointers in rose_rt_device_down() 2025-07-01 19:28:48 -07:00
rxrpc treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
sched Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2025-07-04 08:03:18 +02:00
sctp net: dst: annotate data-races around dst->obsolete 2025-07-02 14:32:29 -07:00
shaper
smc net: make sk->sk_rcvtimeo lockless 2025-06-23 17:05:12 -07:00
strparser net: make sk->sk_rcvtimeo lockless 2025-06-23 17:05:12 -07:00
sunrpc sunrpc: fix loop in gss seqno cache 2025-06-23 11:01:15 -04:00
switchdev
tipc net: remove sock_i_uid() 2025-06-23 17:04:03 -07:00
tls bpf, ktls: Fix data corruption when using bpf_msg_pop_data() in ktls 2025-06-11 16:59:42 +02:00
unix Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2025-06-26 10:40:50 -07:00
vmw_vsock vsock/vmci: Clear the vmci transport packet properly when initializing it 2025-07-03 12:52:52 +02:00
wireless wifi: cfg80211: support configuration of S1G station capabilities 2025-06-24 15:19:28 +02:00
x25 net: make sk->sk_rcvtimeo lockless 2025-06-23 17:05:12 -07:00
xdp net: remove sock_i_uid() 2025-06-23 17:04:03 -07:00
xfrm net: dst: annotate data-races around dst->obsolete 2025-07-02 14:32:29 -07:00
Kconfig net: Kconfig NET_DEVMEM selects GENERIC_ALLOCATOR 2025-05-27 17:31:42 -07:00
Kconfig.debug
Makefile net: Retire DCCP socket. 2025-04-11 18:58:10 -07:00
compat.c
devres.c
socket.c net: annotate races around sk->sk_uid 2025-06-23 17:04:03 -07:00
sysctl_net.c