Move this API to the canonical timer_*() namespace.
[ tglx: Redone against pre rc1 ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
mptcp_delegated_actions is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT this data
structure requires explicit locking.
Add a local_lock_t to the data structure and use local_lock_nested_bh() for
locking. This change adds only lockdep coverage and does not alter the
functional behaviour for !PREEMPT_RT.
Cc: Matthieu Baerts <matttbe@kernel.org>
Cc: Mat Martineau <martineau@kernel.org>
Cc: Geliang Tang <geliang@kernel.org>
Cc: mptcp@lists.linux.dev
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20250512092736.229935-13-bigeasy@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Ever since commit f5f80e32de ("ipv6: remove hard coded limitation on
ipv6_pinfo") that protocols stopped using the old "obj_size -
sizeof(struct ipv6_pinfo)" way of grabbing ipv6_pinfo, that severely
restricted struct layout and caused fun, hard to see issues.
However, mptcp_inet6_sk wasn't fixed (unlike tcp_inet6_sk). Do so.
The non-cloned sockets already do the right thing using
ipv6_pinfo_offset + the generic IPv6 code.
Signed-off-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250430154541.1038561-1-pfalcato@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This counter is useful to understand why some paths are rejected, and
not created as expected.
It is incremented when receiving a connection request, if the PM didn't
allow the creation of new subflows.
Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250413-net-next-mptcp-sched-mib-sft-misc-v2-5-0f83a4350150@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
icsk->icsk_ack.timeout can be replaced by icsk->csk_delack_timer.expires
This saves 8 bytes in TCP/DCCP sockets and helps for better cache locality.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250324203607.703850-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
icsk->icsk_timeout can be replaced by icsk->icsk_retransmit_timer.expires
This saves 8 bytes in TCP/DCCP sockets and helps for better cache locality.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250324203607.703850-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When destroying an MPTCP socket, some userspace PM specific code was
called from mptcp_destroy_common() in protocol.c. That feels wrong, and
it is the only case.
Instead, the core now calls mptcp_pm_destroy() from pm.c which is now in
charge of cleaning the announced addresses list, and ask the different
PMs to do extra cleaning if needed, e.g. the userspace PM, if used, will
clean the local addresses list.
While at it, the userspace PM specific helper has been prefixed with
'mptcp_userspace_pm_' like the other ones.
No behavioural changes intended.
Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250307-net-next-mptcp-pm-reorg-v1-9-abef20ada03b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, in-kernel PM specific helpers are prefixed with
'mptcp_pm_nl_'. But here 'mptcp_pm_nl_work' is not specific to this PM:
it is called from the core to call helpers, some of them needed by both
the in-kernel and userspace PMs.
To avoid confusions, the '_nl' bit has been removed from the name.
Also used 'worker' instead of 'work', similar to protocol.c's worker.
No behavioural changes intended.
Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250307-net-next-mptcp-pm-reorg-v1-4-abef20ada03b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Instead of using sock_kmalloc() to allocate an ip_options and then
immediately duplicate another ip_options to the newly allocated one in
ipv6_dup_options(), mptcp_copy_ip_options() and sctp_v4_copy_ip_options(),
the newly added sock_kmemdup() helper can be used to simplify the code.
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/91ae749d66600ec6fb679e0e518fda6acb5c3e6f.1740735165.git.tanggeliang@kylinos.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After the RX path refactor the mentioned function is expected to run
frequently, let's optimize it a bit.
Scan for ready subflow from the last processed one, and stop after
traversing the list once or reaching the msk memory limit - instead of
looking for dubious per-subflow conditions.
Also re-order the memory limit checks, to avoid duplicate tests.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250218-net-next-mptcp-rx-path-refactor-v1-7-4a47d90d7998@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After the RX path refactor, it become a wrapper for sk_rmem_alloc
access, with a slightly misleading name. Just drop it.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250218-net-next-mptcp-rx-path-refactor-v1-6-4a47d90d7998@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After the previous patch, updating sk_forward_memory is cheap and
we can drop a lot of complexity from the MPTCP memory accounting,
removing the custom fwd mem allocations for rmem.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250218-net-next-mptcp-rx-path-refactor-v1-4-4a47d90d7998@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After commit c2e6048fa1 ("mptcp: fix race in release_cb") we can
move the whole MPTCP rx path under the socket lock leveraging the
release_cb.
We can drop a bunch of spin_lock pairs in the receive functions, use
a single receive queue and invoke __mptcp_move_skbs only when subflows
ask for it.
This will allow more cleanup in the next patch.
Some changes are worth specific mention:
The msk rcvbuf update now always happens under both the msk and the
subflow socket lock: we can drop a bunch of ONCE annotation and
consolidate the checks.
When the skbs move is delayed at msk release callback time, even the
msk rcvbuf update is delayed; additionally take care of such action in
__mptcp_move_skbs().
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250218-net-next-mptcp-rx-path-refactor-v1-3-4a47d90d7998@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When we will move the whole RX path under the msk socket lock, updating
the already queued skb for passive fastopen socket at 3rd ack time will
be extremely painful and race prone
The map_seq for already enqueued skbs is used only to allow correct
coalescing with later data; preventing collapsing to the first skb of
a fastopen connect we can completely remove the
__mptcp_fastopen_gen_msk_ackseq() helper.
Before dropping this helper, a new item had to be added to the
mptcp_skb_cb structure. Because this item will be frequently tested in
the fast path -- almost on every packet -- and because there is free
space there, a single byte is used instead of a bitfield. This micro
optimisation slightly reduces the number of CPU operations to do the
associated check.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250218-net-next-mptcp-rx-path-refactor-v1-2-4a47d90d7998@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
init_dummy_netdev() can initialize statically declared or embedded
net_devices. Such netdevs did not come from alloc_netdev_mqs().
After recent work by Breno, there are the only two cases where
we have do that.
Switch those cases to alloc_netdev_mqs() and delete init_dummy_netdev().
Dealing with static netdevs is not worth the maintenance burden.
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250113003456.3904110-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently the skb size after coalescing is only limited by the skb
layout (the skb must not carry frag_list). A single coalesced skb
covering several MSS can potentially fill completely the receive
buffer. In such a case, the snd win will zero until the receive buffer
will be empty again, affecting tput badly.
Fixes: 8268ed4c9d ("mptcp: introduce and use mptcp_try_coalesce()")
Cc: stable@vger.kernel.org # please delay 2 weeks after 6.13-final release
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20241230-net-mptcp-rbuf-fixes-v1-3-8608af434ceb@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Under some corner cases the MPTCP protocol can end-up invoking
mptcp_cleanup_rbuf() when no data has been copied, but such helper
assumes the opposite condition.
Explicitly drop such assumption and performs the costly call only
when strictly needed - before releasing the msk socket lock.
Fixes: fd8976790a ("mptcp: be careful on MPTCP-level ack.")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20241230-net-mptcp-rbuf-fixes-v1-2-8608af434ceb@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If the recvmsg() blocks after receiving some data - i.e. due to
SO_RCVLOWAT - the MPTCP code will attempt multiple times to
adjust the receive buffer size, wrongly accounting every time the
cumulative of received data - instead of accounting only for the
delta.
Address the issue moving mptcp_rcv_space_adjust just after the
data reception and passing it only the just received bytes.
This also removes an unneeded difference between the TCP and MPTCP
RX code path implementation.
Fixes: 5813022985 ("mptcp: error out earlier on disconnect")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20241230-net-mptcp-rbuf-fixes-v1-1-8608af434ceb@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In 'mptcp_reset_tout_timer', promote 'probe_timestamp' to unsigned long
to avoid possible integer overflow. Compile tested only.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Signed-off-by: Dmitry Kandybka <d.kandybka@gmail.com>
Link: https://patch.msgid.link/20241107103657.1560536-1-d.kandybka@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Additional active subflows - i.e. created by the in kernel path
manager - are included into the subflow list before starting the
3whs.
A racing recvmsg() spooling data received on an already established
subflow would unconditionally call tcp_cleanup_rbuf() on all the
current subflows, potentially hitting a divide by zero error on
the newly created ones.
Explicitly check that the subflow is in a suitable state before
invoking tcp_cleanup_rbuf().
Fixes: c76c695656 ("mptcp: call tcp_cleanup_rbuf on subflows")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/02374660836e1b52afc91966b7535c8c5f7bafb0.1731060874.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The MPTCP path manager event handler mptcp_pm_connection_closed
interface has been added in the commit 1b1c7a0ef7 ("mptcp: Add path
manager interface") but it was an empty function from then on.
With such name, it sounds good to invoke mptcp_event with the
MPTCP_EVENT_CLOSED event type from it. It also removes a bit of
duplicated code.
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20241021-net-next-mptcp-misc-6-13-v1-3-1ef02746504a@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We introduce the same handling for potential data races with the
'fully_established' flag in subflow as previously done for
msk->fully_established.
Additionally, we make a crucial change: convert the subflow's
'fully_established' from 'bit_field' to 'bool' type. This is
necessary because methods for avoiding data races don't work well
with 'bit_field'. Specifically, the 'READ_ONCE' needs to know
the size of the variable being accessed, which is not supported in
'bit_field'. Also, 'test_bit' expect the address of 'bit_field'.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/516
Signed-off-by: Gang Yan <yangang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20241021-net-next-mptcp-misc-6-13-v1-2-1ef02746504a@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Bugged peer implementation can send corrupted DSS options, consistently
hitting a few warning in the data path. Use DEBUG_NET assertions, to
avoid the splat on some builds and handle consistently the error, dumping
related MIBs and performing fallback and/or reset according to the
subflow type.
Fixes: 6771bfd9ee ("mptcp: update mptcp ack sequence from work queue")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20241008-net-mptcp-fallback-fixes-v1-1-c6fb8e93e551@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
delack timer is not stopped from inet_csk_clear_xmit_timer()
because we do not define INET_CSK_CLEAR_TIMERS.
This is a conscious choice : inet_csk_clear_xmit_timer()
is often called from another cpu. Calling del_timer()
would cause false sharing and lock contention.
This means that very often, tcp_delack_timer() is called
at the timer expiration, while there is no ACK to transmit.
This can be detected very early, avoiding the socket spinlock.
Notes:
- test about tp->compressed_ack is racy,
but in the unlikely case there is a race, the dedicated
compressed_ack_timer hrtimer would close it.
- Even if the fast path is not taken, reading
icsk->icsk_ack.pending and tp->compressed_ack
before acquiring the socket spinlock reduces
acquisition time and chances of contention.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241002173042.917928-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
An MPTCP firewall blackhole can be detected if the following SYN
retransmission after a fallback to "plain" TCP is accepted.
In case of blackhole, a similar technique to the one in place with TFO
is now used: MPTCP can be disabled for a certain period of time, 1h by
default. This time period will grow exponentially when more blackhole
issues get detected right after MPTCP is re-enabled and will reset to
the initial value when the blackhole issue goes away.
The blackhole period can be modified thanks to a new sysctl knob:
blackhole_timeout. Two new MIB counters help understanding what's
happening:
- 'Blackhole', incremented when a blackhole is detected.
- 'MPCapableSYNTXDisabled', incremented when an MPTCP connection
directly falls back to TCP during the blackhole period.
Because the technique is inspired by the one used by TFO, an important
part of the new code is similar to what can find in tcp_fastopen.c, with
some adaptations to the MPTCP case.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/57
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20240909-net-next-mptcp-fallback-x-mpc-v1-3-da7ebb4cd2a3@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The initial subflow might have already been closed, but still in the
connection list. When the worker is instructed to close the subflows
that have been marked as closed, it might then try to close the initial
subflow again.
A consequence of that is that the SUB_CLOSED event can be seen twice:
# ip mptcp endpoint
1.1.1.1 id 1 subflow dev eth0
2.2.2.2 id 2 subflow dev eth1
# ip mptcp monitor &
[ CREATED] remid=0 locid=0 saddr4=1.1.1.1 daddr4=9.9.9.9
[ ESTABLISHED] remid=0 locid=0 saddr4=1.1.1.1 daddr4=9.9.9.9
[ SF_ESTABLISHED] remid=0 locid=2 saddr4=2.2.2.2 daddr4=9.9.9.9
# ip mptcp endpoint delete id 1
[ SF_CLOSED] remid=0 locid=0 saddr4=1.1.1.1 daddr4=9.9.9.9
[ SF_CLOSED] remid=0 locid=0 saddr4=1.1.1.1 daddr4=9.9.9.9
The first one is coming from mptcp_pm_nl_rm_subflow_received(), and the
second one from __mptcp_close_subflow().
To avoid doing the post-closed processing twice, the subflow is now
marked as closed the first time.
Note that it is not enough to check if we are dealing with the first
subflow and check its sk_state: the subflow might have been reset or
closed before calling mptcp_close_ssk().
Fixes: b911c97c7d ("mptcp: add netlink event support")
Cc: stable@vger.kernel.org
Tested-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
pr_debug() have been added in various places in MPTCP code to help
developers to debug some situations. With the dynamic debug feature, it
is easy to enable all or some of them, and asks users to reproduce
issues with extra debug.
Many of these pr_debug() don't end with a new line, while no 'pr_cont()'
are used in MPTCP code. So the goal was not to display multiple debug
messages on one line: they were then not missing the '\n' on purpose.
Not having the new line at the end causes these messages to be printed
with a delay, when something else needs to be printed. This issue is not
visible when many messages need to be printed, but it is annoying and
confusing when only specific messages are expected, e.g.
# echo "func mptcp_pm_add_addr_echoed +fmp" \
> /sys/kernel/debug/dynamic_debug/control
# ./mptcp_join.sh "signal address"; \
echo "$(awk '{print $1}' /proc/uptime) - end"; \
sleep 5s; \
echo "$(awk '{print $1}' /proc/uptime) - restart"; \
./mptcp_join.sh "signal address"
013 signal address
(...)
10.75 - end
15.76 - restart
013 signal address
[ 10.367935] mptcp:mptcp_pm_add_addr_echoed: MPTCP: msk=(...)
(...)
=> a delay of 5 seconds: printed with a 10.36 ts, but after 'restart'
which was printed at the 15.76 ts.
The 'Fixes' tag here below points to the first pr_debug() used without
'\n' in net/mptcp. This patch could be split in many small ones, with
different Fixes tag, but it doesn't seem worth it, because it is easy to
re-generate this patch with this simple 'sed' command:
git grep -l pr_debug -- net/mptcp |
xargs sed -i "s/\(pr_debug(\".*[^n]\)\(\"[,)]\)/\1\\\n\2/g"
So in case of conflicts, simply drop the modifications, and launch this
command.
Fixes: f870fa0b57 ("mptcp: Add MPTCP socket stubs")
Cc: stable@vger.kernel.org
Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20240826-net-mptcp-close-extra-sf-fin-v1-4-905199fe1172@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The 'mptcp_subflow_context' structure has two items related to the
backup flags:
- 'backup': the subflow has been marked as backup by the other peer
- 'request_bkup': the backup flag has been set by the host
Looking only at the 'backup' flag can make sense in some cases, but it
is not the behaviour of the default packet scheduler when selecting
paths.
As explained in the commit b6a66e521a ("mptcp: sched: check both
directions for backup"), the packet scheduler should look at both flags,
because that was the behaviour from the beginning: the 'backup' flag was
set by accident instead of the 'request_bkup' one. Now that the latter
has been fixed, get_retrans() needs to be adapted as well.
Fixes: b6a66e521a ("mptcp: sched: check both directions for backup")
Cc: stable@vger.kernel.org
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20240826-net-mptcp-close-extra-sf-fin-v1-3-905199fe1172@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When a peer decides to close one subflow in the middle of a connection
having multiple subflows, the receiver of the first FIN should accept
that, and close the subflow on its side as well. If not, the subflow
will stay half closed, and would even continue to be used until the end
of the MPTCP connection or a reset from the network.
The issue has not been seen before, probably because the in-kernel
path-manager always sends a RM_ADDR before closing the subflow. Upon the
reception of this RM_ADDR, the other peer will initiate the closure on
its side as well. On the other hand, if the RM_ADDR is lost, or if the
path-manager of the other peer only closes the subflow without sending a
RM_ADDR, the subflow would switch to TCP_CLOSE_WAIT, but that's it,
leaving the subflow half-closed.
So now, when the subflow switches to the TCP_CLOSE_WAIT state, and if
the MPTCP connection has not been closed before with a DATA_FIN, the
kernel owning the subflow schedules its worker to initiate the closure
on its side as well.
This issue can be easily reproduced with packetdrill, as visible in [1],
by creating an additional subflow, injecting a FIN+ACK before sending
the DATA_FIN, and expecting a FIN+ACK in return.
Fixes: 40947e1399 ("mptcp: schedule worker when subflow is closed")
Cc: stable@vger.kernel.org
Link: https://github.com/multipath-tcp/packetdrill/pull/154 [1]
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20240826-net-mptcp-close-extra-sf-fin-v1-1-905199fe1172@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since its introduction, the mentioned MIB accounted for the wrong
event: wake-up being skipped as not-needed on some edge condition
instead of incoming skb being dropped after landing in the (subflow)
receive queue.
Move the increment in the correct location.
Fixes: ce599c5163 ("mptcp: properly account bulk freed memory")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The 'mptcp_subflow_context' structure has two items related to the
backup flags:
- 'backup': the subflow has been marked as backup by the other peer
- 'request_bkup': the backup flag has been set by the host
Before this patch, the scheduler was only looking at the 'backup' flag.
That can make sense in some cases, but it looks like that's not what we
wanted for the general use, because either the path-manager was setting
both of them when sending an MP_PRIO, or the receiver was duplicating
the 'backup' flag in the subflow request.
Note that the use of these two flags in the path-manager are going to be
fixed in the next commits, but this change here is needed not to modify
the behaviour.
Fixes: f296234c98 ("mptcp: Add handling of incoming MP_JOIN requests")
Cc: stable@vger.kernel.org
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This is strictly related to commit fb7a0d3348 ("mptcp: ensure snd_nxt
is properly initialized on connect"). It turns out that syzkaller can
trigger the retransmit after fallback and before processing any other
incoming packet - so that snd_una is still left uninitialized.
Address the issue explicitly initializing snd_una together with snd_nxt
and write_seq.
Suggested-by: Mat Martineau <martineau@kernel.org>
Fixes: 8fd738049a ("mptcp: fallback in case of simultaneous connect")
Cc: stable@vger.kernel.org
Reported-by: Christoph Paasch <cpaasch@apple.com>
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/485
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://lore.kernel.org/r/20240607-upstream-net-20240607-misc-fixes-v1-1-1ab9ddfa3d00@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cross-merge networking fixes after downstream PR.
No conflicts.
Adjacent changes:
drivers/net/ethernet/pensando/ionic/ionic_txrx.c
d9c0420999 ("ionic: Mark error paths in the data path as unlikely")
491aee894a ("ionic: fix kernel panic in XDP_TX action")
net/ipv6/ip6_fib.c
b4cb4a1391 ("net: use unrcu_pointer() helper")
b01e1c0307 ("ipv6: fix possible race in __fib6_drop_pcpu_from()")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We used to call it 'master' socket at the early stages of MPTCP
development, but the correct wording is 'MPTCP' socket opposed to 'TCP
subflows': convert the last 3 comments to use a more appropriate term.
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
As a wrapper of __tcp_space_from_win(), this patch adds a MPTCP dedicated
space_from_win helper mptcp_space_from_win() in protocol.h to paired with
mptcp_win_from_space().
Use it instead of __tcp_space_from_win() in both mptcp_rcv_space_adjust()
and mptcp_set_rcvlowat().
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The MPTCP dedicated win_from_space helper mptcp_win_from_space() is defined
in protocol.h, use it in mptcp_rcv_space_adjust() instead of using the TCP
one. Here scaling_ratio is the same as msk->scaling_ratio.
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Like previous patch does in TCP, we need to adhere to RFC 1213:
"tcpCurrEstab OBJECT-TYPE
...
The number of TCP connections for which the current state
is either ESTABLISHED or CLOSE- WAIT."
So let's consider CLOSE-WAIT sockets.
The logic of counting
When we increment the counter?
a) Only if we change the state to ESTABLISHED.
When we decrement the counter?
a) if the socket leaves ESTABLISHED and will never go into CLOSE-WAIT,
say, on the client side, changing from ESTABLISHED to FIN-WAIT-1.
b) if the socket leaves CLOSE-WAIT, say, on the server side, changing
from CLOSE-WAIT to LAST-ACK.
Fixes: d9cd27b8cd ("mptcp: add CurrEstab MIB counter support")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmZFcFwQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpuP1EADKJOJRtcvO/av2cUR+HZFDC+s/jBwHIJK+
4UY633zQlxjqc7dt4rX7zk/uk4mkhZnsGY+wS6xH08kB3VO9YksrwREVt6Ur9lP8
UXNVPpPcZ7fFcIp41rYkZX9pTDp2N8z2qsVg7V8wcXJ7EeTXd6L4ZLjhfCiHvs2s
i6yIEwLrW+voYuqSFV7vWBIM3mSXSRTIiO2DqRAOtT2lsj374DOthvP2lOSSb5wq
6TF4s4z3HMGs+HF3rjP5kJ6ic6RdC6i31lzEivUMhwCiKN1AZXdp96KXaC+NVPRV
t5//EdS+pSenQgkg6XH7d5kzFoCUFJfVZt05w0GCqMA081Q9ySjUymN1zedJbGd9
8CDlW01N8XLqG6+F9yakJLSFY+mUFGduPuueTNiUJWP8kTkQCtYIRzZDeyjxQrE5
c17NW5S1uWkf26Ucyi1r+gxw9N4kGkuB3+NitC6DOc7BW5CocEIoqLWi/UH7cEZe
0v6loTakqBAdgh03RCDMUj9Rt/37pQs2KFT9/CazVpbkvkKsue4xK4K2CUFsxqOj
qcoc/LD62at4S3AUWwhUIs3YaQ7v/6AY5hIktqAwsFHmDffUbPdRrXWY1keKIprJ
4qS/sY0M+kvKGnp+80fPVHab9l6/fMLfabIyFuh0M3W/M4eHGt2YfKWreoGEy/1x
xLq2iq+ehw==
=S6Xt
-----END PGP SIGNATURE-----
Merge tag 'net-accept-more-20240515' of git://git.kernel.dk/linux
Pull more io_uring updates from Jens Axboe:
"This adds support for IORING_CQE_F_SOCK_NONEMPTY for io_uring accept
requests.
This is very similar to previous work that enabled the same hint for
doing receives on sockets. By far the majority of the work here is
refactoring to enable the networking side to pass back whether or not
the socket had more pending requests after accepting the current one,
the last patch just wires it up for io_uring.
Not only does this enable applications to know whether there are more
connections to accept right now, it also enables smarter logic for
io_uring multishot accept on whether to retry immediately or wait for
a poll trigger"
* tag 'net-accept-more-20240515' of git://git.kernel.dk/linux:
io_uring/net: wire up IORING_CQE_F_SOCK_NONEMPTY for accept
net: pass back whether socket was empty post accept
net: have do_accept() take a struct proto_accept_arg argument
net: change proto and proto_ops accept type
strcpy() performs no bounds checking on the destination buffer. This
could result in linear overflows beyond the end of the buffer, leading
to all kinds of misbehaviors. The safe replacement is strscpy() [1].
This is in preparation of a possible future step where all strcpy() uses
will be removed in favour of strscpy() [2].
This fixes CheckPatch warnings:
WARNING: Prefer strscpy over strcpy
Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strcpy [1]
Link: https://github.com/KSPP/linux/issues/88 [2]
Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20240514011335.176158-6-martineau@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Rather than pass in flags, error pointer, and whether this is a kernel
invocation or not, add a struct proto_accept_arg struct as the argument.
This then holds all of these arguments, and prepares accept for being
able to pass back more information.
No functional changes in this patch.
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Cross-merge networking fixes after downstream PR.
Conflicts:
include/linux/filter.h
kernel/bpf/core.c
66e13b615a ("bpf: verifier: prevent userspace memory access")
d503a04f8b ("bpf: Add support for certain atomics in bpf_arena to x86 JIT")
https://lore.kernel.org/all/20240429114939.210328b0@canb.auug.org.au/
No adjacent changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>