JIRA: https://issues.redhat.com/browse/RHEL-65787
Conflicts: Context difference due to missing af9784d007d8 ("tcp: diag:
add support for TIME_WAIT sockets to tcp_abort()") and out-of-order
backport of bac76cf89816 ("tcp: fix forever orphan socket caused by
tcp_abort")
commit 4ddbcb886268af8d12a23e6640b39d1d9c652b1b
Author: Aditi Ghag <aditi.ghag@isovalent.com>
Date: Fri May 19 22:51:55 2023 +0000
bpf: Add bpf_sock_destroy kfunc
The socket destroy kfunc is used to forcefully terminate sockets from
certain BPF contexts. We plan to use the capability in Cilium
load-balancing to terminate client sockets that continue to connect to
deleted backends. The other use case is on-the-fly policy enforcement
where existing socket connections prevented by policies need to be
forcefully terminated. The kfunc also allows terminating sockets that may
or may not be actively sending traffic.
The kfunc can currently be called only from BPF TCP and UDP iterators
where users can filter, and terminate selected sockets. More
specifically, it can only be called from BPF contexts that ensure
socket locking in order to allow synchronous execution of protocol
specific `diag_destroy` handlers. The previous commit that batches UDP
sockets during iteration facilitated a synchronous invocation of the UDP
destroy callback from BPF context by skipping socket locks in
`udp_abort`. TCP iterator already supported batching of sockets being
iterated. To that end, `tracing_iter_filter` callback filter is added so
that verifier can restrict the kfunc to programs with `BPF_TRACE_ITER`
attach type, and reject other programs.
The kfunc takes `sock_common` type argument, even though it expects, and
casts them to a `sock` pointer. This enables the verifier to allow the
sock_destroy kfunc to be called for TCP with `sock_common` and UDP with
`sock` structs. Furthermore, as `sock_common` only has a subset of
certain fields of `sock`, casting pointer to the latter type might not
always be safe for certain sockets like request sockets, but these have a
special handling in the diag_destroy handlers.
Additionally, the kfunc is defined with `KF_TRUSTED_ARGS` flag to avoid the
cases where a `PTR_TO_BTF_ID` sk is obtained by following another pointer.
eg. getting a sk pointer (may be even NULL) by following another sk
pointer. The pointer socket argument passed in TCP and UDP iterators is
tagged as `PTR_TRUSTED` in {tcp,udp}_reg_info. The TRUSTED arg changes
are contributed by Martin KaFai Lau <martin.lau@kernel.org>.
Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
Link: https://lore.kernel.org/r/20230519225157.760788-8-aditi.ghag@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-65787
commit 2242fd537fab52d5f4d2fbb1845f047c01fad0cf
Author: Martin KaFai Lau <martin.lau@kernel.org>
Date: Fri Jan 12 11:05:29 2024 -0800
bpf: Avoid iter->offset making backward progress in bpf_iter_udp
There is a bug in the bpf_iter_udp_batch() function that stops
the userspace from making forward progress.
The case that triggers the bug is the userspace passed in
a very small read buffer. When the bpf prog does bpf_seq_printf,
the userspace read buffer is not enough to capture the whole bucket.
When the read buffer is not large enough, the kernel will remember
the offset of the bucket in iter->offset such that the next userspace
read() can continue from where it left off.
The kernel will skip the number (== "iter->offset") of sockets in
the next read(). However, the code directly decrements the
"--iter->offset". This is incorrect because the next read() may
not consume the whole bucket either and then the next-next read()
will start from offset 0. The net effect is the userspace will
keep reading from the beginning of a bucket and the process will
never finish. "iter->offset" must always go forward until the
whole bucket is consumed.
This patch fixes it by using a local variable "resume_offset"
and "resume_bucket". "iter->offset" is always reset to 0 before
it may be used. "iter->offset" will be advanced to the
"resume_offset" when it continues from the "resume_bucket" (i.e.
"state->bucket == resume_bucket"). This brings it closer to
the bpf_iter_tcp's offset handling which does not suffer
the same bug.
Cc: Aditi Ghag <aditi.ghag@isovalent.com>
Fixes: c96dac8d369f ("bpf: udp: Implement batching for sockets iterator")
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Aditi Ghag <aditi.ghag@isovalent.com>
Link: https://lore.kernel.org/r/20240112190530.3751661-3-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-65787
commit 19ca0823f6eaad01d18f664a00550abe912c034c
Author: Martin KaFai Lau <martin.lau@kernel.org>
Date: Fri Jan 12 11:05:28 2024 -0800
bpf: iter_udp: Retry with a larger batch size without going back to the previous bucket
The current logic is to use a default size 16 to batch the whole bucket.
If it is too small, it will retry with a larger batch size.
The current code accidentally does a state->bucket-- before retrying.
This goes back to retry with the previous bucket which has already
been done. This patch fixed it.
It is hard to create a selftest. I added a WARN_ON(state->bucket < 0),
forced a particular port to be hashed to the first bucket,
created >16 sockets, and observed the for-loop went back
to the "-1" bucket.
Cc: Aditi Ghag <aditi.ghag@isovalent.com>
Fixes: c96dac8d369f ("bpf: udp: Implement batching for sockets iterator")
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Aditi Ghag <aditi.ghag@isovalent.com>
Link: https://lore.kernel.org/r/20240112190530.3751661-2-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-65787
commit c96dac8d369ffd713a45f4e5c30f23c47a1671f0
Author: Aditi Ghag <aditi.ghag@isovalent.com>
Date: Fri May 19 22:51:53 2023 +0000
bpf: udp: Implement batching for sockets iterator
Batch UDP sockets from BPF iterator that allows for overlapping locking
semantics in BPF/kernel helpers executed in BPF programs. This facilitates
BPF socket destroy kfunc (introduced by follow-up patches) to execute from
BPF iterator programs.
Previously, BPF iterators acquired the sock lock and sockets hash table
bucket lock while executing BPF programs. This prevented BPF helpers that
again acquire these locks to be executed from BPF iterators. With the
batching approach, we acquire a bucket lock, batch all the bucket sockets,
and then release the bucket lock. This enables BPF or kernel helpers to
skip sock locking when invoked in the supported BPF contexts.
The batching logic is similar to the logic implemented in TCP iterator:
https://lore.kernel.org/bpf/20210701200613.1036157-1-kafai@fb.com/.
Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
Link: https://lore.kernel.org/r/20230519225157.760788-6-aditi.ghag@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-65787
commit e4fe1bf13e09019578b9b93b942fff3d76ed5793
Author: Aditi Ghag <aditi.ghag@isovalent.com>
Date: Fri May 19 22:51:52 2023 +0000
udp: seq_file: Remove bpf_seq_afinfo from udp_iter_state
This is a preparatory commit to remove the field. The field was
previously shared between proc fs and BPF UDP socket iterators. As the
follow-up commits will decouple the implementation for the iterators,
remove the field. As for BPF socket iterator, filtering of sockets is
exepected to be done in BPF programs.
Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
Link: https://lore.kernel.org/r/20230519225157.760788-5-aditi.ghag@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-65787
commit 7625d2e9741c1f6e08ee79c28a1e27bbb5071805
Author: Aditi Ghag <aditi.ghag@isovalent.com>
Date: Fri May 19 22:51:51 2023 +0000
bpf: udp: Encapsulate logic to get udp table
This is a preparatory commit that encapsulates the logic
to get udp table in iterator inside udp_get_table_afinfo, and
renames the function to `udp_get_table_seq` accordingly.
Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
Link: https://lore.kernel.org/r/20230519225157.760788-4-aditi.ghag@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-65787
commit f44b1c515833c59701c86f92d47b4edd478fb0f3
Author: Aditi Ghag <aditi.ghag@isovalent.com>
Date: Fri May 19 22:51:50 2023 +0000
udp: seq_file: Helper function to match socket attributes
This is a preparatory commit to refactor code that matches socket
attributes in iterators to a helper function, and use it in the
proc fs iterator.
Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
Link: https://lore.kernel.org/r/20230519225157.760788-3-aditi.ghag@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-65787
commit ba6aac1516779dd0ced22c136a2c2c4a9c70cf29
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date: Mon Nov 14 13:57:56 2022 -0800
udp: Access &udp_table via net.
We will soon introduce an optional per-netns hash table
for UDP.
This means we cannot use udp_table directly in most places.
Instead, access it via net->ipv4.udp_table.
The access will be valid only while initialising udp_table
itself and creating/destroying each netns.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-65787
commit 478aee5d6bf617c932f4e9c2981f17e86e093fc5
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date: Mon Nov 14 13:57:55 2022 -0800
udp: Set NULL to udp_seq_afinfo.udp_table.
We will soon introduce an optional per-netns hash table
for UDP.
This means we cannot use the global udp_seq_afinfo.udp_table
to fetch a UDP hash table.
Instead, set NULL to udp_seq_afinfo.udp_table for UDP and get
a proper table from net->ipv4.udp_table.
Note that we still need udp_seq_afinfo.udp_table for UDP LITE.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-65787
Conflicts: Context difference due to already backported
7a7160edf1bf ("net: Return errno in sk->sk_prot->get_port().")
commit 67fb43308f4b354f13aabcc66dd5d99bfbb7e838
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date: Mon Nov 14 13:57:54 2022 -0800
udp: Set NULL to sk->sk_prot->h.udp_table.
We will soon introduce an optional per-netns hash table
for UDP.
This means we cannot use the global sk->sk_prot->h.udp_table
to fetch a UDP hash table.
Instead, set NULL to sk->sk_prot->h.udp_table for UDP and get
a proper table from net->ipv4.udp_table.
Note that we still need sk->sk_prot->h.udp_table for UDP LITE.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-65787
Conflicts: Context difference due to already backported
7a7160edf1bf ("net: Return errno in sk->sk_prot->get_port().")
commit 919dfa0b20ae56060dce0436eb710717f8987d18
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date: Mon Nov 14 13:57:53 2022 -0800
udp: Clean up some functions.
This patch adds no functional change and cleans up some functions
that the following patches touch around so that we make them tidy
and easy to review/revert. The change is mainly to keep reverse
christmas tree order.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4765
JIRA: https://issues.redhat.com/browse/RHEL-48648
Various visibility improvements; mainly around drop reasons, reset reason and improved tracepoints this time.
Signed-off-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>
Merged-by: Lucas Zampieri <lzampier@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-48648
Upstream Status: net-next.git
commit fc0cc9248843b37243fa5fd3287a121ec41d291f
Author: Yan Zhai <yan@cloudflare.com>
Date: Mon Jun 17 11:09:24 2024 -0700
udp: use sk_skb_reason_drop to free rx packets
Replace kfree_skb_reason with sk_skb_reason_drop and pass the receiving
socket to the tracepoint.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/r/202406011751.NpVN0sSk-lkp@intel.com/
Signed-off-by: Yan Zhai <yan@cloudflare.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Antoine Tenart <atenart@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-48648
Upstream Status: linux.git
commit e9669a00bba79442dd4862c57761333d6a020c24
Author: Balazs Scheidler <bazsi77@gmail.com>
Date: Tue Mar 26 19:05:47 2024 +0100
net: udp: add IP/port data to the tracepoint udp/udp_fail_queue_rcv_skb
The udp_fail_queue_rcv_skb() tracepoint lacks any details on the source
and destination IP/port whereas this information can be critical in case
of UDP/syslog.
Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://lore.kernel.org/r/0c8b3e33dbf679e190be6f4c6736603a76988a20.1711475011.git.balazs.scheidler@axoflow.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Antoine Tenart <atenart@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-19729
Upstream Status: net.git
commit 3d010c8031e39f5fa1e8b13ada77e0321091011f
Author: Antoine Tenart <atenart@kernel.org>
Date: Tue Mar 26 12:33:58 2024 +0100
udp: do not accept non-tunnel GSO skbs landing in a tunnel
When rx-udp-gro-forwarding is enabled UDP packets might be GROed when
being forwarded. If such packets might land in a tunnel this can cause
various issues and udp_gro_receive makes sure this isn't the case by
looking for a matching socket. This is performed in
udp4/6_gro_lookup_skb but only in the current netns. This is an issue
with tunneled packets when the endpoint is in another netns. In such
cases the packets will be GROed at the UDP level, which leads to various
issues later on. The same thing can happen with rx-gro-list.
We saw this with geneve packets being GROed at the UDP level. In such
case gso_size is set; later the packet goes through the geneve rx path,
the geneve header is pulled, the offset are adjusted and frag_list skbs
are not adjusted with regard to geneve. When those skbs hit
skb_fragment, it will misbehave. Different outcomes are possible
depending on what the GROed skbs look like; from corrupted packets to
kernel crashes.
One example is a BUG_ON[1] triggered in skb_segment while processing the
frag_list. Because gso_size is wrong (geneve header was pulled)
skb_segment thinks there is "geneve header size" of data in frag_list,
although it's in fact the next packet. The BUG_ON itself has nothing to
do with the issue. This is only one of the potential issues.
Looking up for a matching socket in udp_gro_receive is fragile: the
lookup could be extended to all netns (not speaking about performances)
but nothing prevents those packets from being modified in between and we
could still not find a matching socket. It's OK to keep the current
logic there as it should cover most cases but we also need to make sure
we handle tunnel packets being GROed too early.
This is done by extending the checks in udp_unexpected_gso: GSO packets
lacking the SKB_GSO_UDP_TUNNEL/_CSUM bits and landing in a tunnel must
be segmented.
[1] kernel BUG at net/core/skbuff.c:4408!
RIP: 0010:skb_segment+0xd2a/0xf70
__udp_gso_segment+0xaa/0x560
Fixes: 9fd1ff5d2a ("udp: Support UDP fraglist GRO/GSO.")
Fixes: 36707061d6 ("udp: allow forwarding of plain (non-fraglisted) UDP GRO packets")
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Antoine Tenart <atenart@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-12076
Conflicts: There are contextual differences as we're missing commit
559260fd9d9a ("ipmr: do not acquire mrt_lock in
ioctl(SIOCGETVIFCNT)"). I also pulled in header changes from commit
949d6b405e61 ("net: add missing includes and forward declarations
under net/") to address a build failure with this patch applied.
commit e1d001fa5b477c4da46a29be1fcece91db7c7c6f
Author: Breno Leitao <leitao@debian.org>
Date: Fri Jun 9 08:27:42 2023 -0700
net: ioctl: Use kernel memory on protocol ioctl callbacks
Most of the ioctls to net protocols operates directly on userspace
argument (arg). Usually doing get_user()/put_user() directly in the
ioctl callback. This is not flexible, because it is hard to reuse these
functions without passing userspace buffers.
Change the "struct proto" ioctls to avoid touching userspace memory and
operate on kernel buffers, i.e., all protocol's ioctl callbacks is
adapted to operate on a kernel memory other than on userspace (so, no
more {put,get}_user() and friends being called in the ioctl callback).
This changes the "struct proto" ioctl format in the following way:
int (*ioctl)(struct sock *sk, int cmd,
- unsigned long arg);
+ int *karg);
(Important to say that this patch does not touch the "struct proto_ops"
protocols)
So, the "karg" argument, which is passed to the ioctl callback, is a
pointer allocated to kernel space memory (inside a function wrapper).
This buffer (karg) may contain input argument (copied from userspace in
a prep function) and it might return a value/buffer, which is copied
back to userspace if necessary. There is not one-size-fits-all format
(that is I am using 'may' above), but basically, there are three type of
ioctls:
1) Do not read from userspace, returns a result to userspace
2) Read an input parameter from userspace, and does not return anything
to userspace
3) Read an input from userspace, and return a buffer to userspace.
The default case (1) (where no input parameter is given, and an "int" is
returned to userspace) encompasses more than 90% of the cases, but there
are two other exceptions. Here is a list of exceptions:
* Protocol RAW:
* cmd = SIOCGETVIFCNT:
* input and output = struct sioc_vif_req
* cmd = SIOCGETSGCNT
* input and output = struct sioc_sg_req
* Explanation: for the SIOCGETVIFCNT case, userspace passes the input
argument, which is struct sioc_vif_req. Then the callback populates
the struct, which is copied back to userspace.
* Protocol RAW6:
* cmd = SIOCGETMIFCNT_IN6
* input and output = struct sioc_mif_req6
* cmd = SIOCGETSGCNT_IN6
* input and output = struct sioc_sg_req6
* Protocol PHONET:
* cmd == SIOCPNADDRESOURCE | SIOCPNDELRESOURCE
* input int (4 bytes)
* Nothing is copied back to userspace.
For the exception cases, functions sock_sk_ioctl_inout() will
copy the userspace input, and copy it back to kernel space.
The wrapper that prepare the buffer and put the buffer back to user is
sk_ioctl(), so, instead of calling sk->sk_prot->ioctl(), the callee now
calls sk_ioctl(), which will handle all cases.
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230609152800.830401-1-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-14356
Tested: LNST, Tier1
Upstream commit:
commit f0ea27e7bfe1c34e1f451a63eb68faa1d4c3a86d
Author: Lorenz Bauer <lmb@isovalent.com>
Date: Thu Jul 20 17:30:05 2023 +0200
udp: re-score reuseport groups when connected sockets are present
Contrary to TCP, UDP reuseport groups can contain TCP_ESTABLISHED
sockets. To support these properly we remember whether a group has
a connected socket and skip the fast reuseport early-return. In
effect we continue scoring all reuseport sockets and then choose the
one with the highest score.
The current code fails to re-calculate the score for the result of
lookup_reuseport. According to Kuniyuki Iwashima:
1) SO_INCOMING_CPU is set
-> selected sk might have +1 score
2) BPF prog returns ESTABLISHED and/or SO_INCOMING_CPU sk
-> selected sk will have more than 8
Using the old score could trigger more lookups depending on the
order that sockets are created.
sk -> sk (SO_INCOMING_CPU) -> sk (ESTABLISHED)
| |
`-> select the next SO_INCOMING_CPU sk
|
`-> select itself (We should save this lookup)
Fixes: efc6b6f6c3 ("udp: Improve load balancing for SO_REUSEPORT.")
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
Link: https://lore.kernel.org/r/20230720-so-reuseport-v6-1-7021b683cdae@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Conflicts:
drivers/gpu/drm/tests/drm_buddy_test.c
drivers/gpu/drm/tests/drm_mm_test.c - We already have
ce28ab1380e8 ("drm/tests: Add back seed value information")
so keep calls to kunit_info.
drop changes to drivers/misc/habanalabs/gaudi2/gaudi2.c
fs/ntfs3/fslog.c - files not in CS9
net/sunrpc/auth_gss/gss_krb5_wrap.c - We already have
7f675ca7757b ("SUNRPC: Improve Kerberos confounder generation")
so code to change is gone.
drivers/gpu/drm/i915/i915_gem_gtt.c
drivers/gpu/drm/i915/selftests/i915_selftest.c
drivers/gpu/drm/tests/drm_buddy_test.c
drivers/gpu/drm/tests/drm_mm_test.c
change added under
4cb818386e ("Merge DRM changes from upstream v6.0.8..v6.1")
JIRA: https://issues.redhat.com/browse/RHEL-1848
commit a251c17aa558d8e3128a528af5cf8b9d7caae4fd
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date: Wed Oct 5 17:43:22 2022 +0200
treewide: use get_random_u32() when possible
The prandom_u32() function has been a deprecated inline wrapper around
get_random_u32() for several releases now, and compiles down to the
exact same code. Replace the deprecated wrapper with a direct call to
the real function. The same also applies to get_random_int(), which is
just a wrapper around get_random_u32(). This was done as a basic find
and replace.
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz> # for ext4
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake
Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbol
t
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Acked-by: Helge Deller <deller@gmx.de> # for parisc
Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-12679
commit d457a0e329b0bfd3a1450e0b1a18cd2b47a25a08
Author: Eric Dumazet <edumazet@google.com>
Date: Thu Jun 8 19:17:37 2023 +0000
net: move gso declarations and functions to their own files
Move declarations into include/net/gso.h and code into net/core/gso.c
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Stanislav Fomichev <sdf@google.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230608191738.3947077-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2218483
Conflicts:
- net/ipv4/udp.c: Context difference due to missing ec095263a965 ("net:
remove noblock parameter from recvmsg() entities") and db39dfdc1c3b
("udp: Use WARN_ON_ONCE() in udp_read_skb()"); 31f1fbcb346c ("udp:
Refactor udp_read_skb()") was adapted to reflect this
- net/vmw_vsock/virtio_transport_common.c: Skipped, because the relevant
code is not there, missing 634f1a7110b4 ("vsock: support sockmap")
commit 78fa0d61d97a728d306b0c23d353c0e340756437
Author: John Fastabend <john.fastabend@gmail.com>
Date: Mon May 22 19:56:05 2023 -0700
bpf, sockmap: Pass skb ownership through read_skb
The read_skb hook calls consume_skb() now, but this means that if the
recv_actor program wants to use the skb it needs to inc the ref cnt
so that the consume_skb() doesn't kfree the sk_buff.
This is problematic because in some error cases under memory pressure
we may need to linearize the sk_buff from sk_psock_skb_ingress_enqueue().
Then we get this,
skb_linearize()
__pskb_pull_tail()
pskb_expand_head()
BUG_ON(skb_shared(skb))
Because we incremented users refcnt from sk_psock_verdict_recv() we
hit the bug on with refcnt > 1 and trip it.
To fix lets simply pass ownership of the sk_buff through the skb_read
call. Then we can drop the consume from read_skb handlers and assume
the verdict recv does any required kfree.
Bug found while testing in our CI which runs in VMs that hit memory
constraints rather regularly. William tested TCP read_skb handlers.
[ 106.536188] ------------[ cut here ]------------
[ 106.536197] kernel BUG at net/core/skbuff.c:1693!
[ 106.536479] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[ 106.536726] CPU: 3 PID: 1495 Comm: curl Not tainted 5.19.0-rc5 #1
[ 106.537023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.16.0-1 04/01/2014
[ 106.537467] RIP: 0010:pskb_expand_head+0x269/0x330
[ 106.538585] RSP: 0018:ffffc90000138b68 EFLAGS: 00010202
[ 106.538839] RAX: 000000000000003f RBX: ffff8881048940e8 RCX: 0000000000000a20
[ 106.539186] RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff8881048940e8
[ 106.539529] RBP: ffffc90000138be8 R08: 00000000e161fd1a R09: 0000000000000000
[ 106.539877] R10: 0000000000000018 R11: 0000000000000000 R12: ffff8881048940e8
[ 106.540222] R13: 0000000000000003 R14: 0000000000000000 R15: ffff8881048940e8
[ 106.540568] FS: 00007f277dde9f00(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000
[ 106.540954] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 106.541227] CR2: 00007f277eeede64 CR3: 000000000ad3e000 CR4: 00000000000006e0
[ 106.541569] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 106.541915] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 106.542255] Call Trace:
[ 106.542383] <IRQ>
[ 106.542487] __pskb_pull_tail+0x4b/0x3e0
[ 106.542681] skb_ensure_writable+0x85/0xa0
[ 106.542882] sk_skb_pull_data+0x18/0x20
[ 106.543084] bpf_prog_b517a65a242018b0_bpf_skskb_http_verdict+0x3a9/0x4aa9
[ 106.543536] ? migrate_disable+0x66/0x80
[ 106.543871] sk_psock_verdict_recv+0xe2/0x310
[ 106.544258] ? sk_psock_write_space+0x1f0/0x1f0
[ 106.544561] tcp_read_skb+0x7b/0x120
[ 106.544740] tcp_data_queue+0x904/0xee0
[ 106.544931] tcp_rcv_established+0x212/0x7c0
[ 106.545142] tcp_v4_do_rcv+0x174/0x2a0
[ 106.545326] tcp_v4_rcv+0xe70/0xf60
[ 106.545500] ip_protocol_deliver_rcu+0x48/0x290
[ 106.545744] ip_local_deliver_finish+0xa7/0x150
Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
Reported-by: William Findlay <will@isovalent.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: William Findlay <will@isovalent.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20230523025618.113937-2-john.fastabend@gmail.com
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2218483
Conflicts:
- net/ipv4/udp.c: Code differece due to missing ec095263a965 ("net: remove
noblock parameter from recvmsg() entities"): keep the existing parameters
to skb_recv_udp(); and missing db39dfdc1c3b ("udp: Use WARN_ON_ONCE() in
udp_read_skb()"): keep WARN_ON
commit 31f1fbcb346c9342f6860c322b3f33b2acbc640b
Author: Peilin Ye <peilin.ye@bytedance.com>
Date: Thu Sep 22 21:59:13 2022 -0700
udp: Refactor udp_read_skb()
Delete the unnecessary while loop in udp_read_skb() for readability.
Additionally, since recv_actor() cannot return a value greater than
skb->len (see sk_psock_verdict_recv()), remove the redundant check.
Suggested-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Link: https://lore.kernel.org/r/343b5d8090a3eb764068e9f1d392939e2b423747.1663909008.git.peilin.ye@bytedance.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2193330
Upstream Status: net.git commit 91b6d3256356
commit 91b6d325635617540b6a1646ddb138bb17cbd569
Author: Eric Dumazet <edumazet@google.com>
Date: Mon Nov 15 11:02:39 2021 -0800
net: cache align tcp_memory_allocated, tcp_sockets_allocated
tcp_memory_allocated and tcp_sockets_allocated often share
a common cache line, source of false sharing.
Also take care of udp_memory_allocated and mptcp_sockets_allocated.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237
Conflicts: include/linux/net.h - upstream there was a conflict between
SOCK_CUSTOM_SOCKOPT and SOCK_SUPPORT_ZC. There, it was resolved
with the former getting defined as 6, and the latter as 5. However,
in the RHEL backport of a5ef058dc4d9 ("net: introduce and use custom
sockopt socket flag"), 5 was chosen for SOCK_CUSTOM_SOCKOPT. I
could renumber it to 6 to match upstream, but that risks introducing
unnecessary incompatibilities for 3rd party modules, so I opted to
differ from upstream. net/ipv4/udp.c - RHEL has a backport of
commit 8a3854c7b8e4 ("udp: track the forward memory release
threshold in an hot cacheline") out of order with this commit. It's
a simple fixup.
commit e993ffe3da4bcddea0536b03be1031bf35cd8d85
Author: Pavel Begunkov <asml.silence@gmail.com>
Date: Fri Oct 21 11:16:39 2022 +0100
net: flag sockets supporting msghdr originated zerocopy
We need an efficient way in io_uring to check whether a socket supports
zerocopy with msghdr provided ubuf_info. Add a new flag into the struct
socket flags fields.
Cc: <stable@vger.kernel.org> # 6.0
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/3dafafab822b1c66308bb58a0ac738b1e3f53f74.1666346426.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2166482
Tested: vs bz reproducer
Conflicts: different context in inet_csk_get_port and udp_lib_get_port,\
as rhel-9 lacks the upstream commit 08eaef904031 ("tcp: Clean up \
some functions.") and upstream commit 919dfa0b20ae ("udp: Clean up \
some functions.")
Upstream commit:
commit 7a7160edf1bfde25422262fb26851cef65f695d3
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date: Fri Nov 18 10:25:06 2022 -0800
net: Return errno in sk->sk_prot->get_port().
We assume the correct errno is -EADDRINUSE when sk->sk_prot->get_port()
fails, so some ->get_port() functions return just 1 on failure and the
callers return -EADDRINUSE instead.
However, mptcp_get_port() can return -EINVAL. Let's not ignore the error.
Note the only exception is inet_autobind(), all of whose callers return
-EAGAIN instead.
Fixes: cec37a6e41 ("mptcp: Handle MP_CAPABLE options for outgoing connections")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2137876
commit 57452d767feaeab405de3bff0d240c3ac84bfe0d
Author: Cong Wang <cong.wang@bytedance.com>
Date: Wed Jun 15 09:20:13 2022 -0700
skmsg: Get rid of skb_clone()
With ->read_skb() now we have an entire skb dequeued from
receive queue, now we just need to grab an addtional refcnt
before passing its ownership to recv actors.
And we should not touch them any more, particularly for
skb->sk. Fortunately, skb->sk is already set for most of
the protocols except UDP where skb->sk has been stolen,
so we have to fix it up for UDP case.
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20220615162014.89193-4-xiyou.wangcong@gmail.com
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2137876
Conflicts: Context difference due to not yet applied 314001f0bf927
("af_unix: Add OOB support") and already applied 3f92a64e44e5 ("tcp:
allow tls to decrypt directly from the tcp rcv queue")
commit 965b57b469a589d64d81b1688b38dcb537011bb0
Author: Cong Wang <cong.wang@bytedance.com>
Date: Wed Jun 15 09:20:12 2022 -0700
net: Introduce a new proto_ops ->read_skb()
Currently both splice() and sockmap use ->read_sock() to
read skb from receive queue, but for sockmap we only read
one entire skb at a time, so ->read_sock() is too conservative
to use. Introduce a new proto_ops ->read_skb() which supports
this sematic, with this we can finally pass the ownership of
skb to recv actors.
For non-TCP protocols, all ->read_sock() can be simply
converted to ->read_skb().
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20220615162014.89193-3-xiyou.wangcong@gmail.com
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2149949
Upstream Status: linux.git
commit eda090c31fe923ab9463b884469744ec903ab0cc
Author: Eric Dumazet <edumazet@google.com>
Date: Fri May 13 11:55:50 2022 -0700
inet: rename INET_MATCH()
This is no longer a macro, but an inlined function.
INET_MATCH() -> inet_match()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Olivier Hartkopp <socketcan@hartkopp.net>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2149949
Upstream Status: linux.git
commit 4915d50e300e96929d2462041d6f6c6f061167fd
Author: Eric Dumazet <edumazet@google.com>
Date: Thu May 12 09:56:01 2022 -0700
inet: add READ_ONCE(sk->sk_bound_dev_if) in INET_MATCH()
INET_MATCH() runs without holding a lock on the socket.
We probably need to annotate most reads.
This patch makes INET_MATCH() an inline function
to ease our changes.
v2:
We remove the 32bit version of it, as modern compilers
should generate the same code really, no need to
try to be smarter.
Also make 'struct net *net' the first argument.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Guillaume Nault <gnault@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1541
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2133057
Tested: LNST, Tier1, tput test
This series improves UDP protocol RX tput, to keep it on equal footing with rhel-8 one.
Patches 1,3,4 are there just to reduces the conflicts, and patch 4 is a very partial
backport, to avoid pulling unrelated features.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Florian Westphal <fwestpha@redhat.com>
Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2137858
Upstream Status: net.git commit 0defbb0af775
Conflicts:
- net/core/sock.c: context mismatch because of missing backport of
upstream commit f20cfd662a62 ("net: add sanity check in proto_register()")
commit 0defbb0af775ef037913786048d099bbe8b9a2c2
Author: Eric Dumazet <edumazet@google.com>
Date: Wed Jun 8 23:34:08 2022 -0700
net: add per_cpu_fw_alloc field to struct proto
Each protocol having a ->memory_allocated pointer gets a corresponding
per-cpu reserve, that following patches will use.
Instead of having reserved bytes per socket,
we want to have per-cpu reserves.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2137858
Upstream Status: net.git commit 100fdd1faf50
commit 100fdd1faf50557558e2911af4be32e515cb8036
Author: Eric Dumazet <edumazet@google.com>
Date: Wed Jun 8 23:34:07 2022 -0700
net: remove SK_MEM_QUANTUM and SK_MEM_QUANTUM_SHIFT
Due to memcg interface, SK_MEM_QUANTUM is effectively PAGE_SIZE.
This might change in the future, but it seems better to avoid the
confusion.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2133057
Tested: LNST, Tier1
Conflicts: use lock_sock()/release_sock() instead of \
sockopt_lock_sock()/sockopt_release_sock as rhel-9 lacks the
upstream commit 24426654ed3a ("bpf: net: Avoid sk_setsockopt() taking
sk lock when called from bpf")\
Upstream commit:
commit 8a3854c7b8e4532063b14bed34115079b7d0cb36
Author: Paolo Abeni <pabeni@redhat.com>
Date: Thu Oct 20 19:48:52 2022 +0200
udp: track the forward memory release threshold in an hot cacheline
When the receiver process and the BH runs on different cores,
udp_rmem_release() experience a cache miss while accessing sk_rcvbuf,
as the latter shares the same cacheline with sk_forward_alloc, written
by the BH.
With this patch, UDP tracks the rcvbuf value and its update via custom
SOL_SOCKET socket options, and copies the forward memory threshold value
used by udp_rmem_release() in a different cacheline, already accessed by
the above function and uncontended.
Since the UDP socket init operation grown a bit, factor out the common
code between v4 and v6 in a shared helper.
Overall the above give a 10% peek throughput increase under UDP flood.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1454
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Sync skb drop reasons with upstream to improve debuggability and visibility in
the net stack. This MR helps in understanding why a given packet is being
dropped.
One way of retrieving the skb drop reason is to hook to the kfree_skb tracepoint:
```
# perf record -e skb:kfree_skb -a sleep 10
# perf script
swapper 0 [000] 45483.977088: skb:kfree_skb: skbaddr=0xffffa04859090f00 protocol=34525 location=0xffffffff9bc92940 reason: NOT_SPECIFIED
swapper 0 [000] 45485.792919: skb:kfree_skb: skbaddr=0xffffa04143757900 protocol=34525 location=0xffffffff9bbd84bb reason: TCP_INVALID_SEQUENCE
```
Signed-off-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2135958
Tested: compile only
commit 69421bf98482d089e50799f45e48b25ce4a8d154
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date: Fri Oct 14 11:26:25 2022 -0700
udp: Update reuse->has_conns under reuseport_lock.
When we call connect() for a UDP socket in a reuseport group, we have
to update sk->sk_reuseport_cb->has_conns to 1. Otherwise, the kernel
could select a unconnected socket wrongly for packets sent to the
connected socket.
However, the current way to set has_conns is illegal and possible to
trigger that problem. reuseport_has_conns() changes has_conns under
rcu_read_lock(), which upgrades the RCU reader to the updater. Then,
it must do the update under the updater's lock, reuseport_lock, but
it doesn't for now.
For this reason, there is a race below where we fail to set has_conns
resulting in the wrong socket selection. To avoid the race, let's split
the reader and updater with proper locking.
cpu1 cpu2
+----+ +----+
__ip[46]_datagram_connect() reuseport_grow()
. .
|- reuseport_has_conns(sk, true) |- more_reuse = __reuseport_alloc(more_socks_size)
| . |
| |- rcu_read_lock()
| |- reuse = rcu_dereference(sk->sk_reuseport_cb)
| |
| | | /* reuse->has_conns == 0 here */
| | |- more_reuse->has_conns = reuse->has_conns
| |- reuse->has_conns = 1 | /* more_reuse->has_conns SHOULD BE 1 HERE */
| | |
| | |- rcu_assign_pointer(reuse->socks[i]->sk_reuseport_cb,
| | | more_reuse)
| `- rcu_read_unlock() `- kfree_rcu(reuse, rcu)
|
|- sk->sk_state = TCP_ESTABLISHED
Note the likely(reuse) in reuseport_has_conns_set() is always true,
but we put the test there for ease of review. [0]
For the record, usually, sk_reuseport_cb is changed under lock_sock().
The only exception is reuseport_grow() & TCP reqsk migration case.
1) shutdown() TCP listener, which is moved into the latter part of
reuse->socks[] to migrate reqsk.
2) New listen() overflows reuse->socks[] and call reuseport_grow().
3) reuse->max_socks overflows u16 with the new listener.
4) reuseport_grow() pops the old shutdown()ed listener from the array
and update its sk->sk_reuseport_cb as NULL without lock_sock().
shutdown()ed TCP sk->sk_reuseport_cb can be changed without lock_sock(),
but, reuseport_has_conns_set() is called only for UDP under lock_sock(),
so likely(reuse) never be false in reuseport_has_conns_set().
[0]: https://lore.kernel.org/netdev/CANn89iLja=eQHbsM_Ta2sQF0tOGU8vAGrh_izRuuHjuO1ouUag@mail.gmail.com/
Fixes: acdcecc612 ("udp: correct reuseport selection with connected sockets")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20221014182625.89913-1-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Xin Long <lxin@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2135958
Tested: compile only
commit 02a7cb2866dd6e3ac7645b594289e1c308b68c4e
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date: Thu Jul 28 20:21:37 2022 -0700
udp: Remove redundant __udp_sysctl_init() call from udp_init().
__udp_sysctl_init() is called for init_net via udp_sysctl_ops.
While at it, we can rename __udp_sysctl_init() to udp_sysctl_init().
Fixes: 1e80295158 ("udp: Move the udp sysctl to namespace.")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Xin Long <lxin@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2135319
Upstream Status: net.git commit d38afeec26ed
commit d38afeec26ed4739c640bf286c270559aab2ba5f
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date: Thu Oct 6 11:53:47 2022 -0700
tcp/udp: Call inet6_destroy_sock() in IPv6 sk->sk_destruct().
Originally, inet6_sk(sk)->XXX were changed under lock_sock(), so we were
able to clean them up by calling inet6_destroy_sock() during the IPv6 ->
IPv4 conversion by IPV6_ADDRFORM. However, commit 03485f2adc ("udpv6:
Add lockless sendmsg() support") added a lockless memory allocation path,
which could cause a memory leak:
setsockopt(IPV6_ADDRFORM) sendmsg()
+-----------------------+ +-------+
- do_ipv6_setsockopt(sk, ...) - udpv6_sendmsg(sk, ...)
- sockopt_lock_sock(sk) ^._ called via udpv6_prot
- lock_sock(sk) before WRITE_ONCE()
- WRITE_ONCE(sk->sk_prot, &tcp_prot)
- inet6_destroy_sock() - if (!corkreq)
- sockopt_release_sock(sk) - ip6_make_skb(sk, ...)
- release_sock(sk) ^._ lockless fast path for
the non-corking case
- __ip6_append_data(sk, ...)
- ipv6_local_rxpmtu(sk, ...)
- xchg(&np->rxpmtu, skb)
^._ rxpmtu is never freed.
- goto out_no_dst;
- lock_sock(sk)
For now, rxpmtu is only the case, but not to miss the future change
and a similar bug fixed in commit e27326009a3d ("net: ping6: Fix
memleak in ipv6_renew_options()."), let's set a new function to IPv6
sk->sk_destruct() and call inet6_cleanup_sock() there. Since the
conversion does not change sk->sk_destruct(), we can guarantee that
we can clean up IPv6 resources finally.
We can now remove all inet6_destroy_sock() calls from IPv6 protocol
specific ->destroy() functions, but such changes are invasive to
backport. So they can be posted as a follow-up later for net-next.
Fixes: 03485f2adc ("udpv6: Add lockless sendmsg() support")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Hangbin Liu <haliu@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git
commit 08d4c0370c400fa6ef2194f9ee2e8dccc4a7ab39
Author: Menglong Dong <imagedong@tencent.com>
Date: Sat Feb 5 15:47:39 2022 +0800
net: udp: use kfree_skb_reason() in __udp_queue_rcv_skb()
Replace kfree_skb() with kfree_skb_reason() in __udp_queue_rcv_skb().
Following new drop reasons are introduced:
SKB_DROP_REASON_SOCKET_RCVBUFF
SKB_DROP_REASON_PROTO_MEM
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Antoine Tenart <atenart@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git
commit 1379a92d38e31132e87d9b653e9343c7841a7348
Author: Menglong Dong <imagedong@tencent.com>
Date: Sat Feb 5 15:47:38 2022 +0800
net: udp: use kfree_skb_reason() in udp_queue_rcv_one_skb()
Replace kfree_skb() with kfree_skb_reason() in udp_queue_rcv_one_skb().
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Antoine Tenart <atenart@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2120352
commit a1554c002699cbc9ced2e9f44f9c1357181bead3
Author: Mianhan Liu <liumh1@shanghaitech.edu.cn>
Date: Fri Nov 5 13:45:21 2021 -0700
include/linux/mm.h: move nr_free_buffer_pages from swap.h to mm.h
nr_free_buffer_pages could be exposed through mm.h instead of swap.h.
The advantage of this change is that it can reduce the obsolete
includes. For example, net/ipv4/tcp.c wouldn't need swap.h any more
since it has already included mm.h. Similarly, after checking all the
other files, it comes that tcp.c, udp.c meter.c ,... follow the same
rule, so these files can have swap.h removed too.
Moreover, after preprocessing all the files that use
nr_free_buffer_pages, it turns out that those files have already
included mm.h.Thus, we can move nr_free_buffer_pages from swap.h to mm.h
safely. This change will not affect the compilation of other files.
Link: https://lkml.kernel.org/r/20210912133640.1624-1-liumh1@shanghaitech.edu.cn
Signed-off-by: Mianhan Liu <liumh1@shanghaitech.edu.cn>
Cc: Jakub Kicinski <kuba@kernel.org>
CC: Ulf Hansson <ulf.hansson@linaro.org>
Cc: "David S . Miller" <davem@davemloft.net>
Cc: Simon Horman <horms@verge.net.au>
Cc: Pravin B Shelar <pshelar@ovn.org>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071620
commit 91a760b26926265a60c77ddf016529bcf3e17a04
Author: Menglong Dong <imagedong@tencent.com>
Date: Thu Jan 6 21:20:20 2022 +0800
net: bpf: Handle return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND()
The return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() in
__inet_bind() is not handled properly. While the return value
is non-zero, it will set inet_saddr and inet_rcv_saddr to 0 and
exit:
err = BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk);
if (err) {
inet->inet_saddr = inet->inet_rcv_saddr = 0;
goto out_release_sock;
}
Let's take UDP for example and see what will happen. For UDP
socket, it will be added to 'udp_prot.h.udp_table->hash' and
'udp_prot.h.udp_table->hash2' after the sk->sk_prot->get_port()
called success. If 'inet->inet_rcv_saddr' is specified here,
then 'sk' will be in the 'hslot2' of 'hash2' that it don't belong
to (because inet_saddr is changed to 0), and UDP packet received
will not be passed to this sock. If 'inet->inet_rcv_saddr' is not
specified here, the sock will work fine, as it can receive packet
properly, which is wired, as the 'bind()' is already failed.
To undo the get_port() operation, introduce the 'put_port' field
for 'struct proto'. For TCP proto, it is inet_put_port(); For UDP
proto, it is udp_lib_unhash(); For icmp proto, it is
ping_unhash().
Therefore, after sys_bind() fail caused by
BPF_CGROUP_RUN_PROG_INET4_POST_BIND(), it will be unbinded, which
means that it can try to be binded to another port.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220106132022.3470772-2-imagedong@tencent.com
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2069046
Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
commit aef2feda97b840ec38e9fa53d0065188453304e8
Author: Jakub Kicinski <kuba@kernel.org>
Date: Wed Dec 15 18:55:37 2021 -0800
add missing bpf-cgroup.h includes
We're about to break the cgroup-defs.h -> bpf-cgroup.h dependency,
make sure those who actually need more than the definition of
struct cgroup_bpf include bpf-cgroup.h explicitly.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/bpf/20211216025538.1649516-3-kuba@kernel.org
Signed-off-by: Artem Savkov <asavkov@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2069046
Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
commit f89315650ba34ec6c91a8bded72796980bee2a4d
Author: Mark Pashmfouroush <markpash@cloudflare.com>
Date: Wed Nov 10 11:10:15 2021 +0000
bpf: Add ingress_ifindex to bpf_sk_lookup
It may be helpful to have access to the ifindex during bpf socket
lookup. An example may be to scope certain socket lookup logic to
specific interfaces, i.e. an interface may be made exempt from custom
lookup code.
Add the ifindex of the arriving connection to the bpf_sk_lookup API.
Signed-off-by: Mark Pashmfouroush <markpash@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211110111016.5670-2-markpash@cloudflare.com
Signed-off-by: Artem Savkov <asavkov@redhat.com>