JIRA: https://issues.redhat.com/browse/RHEL-73723
Conflicts: the __netns_tracker_alloc interface has been updated upstream
b6d7c0eb2dcbd, but in RHEL the hunk for notrefcnt_tracker was not included
(See RHEL commit 3b0a87ad0e, RHEL-24101). We merge it in here. Also,
we've dropped the rds hunk, as that seems unmantained in RHEL and is missing
the path where that hunk should operate.
commit 0cafd77dcd032d1687efaba5598cf07bce85997f
Author: Eric Dumazet <edumazet@google.com>
Date: Thu Oct 20 23:20:18 2022 +0000
net: add a refcount tracker for kernel sockets
Commit ffa84b5ffb37 ("net: add netns refcount tracker to struct sock")
added a tracker to sockets, but did not track kernel sockets.
We still have syzbot reports hinting about netns being destroyed
while some kernel TCP sockets had not been dismantled.
This patch tracks kernel sockets, and adds a ref_tracker_dir_print()
call to net_free() right before the netns is freed.
Normally, each layer is responsible for properly releasing its
kernel sockets before last call to net_free().
This debugging facility is enabled with CONFIG_NET_NS_REFCNT_TRACKER=y
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Tested-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-64867
Conflicts: RHEL is missing commit 1ded5e5a5931 ("net: annotate
data-races around sock->ops"), which accounts for the differences in
ops structure dereferencing.
commit 92ef0fd55ac80dfc2e4654edfe5d1ddfa6e070fe
Author: Jens Axboe <axboe@kernel.dk>
Date: Thu May 9 09:20:08 2024 -0600
net: change proto and proto_ops accept type
Rather than pass in flags, error pointer, and whether this is a kernel
invocation or not, add a struct proto_accept_arg struct as the argument.
This then holds all of these arguments, and prepares accept for being
able to pass back more information.
No functional changes in this patch.
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-41185
Tested: compile only
commit eb44ad4e635132754bfbcb18103f1dcb7058aedd
Author: Eric Dumazet <edumazet@google.com>
Date: Thu Sep 21 20:28:18 2023 +0000
net: annotate data-races around sk->sk_dst_pending_confirm
This field can be read or written without socket lock being held.
Add annotations to avoid load-store tearing.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Xin Long <lxin@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27755
Conflicts: RHEL does not include commit 1ded5e5a5931 ("net: annotate
data-races around sock->ops"), which converts proto_ops to a const
accessed with READ_ONCE. Fix up the patch to apply, but keep the
READ_ONCE from 1ded5e5a5931.
commit 0b05b0cd78c92371fdde6333d006f39eaf9e0860
Author: Breno Leitao <leitao@debian.org>
Date: Mon Oct 16 06:47:42 2023 -0700
net/socket: Break down __sys_getsockopt
Split __sys_getsockopt() into two functions by removing the core
logic into a sub-function (do_sock_getsockopt()). This will avoid
code duplication when doing the same operation in other callers, for
instance.
do_sock_getsockopt() will be called by io_uring getsockopt() command
operation in the following patch.
The same was done for the setsockopt pair.
Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20231016134750.1381153-5-leitao@debian.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4195
JIRA: https://issues.redhat.com/browse/RHEL-34070
Tested: LNST, Tier1
The issue is addressed by the last commit in the merge. The previous 2 changes are pre-requisites to avoid any conflict in the last one.
Technically neither of the pre-req is strictly needed, we adapt the fix with moderate effort and limited risk, but both the pre-reqs are small enough and nice enough to be a valuable alternative. Additionally the first one is provides a new sysctl that will be very likely used/requested by the users affected by this issue.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Approved-by: Florian Westphal <fwestpha@redhat.com>
Approved-by: Hangbin Liu <haliu@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>
Merged-by: Lucas Zampieri <lzampier@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-33410
Upstream Status: net.git commit b192812905e4b134f7b7994b079eb647e9d2d37e
commit b192812905e4b134f7b7994b079eb647e9d2d37e
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date: Fri Sep 1 17:27:08 2023 -0700
af_unix: Fix data race around sk->sk_err.
As with sk->sk_shutdown shown in the previous patch, sk->sk_err can be
read locklessly by unix_dgram_sendmsg().
Let's use READ_ONCE() for sk_err as well.
Note that the writer side is marked by commit cc04410af7de ("af_unix:
annotate lockless accesses to sk->sk_err").
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-36217
commit b534dc46c8ae0165b1b2509be24dbea4fa9c4011
Author: Willem de Bruijn <willemb@google.com>
Date: Wed Dec 7 09:37:01 2022 -0500
net_tstamp: add SOF_TIMESTAMPING_OPT_ID_TCP
Add an option to initialize SOF_TIMESTAMPING_OPT_ID for TCP from
write_seq sockets instead of snd_una.
This should have been the behavior from the start. Because processes
may now exist that rely on the established behavior, do not change
behavior of the existing option, but add the right behavior with a new
flag. It is encouraged to always set SOF_TIMESTAMPING_OPT_ID_TCP on
stream sockets along with the existing SOF_TIMESTAMPING_OPT_ID.
Intuitively the contract is that the counter is zero after the
setsockopt, so that the next write N results in a notification for
the last byte N - 1.
On idle sockets snd_una == write_seq and this holds for both. But on
sockets with data in transmission, snd_una records the unacked offset
in the stream. This depends on the ACK response from the peer. A
process cannot learn this in a race free manner (ioctl SIOCOUTQ is one
racy approach).
write_seq records the offset at the last byte written by the process.
This is a better starting point. It matches the intuitive contract in
all circumstances, unaffected by external behavior.
The new timestamp flag necessitates increasing sk_tsflags to 32 bits.
Move the field in struct sock to avoid growing the socket (for some
common CONFIG variants). The UAPI interface so_timestamping.flags is
already int, so 32 bits wide.
Reported-by: Sotirios Delimanolis <sotodel@meta.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20221207143701.29861-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-34070
Tested: LNST, Tier1
Upstream commit:
commit 219160be496f7f9cd105c5708e37cf22ab4ce0c7
Author: Eric Dumazet <edumazet@google.com>
Date: Fri Jun 10 20:30:16 2022 -0700
tcp: sk_forced_mem_schedule() optimization
sk_memory_allocated_add() has three callers, and returns
to them @memory_allocated.
sk_forced_mem_schedule() is one of them, and ignores
the returned value.
Change sk_memory_allocated_add() to return void.
Change sock_reserve_memory() and __sk_mem_raise_allocated()
to call sk_memory_allocated().
This removes one cache line miss [1] for RPC workloads,
as first skbs in TCP write queue and receive queue go through
sk_forced_mem_schedule().
[1] Cache line holding tcp_memory_allocated.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-34070
Tested: LNST, Tier1
Conflicts: different context in sysctl_net_core.c, as rhel-9 lacks \
the upstream series cb636b3e372b ("Merge branch 'use-standard-sysctl-macro'")
Upstream commit:
commit 12a686c2e761f1f1f6e6e2117a9ab9c6de2ac8a7
Author: Adam Li <adamli@os.amperecomputing.com>
Date: Mon Feb 26 02:24:52 2024 +0000
net: make SK_MEMORY_PCPU_RESERV tunable
This patch adds /proc/sys/net/core/mem_pcpu_rsv sysctl file,
to make SK_MEMORY_PCPU_RESERV tunable.
Commit 3cd3399dd7a8 ("net: implement per-cpu reserves for
memory_allocated") introduced per-cpu forward alloc cache:
"Implement a per-cpu cache of +1/-1 MB, to reduce number
of changes to sk->sk_prot->memory_allocated, which
would otherwise be cause of false sharing."
sk_prot->memory_allocated points to global atomic variable:
atomic_long_t tcp_memory_allocated ____cacheline_aligned_in_smp;
If increasing the per-cpu cache size from 1MB to e.g. 16MB,
changes to sk->sk_prot->memory_allocated can be further reduced.
Performance may be improved on system with many cores.
Signed-off-by: Adam Li <adamli@os.amperecomputing.com>
Reviewed-by: Christoph Lameter (Ampere) <cl@linux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-29306
commit 9f06f87fef689d28588cde8c7ebb00a67da34026
Author: Jakub Kicinski <kuba@kernel.org>
Date: Wed Apr 3 13:21:39 2024 -0700
net: skbuff: generalize the skb->decrypted bit
The ->decrypted bit can be reused for other crypto protocols.
Remove the direct dependency on TLS, add helpers to clean up
the ifdefs leaking out everywhere.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sabrina Dubroca <sdubroca@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3939
JIRA: https://issues.redhat.com/browse/RHEL-30656
Tested: LNST
Depends: !3918
The series updates netlink and devlink core to upstream version v6.6. Both have to be updated at once due to circular dependencies.
Omitted-fix: 83f2df9d66bc
The fix needs an additional devlink dependencies and it will be applied in next rebase series covered by RHEL-30145
Commits:
```
6978052448f9 ("netlink: remove unused 'compare' function")
74bf6477c18b ("netlink-specs: add partial specification for devlink")
82b3297009b6 ("netlink: specs: allow uapi-header in genetlink")
56c874f7dbca ("tools: ynl: skip the explicit op array size when not needed")
8da3a5598f75 ("ynl: allow to encode u8 attr")
bc77f7318da8 ("tools: ynl: add the Python requirements.txt file")
dd3a7d58dcc2 ("tools: ynl: Add missing types to encode/decode")
4c6170d1ae2c ("tools: ynl: default to treating enums as flags for mask generation")
bec0b7a2db35 ("tools: ynl: Add struct parsing to nlspec")
b423c3c86325 ("tools: ynl: Add C array attribute decoding to ynl")
2607191395bd ("tools: ynl: Add struct attr decoding to ynl")
f036d936ca57 ("tools: ynl: Add fixed-header support to ynl")
643ef4a676e3 ("netlink: specs: add partial specification for openvswitch")
88e288968412 ("docs: netlink: document struct support for genetlink-legacy")
04eac39361d3 ("docs: netlink: document the sub-type attribute property")
9f7cc57fe550 ("tools: ynl: support byte-order in cli")
a353318ebf24 ("tools: ynl: populate most of the ethtool spec")
48993e22d23a ("tools: ynl: replace print with NlError")
f3d07b02b2b8 ("tools: ynl: ethtool testing tool")
ebe3bdc4359e ("tools: ynl: throw a more meaningful exception if family not supported")
3ea31e66644b ("tools: ynl: Remove absolute paths to yaml files from ethtool testing tool")
85a4abed1554 ("tools: ynl: Rename ethtool to ethtool.py")
d913d32cc270 ("netlink: Use copy_to_user() for optval in netlink_getsockopt().")
a939d14919b7 ("netlink: annotate accesses to nlk->cb_running")
7c2435ef76e5 ("tools: ynl: Use dict of predefined Structs to decode scalar types")
bddd2e561b0a ("tools: ynl: Handle byte-order in struct members")
081e8df68199 ("tools: ynl: avoid dict errors on older Python versions")
9b66ee06e5ca ("net: ynl: prefix uAPI header include with uapi/")
0684f29a89e5 ("netlink: specs: correct types of legacy arrays")
6d6bae63053d ("doc: ynl: Add doc attr to struct members in genetlink-legacy spec")
5ac18889bde0 ("tools: ynl: Initialise fixed headers to 0 in genetlink-legacy")
313a7a808ca8 ("tools: ynl: Support enums in struct members in genetlink-legacy")
93b230b549bc ("netlink: specs: add ynl spec for ovs_flow")
f4e4534850a9 ("net/netlink: fix NETLINK_LIST_MEMBERSHIPS length report")
91dfaef243cd ("tools: ynl-gen: add extra headers for user space")
6ad49839ba9b ("tools: ynl-gen: fix unused / pad attribute handling")
67c65ce762ad ("tools: ynl-gen: don't override pure nested struct")
5605f102378f ("tools: ynl-gen: loosen type consistency check for events")
eef9b794eac8 ("tools: ynl-gen: add error checking for nested structs")
21b6e302789c ("tools: ynl-gen: generate enum-to-string helpers")
dc0956c98f11 ("tools: ynl-gen: move the response reading logic into YNL")
5d58f911c755 ("tools: ynl-gen: generate alloc and free helpers for req")
8cb6afb33541 ("tools: ynl-gen: switch to family struct")
59d814f0f285 ("tools: ynl-gen: generate static descriptions of notifications")
a99bfdf64795 ("tools: ynl-gen: clean up stray new lines at the end of reply-less requests")
86878f14d71a ("tools: ynl: user space helpers")
d75fdfbc6f26 ("tools: ynl: support fou and netdev in C")
ee0202e2e731 ("tools: ynl: add sample for netdev")
f6ca5baf2a86 ("netlink: specs: ethtool: fix random typos")
2cc9671a82e3 ("tools: ynl-gen: fill in support for MultiAttr scalars")
58da455b31ba ("tools: ynl-gen: improve unwind on parsing errors")
7a11f70ce882 ("tools: ynl: generate code for the handshake family")
8947e5037371 ("netlink: specs: devlink: fill in some details important for C")
9858bfc271de ("tools: ynl-gen: use enum names in op strmap more carefully")
6f115d4575ab ("tools: ynl-gen: refactor strmap helper generation")
ff6db4b58c93 ("tools: ynl-gen: enable code gen for directional specs")
6afaa0ef9b0e ("tools: ynl-gen: try to sort the types more intelligently")
37487f93b125 ("tools: ynl-gen: inherit struct use info")
eae7af21bdb9 ("tools: ynl-gen: walk nested types in depth")
168dea20ecef ("tools: ynl-gen: don't generate forward declarations for policies")
0a9471219672 ("tools: ynl-gen: don't generate forward declarations for policies - regen")
5d1a30eb989a ("tools: ynl: generate code for the devlink family")
fff8660b5425 ("tools: ynl: add sample for devlink")
30b5c720e1a9 ("tools: ynl-gen: cleanup user space header includes")
9b52fd4b6305 ("tools: ynl: regen: cleanup user space header includes")
820343ccbb2e ("tools: ynl-gen: complete the C keyword list")
2c0f1466867c ("tools: ynl-gen: combine else with closing bracket")
e4ea3cc68472 ("tools: ynl-gen: get attr type outside of if()")
7234415b8f86 ("tools: ynl: regen: regenerate the if ladders")
f2ba1e5e2208 ("tools: ynl-gen: stop generating common notification handlers")
d0915d64c3a6 ("tools: ynl: regen: stop generating common notification handlers")
ced1568862bd ("tools: ynl-gen: sanitize notification tracking")
6da3424fd629 ("tools: ynl-gen: support code gen for events")
6f96ec73cb5a ("tools: ynl-gen: don't pass op_name to RenderInfo")
76abff37f0d7 ("tools: ynl-gen: support / skip pads on the way to kernel")
008bcd6835a2 ("tools: ynl-gen: support excluding tricky ops")
33eedb0071c8 ("tools: ynl-gen: record extra args for regen")
ed2042cc77f1 ("netlink: specs: support setting prefix-name per attribute")
d4813b11d679 ("netlink: specs: ethtool: add C render hints")
dddc9f53da3e ("tools: ynl-gen: don't generate enum types if unnamed")
2c9d47a095f7 ("tools: ynl-gen: resolve enum vs struct name conflicts")
180ad455273a ("netlink: specs: ethtool: add empty enum stringset")
37c852222712 ("netlink: specs: ethtool: untangle UDP tunnels and cable test a bit")
709d0c3b3d4c ("netlink: specs: ethtool: untangle stats-get")
68335713d2ea ("netlink: specs: ethtool: mark pads as pads")
2d7be507d65e ("tools: ynl: generate code for the ethtool family")
f561ff232a6b ("tools: ynl: add sample for ethtool")
10c4d2a7b88d ("tools: ynl-gen: correct enum policies")
be093a80dff0 ("tools: ynl-gen: inherit policy in multi-attr")
fa0e21fa4443 ("rtnetlink: extend RTEXT_FILTER_SKIP_STATS to IFLA_VF_INFO")
89da780aa4c7 ("rtnetlink: move validate_linkmsg out of do_setlink")
f0ec58d557d6 ("tools: ynl: work around stale system headers")
6907217a8054 ("netlink: specs: fixup openvswitch specs for code generation")
8d61f926d420 ("netlink: fix potential deadlock in netlink_set_err()")
0c3d6fd4b89c ("tools: ynl: improve the direct-include header guard logic")
737eab775d36 ("netlink: specs: add display-hint to schema definitions")
d8eea68d913c ("tools: ynl: add display-hint support to ynl")
334f39ce17ef ("netlink: specs: add display hints to ovs_flow")
25a9c8a4431c ("netlink: Add __sock_i_ino() for __netlink_diag_dump().")
b8e39b38487e ("netlink: Make use of __assign_bit() API")
633d76ad01ad ("devlink: remove reload failed checks in params get/set callbacks")
4a59cdfd6699 ("rtnetlink: Move nesting cancellation rollback to proper function")
5766946ea511 ("genetlink: add explicit ordering break check for split ops")
a3377386b564 ("netlink: Reverse the patch which removed filtering")
a4c9a56e6a2c ("netlink: Add new netlink_release function")
d7ddf5f4269f ("tools: ynl-gen: fix enum index in _decode_enum(..)")
df15c15e6c98 ("tools: ynl-gen: fix parse multi-attr enum attribute")
5fac9b7c16c5 ("netlink: allow be16 and be32 types in all uint policy checks")
e5c157f081ab ("ynl: expose xdp-zc-max-segs")
37844828d290 ("ynl: mark max/mask as private for kdoc")
25b5a2a1905f ("ynl: regenerate all headers")
26fdb67e8b4a ("ynl: print xdp-zc-max-segs in the sample")
759ab1edb56c ("net: store netdevs in an xarray")
84e00d9bd4e4 ("net: convert some netlink netdev iterators to depend on the xarray")
2628d40899d1 ("devlink: Remove unused extern declaration devlink_port_region_destroy()")
78c96d7b7c9a ("netlink: specs: add dump-strict flag for dont-validate property")
dc7b81a828db ("ynl-gen-c.py: filter rendering of validate field values for split ops")
eab7be688b44 ("ynl-gen-c.py: allow directional model for kernel mode")
fa8ba3502ade ("ynl-gen-c.py: render netlink policies static for split ops")
ba0f66c95fa6 ("devlink: rename devlink_nl_ops to devlink_nl_small_ops")
d61aedcf628e ("devlink: rename couple of doit netlink callbacks to match generated names")
491a24872a64 ("devlink: introduce couple of dumpit callbacks for split ops")
8300dce542e4 ("devlink: un-static devlink_nl_pre/post_doit()")
759f661012d1 ("netlink: specs: devlink: add info-get dump op")
6b7c486cae81 ("devlink: add split ops generated according to spec")
b2551b1517d8 ("devlink: include the generated netlink header")
6e067d0cab68 ("devlink: use generated split ops and remove duplicated commands from small ops")
b876b71a6ac2 ("devlink: Remove unused devlink_dpipe_table_resource_set() declaration")
2c0e9f3806c4 ("tools: ynl-gen: avoid rendering empty validate field")
832140804e3b ("devlink: clear flag on port register error path")
cd3112ebbaf4 ("tools: ynl-gen: add missing empty line between policies")
8fe08d70a2b6 ("netlink: convert nlk->flags to atomic flags")
63618463cb94 ("devlink: parse linecard attr in doit() callbacks")
41a1d4d1399a ("devlink: parse rate attrs in doit() callbacks")
ee6d78ac28c7 ("devlink: introduce devlink_nl_pre_doit_port*() helper functions")
8fa995ad1f7f ("devlink: rename doit callbacks for per-instance dump commands")
24c8e56d4f98 ("devlink: introduce dumpit callbacks for split ops")
7d3c6fec6135 ("devlink: pass flags as an arg of dump_one() callback")
7199c86247e9 ("netlink: specs: devlink: add commands that do per-instance dump")
ddff283280ba ("devlink: remove duplicate temporary netlink callback prototypes")
833e479d330c ("devlink: remove converted commands from small ops")
4a1b5aa8b5c7 ("devlink: allow user to narrow per-instance dumps by passing handle attrs")
34493336e7d3 ("netlink: specs: devlink: extend per-instance dump commands to accept instance attributes")
b03f13cb67a5 ("devlink: extend health reporter dump selector by port index")
0149bca17262 ("netlink: specs: devlink: extend health reporter dump attributes by port index")
84817d8c6042 ("genetlink: push conditional locking into dumpit/done")
fde9bd4a4d41 ("genetlink: make genl_info->nlhdr const")
bffcc6882a1b ("genetlink: remove userhdr from struct genl_info")
9272af109fe6 ("genetlink: add struct genl_info to struct genl_dumpit_info")
7288dd2fd488 ("genetlink: use attrs from struct genl_info")
5c670a010de4 ("genetlink: add a family pointer to struct genl_info")
5aa51d9f889c ("genetlink: add genlmsg_iput() API")
0e19d3108aea ("netdev-genl: use struct genl_info for reply construction")
ec0e5b09b834 ("ethtool: netlink: simplify arguments to ethnl_default_parse()")
f946270d05c2 ("ethtool: netlink: always pass genl_info to .prepare_data")
956db0a13b47 ("net: warn about attempts to register negative ifindex")
ded67d90815a ("netlink: specs: add ovs_vport new command")
7582113c6917 ("tools: ynl: add more info to KeyErrors on missing attrs")
d56b699d76d1 ("Documentation: Fix typos")
f65f305ae008 ("tools: ynl-gen: use temporary file for rendering")
f534f6581ec0 ("net: validate veth and vxcan peer ifindexes")
649bde9004ac ("tools: ynl: allow passing binary data")
a149a3a13bbc ("tools: ynl-gen: set length of binary fields")
dc2ef94d8926 ("tools: ynl-gen: fix collecting global policy attrs")
4c8c24e801e6 ("tools: ynl-gen: support empty attribute lists")
e83d4e9b2d0f ("netlink: specs: fix indent in fou")
a02430c06f56 ("tools: ynl-gen: fix uAPI generation after tempfile changes")
52d08fda3516 ("doc/netlink: Add delete operation to ovs_vport spec")
ed68c58c0eb4 ("doc/netlink: Add a schema for netlink-raw families")
294f37fc8772 ("doc/netlink: Update genetlink-legacy documentation")
2db8abf0b455 ("doc/netlink: Document the netlink-raw schema extensions")
88901b967958 ("tools/ynl: Add mcast-group schema parsing to ynl")
fb0a06d455d6 ("tools/net/ynl: Fix extack parsing with fixed header genlmsg")
e46dd903efe3 ("tools/net/ynl: Add support for netlink-raw families")
0493e56d021d ("tools/net/ynl: Implement nlattr array-nest decoding in ynl")
1768d8a767f8 ("tools/net/ynl: Add support for create flags")
dfb0f7d9d979 ("doc/netlink: Add spec for rt addr messages")
b2f63d904e72 ("doc/netlink: Add spec for rt link messages")
023289b4f582 ("doc/netlink: Add spec for rt route messages")
56e65312830e ("devlink: push object register/unregister notifications into separate helpers")
eec1e5ea1d71 ("devlink: push port related code into separate file")
2b4d8bb08889 ("devlink: push shared buffer related code into separate file")
2475ed158c47 ("devlink: move and rename devlink_dpipe_send_and_alloc_skb() helper")
a9fd44b15fc5 ("devlink: push dpipe related code into separate file")
a9f960074ecd ("devlink: push resource related code into separate file")
830c41e1e987 ("devlink: push param related code into separate file")
1aa47ca1f52e ("devlink: push region related code into separate file")
85facf94fd80 ("devlink: use tracepoint_enabled() helper")
4bbdec80ff27 ("devlink: push trap related code into separate file")
7cc7194e85ca ("devlink: push rate related code into separate file")
9edbe6f36c5f ("devlink: push linecard related code into separate file")
890c55667437 ("devlink: move tracepoint definitions into core.c")
29a390d17748 ("devlink: move small_ops definition into netlink.c")
71179ac5c211 ("devlink: move devlink_notify_register/unregister() to dev.c")
ee940b57a929 ("doc/netlink: Fix missing classic_netlink doc reference")
d0f95894fda7 ("netlink: annotate data-races around sk->sk_err")
0f4d44f6ee04 ("netlink: specs: devlink: fix reply command values")
69844e335d8c ("selftests/bpf: Fix sockopt_sk selftest")
e4fe082c38cd ("tools: ynl: make sure we always pass yarg to mnl_cb_run")
5d78b73e8514 ("tools: ynl: don't leak mcast_groups on init error")
b6c65eb20ffa ("tools: ynl: fix handling of multiple mcast groups")
ceaac91dcd06 ("net: make sure we never create ifindex = 0")
0e0939c0adf9 ("net-procfs: use xarray iterator to implement /proc/net/dev")
```
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com>
Approved-by: Sabrina Dubroca <sdubroca@redhat.com>
Merged-by: Lucas Zampieri <lzampier@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-32270
Tested: LNST, Tier1
Upstream commit:
commit a54d51fb2dfb846aedf3751af501e9688db447f5
Author: Eric Dumazet <edumazet@google.com>
Date: Thu Jan 18 20:17:49 2024 +0000
udp: fix busy polling
Generic sk_busy_loop_end() only looks at sk->sk_receive_queue
for presence of packets.
Problem is that for UDP sockets after blamed commit, some packets
could be present in another queue: udp_sk(sk)->reader_queue
In some cases, a busy poller could spin until timeout expiration,
even if some packets are available in udp_sk(sk)->reader_queue.
v3: - make sk_busy_loop_end() nicer (Willem)
v2: - add a READ_ONCE(sk->sk_family) in sk_is_inet() to avoid KCSAN splats.
- add a sk_is_inet() check in sk_is_udp() (Willem feedback)
- add a sk_is_inet() check in sk_is_tcp().
Fixes: 2276f58ac5 ("udp: use a separate rx queue for packet reception")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-32270
Tested: LNST, Tier1
Upstream commit:
commit 448a5ce1120c5bdbce1f1ccdabcd31c7d029f328
Author: Vladislav Efanov <VEfanov@ispras.ru>
Date: Tue May 30 14:39:41 2023 +0300
udp6: Fix race condition in udp6_sendmsg & connect
Syzkaller got the following report:
BUG: KASAN: use-after-free in sk_setup_caps+0x621/0x690 net/core/sock.c:2018
Read of size 8 at addr ffff888027f82780 by task syz-executor276/3255
The function sk_setup_caps (called by ip6_sk_dst_store_flow->
ip6_dst_store) referenced already freed memory as this memory was
freed by parallel task in udpv6_sendmsg->ip6_sk_dst_lookup_flow->
sk_dst_check.
task1 (connect) task2 (udp6_sendmsg)
sk_setup_caps->sk_dst_set |
| sk_dst_check->
| sk_dst_set
| dst_release
sk_setup_caps references |
to already freed dst_entry|
The reason for this race condition is: sk_setup_caps() keeps using
the dst after transferring the ownership to the dst cache.
Found by Linux Verification Center (linuxtesting.org) with syzkaller.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Vladislav Efanov <VEfanov@ispras.ru>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-17138
commit bc1fb82ae11753c5dec53c667a055dc37796dbd2
Author: Eric Dumazet <edumazet@google.com>
Date: Sat Aug 19 04:06:46 2023 +0000
net: annotate data-races around sk->sk_lingertime
sk_getsockopt() runs locklessly. This means sk->sk_lingertime
can be read while other threads are changing its value.
Other reads also happen without socket lock being held,
and must be annotated.
Remove preprocessor logic using BITS_PER_LONG, compilers
are smart enough to figure this by themselves.
v2: fixed a clang W=1 (-Wtautological-constant-out-of-range-compare) warning
(Jakub)
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Bastien Nocera <bnocera@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3318
Update io_uring and its dependencies to upstream kernel version 6.6.
JIRA: https://issues.redhat.com/browse/RHEL-12076
JIRA: https://issues.redhat.com/browse/RHEL-14998
JIRA: https://issues.redhat.com/browse/RHEL-4447
CVE: CVE-2023-46862
Omitted-Fix: ab69838e7c75 ("io_uring/kbuf: Fix check of BID wrapping in provided buffers")
Omitted-Fix: f74c746e476b ("io_uring/kbuf: Allow the full buffer id space for provided buffers")
This is the list of new features available (includes upstream kernel versions 6.3-6.6):
User-specified ring buffer
Provided Buffers allocated by the kernel
Ability to register the ring fd
Multi-shot timeouts
ability to pass custom flags to the completion queue entry for ring messages
All of these features are covered by the liburing tests.
In my testing, no-mmap-inval.t failed because of a broken test. socket-uring-cmd.t also failed because of a missing selinux policy rule. Try running audit2allow if you see a failure in that test.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Approved-by: Wander Lairson Costa <wander@redhat.com>
Approved-by: Donald Dutile <ddutile@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>
Approved-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Scott Weaver <scweaver@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-17413
Upstream Status: linux.git
Conflicts:\
- drivers/infiniband/hw/erdma/erdma_cm.c chunk missing due to missing
upstream commit 920d93eac8b9 ("RDMA/erdma: Add connection management
(CM) support") in c9s.
- Context diff in fs/dlm/lowcomms.c due to missing upstream commit
dbb751ffab0b ("fs: dlm: parallelize lowcomms socket handling") in c9s.
- Context diff in net/core/net-traces.c as 8139dccd464a ("udp6: add a
missing call into udp_fail_queue_rcv_skb tracepoint") was backported
earlier in c9s.
- Context diff in net/tls/tls_sw.c as 74836ec828fe ("tls: rx: strp:
don't use GFP_KERNEL in softirq context") was backported earlier in
c9s.
- Context diff in net/sunrpc/svcsock.c as upstream commit fc80fc2d4e39
("SUNRPC: Fix UAF in svc_tcp_listen_data_ready()") was backported
before in c9s.
commit 40e0b09081420853542571c38875b48b60404ebb
Author: Peilin Ye <peilin.ye@bytedance.com>
Date: Thu Jan 19 16:45:16 2023 -0800
net/sock: Introduce trace_sk_data_ready()
As suggested by Cong, introduce a tracepoint for all ->sk_data_ready()
callback implementations. For example:
<...>
iperf-609 [002] ..... 70.660425: sk_data_ready: family=2 protocol=6 func=sock_def_readable
iperf-609 [002] ..... 70.660436: sk_data_ready: family=2 protocol=6 func=sock_def_readable
<...>
Suggested-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Antoine Tenart <atenart@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-12076
commit 634236b34d7a8c9e11c12b0746b83b8942fc8f2e
Author: Eric Dumazet <edumazet@google.com>
Date: Mon Jun 19 12:43:35 2023 +0000
net: remove sk_is_ipmr() and sk_is_icmpv6() helpers
Blamed commit added these helpers for sake of detecting RAW
sockets specific ioctl.
syzbot complained about it [1].
Issue here is that RAW sockets could pretend there was no need
to call ipmr_sk_ioctl()
Regardless of inet_sk(sk)->inet_num, we must be prepared
for ipmr_ioctl() being called later. This must happen
from ipmr_sk_ioctl() context only.
We could add a safety check in ipmr_ioctl() at the risk of breaking
applications.
Instead, remove sk_is_ipmr() and sk_is_icmpv6() because their
name would be misleading, once we change their implementation.
[1]
BUG: KASAN: stack-out-of-bounds in ipmr_ioctl+0xb12/0xbd0 net/ipv4/ipmr.c:1654
Read of size 4 at addr ffffc90003aefae4 by task syz-executor105/5004
CPU: 0 PID: 5004 Comm: syz-executor105 Not tainted 6.4.0-rc6-syzkaller-01304-gc08afcdcf952 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/27/2023
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:351
print_report mm/kasan/report.c:462 [inline]
kasan_report+0x11c/0x130 mm/kasan/report.c:572
ipmr_ioctl+0xb12/0xbd0 net/ipv4/ipmr.c:1654
raw_ioctl+0x4e/0x1e0 net/ipv4/raw.c:881
sock_ioctl_out net/core/sock.c:4186 [inline]
sk_ioctl+0x151/0x440 net/core/sock.c:4214
inet_ioctl+0x18c/0x380 net/ipv4/af_inet.c:1001
sock_do_ioctl+0xcc/0x230 net/socket.c:1189
sock_ioctl+0x1f8/0x680 net/socket.c:1306
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:870 [inline]
__se_sys_ioctl fs/ioctl.c:856 [inline]
__x64_sys_ioctl+0x197/0x210 fs/ioctl.c:856
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f2944bf6ad9
Code: 28 c3 e8 2a 14 00 00 66 2e 0f 1f 84 00 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffd8897a028 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f2944bf6ad9
RDX: 0000000000000000 RSI: 00000000000089e1 RDI: 0000000000000003
RBP: 00007f2944bbac80 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f2944bbad10
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
</TASK>
The buggy address belongs to stack of task syz-executor105/5004
and is located at offset 36 in frame:
sk_ioctl+0x0/0x440 net/core/sock.c:4172
This frame has 2 objects:
[32, 36) 'karg'
[48, 88) 'buffer'
Fixes: e1d001fa5b47 ("net: ioctl: Use kernel memory on protocol ioctl callbacks")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Breno Leitao <leitao@debian.org>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20230619124336.651528-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-12076
Conflicts: There are contextual differences as we're missing commit
559260fd9d9a ("ipmr: do not acquire mrt_lock in
ioctl(SIOCGETVIFCNT)"). I also pulled in header changes from commit
949d6b405e61 ("net: add missing includes and forward declarations
under net/") to address a build failure with this patch applied.
commit e1d001fa5b477c4da46a29be1fcece91db7c7c6f
Author: Breno Leitao <leitao@debian.org>
Date: Fri Jun 9 08:27:42 2023 -0700
net: ioctl: Use kernel memory on protocol ioctl callbacks
Most of the ioctls to net protocols operates directly on userspace
argument (arg). Usually doing get_user()/put_user() directly in the
ioctl callback. This is not flexible, because it is hard to reuse these
functions without passing userspace buffers.
Change the "struct proto" ioctls to avoid touching userspace memory and
operate on kernel buffers, i.e., all protocol's ioctl callbacks is
adapted to operate on a kernel memory other than on userspace (so, no
more {put,get}_user() and friends being called in the ioctl callback).
This changes the "struct proto" ioctl format in the following way:
int (*ioctl)(struct sock *sk, int cmd,
- unsigned long arg);
+ int *karg);
(Important to say that this patch does not touch the "struct proto_ops"
protocols)
So, the "karg" argument, which is passed to the ioctl callback, is a
pointer allocated to kernel space memory (inside a function wrapper).
This buffer (karg) may contain input argument (copied from userspace in
a prep function) and it might return a value/buffer, which is copied
back to userspace if necessary. There is not one-size-fits-all format
(that is I am using 'may' above), but basically, there are three type of
ioctls:
1) Do not read from userspace, returns a result to userspace
2) Read an input parameter from userspace, and does not return anything
to userspace
3) Read an input from userspace, and return a buffer to userspace.
The default case (1) (where no input parameter is given, and an "int" is
returned to userspace) encompasses more than 90% of the cases, but there
are two other exceptions. Here is a list of exceptions:
* Protocol RAW:
* cmd = SIOCGETVIFCNT:
* input and output = struct sioc_vif_req
* cmd = SIOCGETSGCNT
* input and output = struct sioc_sg_req
* Explanation: for the SIOCGETVIFCNT case, userspace passes the input
argument, which is struct sioc_vif_req. Then the callback populates
the struct, which is copied back to userspace.
* Protocol RAW6:
* cmd = SIOCGETMIFCNT_IN6
* input and output = struct sioc_mif_req6
* cmd = SIOCGETSGCNT_IN6
* input and output = struct sioc_sg_req6
* Protocol PHONET:
* cmd == SIOCPNADDRESOURCE | SIOCPNDELRESOURCE
* input int (4 bytes)
* Nothing is copied back to userspace.
For the exception cases, functions sock_sk_ioctl_inout() will
copy the userspace input, and copy it back to kernel space.
The wrapper that prepare the buffer and put the buffer back to user is
sk_ioctl(), so, instead of calling sk->sk_prot->ioctl(), the callee now
calls sk_ioctl(), which will handle all cases.
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230609152800.830401-1-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-14364
Tested: LNST, Tier1
Upstream commit:
commit 66d58f046c9d3a8f996b7138d02e965fd0617de0
Author: Eric Dumazet <edumazet@google.com>
Date: Thu Aug 31 13:52:08 2023 +0000
net: use sk_forward_alloc_get() in sk_get_meminfo()
inet_sk_diag_fill() has been changed to use sk_forward_alloc_get(),
but sk_get_meminfo() was forgotten.
Fixes: 292e6077b040 ("net: introduce sk_forward_alloc_get()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2375
# Merge Request Required Information
## Summary of Changes
This MR updates RHEL's io_uring implementation to match upstream v6.2 + fixes (Fixes: tags and Cc: stable commits). The final 3 RHEL-only patches disable the feature by default, taint the kernel when it is enabled, and turn on the config option.
## Approved Development Ticket
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2170014
Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2214
Omitted-fix: b0b7a7d24b66 ("io_uring: return back links tw run optimisation")
This is actually just an optimization, and it has non-trivial conflicts
which would require additional backports to resolve. Skip it.
Omitted-fix: 7d481e035633 ("io_uring/rsrc: fix DEFER_TASKRUN rsrc quiesce")
This fix is incorrectly tagged. The code that it applies to is not present in our tree.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Approved-by: John Meneghini <jmeneghi@redhat.com>
Approved-by: Ming Lei <ming.lei@redhat.com>
Approved-by: Maurizio Lombardi <mlombard@redhat.com>
Approved-by: Brian Foster <bfoster@redhat.com>
Approved-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Jan Stancek <jstancek@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2188560
Tested: LNST, Tier1
Upstream commit:
commit 584f3742890e966d2f0a1f3c418c9ead70b2d99e
Author: Pietro Borrello <borrello@diag.uniroma1.it>
Date: Sat Feb 4 17:39:20 2023 +0000
net: add sock_init_data_uid()
Add sock_init_data_uid() to explicitly initialize the socket uid.
To initialise the socket uid, sock_init_data() assumes a the struct
socket* sock is always embedded in a struct socket_alloc, used to
access the corresponding inode uid. This may not be true.
Examples are sockets created in tun_chr_open() and tap_open().
Fixes: 86741ec254 ("net: core: Add a UID field to struct sock.")
Signed-off-by: Pietro Borrello <borrello@diag.uniroma1.it>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2188560
Tested: LNST, Tier1
Upstream commit:
commit c11204c78d6966c5bda6dd05c3ac5cbb193f93e3
Author: Kevin Yang <yyd@google.com>
Date: Tue Feb 7 02:08:20 2023 +0000
txhash: fix sk->sk_txrehash default
This code fix a bug that sk->sk_txrehash gets its default enable
value from sysctl_txrehash only when the socket is a TCP listener.
We should have sysctl_txrehash to set the default sk->sk_txrehash,
no matter TCP, nor listerner/connector.
Tested by following packetdrill:
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 socket(..., SOCK_DGRAM, IPPROTO_UDP) = 4
// SO_TXREHASH == 74, default to sysctl_txrehash == 1
+0 getsockopt(3, SOL_SOCKET, 74, [1], [4]) = 0
+0 getsockopt(4, SOL_SOCKET, 74, [1], [4]) = 0
Fixes: 26859240e4ee ("txhash: Add socket option to control TX hash rethink behavior")
Signed-off-by: Kevin Yang <yyd@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2188560
Tested: LNST, Tier1
Upstream commit:
commit b261eda84ec136240a9ca753389853a3a1bccca2
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date: Fri Oct 21 13:44:34 2022 -0700
soreuseport: Fix socket selection for SO_INCOMING_CPU.
Kazuho Oku reported that setsockopt(SO_INCOMING_CPU) does not work
with setsockopt(SO_REUSEPORT) since v4.6.
With the combination of SO_REUSEPORT and SO_INCOMING_CPU, we could
build a highly efficient server application.
setsockopt(SO_INCOMING_CPU) associates a CPU with a TCP listener
or UDP socket, and then incoming packets processed on the CPU will
likely be distributed to the socket. Technically, a socket could
even receive packets handled on another CPU if no sockets in the
reuseport group have the same CPU receiving the flow.
The logic exists in compute_score() so that a socket will get a higher
score if it has the same CPU with the flow. However, the score gets
ignored after the blamed two commits, which introduced a faster socket
selection algorithm for SO_REUSEPORT.
This patch introduces a counter of sockets with SO_INCOMING_CPU in
a reuseport group to check if we should iterate all sockets to find
a proper one. We increment the counter when
* calling listen() if the socket has SO_INCOMING_CPU and SO_REUSEPORT
* enabling SO_INCOMING_CPU if the socket is in a reuseport group
Also, we decrement it when
* detaching a socket out of the group to apply SO_INCOMING_CPU to
migrated TCP requests
* disabling SO_INCOMING_CPU if the socket is in a reuseport group
When the counter reaches 0, we can get back to the O(1) selection
algorithm.
The overall changes are negligible for the non-SO_INCOMING_CPU case,
and the only notable thing is that we have to update sk_incomnig_cpu
under reuseport_lock. Otherwise, the race prevents transitioning to
the O(n) algorithm and results in the wrong socket selection.
cpu1 (setsockopt) cpu2 (listen)
+-----------------+ +-------------+
lock_sock(sk1) lock_sock(sk2)
reuseport_update_incoming_cpu(sk1, val)
.
| /* set CPU as 0 */
|- WRITE_ONCE(sk1->incoming_cpu, val)
|
| spin_lock_bh(&reuseport_lock)
| reuseport_grow(sk2, reuse)
| .
| |- more_socks_size = reuse->max_socks * 2U;
| |- if (more_socks_size > U16_MAX &&
| | reuse->num_closed_socks)
| | .
| | |- RCU_INIT_POINTER(sk1->sk_reuseport_cb, NULL);
| | `- __reuseport_detach_closed_sock(sk1, reuse)
| | .
| | `- reuseport_put_incoming_cpu(sk1, reuse)
| | .
| | | /* Read shutdown()ed sk1's sk_incoming_cpu
| | | * without lock_sock().
| | | */
| | `- if (sk1->sk_incoming_cpu >= 0)
| | .
| | | /* decrement not-yet-incremented
| | | * count, which is never incremented.
| | | */
| | `- __reuseport_put_incoming_cpu(reuse);
| |
| `- spin_lock_bh(&reuseport_lock)
|
|- spin_lock_bh(&reuseport_lock)
|
|- reuse = rcu_dereference_protected(sk1->sk_reuseport_cb, ...)
|- if (!reuse)
| .
| | /* Cannot increment reuse->incoming_cpu. */
| `- goto out;
|
`- spin_unlock_bh(&reuseport_lock)
Fixes: e32ea7e747 ("soreuseport: fast reuseport UDP socket selection")
Fixes: c125e80b88 ("soreuseport: fast reuseport TCP socket selection")
Reported-by: Kazuho Oku <kazuhooku@gmail.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2185290
Tested: big tcp selftest
commit b1a78b9b98862cda167b643690e43662ea060625
Author: Xin Long <lucien.xin@gmail.com>
Date: Sat Jan 28 10:58:39 2023 -0500
net: add support for ipv4 big tcp
Similar to Eric's IPv6 BIG TCP, this patch is to enable IPv4 BIG TCP.
Firstly, allow sk->sk_gso_max_size to be set to a value greater than
GSO_LEGACY_MAX_SIZE by not trimming gso_max_size in sk_trim_gso_size()
for IPv4 TCP sockets.
Then on TX path, set IP header tot_len to 0 when skb->len > IP_MAX_MTU
in __ip_local_out() to allow to send BIG TCP packets, and this implies
that skb->len is the length of a IPv4 packet; On RX path, use skb->len
as the length of the IPv4 packet when the IP header tot_len is 0 and
skb->len > IP_MAX_MTU in ip_rcv_core(). As the API iph_set_totlen() and
skb_ip_totlen() are used in __ip_local_out() and ip_rcv_core(), we only
need to update these APIs.
Also in GRO receive, add the check for ETH_P_IP/IPPROTO_TCP, and allows
the merged packet size >= GRO_LEGACY_MAX_SIZE in skb_gro_receive(). In
GRO complete, set IP header tot_len to 0 when the merged packet size
greater than IP_MAX_MTU in iph_set_totlen() so that it can be processed
on RX path.
Note that by checking skb_is_gso_tcp() in API iph_totlen(), it makes
this implementation safe to use iph->len == 0 indicates IPv4 BIG TCP
packets.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Xin Long <lxin@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237
commit de32bc6aad09131a30b4a9a738e2bf2ba5a9a5aa
Author: Pavel Begunkov <asml.silence@gmail.com>
Date: Thu Apr 28 11:58:44 2022 +0100
net: inline sock_alloc_send_skb
sock_alloc_send_skb() is simple and just proxying to another function,
so we can inline it and cut associated overhead.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2183213
Upstream Status: linux.git
commit fb87bd47516d9a26b6d549231aa743b20fd4a569
Author: Guillaume Nault <gnault@redhat.com>
Date: Fri Dec 16 07:45:26 2022 -0500
net: Introduce sk_use_task_frag in struct sock.
Sockets that can be used while recursing into memory reclaim, like
those used by network block devices and file systems, mustn't use
current->task_frag: if the current process is already using it, then
the inner memory reclaim call would corrupt the task_frag structure.
To avoid this, sk_page_frag() uses ->sk_allocation to detect sockets
that mustn't use current->task_frag, assuming that those used during
memory reclaim had their allocation constraints reflected in
->sk_allocation.
This unfortunately doesn't cover all cases: in an attempt to remove all
usage of GFP_NOFS and GFP_NOIO, sunrpc stopped setting these flags in
->sk_allocation, and used memalloc_nofs critical sections instead.
This breaks the sk_page_frag() heuristic since the allocation
constraints are now stored in current->flags, which sk_page_frag()
can't read without risking triggering a cache miss and slowing down
TCP's fast path.
This patch creates a new field in struct sock, named sk_use_task_frag,
which sockets with memory reclaim constraints can set to false if they
can't safely use current->task_frag. In such cases, sk_page_frag() now
always returns the socket's page_frag (->sk_frag). The first user is
sunrpc, which needs to avoid using current->task_frag but can keep
->sk_allocation set to GFP_KERNEL otherwise.
Eventually, it might be possible to simplify sk_page_frag() by only
testing ->sk_use_task_frag and avoid relying on the ->sk_allocation
heuristic entirely (assuming other sockets will set ->sk_use_task_frag
according to their constraints in the future).
The new ->sk_use_task_frag field is placed in a hole in struct sock and
belongs to a cache line shared with ->sk_shutdown. Therefore it should
be hot and shouldn't have negative performance impacts on TCP's fast
path (sk_shutdown is tested just before the while() loop in
tcp_sendmsg_locked()).
Link: https://lore.kernel.org/netdev/b4d8cb09c913d3e34f853736f3f5628abfd7f4b6.1656699567.git.gnault@redhat.com/
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Guillaume Nault <gnault@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2250
Bring in changes from 9.2 tag kernel-5.14.0-284.5.1.el9_2.
The change to Makefile.rhelver was dropped since it is not applicable to
centos stream 9.
The change to block/blk-mq.h was re-done based on current
centos-stream-9 tree content. Since c9s tree does have this:
80bd4a7aab4c blk-mq: move the srcu_struct used for quiescing to the tagset
Then I just applied the original upstream change instead*,
not using the 9.2 specific version anymore.
*blk-mq: fix "bad unlock balance detected" on q->srcu in __blk_mq_run_dispatch_ops
Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
Signed-off-by: Jan Stancek <jstancek@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2166911
commit b10b9c342f7571f287fd422be5d5c0beb26ba974
Author: Paul Moore <paul@paul-moore.com>
Date: Mon Oct 10 12:31:21 2022 -0400
lsm: make security_socket_getpeersec_stream() sockptr_t safe
Commit 4ff09db1b79b ("bpf: net: Change sk_getsockopt() to take the
sockptr_t argument") made it possible to call sk_getsockopt()
with both user and kernel address space buffers through the use of
the sockptr_t type. Unfortunately at the time of conversion the
security_socket_getpeersec_stream() LSM hook was written to only
accept userspace buffers, and in a desire to avoid having to change
the LSM hook the commit author simply passed the sockptr_t's
userspace buffer pointer. Since the only sk_getsockopt() callers
at the time of conversion which used kernel sockptr_t buffers did
not allow SO_PEERSEC, and hence the
security_socket_getpeersec_stream() hook, this was acceptable but
also very fragile as future changes presented the possibility of
silently passing kernel space pointers to the LSM hook.
There are several ways to protect against this, including careful
code review of future commits, but since relying on code review to
catch bugs is a recipe for disaster and the upstream eBPF maintainer
is "strongly against defensive programming", this patch updates the
LSM hook, and all of the implementations to support sockptr_t and
safely handle both user and kernel space buffers.
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Artem Savkov <asavkov@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2166911
commit 65ddc82d3b96be5555a36de4e2b4547433a00532
Author: Martin KaFai Lau <martin.lau@kernel.org>
Date: Thu Sep 1 17:29:12 2022 -0700
bpf: Change bpf_getsockopt(SOL_SOCKET) to reuse sk_getsockopt()
This patch changes bpf_getsockopt(SOL_SOCKET) to reuse
sk_getsockopt(). It removes all duplicated code from
bpf_getsockopt(SOL_SOCKET).
Before this patch, there were some optnames available to
bpf_setsockopt(SOL_SOCKET) but missing in bpf_getsockopt(SOL_SOCKET).
It surprises users from time to time. For example, SO_REUSEADDR,
SO_KEEPALIVE, SO_RCVLOWAT, and SO_MAX_PACING_RATE. This patch
automatically closes this gap without duplicating more code.
The only exception is SO_BINDTODEVICE because it needs to acquire a
blocking lock. Thus, SO_BINDTODEVICE is not supported.
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20220902002912.2894040-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2166911
commit 4ff09db1b79b98b4a2a7511571c640b76cab3beb
Author: Martin KaFai Lau <martin.lau@kernel.org>
Date: Thu Sep 1 17:28:02 2022 -0700
bpf: net: Change sk_getsockopt() to take the sockptr_t argument
This patch changes sk_getsockopt() to take the sockptr_t argument
such that it can be used by bpf_getsockopt(SOL_SOCKET) in a
latter patch.
security_socket_getpeersec_stream() is not changed. It stays
with the __user ptr (optval.user and optlen.user) to avoid changes
to other security hooks. bpf_getsockopt(SOL_SOCKET) also does not
support SO_PEERSEC.
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20220902002802.2888419-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2166911
commit ba74a7608dc12fbbd8ea36e660087f08a81ef26a
Author: Martin KaFai Lau <martin.lau@kernel.org>
Date: Thu Sep 1 17:27:56 2022 -0700
net: Change sock_getsockopt() to take the sk ptr instead of the sock ptr
A latter patch refactors bpf_getsockopt(SOL_SOCKET) with the
sock_getsockopt() to avoid code duplication and code
drift between the two duplicates.
The current sock_getsockopt() takes sock ptr as the argument.
The very first thing of this function is to get back the sk ptr
by 'sk = sock->sk'.
bpf_getsockopt() could be called when the sk does not have
the sock ptr created. Meaning sk->sk_socket is NULL. For example,
when a passive tcp connection has just been established but has yet
been accept()-ed. Thus, it cannot use the sock_getsockopt(sk->sk_socket)
or else it will pass a NULL ptr.
This patch moves all sock_getsockopt implementation to the newly
added sk_getsockopt(). The new sk_getsockopt() takes a sk ptr
and immediately gets the sock ptr by 'sock = sk->sk_socket'
The existing sock_getsockopt(sock) is changed to call
sk_getsockopt(sock->sk). All existing callers have both sock->sk
and sk->sk_socket pointer.
The latter patch will make bpf_getsockopt(SOL_SOCKET) call
sk_getsockopt(sk) directly. The bpf_getsockopt(SOL_SOCKET) does
not use the optnames that require sk->sk_socket, so it will
be safe.
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20220902002756.2887884-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2166911
Conflicts:
- Has been partially backported in df06f0d362 ("bpf: Change
bpf_setsockopt(SOL_SOCKET) to reuse sk_setsockopt()"): export of
sk_setsockopt was done there.
- net/core/filter.c: difference in removed code due to already applied
38e724189c ("net: Fix data-races around
sysctl_[rw]mem_(max|default)."). sk_setsockopt in net/core/sock.c does
already have the READ_ONCEs.
commit 29003875bd5bab262a29d1c6e76a2124bd07e4c2
Author: Martin KaFai Lau <kafai@fb.com>
Date: Tue Aug 16 23:18:04 2022 -0700
bpf: Change bpf_setsockopt(SOL_SOCKET) to reuse sk_setsockopt()
After the prep work in the previous patches,
this patch removes most of the dup code from bpf_setsockopt(SOL_SOCKET)
and reuses them from sk_setsockopt().
The sock ptr test is added to the SO_RCVLOWAT because
the sk->sk_socket could be NULL in some of the bpf hooks.
The existing optname white-list is refactored into a new
function sol_socket_setsockopt().
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061804.4178920-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2166911
Conflicts:
- net/core/sock.c: Code difference because SO_PRIORITY and SO_MARK were
not allowed with CAP_NET_RAW, but only with CAP_NET_ADMIN. So change
only this check to use sockopt_capable(). Missing commits are
a1b519b74548 ("net: allow CAP_NET_RAW to setsockopt SO_PRIORITY") and
079925cce1d0 ("net: allow SO_MARK with CAP_NET_RAW").
- net/core/sock.c: We do not have SO_RCVMARK, so leave out this hunk.
commit e42c7beee71d0d84a6193357e3525d0cf2a3e168
Author: Martin KaFai Lau <kafai@fb.com>
Date: Tue Aug 16 23:17:23 2022 -0700
bpf: net: Consider has_current_bpf_ctx() when testing capable() in sk_setsockopt()
When bpf program calling bpf_setsockopt(SOL_SOCKET),
it could be run in softirq and doesn't make sense to do the capable
check. There was a similar situation in bpf_setsockopt(TCP_CONGESTION).
In commit 8d650cdeda ("tcp: fix tcp_set_congestion_control() use from bpf hook"),
tcp_set_congestion_control(..., cap_net_admin) was added to skip
the cap check for bpf prog.
This patch adds sockopt_ns_capable() and sockopt_capable() for
the sk_setsockopt() to use. They will consider the
has_current_bpf_ctx() before doing the ns_capable() and capable() test.
They are in EXPORT_SYMBOL for the ipv6 module to use in a latter patch.
Suggested-by: Stanislav Fomichev <sdf@google.com>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061723.4175820-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2166911
commit 24426654ed3ae83d1127511891fb782c54f49203
Author: Martin KaFai Lau <kafai@fb.com>
Date: Tue Aug 16 23:17:17 2022 -0700
bpf: net: Avoid sk_setsockopt() taking sk lock when called from bpf
Most of the code in bpf_setsockopt(SOL_SOCKET) are duplicated from
the sk_setsockopt(). The number of supported optnames are
increasing ever and so as the duplicated code.
One issue in reusing sk_setsockopt() is that the bpf prog
has already acquired the sk lock. This patch adds a
has_current_bpf_ctx() to tell if the sk_setsockopt() is called from
a bpf prog. The bpf prog calling bpf_setsockopt() is either running
in_task() or in_serving_softirq(). Both cases have the current->bpf_ctx
initialized. Thus, the has_current_bpf_ctx() only needs to
test !!current->bpf_ctx.
This patch also adds sockopt_{lock,release}_sock() helpers
for sk_setsockopt() to use. These helpers will test
has_current_bpf_ctx() before acquiring/releasing the lock. They are
in EXPORT_SYMBOL for the ipv6 module to use in a latter patch.
Note on the change in sock_setbindtodevice(). sockopt_lock_sock()
is done in sock_setbindtodevice() instead of doing the lock_sock
in sock_bindtoindex(..., lock_sk = true).
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061717.4175589-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2155182
Upstream Status: net commit 5c1ebbfabcd61
commit 5c1ebbfabcd61142a4551bfc0e51840f9bdae7af
Author: Brian Vazquez <brianvv@google.com>
Date: Wed Mar 1 13:32:47 2023 +0000
net: use indirect calls helpers for sk_exit_memory_pressure()
Florian reported a regression and sent a patch with the following
changelog:
<quote>
There is a noticeable tcp performance regression (loopback or cross-netns),
seen with iperf3 -Z (sendfile mode) when generic retpolines are needed.
With SK_RECLAIM_THRESHOLD checks gone number of calls to enter/leave
memory pressure happen much more often. For TCP indirect calls are
used.
We can't remove the if-set-return short-circuit check in
tcp_enter_memory_pressure because there are callers other than
sk_enter_memory_pressure. Doing a check in the sk wrapper too
reduces the indirect calls enough to recover some performance.
Before,
0.00-60.00 sec 322 GBytes 46.1 Gbits/sec receiver
After:
0.00-60.04 sec 359 GBytes 51.4 Gbits/sec receiver
"iperf3 -c $peer -t 60 -Z -f g", connected via veth in another netns.
</quote>
It seems we forgot to upstream this indirect call mitigation we
had for years, lets do this instead.
[edumazet] - It seems we forgot to upstream this indirect call
mitigation we had for years, let's do this instead.
- Changed to INDIRECT_CALL_INET_1() to avoid bots reports.
Fixes: 4890b686f408 ("net: keep sk->sk_forward_alloc as small as possible")
Reported-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/netdev/20230227152741.4a53634b@kernel.org/T/
Signed-off-by: Brian Vazquez <brianvv@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230301133247.2346111-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fwestpha@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1722
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2149949
Upstream Status: linux.git
Conflicts: Few minor conflicts, see description in affected commits.
Properly mark concurent reads and writes with READ_ONCE() and
WRITE_ONCE() in various parts of the networking stack. This is a
backport of the following upstream patch series:
* Patch set A: merge commit e97e68b56e78 ("Merge branch 'sk_bound_dev_if-annotations'")
* Patch set B: merge commit 32b3ad1418ea ("Merge branch 'sysctl-data-races'")
* Patch set C: merge commit 7d5424b26f17 ("Merge branch 'net-sysctl-races'")
* Patch set D: merge commit 782d86fe44e3 ("Merge branch 'net-sysctl-races-round2'")
* Patch set E: merge commit c9f21106d97b ("Merge branch 'net-ipv4-sysctl-races-part-3'")
Patch 1 is a standalone READ_ONCE() annotation for sk->sk_bound_dev_if.
It's a prerequisite for correctly backporting patch set A.
Patches 2-9 are backports of patch set A. The following upstream
patches have been omitted since they're already in Centos Stream:
* Upstream commit a20ea298071f ("sctp: read sk->sk_bound_dev_if once
in sctp_rcv()"), backported by Centos Stream commit 5d539b8523.
* Upstream commit 70f87de9fa0d ("net_sched: em_meta: add READ_ONCE()
in var_sk_bound_if()"), backported by Centos Stream commit
866ca288f3.
Patch 10 was in the original upstream series of patch set B, but was
resubmitted independently as that series was reworked before being
applied. Therefore, it doesn't strictly belong to patch set B, but is
closely related to it and is thus backported here.
Patches 11-21 are backports of patch set B. The following upstream
patch has been omitted since it's already in Centos Stream:
* Upstream commit 310731e2f161 ("net: Fix data-races around
sysctl_mem.", backported by Centos Stream commit a99b2cb4eb.
Patches 22-36 are backports corresponding to patch set C.
Patches 37-51 are backports corresponding to patch set D.
Patches 52-66 are backports corresponding to patch set E.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Michal Schmidt <mschmidt@redhat.com>
Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2149949
Upstream Status: linux.git
commit e5fccaa1eb7f6116deab0f708a787e2de915869f
Author: Eric Dumazet <edumazet@google.com>
Date: Fri May 13 11:55:44 2022 -0700
net: core: add READ_ONCE/WRITE_ONCE annotations for sk->sk_bound_dev_if
sock_bindtoindex_locked() needs to use WRITE_ONCE(sk->sk_bound_dev_if, val),
because other cpus/threads might locklessly read this field.
sock_getbindtodevice(), sock_getsockopt() need READ_ONCE()
because they run without socket lock held.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Guillaume Nault <gnault@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1541
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2133057
Tested: LNST, Tier1, tput test
This series improves UDP protocol RX tput, to keep it on equal footing with rhel-8 one.
Patches 1,3,4 are there just to reduces the conflicts, and patch 4 is a very partial
backport, to avoid pulling unrelated features.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Florian Westphal <fwestpha@redhat.com>
Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
Tested: selftests
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2143700
commit c46b01839f7aad5889e23505bbfbeb5f4d7fde8e
Author: Jakub Kicinski <kuba@kernel.org>
Date: Tue Jul 5 16:59:26 2022 -0700
tls: rx: periodically flush socket backlog
We continuously hold the socket lock during large reads and writes.
This may inflate RTT and negatively impact TCP performance.
Flush the backlog periodically. I tried to pick a flush period (128kB)
which gives significant benefit but the max Bps rate is not yet visibly
impacted.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sabrina Dubroca <sdubroca@redhat.com>
Merge conflicts:
-----------------
arch/x86/net/bpf_jit_comp.c
- bpf_arch_text_poke()
HEAD(!1464) contains b73b002f7f ("x86/ibt,bpf: Add ENDBR instructions to prologue and trampoline")
Resolved in favour of !1464, but keep the return statement from !1477
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1477
Bugzilla: https://bugzilla.redhat.com/2120966
Rebase BPF and XDP to the upstream kernel version 5.18
Patch applied, then reverted:
```
544356 selftests/bpf: switch to new libbpf XDP APIs
0bfb95 selftests, bpf: Do not yet switch to new libbpf XDP APIs
```
Taken in the perf rebase:
```
23fcfc perf: use generic bpf_program__set_type() to set BPF prog type
```
Unsuported arches:
```
5c1011 libbpf: Fix riscv register names
cf0b5b libbpf: Fix accessing syscall arguments on riscv
```
Depends on changes of other subsystems:
```
7fc8c3 s390/bpf: encode register within extable entry
aebfd1 x86/ibt,ftrace: Search for __fentry__ location
589127 x86/ibt,bpf: Add ENDBR instructions to prologue and trampoline
```
Broken selftest:
```
edae34 selftests net: add UDP GRO fraglist + bpf self-tests
cf6783 selftests net: fix bpf build error
7b92aa selftests net: fix kselftest net fatal error
```
Out of scope:
```
baebdf net: dev: Makes sure netif_rx() can be invoked in any context.
5c8166 kbuild: replace $(if A,A,B) with $(or A,B)
1a97ce perf maps: Use a pointer for kmaps
967747 uaccess: remove CONFIG_SET_FS
42b01a s390: always use the packed stack layout
bf0882 flow_dissector: Add support for HSR
d09a30 s390/extable: move EX_TABLE define to asm-extable.h
3d6671 s390/extable: convert to relative table with data
4efd41 s390: raise minimum supported machine generation to z10
f65e58 flow_dissector: Add support for HSRv0
1a6d7a netdevsim: Introduce support for L3 offload xstats
9b1894 selftests: netdevsim: hw_stats_l3: Add a new test
84005b perf ftrace latency: Add -n/--use-nsec option
36c4a7 kasan, arm64: don't tag executable vmalloc allocations
8df013 docs: netdev: move the netdev-FAQ to the process pages
4d4d00 perf tools: Update copy of libbpf's hashmap.c
0df6ad perf evlist: Rename cpus to user_requested_cpus
1b8089 flow_dissector: fix false-positive __read_overflow2_field() warning
0ae065 perf build: Fix check for btf__load_from_kernel_by_id() in libbpf
8994e9 perf test bpf: Skip test if clang is not present
735346 perf build: Fix btf__load_from_kernel_by_id() feature check
f037ac s390/stack: merge empty stack frame slots
335220 docs: netdev: update maintainer-netdev.rst reference
a0b098 s390/nospec: remove unneeded header includes
34513a netdevsim: Fix hwstats debugfs file permissions
```
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Approved-by: John W. Linville <linville@redhat.com>
Approved-by: Wander Lairson Costa <wander@redhat.com>
Approved-by: Torez Smith <torez@redhat.com>
Approved-by: Jan Stancek <jstancek@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Felix Maurer <fmaurer@redhat.com>
Approved-by: Viktor Malik <vmalik@redhat.com>
Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>