Commit Graph

864 Commits

Author SHA1 Message Date
Antoine Tenart cff87c7d6f inet: annotate devconf data-races
JIRA: https://issues.redhat.com/browse/RHEL-62202
Upstream Status: linux.git

commit 0598f8f3bb77893a13105d47bb7dfe42f1dc1f4e
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Feb 27 09:24:09 2024 +0000

    inet: annotate devconf data-races

    Add READ_ONCE() in ipv4_devconf_get() and corresponding
    WRITE_ONCE() in ipv4_devconf_set()

    Add IPV4_DEVCONF_RO() and IPV4_DEVCONF_ALL_RO() macros,
    and use them when reading devconf fields.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Jiri Pirko <jiri@nvidia.com>
    Link: https://lore.kernel.org/r/20240227092411.2315725-2-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-11-14 10:16:47 +01:00
Rado Vrbovsky c3700614c3 Merge: ipv4: Don't reset ->flowi4_scope in ip_rt_fix_tos().
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5284

JIRA: https://issues.redhat.com/browse/RHEL-59754

RHEL-57748 backported `ec20b2830093 ("ipv4: Set scope explicitly in ip_route_output().")`, which will set `ip_route_output` tos to 0. This breaks bonding arp monitoring as later in `ip_rt_fix_tos` the scope is reset to RT_SCOPE_UNIVERSE since `tos` is 0. The backported patch 16a28267774c ("ipv4: Don't reset ->flowi4_scope in ip_rt_fix_tos().") fixed this issue as the scope will not set to RT_SCOPE_UNIVERSE.

Signed-off-by: Hangbin Liu <haliu@redhat.com>

Approved-by: Guillaume Nault <gnault@redhat.com>
Approved-by: Florian Westphal <fwestpha@redhat.com>
Approved-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
2024-11-06 08:30:19 +00:00
Rado Vrbovsky f5817ff396 Merge: CNB96: net: Allow configuration of multipath hash seed
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5270

JIRA: https://issues.redhat.com/browse/RHEL-59087  
Tested: Using attached `net/forwarding/router_mpath_seed.sh` self-test  
Depends: !5198  

Commits:
```
3e453ca122d4 net: ipv4,ipv6: Pass multipath hash computation through a helper
4ee2a8cace3f net: ipv4: Add a sysctl to set multipath hash seed
6f51aed38a4f selftests: forwarding: lib: Split sysctl_save() out of sysctl_set()
5f90d93b6108 selftests: forwarding: router_mpath_hash: Add a new selftest
```

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: Petr Oros <poros@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
2024-10-20 08:58:54 +00:00
Guillaume Nault 09cce09af2 ipv4: Fix incorrect source address in Record Route option
JIRA: https://issues.redhat.com/browse/RHEL-61380
Upstream Status: linux.git

commit cc73bbab4b1fb8a4f53a24645871dafa5f81266a
Author: Ido Schimmel <idosch@nvidia.com>
Date:   Thu Jul 18 15:34:07 2024 +0300

    ipv4: Fix incorrect source address in Record Route option

    The Record Route IP option records the addresses of the routers that
    routed the packet. In the case of forwarded packets, the kernel performs
    a route lookup via fib_lookup() and fills in the preferred source
    address of the matched route.

    The lookup is performed with the DS field of the forwarded packet, but
    using the RT_TOS() macro which only masks one of the two ECN bits. If
    the packet is ECT(0) or CE, the matched route might be different than
    the route via which the packet was forwarded as the input path masks
    both of the ECN bits, resulting in the wrong address being filled in the
    Record Route option.

    Fix by masking both of the ECN bits.

    Fixes: 8e36360ae8 ("ipv4: Remove route key identity dependencies in ip_rt_get_source().")
    Signed-off-by: Ido Schimmel <idosch@nvidia.com>
    Reviewed-by: Guillaume Nault <gnault@redhat.com>
    Link: https://patch.msgid.link/20240718123407.434778-1-idosch@nvidia.com
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2024-10-02 21:02:55 +02:00
Guillaume Nault 2db4d3de45 ipv4: Fix incorrect TOS in fibmatch route get reply
JIRA: https://issues.redhat.com/browse/RHEL-61380
Upstream Status: linux.git

commit f036e68212c11e5a7edbb59b5e25299341829485
Author: Ido Schimmel <idosch@nvidia.com>
Date:   Mon Jul 15 17:23:54 2024 +0300

    ipv4: Fix incorrect TOS in fibmatch route get reply

    The TOS value that is returned to user space in the route get reply is
    the one with which the lookup was performed ('fl4->flowi4_tos'). This is
    fine when the matched route is configured with a TOS as it would not
    match if its TOS value did not match the one with which the lookup was
    performed.

    However, matching on TOS is only performed when the route's TOS is not
    zero. It is therefore possible to have the kernel incorrectly return a
    non-zero TOS:

     # ip link add name dummy1 up type dummy
     # ip address add 192.0.2.1/24 dev dummy1
     # ip route get fibmatch 192.0.2.2 tos 0xfc
     192.0.2.0/24 tos 0x1c dev dummy1 proto kernel scope link src 192.0.2.1

    Fix by instead returning the DSCP field from the FIB result structure
    which was populated during the route lookup.

    Output after the patch:

     # ip link add name dummy1 up type dummy
     # ip address add 192.0.2.1/24 dev dummy1
     # ip route get fibmatch 192.0.2.2 tos 0xfc
     192.0.2.0/24 dev dummy1 proto kernel scope link src 192.0.2.1

    Extend the existing selftests to not only verify that the correct route
    is returned, but that it is also returned with correct "tos" value (or
    without it).

    Fixes: b61798130f ("net: ipv4: RTM_GETROUTE: return matched fib result when requested")
    Signed-off-by: Ido Schimmel <idosch@nvidia.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Reviewed-by: Guillaume Nault <gnault@redhat.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2024-10-02 21:02:52 +02:00
Guillaume Nault 781bdceaf3 ipv4: Fix incorrect TOS in route get reply
JIRA: https://issues.redhat.com/browse/RHEL-61380
Upstream Status: linux.git

commit 338bb57e4c2a1c2c6fc92f9c0bd35be7587adca7
Author: Ido Schimmel <idosch@nvidia.com>
Date:   Mon Jul 15 17:23:53 2024 +0300

    ipv4: Fix incorrect TOS in route get reply

    The TOS value that is returned to user space in the route get reply is
    the one with which the lookup was performed ('fl4->flowi4_tos'). This is
    fine when the matched route is configured with a TOS as it would not
    match if its TOS value did not match the one with which the lookup was
    performed.

    However, matching on TOS is only performed when the route's TOS is not
    zero. It is therefore possible to have the kernel incorrectly return a
    non-zero TOS:

     # ip link add name dummy1 up type dummy
     # ip address add 192.0.2.1/24 dev dummy1
     # ip route get 192.0.2.2 tos 0xfc
     192.0.2.2 tos 0x1c dev dummy1 src 192.0.2.1 uid 0
         cache

    Fix by adding a DSCP field to the FIB result structure (inside an
    existing 4 bytes hole), populating it in the route lookup and using it
    when filling the route get reply.

    Output after the patch:

     # ip link add name dummy1 up type dummy
     # ip address add 192.0.2.1/24 dev dummy1
     # ip route get 192.0.2.2 tos 0xfc
     192.0.2.2 dev dummy1 src 192.0.2.1 uid 0
         cache

    Fixes: 1a00fee4ff ("ipv4: Remove rt_key_{src,dst,tos} from struct rtable.")
    Signed-off-by: Ido Schimmel <idosch@nvidia.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Reviewed-by: Guillaume Nault <gnault@redhat.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2024-10-02 21:02:48 +02:00
Guillaume Nault c9980db660 ipv4: check for NULL idev in ip_route_use_hint()
JIRA: https://issues.redhat.com/browse/RHEL-61380
Upstream Status: linux.git

commit 58a4c9b1e5a3e53c9148e80b90e1e43897ce77d1
Author: Eric Dumazet <edumazet@google.com>
Date:   Sun Apr 21 18:43:26 2024 +0000

    ipv4: check for NULL idev in ip_route_use_hint()

    syzbot was able to trigger a NULL deref in fib_validate_source()
    in an old tree [1].

    It appears the bug exists in latest trees.

    All calls to __in_dev_get_rcu() must be checked for a NULL result.

    [1]
    general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN
    KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
    CPU: 2 PID: 3257 Comm: syz-executor.3 Not tainted 5.10.0-syzkaller #0
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
     RIP: 0010:fib_validate_source+0xbf/0x15a0 net/ipv4/fib_frontend.c:425
    Code: 18 f2 f2 f2 f2 42 c7 44 20 23 f3 f3 f3 f3 48 89 44 24 78 42 c6 44 20 27 f3 e8 5d 88 48 fc 4c 89 e8 48 c1 e8 03 48 89 44 24 18 <42> 80 3c 20 00 74 08 4c 89 ef e8 d2 15 98 fc 48 89 5c 24 10 41 bf
    RSP: 0018:ffffc900015fee40 EFLAGS: 00010246
    RAX: 0000000000000000 RBX: ffff88800f7a4000 RCX: ffff88800f4f90c0
    RDX: 0000000000000000 RSI: 0000000004001eac RDI: ffff8880160c64c0
    RBP: ffffc900015ff060 R08: 0000000000000000 R09: ffff88800f7a4000
    R10: 0000000000000002 R11: ffff88800f4f90c0 R12: dffffc0000000000
    R13: 0000000000000000 R14: 0000000000000000 R15: ffff88800f7a4000
    FS:  00007f938acfe6c0(0000) GS:ffff888058c00000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f938acddd58 CR3: 000000001248e000 CR4: 0000000000352ef0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
      ip_route_use_hint+0x410/0x9b0 net/ipv4/route.c:2231
      ip_rcv_finish_core+0x2c4/0x1a30 net/ipv4/ip_input.c:327
      ip_list_rcv_finish net/ipv4/ip_input.c:612 [inline]
      ip_sublist_rcv+0x3ed/0xe50 net/ipv4/ip_input.c:638
      ip_list_rcv+0x422/0x470 net/ipv4/ip_input.c:673
      __netif_receive_skb_list_ptype net/core/dev.c:5572 [inline]
      __netif_receive_skb_list_core+0x6b1/0x890 net/core/dev.c:5620
      __netif_receive_skb_list net/core/dev.c:5672 [inline]
      netif_receive_skb_list_internal+0x9f9/0xdc0 net/core/dev.c:5764
      netif_receive_skb_list+0x55/0x3e0 net/core/dev.c:5816
      xdp_recv_frames net/bpf/test_run.c:257 [inline]
      xdp_test_run_batch net/bpf/test_run.c:335 [inline]
      bpf_test_run_xdp_live+0x1818/0x1d00 net/bpf/test_run.c:363
      bpf_prog_test_run_xdp+0x81f/0x1170 net/bpf/test_run.c:1376
      bpf_prog_test_run+0x349/0x3c0 kernel/bpf/syscall.c:3736
      __sys_bpf+0x45c/0x710 kernel/bpf/syscall.c:5115
      __do_sys_bpf kernel/bpf/syscall.c:5201 [inline]
      __se_sys_bpf kernel/bpf/syscall.c:5199 [inline]
      __x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:5199

    Fixes: 02b2494161 ("ipv4: use dst hint for ipv4 list receive")
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Paolo Abeni <pabeni@redhat.com>
    Link: https://lore.kernel.org/r/20240421184326.1704930-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2024-10-02 21:02:36 +02:00
Guillaume Nault 3548d575b1 ipv4: ignore dst hint for multipath routes
JIRA: https://issues.redhat.com/browse/RHEL-61380
Upstream Status: linux.git
Conflicts: (context) Missing upstream commit e6175a2ed1f1 ("xfrm: fix
           "disable_policy" flag use when arriving from different
           devices"):
           Centos Stream 9 doesn't have the IPSKB_NOPOLICY flag in
           struct inet_skb_parm (include/net/ip.h).

commit 6ac66cb03ae306c2e288a9be18226310529f5b25
Author: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
Date:   Thu Aug 31 10:03:30 2023 +0200

    ipv4: ignore dst hint for multipath routes

    Route hints when the nexthop is part of a multipath group causes packets
    in the same receive batch to be sent to the same nexthop irrespective of
    the multipath hash of the packet. So, do not extract route hint for
    packets whose destination is part of a multipath group.

    A new SKB flag IPSKB_MULTIPATH is introduced for this purpose, set the
    flag when route is looked up in ip_mkroute_input() and use it in
    ip_extract_route_hint() to check for the existence of the flag.

    Fixes: 02b2494161 ("ipv4: use dst hint for ipv4 list receive")
    Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
    Reviewed-by: Ido Schimmel <idosch@nvidia.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2024-10-02 21:02:30 +02:00
Hangbin Liu 6faaad54f0 ipv4: Don't reset ->flowi4_scope in ip_rt_fix_tos().
JIRA: https://issues.redhat.com/browse/RHEL-59754
Upstream Status: net.git commit 16a28267774c

commit 16a28267774cd9f85405ef83d4afcbd0355e5817
Author: Guillaume Nault <gnault@redhat.com>
Date:   Thu Apr 21 01:21:24 2022 +0200

    ipv4: Don't reset ->flowi4_scope in ip_rt_fix_tos().

    All callers already initialise ->flowi4_scope with RT_SCOPE_UNIVERSE,
    either by manual field assignment, memset(0) of the whole structure or
    implicit structure initialisation of on-stack variables
    (RT_SCOPE_UNIVERSE actually equals 0).

    Therefore, we don't need to always initialise ->flowi4_scope in
    ip_rt_fix_tos(). We only need to reduce the scope to RT_SCOPE_LINK when
    the special RTO_ONLINK flag is present in the tos.

    This will allow some code simplification, like removing
    ip_rt_fix_tos(). Also, the long term idea is to remove RTO_ONLINK
    entirely by properly initialising ->flowi4_scope, instead of
    overloading ->flowi4_tos with a special flag. Eventually, this will
    allow to convert ->flowi4_tos to dscp_t.

    Signed-off-by: Guillaume Nault <gnault@redhat.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Hangbin Liu <haliu@redhat.com>
2024-09-29 15:04:21 +08:00
Ivan Vecera b97546c3a2 net: ipv4,ipv6: Pass multipath hash computation through a helper
JIRA: https://issues.redhat.com/browse/RHEL-59087

commit 3e453ca122d483eb519f934b6624215f0536301c
Author: Petr Machata <petrm@nvidia.com>
Date:   Fri Jun 7 17:13:53 2024 +0200

    net: ipv4,ipv6: Pass multipath hash computation through a helper

    The following patches will add a sysctl to control multipath hash
    seed. In order to centralize the hash computation, add a helper,
    fib_multipath_hash_from_keys(), and have all IPv4 and IPv6 route.c
    invocations of flow_hash_from_keys() go through this helper instead.

    Signed-off-by: Petr Machata <petrm@nvidia.com>
    Reviewed-by: Ido Schimmel <idosch@nvidia.com>
    Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Link: https://lore.kernel.org/r/20240607151357.421181-2-petrm@nvidia.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: CKI Backport Bot <cki-ci-bot+cki-gitlab-backport-bot@redhat.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-09-24 17:04:46 +02:00
Xin Long 82d5d527d9 net: fix __dst_negative_advice() race
JIRA: https://issues.redhat.com/browse/RHEL-41185
CVE: CVE-2024-36971
Tested: compile only

Conflicts:
  - context difference in include/net/dst_ops.h due to missing
    43c2817225fc from upstream.

commit 92f1655aa2b2294d0b49925f3b875a634bd3b59e
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue May 28 11:43:53 2024 +0000

    net: fix __dst_negative_advice() race

    __dst_negative_advice() does not enforce proper RCU rules when
    sk->dst_cache must be cleared, leading to possible UAF.

    RCU rules are that we must first clear sk->sk_dst_cache,
    then call dst_release(old_dst).

    Note that sk_dst_reset(sk) is implementing this protocol correctly,
    while __dst_negative_advice() uses the wrong order.

    Given that ip6_negative_advice() has special logic
    against RTF_CACHE, this means each of the three ->negative_advice()
    existing methods must perform the sk_dst_reset() themselves.

    Note the check against NULL dst is centralized in
    __dst_negative_advice(), there is no need to duplicate
    it in various callbacks.

    Many thanks to Clement Lecigne for tracking this issue.

    This old bug became visible after the blamed commit, using UDP sockets.

    Fixes: a87cb3e48e ("net: Facility to report route quality of connected sockets")
    Reported-by: Clement Lecigne <clecigne@google.com>
    Diagnosed-by: Clement Lecigne <clecigne@google.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Tom Herbert <tom@herbertland.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Link: https://lore.kernel.org/r/20240528114353.1794151-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Xin Long <lxin@redhat.com>
2024-07-10 16:54:52 -04:00
Ivan Vecera e2917e01d7 ipv4: rename and move ip_route_output_tunnel()
JIRA: https://issues.redhat.com/browse/RHEL-40130

commit bf3fcbf7e7a08015d3b169bad6281b29d45c272d
Author: Beniamino Galvani <b.galvani@gmail.com>
Date:   Mon Oct 16 09:15:20 2023 +0200

    ipv4: rename and move ip_route_output_tunnel()

    At the moment ip_route_output_tunnel() is used only by bareudp.
    Ideally, other UDP tunnel implementations should use it, but to do so
    the function needs to accept new parameters that are specific for UDP
    tunnels, such as the ports.

    Prepare for these changes by renaming the function to
    udp_tunnel_dst_lookup() and move it to file
    net/ipv4/udp_tunnel_core.c.

    Suggested-by: Guillaume Nault <gnault@redhat.com>
    Signed-off-by: Beniamino Galvani <b.galvani@gmail.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-06-11 11:22:34 +02:00
Felix Maurer c1591b1d73 net: dst: fix missing initialization of rt_uncached
JIRA: https://issues.redhat.com/browse/RHEL-15695

commit 418a73074da9182f571e467eaded03ea501f3281
Author: Maxime Bizon <mbizon@freebox.fr>
Date:   Thu Apr 20 20:25:08 2023 +0200

    net: dst: fix missing initialization of rt_uncached

    xfrm_alloc_dst() followed by xfrm4_dst_destroy(), without a
    xfrm4_fill_dst() call in between, causes the following BUG:

     BUG: spinlock bad magic on CPU#0, fbxhostapd/732
      lock: 0x890b7668, .magic: 890b7668, .owner: <none>/-1, .owner_cpu: 0
     CPU: 0 PID: 732 Comm: fbxhostapd Not tainted 6.3.0-rc6-next-20230414-00613-ge8de66369925-dirty #9
     Hardware name: Marvell Kirkwood (Flattened Device Tree)
      unwind_backtrace from show_stack+0x10/0x14
      show_stack from dump_stack_lvl+0x28/0x30
      dump_stack_lvl from do_raw_spin_lock+0x20/0x80
      do_raw_spin_lock from rt_del_uncached_list+0x30/0x64
      rt_del_uncached_list from xfrm4_dst_destroy+0x3c/0xbc
      xfrm4_dst_destroy from dst_destroy+0x5c/0xb0
      dst_destroy from rcu_process_callbacks+0xc4/0xec
      rcu_process_callbacks from __do_softirq+0xb4/0x22c
      __do_softirq from call_with_stack+0x1c/0x24
      call_with_stack from do_softirq+0x60/0x6c
      do_softirq from __local_bh_enable_ip+0xa0/0xcc

    Patch "net: dst: Prevent false sharing vs. dst_entry:: __refcnt" moved
    rt_uncached and rt_uncached_list fields from rtable struct to dst
    struct, so they are more zeroed by memset_after(xdst, 0, u.dst) in
    xfrm_alloc_dst().

    Note that rt_uncached (list_head) was never properly initialized at
    alloc time, but xfrm[46]_dst_destroy() is written in such a way that
    it was not an issue thanks to the memset:

            if (xdst->u.rt.dst.rt_uncached_list)
                    rt_del_uncached_list(&xdst->u.rt);

    The route code does it the other way around: rt_uncached_list is
    assumed to be valid IIF rt_uncached list_head is not empty:

    void rt_del_uncached_list(struct rtable *rt)
    {
            if (!list_empty(&rt->dst.rt_uncached)) {
                    struct uncached_list *ul = rt->dst.rt_uncached_list;

                    spin_lock_bh(&ul->lock);
                    list_del_init(&rt->dst.rt_uncached);
                    spin_unlock_bh(&ul->lock);
            }
    }

    This patch adds mandatory rt_uncached list_head initialization in
    generic dst_init(), and adapt xfrm[46]_dst_destroy logic to match the
    rest of the code.

    Fixes: d288a162dd1c ("net: dst: Prevent false sharing vs. dst_entry:: __refcnt")
    Reported-by: kernel test robot <oliver.sang@intel.com>
    Link: https://lore.kernel.org/oe-lkp/202304162125.18b7bcdd-oliver.sang@intel.com
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    CC: Leon Romanovsky <leon@kernel.org>
    Signed-off-by: Maxime Bizon <mbizon@freebox.fr>
    Link: https://lore.kernel.org/r/20230420182508.2417582-1-mbizon@freebox.fr
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2024-05-21 17:19:20 +02:00
Felix Maurer e383060306 net: dst: Prevent false sharing vs. dst_entry:: __refcnt
JIRA: https://issues.redhat.com/browse/RHEL-15695
Conflicts:
- include/net/dst.h: We the kABI padding added at the end of dst_entry,
  keep it at the end.

commit d288a162dd1c73507da582966f17dd226e34a0c0
Author: Wangyang Guo <wangyang.guo@intel.com>
Date:   Thu Mar 23 21:55:29 2023 +0100

    net: dst: Prevent false sharing vs. dst_entry:: __refcnt

    dst_entry::__refcnt is highly contended in scenarios where many connections
    happen from and to the same IP. The reference count is an atomic_t, so the
    reference count operations have to take the cache-line exclusive.

    Aside of the unavoidable reference count contention there is another
    significant problem which is caused by that: False sharing.

    perf top identified two affected read accesses. dst_entry::lwtstate and
    rtable::rt_genid.

    dst_entry:__refcnt is located at offset 64 of dst_entry, which puts it into
    a seperate cacheline vs. the read mostly members located at the beginning
    of the struct.

    That prevents false sharing vs. the struct members in the first 64
    bytes of the structure, but there is also

      dst_entry::lwtstate

    which is located after the reference count and in the same cache line. This
    member is read after a reference count has been acquired.

    struct rtable embeds a struct dst_entry at offset 0. struct dst_entry has a
    size of 112 bytes, which means that the struct members of rtable which
    follow the dst member share the same cache line as dst_entry::__refcnt.
    Especially

      rtable::rt_genid

    is also read by the contexts which have a reference count acquired
    already.

    When dst_entry:__refcnt is incremented or decremented via an atomic
    operation these read accesses stall. This was found when analysing the
    memtier benchmark in 1:100 mode, which amplifies the problem extremly.

    Move the rt[6i]_uncached[_list] members out of struct rtable and struct
    rt6_info into struct dst_entry to provide padding and move the lwtstate
    member after that so it ends up in the same cache line.

    The resulting improvement depends on the micro-architecture and the number
    of CPUs. It ranges from +20% to +120% with a localhost memtier/memcached
    benchmark.

    [ tglx: Rearrange struct ]

    Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
    Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Link: https://lore.kernel.org/r/20230323102800.042297517@linutronix.de
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2024-05-21 17:19:19 +02:00
Guillaume Nault 53a1587c58 ipv4: Correct/silence an endian warning in __ip_do_redirect
JIRA: https://issues.redhat.com/browse/RHEL-22186
Upstream Status: linux.git

commit c0e2926266af3b5acf28df0a8fc6e4d90effe0bb
Author: Kunwu Chan <chentao@kylinos.cn>
Date:   Sun Nov 19 22:17:59 2023 +0800

    ipv4: Correct/silence an endian warning in __ip_do_redirect

    net/ipv4/route.c:783:46: warning: incorrect type in argument 2 (different base types)
    net/ipv4/route.c:783:46:    expected unsigned int [usertype] key
    net/ipv4/route.c:783:46:    got restricted __be32 [usertype] new_gw

    Fixes: 969447f226 ("ipv4: use new_gw for redirect neigh lookup")
    Suggested-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
    Link: https://lore.kernel.org/r/20231119141759.420477-1-chentao@kylinos.cn
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2024-01-19 14:08:50 +01:00
Ivan Vecera b167c4e0b9 neighbour: annotate lockless accesses to n->nud_state
JIRA: https://issues.redhat.com/browse/RHEL-16999

commit b071af523579df7341cabf0f16fc661125e9a13f
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Mar 13 20:17:31 2023 +0000

    neighbour: annotate lockless accesses to n->nud_state

    We have many lockless accesses to n->nud_state.

    Before adding another one in the following patch,
    add annotations to readers and writers.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-11-20 19:28:55 +01:00
Scott Weaver 8bbc89729a Merge: ipv4: First round of upstream fixes for RHEL 9.4.
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3242

JIRA: https://issues.redhat.com/browse/RHEL-14295
Upstream Status: linux.git

Signed-off-by: Guillaume Nault <gnault@redhat.com>

Approved-by: Florian Westphal <fwestpha@redhat.com>
Approved-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Scott Weaver <scweaver@redhat.com>
2023-11-02 12:33:46 -04:00
Scott Weaver ad72d6de84 Merge: 9.4 mm changes
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2843

JIRA: https://issues.redhat.com/browse/RHEL-1848

Already in CS9
Omitted-fix: 327b18b7aaed ("mm/kfence: select random number before taking raw lock")
Omitted-fix: bfbfb6182ad1 ("nfsd_splice_actor(): handle compound pages")
Omitted-fix: ac8db824ead0 ("NFSD: Fix reads with a non-zero offset that don't end on a page boundary")
Omitted-fix: b3719108ae60 ("perf kmem: Support legacy tracepoints")
Omitted-fix: dce088ab0d51 ("perf kmem: Support field "node" in evsel__process_alloc_event() coping with recent tracepoint restructuring")
Omitted-fix: c18c20f16219 ("mm, slab: remove duplicate kernel-doc comment for ksize()")
Omitted-fix: cfccd2e63e7e ("mm, compaction: finish pageblocks on complete migration failure")
Omitted-fix: 6342140db660 ("selftests/timens: add a test for vfork+exit")
Omitted-fix: be6667b0db97 ("selftests/vm: dedup hugepage allocation logic")
Omitted-fix: 9d0d94684007 ("selftests/vm: add selftest to verify multi THP collapse")
Omitted-fix: 1370a21fe470 ("selftests/vm: add selftest to verify recollapse of THPs")
Omitted-fix: b25806dcd3d5 ("mm: memcontrol: deprecate swapaccounting=0 mode")
Omitted-fix: b94c4e949c36 ("mm: memcontrol: use do_memsw_account() in a few more places")
Omitted-fix: e55b9f96860f ("mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol")
Omitted-fix: 6f777dcef774 ("docs: kmsan: fix formatting of "Example report"")
Omitted fix: 26e1a0c3277d ("mm: use pmdp_get_lockless() without surplus barrier()")
Omitted-fix: 0cb8fd4d1416 ("mm/migrate: remove cruft from migration_entry_wait()s")

patches resulting in empty commits after conflict resolution
Omitted-fix: 4a7e922587d2 ("selftests: vm: add /dev/userfaultfd test cases to run_vmtests.sh")

patches that are functionally identical
Omitted-fix: 6f777dcef774 ("docs: kmsan: fix formatting of "Example report"")
   Is identical to 436fa4a699bc ("docs: kmsan: fix formatting of "Example report"")

Defer to crypto group
Omitted-fix: f900fde28883 ("crypto: testmgr - fix RNG performance in fuzz tests")

Not including since we're specifically excluding the Maple Tree VMA Iterator
Omitted-fix: 524e00b36e8c ("mm: remove rb tree.")

'series' patches that won't be addressed by this MR
Omitted-fix: 9905eed48e82 ("Merge branch 'af_unix-OOB-fixes'")
Omitted-fix: 2e4b231ac125 ("scsi: NCR5380: Use sc_data_direction instead of rq_data_dir()")
Omitted-fix: 40e16ce7b6fa ("scsi: advansys: Use scsi_cmd_to_rq() instead of scsi_cmnd.request")
Omitted-fix: 11bf4ec58073 ("scsi: aha1542: Use scsi_cmd_to_rq() instead of scsi_cmnd.request")
Omitted-fix: 3ada9c791b1d ("scsi: dpt_i2o: Use scsi_cmd_to_rq() instead of scsi_cmnd.request")
Omitted-fix: 240ec1197786 ("scsi: ips: Use scsi_cmd_to_rq() instead of scsi_cmnd.request")
Omitted-fix: ce425dd7dbc9 ("scsi: mvumi: Use scsi_cmd_to_rq() instead of scsi_cmnd.request")
Omitted-fix: 2fd8f23aae36 ("scsi: myrb: Use scsi_cmd_to_rq() instead of scsi_cmnd.request")
Omitted-fix: 43b2d1b14ed0 ("scsi: myrs: Use scsi_cmd_to_rq() instead of scsi_cmnd.request")
Omitted-fix: 0f8f3ea84a89 ("scsi: ncr53c8xx: Use scsi_cmd_to_rq() instead of scsi_cmnd.request")
Omitted-fix: 3f5e62c5e074 ("scsi: qla1280: Use scsi_cmd_to_rq() instead of scsi_cmnd.request")
Omitted-fix: ba4baf0951bb ("scsi: qlogicpti: Use scsi_cmd_to_rq() instead of scsi_cmnd.request")
Omitted-fix: ec808ef9b838 ("scsi: snic: Use scsi_cmd_to_rq() instead of scsi_cmnd.request")
Omitted-fix: bbfa8d7d1283 ("scsi: stex: Use scsi_cmd_to_rq() instead of scsi_cmnd.request")
Omitted-fix: 6c5d5422c533 ("scsi: sun3_scsi: Use scsi_cmd_to_rq() instead of scsi_cmnd.request")
Omitted-fix: 77ff7756c73e ("scsi: sym53c8xx: Use scsi_cmd_to_rq() instead of scsi_cmnd.request")
Omitted-fix: 80ca10b6052d ("scsi: xen-scsifront: Use scsi_cmd_to_rq() instead of scsi_cmnd.request")
Omitted-fix: 332f606b32b6 ("ovl: enable RCU'd ->get_acl()")
Omitted-fix: b3b6f5b92255 ("btrfs: handle idmaps in btrfs_new_inode()")
Omitted-fix: ca07274c3da9 ("btrfs: allow idmapped rename inode op")
Omitted-fix: c020d2eaf1a8 ("btrfs: allow idmapped getattr inode op")
Omitted-fix: 72105277dcfc ("btrfs: allow idmapped mknod inode op")
Omitted-fix: e93ca491d03f ("btrfs: allow idmapped create inode op")
Omitted-fix: b0b3e44d346c ("btrfs: allow idmapped mkdir inode op")
Omitted-fix: 5a0521086e5f ("btrfs: allow idmapped symlink inode op")
Omitted-fix: 98b6ab5fc098 ("btrfs: allow idmapped tmpfile inode op")
Omitted-fix: d4d094646142 ("btrfs: allow idmapped setattr inode op")
Omitted-fix: 3bc71ba02cf5 ("btrfs: allow idmapped permission inode op")
Omitted-fix: 5474bf400f16 ("btrfs: check whether fsgid/fsuid are mapped during subvolume creation")
Omitted-fix: 4d4340c912cc ("btrfs: allow idmapped SNAP_CREATE/SUBVOL_CREATE ioctls")
Omitted-fix: c4ed533bdc79 ("btrfs: allow idmapped SNAP_DESTROY ioctls")
Omitted-fix: aabb34e7a31c ("btrfs: relax restrictions for SNAP_DESTROY_V2 with subvolids")
Omitted-fix: e4fed17a32b6 ("btrfs: allow idmapped SET_RECEIVED_SUBVOL ioctls")
Omitted-fix: 39e1674ff035 ("btrfs: allow idmapped SUBVOL_SETFLAGS ioctl")
Omitted-fix: 6623d9a0b0ce ("btrfs: allow idmapped INO_LOOKUP_USER ioctl")
Omitted-fix: 4a8b34afa9c9 ("btrfs: handle ACLs on idmapped mounts")
Omitted-fix: 5b9b26f5d0b8 ("btrfs: allow idmapped mount")
Omitted-fix: 8cc5c54de44c ("docs: update mapping documentation")
Omitted-fix: 02e407991350 ("fs: remove unused low-level mapping helpers")
Omitted-fix: ce70fd9a551a ("scsi: core: Remove the cmd field from struct scsi_request")
Omitted-fix: 5b794f98074a ("scsi: core: Remove the sense and sense_len fields from struct scsi_request")
Omitted-fix: a9a4ea1166d6 ("scsi: core: Move the resid_len field from struct scsi_request to struct scsi_cmnd")
Omitted-fix: dbb4c84d87af ("scsi: core: Move the result field from struct scsi_request to struct scsi_cmnd")
Omitted-fix: 6aded12b10e0 ("scsi: core: Remove struct scsi_request")
Omitted-fix: 264403033105 ("scsi: core: Remove <scsi/scsi_request.h>")
Omitted-fix: cd4b46cdb491 ("scsi: 53c700: Use scsi_cmd_to_rq() instead of scsi_cmnd.request")
Omitted-fix: 417c434aa1b4 ("docs/zh_CN: core-api: Update the translation of cachetlb.rst to 5.19-rc3")
Omitted-fix: 1ebfae49fd44 ("docs/zh_CN: core-api: Update the translation of cpu_hotplug.rst to 5.19-rc3")
Omitted-fix: 722ecdbce68a ("docs/zh_CN: core-api: Update the translation of irq/irq-domain.rst to 5.19-rc3")
Omitted-fix: b2fdf7f080b4 ("docs/zh_CN: core-api: Update the translation of kernel-api.rst to 5.19-rc3")
Omitted-fix: e86a0e297f0b ("docs/zh_CN: core-api: Update the translation of printk-format.rst to 5.19-rc3")
Omitted-fix: c290f175e73f ("docs/zh_CN: core-api: Update the translation of workqueue.rst to 5.19-rc3")
Omitted-fix: 4a6d00a43ef7 ("docs/zh_CN: core-api: Update the translation of xarray.rst to 5.19-rc3")
Omitted-fix: e8f60cd7db24 ("Merge tag 'perf-tools-fixes-for-v6.2-2-2023-01-11' of git://git.kernel.org/pub/scm/linux/ker…")
Omitted-fix: 3a761d72fa62 ("exportfs: support idmapped mounts")
Omitted-fix: 22f289ce1f8b ("ovl: use ovl_lookup_upper() wrapper")
Omitted-fix: 50db8d027355 ("ovl: handle idmappings for layer fileattrs")
Omitted-fix: c85bcc912f4f ("kselftests: memcg: update the oom group leaf events test")
Omitted-fix: be74553f250f ("kselftests: memcg: speed up the memory.high test")
Omitted-fix: 1bd1a4dd3e8c ("MAINTAINERS: add corresponding kselftests to cgroup entry")
Omitted-fix: 3a761d72fa62 ("exportfs: support idmapped mounts")
Omitted-fix: 22f289ce1f8b ("ovl: use ovl_lookup_upper() wrapper")
Omitted-fix: 50db8d027355 ("ovl: handle idmappings for layer fileattrs")
Omitted-fix: c85bcc912f4f ("kselftests: memcg: update the oom group leaf events test")
Omitted-fix: be74553f250f ("kselftests: memcg: speed up the memory.high test")
Omitted-fix: 1bd1a4dd3e8c ("MAINTAINERS: add corresponding kselftests to cgroup entry")
Omitted-fix: cdc69458a5f3 ("cgroup: account for memory_recursiveprot in test_memcg_low()")
Omitted-fix: 72b1e03aa725 ("cgroup: account for memory_localevents in test_memcg_oom_group_leaf_events()")
Omitted-fix: 830316807e02 ("cgroup: remove racy check in test_memcg_sock()")
Omitted-fix: c1a31a2f7a9c ("cgroup: fix racy check in alloc_pagecache_max_30M() helper function")
Omitted-fix: c01d4d0a82b7 ("random: quiet urandom warning ratelimit suppression message")
Omitted-fix: 21873bd66b6e ("Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux")
Omitted-fix: ff3b72a5d614 ("selftests: memcg: fix compilation")
Omitted-fix: 1d09069f5313 ("selftests: memcg: expect no low events in unprotected sibling")
Omitted-fix: 63fbdd3c77ec ("net: use DEBUG_NET_WARN_ON_ONCE() in __release_sock()")
Omitted-fix: 76458faeb285 ("net: use DEBUG_NET_WARN_ON_ONCE() in dev_loopback_xmit()")
Omitted-fix: 3e7f2b8d3088 ("net: use WARN_ON_ONCE() in inet_sock_destruct()")
Omitted-fix: 7890e2f09d43 ("net: use DEBUG_NET_WARN_ON_ONCE() in skb_release_head_state()")
Omitted-fix: ee2640df2393 ("net: add debug checks in napi_consume_skb and __napi_alloc_skb()")
Omitted-fix: 39e0f991a62e ("random: mark bootloader randomness code as __init")
Omitted-fix: 6342140db660 ("selftests/timens: add a test for vfork+exit")
Omitted-fix: cf21b355ccb3 ("af_unix: Optimise hash table layout.")
Omitted-fix: c12db92d62bf ("ovl: port to vfs{g,u}id_t and associated helpers")
Omitted-fix: 73db6a063c78 ("ovl: port to vfs{g,u}id_t and associated helpers")
Omitted-fix: 1e8a9191ccc2 ("f2fs: port to vfs{g,u}id_t and associated helpers")
Omitted-fix: a03a972b26da ("fuse: port to vfs{g,u}id_t and associated helpers")
Omitted-fix: 00d369bc2de5 ("fuse: port to vfs{g,u}id_t and associated helpers")
Omitted-fix: 276a3f7cf1d9 ("ksmbd: port to vfs{g,u}id_t and associated helpers")
Omitted-fix: 45c311501c77 ("fs: use mount types in iattr")
Omitted-fix: 1f36146a5a3d ("fs: introduce tiny iattr ownership update helpers")
Omitted-fix: 35faf3109a78 ("fs: port to iattr ownership update helpers")
Omitted-fix: 71e7b535b890 ("quota: port quota helpers mount ids")
Omitted-fix: b27c82e12965 ("attr: port attribute changes to new types")
Omitted-fix: cf21b355ccb3 ("af_unix: Optimise hash table layout.")
Omitted-fix: e95ab1d85289 ("selftests: net: af_unix: Test connect() with different netns.")
Omitted-fix: 169005eae2af ("docs/zh_CN: Update the translation of mm-api to 6.1-rc8")
Omitted-fix: 659797dc4d64 ("Docs/zh_CN: Update the translation of iio_configfs to 5.19-rc8")
Omitted-fix: 6a5057e9dc13 ("Docs/zh_CN: Update the translation of sparse to 5.19-rc8")
Omitted-fix: 63c1d2516b05 ("Docs/zh_CN: Update the translation of testing-overview to 5.19-rc8")
Omitted-fix: 83b41bb27b25 ("Docs/zh_CN: Update the translation of usage to 5.19-rc8")
Omitted-fix: c78478e164d4 ("Docs/zh_CN: Update the translation of pci-iov-howto to 5.19-rc8")
Omitted-fix: ce1120076c53 ("Docs/zh_CN: Update the translation of pci to 5.19-rc8")
Omitted-fix: 4116ff79749d ("Docs/zh_CN: Update the translation of sched-stats to 5.19-rc8")
Omitted-fix: 7f02464739da ("9p: convert to advancing variant of iov_iter_get_pages_alloc()")
Omitted-fix: 5b09c9fec086 ("do_proc_readlink(): constify path")
Omitted-fix: ea4af4aa03c3 ("nd_jump_link(): constify path")
Omitted-fix: 20f45ad50d65 ("spufs: constify path")
Omitted-fix: 88569546e8a1 ("ecryptfs: constify path")
Omitted-fix: 9204a97f7ae8 ("sched: Change wait_task_inactive()s match_state")
Omitted-fix: 04c6b79ae4f0 ("btrfs: convert __process_pages_contig() to use filemap_get_folios_contig()")
Omitted-fix: a75b81c3f63b ("btrfs: convert end_compressed_writeback() to use filemap_get_folios()")
Omitted-fix: 47d554199513 ("btrfs: convert process_page_range() to use filemap_get_folios_contig()")
Omitted-fix: 24a1efb4a912 ("nilfs2: convert nilfs_find_uncommited_extent() to use filemap_get_folios_contig()")
Omitted-fix: 7c18b64bba3b ("mips: ralink: mt7621: do not use kzalloc too early")
Omitted-fix: 7d37539037c2 ("fuse: implement ->tmpfile()")
Omitted-fix: f743f16c548b ("treewide: use get_random_{u8,u16}() when possible, part 2")
Omitted-fix: 6ab587e8e8b4 ("docs/zh_CN: Update the translation of delay-accounting to 6.1-rc8")
Omitted-fix: cf306a26cb3a ("docs/zh_CN: Update the translation of kernel-api to 6.1-rc8")
Omitted-fix: e07e9f22259e ("docs/zh_CN: Update the translation of testing-overview to 6.1-rc8")
Omitted-fix: ffdd9bd7a278 ("docs/zh_CN: Update the translation of reclaim to 6.1-rc8")
Omitted-fix: 9a833802a04d ("docs/zh_CN: Update the translation of start to 6.1-rc8")
Omitted-fix: 7cb52d4b3724 ("docs/zh_CN: Update the translation of usage to 6.1-rc8")
Omitted-fix: 03474d581df3 ("docs/zh_CN: Update the translation of msi-howto to 6.1-rc8")
Omitted-fix: 7df047be4363 ("docs/zh_CN: Update the translation of energy-model to 6.1-rc8")
Omitted-fix: e0068090095c ("docs/zh_CN: Update the translation of highmem to 6.1-rc8")
Omitted-fix: 0f3d70cb01da ("docs/zh_CN: Update the translation of ksm to 6.1-rc8")
Omitted-fix: 11018ef90ce7 ("s390/checksum: remove not needed uaccess.h include")
Omitted-fix: 2ea3498980f5 ("mm/damon/core: split out DAMOS-charged region skip logic into a new function")
Omitted-fix: e63a30c51f84 ("mm/damon/core: split damos application logic into a new function")
Omitted-fix: d1cbbf621fc2 ("mm/damon/core: split out scheme stat update logic into a new function")
Omitted-fix: 898810e5ca54 ("mm/damon/core: split out scheme quota adjustment logic into a new function")
Omitted-fix: 789a230613c8 ("mm/damon/sysfs: use damon_addr_range for region's start and end values")
Omitted-fix: 1f71981408ef ("mm/damon/sysfs: remove parameters of damon_sysfs_region_alloc()")
Omitted-fix: 39240595917e ("mm/damon/sysfs: move sysfs_lock to common module")
Omitted-fix: d332fe11debe ("mm/damon/sysfs: move unsigned long range directory to common module")
Omitted-fix: 4acd715ff57f ("mm/damon/sysfs: split out kdamond-independent schemes stats update logic into a new function")
Omitted-fix: c8e7b4d0ba34 ("mm/damon/sysfs: split out schemes directory implementation to separate file")
Omitted fix: dfe843dce775 ("s390/checksum: support GENERIC_CSUM, enable it for KASAN")
Omitted fix: e42ac7789df6 ("s390/checksum: always use cksm instruction")
Omitted fix: 1a167ddd3c56 ("x86: kmsan: pgtable: reduce vmalloc space")
Omitted fix: 7cf8f44a5a1c ("x86: fs: kmsan: disable CONFIG_DCACHE_WORD_ACCESS")
Omitted fix: 1468c6f4558b ("mm: fs: initialize fsdata passed to write_begin/write_end interface")
Omitted fix: 0aa8ea3c5d35 ("mm/compaction: correct comment of fast_find_migrateblock in isolate_migratepages")
Omitted fix: 42855f588e18 ("x86/purgatory: disable KMSAN instrumentation")
Omitted fix: 11385b261200 ("x86/uaccess: instrument copy_from_user_nmi()")
Omitted fix: f70da5ee8fe1 ("mm/damon: convert damon_pa_mark_accessed_or_deactivate() to use folios")
Omitted fix: 5a9e34747c9f ("mm/swap: convert deactivate_page() to folio_deactivate()")
Omitted fix: 0aa8ea3c5d35 ("mm/compaction: correct comment of fast_find_migrateblock in isolate_migratepages")
Omitted fix: de1f5055523e ("mm/mempolicy: convert queue_pages_pmd() to queue_folios_pmd()")
Omitted fix: 3dae02bbd07f ("mm/mempolicy: convert queue_pages_pte_range() to queue_folios_pte_range()")
Omitted fix: 0a2c1e818316 ("mm/mempolicy: convert queue_pages_hugetlb() to queue_folios_hugetlb()")
Omitted fix: d451b89dcd18 ("mm/mempolicy: convert queue_pages_required() to queue_folio_required()")
Omitted fix: 4a64981dfee9 ("mm/mempolicy: convert migrate_page_add() to migrate_folio_add()")
Omitted fix: 0aa8ea3c5d35 ("mm/compaction: correct comment of fast_find_migrateblock in isolate_migratepages")
Omitted fix: 46c475bd676b ("mm/pgtable: kmap_local_page() instead of kmap_atomic()")
Omitted fix: 0d940a9b270b ("mm/pgtable: allow pte_offset_map[_lock]() to fail")
Omitted fix: 65747aaf42b7 ("mm/filemap: allow pte_offset_map_lock() to fail")
Omitted fix: 45fe85e9811e ("mm/page_vma_mapped: delete bogosity in page_vma_mapped_walk()")
Omitted fix: 90f43b0a13cd ("mm/page_vma_mapped: reformat map_pte() with less indentation")
Omitted fix: 2798bbe75b9c ("mm/page_vma_mapped: pte_offset_map_nolock() not pte_lockptr()")
Omitted fix: 7780d04046a2 ("mm/pagewalkers: ACTION_AGAIN if pte_offset_map_lock() fails")
Omitted fix: be872f83bf57 ("mm/pagewalk: walk_pte_range() allow for pte_offset_map()")
Omitted fix: e5ad581c7f1c ("mm/vmwgfx: simplify pmd & pud mapping dirty helpers")
Omitted fix: 0d1c81edc61e ("mm/vmalloc: vmalloc_to_page() use pte_offset_kernel()")
Omitted fix: 6ec1905f6ec7 ("mm/hmm: retry if pte_offset_map() fails")
Omitted fix: 2b683a4ff6ee ("mm/userfaultfd: retry if pte_offset_map() fails")
Omitted fix: 3622d3cde308 ("mm/userfaultfd: allow pte_offset_map_lock() to fail")
Omitted fix: 9f2bad096d2f ("mm/debug_vm_pgtable,page_table_check: warn pte map fails")
Omitted fix: 04dee9e85cf5 ("mm/various: give up if pte_offset_map[_lock]() fails")
Omitted fix: 670ddd8cdcbd ("mm/mprotect: delete pmd_none_or_clear_bad_unless_trans_huge()")
Omitted fix: a5be621ee292 ("mm/mremap: retry if either pte_offset_map_*lock() fails")
Omitted fix: 179d3e4f3bfa ("mm/madvise: clean up force_shm_swapin_readahead()")
Omitted fix: d850fa729873 ("mm/swapoff: allow pte_offset_map[_lock]() to fail")
Omitted fix: 52fc048320ad ("mm/mglru: allow pte_offset_map_nolock() to fail")
Omitted fix: 4b56069c95d6 ("mm/migrate_device: allow pte_offset_map_lock() to fail")
Omitted fix: 2378118bd9da ("mm/gup: remove FOLL_SPLIT_PMD use of pmd_trans_unstable()")
Omitted fix: c9c1ee20ee84 ("mm/huge_memory: split huge pmd under one pte_offset_map()")
Omitted fix: 895f5ee464cc ("mm/khugepaged: allow pte_offset_map[_lock]() to fail")
Omitted fix: 3db82b9374ca ("mm/memory: allow pte_offset_map[_lock]() to fail")
Omitted fix: c7ad08804fae ("mm/memory: handle_pte_fault() use pte_offset_map_nolock()")
Omitted fix: 20b18aada185 ("madvise:madvise_free_huge_pmd(): don't use mapcount() against large folio for sharing check")
Omitted fix: 3db82b9374ca ("mm/memory: allow pte_offset_map[_lock]() to fail")
Omitted fix: c7ad08804fae ("mm/memory: handle_pte_fault() use pte_offset_map_nolock()")
Omitted fix: 20b18aada185 ("madvise:madvise_free_huge_pmd(): don't use mapcount() against large folio for sharing check")

Coming Soon:
Omitted-fix: 6f0df8e16eb5 ("memcontrol: ensure memcg acquired by id is properly set up")
Omitted-fix: ee40d543e97d ("mm/pagewalk: fix bootstopping regression from extra pte_unmap()")
Omitted-fix: ab048302026d ("ovl: fix failed copyup of fileattr on a symlink")
Omitted-fix: 92fe9dcbe4e1 ("hugetlbfs: clear resv_map pointer if mmap fails")
Omitted-fix: bf4916922c60 ("hugetlbfs: extend hugetlb_vma_lock to private VMAs")
Omitted-fix: 2820b0f09be9 ("hugetlbfs: close race between MADV_DONTNEED and page fault")

Brew: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=56452800
Tested: KT1+mm regression: https://beaker.engineering.redhat.com/jobs/8467307
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>

Approved-by: Jan Stancek <jstancek@redhat.com>
Approved-by: Mika Penttilä <mpenttil@redhat.com>
Approved-by: Jerry Snitselaar <jsnitsel@redhat.com>
Approved-by: Alex Gladkov <agladkov@redhat.com>
Approved-by: Vladis Dronov <vdronov@redhat.com>
Approved-by: Dean Nelson <dnelson@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Baoquan He <5820488-baoquan_he@users.noreply.gitlab.com>
Approved-by: Jiri Benc <jbenc@redhat.com>
Approved-by: John W. Linville <linville@redhat.com>

Signed-off-by: Scott Weaver <scweaver@redhat.com>
2023-10-25 11:39:20 -04:00
Chris von Recklinghausen 1f619343f6 treewide: use get_random_u32() when possible
Conflicts:
	drivers/gpu/drm/tests/drm_buddy_test.c
	drivers/gpu/drm/tests/drm_mm_test.c - We already have
		ce28ab1380e8 ("drm/tests: Add back seed value information")
		so keep calls to kunit_info.
	drop changes to drivers/misc/habanalabs/gaudi2/gaudi2.c
		fs/ntfs3/fslog.c - files not in CS9
	net/sunrpc/auth_gss/gss_krb5_wrap.c - We already have
		7f675ca7757b ("SUNRPC: Improve Kerberos confounder generation")
		so code to change is gone.
	drivers/gpu/drm/i915/i915_gem_gtt.c
	drivers/gpu/drm/i915/selftests/i915_selftest.c
	drivers/gpu/drm/tests/drm_buddy_test.c
	drivers/gpu/drm/tests/drm_mm_test.c
		change added under
		4cb818386e ("Merge DRM changes from upstream v6.0.8..v6.1")

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit a251c17aa558d8e3128a528af5cf8b9d7caae4fd
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Wed Oct 5 17:43:22 2022 +0200

    treewide: use get_random_u32() when possible

    The prandom_u32() function has been a deprecated inline wrapper around
    get_random_u32() for several releases now, and compiles down to the
    exact same code. Replace the deprecated wrapper with a direct call to
    the real function. The same also applies to get_random_int(), which is
    just a wrapper around get_random_u32(). This was done as a basic find
    and replace.

    Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Reviewed-by: Yury Norov <yury.norov@gmail.com>
    Reviewed-by: Jan Kara <jack@suse.cz> # for ext4
    Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake
    Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd
    Acked-by: Jakub Kicinski <kuba@kernel.org>
    Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbol
t
    Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
    Acked-by: Helge Deller <deller@gmx.de> # for parisc
    Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:03 -04:00
Guillaume Nault e1700f521f ipv4: Set offload_failed flag in fibmatch results
JIRA: https://issues.redhat.com/browse/RHEL-14295
Upstream Status: linux.git

commit 0add5c597f3253a9c6108a0a81d57f44ab0d9d30
Author: Benjamin Poirier <bpoirier@nvidia.com>
Date:   Tue Sep 26 14:27:30 2023 -0400

    ipv4: Set offload_failed flag in fibmatch results

    Due to a small omission, the offload_failed flag is missing from ipv4
    fibmatch results. Make sure it is set correctly.

    The issue can be witnessed using the following commands:
    echo "1 1" > /sys/bus/netdevsim/new_device
    ip link add dummy1 up type dummy
    ip route add 192.0.2.0/24 dev dummy1
    echo 1 > /sys/kernel/debug/netdevsim/netdevsim1/fib/fail_route_offload
    ip route add 198.51.100.0/24 dev dummy1
    ip route
    	# 192.168.15.0/24 has rt_trap
    	# 198.51.100.0/24 has rt_offload_failed
    ip route get 192.168.15.1 fibmatch
    	# Result has rt_trap
    ip route get 198.51.100.1 fibmatch
    	# Result differs from the route shown by `ip route`, it is missing
    	# rt_offload_failed
    ip link del dev dummy1
    echo 1 > /sys/bus/netdevsim/del_device

    Fixes: 36c5100e85 ("IPv4: Add "offload failed" indication to routes")
    Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
    Reviewed-by: Ido Schimmel <idosch@nvidia.com>
    Reviewed-by: Simon Horman <horms@kernel.org>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Link: https://lore.kernel.org/r/20230926182730.231208-1-bpoirier@nvidia.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2023-10-19 23:57:05 +02:00
Felix Maurer d45fbb04c1 ipv4: fix null-deref in ipv4_link_failure
JIRA: https://issues.redhat.com/browse/RHEL-5426
CVE: CVE-2023-42754

commit 0113d9c9d1ccc07f5a3710dac4aa24b6d711278c
Author: Kyle Zeng <zengyhkyle@gmail.com>
Date:   Thu Sep 14 22:12:57 2023 -0700

    ipv4: fix null-deref in ipv4_link_failure

    Currently, we assume the skb is associated with a device before calling
    __ip_options_compile, which is not always the case if it is re-routed by
    ipvs.
    When skb->dev is NULL, dev_net(skb->dev) will become null-dereference.
    This patch adds a check for the edge case and switch to use the net_device
    from the rtable when skb->dev is NULL.

    Fixes: ed0de45a10 ("ipv4: recompile ip options in ipv4_link_failure")
    Suggested-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: Kyle Zeng <zengyhkyle@gmail.com>
    Cc: Stephen Suryaputra <ssuryaextr@gmail.com>
    Cc: Vadim Fedorenko <vfedorenko@novek.ru>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-10-19 17:18:41 +02:00
Íñigo Huguet 3a91b473a8 net: rename reference+tracking helpers
Bugzilla: https://bugzilla.redhat.com/2175258

Conflicts:
 - Removed chunks of unsupported protocol AX.25
 - Renamed the funtions also in ipvlan. Commit 40b9d1ab63f5 ("ipvlan: hold lower
   dev to avoid possible use-after-free") was backported out of order so it had
   to use the old functions names.

commit d62607c3fe45911b2331fac073355a8c914bbde2
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Tue Jun 7 21:39:55 2022 -0700

    net: rename reference+tracking helpers

    Netdev reference helpers have a dev_ prefix for historic
    reasons. Renaming the old helpers would be too much churn
    but we can rename the tracking ones which are relatively
    recent and should be the default for new code.

    Rename:
     dev_hold_track()    -> netdev_hold()
     dev_put_track()     -> netdev_put()
     dev_replace_track() -> netdev_ref_replace()

    Link: https://lore.kernel.org/r/20220608043955.919359-1-kuba@kernel.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
2023-03-23 16:19:21 +01:00
Xin Long cca0e4b8c1 ipv4: add (struct uncached_list)->quarantine list
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2180612
Tested: compile only

commit 29e5375d7fcb5f88b438d74d537bbfd67ac75a64
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Feb 10 13:42:31 2022 -0800

    ipv4: add (struct uncached_list)->quarantine list

    This is an optimization to keep the per-cpu lists as short as possible:

    Whenever rt_flush_dev() changes one rtable dst.dev
    matching the disappearing device, it can can transfer the object
    to a quarantine list, waiting for a final rt_del_uncached_list().

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Xin Long <lxin@redhat.com>
2023-03-21 17:37:26 -04:00
Guillaume Nault 2655aa458c ipv4: Fix data-races around sysctl_fib_multipath_hash_fields.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160073
Upstream Status: linux.git

commit 8895a9c2ac76fb9d3922fed4fe092c8ec5e5cccc
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Mon Jul 18 10:26:41 2022 -0700

    ipv4: Fix data-races around sysctl_fib_multipath_hash_fields.

    While reading sysctl_fib_multipath_hash_fields, it can be changed
    concurrently.  Thus, we need to add READ_ONCE() to its readers.

    Fixes: ce5c9c20d3 ("ipv4: Add a sysctl to control multipath hash fields")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Reviewed-by: Ido Schimmel <idosch@nvidia.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2023-01-17 12:25:13 +01:00
Guillaume Nault e21edcab5f ipv4: Fix data-races around sysctl_fib_multipath_hash_policy.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160073
Upstream Status: linux.git

commit 7998c12a08c97cc26660532c9f90a34bd7d8da5a
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Mon Jul 18 10:26:40 2022 -0700

    ipv4: Fix data-races around sysctl_fib_multipath_hash_policy.

    While reading sysctl_fib_multipath_hash_policy, it can be changed
    concurrently.  Thus, we need to add READ_ONCE() to its readers.

    Fixes: bf4e0a3db9 ("net: ipv4: add support for ECMP hash policy choice")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2023-01-17 12:25:13 +01:00
Guillaume Nault 113e70bc57 ip: Fix data-races around sysctl_ip_fwd_use_pmtu.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2149949
Upstream Status: linux.git
Conflicts: (context) Missing upstream commit ac6627a28dbf ("net: ipv4:
           Consolidate ipv4_mtu and ip_dst_mtu_maybe_forward"):
           Centos Stream returns immediately in the if condition.

commit 60c158dc7b1f0558f6cadd5b50d0386da0000d50
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Wed Jul 13 13:51:53 2022 -0700

    ip: Fix data-races around sysctl_ip_fwd_use_pmtu.

    While reading sysctl_ip_fwd_use_pmtu, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its readers.

    Fixes: f87c10a8aa ("ipv4: introduce ip_dst_mtu_maybe_forward and protect forwarding path against pmtu spoofing")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2022-12-22 11:37:53 +01:00
Frantisek Hrbata a992b2c2a7 Merge: netfilter: nft_fib: Fix for rpath check with VRF devices
Merge conflicts:
-----------------
net/ipv4/netfilter/nft_fib_ipv4.c
        - nft_fib4_eval()
          HEAD(!1475) is missing upstream acc641ab95b6 ("netfilter: rpfilter/fib: Populate flowic_l3mdev field")
          Resolved in favor of !1548

net/ipv6/netfilter/nft_fib_ipv6.c
        - nft_fib6_flowi_init()
          HEAD(!1475) is missing upstream acc641ab95b6 ("netfilter: rpfilter/fib: Populate flowic_l3mdev field")
          Resolved in favor of !1548

MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1548

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2129093

Signed-off-by: Phil Sutter <psutter@redhat.com>

Approved-by: Guillaume Nault <gnault@redhat.com>
Approved-by: Florian Westphal <fwestpha@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-11-11 09:20:29 +01:00
Ivan Vecera 7feda53aba ipv4: Use dscp_t in struct fib_rt_info
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2140160

Conflicts:
- removed n/a hunk for unsupported prestera driver

commit 888ade8f90d7dbbdc8552ae9b23d311f9e61ab0e
Author: Guillaume Nault <gnault@redhat.com>
Date:   Fri Apr 8 22:08:37 2022 +0200

    ipv4: Use dscp_t in struct fib_rt_info

    Use the new dscp_t type to replace the tos field of struct fib_rt_info.
    This ensures ECN bits are ignored and makes it compatible with the
    fa_dscp field of struct fib_alias.

    This also allows sparse to flag potential incorrect uses of DSCP and
    ECN bits.

    Signed-off-by: Guillaume Nault <gnault@redhat.com>
    Reviewed-by: Ido Schimmel <idosch@nvidia.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-11-04 18:01:45 +01:00
Ivan Vecera 0954ecd449 ipv4: Use dscp_t in struct fib_alias
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2140160

commit 32ccf1107980e8ed5c62cf6666da7a47a4fc7ecf
Author: Guillaume Nault <gnault@redhat.com>
Date:   Fri Feb 4 14:58:19 2022 +0100

    ipv4: Use dscp_t in struct fib_alias

    Use the new dscp_t type to replace the fa_tos field of fib_alias. This
    ensures ECN bits are ignored and makes the field compatible with the
    fc_dscp field of struct fib_config.

    Converting old *tos variables and fields to dscp_t allows sparse to
    flag incorrect uses of DSCP and ECN bits. This patch is entirely about
    type annotation and shouldn't change any existing behaviour.

    Signed-off-by: Guillaume Nault <gnault@redhat.com>
    Acked-by: David Ahern <dsahern@kernel.org>
    Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-11-04 18:01:15 +01:00
Phil Sutter ace54a48e5 net: Add l3mdev index to flow struct and avoid oif reset for port devices
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2129093
Upstream Status: commit 40867d74c374b

commit 40867d74c374b235e14d839f3a77f26684feefe5
Author: David Ahern <dsahern@kernel.org>
Date:   Mon Mar 14 14:45:51 2022 -0600

    net: Add l3mdev index to flow struct and avoid oif reset for port devices

    The fundamental premise of VRF and l3mdev core code is binding a socket
    to a device (l3mdev or netdev with an L3 domain) to indicate L3 scope.
    Legacy code resets flowi_oif to the l3mdev losing any original port
    device binding. Ben (among others) has demonstrated use cases where the
    original port device binding is important and needs to be retained.
    This patch handles that by adding a new entry to the common flow struct
    that can indicate the l3mdev index for later rule and table matching
    avoiding the need to reset flowi_oif.

    In addition to allowing more use cases that require port device binds,
    this patch brings a few datapath simplications:

    1. l3mdev_fib_rule_match is only called when walking fib rules and
       always after l3mdev_update_flow. That allows an optimization to bail
       early for non-VRF type uses cases when flowi_l3mdev is not set. Also,
       only that index needs to be checked for the FIB table id.

    2. l3mdev_update_flow can be called with flowi_oif set to a l3mdev
       (e.g., VRF) device. By resetting flowi_oif only for this case the
       FLOWI_FLAG_SKIP_NH_OIF flag is not longer needed and can be removed,
       removing several checks in the datapath. The flowi_iif path can be
       simplified to only be called if the it is not loopback (loopback can
       not be assigned to an L3 domain) and the l3mdev index is not already
       set.

    3. Avoid another device lookup in the output path when the fib lookup
       returns a reject failure.

    Note: 2 functional tests for local traffic with reject fib rules are
    updated to reflect the new direct failure at FIB lookup time for ping
    rather than the failure on packet path. The current code fails like this:

        HINT: Fails since address on vrf device is out of device scope
        COMMAND: ip netns exec ns-A ping -c1 -w1 -I eth1 172.16.3.1
        ping: Warning: source address might be selected on device other than: eth1
        PING 172.16.3.1 (172.16.3.1) from 172.16.3.1 eth1: 56(84) bytes of data.

        --- 172.16.3.1 ping statistics ---
        1 packets transmitted, 0 received, 100% packet loss, time 0ms

    where the test now directly fails:

        HINT: Fails since address on vrf device is out of device scope
        COMMAND: ip netns exec ns-A ping -c1 -w1 -I eth1 172.16.3.1
        ping: connect: No route to host

    Signed-off-by: David Ahern <dsahern@kernel.org>
    Tested-by: Ben Greear <greearb@candelatech.com>
    Link: https://lore.kernel.org/r/20220314204551.16369-1-dsahern@kernel.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Phil Sutter <psutter@redhat.com>
2022-10-28 22:35:32 +02:00
Antoine Tenart 1a9648b0cd net: ipv4: add skb drop reasons to ip_error()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit c4eb664191b4a5ff6856478f903924176697719e
Author: Menglong Dong <imagedong@tencent.com>
Date:   Wed Apr 13 16:15:53 2022 +0800

    net: ipv4: add skb drop reasons to ip_error()

    Eventually, I find out the handler function for inputting route lookup
    fail: ip_error().

    The drop reasons we used in ip_error() are almost corresponding to
    IPSTATS_MIB_*, and following new reasons are introduced:

    SKB_DROP_REASON_IP_INADDRERRORS
    SKB_DROP_REASON_IP_INNOROUTES

    Isn't the name SKB_DROP_REASON_IP_HOSTUNREACH and
    SKB_DROP_REASON_IP_NETUNREACH more accurate? To make them corresponding
    to IPSTATS_MIB_*, we keep their name still.

    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Reviewed-by: Jiang Biao <benbjiang@tencent.com>
    Reviewed-by: Hao Peng <flyingpeng@tencent.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:24 +02:00
Florian Westphal 37df13d4bf net: align static siphash keys
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2111270
Upstream Status: commit 49ecc2e9c3ab

commit 49ecc2e9c3abd269951972fa8b23a4d081111b80
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 15 09:23:03 2021 -0800

    net: align static siphash keys

    siphash keys use 16 bytes.

    Define siphash_aligned_key_t macro so that we can make sure they
    are not crossing a cache line boundary.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Florian Westphal <fwestpha@redhat.com>
2022-07-27 00:34:24 +02:00
Guillaume Nault 05a1121352 ipv4: drop dst in multicast routing path
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2104124
Upstream Status: linux.git

commit 9e6c6d17d1d6a3f1515ce399f9a011629ec79aa0
Author: Lokesh Dhoundiyal <lokesh.dhoundiyal@alliedtelesis.co.nz>
Date:   Thu May 5 14:00:17 2022 +1200

    ipv4: drop dst in multicast routing path

    kmemleak reports the following when routing multicast traffic over an
    ipsec tunnel.

    Kmemleak output:
    unreferenced object 0x8000000044bebb00 (size 256):
      comm "softirq", pid 0, jiffies 4294985356 (age 126.810s)
      hex dump (first 32 bytes):
        00 00 00 00 00 00 00 00 80 00 00 00 05 13 74 80  ..............t.
        80 00 00 00 04 9b bf f9 00 00 00 00 00 00 00 00  ................
      backtrace:
        [<00000000f83947e0>] __kmalloc+0x1e8/0x300
        [<00000000b7ed8dca>] metadata_dst_alloc+0x24/0x58
        [<0000000081d32c20>] __ipgre_rcv+0x100/0x2b8
        [<00000000824f6cf1>] gre_rcv+0x178/0x540
        [<00000000ccd4e162>] gre_rcv+0x7c/0xd8
        [<00000000c024b148>] ip_protocol_deliver_rcu+0x124/0x350
        [<000000006a483377>] ip_local_deliver_finish+0x54/0x68
        [<00000000d9271b3a>] ip_local_deliver+0x128/0x168
        [<00000000bd4968ae>] xfrm_trans_reinject+0xb8/0xf8
        [<0000000071672a19>] tasklet_action_common.isra.16+0xc4/0x1b0
        [<0000000062e9c336>] __do_softirq+0x1fc/0x3e0
        [<00000000013d7914>] irq_exit+0xc4/0xe0
        [<00000000a4d73e90>] plat_irq_dispatch+0x7c/0x108
        [<000000000751eb8e>] handle_int+0x16c/0x178
        [<000000001668023b>] _raw_spin_unlock_irqrestore+0x1c/0x28

    The metadata dst is leaked when ip_route_input_mc() updates the dst for
    the skb. Commit f38a9eb1f7 ("dst: Metadata destinations") correctly
    handled dropping the dst in ip_route_input_slow() but missed the
    multicast case which is handled by ip_route_input_mc(). Drop the dst in
    ip_route_input_mc() avoiding the leak.

    Fixes: f38a9eb1f7 ("dst: Metadata destinations")
    Signed-off-by: Lokesh Dhoundiyal <lokesh.dhoundiyal@alliedtelesis.co.nz>
    Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Link: https://lore.kernel.org/r/20220505020017.3111846-1-chris.packham@alliedtelesis.co.nz
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2022-07-05 17:22:58 +02:00
Ivan Vecera e7babfdab7 net: dst: add net device refcount tracking to dst_entry
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2096377

commit 9038c320001dd07f60736018edf608ac5baca0ab
Author: Eric Dumazet <edumazet@google.com>
Date:   Sat Dec 4 20:22:03 2021 -0800

    net: dst: add net device refcount tracking to dst_entry

    We want to track all dev_hold()/dev_put() to ease leak hunting.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-06-13 18:36:49 +02:00
Guillaume Nault 6df45ea29a ipv4: Fix route lookups when handling ICMP redirects and PMTU updates
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081383
Upstream Status: linux.git

commit 544b4dd568e3b09c1ab38a759d3187e7abda11a0
Author: Guillaume Nault <gnault@redhat.com>
Date:   Thu Mar 17 13:45:09 2022 +0100

    ipv4: Fix route lookups when handling ICMP redirects and PMTU updates

    The PMTU update and ICMP redirect helper functions initialise their fl4
    variable with either __build_flow_key() or build_sk_flow_key(). These
    initialisation functions always set ->flowi4_scope with
    RT_SCOPE_UNIVERSE and might set the ECN bits of ->flowi4_tos. This is
    not a problem when the route lookup is later done via
    ip_route_output_key_hash(), which properly clears the ECN bits from
    ->flowi4_tos and initialises ->flowi4_scope based on the RTO_ONLINK
    flag. However, some helpers call fib_lookup() directly, without
    sanitising the tos and scope fields, so the route lookup can fail and,
    as a result, the ICMP redirect or PMTU update aren't taken into
    account.

    Fix this by extracting the ->flowi4_tos and ->flowi4_scope sanitisation
    code into ip_rt_fix_tos(), then use this function in handlers that call
    fib_lookup() directly.

    Note 1: We can't sanitise ->flowi4_tos and ->flowi4_scope in a central
    place (like __build_flow_key() or flowi4_init_output()), because
    ip_route_output_key_hash() expects non-sanitised values. When called
    with sanitised values, it can erroneously overwrite RT_SCOPE_LINK with
    RT_SCOPE_UNIVERSE in ->flowi4_scope. Therefore we have to be careful to
    sanitise the values only for those paths that don't call
    ip_route_output_key_hash().

    Note 2: The problem is mostly about sanitising ->flowi4_tos. Having
    ->flowi4_scope initialised with RT_SCOPE_UNIVERSE instead of
    RT_SCOPE_LINK probably wasn't really a problem: sockets with the
    SOCK_LOCALROUTE flag set (those that'd result in RTO_ONLINK being set)
    normally shouldn't receive ICMP redirects or PMTU updates.

    Fixes: 4895c771c7 ("ipv4: Add FIB nexthop exceptions.")
    Signed-off-by: Guillaume Nault <gnault@redhat.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2022-05-03 17:05:27 +02:00
Guillaume Nault b0f81b7517 ipv4: fix data races in fib_alias_hw_flags_set
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081383
Upstream Status: linux.git

commit 9fcf986cc4bc6a3a39f23fbcbbc3a9e52d3c24fd
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Feb 16 09:32:16 2022 -0800

    ipv4: fix data races in fib_alias_hw_flags_set

    fib_alias_hw_flags_set() can be used by concurrent threads,
    and is only RCU protected.

    We need to annotate accesses to following fields of struct fib_alias:

        offload, trap, offload_failed

    Because of READ_ONCE()WRITE_ONCE() limitations, make these
    field u8.

    BUG: KCSAN: data-race in fib_alias_hw_flags_set / fib_alias_hw_flags_set

    read to 0xffff888134224a6a of 1 bytes by task 2013 on cpu 1:
     fib_alias_hw_flags_set+0x28a/0x470 net/ipv4/fib_trie.c:1050
     nsim_fib4_rt_hw_flags_set drivers/net/netdevsim/fib.c:350 [inline]
     nsim_fib4_rt_add drivers/net/netdevsim/fib.c:367 [inline]
     nsim_fib4_rt_insert drivers/net/netdevsim/fib.c:429 [inline]
     nsim_fib4_event drivers/net/netdevsim/fib.c:461 [inline]
     nsim_fib_event drivers/net/netdevsim/fib.c:881 [inline]
     nsim_fib_event_work+0x1852/0x2cf0 drivers/net/netdevsim/fib.c:1477
     process_one_work+0x3f6/0x960 kernel/workqueue.c:2307
     process_scheduled_works kernel/workqueue.c:2370 [inline]
     worker_thread+0x7df/0xa70 kernel/workqueue.c:2456
     kthread+0x1bf/0x1e0 kernel/kthread.c:377
     ret_from_fork+0x1f/0x30

    write to 0xffff888134224a6a of 1 bytes by task 4872 on cpu 0:
     fib_alias_hw_flags_set+0x2d5/0x470 net/ipv4/fib_trie.c:1054
     nsim_fib4_rt_hw_flags_set drivers/net/netdevsim/fib.c:350 [inline]
     nsim_fib4_rt_add drivers/net/netdevsim/fib.c:367 [inline]
     nsim_fib4_rt_insert drivers/net/netdevsim/fib.c:429 [inline]
     nsim_fib4_event drivers/net/netdevsim/fib.c:461 [inline]
     nsim_fib_event drivers/net/netdevsim/fib.c:881 [inline]
     nsim_fib_event_work+0x1852/0x2cf0 drivers/net/netdevsim/fib.c:1477
     process_one_work+0x3f6/0x960 kernel/workqueue.c:2307
     process_scheduled_works kernel/workqueue.c:2370 [inline]
     worker_thread+0x7df/0xa70 kernel/workqueue.c:2456
     kthread+0x1bf/0x1e0 kernel/kthread.c:377
     ret_from_fork+0x1f/0x30

    value changed: 0x00 -> 0x02

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 0 PID: 4872 Comm: kworker/0:0 Not tainted 5.17.0-rc3-syzkaller-00188-g1d41d2e82623-dirty #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Workqueue: events nsim_fib_event_work

    Fixes: 90b93f1b31 ("ipv4: Add "offload" and "trap" indications to routes")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Reviewed-by: Ido Schimmel <idosch@nvidia.com>
    Link: https://lore.kernel.org/r/20220216173217.3792411-1-eric.dumazet@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2022-05-03 17:05:24 +02:00
Herton R. Krzesinski adc4082e23 Merge: CNB: net: Remove redundant if statements
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/328

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2037315

Series moving dev NULL check into dev_put()/dev_hold()

Signed-off-by: Petr Oros <poros@redhat.com>

Approved-by: John W. Linville <linville@redhat.com>
Approved-by: Andrea Claudi <aclaudi@redhat.com>
Approved-by: Corinna Vinschen <vinschen@redhat.com>
Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Ivan Vecera <ivecera@redhat.com>

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2022-01-26 22:11:25 +00:00
Petr Oros ea6b084bc4 net: Remove redundant if statements
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2037315

Upstream commit(s):
commit 1160dfa178eb848327e9dec39960a735f4dc1685
Author: Yajun Deng <yajun.deng@linux.dev>
Date:   Thu Aug 5 19:55:27 2021 +0800

    net: Remove redundant if statements

    The 'if (dev)' statement already move into dev_{put , hold}, so remove
    redundant if statements.

    Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Petr Oros <poros@redhat.com>
2022-01-10 16:20:08 +01:00
Antoine Tenart 3e8afedacf ipv4: make exception cache less predictible
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2015112
Upstream Status: linux.git
Tested: LNST
CVE: CVE-2021-20322

commit 67d6d681e15b578c1725bad8ad079e05d1c48a8e
Author: Eric Dumazet <edumazet@google.com>
Date:   Sun Aug 29 15:16:15 2021 -0700

    ipv4: make exception cache less predictible

    Even after commit 6457378fe7 ("ipv4: use siphash instead of Jenkins in
    fnhe_hashfun()"), an attacker can still use brute force to learn
    some secrets from a victim linux host.

    One way to defeat these attacks is to make the max depth of the hash
    table bucket a random value.

    Before this patch, each bucket of the hash table used to store exceptions
    could contain 6 items under attack.

    After the patch, each bucket would contains a random number of items,
    between 6 and 10. The attacker can no longer infer secrets.

    This is slightly increasing memory size used by the hash table,
    by 50% in average, we do not expect this to be a problem.

    This patch is more complex than the prior one (IPv6 equivalent),
    because IPv4 was reusing the oldest entry.
    Since we need to be able to evict more than one entry per
    update_or_create_fnhe() call, I had to replace
    fnhe_oldest() with fnhe_remove_oldest().

    Also note that we will queue extra kfree_rcu() calls under stress,
    which hopefully wont be a too big issue.

    Fixes: 4895c771c7 ("ipv4: Add FIB nexthop exceptions.")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reported-by: Keyu Man <kman001@ucr.edu>
    Cc: Willy Tarreau <w@1wt.eu>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Tested-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-01-06 12:06:15 +01:00
Guillaume Nault a47a054130 ipv4: fix endianness issue in inet_rtm_getroute_build_skb()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2024572
Upstream Status: linux.git

commit 92548b0ee220e000d81c27ac9a80e0ede895a881
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Aug 30 19:02:10 2021 -0700

    ipv4: fix endianness issue in inet_rtm_getroute_build_skb()

    The UDP length field should be in network order.
    This removes the following sparse error:

    net/ipv4/route.c:3173:27: warning: incorrect type in assignment (different base types)
    net/ipv4/route.c:3173:27:    expected restricted __be16 [usertype] len
    net/ipv4/route.c:3173:27:    got unsigned long

    Fixes: 404eb77ea7 ("ipv4: support sport, dport and ip_proto in RTM_GETROUTE")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Roopa Prabhu <roopa@nvidia.com>
    Cc: David Ahern <dsahern@kernel.org>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2021-11-18 14:50:37 +01:00
Eric Dumazet 6457378fe7 ipv4: use siphash instead of Jenkins in fnhe_hashfun()
A group of security researchers brought to our attention
the weakness of hash function used in fnhe_hashfun().

Lets use siphash instead of Jenkins Hash, to considerably
reduce security risks.

Also remove the inline keyword, this really is distracting.

Fixes: d546c62154 ("ipv4: harden fnhe_hashfun()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Keyu Man <kman001@ucr.edu>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-26 10:20:34 +01:00
Jakub Kicinski b6df00789e Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Trivial conflict in net/netfilter/nf_tables_api.c.

Duplicate fix in tools/testing/selftests/net/devlink_port_split.py
- take the net-next version.

skmsg, and L4 bpf - keep the bpf code but remove the flags
and err params.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-06-29 15:45:27 -07:00
Vadim Fedorenko fade56410c net: lwtunnel: handle MTU calculation in forwading
Commit 14972cbd34 ("net: lwtunnel: Handle fragmentation") moved
fragmentation logic away from lwtunnel by carry encap headroom and
use it in output MTU calculation. But the forwarding part was not
covered and created difference in MTU for output and forwarding and
further to silent drops on ipv4 forwarding path. Fix it by taking
into account lwtunnel encap headroom.

The same commit also introduced difference in how to treat RTAX_MTU
in IPv4 and IPv6 where latter explicitly removes lwtunnel encap
headroom from route MTU. Make IPv4 version do the same.

Fixes: 14972cbd34 ("net: lwtunnel: Handle fragmentation")
Suggested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-28 12:42:14 -07:00
Jakub Kicinski adc2e56ebe Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Trivial conflicts in net/can/isotp.c and
tools/testing/selftests/net/mptcp/mptcp_connect.sh

scaled_ppm_to_ppb() was moved from drivers/ptp/ptp_clock.c
to include/linux/ptp_clock_kernel.h in -next so re-apply
the fix there.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-06-18 19:47:02 -07:00
David Ahern b87b04f501 ipv4: Fix device used for dst_alloc with local routes
Oliver reported a use case where deleting a VRF device can hang
waiting for the refcnt to drop to 0. The root cause is that the dst
is allocated against the VRF device but cached on the loopback
device.

The use case (added to the selftests) has an implicit VRF crossing
due to the ordering of the FIB rules (lookup local is before the
l3mdev rule, but the problem occurs even if the FIB rules are
re-ordered with local after l3mdev because the VRF table does not
have a default route to terminate the lookup). The end result is
is that the FIB lookup returns the loopback device as the nexthop,
but the ingress device is in a VRF. The mismatch causes the dst
alloc against the VRF device but then cached on the loopback.

The fix is to bring the trick used for IPv6 (see ip6_rt_get_dev_rcu):
pick the dst alloc device based the fib lookup result but with checks
that the result has a nexthop device (e.g., not an unreachable or
prohibit entry).

Fixes: f5a0aab84b ("net: ipv4: dst for local input routes should use l3mdev if relevant")
Reported-by: Oliver Herms <oliver.peter.herms@gmail.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-14 12:30:53 -07:00
Ido Schimmel 4253b4986f ipv4: Add custom multipath hash policy
Add a new multipath hash policy where the packet fields used for hash
calculation are determined by user space via the
fib_multipath_hash_fields sysctl that was introduced in the previous
patch.

The current set of available packet fields includes both outer and inner
fields, which requires two invocations of the flow dissector. Avoid
unnecessary dissection of the outer or inner flows by skipping
dissection if none of the outer or inner fields are required.

In accordance with the existing policies, when an skb is not available,
packet fields are extracted from the provided flow key. In which case,
only outer fields are considered.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-18 13:27:32 -07:00
Ido Schimmel 2e68ea9268 ipv4: Calculate multipath hash inside switch statement
A subsequent patch will add another multipath hash policy where the
multipath hash is calculated directly by the policy specific code and
not outside of the switch statement.

Prepare for this change by moving the multipath hash calculation inside
the switch statement.

No functional changes intended.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-18 13:27:32 -07:00
David S. Miller efd13b71a3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 15:31:22 -07:00
Eric Dumazet aa6dd211e4 inet: use bigger hash table for IP ID generation
In commit 73f156a6e8 ("inetpeer: get rid of ip_id_count")
I used a very small hash table that could be abused
by patient attackers to reveal sensitive information.

Switch to a dynamic sizing, depending on RAM size.

Typical big hosts will now use 128x more storage (2 MB)
to get a similar increase in security and reduction
of hash collisions.

As a bonus, use of alloc_large_system_hash() spreads
allocated memory among all NUMA nodes.

Fixes: 73f156a6e8 ("inetpeer: get rid of ip_id_count")
Reported-by: Amit Klein <aksecurity@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24 16:45:11 -07:00
Yejune Deng f105f26e45 net: ipv4: route.c: simplify procfs code
proc_creat_seq() that directly take a struct seq_operations,
and deal with network namespaces in ->open.

Signed-off-by: Yejune Deng <yejune.deng@gmail.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-16 14:43:49 -07:00