Commit Graph

247 Commits

Author SHA1 Message Date
Antoine Tenart dff30fe84c net: raw: use sk_skb_reason_drop to free rx packets
JIRA: https://issues.redhat.com/browse/RHEL-48648
Upstream Status: net-next.git

commit ce9a2424e9da2cd4e790f2498621bc2aa5e5d298
Author: Yan Zhai <yan@cloudflare.com>
Date:   Mon Jun 17 11:09:16 2024 -0700

    net: raw: use sk_skb_reason_drop to free rx packets

    Replace kfree_skb_reason with sk_skb_reason_drop and pass the receiving
    socket to the tracepoint.

    Signed-off-by: Yan Zhai <yan@cloudflare.com>
    Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-07-16 17:29:42 +02:00
Antoine Tenart c4af9aca79 ipv4: Fix uninit-value access in __ip_make_skb()
JIRA: https://issues.redhat.com/browse/RHEL-39786
Upstream Status: linux.git
CVE: CVE-2024-36927
Conflicts:\
- Removed code differs due to missing upstream commit cafbe182a467
  ("inet: move inet->hdrincl to inet->inet_flags") in c9s.

commit fc1092f51567277509563800a3c56732070b6aa4
Author: Shigeru Yoshida <syoshida@redhat.com>
Date:   Tue Apr 30 21:39:45 2024 +0900

    ipv4: Fix uninit-value access in __ip_make_skb()

    KMSAN reported uninit-value access in __ip_make_skb() [1].  __ip_make_skb()
    tests HDRINCL to know if the skb has icmphdr. However, HDRINCL can cause a
    race condition. If calling setsockopt(2) with IP_HDRINCL changes HDRINCL
    while __ip_make_skb() is running, the function will access icmphdr in the
    skb even if it is not included. This causes the issue reported by KMSAN.

    Check FLOWI_FLAG_KNOWN_NH on fl4->flowi4_flags instead of testing HDRINCL
    on the socket.

    Also, fl4->fl4_icmp_type and fl4->fl4_icmp_code are not initialized. These
    are union in struct flowi4 and are implicitly initialized by
    flowi4_init_output(), but we should not rely on specific union layout.

    Initialize these explicitly in raw_sendmsg().

    [1]
    BUG: KMSAN: uninit-value in __ip_make_skb+0x2b74/0x2d20 net/ipv4/ip_output.c:1481
     __ip_make_skb+0x2b74/0x2d20 net/ipv4/ip_output.c:1481
     ip_finish_skb include/net/ip.h:243 [inline]
     ip_push_pending_frames+0x4c/0x5c0 net/ipv4/ip_output.c:1508
     raw_sendmsg+0x2381/0x2690 net/ipv4/raw.c:654
     inet_sendmsg+0x27b/0x2a0 net/ipv4/af_inet.c:851
     sock_sendmsg_nosec net/socket.c:730 [inline]
     __sock_sendmsg+0x274/0x3c0 net/socket.c:745
     __sys_sendto+0x62c/0x7b0 net/socket.c:2191
     __do_sys_sendto net/socket.c:2203 [inline]
     __se_sys_sendto net/socket.c:2199 [inline]
     __x64_sys_sendto+0x130/0x200 net/socket.c:2199
     do_syscall_64+0xd8/0x1f0 arch/x86/entry/common.c:83
     entry_SYSCALL_64_after_hwframe+0x6d/0x75

    Uninit was created at:
     slab_post_alloc_hook mm/slub.c:3804 [inline]
     slab_alloc_node mm/slub.c:3845 [inline]
     kmem_cache_alloc_node+0x5f6/0xc50 mm/slub.c:3888
     kmalloc_reserve+0x13c/0x4a0 net/core/skbuff.c:577
     __alloc_skb+0x35a/0x7c0 net/core/skbuff.c:668
     alloc_skb include/linux/skbuff.h:1318 [inline]
     __ip_append_data+0x49ab/0x68c0 net/ipv4/ip_output.c:1128
     ip_append_data+0x1e7/0x260 net/ipv4/ip_output.c:1365
     raw_sendmsg+0x22b1/0x2690 net/ipv4/raw.c:648
     inet_sendmsg+0x27b/0x2a0 net/ipv4/af_inet.c:851
     sock_sendmsg_nosec net/socket.c:730 [inline]
     __sock_sendmsg+0x274/0x3c0 net/socket.c:745
     __sys_sendto+0x62c/0x7b0 net/socket.c:2191
     __do_sys_sendto net/socket.c:2203 [inline]
     __se_sys_sendto net/socket.c:2199 [inline]
     __x64_sys_sendto+0x130/0x200 net/socket.c:2199
     do_syscall_64+0xd8/0x1f0 arch/x86/entry/common.c:83
     entry_SYSCALL_64_after_hwframe+0x6d/0x75

    CPU: 1 PID: 15709 Comm: syz-executor.7 Not tainted 6.8.0-11567-gb3603fcb79b1 #25
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014

    Fixes: 99e5acae193e ("ipv4: Fix potential uninit variable access bug in __ip_make_skb()")
    Reported-by: syzkaller <syzkaller@googlegroups.com>
    Signed-off-by: Shigeru Yoshida <syoshida@redhat.com>
    Link: https://lore.kernel.org/r/20240430123945.2057348-1-syoshida@redhat.com
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-06-14 15:11:39 +02:00
Hangbin Liu f6d507a115 ipv{4,6}/raw: fix output xfrm lookup wrt protocol
JIRA: https://issues.redhat.com/browse/RHEL-31050
Upstream Status: net.git commit 3632679d9e4f

Conflicts: context conflicts due to missing commit
91d0b78c5177 ("inet: Add IP_LOCAL_PORT_RANGE socket option")

commit 3632679d9e4f879f49949bb5b050e0de553e4739
Author: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Date:   Mon May 22 14:08:20 2023 +0200

    ipv{4,6}/raw: fix output xfrm lookup wrt protocol

    With a raw socket bound to IPPROTO_RAW (ie with hdrincl enabled), the
    protocol field of the flow structure, build by raw_sendmsg() /
    rawv6_sendmsg()),  is set to IPPROTO_RAW. This breaks the ipsec policy
    lookup when some policies are defined with a protocol in the selector.

    For ipv6, the sin6_port field from 'struct sockaddr_in6' could be used to
    specify the protocol. Just accept all values for IPPROTO_RAW socket.

    For ipv4, the sin_port field of 'struct sockaddr_in' could not be used
    without breaking backward compatibility (the value of this field was never
    checked). Let's add a new kind of control message, so that the userland
    could specify which protocol is used.

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    CC: stable@vger.kernel.org
    Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
    Link: https://lore.kernel.org/r/20230522120820.1319391-1-nicolas.dichtel@6wind.com
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Hangbin Liu <haliu@redhat.com>
2024-04-02 17:49:05 +08:00
Jeff Moyer 092f5d645a net: ioctl: Use kernel memory on protocol ioctl callbacks
JIRA: https://issues.redhat.com/browse/RHEL-12076
Conflicts: There are contextual differences as we're missing commit
  559260fd9d9a ("ipmr: do not acquire mrt_lock in
  ioctl(SIOCGETVIFCNT)").  I also pulled in header changes from commit
  949d6b405e61 ("net: add missing includes and forward declarations
  under net/") to address a build failure with this patch applied.

commit e1d001fa5b477c4da46a29be1fcece91db7c7c6f
Author: Breno Leitao <leitao@debian.org>
Date:   Fri Jun 9 08:27:42 2023 -0700

    net: ioctl: Use kernel memory on protocol ioctl callbacks
    
    Most of the ioctls to net protocols operates directly on userspace
    argument (arg). Usually doing get_user()/put_user() directly in the
    ioctl callback.  This is not flexible, because it is hard to reuse these
    functions without passing userspace buffers.
    
    Change the "struct proto" ioctls to avoid touching userspace memory and
    operate on kernel buffers, i.e., all protocol's ioctl callbacks is
    adapted to operate on a kernel memory other than on userspace (so, no
    more {put,get}_user() and friends being called in the ioctl callback).
    
    This changes the "struct proto" ioctl format in the following way:
    
        int                     (*ioctl)(struct sock *sk, int cmd,
    -                                        unsigned long arg);
    +                                        int *karg);
    
    (Important to say that this patch does not touch the "struct proto_ops"
    protocols)
    
    So, the "karg" argument, which is passed to the ioctl callback, is a
    pointer allocated to kernel space memory (inside a function wrapper).
    This buffer (karg) may contain input argument (copied from userspace in
    a prep function) and it might return a value/buffer, which is copied
    back to userspace if necessary. There is not one-size-fits-all format
    (that is I am using 'may' above), but basically, there are three type of
    ioctls:
    
    1) Do not read from userspace, returns a result to userspace
    2) Read an input parameter from userspace, and does not return anything
      to userspace
    3) Read an input from userspace, and return a buffer to userspace.
    
    The default case (1) (where no input parameter is given, and an "int" is
    returned to userspace) encompasses more than 90% of the cases, but there
    are two other exceptions. Here is a list of exceptions:
    
    * Protocol RAW:
       * cmd = SIOCGETVIFCNT:
         * input and output = struct sioc_vif_req
       * cmd = SIOCGETSGCNT
         * input and output = struct sioc_sg_req
       * Explanation: for the SIOCGETVIFCNT case, userspace passes the input
         argument, which is struct sioc_vif_req. Then the callback populates
         the struct, which is copied back to userspace.
    
    * Protocol RAW6:
       * cmd = SIOCGETMIFCNT_IN6
         * input and output = struct sioc_mif_req6
       * cmd = SIOCGETSGCNT_IN6
         * input and output = struct sioc_sg_req6
    
    * Protocol PHONET:
      * cmd == SIOCPNADDRESOURCE | SIOCPNDELRESOURCE
         * input int (4 bytes)
      * Nothing is copied back to userspace.
    
    For the exception cases, functions sock_sk_ioctl_inout() will
    copy the userspace input, and copy it back to kernel space.
    
    The wrapper that prepare the buffer and put the buffer back to user is
    sk_ioctl(), so, instead of calling sk->sk_prot->ioctl(), the callee now
    calls sk_ioctl(), which will handle all cases.
    
    Signed-off-by: Breno Leitao <leitao@debian.org>
    Reviewed-by: Willem de Bruijn <willemb@google.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Link: https://lore.kernel.org/r/20230609152800.830401-1-leitao@debian.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-11-02 15:32:16 -04:00
Guillaume Nault f640aa02b7 raw: Fix NULL deref in raw_get_next().
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2221167
Upstream Status: linux.git

commit 0a78cf7264d29abeca098eae0b188a10aabc8a32
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Mon Apr 3 12:49:58 2023 -0700

    raw: Fix NULL deref in raw_get_next().

    Dae R. Jeong reported a NULL deref in raw_get_next() [0].

    It seems that the repro was running these sequences in parallel so
    that one thread was iterating on a socket that was being freed in
    another netns.

      unshare(0x40060200)
      r0 = syz_open_procfs(0x0, &(0x7f0000002080)='net/raw\x00')
      socket$inet_icmp_raw(0x2, 0x3, 0x1)
      pread64(r0, &(0x7f0000000000)=""/10, 0xa, 0x10000000007f)

    After commit 0daf07e52709 ("raw: convert raw sockets to RCU"), we
    use RCU and hlist_nulls_for_each_entry() to iterate over SOCK_RAW
    sockets.  However, we should use spinlock for slow paths to avoid
    the NULL deref.

    Also, SOCK_RAW does not use SLAB_TYPESAFE_BY_RCU, and the slab object
    is not reused during iteration in the grace period.  In fact, the
    lockless readers do not check the nulls marker with get_nulls_value().
    So, SOCK_RAW should use hlist instead of hlist_nulls.

    Instead of adding an unnecessary barrier by sk_nulls_for_each_rcu(),
    let's convert hlist_nulls to hlist and use sk_for_each_rcu() for
    fast paths and sk_for_each() and spinlock for /proc/net/raw.

    [0]:
    general protection fault, probably for non-canonical address 0xdffffc0000000005: 0000 [#1] PREEMPT SMP KASAN
    KASAN: null-ptr-deref in range [0x0000000000000028-0x000000000000002f]
    CPU: 2 PID: 20952 Comm: syz-executor.0 Not tainted 6.2.0-g048ec869bafd-dirty #7
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
    RIP: 0010:read_pnet include/net/net_namespace.h:383 [inline]
    RIP: 0010:sock_net include/net/sock.h:649 [inline]
    RIP: 0010:raw_get_next net/ipv4/raw.c:974 [inline]
    RIP: 0010:raw_get_idx net/ipv4/raw.c:986 [inline]
    RIP: 0010:raw_seq_start+0x431/0x800 net/ipv4/raw.c:995
    Code: ef e8 33 3d 94 f7 49 8b 6d 00 4c 89 ef e8 b7 65 5f f7 49 89 ed 49 83 c5 98 0f 84 9a 00 00 00 48 83 c5 c8 48 89 e8 48 c1 e8 03 <42> 80 3c 30 00 74 08 48 89 ef e8 00 3d 94 f7 4c 8b 7d 00 48 89 ef
    RSP: 0018:ffffc9001154f9b0 EFLAGS: 00010206
    RAX: 0000000000000005 RBX: 1ffff1100302c8fd RCX: 0000000000000000
    RDX: 0000000000000028 RSI: ffffc9001154f988 RDI: ffffc9000f77a338
    RBP: 0000000000000029 R08: ffffffff8a50ffb4 R09: fffffbfff24b6bd9
    R10: fffffbfff24b6bd9 R11: 0000000000000000 R12: ffff88801db73b78
    R13: fffffffffffffff9 R14: dffffc0000000000 R15: 0000000000000030
    FS:  00007f843ae8e700(0000) GS:ffff888063700000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000055bb9614b35f CR3: 000000003c672000 CR4: 00000000003506e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
     <TASK>
     seq_read_iter+0x4c6/0x10f0 fs/seq_file.c:225
     seq_read+0x224/0x320 fs/seq_file.c:162
     pde_read fs/proc/inode.c:316 [inline]
     proc_reg_read+0x23f/0x330 fs/proc/inode.c:328
     vfs_read+0x31e/0xd30 fs/read_write.c:468
     ksys_pread64 fs/read_write.c:665 [inline]
     __do_sys_pread64 fs/read_write.c:675 [inline]
     __se_sys_pread64 fs/read_write.c:672 [inline]
     __x64_sys_pread64+0x1e9/0x280 fs/read_write.c:672
     do_syscall_x64 arch/x86/entry/common.c:51 [inline]
     do_syscall_64+0x4e/0xa0 arch/x86/entry/common.c:82
     entry_SYSCALL_64_after_hwframe+0x63/0xcd
    RIP: 0033:0x478d29
    Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
    RSP: 002b:00007f843ae8dbe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000011
    RAX: ffffffffffffffda RBX: 0000000000791408 RCX: 0000000000478d29
    RDX: 000000000000000a RSI: 0000000020000000 RDI: 0000000000000003
    RBP: 00000000f477909a R08: 0000000000000000 R09: 0000000000000000
    R10: 000010000000007f R11: 0000000000000246 R12: 0000000000791740
    R13: 0000000000791414 R14: 0000000000791408 R15: 00007ffc2eb48a50
     </TASK>
    Modules linked in:
    ---[ end trace 0000000000000000 ]---
    RIP: 0010:read_pnet include/net/net_namespace.h:383 [inline]
    RIP: 0010:sock_net include/net/sock.h:649 [inline]
    RIP: 0010:raw_get_next net/ipv4/raw.c:974 [inline]
    RIP: 0010:raw_get_idx net/ipv4/raw.c:986 [inline]
    RIP: 0010:raw_seq_start+0x431/0x800 net/ipv4/raw.c:995
    Code: ef e8 33 3d 94 f7 49 8b 6d 00 4c 89 ef e8 b7 65 5f f7 49 89 ed 49 83 c5 98 0f 84 9a 00 00 00 48 83 c5 c8 48 89 e8 48 c1 e8 03 <42> 80 3c 30 00 74 08 48 89 ef e8 00 3d 94 f7 4c 8b 7d 00 48 89 ef
    RSP: 0018:ffffc9001154f9b0 EFLAGS: 00010206
    RAX: 0000000000000005 RBX: 1ffff1100302c8fd RCX: 0000000000000000
    RDX: 0000000000000028 RSI: ffffc9001154f988 RDI: ffffc9000f77a338
    RBP: 0000000000000029 R08: ffffffff8a50ffb4 R09: fffffbfff24b6bd9
    R10: fffffbfff24b6bd9 R11: 0000000000000000 R12: ffff88801db73b78
    R13: fffffffffffffff9 R14: dffffc0000000000 R15: 0000000000000030
    FS:  00007f843ae8e700(0000) GS:ffff888063700000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f92ff166000 CR3: 000000003c672000 CR4: 00000000003506e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

    Fixes: 0daf07e52709 ("raw: convert raw sockets to RCU")
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Reported-by: Dae R. Jeong <threeearcat@gmail.com>
    Link: https://lore.kernel.org/netdev/ZCA2mGV_cmq7lIfV@dragonet/
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2023-07-07 12:23:40 +02:00
Guillaume Nault 7060682f72 raw: use net_hash_mix() in hash function
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2221167
Upstream Status: linux.git

commit 6579f5bacc2c4cbc5ef6abb45352416939d1f844
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Feb 2 09:41:00 2023 +0000

    raw: use net_hash_mix() in hash function

    Some applications seem to rely on RAW sockets.

    If they use private netns, we can avoid piling all RAW
    sockets bound to a given protocol into a single bucket.

    Also place (struct raw_hashinfo).lock into its own
    cache line to limit false sharing.

    Alternative would be to have per-netns hashtables,
    but this seems too expensive for most netns
    where RAW sockets are not used.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2023-07-07 12:23:23 +02:00
Antoine Tenart 880df2fe23 ipv4: raw: add drop reasons
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2184073
Upstream Status: linux.git

commit 42186e6c00352ce9df9e3f12b1ff82e61978d40b
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Feb 2 09:40:59 2023 +0000

    ipv4: raw: add drop reasons

    Use existing helpers and drop reason codes for RAW input path.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2023-06-06 11:23:25 +02:00
Frantisek Hrbata 34b02be423 Merge: CNB: net: remove noblock parameter from skb_recv_datagram()
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1655

Bugzilla: https://bugzilla.redhat.com/2143360
Tested: build, boot

Conflicts:
 - isotp: missing many commits, such as:
   30ffd5332e06 ("can: isotp: return -EADDRNOTAVAIL when reading from unbound socket")
   42bf50a1795a ("can: isotp: support MSG_TRUNC flag when reading from socket")
   e382fea8ae54 ("can: isotp: restore accidentally removed MSG_PEEK feature")
 - removed chunks of non existent net/mctp

```
commit f4b41f062c424209e3939a81e6da022e049a45f2
Author: Oliver Hartkopp <socketcan@hartkopp.net>
Date:   Mon Apr 4 18:30:22 2022 +0200

    net: remove noblock parameter from skb_recv_datagram()

    skb_recv_datagram() has two parameters 'flags' and 'noblock' that are
    merged inside skb_recv_datagram() by 'flags | (noblock ? MSG_DONTWAIT : 0)'

    As 'flags' may contain MSG_DONTWAIT as value most callers split the 'flags'
    into 'flags' and 'noblock' with finally obsolete bit operations like this:

    skb_recv_datagram(sk, flags & ~MSG_DONTWAIT, flags & MSG_DONTWAIT, &rc);

    And this is not even done consistently with the 'flags' parameter.

    This patch removes the obsolete and costly splitting into two parameters
    and only performs bit operations when really needed on the caller side.

    One missing conversion thankfully reported by kernel test robot. I missed
    to enable kunit tests to build the mctp code.

    Reported-by: kernel test robot <lkp@intel.com>
    Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
    Signed-off-by: David S. Miller <davem@davemloft.net>
```

Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>

Approved-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: Xin Long <lxin@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-11-30 08:10:47 -05:00
Íñigo Huguet e24462420c net: remove noblock parameter from skb_recv_datagram()
Bugzilla: https://bugzilla.redhat.com/2143360

Conflicts:
 - isotp: missing many commits, such as:
   30ffd5332e06 ("can: isotp: return -EADDRNOTAVAIL when reading from unbound socket")
   42bf50a1795a ("can: isotp: support MSG_TRUNC flag when reading from socket")
   e382fea8ae54 ("can: isotp: restore accidentally removed MSG_PEEK feature")
 - removed chunks of non existent net/mctp

commit f4b41f062c424209e3939a81e6da022e049a45f2
Author: Oliver Hartkopp <socketcan@hartkopp.net>
Date:   Mon Apr 4 18:30:22 2022 +0200

    net: remove noblock parameter from skb_recv_datagram()
    
    skb_recv_datagram() has two parameters 'flags' and 'noblock' that are
    merged inside skb_recv_datagram() by 'flags | (noblock ? MSG_DONTWAIT : 0)'
    
    As 'flags' may contain MSG_DONTWAIT as value most callers split the 'flags'
    into 'flags' and 'noblock' with finally obsolete bit operations like this:
    
    skb_recv_datagram(sk, flags & ~MSG_DONTWAIT, flags & MSG_DONTWAIT, &rc);
    
    And this is not even done consistently with the 'flags' parameter.
    
    This patch removes the obsolete and costly splitting into two parameters
    and only performs bit operations when really needed on the caller side.
    
    One missing conversion thankfully reported by kernel test robot. I missed
    to enable kunit tests to build the mctp code.
    
    Reported-by: kernel test robot <lkp@intel.com>
    Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
2022-11-18 11:18:14 +01:00
Guillaume Nault 04690bb611 raw: fix a typo in raw_icmp_error()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2138802
Upstream Status: linux.git

commit 97a4d46b1516250d640c1ae0c9e7129d160d6a1c
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Jun 23 19:35:40 2022 +0000

    raw: fix a typo in raw_icmp_error()

    I accidentally broke IPv4 traceroute, by swapping iph->saddr
    and iph->daddr.

    Probably because raw_icmp_error() and raw_v4_input()
    use different order for iph->saddr and iph->daddr.

    Fixes: ba44f8182ec2 ("raw: use more conventional iterators")
    Reported-by: John Sperbeck <jsperbeck@google.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20220623193540.2851799-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2022-11-07 16:12:51 +01:00
Guillaume Nault a1018ba58d raw: complete rcu conversion
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2138802
Upstream Status: linux.git

commit af185d8c76333daa877678e0166a7b45e63bf3c4
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Jun 20 03:05:09 2022 -0700

    raw: complete rcu conversion

    raw_diag_dump() can use rcu_read_lock() instead of read_lock()

    Now the hashinfo lock is only used from process context,
    in write mode only, we can convert it to a spinlock,
    and we do not need to block BH anymore.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20220620100509.3493504-1-eric.dumazet@gmail.com
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2022-11-07 16:12:51 +01:00
Guillaume Nault b96d4b4b49 raw: Use helpers for the hlist_nulls variant.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2138802
Upstream Status: linux.git

commit f289c02bf41b55fbfccf21d72c4ac44cd4a7a107
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Sun Jun 19 16:29:27 2022 -0700

    raw: Use helpers for the hlist_nulls variant.

    hlist_nulls_add_head_rcu() and hlist_nulls_for_each_entry() have dedicated
    macros for sk.

    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2022-11-07 16:01:26 +01:00
Guillaume Nault 0539bb57bb raw: Fix mixed declarations error in raw_icmp_error().
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2138802
Upstream Status: linux.git

commit 5da39e31b1b0eb62b8ed369ad9615da850239e9e
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Sun Jun 19 16:29:26 2022 -0700

    raw: Fix mixed declarations error in raw_icmp_error().

    The trailing semicolon causes a compiler error, so let's remove it.

    net/ipv4/raw.c: In function ‘raw_icmp_error’:
    net/ipv4/raw.c:266:2: error: ISO C90 forbids mixed declarations and code [-Werror=declaration-after-statement]
      266 |  struct hlist_nulls_head *hlist;
          |  ^~~~~~

    Fixes: ba44f8182ec2 ("raw: use more conventional iterators")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2022-11-07 16:00:35 +01:00
Guillaume Nault 3689f27285 raw: convert raw sockets to RCU
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2138802
Upstream Status: linux.git
Conflicts: (context) Missing upstream commit 9ee11f0fff20 ("ipv6: ioam:
           Data plane support for Pre-allocated Trace"):
           We don't include net/ioam6.h in Centos Stream 9.

commit 0daf07e527095e64ee8927ce297ab626643e9f51
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Jun 17 20:47:05 2022 -0700

    raw: convert raw sockets to RCU

    Using rwlock in networking code is extremely risky.
    writers can starve if enough readers are constantly
    grabing the rwlock.

    I thought rwlock were at fault and sent this patch:

    https://lkml.org/lkml/2022/6/17/272

    But Peter and Linus essentially told me rwlock had to be unfair.

    We need to get rid of rwlock in networking code.

    Without this fix, following script triggers soft lockups:

    for i in {1..48}
    do
     ping -f -n -q 127.0.0.1 &
     sleep 0.1
    done

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2022-11-07 15:45:03 +01:00
Guillaume Nault 024a558c88 raw: use more conventional iterators
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2138802
Upstream Status: linux.git

commit ba44f8182ec299c5d1c8a72fc0fde4ec127b5a6d
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Jun 17 20:47:04 2022 -0700

    raw: use more conventional iterators

    In order to prepare the following patch,
    I change raw v4 & v6 code to use more conventional
    iterators.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2022-11-07 15:38:56 +01:00
Ivan Vecera fa0c210030 net: drop nopreempt requirement on sock_prot_inuse_add()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2096377

commit b3cb764aa1d753cf6a58858f9e2097ba71e8100b
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 15 09:11:50 2021 -0800

    net: drop nopreempt requirement on sock_prot_inuse_add()

    This is distracting really, let's make this simpler,
    because many callers had to take care of this
    by themselves, even if on x86 this adds more
    code than really needed.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-06-13 18:35:56 +02:00
Guillaume Nault 28674d5179 ipv4: raw: lock the socket in raw_bind()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081383
Upstream Status: linux.git

commit 153a0d187e767c68733b8e9f46218eb1f41ab902
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Jan 26 16:51:16 2022 -0800

    ipv4: raw: lock the socket in raw_bind()

    For some reason, raw_bind() forgot to lock the socket.

    BUG: KCSAN: data-race in __ip4_datagram_connect / raw_bind

    write to 0xffff8881170d4308 of 4 bytes by task 5466 on cpu 0:
     raw_bind+0x1b0/0x250 net/ipv4/raw.c:739
     inet_bind+0x56/0xa0 net/ipv4/af_inet.c:443
     __sys_bind+0x14b/0x1b0 net/socket.c:1697
     __do_sys_bind net/socket.c:1708 [inline]
     __se_sys_bind net/socket.c:1706 [inline]
     __x64_sys_bind+0x3d/0x50 net/socket.c:1706
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x44/0xae

    read to 0xffff8881170d4308 of 4 bytes by task 5468 on cpu 1:
     __ip4_datagram_connect+0xb7/0x7b0 net/ipv4/datagram.c:39
     ip4_datagram_connect+0x2a/0x40 net/ipv4/datagram.c:89
     inet_dgram_connect+0x107/0x190 net/ipv4/af_inet.c:576
     __sys_connect_file net/socket.c:1900 [inline]
     __sys_connect+0x197/0x1b0 net/socket.c:1917
     __do_sys_connect net/socket.c:1927 [inline]
     __se_sys_connect net/socket.c:1924 [inline]
     __x64_sys_connect+0x3d/0x50 net/socket.c:1924
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x44/0xae

    value changed: 0x00000000 -> 0x0003007f

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 1 PID: 5468 Comm: syz-executor.5 Not tainted 5.17.0-rc1-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2022-05-03 17:05:14 +02:00
Ivan Vecera 87d6a33df9 proc: remove PDE_DATA() completely
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2073492

Conflicts:
- Hunk for dell-smm-hwmon driver skipped as it is not applicable and
  does not use PDE_DATA()

commit 359745d78351c6f5442435f81549f0207ece28aa
Author: Muchun Song <songmuchun@bytedance.com>
Date:   Fri Jan 21 22:14:23 2022 -0800

    proc: remove PDE_DATA() completely

    Remove PDE_DATA() completely and replace it with pde_data().

    [akpm@linux-foundation.org: fix naming clash in drivers/nubus/proc.c]
    [akpm@linux-foundation.org: now fix it properly]

    Link: https://lkml.kernel.org/r/20211124081956.87711-2-songmuchun@bytedance.com
    Signed-off-by: Muchun Song <songmuchun@bytedance.com>
    Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
    Cc: Alexey Dobriyan <adobriyan@gmail.com>
    Cc: Alexey Gladkov <gladkov.alexey@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-04-08 17:38:02 +02:00
Alexander Aring e3ae2365ef net: sock: introduce sk_error_report
This patch introduces a function wrapper to call the sk_error_report
callback. That will prepare to add additional handling whenever
sk_error_report is called, for example to trace socket errors.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-29 11:28:21 -07:00
Paul Moore 3df98d7921 lsm,selinux: pass flowi_common instead of flowi to the LSM hooks
As pointed out by Herbert in a recent related patch, the LSM hooks do
not have the necessary address family information to use the flowi
struct safely.  As none of the LSMs currently use any of the protocol
specific flowi information, replace the flowi pointers with pointers
to the address family independent flowi_common struct.

Reported-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2020-11-23 18:36:21 -05:00
Linus Torvalds 9ff9b0d392 networking changes for the 5.10 merge window
Add redirect_neigh() BPF packet redirect helper, allowing to limit stack
 traversal in common container configs and improving TCP back-pressure.
 Daniel reports ~10Gbps => ~15Gbps single stream TCP performance gain.
 
 Expand netlink policy support and improve policy export to user space.
 (Ge)netlink core performs request validation according to declared
 policies. Expand the expressiveness of those policies (min/max length
 and bitmasks). Allow dumping policies for particular commands.
 This is used for feature discovery by user space (instead of kernel
 version parsing or trial and error).
 
 Support IGMPv3/MLDv2 multicast listener discovery protocols in bridge.
 
 Allow more than 255 IPv4 multicast interfaces.
 
 Add support for Type of Service (ToS) reflection in SYN/SYN-ACK
 packets of TCPv6.
 
 In Multi-patch TCP (MPTCP) support concurrent transmission of data
 on multiple subflows in a load balancing scenario. Enhance advertising
 addresses via the RM_ADDR/ADD_ADDR options.
 
 Support SMC-Dv2 version of SMC, which enables multi-subnet deployments.
 
 Allow more calls to same peer in RxRPC.
 
 Support two new Controller Area Network (CAN) protocols -
 CAN-FD and ISO 15765-2:2016.
 
 Add xfrm/IPsec compat layer, solving the 32bit user space on 64bit
 kernel problem.
 
 Add TC actions for implementing MPLS L2 VPNs.
 
 Improve nexthop code - e.g. handle various corner cases when nexthop
 objects are removed from groups better, skip unnecessary notifications
 and make it easier to offload nexthops into HW by converting
 to a blocking notifier.
 
 Support adding and consuming TCP header options by BPF programs,
 opening the doors for easy experimental and deployment-specific
 TCP option use.
 
 Reorganize TCP congestion control (CC) initialization to simplify life
 of TCP CC implemented in BPF.
 
 Add support for shipping BPF programs with the kernel and loading them
 early on boot via the User Mode Driver mechanism, hence reusing all the
 user space infra we have.
 
 Support sleepable BPF programs, initially targeting LSM and tracing.
 
 Add bpf_d_path() helper for returning full path for given 'struct path'.
 
 Make bpf_tail_call compatible with bpf-to-bpf calls.
 
 Allow BPF programs to call map_update_elem on sockmaps.
 
 Add BPF Type Format (BTF) support for type and enum discovery, as
 well as support for using BTF within the kernel itself (current use
 is for pretty printing structures).
 
 Support listing and getting information about bpf_links via the bpf
 syscall.
 
 Enhance kernel interfaces around NIC firmware update. Allow specifying
 overwrite mask to control if settings etc. are reset during update;
 report expected max time operation may take to users; support firmware
 activation without machine reboot incl. limits of how much impact
 reset may have (e.g. dropping link or not).
 
 Extend ethtool configuration interface to report IEEE-standard
 counters, to limit the need for per-vendor logic in user space.
 
 Adopt or extend devlink use for debug, monitoring, fw update
 in many drivers (dsa loop, ice, ionic, sja1105, qed, mlxsw,
 mv88e6xxx, dpaa2-eth).
 
 In mlxsw expose critical and emergency SFP module temperature alarms.
 Refactor port buffer handling to make the defaults more suitable and
 support setting these values explicitly via the DCBNL interface.
 
 Add XDP support for Intel's igb driver.
 
 Support offloading TC flower classification and filtering rules to
 mscc_ocelot switches.
 
 Add PTP support for Marvell Octeontx2 and PP2.2 hardware, as well as
 fixed interval period pulse generator and one-step timestamping in
 dpaa-eth.
 
 Add support for various auth offloads in WiFi APs, e.g. SAE (WPA3)
 offload.
 
 Add Lynx PHY/PCS MDIO module, and convert various drivers which have
 this HW to use it. Convert mvpp2 to split PCS.
 
 Support Marvell Prestera 98DX3255 24-port switch ASICs, as well as
 7-port Mediatek MT7531 IP.
 
 Add initial support for QCA6390 and IPQ6018 in ath11k WiFi driver,
 and wcn3680 support in wcn36xx.
 
 Improve performance for packets which don't require much offloads
 on recent Mellanox NICs by 20% by making multiple packets share
 a descriptor entry.
 
 Move chelsio inline crypto drivers (for TLS and IPsec) from the crypto
 subtree to drivers/net. Move MDIO drivers out of the phy directory.
 
 Clean up a lot of W=1 warnings, reportedly the actively developed
 subsections of networking drivers should now build W=1 warning free.
 
 Make sure drivers don't use in_interrupt() to dynamically adapt their
 code. Convert tasklets to use new tasklet_setup API (sadly this
 conversion is not yet complete).
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAl+ItRwACgkQMUZtbf5S
 IrtTMg//UxpdR/MirT1DatBU0K/UGAZY82hV7F/UC8tPgjfHZeHvWlDFxfi3YP81
 PtPKbhRZ7DhwBXefUp6nY3UdvjftrJK2lJm8prJUPSsZRye8Wlcb7y65q7/P2y2U
 Efucyopg6RUrmrM0DUsIGYGJgylQLHnMYUl/keCsD4t5Bp4ksyi9R2t5eitGoWzh
 r3QGdbSa0AuWx4iu0i+tqp6Tj0ekMBMXLVb35dtU1t0joj2KTNEnSgABN3prOa8E
 iWYf2erOau68Ogp3yU3miCy0ZU4p/7qGHTtzbcp677692P/ekak6+zmfHLT9/Pjy
 2Stq2z6GoKuVxdktr91D9pA3jxG4LxSJmr0TImcGnXbvkMP3Ez3g9RrpV5fn8j6F
 mZCH8TKZAoD5aJrAJAMkhZmLYE1pvDa7KolSk8WogXrbCnTEb5Nv8FHTS1Qnk3yl
 wSKXuvutFVNLMEHCnWQLtODbTST9DI/aOi6EctPpuOA/ZyL1v3pl+gfp37S+LUTe
 owMnT/7TdvKaTD0+gIyU53M6rAWTtr5YyRQorX9awIu/4Ha0F0gYD7BJZQUGtegp
 HzKt59NiSrFdbSH7UdyemdBF4LuCgIhS7rgfeoUXMXmuPHq7eHXyHZt5dzPPa/xP
 81P0MAvdpFVwg8ij2yp2sHS7sISIRKq17fd1tIewUabxQbjXqPc=
 =bc1U
 -----END PGP SIGNATURE-----

Merge tag 'net-next-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:

 - Add redirect_neigh() BPF packet redirect helper, allowing to limit
   stack traversal in common container configs and improving TCP
   back-pressure.

   Daniel reports ~10Gbps => ~15Gbps single stream TCP performance gain.

 - Expand netlink policy support and improve policy export to user
   space. (Ge)netlink core performs request validation according to
   declared policies. Expand the expressiveness of those policies
   (min/max length and bitmasks). Allow dumping policies for particular
   commands. This is used for feature discovery by user space (instead
   of kernel version parsing or trial and error).

 - Support IGMPv3/MLDv2 multicast listener discovery protocols in
   bridge.

 - Allow more than 255 IPv4 multicast interfaces.

 - Add support for Type of Service (ToS) reflection in SYN/SYN-ACK
   packets of TCPv6.

 - In Multi-patch TCP (MPTCP) support concurrent transmission of data on
   multiple subflows in a load balancing scenario. Enhance advertising
   addresses via the RM_ADDR/ADD_ADDR options.

 - Support SMC-Dv2 version of SMC, which enables multi-subnet
   deployments.

 - Allow more calls to same peer in RxRPC.

 - Support two new Controller Area Network (CAN) protocols - CAN-FD and
   ISO 15765-2:2016.

 - Add xfrm/IPsec compat layer, solving the 32bit user space on 64bit
   kernel problem.

 - Add TC actions for implementing MPLS L2 VPNs.

 - Improve nexthop code - e.g. handle various corner cases when nexthop
   objects are removed from groups better, skip unnecessary
   notifications and make it easier to offload nexthops into HW by
   converting to a blocking notifier.

 - Support adding and consuming TCP header options by BPF programs,
   opening the doors for easy experimental and deployment-specific TCP
   option use.

 - Reorganize TCP congestion control (CC) initialization to simplify
   life of TCP CC implemented in BPF.

 - Add support for shipping BPF programs with the kernel and loading
   them early on boot via the User Mode Driver mechanism, hence reusing
   all the user space infra we have.

 - Support sleepable BPF programs, initially targeting LSM and tracing.

 - Add bpf_d_path() helper for returning full path for given 'struct
   path'.

 - Make bpf_tail_call compatible with bpf-to-bpf calls.

 - Allow BPF programs to call map_update_elem on sockmaps.

 - Add BPF Type Format (BTF) support for type and enum discovery, as
   well as support for using BTF within the kernel itself (current use
   is for pretty printing structures).

 - Support listing and getting information about bpf_links via the bpf
   syscall.

 - Enhance kernel interfaces around NIC firmware update. Allow
   specifying overwrite mask to control if settings etc. are reset
   during update; report expected max time operation may take to users;
   support firmware activation without machine reboot incl. limits of
   how much impact reset may have (e.g. dropping link or not).

 - Extend ethtool configuration interface to report IEEE-standard
   counters, to limit the need for per-vendor logic in user space.

 - Adopt or extend devlink use for debug, monitoring, fw update in many
   drivers (dsa loop, ice, ionic, sja1105, qed, mlxsw, mv88e6xxx,
   dpaa2-eth).

 - In mlxsw expose critical and emergency SFP module temperature alarms.
   Refactor port buffer handling to make the defaults more suitable and
   support setting these values explicitly via the DCBNL interface.

 - Add XDP support for Intel's igb driver.

 - Support offloading TC flower classification and filtering rules to
   mscc_ocelot switches.

 - Add PTP support for Marvell Octeontx2 and PP2.2 hardware, as well as
   fixed interval period pulse generator and one-step timestamping in
   dpaa-eth.

 - Add support for various auth offloads in WiFi APs, e.g. SAE (WPA3)
   offload.

 - Add Lynx PHY/PCS MDIO module, and convert various drivers which have
   this HW to use it. Convert mvpp2 to split PCS.

 - Support Marvell Prestera 98DX3255 24-port switch ASICs, as well as
   7-port Mediatek MT7531 IP.

 - Add initial support for QCA6390 and IPQ6018 in ath11k WiFi driver,
   and wcn3680 support in wcn36xx.

 - Improve performance for packets which don't require much offloads on
   recent Mellanox NICs by 20% by making multiple packets share a
   descriptor entry.

 - Move chelsio inline crypto drivers (for TLS and IPsec) from the
   crypto subtree to drivers/net. Move MDIO drivers out of the phy
   directory.

 - Clean up a lot of W=1 warnings, reportedly the actively developed
   subsections of networking drivers should now build W=1 warning free.

 - Make sure drivers don't use in_interrupt() to dynamically adapt their
   code. Convert tasklets to use new tasklet_setup API (sadly this
   conversion is not yet complete).

* tag 'net-next-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2583 commits)
  Revert "bpfilter: Fix build error with CONFIG_BPFILTER_UMH"
  net, sockmap: Don't call bpf_prog_put() on NULL pointer
  bpf, selftest: Fix flaky tcp_hdr_options test when adding addr to lo
  bpf, sockmap: Add locking annotations to iterator
  netfilter: nftables: allow re-computing sctp CRC-32C in 'payload' statements
  net: fix pos incrementment in ipv6_route_seq_next
  net/smc: fix invalid return code in smcd_new_buf_create()
  net/smc: fix valid DMBE buffer sizes
  net/smc: fix use-after-free of delayed events
  bpfilter: Fix build error with CONFIG_BPFILTER_UMH
  cxgb4/ch_ipsec: Replace the module name to ch_ipsec from chcr
  net: sched: Fix suspicious RCU usage while accessing tcf_tunnel_info
  bpf: Fix register equivalence tracking.
  rxrpc: Fix loss of final ack on shutdown
  rxrpc: Fix bundle counting for exclusive connections
  netfilter: restore NF_INET_NUMHOOKS
  ibmveth: Identify ingress large send packets.
  ibmveth: Switch order of ibmveth_helper calls.
  cxgb4: handle 4-tuple PEDIT to NAT mode translation
  selftests: Add VRF route leaking tests
  ...
2020-10-15 18:42:13 -07:00
Linus Torvalds c90578360c Merge branch 'work.csum_and_copy' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull copy_and_csum cleanups from Al Viro:
 "Saner calling conventions for csum_and_copy_..._user() and friends"

[ Removing 800+ lines of code and cleaning stuff up is good  - Linus ]

* 'work.csum_and_copy' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ppc: propagate the calling conventions change down to csum_partial_copy_generic()
  amd64: switch csum_partial_copy_generic() to new calling conventions
  sparc64: propagate the calling convention changes down to __csum_partial_copy_...()
  xtensa: propagate the calling conventions change down into csum_partial_copy_generic()
  mips: propagate the calling convention change down into __csum_partial_copy_..._user()
  mips: __csum_partial_copy_kernel() has no users left
  mips: csum_and_copy_{to,from}_user() are never called under KERNEL_DS
  sparc32: propagate the calling conventions change down to __csum_partial_copy_sparc_generic()
  i386: propagate the calling conventions change down to csum_partial_copy_generic()
  sh: propage the calling conventions change down to csum_partial_copy_generic()
  m68k: get rid of zeroing destination on error in csum_and_copy_from_user()
  arm: propagate the calling convention changes down to csum_partial_copy_from_user()
  alpha: propagate the calling convention changes down to csum_partial_copy.c helpers
  saner calling conventions for csum_and_copy_..._user()
  csum_and_copy_..._user(): pass 0xffffffff instead of 0 as initial sum
  csum_partial_copy_nocheck(): drop the last argument
  unify generic instances of csum_partial_copy_nocheck()
  icmp_push_reply(): reorder adding the checksum up
  skb_copy_and_csum_bits(): don't bother with the last argument
2020-10-12 16:24:13 -07:00
Jakub Kicinski 44a8c4f33c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
We got slightly different patches removing a double word
in a comment in net/ipv4/raw.c - picked the version from net.

Simple conflict in drivers/net/ethernet/ibm/ibmvnic.c. Use cached
values instead of VNIC login response buffer (following what
commit 507ebe6444 ("ibmvnic: Fix use-after-free of VNIC login
response buffer") did).

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-04 21:28:59 -07:00
Miaohe Lin 645f08975f net: Fix some comments
Fix some comments, including wrong function name, duplicated word and so
on.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-27 07:55:59 -07:00
Miaohe Lin 7551144978 net: Avoid access icmp_err_convert when icmp code is ICMP_FRAG_NEEDED
There is no need to fetch errno and fatal info from icmp_err_convert when
icmp code is ICMP_FRAG_NEEDED.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-24 18:11:43 -07:00
Randy Dunlap 2bdcc73c88 net: ipv4: delete repeated words
Drop duplicate words in comments in net/ipv4/.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-24 17:31:20 -07:00
Al Viro cc44c17baf csum_partial_copy_nocheck(): drop the last argument
It's always 0.  Note that we theoretically could use ~0U as well -
result will be the same modulo 0xffff, _if_ the damn thing did the
right thing for any value of initial sum; later we'll make use of
that when convenient.

However, unlike csum_and_copy_..._user(), there are instances that
did not work for arbitrary initial sums; c6x is one such.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-08-20 15:45:14 -04:00
Christoph Hellwig a7b75c5a8c net: pass a sockptr_t into ->setsockopt
Rework the remaining setsockopt code to pass a sockptr_t instead of a
plain user pointer.  This removes the last remaining set_fs(KERNEL_DS)
outside of architecture specific code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> [ieee802154]
Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-24 15:41:54 -07:00
Christoph Hellwig b6238c04c0 net/ipv4: remove compat_ip_{get,set}sockopt
Handle the few cases that need special treatment in-line using
in_compat_syscall().

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-19 18:16:41 -07:00
Jules Irenge 0d8a42c93a raw: Add missing annotations to raw_seq_start() and raw_seq_stop()
Sparse reports warnings at raw_seq_start() and raw_seq_stop()

warning: context imbalance in raw_seq_start() - wrong count at exit
warning: context imbalance in raw_seq_stop() - unexpected unlock

The root cause is the missing annotations at raw_seq_start()
	and raw_seq_stop()
Add the missing __acquires(&h->lock) annotation
Add the missing __releases(&h->lock) annotation

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-11 23:19:40 -07:00
Florian Westphal 895b5c9f20 netfilter: drop bridge nf reset from nf_reset
commit 174e23810c
("sk_buff: drop all skb extensions on free and skb scrubbing") made napi
recycle always drop skb extensions.  The additional skb_ext_del() that is
performed via nf_reset on napi skb recycle is not needed anymore.

Most nf_reset() calls in the stack are there so queued skb won't block
'rmmod nf_conntrack' indefinitely.

This removes the skb_ext_del from nf_reset, and renames it to a more
fitting nf_reset_ct().

In a few selected places, add a call to skb_ext_reset to make sure that
no active extensions remain.

I am submitting this for "net", because we're still early in the release
cycle.  The patch applies to net-next too, but I think the rename causes
needless divergence between those trees.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-10-01 18:42:15 +02:00
Willem de Bruijn c6af0c227a ip: support SO_MARK cmsg
Enable setting skb->mark for UDP and RAW sockets using cmsg.

This is analogous to existing support for TOS, TTL, txtime, etc.

Packet sockets already support this as of commit c7d39e3263
("packet: support per-packet fwmark for af_packet sendmsg").

Similar to other fields, implement by
1. initialize the sockcm_cookie.mark from socket option sk_mark
2. optionally overwrite this in ip_cmsg_send/ip6_datagram_send_ctl
3. initialize inet_cork.mark from sockcm_cookie.mark
4. initialize each (usually just one) skb->mark from inet_cork.mark

Step 1 is handled in one location for most protocols by ipcm_init_sk
as of commit 351782067b ("ipv4: ipcm_cookie initializers").

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-13 21:44:19 +02:00
Stephen Suryaputra 38c73529de ipv4: Use return value of inet_iif() for __raw_v4_lookup in the while loop
In commit 19e4e76806 ("ipv4: Fix raw socket lookup for local
traffic"), the dif argument to __raw_v4_lookup() is coming from the
returned value of inet_iif() but the change was done only for the first
lookup. Subsequent lookups in the while loop still use skb->dev->ifIndex.

Fixes: 19e4e76806 ("ipv4: Fix raw socket lookup for local traffic")
Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-25 12:46:02 -07:00
Thomas Gleixner 2874c5fd28 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 3029 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30 11:26:32 -07:00
Patrick Talbert ea9a03791a net: Treat sock->sk_drops as an unsigned int when printing
Currently, procfs socket stats format sk_drops as a signed int (%d). For large
values this will cause a negative number to be printed.

We know the drop count can never be a negative so change the format specifier to
%u.

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-19 10:31:10 -07:00
David Ahern 19e4e76806 ipv4: Fix raw socket lookup for local traffic
inet_iif should be used for the raw socket lookup. inet_iif considers
rt_iif which handles the case of local traffic.

As it stands, ping to a local address with the '-I <dev>' option fails
ever since ping was changed to use SO_BINDTODEVICE instead of
cmsg + IP_PKTINFO.

IPv6 works fine.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-08 09:38:30 -07:00
David S. Miller 2be09de7d6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Lots of conflicts, by happily all cases of overlapping
changes, parallel adds, things of that nature.

Thanks to Stephen Rothwell, Saeed Mahameed, and others
for their guidance in these resolutions.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-20 11:53:36 -08:00
Willem de Bruijn 8f932f762e net: add missing SOF_TIMESTAMPING_OPT_ID support
SOF_TIMESTAMPING_OPT_ID is supported on TCP, UDP and RAW sockets.
But it was missing on RAW with IPPROTO_IP, PF_PACKET and CAN.

Add skb_setup_tx_timestamp that configures both tx_flags and tskey
for these paths that do not need corking or use bytestream keys.

Fixes: 09c2d251b7 ("net-timestamp: add key to disambiguate concurrent datagrams")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-17 23:27:00 -08:00
David Ahern 86d1d8b72c net/ipv4: Fix missing raw_init when CONFIG_PROC_FS is disabled
Randy reported when CONFIG_PROC_FS is not enabled:
    ld: net/ipv4/af_inet.o: in function `inet_init':
    af_inet.c:(.init.text+0x42d): undefined reference to `raw_init'

Fix by moving the endif up to the end of the proc entries

Fixes: 6897445fb1 ("net: provide a sysctl raw_l3mdev_accept for raw socket lookup with VRFs")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Mike Manning <mmanning@vyatta.att-mail.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-27 20:58:02 -08:00
Duncan Eastoe 7055420fb6 net: fix raw socket lookup device bind matching with VRFs
When there exist a pair of raw sockets one unbound and one bound
to a VRF but equal in all other respects, when a packet is received
in the VRF context, __raw_v4_lookup() matches on both sockets.

This results in the packet being delivered over both sockets,
instead of only the raw socket bound to the VRF. The bound device
checks in __raw_v4_lookup() are replaced with a call to
raw_sk_bound_dev_eq() which correctly handles whether the packet
should be delivered over the unbound socket in such cases.

In __raw_v6_lookup() the match on the device binding of the socket is
similarly updated to use raw_sk_bound_dev_eq() which matches the
handling in __raw_v4_lookup().

Importantly raw_sk_bound_dev_eq() takes the raw_l3mdev_accept sysctl
into account.

Signed-off-by: Duncan Eastoe <deastoe@vyatta.att-mail.com>
Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Tested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-07 16:12:39 -08:00
Mike Manning 6897445fb1 net: provide a sysctl raw_l3mdev_accept for raw socket lookup with VRFs
Add a sysctl raw_l3mdev_accept to control raw socket lookup in a manner
similar to use of tcp_l3mdev_accept for stream and of udp_l3mdev_accept
for datagram sockets. Have this default to enabled for reasons of
backwards compatibility. This is so as to specify the output device
with cmsg and IP_PKTINFO, but using a socket not bound to the
corresponding VRF. This allows e.g. older ping implementations to be
run with specifying the device but without executing it in the VRF.
If the option is disabled, packets received in a VRF context are only
handled by a raw socket bound to the VRF, and correspondingly packets
in the default VRF are only handled by a socket not bound to any VRF.

Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Tested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-07 16:12:38 -08:00
Robert Shearman 854da99173 ipv4: Allow sending multicast packets on specific i/f using VRF socket
It is useful to be able to use the same socket for listening in a
specific VRF, as for sending multicast packets out of a specific
interface. However, the bound device on the socket currently takes
precedence and results in the packets not being sent.

Relax the condition on overriding the output interface to use for
sending packets out of UDP, raw and ping sockets to allow multicast
packets to be sent using the specified multicast interface.

Signed-off-by: Robert Shearman <rshearma@vyatta.att-mail.com>
Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-02 22:28:17 -07:00
Willem de Bruijn 678ca42d68 ip: remove tx_flags from ipcm_cookie and use same logic for v4 and v6
skb_shinfo(skb)->tx_flags is derived from sk->sk_tsflags, possibly
after modification by __sock_cmsg_send, by calling sock_tx_timestamp.

The IPv4 and IPv6 paths do this conversion differently. In IPv4, the
individual protocols that support tx timestamps call this function
and store the result in ipc.tx_flags. In IPv6, sock_tx_timestamp is
called in __ip6_append_data.

There is no need to store both tx_flags and ts_flags in the cookie
as one is derived from the other. Convert when setting up the cork
and remove the redundant field. This is similar to IPv6, only have
the conversion happen only once per datagram, in ip(6)_setup_cork.

Also change __ip6_append_data to match __ip_append_data. Only update
tskey if timestamping is enabled with OPT_ID. The SOCK_.. test is
redundant: only valid protocols can have non-zero cork->tx_flags.

After this change the IPv4 and IPv6 logic is the same.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-07 10:58:49 +09:00
Willem de Bruijn 351782067b ipv4: ipcm_cookie initializers
Initialize the cookie in one location to reduce code duplication and
avoid bugs from inconsistent initialization, such as that fixed in
commit 9887cba199 ("ip: limit use of gso_size to udp").

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-07 10:58:49 +09:00
Jesus Sanchez-Palencia bc969a9778 net: ipv4: Hook into time based transmission
Add a transmit_time field to struct inet_cork, then copy the
timestamp from the CMSG cookie at ip_setup_cork() so we can
safely copy it into the skb later during __ip_make_skb().

For the raw fast path, just perform the copy at raw_send_hdrinc().

Signed-off-by: Richard Cochran <rcochran@linutronix.de>
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04 22:30:27 +09:00
Christoph Hellwig c350637227 proc: introduce proc_create_net{,_data}
Variants of proc_create{,_data} that directly take a struct seq_operations
and deal with network namespaces in ->open and ->release.  All callers of
proc_create + seq_open_net converted over, and seq_{open,release}_net are
removed entirely.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-16 07:24:30 +02:00
Christoph Hellwig 93cb5a1f58 ipv{4,6}/raw: simplify ѕeq_file code
Pass the hashtable to the proc private data instead of copying
it into the per-file private data.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-16 07:23:35 +02:00
Kirill Tkhai 2f635ceeb2 net: Drop pernet_operations::async
Synchronous pernet_operations are not allowed anymore.
All are asynchronous. So, drop the structure member.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-27 13:18:09 -04:00
Joe Perches d6444062f8 net: Use octal not symbolic permissions
Prefer the direct use of octal for permissions.

Done with checkpatch -f --types=SYMBOLIC_PERMS --fix-inplace
and some typing.

Miscellanea:

o Whitespace neatening around these conversions.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-26 12:07:48 -04:00
Kirill Tkhai 128aaa98ad net: Revert "ipv4: fix a deadlock in ip_ra_control"
This reverts commit 1215e51eda.
Since raw_close() is used on every RAW socket destruction,
the changes made by 1215e51eda scale sadly. This clearly
seen on endless unshare(CLONE_NEWNET) test, and cleanup_net()
kwork spends a lot of time waiting for rtnl_lock() introduced
by this commit.

Previous patch moved IP_ROUTER_ALERT out of rtnl_lock(),
so we revert this patch.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22 15:12:56 -04:00