Commit Graph

449 Commits

Author SHA1 Message Date
Antoine Tenart a3c9cc5e4b inet: preserve const qualifier in inet_sk()
JIRA: https://issues.redhat.com/browse/RHEL-48648
Upstream Status: linux.git

commit abc17a11ed29b0471e428d86189acca8d1a213c6
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Mar 16 15:31:55 2023 +0000

    inet: preserve const qualifier in inet_sk()

    We can change inet_sk() to propagate const qualifier of its argument.

    This should avoid some potential errors caused by accidental
    (const -> not_const) promotion.

    Other helpers like tcp_sk(), udp_sk(), raw_sk() will be handled
    in separate patch series.

    v2: use container_of_const() as advised by Jakub and Linus

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/netdev/20230315142841.3a2ac99a@kernel.org/
    Link: https://lore.kernel.org/netdev/CAHk-=wiOf12nrYEF2vJMcucKjWPN-Ns_SW9fA7LwST_2Dzp7rw@mail.gmail.com/
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-07-16 17:29:40 +02:00
Antoine Tenart c4af9aca79 ipv4: Fix uninit-value access in __ip_make_skb()
JIRA: https://issues.redhat.com/browse/RHEL-39786
Upstream Status: linux.git
CVE: CVE-2024-36927
Conflicts:\
- Removed code differs due to missing upstream commit cafbe182a467
  ("inet: move inet->hdrincl to inet->inet_flags") in c9s.

commit fc1092f51567277509563800a3c56732070b6aa4
Author: Shigeru Yoshida <syoshida@redhat.com>
Date:   Tue Apr 30 21:39:45 2024 +0900

    ipv4: Fix uninit-value access in __ip_make_skb()

    KMSAN reported uninit-value access in __ip_make_skb() [1].  __ip_make_skb()
    tests HDRINCL to know if the skb has icmphdr. However, HDRINCL can cause a
    race condition. If calling setsockopt(2) with IP_HDRINCL changes HDRINCL
    while __ip_make_skb() is running, the function will access icmphdr in the
    skb even if it is not included. This causes the issue reported by KMSAN.

    Check FLOWI_FLAG_KNOWN_NH on fl4->flowi4_flags instead of testing HDRINCL
    on the socket.

    Also, fl4->fl4_icmp_type and fl4->fl4_icmp_code are not initialized. These
    are union in struct flowi4 and are implicitly initialized by
    flowi4_init_output(), but we should not rely on specific union layout.

    Initialize these explicitly in raw_sendmsg().

    [1]
    BUG: KMSAN: uninit-value in __ip_make_skb+0x2b74/0x2d20 net/ipv4/ip_output.c:1481
     __ip_make_skb+0x2b74/0x2d20 net/ipv4/ip_output.c:1481
     ip_finish_skb include/net/ip.h:243 [inline]
     ip_push_pending_frames+0x4c/0x5c0 net/ipv4/ip_output.c:1508
     raw_sendmsg+0x2381/0x2690 net/ipv4/raw.c:654
     inet_sendmsg+0x27b/0x2a0 net/ipv4/af_inet.c:851
     sock_sendmsg_nosec net/socket.c:730 [inline]
     __sock_sendmsg+0x274/0x3c0 net/socket.c:745
     __sys_sendto+0x62c/0x7b0 net/socket.c:2191
     __do_sys_sendto net/socket.c:2203 [inline]
     __se_sys_sendto net/socket.c:2199 [inline]
     __x64_sys_sendto+0x130/0x200 net/socket.c:2199
     do_syscall_64+0xd8/0x1f0 arch/x86/entry/common.c:83
     entry_SYSCALL_64_after_hwframe+0x6d/0x75

    Uninit was created at:
     slab_post_alloc_hook mm/slub.c:3804 [inline]
     slab_alloc_node mm/slub.c:3845 [inline]
     kmem_cache_alloc_node+0x5f6/0xc50 mm/slub.c:3888
     kmalloc_reserve+0x13c/0x4a0 net/core/skbuff.c:577
     __alloc_skb+0x35a/0x7c0 net/core/skbuff.c:668
     alloc_skb include/linux/skbuff.h:1318 [inline]
     __ip_append_data+0x49ab/0x68c0 net/ipv4/ip_output.c:1128
     ip_append_data+0x1e7/0x260 net/ipv4/ip_output.c:1365
     raw_sendmsg+0x22b1/0x2690 net/ipv4/raw.c:648
     inet_sendmsg+0x27b/0x2a0 net/ipv4/af_inet.c:851
     sock_sendmsg_nosec net/socket.c:730 [inline]
     __sock_sendmsg+0x274/0x3c0 net/socket.c:745
     __sys_sendto+0x62c/0x7b0 net/socket.c:2191
     __do_sys_sendto net/socket.c:2203 [inline]
     __se_sys_sendto net/socket.c:2199 [inline]
     __x64_sys_sendto+0x130/0x200 net/socket.c:2199
     do_syscall_64+0xd8/0x1f0 arch/x86/entry/common.c:83
     entry_SYSCALL_64_after_hwframe+0x6d/0x75

    CPU: 1 PID: 15709 Comm: syz-executor.7 Not tainted 6.8.0-11567-gb3603fcb79b1 #25
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014

    Fixes: 99e5acae193e ("ipv4: Fix potential uninit variable access bug in __ip_make_skb()")
    Reported-by: syzkaller <syzkaller@googlegroups.com>
    Signed-off-by: Shigeru Yoshida <syoshida@redhat.com>
    Link: https://lore.kernel.org/r/20240430123945.2057348-1-syoshida@redhat.com
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-06-14 15:11:39 +02:00
Guillaume Nault 5af7c299c2 net: ipv4: fix a memleak in ip_setup_cork
JIRA: https://issues.redhat.com/browse/RHEL-31492
Upstream Status: linux.git

commit 5dee6d6923458e26966717f2a3eae7d09fc10bf6
Author: Zhipeng Lu <alexious@zju.edu.cn>
Date:   Mon Jan 29 17:10:17 2024 +0800

    net: ipv4: fix a memleak in ip_setup_cork

    When inetdev_valid_mtu fails, cork->opt should be freed if it is
    allocated in ip_setup_cork. Otherwise there could be a memleak.

    Fixes: 501a90c945 ("inet: protect against too small mtu values.")
    Signed-off-by: Zhipeng Lu <alexious@zju.edu.cn>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20240129091017.2938835-1-alexious@zju.edu.cn
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2024-04-03 16:47:16 +02:00
Antoine Tenart 791e96333e net: fix IPSTATS_MIB_OUTPKGS increment in OutForwDatagrams.
JIRA: https://issues.redhat.com/browse/RHEL-17413
Upstream Status: linux.git
Conflicts:\
- Context diff due to missing upstream commit 09eed1192cec ("neighbour:
  switch to standard rcu, instead of rcu_bh") in c9s.
- Context diff due to missing upstream commit cd3c74807736 ("ipv6:
  optimise dst refcounting on skb init") in c9s.

commit b4a11b2033b7d3dfdd46592f7036a775b18cecd1
Author: Heng Guo <heng.guo@windriver.com>
Date:   Thu Oct 19 09:20:53 2023 +0800

    net: fix IPSTATS_MIB_OUTPKGS increment in OutForwDatagrams.

    Reproduce environment:
    network with 3 VM linuxs is connected as below:
    VM1<---->VM2(latest kernel 6.5.0-rc7)<---->VM3
    VM1: eth0 ip: 192.168.122.207 MTU 1500
    VM2: eth0 ip: 192.168.122.208, eth1 ip: 192.168.123.224 MTU 1500
    VM3: eth0 ip: 192.168.123.240 MTU 1500

    Reproduce:
    VM1 send 1400 bytes UDP data to VM3 using tools scapy with flags=0.
    scapy command:
    send(IP(dst="192.168.123.240",flags=0)/UDP()/str('0'*1400),count=1,
    inter=1.000000)

    Result:
    Before IP data is sent.
    ----------------------------------------------------------------------
    root@qemux86-64:~# cat /proc/net/snmp
    Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors
      ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests
      OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails
      FragOKs FragFails FragCreates
    Ip: 1 64 11 0 3 4 0 0 4 7 0 0 0 0 0 0 0 0 0
    ......
    ----------------------------------------------------------------------
    After IP data is sent.
    ----------------------------------------------------------------------
    root@qemux86-64:~# cat /proc/net/snmp
    Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors
      ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests
      OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails
      FragOKs FragFails FragCreates
    Ip: 1 64 12 0 3 5 0 0 4 8 0 0 0 0 0 0 0 0 0
    ......
    ----------------------------------------------------------------------
    "ForwDatagrams" increase from 4 to 5 and "OutRequests" also increase
    from 7 to 8.

    Issue description and patch:
    IPSTATS_MIB_OUTPKTS("OutRequests") is counted with IPSTATS_MIB_OUTOCTETS
    ("OutOctets") in ip_finish_output2().
    According to RFC 4293, it is "OutOctets" counted with "OutTransmits" but
    not "OutRequests". "OutRequests" does not include any datagrams counted
    in "ForwDatagrams".
    ipSystemStatsOutOctets OBJECT-TYPE
        DESCRIPTION
               "The total number of octets in IP datagrams delivered to the
                lower layers for transmission.  Octets from datagrams
                counted in ipIfStatsOutTransmits MUST be counted here.
    ipSystemStatsOutRequests OBJECT-TYPE
        DESCRIPTION
               "The total number of IP datagrams that local IP user-
                protocols (including ICMP) supplied to IP in requests for
                transmission.  Note that this counter does not include any
                datagrams counted in ipSystemStatsOutForwDatagrams.
    So do patch to define IPSTATS_MIB_OUTPKTS to "OutTransmits" and add
    IPSTATS_MIB_OUTREQUESTS for "OutRequests".
    Add IPSTATS_MIB_OUTREQUESTS counter in __ip_local_out() for ipv4 and add
    IPSTATS_MIB_OUT counter in ip6_finish_output2() for ipv6.

    Test result with patch:
    Before IP data is sent.
    ----------------------------------------------------------------------
    root@qemux86-64:~# cat /proc/net/snmp
    Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors
      ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests
      OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails
      FragOKs FragFails FragCreates OutTransmits
    Ip: 1 64 9 0 5 1 0 0 3 3 0 0 0 0 0 0 0 0 0 4
    ......
    root@qemux86-64:~# cat /proc/net/netstat
    ......
    IpExt: InNoRoutes InTruncatedPkts InMcastPkts OutMcastPkts InBcastPkts
      OutBcastPkts InOctets OutOctets InMcastOctets OutMcastOctets
      InBcastOctets OutBcastOctets InCsumErrors InNoECTPkts InECT1Pkts
      InECT0Pkts InCEPkts ReasmOverlaps
    IpExt: 0 0 0 0 0 0 2976 1896 0 0 0 0 0 9 0 0 0 0
    ----------------------------------------------------------------------
    After IP data is sent.
    ----------------------------------------------------------------------
    root@qemux86-64:~# cat /proc/net/snmp
    Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors
      ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests
      OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails
      FragOKs FragFails FragCreates OutTransmits
    Ip: 1 64 10 0 5 2 0 0 3 3 0 0 0 0 0 0 0 0 0 5
    ......
    root@qemux86-64:~# cat /proc/net/netstat
    ......
    IpExt: InNoRoutes InTruncatedPkts InMcastPkts OutMcastPkts InBcastPkts
      OutBcastPkts InOctets OutOctets InMcastOctets OutMcastOctets
      InBcastOctets OutBcastOctets InCsumErrors InNoECTPkts InECT1Pkts
      InECT0Pkts InCEPkts ReasmOverlaps
    IpExt: 0 0 0 0 0 0 4404 3324 0 0 0 0 0 10 0 0 0 0
    ----------------------------------------------------------------------
    "ForwDatagrams" increase from 1 to 2 and "OutRequests" is keeping 3.
    "OutTransmits" increase from 4 to 5 and "OutOctets" increase 1428.

    Signed-off-by: Heng Guo <heng.guo@windriver.com>
    Reviewed-by: Kun Song <Kun.Song@windriver.com>
    Reviewed-by: Filip Pudak <filip.pudak@windriver.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2023-12-11 11:15:48 +01:00
Antoine Tenart b7ff6e2ea3 net: ipv4, ipv6: fix IPSTATS_MIB_OUTOCTETS increment duplicated
JIRA: https://issues.redhat.com/browse/RHEL-17413
Upstream Status: linux.git

commit e4da8c78973c1e307c0431e0b99a969ffb8aa3f1
Author: Heng Guo <heng.guo@windriver.com>
Date:   Fri Aug 25 15:55:05 2023 +0800

    net: ipv4, ipv6: fix IPSTATS_MIB_OUTOCTETS increment duplicated

    commit edf391ff17 ("snmp: add missing counters for RFC 4293") had
    already added OutOctets for RFC 4293. In commit 2d8dbb04c6 ("snmp: fix
    OutOctets counter to include forwarded datagrams"), OutOctets was
    counted again, but not removed from ip_output().

    According to RFC 4293 "3.2.3. IP Statistics Tables",
    ipipIfStatsOutTransmits is not equal to ipIfStatsOutForwDatagrams. So
    "IPSTATS_MIB_OUTOCTETS must be incremented when incrementing" is not
    accurate. And IPSTATS_MIB_OUTOCTETS should be counted after fragment.

    This patch reverts commit 2d8dbb04c6 ("snmp: fix OutOctets counter to
    include forwarded datagrams") and move IPSTATS_MIB_OUTOCTETS to
    ip_finish_output2 for ipv4.

    Reviewed-by: Filip Pudak <filip.pudak@windriver.com>
    Signed-off-by: Heng Guo <heng.guo@windriver.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2023-12-11 11:15:48 +01:00
Scott Weaver 14c9e1feb0 Merge: tunnels: First round of upstream fixes for RHEL 9.4.
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3250

JIRA: https://issues.redhat.com/browse/RHEL-14360
Upstream Status: linux.git

Signed-off-by: Guillaume Nault <gnault@redhat.com>

Approved-by: Hangbin Liu <haliu@redhat.com>
Approved-by: Florian Westphal <fwestpha@redhat.com>

Signed-off-by: Scott Weaver <scweaver@redhat.com>
2023-10-25 11:39:20 -04:00
Guillaume Nault 30e4857ce5 lwt: Check LWTUNNEL_XMIT_CONTINUE strictly
JIRA: https://issues.redhat.com/browse/RHEL-14360
Upstream Status: linux.git

commit a171fbec88a2c730b108c7147ac5e7b2f5a02b47
Author: Yan Zhai <yan@cloudflare.com>
Date:   Thu Aug 17 19:58:14 2023 -0700

    lwt: Check LWTUNNEL_XMIT_CONTINUE strictly

    LWTUNNEL_XMIT_CONTINUE is implicitly assumed in ip(6)_finish_output2,
    such that any positive return value from a xmit hook could cause
    unexpected continue behavior, despite that related skb may have been
    freed. This could be error-prone for future xmit hook ops. One of the
    possible errors is to return statuses of dst_output directly.

    To make the code safer, redefine LWTUNNEL_XMIT_CONTINUE value to
    distinguish from dst_output statuses and check the continue
    condition explicitly.

    Fixes: 3a0af8fd61 ("bpf: BPF for lightweight tunnel infrastructure")
    Suggested-by: Dan Carpenter <dan.carpenter@linaro.org>
    Signed-off-by: Yan Zhai <yan@cloudflare.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/bpf/96b939b85eda00e8df4f7c080f770970a4c5f698.1692326837.git.yan@cloudflare.com

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2023-10-20 13:25:18 +02:00
Ivan Vecera 497f645693 net: move gso declarations and functions to their own files
JIRA: https://issues.redhat.com/browse/RHEL-12679

commit d457a0e329b0bfd3a1450e0b1a18cd2b47a25a08
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Jun 8 19:17:37 2023 +0000

    net: move gso declarations and functions to their own files

    Move declarations into include/net/gso.h and code into net/core/gso.c

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Stanislav Fomichev <sdf@google.com>
    Reviewed-by: Simon Horman <simon.horman@corigine.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Link: https://lore.kernel.org/r/20230608191738.3947077-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-10-11 13:35:27 +02:00
Guillaume Nault 03365a2072 ipv4: Fix potential uninit variable access bug in __ip_make_skb()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2221167
Upstream Status: linux.git

commit 99e5acae193e369b71217efe6f1dad42f3f18815
Author: Ziyang Xuan <william.xuanziyang@huawei.com>
Date:   Thu Apr 20 20:40:35 2023 +0800

    ipv4: Fix potential uninit variable access bug in __ip_make_skb()

    Like commit ea30388baebc ("ipv6: Fix an uninit variable access bug in
    __ip6_make_skb()"). icmphdr does not in skb linear region under the
    scenario of SOCK_RAW socket. Access icmp_hdr(skb)->type directly will
    trigger the uninit variable access bug.

    Use a local variable icmp_type to carry the correct value in different
    scenarios.

    Fixes: 96793b4825 ("[IPV4]: Add ICMPMsgStats MIB (RFC 4293)")
    Reviewed-by: Willem de Bruijn <willemb@google.com>
    Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2023-07-07 12:23:59 +02:00
Antoine Tenart 1cfc972fac net: ipv4: use consistent txhash in TIME_WAIT and SYN_RECV
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2214966
Upstream Status: net-next.git
Conflicts:\
- Context difference due to missing upstream commit e22aa1486668 ("net:
  Find dst with sk's xfrm policy not ctl_sk") in c9s.

commit c0a8966e2bc7d31f77a7246947ebc09c1ff06066
Author: Antoine Tenart <atenart@kernel.org>
Date:   Tue May 23 18:14:52 2023 +0200

    net: ipv4: use consistent txhash in TIME_WAIT and SYN_RECV

    When using IPv4/TCP, skb->hash comes from sk->sk_txhash except in
    TIME_WAIT and SYN_RECV where it's not set in the reply skb from
    ip_send_unicast_reply. Those packets will have a mismatched hash with
    others from the same flow as their hashes will be 0. IPv6 does not have
    the same issue as the hash is set from the socket txhash in those cases.

    This commits sets the hash in the reply skb from ip_send_unicast_reply,
    which makes the IPv4 code behaving like IPv6.

    Signed-off-by: Antoine Tenart <atenart@kernel.org>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2023-06-16 10:55:36 +02:00
Jan Stancek 704d11b087 Merge: enable io_uring
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2375

# Merge Request Required Information

## Summary of Changes
This MR updates RHEL's io_uring implementation to match upstream v6.2 + fixes (Fixes: tags and Cc: stable commits).  The final 3 RHEL-only patches disable the feature by default, taint the kernel when it is enabled, and turn on the config option.

## Approved Development Ticket
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2170014
Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2214
Omitted-fix: b0b7a7d24b66 ("io_uring: return back links tw run optimisation")
  This is actually just	an optimization, and it	has non-trivial	conflicts
  which	would require additional backports to resolve.	Skip it.
Omitted-fix: 7d481e035633 ("io_uring/rsrc: fix DEFER_TASKRUN rsrc quiesce")
  This fix is incorrectly tagged.  The code that it applies to is not present in our tree.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>

Approved-by: John Meneghini <jmeneghi@redhat.com>
Approved-by: Ming Lei <ming.lei@redhat.com>
Approved-by: Maurizio Lombardi <mlombard@redhat.com>
Approved-by: Brian Foster <bfoster@redhat.com>
Approved-by: Guillaume Nault <gnault@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-05-17 07:47:08 +02:00
Jeff Moyer 5fbf8901c6 net: shrink struct ubuf_info
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237

commit e7d2b510165fff6bedc9cca88c071ad846850c74
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Fri Sep 23 17:39:04 2022 +0100

    net: shrink struct ubuf_info
    
    We can benefit from a smaller struct ubuf_info, so leave only mandatory
    fields and let users to decide how they want to extend it. Convert
    MSG_ZEROCOPY to struct ubuf_info_msgzc and remove duplicated fields.
    This reduces the size from 48 bytes to just 16.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Acked-by: Paolo Abeni <pabeni@redhat.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-05-05 15:25:02 -04:00
Xin Long 5a01b46698 net: add support for ipv4 big tcp
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2185290
Tested: big tcp selftest

commit b1a78b9b98862cda167b643690e43662ea060625
Author: Xin Long <lucien.xin@gmail.com>
Date:   Sat Jan 28 10:58:39 2023 -0500

    net: add support for ipv4 big tcp

    Similar to Eric's IPv6 BIG TCP, this patch is to enable IPv4 BIG TCP.

    Firstly, allow sk->sk_gso_max_size to be set to a value greater than
    GSO_LEGACY_MAX_SIZE by not trimming gso_max_size in sk_trim_gso_size()
    for IPv4 TCP sockets.

    Then on TX path, set IP header tot_len to 0 when skb->len > IP_MAX_MTU
    in __ip_local_out() to allow to send BIG TCP packets, and this implies
    that skb->len is the length of a IPv4 packet; On RX path, use skb->len
    as the length of the IPv4 packet when the IP header tot_len is 0 and
    skb->len > IP_MAX_MTU in ip_rcv_core(). As the API iph_set_totlen() and
    skb_ip_totlen() are used in __ip_local_out() and ip_rcv_core(), we only
    need to update these APIs.

    Also in GRO receive, add the check for ETH_P_IP/IPPROTO_TCP, and allows
    the merged packet size >= GRO_LEGACY_MAX_SIZE in skb_gro_receive(). In
    GRO complete, set IP header tot_len to 0 when the merged packet size
    greater than IP_MAX_MTU in iph_set_totlen() so that it can be processed
    on RX path.

    Note that by checking skb_is_gso_tcp() in API iph_totlen(), it makes
    this implementation safe to use iph->len == 0 indicates IPv4 BIG TCP
    packets.

    Signed-off-by: Xin Long <lucien.xin@gmail.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Xin Long <lxin@redhat.com>
2023-05-02 10:36:11 -04:00
Jeff Moyer 1f1aaf023c ipv4/udp: support externally provided ubufs
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237

commit c445f31b3cfaa008e110bf548c3a1f0198d332d4
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Tue Jul 12 21:52:33 2022 +0100

    ipv4/udp: support externally provided ubufs
    
    Teach ipv4/udp how to use external ubuf_info provided in msghdr and
    also prepare it for managed frags by sprinkling
    skb_zcopy_downgrade_managed() when it could mix managed and not managed
    frags.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-04-29 08:08:02 -04:00
Jeff Moyer 7f1177108f ipv4: avoid partial copy for zc
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237

commit 8eb77cc73977d88787b37c92831b1c242e035396
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Tue Jul 12 21:52:25 2022 +0100

    ipv4: avoid partial copy for zc
    
    Even when zerocopy transmission is requested and possible,
    __ip_append_data() will still copy a small chunk of data just because it
    allocated some extra linear space (e.g. 148 bytes). It wastes CPU cycles
    on copy and iter manipulations and also misalignes potentially aligned
    data. Avoid such copies. And as a bonus we can allocate smaller skb.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-04-29 07:59:02 -04:00
Eric Chanudet 2aac422bef ipv4: ip_output.c: Fix out-of-bounds warning in ip_copy_addrs()
Bugzilla: https://bugzilla.redhat.com/2159468

commit 6321c7acb82872ef6576c520b0e178eaad3a25c0
Author: Gustavo A. R. Silva <gustavoars@kernel.org>
Date:   Mon Jul 26 14:52:51 2021 -0500

    ipv4: ip_output.c: Fix out-of-bounds warning in ip_copy_addrs()

    Fix the following out-of-bounds warning:

        In function 'ip_copy_addrs',
            inlined from '__ip_queue_xmit' at net/ipv4/ip_output.c:517:2:
    net/ipv4/ip_output.c:449:2: warning: 'memcpy' offset [40, 43] from the object at 'fl' is out of the bounds of referenced subobject 'saddr' with type 'unsigned int' at offset 36 [-Warray-bounds]
          449 |  memcpy(&iph->saddr, &fl4->saddr,
              |  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          450 |         sizeof(fl4->saddr) + sizeof(fl4->daddr));
              |         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    The problem is that the original code is trying to copy data into a
    couple of struct members adjacent to each other in a single call to
    memcpy(). This causes a legitimate compiler warning because memcpy()
    overruns the length of &iph->saddr and &fl4->saddr. As these are just
    a couple of struct members, fix this by using direct assignments,
    instead of memcpy().

    This helps with the ongoing efforts to globally enable -Warray-bounds
    and get us closer to being able to tighten the FORTIFY_SOURCE routines
    on memcpy().

    Link: https://github.com/KSPP/linux/issues/109
    Reported-by: kernel test robot <lkp@intel.com>
    Link: https://lore.kernel.org/lkml/d5ae2e65-1f18-2577-246f-bada7eee6ccd@intel.com/
    Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Eric Chanudet <echanude@redhat.com>
2023-01-09 13:32:41 -05:00
Frantisek Hrbata 1269719102 Merge: BPF and XDP rebase to v5.18
Merge conflicts:
-----------------
arch/x86/net/bpf_jit_comp.c
        - bpf_arch_text_poke()
          HEAD(!1464) contains b73b002f7f ("x86/ibt,bpf: Add ENDBR instructions to prologue and trampoline")
          Resolved in favour of !1464, but keep the return statement from !1477

MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1477

Bugzilla: https://bugzilla.redhat.com/2120966

Rebase BPF and XDP to the upstream kernel version 5.18

Patch applied, then reverted:
```
544356 selftests/bpf: switch to new libbpf XDP APIs
0bfb95 selftests, bpf: Do not yet switch to new libbpf XDP APIs
```
Taken in the perf rebase:
```
23fcfc perf: use generic bpf_program__set_type() to set BPF prog type
```
Unsuported arches:
```
5c1011 libbpf: Fix riscv register names
cf0b5b libbpf: Fix accessing syscall arguments on riscv
```
Depends on changes of other subsystems:
```
7fc8c3 s390/bpf: encode register within extable entry
aebfd1 x86/ibt,ftrace: Search for __fentry__ location
589127 x86/ibt,bpf: Add ENDBR instructions to prologue and trampoline
```
Broken selftest:
```
edae34 selftests net: add UDP GRO fraglist + bpf self-tests
cf6783 selftests net: fix bpf build error
7b92aa selftests net: fix kselftest net fatal error
```
Out of scope:
```
baebdf net: dev: Makes sure netif_rx() can be invoked in any context.
5c8166 kbuild: replace $(if A,A,B) with $(or A,B)
1a97ce perf maps: Use a pointer for kmaps
967747 uaccess: remove CONFIG_SET_FS
42b01a s390: always use the packed stack layout
bf0882 flow_dissector: Add support for HSR
d09a30 s390/extable: move EX_TABLE define to asm-extable.h
3d6671 s390/extable: convert to relative table with data
4efd41 s390: raise minimum supported machine generation to z10
f65e58 flow_dissector: Add support for HSRv0
1a6d7a netdevsim: Introduce support for L3 offload xstats
9b1894 selftests: netdevsim: hw_stats_l3: Add a new test
84005b perf ftrace latency: Add -n/--use-nsec option
36c4a7 kasan, arm64: don't tag executable vmalloc allocations
8df013 docs: netdev: move the netdev-FAQ to the process pages
4d4d00 perf tools: Update copy of libbpf's hashmap.c
0df6ad perf evlist: Rename cpus to user_requested_cpus
1b8089 flow_dissector: fix false-positive __read_overflow2_field() warning
0ae065 perf build: Fix check for btf__load_from_kernel_by_id() in libbpf
8994e9 perf test bpf: Skip test if clang is not present
735346 perf build: Fix btf__load_from_kernel_by_id() feature check
f037ac s390/stack: merge empty stack frame slots
335220 docs: netdev: update maintainer-netdev.rst reference
a0b098 s390/nospec: remove unneeded header includes
34513a netdevsim: Fix hwstats debugfs file permissions
```

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>

Approved-by: John W. Linville <linville@redhat.com>
Approved-by: Wander Lairson Costa <wander@redhat.com>
Approved-by: Torez Smith <torez@redhat.com>
Approved-by: Jan Stancek <jstancek@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Felix Maurer <fmaurer@redhat.com>
Approved-by: Viktor Malik <vmalik@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-11-21 05:30:47 -05:00
Frantisek Hrbata e657290ddd Merge: ipv4: Backport upstream fixes.
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1475

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134815
Upstream Status: linux.git, net.git

Upstream fixes for net/ipv4. Patch order is slightly modified to
reflect the logical dependencies between prerequisites, fixes and
follow-ups.

There are 6 logical fixes:

  1- Patches 1-3 fix how IPv4 handles its options when fragmenting
     packets. Patches 1-2 are prerequisites to make patch 3 a clean
     backport from upstream.

  2- Patch 4 fixes an edge case where the recent TCP source port
     selection algorithm improvements broke some systems which aren't
     able to cope with the bigger table_perturb array.

  3- Patches 5-7 fix another side effect of the improved TCP source
     port selection algorithm, where initialising the table_perturb
     array incurred a latency spike due to its bigger size. Patch 5
     is a trivial prerequisite for making patch 6 a clean upstream
     backport. Patch 6 is the real fix and patch 7 a simple
     follow-up (function renaming).

  4- Patch 8 fixes an interaction problem between the original IPv4
     multipath mechanism and the more recent nexthop infrastructure.

  5- Patch 9 fixes an edge case for fragment handling in IPv6 ping
     sockets. This is IPv6-specific, but the affected function is
     called by both IPv4 and IPv6 and is defined in net/ipv4, hence
     the presence of this patch in the IPv4 backport series.

  6- Patches 10 fixes the interaction between netfilter rpfilter/fib
     modules with VRFs. It fixes nft_fib_ipv[46], by copying existing
     code from ip{,6}t_rpfilter. There's another patch upstream that
     refines this mechanism for all those modules (ipt_rpfilter,
     ip6t_rpfilter, nft_fib_ipv4 and nft_fib_ipv6) to better integrate
     with the VRF infrastructure (upstream commit acc641ab95b6
     ("netfilter: rpfilter/fib: Populate flowic_l3mdev field")). It's
     not backported in this series since centos-stream 9 currently
     miss upstream commit 40867d74c374 ("net: Add l3mdev index to flow
     struct and avoid oif reset for port devices"), which adds the
     flowic_l3mdev field to struct flowi_common.

Signed-off-by: Guillaume Nault <gnault@redhat.com>

Approved-by: Jiri Benc <jbenc@redhat.com>
Approved-by: Ivan Vecera <ivecera@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-11-08 02:38:14 -05:00
Guillaume Nault 16f615016a ipv4: fix ip option filtering for locally generated fragments
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134815
Upstream Status: linux.git

commit 27a8caa59babb96c5890569e131bc0eb6d45daee
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Fri Jan 21 16:57:31 2022 -0800

    ipv4: fix ip option filtering for locally generated fragments

    During IP fragmentation we sanitize IP options. This means overwriting
    options which should not be copied with NOPs. Only the first fragment
    has the original, full options.

    ip_fraglist_prepare() copies the IP header and options from previous
    fragment to the next one. Commit 19c3401a91 ("net: ipv4: place control
    buffer handling away from fragmentation iterators") moved sanitizing
    options before ip_fraglist_prepare() which means options are sanitized
    and then overwritten again with the old values.

    Fixing this is not enough, however, nor did the sanitization work
    prior to aforementioned commit.

    ip_options_fragment() (which does the sanitization) uses ipcb->opt.optlen
    for the length of the options. ipcb->opt of fragments is not populated
    (it's 0), only the head skb has the state properly built. So even when
    called at the right time ip_options_fragment() does nothing. This seems
    to date back all the way to v2.5.44 when the fast path for pre-fragmented
    skbs had been introduced. Prior to that ip_options_build() would have been
    called for every fragment (in fact ever since v2.5.44 the fragmentation
    handing in ip_options_build() has been dead code, I'll clean it up in
    -next).

    In the original patch (see Link) caixf mentions fixing the handling
    for fragments other than the second one, but I'm not sure how _any_
    fragment could have had their options sanitized with the code
    as it stood.

    Tested with python (MTU on lo lowered to 1000 to force fragmentation):

      import socket
      s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
      s.setsockopt(socket.IPPROTO_IP, socket.IP_OPTIONS,
                   bytearray([7,4,5,192, 20|0x80,4,1,0]))
      s.sendto(b'1'*2000, ('127.0.0.1', 1234))

    Before:

    IP (tos 0x0, ttl 64, id 1053, offset 0, flags [+], proto UDP (17), length 996, options (RR [bad length 4] [bad ptr 5] 192.148.4.1,,RA value 256))
        localhost.36500 > localhost.search-agent: UDP, length 2000
    IP (tos 0x0, ttl 64, id 1053, offset 968, flags [+], proto UDP (17), length 996, options (RR [bad length 4] [bad ptr 5] 192.148.4.1,,RA value 256))
        localhost > localhost: udp
    IP (tos 0x0, ttl 64, id 1053, offset 1936, flags [none], proto UDP (17), length 100, options (RR [bad length 4] [bad ptr 5] 192.148.4.1,,RA value 256))
        localhost > localhost: udp

    After:

    IP (tos 0x0, ttl 96, id 42549, offset 0, flags [+], proto UDP (17), length 996, options (RR [bad length 4] [bad ptr 5] 192.148.4.1,,RA value 256))
        localhost.51607 > localhost.search-agent: UDP, bad length 2000 > 960
    IP (tos 0x0, ttl 96, id 42549, offset 968, flags [+], proto UDP (17), length 996, options (NOP,NOP,NOP,NOP,RA value 256))
        localhost > localhost: udp
    IP (tos 0x0, ttl 96, id 42549, offset 1936, flags [none], proto UDP (17), length 100, options (NOP,NOP,NOP,NOP,RA value 256))
        localhost > localhost: udp

    RA (20 | 0x80) is now copied as expected, RR (7) is "NOPed out".

    Link: https://lore.kernel.org/netdev/20220107080559.122713-1-ooppublic@163.com/
    Fixes: 19c3401a91 ("net: ipv4: place control buffer handling away from fragmentation iterators")
    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: caixf <ooppublic@163.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2022-11-02 09:25:06 +01:00
Guillaume Nault db0ac3b778 net: ipv4: Fix the warning for dereference
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134815
Upstream Status: linux.git

commit 1b9fbe813016b08e08b22ddba4ddbf9cb1b04b00
Author: Yajun Deng <yajun.deng@linux.dev>
Date:   Mon Aug 30 17:16:40 2021 +0800

    net: ipv4: Fix the warning for dereference

    Add a if statements to avoid the warning.

    Dan Carpenter report:
    The patch faf482ca196a: "net: ipv4: Move ip_options_fragment() out of
    loop" from Aug 23, 2021, leads to the following Smatch complaint:

        net/ipv4/ip_output.c:833 ip_do_fragment()
        warn: variable dereferenced before check 'iter.frag' (see line 828)

    Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
    Fixes: faf482ca196a ("net: ipv4: Move ip_options_fragment() out of loop")
    Link: https://lore.kernel.org/netdev/20210830073802.GR7722@kadam/T/#t
    Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2022-11-02 09:25:06 +01:00
Guillaume Nault 287a6c8276 net: ipv4: Move ip_options_fragment() out of loop
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134815
Upstream Status: linux.git

commit faf482ca196a5b16007190529b3b2dd32ab3f761
Author: Yajun Deng <yajun.deng@linux.dev>
Date:   Mon Aug 23 11:17:59 2021 +0800

    net: ipv4: Move ip_options_fragment() out of loop

    The ip_options_fragment() only called when iter->offset is equal to zero,
    so move it out of loop, and inline 'Copy the flags to each fragment.'
    As also, remove the unused parameter in ip_frag_ipcb().

    Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2022-11-02 09:25:05 +01:00
Jiri Benc e0f797236e net: Set skb->mono_delivery_time and clear it after sch_handle_ingress()
Bugzilla: https://bugzilla.redhat.com/2120966

Conflicts:
- [minor] context difference in __netif_receive_skb_core due to missing
  42df6e1d221d ("netfilter: Introduce egress hook")

commit d98d58a002619b5c165f1eedcd731e2fe2c19088
Author: Martin KaFai Lau <kafai@fb.com>
Date:   Wed Mar 2 11:55:50 2022 -0800

    net: Set skb->mono_delivery_time and clear it after sch_handle_ingress()

    The previous patches handled the delivery_time before sch_handle_ingress().

    This patch can now set the skb->mono_delivery_time to flag the skb->tstamp
    is used as the mono delivery_time (EDT) instead of the (rcv) timestamp
    and also clear it with skb_clear_delivery_time() after
    sch_handle_ingress().  This will make the bpf_redirect_*()
    to keep the mono delivery_time and used by a qdisc (fq) of
    the egress-ing interface.

    A latter patch will postpone the skb_clear_delivery_time() until the
    stack learns that the skb is being delivered locally and that will
    make other kernel forwarding paths (ip[6]_forward) able to keep
    the delivery_time also.  Thus, like the previous patches on using
    the skb->mono_delivery_time bit, calling skb_clear_delivery_time()
    is not limited within the CONFIG_NET_INGRESS to avoid too many code
    churns among this set.

    Signed-off-by: Martin KaFai Lau <kafai@fb.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:58:00 +02:00
Jiri Benc 6619cf0a37 net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp
Bugzilla: https://bugzilla.redhat.com/2120966

Conflicts:
- [minor] different context in tcp_fragment() due to missing
  a52fe46ef160 ("tcp: factorize ip_summed setting")

commit a1ac9c8acec1605c6b43af418f79facafdced680
Author: Martin KaFai Lau <kafai@fb.com>
Date:   Wed Mar 2 11:55:25 2022 -0800

    net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp

    skb->tstamp was first used as the (rcv) timestamp.
    The major usage is to report it to the user (e.g. SO_TIMESTAMP).

    Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP)
    during egress and used by the qdisc (e.g. sch_fq) to make decision on when
    the skb can be passed to the dev.

    Currently, there is no way to tell skb->tstamp having the (rcv) timestamp
    or the delivery_time, so it is always reset to 0 whenever forwarded
    between egress and ingress.

    While it makes sense to always clear the (rcv) timestamp in skb->tstamp
    to avoid confusing sch_fq that expects the delivery_time, it is a
    performance issue [0] to clear the delivery_time if the skb finally
    egress to a fq@phy-dev.  For example, when forwarding from egress to
    ingress and then finally back to egress:

                tcp-sender => veth@netns => veth@hostns => fq@eth0@hostns
                                         ^              ^
                                         reset          rest

    This patch adds one bit skb->mono_delivery_time to flag the skb->tstamp
    is storing the mono delivery_time (EDT) instead of the (rcv) timestamp.

    The current use case is to keep the TCP mono delivery_time (EDT) and
    to be used with sch_fq.  A latter patch will also allow tc-bpf@ingress
    to read and change the mono delivery_time.

    In the future, another bit (e.g. skb->user_delivery_time) can be added
    for the SCM_TXTIME where the clock base is tracked by sk->sk_clockid.

    [ This patch is a prep work.  The following patches will
      get the other parts of the stack ready first.  Then another patch
      after that will finally set the skb->mono_delivery_time. ]

    skb_set_delivery_time() function is added.  It is used by the tcp_output.c
    and during ip[6] fragmentation to assign the delivery_time to
    the skb->tstamp and also set the skb->mono_delivery_time.

    A note on the change in ip_send_unicast_reply() in ip_output.c.
    It is only used by TCP to send reset/ack out of a ctl_sk.
    Like the new skb_set_delivery_time(), this patch sets
    the skb->mono_delivery_time to 0 for now as a place
    holder.  It will be enabled in a latter patch.
    A similar case in tcp_ipv6 can be done with
    skb_set_delivery_time() in tcp_v6_send_response().

    [0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdf

    Signed-off-by: Martin KaFai Lau <kafai@fb.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:57:59 +02:00
Frantisek Hrbata fa843be1d1 Merge: net: add skb drop reasons
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1454

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161

Sync skb drop reasons with upstream to improve debuggability and visibility in
the net stack. This MR helps in understanding why a given packet is being
dropped.

One way of retrieving the skb drop reason is to hook to the kfree_skb tracepoint:

```
# perf record -e skb:kfree_skb -a sleep 10
# perf script
         swapper     0 [000] 45483.977088: skb:kfree_skb: skbaddr=0xffffa04859090f00 protocol=34525 location=0xffffffff9bc92940 reason: NOT_SPECIFIED
         swapper     0 [000] 45485.792919: skb:kfree_skb: skbaddr=0xffffa04143757900 protocol=34525 location=0xffffffff9bbd84bb reason: TCP_INVALID_SEQUENCE
```

Signed-off-by: Antoine Tenart <atenart@redhat.com>

Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-10-24 14:27:58 -04:00
Antoine Tenart 620d4ff739 net: ip: add skb drop reasons for ip egress path
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit 5e187189ec324f78035d33a4bc123a9c4ca6f3e3
Author: Menglong Dong <imagedong@tencent.com>
Date:   Sat Feb 26 12:18:29 2022 +0800

    net: ip: add skb drop reasons for ip egress path

    Replace kfree_skb() which is used in the packet egress path of IP layer
    with kfree_skb_reason(). Functions that are involved include:

    __ip_queue_xmit()
    ip_finish_output()
    ip_mc_finish_output()
    ip6_output()
    ip6_finish_output()
    ip6_finish_output2()

    Following new drop reasons are introduced:

    SKB_DROP_REASON_IP_OUTNOROUTES
    SKB_DROP_REASON_BPF_CGROUP_EGRESS
    SKB_DROP_REASON_IPV6DISABLED
    SKB_DROP_REASON_NEIGH_CREATEFAIL

    Reviewed-by: Mengen Sun <mengensun@tencent.com>
    Reviewed-by: Hao Peng <flyingpeng@tencent.com>
    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:23 +02:00
Paolo Abeni 38e724189c net: Fix data-races around sysctl_[rw]mem_(max|default).
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134161
Tested: LNST, Tier1

Upstream commit:
commit 1227c1771dd2ad44318aa3ab9e3a293b3f34ff2a
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Tue Aug 23 10:46:44 2022 -0700

    net: Fix data-races around sysctl_[rw]mem_(max|default).

    While reading sysctl_[rw]mem_(max|default), they can be changed
    concurrently.  Thus, we need to add READ_ONCE() to its readers.

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-13 13:00:03 +02:00
Patrick Talbert f311aab772 Merge: net: backport core fixes from upstream
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/832

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081920

A bunch of fixes for net core path.

Signed-off-by: Hangbin Liu <haliu@redhat.com>

Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Jarod Wilson <jarod@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-05-18 10:58:56 +02:00
Hangbin Liu e333d6a1da net-timestamp: convert sk->sk_tskey to atomic_t
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081920
Upstream Status: net.git commit a1cdec57e03a

commit a1cdec57e03a1352e92fbbe7974039dda4efcec0
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Feb 17 09:05:02 2022 -0800

    net-timestamp: convert sk->sk_tskey to atomic_t

    UDP sendmsg() can be lockless, this is causing all kinds
    of data races.

    This patch converts sk->sk_tskey to remove one of these races.

    BUG: KCSAN: data-race in __ip_append_data / __ip_append_data

    read to 0xffff8881035d4b6c of 4 bytes by task 8877 on cpu 1:
     __ip_append_data+0x1c1/0x1de0 net/ipv4/ip_output.c:994
     ip_make_skb+0x13f/0x2d0 net/ipv4/ip_output.c:1636
     udp_sendmsg+0x12bd/0x14c0 net/ipv4/udp.c:1249
     inet_sendmsg+0x5f/0x80 net/ipv4/af_inet.c:819
     sock_sendmsg_nosec net/socket.c:705 [inline]
     sock_sendmsg net/socket.c:725 [inline]
     ____sys_sendmsg+0x39a/0x510 net/socket.c:2413
     ___sys_sendmsg net/socket.c:2467 [inline]
     __sys_sendmmsg+0x267/0x4c0 net/socket.c:2553
     __do_sys_sendmmsg net/socket.c:2582 [inline]
     __se_sys_sendmmsg net/socket.c:2579 [inline]
     __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x44/0xae

    write to 0xffff8881035d4b6c of 4 bytes by task 8880 on cpu 0:
     __ip_append_data+0x1d8/0x1de0 net/ipv4/ip_output.c:994
     ip_make_skb+0x13f/0x2d0 net/ipv4/ip_output.c:1636
     udp_sendmsg+0x12bd/0x14c0 net/ipv4/udp.c:1249
     inet_sendmsg+0x5f/0x80 net/ipv4/af_inet.c:819
     sock_sendmsg_nosec net/socket.c:705 [inline]
     sock_sendmsg net/socket.c:725 [inline]
     ____sys_sendmsg+0x39a/0x510 net/socket.c:2413
     ___sys_sendmsg net/socket.c:2467 [inline]
     __sys_sendmmsg+0x267/0x4c0 net/socket.c:2553
     __do_sys_sendmmsg net/socket.c:2582 [inline]
     __se_sys_sendmmsg net/socket.c:2579 [inline]
     __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x44/0xae

    value changed: 0x0000054d -> 0x0000054e

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 0 PID: 8880 Comm: syz-executor.5 Not tainted 5.17.0-rc2-syzkaller-00167-gdcb85f85fa6f-dirty #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

    Fixes: 09c2d251b7 ("net-timestamp: add key to disambiguate concurrent datagrams")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Willem de Bruijn <willemb@google.com>
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Hangbin Liu <haliu@redhat.com>
2022-05-05 12:26:57 +08:00
Guillaume Nault 8a791ba8a6 ipv4: tcp: send zero IPID in SYNACK messages
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081383
Upstream Status: linux.git

commit 970a5a3ea86da637471d3cd04d513a0755aba4bf
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Jan 26 17:10:21 2022 -0800

    ipv4: tcp: send zero IPID in SYNACK messages

    In commit 431280eebe ("ipv4: tcp: send zero IPID for RST and
    ACK sent in SYN-RECV and TIME-WAIT state") we took care of some
    ctl packets sent by TCP.

    It turns out we need to use a similar strategy for SYNACK packets.

    By default, they carry IP_DF and IPID==0, but there are ways
    to ask them to use the hashed IP ident generator and thus
    be used to build off-path attacks.
    (Ref: Off-Path TCP Exploits of the Mixed IPID Assignment)

    One of this way is to force (before listener is started)
    echo 1 >/proc/sys/net/ipv4/ip_no_pmtu_disc

    Another way is using forged ICMP ICMP_FRAG_NEEDED
    with a very small MTU (like 68) to force a false return from
    ip_dont_fragment()

    In this patch, ip_build_and_send_pkt() uses the following
    heuristics.

    1) Most SYNACK packets are smaller than IPV4_MIN_MTU and therefore
    can use IP_DF regardless of the listener or route pmtu setting.

    2) In case the SYNACK packet is bigger than IPV4_MIN_MTU,
    we use prandom_u32() generator instead of the IPv4 hashed ident one.

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reported-by: Ray Che <xijiache@gmail.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Cc: Geoff Alexander <alexandg@cs.unm.edu>
    Cc: Willy Tarreau <w@1wt.eu>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2022-05-03 17:05:17 +02:00
Paolo Abeni e696647fbb ipv4: use skb_expand_head in ip_finish_output2
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028276
Tested: LNST, Tier1

Upstream commit:
commit 5678a59579647c4d9affe5e6544baf7645b41e4f
Author: Vasily Averin <vvs@virtuozzo.com>
Date:   Mon Aug 2 11:52:35 2021 +0300

    ipv4: use skb_expand_head in ip_finish_output2

    Unlike skb_realloc_headroom, new helper skb_expand_head
    does not allocate a new skb if possible.

    Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2021-12-09 10:44:30 +01:00
Jakub Kicinski 6d123b81ac net: ip: avoid OOM kills with large UDP sends over loopback
Dave observed number of machines hitting OOM on the UDP send
path. The workload seems to be sending large UDP packets over
loopback. Since loopback has MTU of 64k kernel will try to
allocate an skb with up to 64k of head space. This has a good
chance of failing under memory pressure. What's worse if
the message length is <32k the allocation may trigger an
OOM killer.

This is entirely avoidable, we can use an skb with page frags.

af_unix solves a similar problem by limiting the head
length to SKB_MAX_ALLOC. This seems like a good and simple
approach. It means that UDP messages > 16kB will now
use fragments if underlying device supports SG, if extra
allocator pressure causes regressions in real workloads
we can switch to trying the large allocation first and
falling back.

v4: pre-calculate all the additions to alloclen so
    we can be sure it won't go over order-2

Reported-by: Dave Jones <dsj@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24 11:17:21 -07:00
Bhaskar Chowdhury a66e04ce0e ipv4: ip_output.c: Couple of typo fixes
s/readibility/readability/
s/insufficent/insufficient/

Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-28 17:31:13 -07:00
Brian Vazquez 6585d7dc49 net: use indirect call helpers for dst_output
This patch avoids the indirect call for the common case:
ip6_output and ip_output

Signed-off-by: Brian Vazquez <brianvv@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-03 14:51:39 -08:00
Jakub Kicinski 833d22f2f9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Trivial conflict in CAN on file rename.

Conflicts:
	drivers/net/can/m_can/tcan4x5x-core.c

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-08 13:28:00 -08:00
Jonathan Lemon 8e04491724 skbuff: Rename skb_zcopy_{get|put} to net_zcopy_{get|put}
Unlike the rest of the skb_zcopy_ functions, these routines
operate on a 'struct ubuf', not a skb.  Remove the 'skb_'
prefix from the naming to make things clearer.

Suggested-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 16:08:37 -08:00
Jonathan Lemon 8c793822c5 skbuff: rename sock_zerocopy_* to msg_zerocopy_*
At Willem's suggestion, rename the sock_zerocopy_* functions
so that they match the MSG_ZEROCOPY flag, which makes it clear
they are specific to this zerocopy implementation.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 16:08:35 -08:00
Jonathan Lemon 236a6b1cd5 skbuff: Call sock_zerocopy_put_abort from skb_zcopy_put_abort
The sock_zerocopy_put_abort function contains logic which is
specific to the current zerocopy implementation.  Add a wrapper
which checks the callback and dispatches apppropriately.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 16:06:37 -08:00
Florian Westphal bb4cc1a188 net: ip: always refragment ip defragmented packets
Conntrack reassembly records the largest fragment size seen in IPCB.
However, when this gets forwarded/transmitted, fragmentation will only
be forced if one of the fragmented packets had the DF bit set.

In that case, a flag in IPCB will force fragmentation even if the
MTU is large enough.

This should work fine, but this breaks with ip tunnels.
Consider client that sends a UDP datagram of size X to another host.

The client fragments the datagram, so two packets, of size y and z, are
sent. DF bit is not set on any of these packets.

Middlebox netfilter reassembles those packets back to single size-X
packet, before routing decision.

packet-size-vs-mtu checks in ip_forward are irrelevant, because DF bit
isn't set.  At output time, ip refragmentation is skipped as well
because x is still smaller than the mtu of the output device.

If ttransmit device is an ip tunnel, the packet size increases to
x+overhead.

Also, tunnel might be configured to force DF bit on outer header.

In this case, packet will be dropped (exceeds MTU) and an ICMP error is
generated back to sender.

But sender already respects the announced MTU, all the packets that
it sent did fit the announced mtu.

Force refragmentation as per original sizes unconditionally so ip tunnel
will encapsulate the fragments instead.

The only other solution I see is to place ip refragmentation in
the ip_tunnel code to handle this case.

Fixes: d6b915e29f ("ip_fragment: don't forward defragmented DF packet")
Reported-by: Christian Perle <christian.perle@secunet.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 14:42:36 -08:00
Paul Moore 3df98d7921 lsm,selinux: pass flowi_common instead of flowi to the LSM hooks
As pointed out by Herbert in a recent related patch, the LSM hooks do
not have the necessary address family information to use the flowi
struct safely.  As none of the LSMs currently use any of the protocol
specific flowi information, replace the flowi pointers with pointers
to the address family independent flowi_common struct.

Reported-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2020-11-23 18:36:21 -05:00
Linus Torvalds 9ff9b0d392 networking changes for the 5.10 merge window
Add redirect_neigh() BPF packet redirect helper, allowing to limit stack
 traversal in common container configs and improving TCP back-pressure.
 Daniel reports ~10Gbps => ~15Gbps single stream TCP performance gain.
 
 Expand netlink policy support and improve policy export to user space.
 (Ge)netlink core performs request validation according to declared
 policies. Expand the expressiveness of those policies (min/max length
 and bitmasks). Allow dumping policies for particular commands.
 This is used for feature discovery by user space (instead of kernel
 version parsing or trial and error).
 
 Support IGMPv3/MLDv2 multicast listener discovery protocols in bridge.
 
 Allow more than 255 IPv4 multicast interfaces.
 
 Add support for Type of Service (ToS) reflection in SYN/SYN-ACK
 packets of TCPv6.
 
 In Multi-patch TCP (MPTCP) support concurrent transmission of data
 on multiple subflows in a load balancing scenario. Enhance advertising
 addresses via the RM_ADDR/ADD_ADDR options.
 
 Support SMC-Dv2 version of SMC, which enables multi-subnet deployments.
 
 Allow more calls to same peer in RxRPC.
 
 Support two new Controller Area Network (CAN) protocols -
 CAN-FD and ISO 15765-2:2016.
 
 Add xfrm/IPsec compat layer, solving the 32bit user space on 64bit
 kernel problem.
 
 Add TC actions for implementing MPLS L2 VPNs.
 
 Improve nexthop code - e.g. handle various corner cases when nexthop
 objects are removed from groups better, skip unnecessary notifications
 and make it easier to offload nexthops into HW by converting
 to a blocking notifier.
 
 Support adding and consuming TCP header options by BPF programs,
 opening the doors for easy experimental and deployment-specific
 TCP option use.
 
 Reorganize TCP congestion control (CC) initialization to simplify life
 of TCP CC implemented in BPF.
 
 Add support for shipping BPF programs with the kernel and loading them
 early on boot via the User Mode Driver mechanism, hence reusing all the
 user space infra we have.
 
 Support sleepable BPF programs, initially targeting LSM and tracing.
 
 Add bpf_d_path() helper for returning full path for given 'struct path'.
 
 Make bpf_tail_call compatible with bpf-to-bpf calls.
 
 Allow BPF programs to call map_update_elem on sockmaps.
 
 Add BPF Type Format (BTF) support for type and enum discovery, as
 well as support for using BTF within the kernel itself (current use
 is for pretty printing structures).
 
 Support listing and getting information about bpf_links via the bpf
 syscall.
 
 Enhance kernel interfaces around NIC firmware update. Allow specifying
 overwrite mask to control if settings etc. are reset during update;
 report expected max time operation may take to users; support firmware
 activation without machine reboot incl. limits of how much impact
 reset may have (e.g. dropping link or not).
 
 Extend ethtool configuration interface to report IEEE-standard
 counters, to limit the need for per-vendor logic in user space.
 
 Adopt or extend devlink use for debug, monitoring, fw update
 in many drivers (dsa loop, ice, ionic, sja1105, qed, mlxsw,
 mv88e6xxx, dpaa2-eth).
 
 In mlxsw expose critical and emergency SFP module temperature alarms.
 Refactor port buffer handling to make the defaults more suitable and
 support setting these values explicitly via the DCBNL interface.
 
 Add XDP support for Intel's igb driver.
 
 Support offloading TC flower classification and filtering rules to
 mscc_ocelot switches.
 
 Add PTP support for Marvell Octeontx2 and PP2.2 hardware, as well as
 fixed interval period pulse generator and one-step timestamping in
 dpaa-eth.
 
 Add support for various auth offloads in WiFi APs, e.g. SAE (WPA3)
 offload.
 
 Add Lynx PHY/PCS MDIO module, and convert various drivers which have
 this HW to use it. Convert mvpp2 to split PCS.
 
 Support Marvell Prestera 98DX3255 24-port switch ASICs, as well as
 7-port Mediatek MT7531 IP.
 
 Add initial support for QCA6390 and IPQ6018 in ath11k WiFi driver,
 and wcn3680 support in wcn36xx.
 
 Improve performance for packets which don't require much offloads
 on recent Mellanox NICs by 20% by making multiple packets share
 a descriptor entry.
 
 Move chelsio inline crypto drivers (for TLS and IPsec) from the crypto
 subtree to drivers/net. Move MDIO drivers out of the phy directory.
 
 Clean up a lot of W=1 warnings, reportedly the actively developed
 subsections of networking drivers should now build W=1 warning free.
 
 Make sure drivers don't use in_interrupt() to dynamically adapt their
 code. Convert tasklets to use new tasklet_setup API (sadly this
 conversion is not yet complete).
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAl+ItRwACgkQMUZtbf5S
 IrtTMg//UxpdR/MirT1DatBU0K/UGAZY82hV7F/UC8tPgjfHZeHvWlDFxfi3YP81
 PtPKbhRZ7DhwBXefUp6nY3UdvjftrJK2lJm8prJUPSsZRye8Wlcb7y65q7/P2y2U
 Efucyopg6RUrmrM0DUsIGYGJgylQLHnMYUl/keCsD4t5Bp4ksyi9R2t5eitGoWzh
 r3QGdbSa0AuWx4iu0i+tqp6Tj0ekMBMXLVb35dtU1t0joj2KTNEnSgABN3prOa8E
 iWYf2erOau68Ogp3yU3miCy0ZU4p/7qGHTtzbcp677692P/ekak6+zmfHLT9/Pjy
 2Stq2z6GoKuVxdktr91D9pA3jxG4LxSJmr0TImcGnXbvkMP3Ez3g9RrpV5fn8j6F
 mZCH8TKZAoD5aJrAJAMkhZmLYE1pvDa7KolSk8WogXrbCnTEb5Nv8FHTS1Qnk3yl
 wSKXuvutFVNLMEHCnWQLtODbTST9DI/aOi6EctPpuOA/ZyL1v3pl+gfp37S+LUTe
 owMnT/7TdvKaTD0+gIyU53M6rAWTtr5YyRQorX9awIu/4Ha0F0gYD7BJZQUGtegp
 HzKt59NiSrFdbSH7UdyemdBF4LuCgIhS7rgfeoUXMXmuPHq7eHXyHZt5dzPPa/xP
 81P0MAvdpFVwg8ij2yp2sHS7sISIRKq17fd1tIewUabxQbjXqPc=
 =bc1U
 -----END PGP SIGNATURE-----

Merge tag 'net-next-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:

 - Add redirect_neigh() BPF packet redirect helper, allowing to limit
   stack traversal in common container configs and improving TCP
   back-pressure.

   Daniel reports ~10Gbps => ~15Gbps single stream TCP performance gain.

 - Expand netlink policy support and improve policy export to user
   space. (Ge)netlink core performs request validation according to
   declared policies. Expand the expressiveness of those policies
   (min/max length and bitmasks). Allow dumping policies for particular
   commands. This is used for feature discovery by user space (instead
   of kernel version parsing or trial and error).

 - Support IGMPv3/MLDv2 multicast listener discovery protocols in
   bridge.

 - Allow more than 255 IPv4 multicast interfaces.

 - Add support for Type of Service (ToS) reflection in SYN/SYN-ACK
   packets of TCPv6.

 - In Multi-patch TCP (MPTCP) support concurrent transmission of data on
   multiple subflows in a load balancing scenario. Enhance advertising
   addresses via the RM_ADDR/ADD_ADDR options.

 - Support SMC-Dv2 version of SMC, which enables multi-subnet
   deployments.

 - Allow more calls to same peer in RxRPC.

 - Support two new Controller Area Network (CAN) protocols - CAN-FD and
   ISO 15765-2:2016.

 - Add xfrm/IPsec compat layer, solving the 32bit user space on 64bit
   kernel problem.

 - Add TC actions for implementing MPLS L2 VPNs.

 - Improve nexthop code - e.g. handle various corner cases when nexthop
   objects are removed from groups better, skip unnecessary
   notifications and make it easier to offload nexthops into HW by
   converting to a blocking notifier.

 - Support adding and consuming TCP header options by BPF programs,
   opening the doors for easy experimental and deployment-specific TCP
   option use.

 - Reorganize TCP congestion control (CC) initialization to simplify
   life of TCP CC implemented in BPF.

 - Add support for shipping BPF programs with the kernel and loading
   them early on boot via the User Mode Driver mechanism, hence reusing
   all the user space infra we have.

 - Support sleepable BPF programs, initially targeting LSM and tracing.

 - Add bpf_d_path() helper for returning full path for given 'struct
   path'.

 - Make bpf_tail_call compatible with bpf-to-bpf calls.

 - Allow BPF programs to call map_update_elem on sockmaps.

 - Add BPF Type Format (BTF) support for type and enum discovery, as
   well as support for using BTF within the kernel itself (current use
   is for pretty printing structures).

 - Support listing and getting information about bpf_links via the bpf
   syscall.

 - Enhance kernel interfaces around NIC firmware update. Allow
   specifying overwrite mask to control if settings etc. are reset
   during update; report expected max time operation may take to users;
   support firmware activation without machine reboot incl. limits of
   how much impact reset may have (e.g. dropping link or not).

 - Extend ethtool configuration interface to report IEEE-standard
   counters, to limit the need for per-vendor logic in user space.

 - Adopt or extend devlink use for debug, monitoring, fw update in many
   drivers (dsa loop, ice, ionic, sja1105, qed, mlxsw, mv88e6xxx,
   dpaa2-eth).

 - In mlxsw expose critical and emergency SFP module temperature alarms.
   Refactor port buffer handling to make the defaults more suitable and
   support setting these values explicitly via the DCBNL interface.

 - Add XDP support for Intel's igb driver.

 - Support offloading TC flower classification and filtering rules to
   mscc_ocelot switches.

 - Add PTP support for Marvell Octeontx2 and PP2.2 hardware, as well as
   fixed interval period pulse generator and one-step timestamping in
   dpaa-eth.

 - Add support for various auth offloads in WiFi APs, e.g. SAE (WPA3)
   offload.

 - Add Lynx PHY/PCS MDIO module, and convert various drivers which have
   this HW to use it. Convert mvpp2 to split PCS.

 - Support Marvell Prestera 98DX3255 24-port switch ASICs, as well as
   7-port Mediatek MT7531 IP.

 - Add initial support for QCA6390 and IPQ6018 in ath11k WiFi driver,
   and wcn3680 support in wcn36xx.

 - Improve performance for packets which don't require much offloads on
   recent Mellanox NICs by 20% by making multiple packets share a
   descriptor entry.

 - Move chelsio inline crypto drivers (for TLS and IPsec) from the
   crypto subtree to drivers/net. Move MDIO drivers out of the phy
   directory.

 - Clean up a lot of W=1 warnings, reportedly the actively developed
   subsections of networking drivers should now build W=1 warning free.

 - Make sure drivers don't use in_interrupt() to dynamically adapt their
   code. Convert tasklets to use new tasklet_setup API (sadly this
   conversion is not yet complete).

* tag 'net-next-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2583 commits)
  Revert "bpfilter: Fix build error with CONFIG_BPFILTER_UMH"
  net, sockmap: Don't call bpf_prog_put() on NULL pointer
  bpf, selftest: Fix flaky tcp_hdr_options test when adding addr to lo
  bpf, sockmap: Add locking annotations to iterator
  netfilter: nftables: allow re-computing sctp CRC-32C in 'payload' statements
  net: fix pos incrementment in ipv6_route_seq_next
  net/smc: fix invalid return code in smcd_new_buf_create()
  net/smc: fix valid DMBE buffer sizes
  net/smc: fix use-after-free of delayed events
  bpfilter: Fix build error with CONFIG_BPFILTER_UMH
  cxgb4/ch_ipsec: Replace the module name to ch_ipsec from chcr
  net: sched: Fix suspicious RCU usage while accessing tcf_tunnel_info
  bpf: Fix register equivalence tracking.
  rxrpc: Fix loss of final ack on shutdown
  rxrpc: Fix bundle counting for exclusive connections
  netfilter: restore NF_INET_NUMHOOKS
  ibmveth: Identify ingress large send packets.
  ibmveth: Switch order of ibmveth_helper calls.
  cxgb4: handle 4-tuple PEDIT to NAT mode translation
  selftests: Add VRF route leaking tests
  ...
2020-10-15 18:42:13 -07:00
Linus Torvalds c90578360c Merge branch 'work.csum_and_copy' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull copy_and_csum cleanups from Al Viro:
 "Saner calling conventions for csum_and_copy_..._user() and friends"

[ Removing 800+ lines of code and cleaning stuff up is good  - Linus ]

* 'work.csum_and_copy' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ppc: propagate the calling conventions change down to csum_partial_copy_generic()
  amd64: switch csum_partial_copy_generic() to new calling conventions
  sparc64: propagate the calling convention changes down to __csum_partial_copy_...()
  xtensa: propagate the calling conventions change down into csum_partial_copy_generic()
  mips: propagate the calling convention change down into __csum_partial_copy_..._user()
  mips: __csum_partial_copy_kernel() has no users left
  mips: csum_and_copy_{to,from}_user() are never called under KERNEL_DS
  sparc32: propagate the calling conventions change down to __csum_partial_copy_sparc_generic()
  i386: propagate the calling conventions change down to csum_partial_copy_generic()
  sh: propage the calling conventions change down to csum_partial_copy_generic()
  m68k: get rid of zeroing destination on error in csum_and_copy_from_user()
  arm: propagate the calling convention changes down to csum_partial_copy_from_user()
  alpha: propagate the calling convention changes down to csum_partial_copy.c helpers
  saner calling conventions for csum_and_copy_..._user()
  csum_and_copy_..._user(): pass 0xffffffff instead of 0 as initial sum
  csum_partial_copy_nocheck(): drop the last argument
  unify generic instances of csum_partial_copy_nocheck()
  icmp_push_reply(): reorder adding the checksum up
  skb_copy_and_csum_bits(): don't bother with the last argument
2020-10-12 16:24:13 -07:00
David S. Miller 3ab0a7a0c3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Two minor conflicts:

1) net/ipv4/route.c, adding a new local variable while
   moving another local variable and removing it's
   initial assignment.

2) drivers/net/dsa/microchip/ksz9477.c, overlapping changes.
   One pretty prints the port mode differently, whilst another
   changes the driver to try and obtain the port mode from
   the port node rather than the switch node.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-22 16:45:34 -07:00
Wei Wang de033b7d15 ip: pass tos into ip_build_and_send_pkt()
This commit adds tos as a new passed in parameter to
ip_build_and_send_pkt() which will be used in the later commit.
This is a pure restructure and does not have any functional change.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 13:15:40 -07:00
Wei Wang ba9e04a7dd ip: fix tos reflection in ack and reset packets
Currently, in tcp_v4_reqsk_send_ack() and tcp_v4_send_reset(), we
echo the TOS value of the received packets in the response.
However, we do not want to echo the lower 2 ECN bits in accordance
with RFC 3168 6.1.5 robustness principles.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-08 20:19:08 -07:00
Miaohe Lin 5af68891dc net: clean up codestyle
This is a pure codestyle cleanup patch. No functional change intended.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-31 12:33:34 -07:00
Miaohe Lin cbc08a3312 net: Use helper macro IP_MAX_MTU in __ip_append_data()
What 0xFFFF means here is actually the max mtu of a ip packet. Use help
macro IP_MAX_MTU here.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-31 12:33:16 -07:00
Miaohe Lin 343d8c6014 net: clean up codestyle for net/ipv4
This is a pure codestyle cleanup patch. Also add a blank line after
declarations as warned by checkpatch.pl.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-25 06:28:02 -07:00
Al Viro cc44c17baf csum_partial_copy_nocheck(): drop the last argument
It's always 0.  Note that we theoretically could use ~0U as well -
result will be the same modulo 0xffff, _if_ the damn thing did the
right thing for any value of initial sum; later we'll make use of
that when convenient.

However, unlike csum_and_copy_..._user(), there are instances that
did not work for arbitrary initial sums; c6x is one such.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-08-20 15:45:14 -04:00
Al Viro 8d5930dfb7 skb_copy_and_csum_bits(): don't bother with the last argument
it's always 0

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-08-20 15:45:13 -04:00
David S. Miller 71930d6102 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
All conflicts seemed rather trivial, with some guidance from
Saeed Mameed on the tc_ct.c one.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-11 00:46:00 -07:00