Commit Graph

731 Commits

Author SHA1 Message Date
CKI Backport Bot 9bb50ec24d net: fix data-races around sk->sk_forward_alloc
JIRA: https://issues.redhat.com/browse/RHEL-69689
CVE: CVE-2024-53124

commit 073d89808c065ac4c672c0a613a71b27a80691cb
Author: Wang Liang <wangliang74@huawei.com>
Date:   Thu Nov 7 10:34:05 2024 +0800

    net: fix data-races around sk->sk_forward_alloc

    Syzkaller reported this warning:
     ------------[ cut here ]------------
     WARNING: CPU: 0 PID: 16 at net/ipv4/af_inet.c:156 inet_sock_destruct+0x1c5/0x1e0
     Modules linked in:
     CPU: 0 UID: 0 PID: 16 Comm: ksoftirqd/0 Not tainted 6.12.0-rc5 #26
     Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
     RIP: 0010:inet_sock_destruct+0x1c5/0x1e0
     Code: 24 12 4c 89 e2 5b 48 c7 c7 98 ec bb 82 41 5c e9 d1 18 17 ff 4c 89 e6 5b 48 c7 c7 d0 ec bb 82 41 5c e9 bf 18 17 ff 0f 0b eb 83 <0f> 0b eb 97 0f 0b eb 87 0f 0b e9 68 ff ff ff 66 66 2e 0f 1f 84 00
     RSP: 0018:ffffc9000008bd90 EFLAGS: 00010206
     RAX: 0000000000000300 RBX: ffff88810b172a90 RCX: 0000000000000007
     RDX: 0000000000000002 RSI: 0000000000000300 RDI: ffff88810b172a00
     RBP: ffff88810b172a00 R08: ffff888104273c00 R09: 0000000000100007
     R10: 0000000000020000 R11: 0000000000000006 R12: ffff88810b172a00
     R13: 0000000000000004 R14: 0000000000000000 R15: ffff888237c31f78
     FS:  0000000000000000(0000) GS:ffff888237c00000(0000) knlGS:0000000000000000
     CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
     CR2: 00007ffc63fecac8 CR3: 000000000342e000 CR4: 00000000000006f0
     DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
     DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
     Call Trace:
      <TASK>
      ? __warn+0x88/0x130
      ? inet_sock_destruct+0x1c5/0x1e0
      ? report_bug+0x18e/0x1a0
      ? handle_bug+0x53/0x90
      ? exc_invalid_op+0x18/0x70
      ? asm_exc_invalid_op+0x1a/0x20
      ? inet_sock_destruct+0x1c5/0x1e0
      __sk_destruct+0x2a/0x200
      rcu_do_batch+0x1aa/0x530
      ? rcu_do_batch+0x13b/0x530
      rcu_core+0x159/0x2f0
      handle_softirqs+0xd3/0x2b0
      ? __pfx_smpboot_thread_fn+0x10/0x10
      run_ksoftirqd+0x25/0x30
      smpboot_thread_fn+0xdd/0x1d0
      kthread+0xd3/0x100
      ? __pfx_kthread+0x10/0x10
      ret_from_fork+0x34/0x50
      ? __pfx_kthread+0x10/0x10
      ret_from_fork_asm+0x1a/0x30
      </TASK>
     ---[ end trace 0000000000000000 ]---

    Its possible that two threads call tcp_v6_do_rcv()/sk_forward_alloc_add()
    concurrently when sk->sk_state == TCP_LISTEN with sk->sk_lock unlocked,
    which triggers a data-race around sk->sk_forward_alloc:
    tcp_v6_rcv
        tcp_v6_do_rcv
            skb_clone_and_charge_r
                sk_rmem_schedule
                    __sk_mem_schedule
                        sk_forward_alloc_add()
                skb_set_owner_r
                    sk_mem_charge
                        sk_forward_alloc_add()
            __kfree_skb
                skb_release_all
                    skb_release_head_state
                        sock_rfree
                            sk_mem_uncharge
                                sk_forward_alloc_add()
                                sk_mem_reclaim
                                    // set local var reclaimable
                                    __sk_mem_reclaim
                                        sk_forward_alloc_add()

    In this syzkaller testcase, two threads call
    tcp_v6_do_rcv() with skb->truesize=768, the sk_forward_alloc changes like
    this:
     (cpu 1)             | (cpu 2)             | sk_forward_alloc
     ...                 | ...                 | 0
     __sk_mem_schedule() |                     | +4096 = 4096
                         | __sk_mem_schedule() | +4096 = 8192
     sk_mem_charge()     |                     | -768  = 7424
                         | sk_mem_charge()     | -768  = 6656
     ...                 |    ...              |
     sk_mem_uncharge()   |                     | +768  = 7424
     reclaimable=7424    |                     |
                         | sk_mem_uncharge()   | +768  = 8192
                         | reclaimable=8192    |
     __sk_mem_reclaim()  |                     | -4096 = 4096
                         | __sk_mem_reclaim()  | -8192 = -4096 != 0

    The skb_clone_and_charge_r() should not be called in tcp_v6_do_rcv() when
    sk->sk_state is TCP_LISTEN, it happens later in tcp_v6_syn_recv_sock().
    Fix the same issue in dccp_v6_do_rcv().

    Suggested-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Fixes: e994b2f0fb ("tcp: do not lock listener to process SYN packets")
    Signed-off-by: Wang Liang <wangliang74@huawei.com>
    Link: https://patch.msgid.link/20241107023405.889239-1-wangliang74@huawei.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: CKI Backport Bot <cki-ci-bot+cki-gitlab-backport-bot@redhat.com>
2024-12-02 14:40:01 +00:00
Paolo Abeni 14cb20c8b2 tcp: fix race in tcp_v6_syn_recv_sock()
JIRA: https://issues.redhat.com/browse/RHEL-62865
Tested: LNST, Tier1

Upstream commit:
commit d37fe4255abe8e7b419b90c5847e8ec2b8debb08
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Jun 6 15:46:51 2024 +0000

    tcp: fix race in tcp_v6_syn_recv_sock()

    tcp_v6_syn_recv_sock() calls ip6_dst_store() before
    inet_sk(newsk)->pinet6 has been set up.

    This means ip6_dst_store() writes over the parent (listener)
    np->dst_cookie.

    This is racy because multiple threads could share the same
    parent and their final np->dst_cookie could be wrong.

    Move ip6_dst_store() call after inet_sk(newsk)->pinet6
    has been changed and after the copy of parent ipv6_pinfo.

    Fixes: e994b2f0fb ("tcp: do not lock listener to process SYN packets")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Simon Horman <horms@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-16 19:09:14 +02:00
Paolo Abeni fdad6e7a51 tcp: replace TCP_SKB_CB(skb)->tcp_tw_isn with a per-cpu field
JIRA: https://issues.redhat.com/browse/RHEL-62865
Tested: LNST, Tier1
Conflicts: different context in tcp_conn_request(), as rhel-9 \
  lacks the TCP AO support.

Upstream commit:
commit 41eecbd712b73f0d5dcf1152b9a1c27b1f238028
Author: Eric Dumazet <edumazet@google.com>
Date:   Sun Apr 7 09:33:22 2024 +0000

    tcp: replace TCP_SKB_CB(skb)->tcp_tw_isn with a per-cpu field

    TCP can transform a TIMEWAIT socket into a SYN_RECV one from
    a SYN packet, and the ISN of the SYNACK packet is normally
    generated using TIMEWAIT tw_snd_nxt :

    tcp_timewait_state_process()
    ...
        u32 isn = tcptw->tw_snd_nxt + 65535 + 2;
        if (isn == 0)
            isn++;
        TCP_SKB_CB(skb)->tcp_tw_isn = isn;
        return TCP_TW_SYN;

    This SYN packet also bypasses normal checks against listen queue
    being full or not.

    tcp_conn_request()
    ...
           __u32 isn = TCP_SKB_CB(skb)->tcp_tw_isn;
    ...
            /* TW buckets are converted to open requests without
             * limitations, they conserve resources and peer is
             * evidently real one.
             */
            if ((syncookies == 2 || inet_csk_reqsk_queue_is_full(sk)) && !isn) {
                    want_cookie = tcp_syn_flood_action(sk, rsk_ops->slab_name);
                    if (!want_cookie)
                            goto drop;
            }

    This was using TCP_SKB_CB(skb)->tcp_tw_isn field in skb.

    Unfortunately this field has been accidentally cleared
    after the call to tcp_timewait_state_process() returning
    TCP_TW_SYN.

    Using a field in TCP_SKB_CB(skb) for a temporary state
    is overkill.

    Switch instead to a per-cpu variable.

    As a bonus, we do not have to clear tcp_tw_isn in TCP receive
    fast path.
    It is temporarily set then cleared only in the TCP_TW_SYN dance.

    Fixes: 4ad19de877 ("net: tcp6: fix double call of tcp_v6_fill_cb()")
    Fixes: eeea10b83a ("tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb()")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-16 19:08:41 +02:00
Paolo Abeni 4cd846284a tcp: propagate tcp_tw_isn via an extra parameter to ->route_req()
JIRA: https://issues.redhat.com/browse/RHEL-62865
Tested: LNST, Tier1

Upstream commit:
commit b9e810405880c99baafd550ada7043e86465396e
Author: Eric Dumazet <edumazet@google.com>
Date:   Sun Apr 7 09:33:21 2024 +0000

    tcp: propagate tcp_tw_isn via an extra parameter to ->route_req()

    tcp_v6_init_req() reads TCP_SKB_CB(skb)->tcp_tw_isn to find
    out if the request socket is created by a SYN hitting a TIMEWAIT socket.

    This has been buggy for a decade, lets directly pass the information
    from tcp_conn_request().

    This is a preparatory patch to make the following one easier to review.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-16 19:07:53 +02:00
Florian Westphal 23f780623d tcp: annotate data-races around tw->tw_ts_recent and tw->tw_ts_recent_stamp
JIRA: https://issues.redhat.com/browse/RHEL-9279
Upstream Status: commit 69e0b33a7fce

CS9 lacks both support for TCP Authentication option and usec
resolution for TCP timestamps.
Both features are out of scope, so do needed context fixups.
This change was added to reduce conflicts in the followup patch.

commit 69e0b33a7fce4d96649b9fa32e56b696921aa48e
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Jun 3 15:51:06 2024 +0000

    tcp: annotate data-races around tw->tw_ts_recent and tw->tw_ts_recent_stamp

    These fields can be read and written locklessly, add annotations
    around these minor races.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Simon Horman <horms@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Conflicts: net/ipv4/tcp_ipv4.c net/ipv6/tcp_ipv6.c

Signed-off-by: Florian Westphal <fwestpha@redhat.com>
2024-08-21 16:55:25 +02:00
Lucas Zampieri 55f96777fb Merge: net: backport visibility improvements
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4765

JIRA: https://issues.redhat.com/browse/RHEL-48648  
  
Various visibility improvements; mainly around drop reasons, reset reason and improved tracepoints this time.  
  
Signed-off-by: Antoine Tenart <atenart@redhat.com>

Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-08-12 16:18:50 +00:00
Antoine Tenart 3a0f9f0ce0 tcp: use sk_skb_reason_drop to free rx packets
JIRA: https://issues.redhat.com/browse/RHEL-48648
Upstream Status: net-next.git

commit 46a02aa357529d7b038096955976b14f7c44aa23
Author: Yan Zhai <yan@cloudflare.com>
Date:   Mon Jun 17 11:09:20 2024 -0700

    tcp: use sk_skb_reason_drop to free rx packets

    Replace kfree_skb_reason with sk_skb_reason_drop and pass the receiving
    socket to the tracepoint.

    Reported-by: kernel test robot <lkp@intel.com>
    Closes: https://lore.kernel.org/r/202406011539.jhwBd7DX-lkp@intel.com/
    Signed-off-by: Yan Zhai <yan@cloudflare.com>
    Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-07-16 17:29:42 +02:00
Antoine Tenart 0bc1f777a4 tcp: rstreason: handle timewait cases in the receive path
JIRA: https://issues.redhat.com/browse/RHEL-48648
Upstream Status: linux.git

commit 22a32557758a7100e46dfa8f383a401125e60b16
Author: Jason Xing <kernelxing@tencent.com>
Date:   Fri May 10 20:25:01 2024 +0800

    tcp: rstreason: handle timewait cases in the receive path

    There are two possible cases where TCP layer can send an RST. Since they
    happen in the same place, I think using one independent reason is enough
    to identify this special situation.

    Signed-off-by: Jason Xing <kernelxing@tencent.com>
    Link: https://lore.kernel.org/r/20240510122502.27850-5-kerneljasonxing@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-07-16 17:29:42 +02:00
Antoine Tenart 51c78f9a4a rstreason: make it work in trace world
JIRA: https://issues.redhat.com/browse/RHEL-48648
Upstream Status: linux.git

commit b533fb9cf4f7c6ca2aa255a5a1fdcde49fff2b24
Author: Jason Xing <kernelxing@tencent.com>
Date:   Thu Apr 25 11:13:40 2024 +0800

    rstreason: make it work in trace world

    At last, we should let it work by introducing this reset reason in
    trace world.

    One of the possible expected outputs is:
    ... tcp_send_reset: skbaddr=xxx skaddr=xxx src=xxx dest=xxx
    state=TCP_ESTABLISHED reason=NOT_SPECIFIED

    Signed-off-by: Jason Xing <kernelxing@tencent.com>
    Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-07-16 17:29:41 +02:00
Antoine Tenart 8ea5cff87d tcp: support rstreason for passive reset
JIRA: https://issues.redhat.com/browse/RHEL-48648
Upstream Status: linux.git

commit 120391ef9ca8fe8f82ea3f2961ad802043468226
Author: Jason Xing <kernelxing@tencent.com>
Date:   Thu Apr 25 11:13:37 2024 +0800

    tcp: support rstreason for passive reset

    Reuse the dropreason logic to show the exact reason of tcp reset,
    so we can finally display the corresponding item in enum sk_reset_reason
    instead of reinventing new reset reasons. This patch replaces all
    the prior NOT_SPECIFIED reasons.

    Signed-off-by: Jason Xing <kernelxing@tencent.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-07-16 17:29:41 +02:00
Antoine Tenart 25344d90dd rstreason: prepare for passive reset
JIRA: https://issues.redhat.com/browse/RHEL-48648
Upstream Status: linux.git
Conflicts:\
- Context differences due to missing upstream commits ba7783ad45c8
  ("net/tcp: Add AO sign to RST packets") and d5dfbfa2f88e ("mptcp: drop
  duplicate header inclusions") in c9s.

commit 6be49deaa09576c141002a2e6f816a1709bc2c86
Author: Jason Xing <kernelxing@tencent.com>
Date:   Thu Apr 25 11:13:35 2024 +0800

    rstreason: prepare for passive reset

    Adjust the parameter and support passing reason of reset which
    is for now NOT_SPECIFIED. No functional changes.

    Signed-off-by: Jason Xing <kernelxing@tencent.com>
    Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-07-16 17:29:41 +02:00
Antoine Tenart 528623fc31 trace: tcp: fully support trace_tcp_send_reset
JIRA: https://issues.redhat.com/browse/RHEL-48648
Upstream Status: linux.git
Conflicts:\
- Context differences due to missing upstream commits ba7783ad45c8
  ("net/tcp: Add AO sign to RST packets") and 3cccda8db2cf ("ipv6: move
  np->repflow to atomic flags") in c9s.

commit 19822a980e1956a6572998887a7df5a0607a32f6
Author: Jason Xing <kernelxing@tencent.com>
Date:   Mon Apr 1 15:36:05 2024 +0800

    trace: tcp: fully support trace_tcp_send_reset

    Prior to this patch, what we can see by enabling trace_tcp_send is
    only happening under two circumstances:
    1) active rst mode
    2) non-active rst mode and based on the full socket

    That means the inconsistency occurs if we use tcpdump and trace
    simultaneously to see how rst happens.

    It's necessary that we should take into other cases into considerations,
    say:
    1) time-wait socket
    2) no socket
    ...

    By parsing the incoming skb and reversing its 4-tuple can
    we know the exact 'flow' which might not exist.

    Samples after applied this patch:
    1. tcp_send_reset: skbaddr=XXX skaddr=XXX src=ip:port dest=ip:port
    state=TCP_ESTABLISHED
    2. tcp_send_reset: skbaddr=000...000 skaddr=XXX src=ip:port dest=ip:port
    state=UNKNOWN
    Note:
    1) UNKNOWN means we cannot extract the right information from skb.
    2) skbaddr/skaddr could be 0

    Signed-off-by: Jason Xing <kernelxing@tencent.com>
    Link: https://lore.kernel.org/r/20240401073605.37335-3-kerneljasonxing@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-07-16 17:29:41 +02:00
Antoine Tenart 8e320d89a7 tcp: make dropreason in tcp_child_process() work
JIRA: https://issues.redhat.com/browse/RHEL-48648
Upstream Status: linux.git

commit ee01defe25bad09a37b68dd051a7e931d1e4cd91
Author: Jason Xing <kernelxing@tencent.com>
Date:   Mon Feb 26 11:22:27 2024 +0800

    tcp: make dropreason in tcp_child_process() work

    It's time to let it work right now. We've already prepared for this:)

    Signed-off-by: Jason Xing <kernelxing@tencent.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-07-16 17:29:40 +02:00
Antoine Tenart 8f346a11e7 tcp: make the dropreason really work when calling tcp_rcv_state_process()
JIRA: https://issues.redhat.com/browse/RHEL-48648
Upstream Status: linux.git

commit b9825695930546af725b1e686b8eaf4c71201728
Author: Jason Xing <kernelxing@tencent.com>
Date:   Mon Feb 26 11:22:26 2024 +0800

    tcp: make the dropreason really work when calling tcp_rcv_state_process()

    Update three callers including both ipv4 and ipv6 and let the dropreason
    mechanism work in reality.

    Signed-off-by: Jason Xing <kernelxing@tencent.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-07-16 17:29:40 +02:00
Antoine Tenart 1042a45152 tcp: directly drop skb in cookie check for ipv6
JIRA: https://issues.redhat.com/browse/RHEL-48648
Upstream Status: linux.git
Conflicts:\
- Context difference due to missing upstream commits efce3d1fdff5 ("tcp:
  Don't initialise tp->tsoffset in tcp_get_cookie_sock().") and
  8e7bab6b9652 ("tcp: Factorise cookie-dependent fields initialisation
  in cookie_v[46]_check()").

commit ed43e76cdcc497e2b27d84db27e7df5612be2643
Author: Jason Xing <kernelxing@tencent.com>
Date:   Mon Feb 26 11:22:21 2024 +0800

    tcp: directly drop skb in cookie check for ipv6

    Like previous patch does, only moving skb drop logical code to
    cookie_v6_check() for later refinement.

    Signed-off-by: Jason Xing <kernelxing@tencent.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-07-16 17:29:40 +02:00
Felix Maurer 8fc3cda22c net/tcp: refactor tcp_inet6_sk()
JIRA: https://issues.redhat.com/browse/RHEL-30902

commit fe79bd65c819cc520aa66de65caae8e4cea29c5a
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Fri May 19 14:30:36 2023 +0100

    net/tcp: refactor tcp_inet6_sk()
    
    Don't keep hand coded offset caluclations and replace it with
    container_of(). It should be type safer and a bit less confusing.
    
    It also makes it with a macro instead of inline function to preserve
    constness, which was previously casted out like in case of
    tcp_v6_send_synack().
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Reviewed-by: Simon Horman <simon.horman@corigine.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2024-06-26 17:17:16 +02:00
Hangbin Liu 1446db33d3 ipv6: remove hard coded limitation on ipv6_pinfo
JIRA: https://issues.redhat.com/browse/RHEL-31050
Upstream Status: net.git commit f5f80e32de12

Conflicts: context conflicts due to missing commit
67fb43308f4b ("udp: Set NULL to sk->sk_prot->h.udp_table.").

commit f5f80e32de12fad2813d37270e8364a03e6d3ef0
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Jul 20 11:09:01 2023 +0000

    ipv6: remove hard coded limitation on ipv6_pinfo

    IPv6 inet sockets are supposed to have a "struct ipv6_pinfo"
    field at the end of their definition, so that inet6_sk_generic()
    can derive from socket size the offset of the "struct ipv6_pinfo".

    This is very fragile, and prevents adding bigger alignment
    in sockets, because inet6_sk_generic() does not work
    if the compiler adds padding after the ipv6_pinfo component.

    We are currently working on a patch series to reorganize
    TCP structures for better data locality and found issues
    similar to the one fixed in commit f5d547676c
    ("tcp: fix tcp_inet6_sk() for 32bit kernels")

    Alternative would be to force an alignment on "struct ipv6_pinfo",
    greater or equal to __alignof__(any ipv6 sock) to ensure there is
    no padding. This does not look great.

    v2: fix typo in mptcp_proto_v6_init() (Paolo)

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Chao Wu <wwchao@google.com>
    Cc: Wei Wang <weiwan@google.com>
    Cc: Coco Li <lixiaoyan@google.com>
    Cc: YiFei Zhu <zhuyifei@google.com>
    Reviewed-by: Simon Horman <simon.horman@corigine.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Hangbin Liu <haliu@redhat.com>
2024-04-02 17:50:46 +08:00
Antoine Tenart a8adbce266 net: ipv6: fix skb hash for some RST packets
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2214966
Upstream Status: linux.git

commit dc6456e938e938d64ffb6383a286b2ac9790a37f
Author: Antoine Tenart <atenart@kernel.org>
Date:   Thu Apr 27 11:21:59 2023 +0200

    net: ipv6: fix skb hash for some RST packets

    The skb hash comes from sk->sk_txhash when using TCP, except for some
    IPv6 RST packets. This is because in tcp_v6_send_reset when not in
    TIME_WAIT the hash is taken from sk->sk_hash, while it should come from
    sk->sk_txhash as those two hashes are not computed the same way.

    Packetdrill script to test the above,

       0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
      +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)

      +0 > (flowlabel 0x1) S 0:0(0) <...>

      // Wrong ack seq, trigger a rst.
      +0 < S. 0:0(0) ack 0 win 4000

      // Check the flowlabel matches prior one from SYN.
      +0 > (flowlabel 0x1) R 0:0(0) <...>

    Fixes: 9258b8b1be2e ("ipv6: tcp: send consistent autoflowlabel in RST packets")
    Signed-off-by: Antoine Tenart <atenart@kernel.org>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2023-06-16 10:55:22 +02:00
Antoine Tenart b4f1329917 ipv6: tcp: send consistent autoflowlabel in RST packets
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2214966
Upstream Status: linux.git

commit 9258b8b1be2e1e241baf8aa703aba1086069ee0f
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Sep 22 09:50:36 2022 -0700

    ipv6: tcp: send consistent autoflowlabel in RST packets

    Blamed commit added a txhash parameter to tcp_v6_send_response()
    but forgot to update tcp_v6_send_reset() accordingly.

    Fixes: aa51b80e1af4 ("ipv6: tcp: send consistent autoflowlabel in SYN_RECV state")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20220922165036.1795862-1-eric.dumazet@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2023-06-16 10:55:21 +02:00
Antoine Tenart db80d3e170 ipv6: tcp: send consistent autoflowlabel in SYN_RECV state
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2214966
Upstream Status: linux.git

commit aa51b80e1af47b3781abb1fb1666445a7616f0cd
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Aug 31 13:37:29 2022 -0700

    ipv6: tcp: send consistent autoflowlabel in SYN_RECV state

    This is a followup of commit c67b85558f ("ipv6: tcp: send consistent
    autoflowlabel in TIME_WAIT state"), but for SYN_RECV state.

    In some cases, TCP sends a challenge ACK on behalf of a SYN_RECV request.
    WHen this happens, we want to use the flow label that was used when
    the prior SYNACK packet was sent, instead of another one.

    After his patch, following packetdrill passes:

        0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
       +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
       +0 bind(3, ..., ...) = 0
       +0 listen(3, 1) = 0

      +.2 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
       +0 > (flowlabel 0x11) S. 0:0(0) ack 1 <...>
    // Test if a challenge ack is properly sent (same flowlabel than prior SYNACK)
       +.01 < . 4000000000:4000000000(0) ack 1 win 320
       +0  > (flowlabel 0x11) . 1:1(0) ack 1

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20220831203729.458000-1-eric.dumazet@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2023-06-16 10:55:11 +02:00
Antoine Tenart 30b200a890 tcp: add TCP_MINTTL drop reason
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2184073
Upstream Status: linux.git
Conflicts:\
- Context difference due to missing upstream commits 020e71a3cf7f
  ("ipv4: guard IP_MINTTL with a static key") and 14834c4f4eb3 ("ipv4:
  annotate data races arount inet->min_ttl") in c9s.

commit 2798e36dc233a409a5d3f26f73029596dc504020
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Feb 1 17:43:45 2023 +0000

    tcp: add TCP_MINTTL drop reason

    In the unlikely case incoming packets are dropped because
    of IP_MINTTL / IPV6_MINHOPCOUNT constraints...

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20230201174345.2708943-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2023-06-06 11:23:15 +02:00
Jan Stancek fa72082f2d Merge: net: core: stable backports for 9.3 phase 1
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2408

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2188560
Depends: !2404

A bunch of fixes from upstream, affecting the core networking
implementation.

This also includes a couple of fixes for tun/tap, strictly tied to
commit "net: add sock_init_data_uid()"

Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Approved-by: Sabrina Dubroca <sdubroca@redhat.com>
Approved-by: Andrea Claudi <aclaudi@redhat.com>
Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-05-16 11:49:41 +02:00
Jan Stancek cb3b1a532c Merge: tcp: stable backport for 9.3 phase 1
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2407

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2188561

A bunch of minor TCP fixes, with no behavioral changes inteded.
The only exception is commit 8dda5cd012 ("tcp: minor optimization
in tcp_add_backlog()"), which is not a fix but a needed pre-req to
avoid conflict in the next one.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Hangbin Liu <haliu@redhat.com>
Approved-by: Andrea Claudi <aclaudi@redhat.com>
Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-05-16 11:49:40 +02:00
Paolo Abeni e46d2d95f0 dccp/tcp: Avoid negative sk_forward_alloc by ipv6_pinfo.pktoptions.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2188560
Tested: LNST, Tier1

Upstream commit:
commit ca43ccf41224b023fc290073d5603a755fd12eed
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Thu Feb 9 16:22:01 2023 -0800

    dccp/tcp: Avoid negative sk_forward_alloc by ipv6_pinfo.pktoptions.

    Eric Dumazet pointed out [0] that when we call skb_set_owner_r()
    for ipv6_pinfo.pktoptions, sk_rmem_schedule() has not been called,
    resulting in a negative sk_forward_alloc.

    We add a new helper which clones a skb and sets its owner only
    when sk_rmem_schedule() succeeds.

    Note that we move skb_set_owner_r() forward in (dccp|tcp)_v6_do_rcv()
    because tcp_send_synack() can make sk_forward_alloc negative before
    ipv6_opt_accepted() in the crossed SYN-ACK or self-connect() cases.

    [0]: https://lore.kernel.org/netdev/CANn89iK9oc20Jdi_41jb9URdF210r7d1Y-+uypbMSbOfY6jqrg@mail.gmail.com/

    Fixes: 323fbd0edf ("net: dccp: Add handling of IPV6_PKTOPTIONS to dccp_v6_do_rcv()")
    Fixes: 3df80d9320 ("[DCCP]: Introduce DCCPv6")
    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-05-02 19:07:41 +02:00
Hangbin Liu 653e992bed ipv6: Fix tcp socket connection with DSCP.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2186064
Upstream Status: net.git commit 8230680f36fd

commit 8230680f36fd1525303d1117768c8852314c488c
Author: Guillaume Nault <gnault@redhat.com>
Date:   Wed Feb 8 18:14:03 2023 +0100

    ipv6: Fix tcp socket connection with DSCP.

    Take into account the IPV6_TCLASS socket option (DSCP) in
    tcp_v6_connect(). Otherwise fib6_rule_match() can't properly
    match the DSCP value, resulting in invalid route lookup.

    For example:

      ip route add unreachable table main 2001:db8::10/124

      ip route add table 100 2001:db8::10/124 dev eth0
      ip -6 rule add dsfield 0x04 table 100

      echo test | socat - TCP6:[2001:db8::11]:54321,ipv6-tclass=0x04

    Without this patch, socat fails at connect() time ("No route to host")
    because the fib-rule doesn't jump to table 100 and the lookup ends up
    being done in the main table.

    Fixes: 2cc67cc731 ("[IPV6] ROUTE: Routing by Traffic Class.")
    Signed-off-by: Guillaume Nault <gnault@redhat.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Hangbin Liu <haliu@redhat.com>
2023-04-27 10:04:56 +08:00
Paolo Abeni 220a990332 dccp/tcp: Reset saddr on failure after inet6?_hash_connect().
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2188561
Tested: LNST, Tier1

Upstream commit:
commit 77934dc6db0d2b111a8f2759e9ad2fb67f5cffa5
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Fri Nov 18 17:49:11 2022 -0800

    dccp/tcp: Reset saddr on failure after inet6?_hash_connect().

    When connect() is called on a socket bound to the wildcard address,
    we change the socket's saddr to a local address.  If the socket
    fails to connect() to the destination, we have to reset the saddr.

    However, when an error occurs after inet_hash6?_connect() in
    (dccp|tcp)_v[46]_conect(), we forget to reset saddr and leave
    the socket bound to the address.

    From the user's point of view, whether saddr is reset or not varies
    with errno.  Let's fix this inconsistent behaviour.

    Note that after this patch, the repro [0] will trigger the WARN_ON()
    in inet_csk_get_port() again, but this patch is not buggy and rather
    fixes a bug papering over the bhash2's bug for which we need another
    fix.

    For the record, the repro causes -EADDRNOTAVAIL in inet_hash6_connect()
    by this sequence:

      s1 = socket()
      s1.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
      s1.bind(('127.0.0.1', 10000))
      s1.sendto(b'hello', MSG_FASTOPEN, (('127.0.0.1', 10000)))
      # or s1.connect(('127.0.0.1', 10000))

      s2 = socket()
      s2.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
      s2.bind(('0.0.0.0', 10000))
      s2.connect(('127.0.0.1', 10000))  # -EADDRNOTAVAIL

      s2.listen(32)  # WARN_ON(inet_csk(sk)->icsk_bind2_hash != tb2);

    [0]: https://syzkaller.appspot.com/bug?extid=015d756bbd1f8b5c8f09

    Fixes: 3df80d9320 ("[DCCP]: Introduce DCCPv6")
    Fixes: 7c657876b6 ("[DCCP]: Initial implementation")
    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Acked-by: Joanne Koong <joannelkoong@gmail.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-04-21 09:57:10 +02:00
Guillaume Nault 9194a37d24 tcp/udp: Make early_demux back namespacified.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2186795
Upstream Status: linux.git

commit 11052589cf5c0bab3b4884d423d5f60c38fcf25d
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Wed Jul 13 10:52:07 2022 -0700

    tcp/udp: Make early_demux back namespacified.

    Commit e21145a987 ("ipv4: namespacify ip_early_demux sysctl knob") made
    it possible to enable/disable early_demux on a per-netns basis.  Then, we
    introduced two knobs, tcp_early_demux and udp_early_demux, to switch it for
    TCP/UDP in commit dddb64bcb3 ("net: Add sysctl to toggle early demux for
    tcp and udp").  However, the .proc_handler() was wrong and actually
    disabled us from changing the behaviour in each netns.

    We can execute early_demux if net.ipv4.ip_early_demux is on and each proto
    .early_demux() handler is not NULL.  When we toggle (tcp|udp)_early_demux,
    the change itself is saved in each netns variable, but the .early_demux()
    handler is a global variable, so the handler is switched based on the
    init_net's sysctl variable.  Thus, netns (tcp|udp)_early_demux knobs have
    nothing to do with the logic.  Whether we CAN execute proto .early_demux()
    is always decided by init_net's sysctl knob, and whether we DO it or not is
    by each netns ip_early_demux knob.

    This patch namespacifies (tcp|udp)_early_demux again.  For now, the users
    of the .early_demux() handler are TCP and UDP only, and they are called
    directly to avoid retpoline.  So, we can remove the .early_demux() handler
    from inet6?_protos and need not dereference them in ip6?_rcv_finish_core().
    If another proto needs .early_demux(), we can restore it at that time.

    Fixes: dddb64bcb3 ("net: Add sysctl to toggle early demux for tcp and udp")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Link: https://lore.kernel.org/r/20220713175207.7727-1-kuniyu@amazon.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2023-04-14 16:19:36 +02:00
Herton R. Krzesinski f7d56c83b6 Merge: sctp: backports from upstream, 2nd phase
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1878

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160516
Tested: lksctp-tools func_tests

v1-v2:
  - add the whole patchset of "inet6: Remove inet6_destroy_sock() calls" instead of
    only 2 patches of them, as Davide suggested.
v2->v3
  - drop patch "sctp: delete free member from struct sctp_sched_ops" which is a code
    improvement, as it may cause stuck.
v3->v4:
  - drop patch "sctp: fix memory leak in sctp_stream_outq_migrate()" which cause
    another stuck.

Signed-off-by: Xin Long <lxin@redhat.com>

Approved-by: Davide Caratti <dcaratti@redhat.com>
Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2023-02-23 12:33:18 +00:00
Xin Long af4a12dd88 inet6: Remove inet6_destroy_sock() in sk->sk_prot->destroy().
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160516
Tested: compile only

Conflicts:
  - context difference due to missing commit 0ffe2412531e from
    upstream.

commit b5fc29233d28be7a3322848ebe73ac327559cdb9
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Wed Oct 19 15:35:59 2022 -0700

    inet6: Remove inet6_destroy_sock() in sk->sk_prot->destroy().

    After commit d38afeec26ed ("tcp/udp: Call inet6_destroy_sock()
    in IPv6 sk->sk_destruct()."), we call inet6_destroy_sock() in
    sk->sk_destruct() by setting inet6_sock_destruct() to it to make
    sure we do not leak inet6-specific resources.

    Now we can remove unnecessary inet6_destroy_sock() calls in
    sk->sk_prot->destroy().

    DCCP and SCTP have their own sk->sk_destruct() function, so we
    change them separately in the following patches.

    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Xin Long <lxin@redhat.com>
2023-02-01 15:05:51 -05:00
Guillaume Nault 04b96d8fcf tcp: Fix data-races around sysctl_tcp_reflect_tos.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160073
Upstream Status: linux.git

commit 870e3a634b6a6cb1543b359007aca73fe6a03ac5
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Fri Jul 22 11:22:04 2022 -0700

    tcp: Fix data-races around sysctl_tcp_reflect_tos.

    While reading sysctl_tcp_reflect_tos, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its readers.

    Fixes: ac8f1710c1 ("tcp: reflect tos value received in SYN to the socket")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Acked-by: Wei Wang <weiwan@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2023-01-17 12:25:15 +01:00
Ivan Vecera 5dc8d666ee ipv6: Remove __ipv6_only_sock().
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2144847

Since commit 9fe516ba3f ("inet: move ipv6only in sock_common"),
ipv6_only_sock() and __ipv6_only_sock() are the same macro.  Let's
remove the one.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 89e9c7280075f6733b22dd0740daeddeb1256ebf)

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-11-24 09:54:19 +01:00
Frantisek Hrbata 1269719102 Merge: BPF and XDP rebase to v5.18
Merge conflicts:
-----------------
arch/x86/net/bpf_jit_comp.c
        - bpf_arch_text_poke()
          HEAD(!1464) contains b73b002f7f ("x86/ibt,bpf: Add ENDBR instructions to prologue and trampoline")
          Resolved in favour of !1464, but keep the return statement from !1477

MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1477

Bugzilla: https://bugzilla.redhat.com/2120966

Rebase BPF and XDP to the upstream kernel version 5.18

Patch applied, then reverted:
```
544356 selftests/bpf: switch to new libbpf XDP APIs
0bfb95 selftests, bpf: Do not yet switch to new libbpf XDP APIs
```
Taken in the perf rebase:
```
23fcfc perf: use generic bpf_program__set_type() to set BPF prog type
```
Unsuported arches:
```
5c1011 libbpf: Fix riscv register names
cf0b5b libbpf: Fix accessing syscall arguments on riscv
```
Depends on changes of other subsystems:
```
7fc8c3 s390/bpf: encode register within extable entry
aebfd1 x86/ibt,ftrace: Search for __fentry__ location
589127 x86/ibt,bpf: Add ENDBR instructions to prologue and trampoline
```
Broken selftest:
```
edae34 selftests net: add UDP GRO fraglist + bpf self-tests
cf6783 selftests net: fix bpf build error
7b92aa selftests net: fix kselftest net fatal error
```
Out of scope:
```
baebdf net: dev: Makes sure netif_rx() can be invoked in any context.
5c8166 kbuild: replace $(if A,A,B) with $(or A,B)
1a97ce perf maps: Use a pointer for kmaps
967747 uaccess: remove CONFIG_SET_FS
42b01a s390: always use the packed stack layout
bf0882 flow_dissector: Add support for HSR
d09a30 s390/extable: move EX_TABLE define to asm-extable.h
3d6671 s390/extable: convert to relative table with data
4efd41 s390: raise minimum supported machine generation to z10
f65e58 flow_dissector: Add support for HSRv0
1a6d7a netdevsim: Introduce support for L3 offload xstats
9b1894 selftests: netdevsim: hw_stats_l3: Add a new test
84005b perf ftrace latency: Add -n/--use-nsec option
36c4a7 kasan, arm64: don't tag executable vmalloc allocations
8df013 docs: netdev: move the netdev-FAQ to the process pages
4d4d00 perf tools: Update copy of libbpf's hashmap.c
0df6ad perf evlist: Rename cpus to user_requested_cpus
1b8089 flow_dissector: fix false-positive __read_overflow2_field() warning
0ae065 perf build: Fix check for btf__load_from_kernel_by_id() in libbpf
8994e9 perf test bpf: Skip test if clang is not present
735346 perf build: Fix btf__load_from_kernel_by_id() feature check
f037ac s390/stack: merge empty stack frame slots
335220 docs: netdev: update maintainer-netdev.rst reference
a0b098 s390/nospec: remove unneeded header includes
34513a netdevsim: Fix hwstats debugfs file permissions
```

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>

Approved-by: John W. Linville <linville@redhat.com>
Approved-by: Wander Lairson Costa <wander@redhat.com>
Approved-by: Torez Smith <torez@redhat.com>
Approved-by: Jan Stancek <jstancek@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Felix Maurer <fmaurer@redhat.com>
Approved-by: Viktor Malik <vmalik@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-11-21 05:30:47 -05:00
Davide Caratti 728983215c tcp: Access &tcp_hashinfo via net.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2137858
Upstream Status: net.git commit 4461568aa4e5
Conflicts:
 - net/ipv4/tcp_ipv4.c: context mismatch as we don't have upstream
   commit 28044fc1d495 ("net: Add a bhash2 table hashed by port and
   address") and 08eaef904031 ("tcp: Clean up some functions.")
 - net/ipv6/tcp_ipv6.c: context mismatch as we don't have upstream
   commit 28044fc1d495 ("net: Add a bhash2 table hashed by port and
   address")
 - net/ipv4/tcp_minisocks.c: hunk applied manually to fix a build issue
   caused by missing upstream commit 08eaef904031 ("tcp: Clean up some
   functions.")

commit 4461568aa4e565de2c336f4875ddf912f26da8a5
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Wed Sep 7 18:10:20 2022 -0700

    tcp: Access &tcp_hashinfo via net.

    We will soon introduce an optional per-netns ehash.

    This means we cannot use tcp_hashinfo directly in most places.

    Instead, access it via net->ipv4.tcp_death_row.hashinfo.

    The access will be valid only while initialising tcp_hashinfo
    itself and creating/destroying each netns.

    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2022-11-08 17:10:59 +01:00
Davide Caratti 9aac6c4346 net: add per_cpu_fw_alloc field to struct proto
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2137858
Upstream Status: net.git commit 0defbb0af775
Conflicts:
 - net/core/sock.c: context mismatch because of missing backport of
   upstream commit f20cfd662a62 ("net: add sanity check in proto_register()")

commit 0defbb0af775ef037913786048d099bbe8b9a2c2
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Jun 8 23:34:08 2022 -0700

    net: add per_cpu_fw_alloc field to struct proto

    Each protocol having a ->memory_allocated pointer gets a corresponding
    per-cpu reserve, that following patches will use.

    Instead of having reserved bytes per socket,
    we want to have per-cpu reserves.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Shakeel Butt <shakeelb@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2022-11-08 17:10:55 +01:00
Frantisek Hrbata 0c3a22328a Merge: IPv6: 9.2 P1 backport from upstream
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1488

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2135319

Signed-off-by: Hangbin Liu <haliu@redhat.com>

Approved-by: Davide Caratti <dcaratti@redhat.com>
Approved-by: Sabrina Dubroca <sdubroca@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-10-27 08:26:02 -04:00
Jiri Benc 6619cf0a37 net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp
Bugzilla: https://bugzilla.redhat.com/2120966

Conflicts:
- [minor] different context in tcp_fragment() due to missing
  a52fe46ef160 ("tcp: factorize ip_summed setting")

commit a1ac9c8acec1605c6b43af418f79facafdced680
Author: Martin KaFai Lau <kafai@fb.com>
Date:   Wed Mar 2 11:55:25 2022 -0800

    net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp

    skb->tstamp was first used as the (rcv) timestamp.
    The major usage is to report it to the user (e.g. SO_TIMESTAMP).

    Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP)
    during egress and used by the qdisc (e.g. sch_fq) to make decision on when
    the skb can be passed to the dev.

    Currently, there is no way to tell skb->tstamp having the (rcv) timestamp
    or the delivery_time, so it is always reset to 0 whenever forwarded
    between egress and ingress.

    While it makes sense to always clear the (rcv) timestamp in skb->tstamp
    to avoid confusing sch_fq that expects the delivery_time, it is a
    performance issue [0] to clear the delivery_time if the skb finally
    egress to a fq@phy-dev.  For example, when forwarding from egress to
    ingress and then finally back to egress:

                tcp-sender => veth@netns => veth@hostns => fq@eth0@hostns
                                         ^              ^
                                         reset          rest

    This patch adds one bit skb->mono_delivery_time to flag the skb->tstamp
    is storing the mono delivery_time (EDT) instead of the (rcv) timestamp.

    The current use case is to keep the TCP mono delivery_time (EDT) and
    to be used with sch_fq.  A latter patch will also allow tc-bpf@ingress
    to read and change the mono delivery_time.

    In the future, another bit (e.g. skb->user_delivery_time) can be added
    for the SCM_TXTIME where the clock base is tracked by sk->sk_clockid.

    [ This patch is a prep work.  The following patches will
      get the other parts of the stack ready first.  Then another patch
      after that will finally set the skb->mono_delivery_time. ]

    skb_set_delivery_time() function is added.  It is used by the tcp_output.c
    and during ip[6] fragmentation to assign the delivery_time to
    the skb->tstamp and also set the skb->mono_delivery_time.

    A note on the change in ip_send_unicast_reply() in ip_output.c.
    It is only used by TCP to send reset/ack out of a ctl_sk.
    Like the new skb_set_delivery_time(), this patch sets
    the skb->mono_delivery_time to 0 for now as a place
    holder.  It will be enabled in a latter patch.
    A similar case in tcp_ipv6 can be done with
    skb_set_delivery_time() in tcp_v6_send_response().

    [0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdf

    Signed-off-by: Martin KaFai Lau <kafai@fb.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:57:59 +02:00
Hangbin Liu d109429414 tcp: Fix data races around icsk->icsk_af_ops.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2135319
Upstream Status: net.git commit f49cd2f4d617

Conflicts: context conflicts due to missing upstream commit
34704ef024ae ("bpf: net: Change do_tcp_getsockopt() to take the sockptr_t
argument").

commit f49cd2f4d6170d27a2c61f1fecb03d8a70c91f57
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Thu Oct 6 11:53:49 2022 -0700

    tcp: Fix data races around icsk->icsk_af_ops.

    setsockopt(IPV6_ADDRFORM) and tcp_v6_connect() change icsk->icsk_af_ops
    under lock_sock(), but tcp_(get|set)sockopt() read it locklessly.  To
    avoid load/store tearing, we need to add READ_ONCE() and WRITE_ONCE()
    for the reads and writes.

    Thanks to Eric Dumazet for providing the syzbot report:

    BUG: KCSAN: data-race in tcp_setsockopt / tcp_v6_connect

    write to 0xffff88813c624518 of 8 bytes by task 23936 on cpu 0:
    tcp_v6_connect+0x5b3/0xce0 net/ipv6/tcp_ipv6.c:240
    __inet_stream_connect+0x159/0x6d0 net/ipv4/af_inet.c:660
    inet_stream_connect+0x44/0x70 net/ipv4/af_inet.c:724
    __sys_connect_file net/socket.c:1976 [inline]
    __sys_connect+0x197/0x1b0 net/socket.c:1993
    __do_sys_connect net/socket.c:2003 [inline]
    __se_sys_connect net/socket.c:2000 [inline]
    __x64_sys_connect+0x3d/0x50 net/socket.c:2000
    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
    entry_SYSCALL_64_after_hwframe+0x63/0xcd

    read to 0xffff88813c624518 of 8 bytes by task 23937 on cpu 1:
    tcp_setsockopt+0x147/0x1c80 net/ipv4/tcp.c:3789
    sock_common_setsockopt+0x5d/0x70 net/core/sock.c:3585
    __sys_setsockopt+0x212/0x2b0 net/socket.c:2252
    __do_sys_setsockopt net/socket.c:2263 [inline]
    __se_sys_setsockopt net/socket.c:2260 [inline]
    __x64_sys_setsockopt+0x62/0x70 net/socket.c:2260
    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
    entry_SYSCALL_64_after_hwframe+0x63/0xcd

    value changed: 0xffffffff8539af68 -> 0xffffffff8539aff8

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 1 PID: 23937 Comm: syz-executor.5 Not tainted
    6.0.0-rc4-syzkaller-00331-g4ed9c1e971b1-dirty #0

    Hardware name: Google Google Compute Engine/Google Compute Engine,
    BIOS Google 08/26/2022

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Reported-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Hangbin Liu <haliu@redhat.com>
2022-10-18 11:41:13 +08:00
Antoine Tenart 4600abbb3e tcp_ipv6: set the drop_reason in the right place
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit dc7769244e03e932262a4f10eeab11657cb601c7
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Thu May 19 19:13:47 2022 -0700

    tcp_ipv6: set the drop_reason in the right place

    Looks like the IPv6 version of the patch under Fixes was
    a copy/paste of the IPv4 but hit the wrong spot.
    It is tcp_v6_rcv() which uses drop_reason as a boolean, and
    needs to be protected against reason == 0 before calling free.
    tcp_v6_do_rcv() has a pretty straightforward flow.

    The resulting warning looks like this:
      WARNING: CPU: 1 PID: 0 at net/core/skbuff.c:775
      Call Trace:
        tcp_v6_rcv (net/ipv6/tcp_ipv6.c:1767)
        ip6_protocol_deliver_rcu (net/ipv6/ip6_input.c:438)
        ip6_input_finish (include/linux/rcupdate.h:726)
        ip6_input (include/linux/netfilter.h:307)

    Fixes: f8319dfd1b3b ("net: tcp: reset 'drop_reason' to NOT_SPCIFIED in tcp_v{4,6}_rcv()")
    Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
    Link: https://lore.kernel.org/r/20220520021347.2270207-1-kuba@kernel.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-14 17:40:26 +02:00
Antoine Tenart 626c678449 net: tcp: reset 'drop_reason' to NOT_SPCIFIED in tcp_v{4,6}_rcv()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit f8319dfd1b3b3be6c08795017fc30f880f8bc861
Author: Menglong Dong <imagedong@tencent.com>
Date:   Fri May 13 11:03:39 2022 +0800

    net: tcp: reset 'drop_reason' to NOT_SPCIFIED in tcp_v{4,6}_rcv()

    The 'drop_reason' that passed to kfree_skb_reason() in tcp_v4_rcv()
    and tcp_v6_rcv() can be SKB_NOT_DROPPED_YET(0), as it is used as the
    return value of tcp_inbound_md5_hash().

    And it can panic the kernel with NULL pointer in
    net_dm_packet_report_size() if the reason is 0, as drop_reasons[0]
    is NULL.

    Fixes: 1330b6ef3313 ("skb: make drop reason booleanable")
    Reviewed-by: Jiang Biao <benbjiang@tencent.com>
    Reviewed-by: Hao Peng <flyingpeng@tencent.com>
    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-14 17:40:26 +02:00
Antoine Tenart 04f4917aca skb: make drop reason booleanable
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit 1330b6ef3313fcec577d2b020c290dc8b9f11f1a
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Mon Mar 7 16:44:21 2022 -0800

    skb: make drop reason booleanable

    We have a number of cases where function returns drop/no drop
    decision as a boolean. Now that we want to report the reason
    code as well we have to pass extra output arguments.

    We can make the reason code evaluate correctly as bool.

    I believe we're good to reorder the reasons as they are
    reported to user space as strings.

    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:24 +02:00
Antoine Tenart 997d93a49f net/tcp: Merge TCP-MD5 inbound callbacks
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit 7bbb765b73496699a165d505ecdce962f903b422
Author: Dmitry Safonov <0x7f454c46@gmail.com>
Date:   Wed Feb 23 17:57:40 2022 +0000

    net/tcp: Merge TCP-MD5 inbound callbacks

    The functions do essentially the same work to verify TCP-MD5 sign.
    Code can be merged into one family-independent function in order to
    reduce copy'n'paste and generated code.
    Later with TCP-AO option added, this will allow to create one function
    that's responsible for segment verification, that will have all the
    different checks for MD5/AO/non-signed packets, which in turn will help
    to see checks for all corner-cases in one function, rather than spread
    around different families and functions.

    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
    Signed-off-by: Dmitry Safonov <dima@arista.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Link: https://lore.kernel.org/r/20220223175740.452397-1-dima@arista.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:24 +02:00
Antoine Tenart 7e7867a749 net: tcp: use kfree_skb_reason() for tcp_v{4,6}_do_rcv()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit 8eba65fa5f06519042b98564089b942d795e3f8d
Author: Menglong Dong <imagedong@tencent.com>
Date:   Sun Feb 20 15:06:34 2022 +0800

    net: tcp: use kfree_skb_reason() for tcp_v{4,6}_do_rcv()

    Replace kfree_skb() used in tcp_v4_do_rcv() and tcp_v6_do_rcv() with
    kfree_skb_reason().

    Reviewed-by: Mengen Sun <mengensun@tencent.com>
    Reviewed-by: Hao Peng <flyingpeng@tencent.com>
    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:22 +02:00
Antoine Tenart 0b99c6c861 net: tcp: add skb drop reasons to tcp_add_backlog()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git
Conflicts:\
- In tcp.h due to missing commit f35f821935d8 ("tcp: defer skb freeing
  after socket lock is released") in C9S; which is fine btw as the chunk
  in tcp.h was later removed upstream by commit 68822bdf76f1 ("net:
  generalize skb freeing deferral to per-cpu lists").

commit 7a26dc9e7b43f5a24c4b843713e728582adf1c38
Author: Menglong Dong <imagedong@tencent.com>
Date:   Sun Feb 20 15:06:33 2022 +0800

    net: tcp: add skb drop reasons to tcp_add_backlog()

    Pass the address of drop_reason to tcp_add_backlog() to store the
    reasons for skb drops when fails. Following drop reasons are
    introduced:

    SKB_DROP_REASON_SOCKET_BACKLOG

    Reviewed-by: Mengen Sun <mengensun@tencent.com>
    Reviewed-by: Hao Peng <flyingpeng@tencent.com>
    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:22 +02:00
Antoine Tenart de5f3d75e9 net: tcp: add skb drop reasons to tcp_v{4,6}_inbound_md5_hash()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit 643b622b51f1f0015e0a80f90b4ef9032e6ddb1b
Author: Menglong Dong <imagedong@tencent.com>
Date:   Sun Feb 20 15:06:32 2022 +0800

    net: tcp: add skb drop reasons to tcp_v{4,6}_inbound_md5_hash()

    Pass the address of drop reason to tcp_v4_inbound_md5_hash() and
    tcp_v6_inbound_md5_hash() to store the reasons for skb drops when this
    function fails. Therefore, the drop reason can be passed to
    kfree_skb_reason() when the skb needs to be freed.

    Following drop reasons are added:

    SKB_DROP_REASON_TCP_MD5NOTFOUND
    SKB_DROP_REASON_TCP_MD5UNEXPECTED
    SKB_DROP_REASON_TCP_MD5FAILURE

    SKB_DROP_REASON_TCP_MD5* above correspond to LINUX_MIB_TCPMD5*

    Reviewed-by: Mengen Sun <mengensun@tencent.com>
    Reviewed-by: Hao Peng <flyingpeng@tencent.com>
    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:22 +02:00
Antoine Tenart b55f02222f net: tcp: use kfree_skb_reason() for tcp_v6_rcv()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit c0e3154d9c889e1aa1af098f40301395f2e33d8a
Author: Menglong Dong <imagedong@tencent.com>
Date:   Sun Feb 20 15:06:31 2022 +0800

    net: tcp: use kfree_skb_reason() for tcp_v6_rcv()

    Replace kfree_skb() used in tcp_v6_rcv() with kfree_skb_reason().

    Reviewed-by: Mengen Sun <mengensun@tencent.com>
    Reviewed-by: Hao Peng <flyingpeng@tencent.com>
    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:22 +02:00
Felix Maurer de20724127 net: bpf: Handle return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071620

commit 91a760b26926265a60c77ddf016529bcf3e17a04
Author: Menglong Dong <imagedong@tencent.com>
Date:   Thu Jan 6 21:20:20 2022 +0800

    net: bpf: Handle return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND()

    The return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() in
    __inet_bind() is not handled properly. While the return value
    is non-zero, it will set inet_saddr and inet_rcv_saddr to 0 and
    exit:

            err = BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk);
            if (err) {
                    inet->inet_saddr = inet->inet_rcv_saddr = 0;
                    goto out_release_sock;
            }

    Let's take UDP for example and see what will happen. For UDP
    socket, it will be added to 'udp_prot.h.udp_table->hash' and
    'udp_prot.h.udp_table->hash2' after the sk->sk_prot->get_port()
    called success. If 'inet->inet_rcv_saddr' is specified here,
    then 'sk' will be in the 'hslot2' of 'hash2' that it don't belong
    to (because inet_saddr is changed to 0), and UDP packet received
    will not be passed to this sock. If 'inet->inet_rcv_saddr' is not
    specified here, the sock will work fine, as it can receive packet
    properly, which is wired, as the 'bind()' is already failed.

    To undo the get_port() operation, introduce the 'put_port' field
    for 'struct proto'. For TCP proto, it is inet_put_port(); For UDP
    proto, it is udp_lib_unhash(); For icmp proto, it is
    ping_unhash().

    Therefore, after sys_bind() fail caused by
    BPF_CGROUP_RUN_PROG_INET4_POST_BIND(), it will be unbinded, which
    means that it can try to be binded to another port.

    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20220106132022.3470772-2-imagedong@tencent.com

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2022-08-24 16:53:48 +02:00
Paolo Abeni 036c0e121e tcp: add accessors to read/set tp->snd_cwnd
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101465
Tested: LNST, Tier1

Upstream commit:
commit 40570375356c874b1578e05c1dcc3ff7c1322dbe
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Apr 5 16:35:38 2022 -0700

    tcp: add accessors to read/set tp->snd_cwnd

    We had various bugs over the years with code
    breaking the assumption that tp->snd_cwnd is greater
    than zero.

    Lately, syzbot reported the WARN_ON_ONCE(!tp->prior_cwnd) added
    in commit 8b8a321ff7 ("tcp: fix zero cwnd in tcp_cwnd_reduction")
    can trigger, and without a repro we would have to spend
    considerable time finding the bug.

    Instead of complaining too late, we want to catch where
    and when tp->snd_cwnd is set to an illegal value.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Suggested-by: Yuchung Cheng <ycheng@google.com>
    Cc: Neal Cardwell <ncardwell@google.com>
    Acked-by: Yuchung Cheng <ycheng@google.com>
    Link: https://lore.kernel.org/r/20220405233538.947344-1-eric.dumazet@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-06-27 16:43:55 +02:00
Paolo Abeni bae902a610 inet: fully convert sk->sk_rx_dst to RCU rules
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2079411
Tested: LNST, Tieri1
Conflicts: \
  - sk_rx_dst location inside struct sock is slightly different
  from upstream as rhel-9 already has commit 43f51df41729 ("net:
   move early demux fields close to sk_refcnt")

Upstream commit:
commit 8f905c0e7354ef261360fb7535ea079b1082c105
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Dec 20 06:33:30 2021 -0800

    inet: fully convert sk->sk_rx_dst to RCU rules

    syzbot reported various issues around early demux,
    one being included in this changelog [1]

    sk->sk_rx_dst is using RCU protection without clearly
    documenting it.

    And following sequences in tcp_v4_do_rcv()/tcp_v6_do_rcv()
    are not following standard RCU rules.

    [a]    dst_release(dst);
    [b]    sk->sk_rx_dst = NULL;

    They look wrong because a delete operation of RCU protected
    pointer is supposed to clear the pointer before
    the call_rcu()/synchronize_rcu() guarding actual memory freeing.

    In some cases indeed, dst could be freed before [b] is done.

    We could cheat by clearing sk_rx_dst before calling
    dst_release(), but this seems the right time to stick
    to standard RCU annotations and debugging facilities.

    [1]
    BUG: KASAN: use-after-free in dst_check include/net/dst.h:470 [inline]
    BUG: KASAN: use-after-free in tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792
    Read of size 2 at addr ffff88807f1cb73a by task syz-executor.5/9204

    CPU: 0 PID: 9204 Comm: syz-executor.5 Not tainted 5.16.0-rc5-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
     <TASK>
     __dump_stack lib/dump_stack.c:88 [inline]
     dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
     print_address_description.constprop.0.cold+0x8d/0x320 mm/kasan/report.c:247
     __kasan_report mm/kasan/report.c:433 [inline]
     kasan_report.cold+0x83/0xdf mm/kasan/report.c:450
     dst_check include/net/dst.h:470 [inline]
     tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792
     ip_rcv_finish_core.constprop.0+0x15de/0x1e80 net/ipv4/ip_input.c:340
     ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583
     ip_sublist_rcv net/ipv4/ip_input.c:609 [inline]
     ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644
     __netif_receive_skb_list_ptype net/core/dev.c:5508 [inline]
     __netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556
     __netif_receive_skb_list net/core/dev.c:5608 [inline]
     netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699
     gro_normal_list net/core/dev.c:5853 [inline]
     gro_normal_list net/core/dev.c:5849 [inline]
     napi_complete_done+0x1f1/0x880 net/core/dev.c:6590
     virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline]
     virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557
     __napi_poll+0xaf/0x440 net/core/dev.c:7023
     napi_poll net/core/dev.c:7090 [inline]
     net_rx_action+0x801/0xb40 net/core/dev.c:7177
     __do_softirq+0x29b/0x9c2 kernel/softirq.c:558
     invoke_softirq kernel/softirq.c:432 [inline]
     __irq_exit_rcu+0x123/0x180 kernel/softirq.c:637
     irq_exit_rcu+0x5/0x20 kernel/softirq.c:649
     common_interrupt+0x52/0xc0 arch/x86/kernel/irq.c:240
     asm_common_interrupt+0x1e/0x40 arch/x86/include/asm/idtentry.h:629
    RIP: 0033:0x7f5e972bfd57
    Code: 39 d1 73 14 0f 1f 80 00 00 00 00 48 8b 50 f8 48 83 e8 08 48 39 ca 77 f3 48 39 c3 73 3e 48 89 13 48 8b 50 f8 48 89 38 49 8b 0e <48> 8b 3e 48 83 c3 08 48 83 c6 08 eb bc 48 39 d1 72 9e 48 39 d0 73
    RSP: 002b:00007fff8a413210 EFLAGS: 00000283
    RAX: 00007f5e97108990 RBX: 00007f5e97108338 RCX: ffffffff81d3aa45
    RDX: ffffffff81d3aa45 RSI: 00007f5e97108340 RDI: ffffffff81d3aa45
    RBP: 00007f5e97107eb8 R08: 00007f5e97108d88 R09: 0000000093c2e8d9
    R10: 0000000000000000 R11: 0000000000000000 R12: 00007f5e97107eb0
    R13: 00007f5e97108338 R14: 00007f5e97107ea8 R15: 0000000000000019
     </TASK>

    Allocated by task 13:
     kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
     kasan_set_track mm/kasan/common.c:46 [inline]
     set_alloc_info mm/kasan/common.c:434 [inline]
     __kasan_slab_alloc+0x90/0xc0 mm/kasan/common.c:467
     kasan_slab_alloc include/linux/kasan.h:259 [inline]
     slab_post_alloc_hook mm/slab.h:519 [inline]
     slab_alloc_node mm/slub.c:3234 [inline]
     slab_alloc mm/slub.c:3242 [inline]
     kmem_cache_alloc+0x202/0x3a0 mm/slub.c:3247
     dst_alloc+0x146/0x1f0 net/core/dst.c:92
     rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613
     ip_route_input_slow+0x1817/0x3a20 net/ipv4/route.c:2340
     ip_route_input_rcu net/ipv4/route.c:2470 [inline]
     ip_route_input_noref+0x116/0x2a0 net/ipv4/route.c:2415
     ip_rcv_finish_core.constprop.0+0x288/0x1e80 net/ipv4/ip_input.c:354
     ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583
     ip_sublist_rcv net/ipv4/ip_input.c:609 [inline]
     ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644
     __netif_receive_skb_list_ptype net/core/dev.c:5508 [inline]
     __netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556
     __netif_receive_skb_list net/core/dev.c:5608 [inline]
     netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699
     gro_normal_list net/core/dev.c:5853 [inline]
     gro_normal_list net/core/dev.c:5849 [inline]
     napi_complete_done+0x1f1/0x880 net/core/dev.c:6590
     virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline]
     virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557
     __napi_poll+0xaf/0x440 net/core/dev.c:7023
     napi_poll net/core/dev.c:7090 [inline]
     net_rx_action+0x801/0xb40 net/core/dev.c:7177
     __do_softirq+0x29b/0x9c2 kernel/softirq.c:558

    Freed by task 13:
     kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
     kasan_set_track+0x21/0x30 mm/kasan/common.c:46
     kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:370
     ____kasan_slab_free mm/kasan/common.c:366 [inline]
     ____kasan_slab_free mm/kasan/common.c:328 [inline]
     __kasan_slab_free+0xff/0x130 mm/kasan/common.c:374
     kasan_slab_free include/linux/kasan.h:235 [inline]
     slab_free_hook mm/slub.c:1723 [inline]
     slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1749
     slab_free mm/slub.c:3513 [inline]
     kmem_cache_free+0xbd/0x5d0 mm/slub.c:3530
     dst_destroy+0x2d6/0x3f0 net/core/dst.c:127
     rcu_do_batch kernel/rcu/tree.c:2506 [inline]
     rcu_core+0x7ab/0x1470 kernel/rcu/tree.c:2741
     __do_softirq+0x29b/0x9c2 kernel/softirq.c:558

    Last potentially related work creation:
     kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
     __kasan_record_aux_stack+0xf5/0x120 mm/kasan/generic.c:348
     __call_rcu kernel/rcu/tree.c:2985 [inline]
     call_rcu+0xb1/0x740 kernel/rcu/tree.c:3065
     dst_release net/core/dst.c:177 [inline]
     dst_release+0x79/0xe0 net/core/dst.c:167
     tcp_v4_do_rcv+0x612/0x8d0 net/ipv4/tcp_ipv4.c:1712
     sk_backlog_rcv include/net/sock.h:1030 [inline]
     __release_sock+0x134/0x3b0 net/core/sock.c:2768
     release_sock+0x54/0x1b0 net/core/sock.c:3300
     tcp_sendmsg+0x36/0x40 net/ipv4/tcp.c:1441
     inet_sendmsg+0x99/0xe0 net/ipv4/af_inet.c:819
     sock_sendmsg_nosec net/socket.c:704 [inline]
     sock_sendmsg+0xcf/0x120 net/socket.c:724
     sock_write_iter+0x289/0x3c0 net/socket.c:1057
     call_write_iter include/linux/fs.h:2162 [inline]
     new_sync_write+0x429/0x660 fs/read_write.c:503
     vfs_write+0x7cd/0xae0 fs/read_write.c:590
     ksys_write+0x1ee/0x250 fs/read_write.c:643
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x44/0xae

    The buggy address belongs to the object at ffff88807f1cb700
     which belongs to the cache ip_dst_cache of size 176
    The buggy address is located 58 bytes inside of
     176-byte region [ffff88807f1cb700, ffff88807f1cb7b0)
    The buggy address belongs to the page:
    page:ffffea0001fc72c0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7f1cb
    flags: 0xfff00000000200(slab|node=0|zone=1|lastcpupid=0x7ff)
    raw: 00fff00000000200 dead000000000100 dead000000000122 ffff8881413bb780
    raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
    page dumped because: kasan: bad access detected
    page_owner tracks the page as allocated
    page last allocated via order 0, migratetype Unmovable, gfp_mask 0x112a20(GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_HARDWALL), pid 5, ts 108466983062, free_ts 108048976062
     prep_new_page mm/page_alloc.c:2418 [inline]
     get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4149
     __alloc_pages+0x1b2/0x500 mm/page_alloc.c:5369
     alloc_pages+0x1a7/0x300 mm/mempolicy.c:2191
     alloc_slab_page mm/slub.c:1793 [inline]
     allocate_slab mm/slub.c:1930 [inline]
     new_slab+0x32d/0x4a0 mm/slub.c:1993
     ___slab_alloc+0x918/0xfe0 mm/slub.c:3022
     __slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3109
     slab_alloc_node mm/slub.c:3200 [inline]
     slab_alloc mm/slub.c:3242 [inline]
     kmem_cache_alloc+0x35c/0x3a0 mm/slub.c:3247
     dst_alloc+0x146/0x1f0 net/core/dst.c:92
     rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613
     __mkroute_output net/ipv4/route.c:2564 [inline]
     ip_route_output_key_hash_rcu+0x921/0x2d00 net/ipv4/route.c:2791
     ip_route_output_key_hash+0x18b/0x300 net/ipv4/route.c:2619
     __ip_route_output_key include/net/route.h:126 [inline]
     ip_route_output_flow+0x23/0x150 net/ipv4/route.c:2850
     ip_route_output_key include/net/route.h:142 [inline]
     geneve_get_v4_rt+0x3a6/0x830 drivers/net/geneve.c:809
     geneve_xmit_skb drivers/net/geneve.c:899 [inline]
     geneve_xmit+0xc4a/0x3540 drivers/net/geneve.c:1082
     __netdev_start_xmit include/linux/netdevice.h:4994 [inline]
     netdev_start_xmit include/linux/netdevice.h:5008 [inline]
     xmit_one net/core/dev.c:3590 [inline]
     dev_hard_start_xmit+0x1eb/0x920 net/core/dev.c:3606
     __dev_queue_xmit+0x299a/0x3650 net/core/dev.c:4229
    page last free stack trace:
     reset_page_owner include/linux/page_owner.h:24 [inline]
     free_pages_prepare mm/page_alloc.c:1338 [inline]
     free_pcp_prepare+0x374/0x870 mm/page_alloc.c:1389
     free_unref_page_prepare mm/page_alloc.c:3309 [inline]
     free_unref_page+0x19/0x690 mm/page_alloc.c:3388
     qlink_free mm/kasan/quarantine.c:146 [inline]
     qlist_free_all+0x5a/0xc0 mm/kasan/quarantine.c:165
     kasan_quarantine_reduce+0x180/0x200 mm/kasan/quarantine.c:272
     __kasan_slab_alloc+0xa2/0xc0 mm/kasan/common.c:444
     kasan_slab_alloc include/linux/kasan.h:259 [inline]
     slab_post_alloc_hook mm/slab.h:519 [inline]
     slab_alloc_node mm/slub.c:3234 [inline]
     kmem_cache_alloc_node+0x255/0x3f0 mm/slub.c:3270
     __alloc_skb+0x215/0x340 net/core/skbuff.c:414
     alloc_skb include/linux/skbuff.h:1126 [inline]
     alloc_skb_with_frags+0x93/0x620 net/core/skbuff.c:6078
     sock_alloc_send_pskb+0x783/0x910 net/core/sock.c:2575
     mld_newpack+0x1df/0x770 net/ipv6/mcast.c:1754
     add_grhead+0x265/0x330 net/ipv6/mcast.c:1857
     add_grec+0x1053/0x14e0 net/ipv6/mcast.c:1995
     mld_send_initial_cr.part.0+0xf6/0x230 net/ipv6/mcast.c:2242
     mld_send_initial_cr net/ipv6/mcast.c:1232 [inline]
     mld_dad_work+0x1d3/0x690 net/ipv6/mcast.c:2268
     process_one_work+0x9b2/0x1690 kernel/workqueue.c:2298
     worker_thread+0x658/0x11f0 kernel/workqueue.c:2445

    Memory state around the buggy address:
     ffff88807f1cb600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
     ffff88807f1cb680: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
    >ffff88807f1cb700: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                            ^
     ffff88807f1cb780: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
     ffff88807f1cb800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

    Fixes: 41063e9dd1 ("ipv4: Early TCP socket demux.")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20211220143330.680945-1-eric.dumazet@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-05-12 16:55:33 +02:00
Antoine Tenart 496fd6c98c ipv6: move inet6_sk(sk)->rx_dst_cookie to sk->sk_rx_dst_cookie
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2041382
Upstream Status: linux.git
Tested: ENRT

commit ef57c1610dd8fba5031bf71e0db73356190de151
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Oct 25 09:48:17 2021 -0700

    ipv6: move inet6_sk(sk)->rx_dst_cookie to sk->sk_rx_dst_cookie

    Increase cache locality by moving rx_dst_coookie next to sk->sk_rx_dst

    This removes one or two cache line misses in IPv6 early demux (TCP/UDP)

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-01-21 11:10:05 +01:00
Antoine Tenart ffc4c3163b tcp: move inet->rx_dst_ifindex to sk->sk_rx_dst_ifindex
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2041382
Upstream Status: linux.git
Tested: ENRT

commit 0c0a5ef809f9150e9229e7b13e43183b681b7a39
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Oct 25 09:48:16 2021 -0700

    tcp: move inet->rx_dst_ifindex to sk->sk_rx_dst_ifindex

    Increase cache locality by moving rx_dst_ifindex next to sk->sk_rx_dst

    This is part of an effort to reduce cache line misses in TCP fast path.

    This removes one cache line miss in early demux.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-01-21 11:10:01 +01:00