Commit Graph

889 Commits

Author SHA1 Message Date
Sabrina Dubroca 1eacc871f3 tcp: drop secpath at the same time as we currently drop dst
JIRA: https://issues.redhat.com/browse/RHEL-69649
JIRA: https://issues.redhat.com/browse/RHEL-83224
CVE: CVE-2025-21864

commit 9b6412e6979f6f9e0632075f8f008937b5cd4efd
Author: Sabrina Dubroca <sd@queasysnail.net>
Date:   Mon Feb 17 11:23:35 2025 +0100

    tcp: drop secpath at the same time as we currently drop dst

    Xiumei reported hitting the WARN in xfrm6_tunnel_net_exit while
    running tests that boil down to:
     - create a pair of netns
     - run a basic TCP test over ipcomp6
     - delete the pair of netns

    The xfrm_state found on spi_byaddr was not deleted at the time we
    delete the netns, because we still have a reference on it. This
    lingering reference comes from a secpath (which holds a ref on the
    xfrm_state), which is still attached to an skb. This skb is not
    leaked, it ends up on sk_receive_queue and then gets defer-free'd by
    skb_attempt_defer_free.

    The problem happens when we defer freeing an skb (push it on one CPU's
    defer_list), and don't flush that list before the netns is deleted. In
    that case, we still have a reference on the xfrm_state that we don't
    expect at this point.

    We already drop the skb's dst in the TCP receive path when it's no
    longer needed, so let's also drop the secpath. At this point,
    tcp_filter has already called into the LSM hooks that may require the
    secpath, so it should not be needed anymore. However, in some of those
    places, the MPTCP extension has just been attached to the skb, so we
    cannot simply drop all extensions.

    Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
    Reported-by: Xiumei Mu <xmu@redhat.com>
    Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://patch.msgid.link/5055ba8f8f72bdcb602faa299faca73c280b7735.1739743613.git.sd@queasysnail.net
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Sabrina Dubroca <sdubroca@redhat.com>
2025-03-20 10:11:45 +01:00
Toke Høiland-Jørgensen e9b6f5c14c bpf: Add bpf_sock_destroy kfunc
JIRA: https://issues.redhat.com/browse/RHEL-65787

Conflicts: Context difference due to missing af9784d007d8 ("tcp: diag:
add support for TIME_WAIT sockets to tcp_abort()") and out-of-order
backport of bac76cf89816 ("tcp: fix forever orphan socket caused by
tcp_abort")

commit 4ddbcb886268af8d12a23e6640b39d1d9c652b1b
Author: Aditi Ghag <aditi.ghag@isovalent.com>
Date:   Fri May 19 22:51:55 2023 +0000

    bpf: Add bpf_sock_destroy kfunc

    The socket destroy kfunc is used to forcefully terminate sockets from
    certain BPF contexts. We plan to use the capability in Cilium
    load-balancing to terminate client sockets that continue to connect to
    deleted backends.  The other use case is on-the-fly policy enforcement
    where existing socket connections prevented by policies need to be
    forcefully terminated.  The kfunc also allows terminating sockets that may
    or may not be actively sending traffic.

    The kfunc can currently be called only from BPF TCP and UDP iterators
    where users can filter, and terminate selected sockets. More
    specifically, it can only be called from  BPF contexts that ensure
    socket locking in order to allow synchronous execution of protocol
    specific `diag_destroy` handlers. The previous commit that batches UDP
    sockets during iteration facilitated a synchronous invocation of the UDP
    destroy callback from BPF context by skipping socket locks in
    `udp_abort`. TCP iterator already supported batching of sockets being
    iterated. To that end, `tracing_iter_filter` callback filter is added so
    that verifier can restrict the kfunc to programs with `BPF_TRACE_ITER`
    attach type, and reject other programs.

    The kfunc takes `sock_common` type argument, even though it expects, and
    casts them to a `sock` pointer. This enables the verifier to allow the
    sock_destroy kfunc to be called for TCP with `sock_common` and UDP with
    `sock` structs. Furthermore, as `sock_common` only has a subset of
    certain fields of `sock`, casting pointer to the latter type might not
    always be safe for certain sockets like request sockets, but these have a
    special handling in the diag_destroy handlers.

    Additionally, the kfunc is defined with `KF_TRUSTED_ARGS` flag to avoid the
    cases where a `PTR_TO_BTF_ID` sk is obtained by following another pointer.
    eg. getting a sk pointer (may be even NULL) by following another sk
    pointer. The pointer socket argument passed in TCP and UDP iterators is
    tagged as `PTR_TRUSTED` in {tcp,udp}_reg_info.  The TRUSTED arg changes
    are contributed by Martin KaFai Lau <martin.lau@kernel.org>.

    Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
    Link: https://lore.kernel.org/r/20230519225157.760788-8-aditi.ghag@isovalent.com
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
2025-01-28 12:51:54 +01:00
Toke Høiland-Jørgensen 3dd3f94935 bpf: tcp: Avoid taking fast sock lock in iterator
JIRA: https://issues.redhat.com/browse/RHEL-65787

commit 9378096e8a656fb5c4099b26b1370c56f056eab9
Author: Aditi Ghag <aditi.ghag@isovalent.com>
Date:   Fri May 19 22:51:49 2023 +0000

    bpf: tcp: Avoid taking fast sock lock in iterator

    This is a preparatory commit to replace `lock_sock_fast` with
    `lock_sock`,and facilitate BPF programs executed from the TCP sockets
    iterator to be able to destroy TCP sockets using the bpf_sock_destroy
    kfunc (implemented in follow-up commits).

    Previously, BPF TCP iterator was acquiring the sock lock with BH
    disabled. This led to scenarios where the sockets hash table bucket lock
    can be acquired with BH enabled in some path versus disabled in other.
    In such situation, kernel issued a warning since it thinks that in the
    BH enabled path the same bucket lock *might* be acquired again in the
    softirq context (BH disabled), which will lead to a potential dead lock.
    Since bpf_sock_destroy also happens in a process context, the potential
    deadlock warning is likely a false alarm.

    Here is a snippet of annotated stack trace that motivated this change:

    ```

    Possible interrupt unsafe locking scenario:

          CPU0                    CPU1
          ----                    ----
     lock(&h->lhash2[i].lock);
                                  local_bh_disable();
                                  lock(&h->lhash2[i].lock);
    kernel imagined possible scenario:
      local_bh_disable();  /* Possible softirq */
      lock(&h->lhash2[i].lock);
    *** Potential Deadlock ***

    process context:

    lock_acquire+0xcd/0x330
    _raw_spin_lock+0x33/0x40
    ------> Acquire (bucket) lhash2.lock with BH enabled
    __inet_hash+0x4b/0x210
    inet_csk_listen_start+0xe6/0x100
    inet_listen+0x95/0x1d0
    __sys_listen+0x69/0xb0
    __x64_sys_listen+0x14/0x20
    do_syscall_64+0x3c/0x90
    entry_SYSCALL_64_after_hwframe+0x72/0xdc

    bpf_sock_destroy run from iterator:

    lock_acquire+0xcd/0x330
    _raw_spin_lock+0x33/0x40
    ------> Acquire (bucket) lhash2.lock with BH disabled
    inet_unhash+0x9a/0x110
    tcp_set_state+0x6a/0x210
    tcp_abort+0x10d/0x200
    bpf_prog_6793c5ca50c43c0d_iter_tcp6_server+0xa4/0xa9
    bpf_iter_run_prog+0x1ff/0x340
    ------> lock_sock_fast that acquires sock lock with BH disabled
    bpf_iter_tcp_seq_show+0xca/0x190
    bpf_seq_read+0x177/0x450

    ```

    Also, Yonghong reported a deadlock for non-listening TCP sockets that
    this change resolves. Previously, `lock_sock_fast` held the sock spin
    lock with BH which was again being acquired in `tcp_abort`:

    ```
    watchdog: BUG: soft lockup - CPU#0 stuck for 86s! [test_progs:2331]
    RIP: 0010:queued_spin_lock_slowpath+0xd8/0x500
    Call Trace:
     <TASK>
     _raw_spin_lock+0x84/0x90
     tcp_abort+0x13c/0x1f0
     bpf_prog_88539c5453a9dd47_iter_tcp6_client+0x82/0x89
     bpf_iter_run_prog+0x1aa/0x2c0
     ? preempt_count_sub+0x1c/0xd0
     ? from_kuid_munged+0x1c8/0x210
     bpf_iter_tcp_seq_show+0x14e/0x1b0
     bpf_seq_read+0x36c/0x6a0

    bpf_iter_tcp_seq_show
       lock_sock_fast
         __lock_sock_fast
           spin_lock_bh(&sk->sk_lock.slock);
            /* * Fast path return with bottom halves disabled and * sock::sk_lock.slock held.* */

     ...
     tcp_abort
       local_bh_disable();
       spin_lock(&((sk)->sk_lock.slock)); // from bh_lock_sock(sk)

    ```

    With the switch to `lock_sock`, it calls `spin_unlock_bh` before returning:

    ```
    lock_sock
        lock_sock_nested
           spin_lock_bh(&sk->sk_lock.slock);
           :
           spin_unlock_bh(&sk->sk_lock.slock);
    ```

    Acked-by: Yonghong Song <yhs@meta.com>
    Acked-by: Stanislav Fomichev <sdf@google.com>
    Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
    Link: https://lore.kernel.org/r/20230519225157.760788-2-aditi.ghag@isovalent.com
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
2025-01-28 12:51:53 +01:00
Rado Vrbovsky 81ce48e690 Merge: mptcp: phase-1 backports for RHEL-9.6
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5449

JIRA: https://issues.redhat.com/browse/RHEL-62871  
JIRA: https://issues.redhat.com/browse/RHEL-58839  
JIRA: https://issues.redhat.com/browse/RHEL-66083  
JIRA: https://issues.redhat.com/browse/RHEL-66074  
CVE: CVE-2024-46711  
CVE: CVE-2024-45009  
CVE: CVE-2024-45010  
Upstream Status: All mainline in net.git  
Tested: kselftest  
Conflicts: see individual patches  
  
Signed-off-by: Davide Caratti <dcaratti@redhat.com>

Approved-by: Florian Westphal <fwestpha@redhat.com>
Approved-by: Paolo Abeni <pabeni@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
2024-11-22 09:18:31 +00:00
Davide Caratti 7bb06cd72e tcp: annotate data-races around tp->tsoffset
JIRA: https://issues.redhat.com/browse/RHEL-62871
Upstream Status: net.git commit dd23c9f1e8d5c1d2e3d29393412385ccb9c7a948
Conflicts:
  - net/ipv4/tcp_ipv4.c: keep using sock_net(sk) as we don't have
    upstream commit 08eaef904031 ("tcp: Clean up some functions.")

commit dd23c9f1e8d5c1d2e3d29393412385ccb9c7a948
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Jul 19 21:28:48 2023 +0000

    tcp: annotate data-races around tp->tsoffset

    do_tcp_getsockopt() reads tp->tsoffset while another cpu
    might change its value.

    Fixes: 93be6ce0e9 ("tcp: set and get per-socket timestamp")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20230719212857.3943972-3-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2024-11-12 10:18:58 +01:00
Paolo Abeni 7e492a042f tcp: avoid premature drops in tcp_add_backlog()
JIRA: https://issues.redhat.com/browse/RHEL-62865
Tested: LNST, Tier1

Upstream commit:
commit ec00ed472bdb7d0af840da68c8c11bff9f4d9caa
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Apr 23 12:56:20 2024 +0000

    tcp: avoid premature drops in tcp_add_backlog()

    While testing TCP performance with latest trees,
    I saw suspect SOCKET_BACKLOG drops.

    tcp_add_backlog() computes its limit with :

        limit = (u32)READ_ONCE(sk->sk_rcvbuf) +
                (u32)(READ_ONCE(sk->sk_sndbuf) >> 1);
        limit += 64 * 1024;

    This does not take into account that sk->sk_backlog.len
    is reset only at the very end of __release_sock().

    Both sk->sk_backlog.len and sk->sk_rmem_alloc could reach
    sk_rcvbuf in normal conditions.

    We should double sk->sk_rcvbuf contribution in the formula
    to absorb bubbles in the backlog, which happen more often
    for very fast flows.

    This change maintains decent protection against abuses.

    Fixes: c377411f24 ("net: sk_add_backlog() take rmem_alloc into account")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20240423125620.3309458-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-16 19:09:13 +02:00
Paolo Abeni fdad6e7a51 tcp: replace TCP_SKB_CB(skb)->tcp_tw_isn with a per-cpu field
JIRA: https://issues.redhat.com/browse/RHEL-62865
Tested: LNST, Tier1
Conflicts: different context in tcp_conn_request(), as rhel-9 \
  lacks the TCP AO support.

Upstream commit:
commit 41eecbd712b73f0d5dcf1152b9a1c27b1f238028
Author: Eric Dumazet <edumazet@google.com>
Date:   Sun Apr 7 09:33:22 2024 +0000

    tcp: replace TCP_SKB_CB(skb)->tcp_tw_isn with a per-cpu field

    TCP can transform a TIMEWAIT socket into a SYN_RECV one from
    a SYN packet, and the ISN of the SYNACK packet is normally
    generated using TIMEWAIT tw_snd_nxt :

    tcp_timewait_state_process()
    ...
        u32 isn = tcptw->tw_snd_nxt + 65535 + 2;
        if (isn == 0)
            isn++;
        TCP_SKB_CB(skb)->tcp_tw_isn = isn;
        return TCP_TW_SYN;

    This SYN packet also bypasses normal checks against listen queue
    being full or not.

    tcp_conn_request()
    ...
           __u32 isn = TCP_SKB_CB(skb)->tcp_tw_isn;
    ...
            /* TW buckets are converted to open requests without
             * limitations, they conserve resources and peer is
             * evidently real one.
             */
            if ((syncookies == 2 || inet_csk_reqsk_queue_is_full(sk)) && !isn) {
                    want_cookie = tcp_syn_flood_action(sk, rsk_ops->slab_name);
                    if (!want_cookie)
                            goto drop;
            }

    This was using TCP_SKB_CB(skb)->tcp_tw_isn field in skb.

    Unfortunately this field has been accidentally cleared
    after the call to tcp_timewait_state_process() returning
    TCP_TW_SYN.

    Using a field in TCP_SKB_CB(skb) for a temporary state
    is overkill.

    Switch instead to a per-cpu variable.

    As a bonus, we do not have to clear tcp_tw_isn in TCP receive
    fast path.
    It is temporarily set then cleared only in the TCP_TW_SYN dance.

    Fixes: 4ad19de877 ("net: tcp6: fix double call of tcp_v6_fill_cb()")
    Fixes: eeea10b83a ("tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb()")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-16 19:08:41 +02:00
Paolo Abeni 4cd846284a tcp: propagate tcp_tw_isn via an extra parameter to ->route_req()
JIRA: https://issues.redhat.com/browse/RHEL-62865
Tested: LNST, Tier1

Upstream commit:
commit b9e810405880c99baafd550ada7043e86465396e
Author: Eric Dumazet <edumazet@google.com>
Date:   Sun Apr 7 09:33:21 2024 +0000

    tcp: propagate tcp_tw_isn via an extra parameter to ->route_req()

    tcp_v6_init_req() reads TCP_SKB_CB(skb)->tcp_tw_isn to find
    out if the request socket is created by a SYN hitting a TIMEWAIT socket.

    This has been buggy for a decade, lets directly pass the information
    from tcp_conn_request().

    This is a preparatory patch to make the following one easier to review.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-16 19:07:53 +02:00
Florian Westphal e68e0d9b40 net: tcp/dccp: prepare for tw_timer un-pinning
JIRA: https://issues.redhat.com/browse/RHEL-9279
Upstream Status: commit b334b924c9b7

Conflicts: net/ipv4/tcp_minisocks.c

We lack a "struct net *net" in this function, earlier
conflict fixup used sock_net().  Resolve this by keeping
sock_net() usage as-is.

commit b334b924c9b709bc969644fb5c406f5c9d01dceb
Author: Valentin Schneider <vschneid@redhat.com>
Date:   Thu Jun 6 17:11:37 2024 +0200

    net: tcp/dccp: prepare for tw_timer un-pinning

    The TCP timewait timer is proving to be problematic for setups where
    scheduler CPU isolation is achieved at runtime via cpusets (as opposed to
    statically via isolcpus=domains).

    What happens there is a CPU goes through tcp_time_wait(), arming the
    time_wait timer, then gets isolated. TCP_TIMEWAIT_LEN later, the timer
    fires, causing interference for the now-isolated CPU. This is conceptually
    similar to the issue described in commit e02b93124855 ("workqueue: Unbind
    kworkers before sending them to exit()")

    Move inet_twsk_schedule() to within inet_twsk_hashdance(), with the ehash
    lock held. Expand the lock's critical section from inet_twsk_kill() to
    inet_twsk_deschedule_put(), serializing the scheduling vs descheduling of
    the timer. IOW, this prevents the following race:

                                 tcp_time_wait()
                                   inet_twsk_hashdance()
      inet_twsk_deschedule_put()
        del_timer_sync()
                                   inet_twsk_schedule()

    Thanks to Paolo Abeni for suggesting to leverage the ehash lock.

    This also restores a comment from commit ec94c2696f ("tcp/dccp: avoid
    one atomic operation for timewait hashdance") as inet_twsk_hashdance() had
    a "Step 1" and "Step 3" comment, but the "Step 2" had gone missing.

    inet_twsk_deschedule_put() now acquires the ehash spinlock to synchronize
    with inet_twsk_hashdance_schedule().

    To ease possible regression search, actual un-pin is done in next patch.

    Link: https://lore.kernel.org/all/ZPhpfMjSiHVjQkTk@localhost.localdomain/
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Valentin Schneider <vschneid@redhat.com>
    Co-developed-by: Florian Westphal <fw@strlen.de>
    Signed-off-by: Florian Westphal <fw@strlen.de>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Florian Westphal <fwestpha@redhat.com>
2024-08-21 16:56:29 +02:00
Florian Westphal 23f780623d tcp: annotate data-races around tw->tw_ts_recent and tw->tw_ts_recent_stamp
JIRA: https://issues.redhat.com/browse/RHEL-9279
Upstream Status: commit 69e0b33a7fce

CS9 lacks both support for TCP Authentication option and usec
resolution for TCP timestamps.
Both features are out of scope, so do needed context fixups.
This change was added to reduce conflicts in the followup patch.

commit 69e0b33a7fce4d96649b9fa32e56b696921aa48e
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Jun 3 15:51:06 2024 +0000

    tcp: annotate data-races around tw->tw_ts_recent and tw->tw_ts_recent_stamp

    These fields can be read and written locklessly, add annotations
    around these minor races.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Simon Horman <horms@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Conflicts: net/ipv4/tcp_ipv4.c net/ipv6/tcp_ipv6.c

Signed-off-by: Florian Westphal <fwestpha@redhat.com>
2024-08-21 16:55:25 +02:00
Antoine Tenart 3a0f9f0ce0 tcp: use sk_skb_reason_drop to free rx packets
JIRA: https://issues.redhat.com/browse/RHEL-48648
Upstream Status: net-next.git

commit 46a02aa357529d7b038096955976b14f7c44aa23
Author: Yan Zhai <yan@cloudflare.com>
Date:   Mon Jun 17 11:09:20 2024 -0700

    tcp: use sk_skb_reason_drop to free rx packets

    Replace kfree_skb_reason with sk_skb_reason_drop and pass the receiving
    socket to the tracepoint.

    Reported-by: kernel test robot <lkp@intel.com>
    Closes: https://lore.kernel.org/r/202406011539.jhwBd7DX-lkp@intel.com/
    Signed-off-by: Yan Zhai <yan@cloudflare.com>
    Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-07-16 17:29:42 +02:00
Antoine Tenart 0bc1f777a4 tcp: rstreason: handle timewait cases in the receive path
JIRA: https://issues.redhat.com/browse/RHEL-48648
Upstream Status: linux.git

commit 22a32557758a7100e46dfa8f383a401125e60b16
Author: Jason Xing <kernelxing@tencent.com>
Date:   Fri May 10 20:25:01 2024 +0800

    tcp: rstreason: handle timewait cases in the receive path

    There are two possible cases where TCP layer can send an RST. Since they
    happen in the same place, I think using one independent reason is enough
    to identify this special situation.

    Signed-off-by: Jason Xing <kernelxing@tencent.com>
    Link: https://lore.kernel.org/r/20240510122502.27850-5-kerneljasonxing@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-07-16 17:29:42 +02:00
Antoine Tenart 51c78f9a4a rstreason: make it work in trace world
JIRA: https://issues.redhat.com/browse/RHEL-48648
Upstream Status: linux.git

commit b533fb9cf4f7c6ca2aa255a5a1fdcde49fff2b24
Author: Jason Xing <kernelxing@tencent.com>
Date:   Thu Apr 25 11:13:40 2024 +0800

    rstreason: make it work in trace world

    At last, we should let it work by introducing this reset reason in
    trace world.

    One of the possible expected outputs is:
    ... tcp_send_reset: skbaddr=xxx skaddr=xxx src=xxx dest=xxx
    state=TCP_ESTABLISHED reason=NOT_SPECIFIED

    Signed-off-by: Jason Xing <kernelxing@tencent.com>
    Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-07-16 17:29:41 +02:00
Antoine Tenart 8ea5cff87d tcp: support rstreason for passive reset
JIRA: https://issues.redhat.com/browse/RHEL-48648
Upstream Status: linux.git

commit 120391ef9ca8fe8f82ea3f2961ad802043468226
Author: Jason Xing <kernelxing@tencent.com>
Date:   Thu Apr 25 11:13:37 2024 +0800

    tcp: support rstreason for passive reset

    Reuse the dropreason logic to show the exact reason of tcp reset,
    so we can finally display the corresponding item in enum sk_reset_reason
    instead of reinventing new reset reasons. This patch replaces all
    the prior NOT_SPECIFIED reasons.

    Signed-off-by: Jason Xing <kernelxing@tencent.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-07-16 17:29:41 +02:00
Antoine Tenart 25344d90dd rstreason: prepare for passive reset
JIRA: https://issues.redhat.com/browse/RHEL-48648
Upstream Status: linux.git
Conflicts:\
- Context differences due to missing upstream commits ba7783ad45c8
  ("net/tcp: Add AO sign to RST packets") and d5dfbfa2f88e ("mptcp: drop
  duplicate header inclusions") in c9s.

commit 6be49deaa09576c141002a2e6f816a1709bc2c86
Author: Jason Xing <kernelxing@tencent.com>
Date:   Thu Apr 25 11:13:35 2024 +0800

    rstreason: prepare for passive reset

    Adjust the parameter and support passing reason of reset which
    is for now NOT_SPECIFIED. No functional changes.

    Signed-off-by: Jason Xing <kernelxing@tencent.com>
    Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-07-16 17:29:41 +02:00
Antoine Tenart 528623fc31 trace: tcp: fully support trace_tcp_send_reset
JIRA: https://issues.redhat.com/browse/RHEL-48648
Upstream Status: linux.git
Conflicts:\
- Context differences due to missing upstream commits ba7783ad45c8
  ("net/tcp: Add AO sign to RST packets") and 3cccda8db2cf ("ipv6: move
  np->repflow to atomic flags") in c9s.

commit 19822a980e1956a6572998887a7df5a0607a32f6
Author: Jason Xing <kernelxing@tencent.com>
Date:   Mon Apr 1 15:36:05 2024 +0800

    trace: tcp: fully support trace_tcp_send_reset

    Prior to this patch, what we can see by enabling trace_tcp_send is
    only happening under two circumstances:
    1) active rst mode
    2) non-active rst mode and based on the full socket

    That means the inconsistency occurs if we use tcpdump and trace
    simultaneously to see how rst happens.

    It's necessary that we should take into other cases into considerations,
    say:
    1) time-wait socket
    2) no socket
    ...

    By parsing the incoming skb and reversing its 4-tuple can
    we know the exact 'flow' which might not exist.

    Samples after applied this patch:
    1. tcp_send_reset: skbaddr=XXX skaddr=XXX src=ip:port dest=ip:port
    state=TCP_ESTABLISHED
    2. tcp_send_reset: skbaddr=000...000 skaddr=XXX src=ip:port dest=ip:port
    state=UNKNOWN
    Note:
    1) UNKNOWN means we cannot extract the right information from skb.
    2) skbaddr/skaddr could be 0

    Signed-off-by: Jason Xing <kernelxing@tencent.com>
    Link: https://lore.kernel.org/r/20240401073605.37335-3-kerneljasonxing@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-07-16 17:29:41 +02:00
Antoine Tenart 8e320d89a7 tcp: make dropreason in tcp_child_process() work
JIRA: https://issues.redhat.com/browse/RHEL-48648
Upstream Status: linux.git

commit ee01defe25bad09a37b68dd051a7e931d1e4cd91
Author: Jason Xing <kernelxing@tencent.com>
Date:   Mon Feb 26 11:22:27 2024 +0800

    tcp: make dropreason in tcp_child_process() work

    It's time to let it work right now. We've already prepared for this:)

    Signed-off-by: Jason Xing <kernelxing@tencent.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-07-16 17:29:40 +02:00
Antoine Tenart 8f346a11e7 tcp: make the dropreason really work when calling tcp_rcv_state_process()
JIRA: https://issues.redhat.com/browse/RHEL-48648
Upstream Status: linux.git

commit b9825695930546af725b1e686b8eaf4c71201728
Author: Jason Xing <kernelxing@tencent.com>
Date:   Mon Feb 26 11:22:26 2024 +0800

    tcp: make the dropreason really work when calling tcp_rcv_state_process()

    Update three callers including both ipv4 and ipv6 and let the dropreason
    mechanism work in reality.

    Signed-off-by: Jason Xing <kernelxing@tencent.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-07-16 17:29:40 +02:00
Antoine Tenart 191443751b tcp: directly drop skb in cookie check for ipv4
JIRA: https://issues.redhat.com/browse/RHEL-48648
Upstream Status: linux.git
Conflicts:\
- Context difference due to missing upstream commit 8e7bab6b9652
  ("tcp: Factorise cookie-dependent fields initialisation in
  cookie_v[46]_check()") in c9s.

commit 65be4393f363c4bd5c388ddf3e3eb4abee2b1f79
Author: Jason Xing <kernelxing@tencent.com>
Date:   Mon Feb 26 11:22:19 2024 +0800

    tcp: directly drop skb in cookie check for ipv4

    Only move the skb drop from tcp_v4_do_rcv() to cookie_v4_check() itself,
    no other changes made. It can help us refine the specific drop reasons
    later.

    Signed-off-by: Jason Xing <kernelxing@tencent.com>
    Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2024-07-16 17:29:40 +02:00
Guillaume Nault b445907e34 tcp: Use refcount_inc_not_zero() in tcp_twsk_unique().
JIRA: https://issues.redhat.com/browse/RHEL-39837
Upstream Status: linux.git
CVE: CVE-2024-36904

commit f2db7230f73a80dbb179deab78f88a7947f0ab7e
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Wed May 1 14:31:45 2024 -0700

    tcp: Use refcount_inc_not_zero() in tcp_twsk_unique().

    Anderson Nascimento reported a use-after-free splat in tcp_twsk_unique()
    with nice analysis.

    Since commit ec94c2696f ("tcp/dccp: avoid one atomic operation for
    timewait hashdance"), inet_twsk_hashdance() sets TIME-WAIT socket's
    sk_refcnt after putting it into ehash and releasing the bucket lock.

    Thus, there is a small race window where other threads could try to
    reuse the port during connect() and call sock_hold() in tcp_twsk_unique()
    for the TIME-WAIT socket with zero refcnt.

    If that happens, the refcnt taken by tcp_twsk_unique() is overwritten
    and sock_put() will cause underflow, triggering a real use-after-free
    somewhere else.

    To avoid the use-after-free, we need to use refcount_inc_not_zero() in
    tcp_twsk_unique() and give up on reusing the port if it returns false.

    [0]:
    refcount_t: addition on 0; use-after-free.
    WARNING: CPU: 0 PID: 1039313 at lib/refcount.c:25 refcount_warn_saturate+0xe5/0x110
    CPU: 0 PID: 1039313 Comm: trigger Not tainted 6.8.6-200.fc39.x86_64 #1
    Hardware name: VMware, Inc. VMware20,1/440BX Desktop Reference Platform, BIOS VMW201.00V.21805430.B64.2305221830 05/22/2023
    RIP: 0010:refcount_warn_saturate+0xe5/0x110
    Code: 42 8e ff 0f 0b c3 cc cc cc cc 80 3d aa 13 ea 01 00 0f 85 5e ff ff ff 48 c7 c7 f8 8e b7 82 c6 05 96 13 ea 01 01 e8 7b 42 8e ff <0f> 0b c3 cc cc cc cc 48 c7 c7 50 8f b7 82 c6 05 7a 13 ea 01 01 e8
    RSP: 0018:ffffc90006b43b60 EFLAGS: 00010282
    RAX: 0000000000000000 RBX: ffff888009bb3ef0 RCX: 0000000000000027
    RDX: ffff88807be218c8 RSI: 0000000000000001 RDI: ffff88807be218c0
    RBP: 0000000000069d70 R08: 0000000000000000 R09: ffffc90006b439f0
    R10: ffffc90006b439e8 R11: 0000000000000003 R12: ffff8880029ede84
    R13: 0000000000004e20 R14: ffffffff84356dc0 R15: ffff888009bb3ef0
    FS:  00007f62c10926c0(0000) GS:ffff88807be00000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000020ccb000 CR3: 000000004628c005 CR4: 0000000000f70ef0
    PKRU: 55555554
    Call Trace:
     <TASK>
     ? refcount_warn_saturate+0xe5/0x110
     ? __warn+0x81/0x130
     ? refcount_warn_saturate+0xe5/0x110
     ? report_bug+0x171/0x1a0
     ? refcount_warn_saturate+0xe5/0x110
     ? handle_bug+0x3c/0x80
     ? exc_invalid_op+0x17/0x70
     ? asm_exc_invalid_op+0x1a/0x20
     ? refcount_warn_saturate+0xe5/0x110
     tcp_twsk_unique+0x186/0x190
     __inet_check_established+0x176/0x2d0
     __inet_hash_connect+0x74/0x7d0
     ? __pfx___inet_check_established+0x10/0x10
     tcp_v4_connect+0x278/0x530
     __inet_stream_connect+0x10f/0x3d0
     inet_stream_connect+0x3a/0x60
     __sys_connect+0xa8/0xd0
     __x64_sys_connect+0x18/0x20
     do_syscall_64+0x83/0x170
     entry_SYSCALL_64_after_hwframe+0x78/0x80
    RIP: 0033:0x7f62c11a885d
    Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a3 45 0c 00 f7 d8 64 89 01 48
    RSP: 002b:00007f62c1091e58 EFLAGS: 00000296 ORIG_RAX: 000000000000002a
    RAX: ffffffffffffffda RBX: 0000000020ccb004 RCX: 00007f62c11a885d
    RDX: 0000000000000010 RSI: 0000000020ccb000 RDI: 0000000000000003
    RBP: 00007f62c1091e90 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000296 R12: 00007f62c10926c0
    R13: ffffffffffffff88 R14: 0000000000000000 R15: 00007ffe237885b0
     </TASK>

    Fixes: ec94c2696f ("tcp/dccp: avoid one atomic operation for timewait hashdance")
    Reported-by: Anderson Nascimento <anderson@allelesecurity.com>
    Closes: https://lore.kernel.org/netdev/37a477a6-d39e-486b-9577-3463f655a6b7@allelesecurity.com/
    Suggested-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20240501213145.62261-1-kuniyu@amazon.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2024-06-07 20:27:04 +02:00
Sabrina Dubroca 7bc5eeb384 net: skbuff: generalize the skb->decrypted bit
JIRA: https://issues.redhat.com/browse/RHEL-29306

commit 9f06f87fef689d28588cde8c7ebb00a67da34026
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Wed Apr 3 13:21:39 2024 -0700

    net: skbuff: generalize the skb->decrypted bit

    The ->decrypted bit can be reused for other crypto protocols.
    Remove the direct dependency on TLS, add helpers to clean up
    the ifdefs leaking out everywhere.

    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Sabrina Dubroca <sdubroca@redhat.com>
2024-05-01 17:48:16 +02:00
Paolo Abeni 026591e567 tcp: check mptcp-level constraints for backlog coalescing
JIRA: https://issues.redhat.com/browse/RHEL-21432
Tested: LNST, Tier1

Upstream commit:
commit 6db8a37dfc541e059851652cfd4f0bb13b8ff6af
Author: Paolo Abeni <pabeni@redhat.com>
Date:   Wed Oct 18 11:23:53 2023 -0700

    tcp: check mptcp-level constraints for backlog coalescing

    The MPTCP protocol can acquire the subflow-level socket lock and
    cause the tcp backlog usage. When inserting new skbs into the
    backlog, the stack will try to coalesce them.

    Currently, we have no check in place to ensure that such coalescing
    will respect the MPTCP-level DSS, and that may cause data stream
    corruption, as reported by Christoph.

    Address the issue by adding the relevant admission check for coalescing
    in tcp_add_backlog().

    Note the issue is not easy to reproduce, as the MPTCP protocol tries
    hard to avoid acquiring the subflow-level socket lock.

    Fixes: 648ef4b886 ("mptcp: Implement MPTCP receive path")
    Cc: stable@vger.kernel.org
    Reported-by: Christoph Paasch <cpaasch@apple.com>
    Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/420
    Reviewed-by: Mat Martineau <martineau@kernel.org>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Signed-off-by: Mat Martineau <martineau@kernel.org>
    Link: https://lore.kernel.org/r/20231018-send-net-20231018-v1-2-17ecb002e41d@kernel.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-01-12 13:41:50 +01:00
Felix Maurer 130ad87ddc tcp: enforce receive buffer memory limits by allowing the tcp window to shrink
JIRA: https://issues.redhat.com/browse/RHEL-11592
Conflicts:
- net/ipv4/sysctl_net_ipv4.c: context difference due to missing new sysctls
- net/ipv4/tcp_ipv4.c: context difference due to missing ccce324dabfe
  ("tcp: make the first N SYN RTO backoffs linear") and 37ba017dcc3b
  ("ipv4/tcp: do not use per netns ctl sockets")

commit b650d953cd391595e536153ce30b4aab385643ac
Author: mfreemon@cloudflare.com <mfreemon@cloudflare.com>
Date:   Sun Jun 11 22:05:24 2023 -0500

    tcp: enforce receive buffer memory limits by allowing the tcp window to shrink

    Under certain circumstances, the tcp receive buffer memory limit
    set by autotuning (sk_rcvbuf) is increased due to incoming data
    packets as a result of the window not closing when it should be.
    This can result in the receive buffer growing all the way up to
    tcp_rmem[2], even for tcp sessions with a low BDP.

    To reproduce:  Connect a TCP session with the receiver doing
    nothing and the sender sending small packets (an infinite loop
    of socket send() with 4 bytes of payload with a sleep of 1 ms
    in between each send()).  This will cause the tcp receive buffer
    to grow all the way up to tcp_rmem[2].

    As a result, a host can have individual tcp sessions with receive
    buffers of size tcp_rmem[2], and the host itself can reach tcp_mem
    limits, causing the host to go into tcp memory pressure mode.

    The fundamental issue is the relationship between the granularity
    of the window scaling factor and the number of byte ACKed back
    to the sender.  This problem has previously been identified in
    RFC 7323, appendix F [1].

    The Linux kernel currently adheres to never shrinking the window.

    In addition to the overallocation of memory mentioned above, the
    current behavior is functionally incorrect, because once tcp_rmem[2]
    is reached when no remediations remain (i.e. tcp collapse fails to
    free up any more memory and there are no packets to prune from the
    out-of-order queue), the receiver will drop in-window packets
    resulting in retransmissions and an eventual timeout of the tcp
    session.  A receive buffer full condition should instead result
    in a zero window and an indefinite wait.

    In practice, this problem is largely hidden for most flows.  It
    is not applicable to mice flows.  Elephant flows can send data
    fast enough to "overrun" the sk_rcvbuf limit (in a single ACK),
    triggering a zero window.

    But this problem does show up for other types of flows.  Examples
    are websockets and other type of flows that send small amounts of
    data spaced apart slightly in time.  In these cases, we directly
    encounter the problem described in [1].

    RFC 7323, section 2.4 [2], says there are instances when a retracted
    window can be offered, and that TCP implementations MUST ensure
    that they handle a shrinking window, as specified in RFC 1122,
    section 4.2.2.16 [3].  All prior RFCs on the topic of tcp window
    management have made clear that sender must accept a shrunk window
    from the receiver, including RFC 793 [4] and RFC 1323 [5].

    This patch implements the functionality to shrink the tcp window
    when necessary to keep the right edge within the memory limit by
    autotuning (sk_rcvbuf).  This new functionality is enabled with
    the new sysctl: net.ipv4.tcp_shrink_window

    Additional information can be found at:
    https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/

    [1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F
    [2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4
    [3] https://www.rfc-editor.org/rfc/rfc1122#page-91
    [4] https://www.rfc-editor.org/rfc/rfc793
    [5] https://www.rfc-editor.org/rfc/rfc1323

    Signed-off-by: Mike Freemon <mfreemon@cloudflare.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-10-31 16:20:06 +01:00
Jan Stancek 5a0d19aa9d Merge: net: improve skb hash stability when net.core.txrehash=0
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2694

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2214966

As a side effect this also improves stability for IPv6 autoflowlabel.

Signed-off-by: Antoine Tenart <atenart@redhat.com>

Approved-by: Florian Westphal <fwestpha@redhat.com>
Approved-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-07-04 11:15:02 +02:00
Jan Stancek e341c7e709 Merge: bpf, xdp: update to 6.3
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2583

Rebase bpf and xdp to 6.3.

Bugzilla: https://bugzilla.redhat.com/2178930

Signed-off-by: Viktor Malik <vmalik@redhat.com>

Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Artem Savkov <asavkov@redhat.com>
Approved-by: Jason Wang <jasowang@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>
Approved-by: Jan Stancek <jstancek@redhat.com>
Approved-by: Baoquan He <5820488-baoquan_he@users.noreply.gitlab.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-06-28 07:52:45 +02:00
Antoine Tenart 1cfc972fac net: ipv4: use consistent txhash in TIME_WAIT and SYN_RECV
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2214966
Upstream Status: net-next.git
Conflicts:\
- Context difference due to missing upstream commit e22aa1486668 ("net:
  Find dst with sk's xfrm policy not ctl_sk") in c9s.

commit c0a8966e2bc7d31f77a7246947ebc09c1ff06066
Author: Antoine Tenart <atenart@kernel.org>
Date:   Tue May 23 18:14:52 2023 +0200

    net: ipv4: use consistent txhash in TIME_WAIT and SYN_RECV

    When using IPv4/TCP, skb->hash comes from sk->sk_txhash except in
    TIME_WAIT and SYN_RECV where it's not set in the reply skb from
    ip_send_unicast_reply. Those packets will have a mismatched hash with
    others from the same flow as their hashes will be 0. IPv6 does not have
    the same issue as the hash is set from the socket txhash in those cases.

    This commits sets the hash in the reply skb from ip_send_unicast_reply,
    which makes the IPv4 code behaving like IPv6.

    Signed-off-by: Antoine Tenart <atenart@kernel.org>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2023-06-16 10:55:36 +02:00
Antoine Tenart 6dd8976945 tcp: fix possible sk_priority leak in tcp_v4_send_reset()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2214966
Upstream Status: linux.git
Conflicts:\
- Context differences due to missing upstream commits e22aa1486668
  ("net: Find dst with sk's xfrm policy not ctl_sk") and 37ba017dcc3b
  ("ipv4/tcp: do not use per netns ctl sockets") in c9s.

commit 1e306ec49a1f206fd2cc89a42fac6e6f592a8cc1
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu May 11 11:47:49 2023 +0000

    tcp: fix possible sk_priority leak in tcp_v4_send_reset()

    When tcp_v4_send_reset() is called with @sk == NULL,
    we do not change ctl_sk->sk_priority, which could have been
    set from a prior invocation.

    Change tcp_v4_send_reset() to set sk_priority and sk_mark
    fields before calling ip_send_unicast_reply().

    This means tcp_v4_send_reset() and tcp_v4_send_ack()
    no longer have to clear ctl_sk->sk_mark after
    their call to ip_send_unicast_reply().

    Fixes: f6c0f5d209 ("tcp: honor SO_PRIORITY in TIME_WAIT state")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Antoine Tenart <atenart@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2023-06-16 10:55:24 +02:00
Felix Maurer eed4a49571 bpf: tcp: Use sock_gen_put instead of sock_put in bpf_iter_tcp
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2178930

commit 580031ff9952b7dbf48dedba6b56a100ae002bef
Author: Martin KaFai Lau <martin.lau@kernel.org>
Date:   Mon Mar 27 17:42:32 2023 -0700

    bpf: tcp: Use sock_gen_put instead of sock_put in bpf_iter_tcp

    While reviewing the udp-iter batching patches, noticed the bpf_iter_tcp
    calling sock_put() is incorrect. It should call sock_gen_put instead
    because bpf_iter_tcp is iterating the ehash table which has the req sk
    and tw sk. This patch replaces all sock_put with sock_gen_put in the
    bpf_iter_tcp codepath.

    Fixes: 04c7820b776f ("bpf: tcp: Bpf iter batching and lock_sock")
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/bpf/20230328004232.2134233-1-martin.lau@linux.dev

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-06-14 10:44:30 +02:00
Antoine Tenart 30b200a890 tcp: add TCP_MINTTL drop reason
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2184073
Upstream Status: linux.git
Conflicts:\
- Context difference due to missing upstream commits 020e71a3cf7f
  ("ipv4: guard IP_MINTTL with a static key") and 14834c4f4eb3 ("ipv4:
  annotate data races arount inet->min_ttl") in c9s.

commit 2798e36dc233a409a5d3f26f73029596dc504020
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Feb 1 17:43:45 2023 +0000

    tcp: add TCP_MINTTL drop reason

    In the unlikely case incoming packets are dropped because
    of IP_MINTTL / IPV6_MINHOPCOUNT constraints...

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20230201174345.2708943-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2023-06-06 11:23:15 +02:00
Paolo Abeni 220a990332 dccp/tcp: Reset saddr on failure after inet6?_hash_connect().
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2188561
Tested: LNST, Tier1

Upstream commit:
commit 77934dc6db0d2b111a8f2759e9ad2fb67f5cffa5
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Fri Nov 18 17:49:11 2022 -0800

    dccp/tcp: Reset saddr on failure after inet6?_hash_connect().

    When connect() is called on a socket bound to the wildcard address,
    we change the socket's saddr to a local address.  If the socket
    fails to connect() to the destination, we have to reset the saddr.

    However, when an error occurs after inet_hash6?_connect() in
    (dccp|tcp)_v[46]_conect(), we forget to reset saddr and leave
    the socket bound to the address.

    From the user's point of view, whether saddr is reset or not varies
    with errno.  Let's fix this inconsistent behaviour.

    Note that after this patch, the repro [0] will trigger the WARN_ON()
    in inet_csk_get_port() again, but this patch is not buggy and rather
    fixes a bug papering over the bhash2's bug for which we need another
    fix.

    For the record, the repro causes -EADDRNOTAVAIL in inet_hash6_connect()
    by this sequence:

      s1 = socket()
      s1.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
      s1.bind(('127.0.0.1', 10000))
      s1.sendto(b'hello', MSG_FASTOPEN, (('127.0.0.1', 10000)))
      # or s1.connect(('127.0.0.1', 10000))

      s2 = socket()
      s2.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
      s2.bind(('0.0.0.0', 10000))
      s2.connect(('127.0.0.1', 10000))  # -EADDRNOTAVAIL

      s2.listen(32)  # WARN_ON(inet_csk(sk)->icsk_bind2_hash != tb2);

    [0]: https://syzkaller.appspot.com/bug?extid=015d756bbd1f8b5c8f09

    Fixes: 3df80d9320 ("[DCCP]: Introduce DCCPv6")
    Fixes: 7c657876b6 ("[DCCP]: Initial implementation")
    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Acked-by: Joanne Koong <joannelkoong@gmail.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-04-21 09:57:10 +02:00
Paolo Abeni 059dc63005 tcp: fix a signed-integer-overflow bug in tcp_add_backlog()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2188561
Tested: LNST, Tier1

Upstream commit:
commit ec791d8149ff60c40ad2074af3b92a39c916a03f
Author: Lu Wei <luwei32@huawei.com>
Date:   Fri Oct 21 12:06:22 2022 +0800

    tcp: fix a signed-integer-overflow bug in tcp_add_backlog()

    The type of sk_rcvbuf and sk_sndbuf in struct sock is int, and
    in tcp_add_backlog(), the variable limit is caculated by adding
    sk_rcvbuf, sk_sndbuf and 64 * 1024, it may exceed the max value
    of int and overflow. This patch reduces the limit budget by
    halving the sndbuf to solve this issue since ACK packets are much
    smaller than the payload.

    Fixes: c9c3321257 ("tcp: add tcp_add_backlog()")
    Signed-off-by: Lu Wei <luwei32@huawei.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-04-21 09:56:46 +02:00
Paolo Abeni 8dda5cd012 tcp: minor optimization in tcp_add_backlog()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2188561
Tested: LNST, Tier1

Upstream commit:
commit d519f350967a60b85a574ad8aeac43f2b4384746
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 15 11:02:30 2021 -0800

    tcp: minor optimization in tcp_add_backlog()

    If packet is going to be coalesced, sk_sndbuf/sk_rcvbuf values
    are not used. Defer their access to the point we need them.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-04-21 09:56:38 +02:00
Guillaume Nault 04b96d8fcf tcp: Fix data-races around sysctl_tcp_reflect_tos.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160073
Upstream Status: linux.git

commit 870e3a634b6a6cb1543b359007aca73fe6a03ac5
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Fri Jul 22 11:22:04 2022 -0700

    tcp: Fix data-races around sysctl_tcp_reflect_tos.

    While reading sysctl_tcp_reflect_tos, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its readers.

    Fixes: ac8f1710c1 ("tcp: reflect tos value received in SYN to the socket")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Acked-by: Wei Wang <weiwan@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2023-01-17 12:25:15 +01:00
Guillaume Nault 8114c29f71 tcp: Fix a data-race around sysctl_tcp_tw_reuse.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2149949
Upstream Status: linux.git

commit cbfc6495586a3f09f6f07d9fb3c7cafe807e3c55
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Fri Jul 15 10:17:52 2022 -0700

    tcp: Fix a data-race around sysctl_tcp_tw_reuse.

    While reading sysctl_tcp_tw_reuse, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its reader.

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2022-12-22 11:37:58 +01:00
Frantisek Hrbata e265d68e77 Merge: tcp: phase-1 backports for RHEL-9.2
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1504

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2136491
Upstream Status: All mainline in net-next.git.
Tested: boot-tested only
Conflicts: see individual patches

Signed-off-by: Davide Caratti <dcaratti@redhat.com>

Approved-by: Hangbin Liu <haliu@redhat.com>
Approved-by: Antoine Tenart <atenart@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-11-14 02:40:21 -05:00
Davide Caratti 728983215c tcp: Access &tcp_hashinfo via net.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2137858
Upstream Status: net.git commit 4461568aa4e5
Conflicts:
 - net/ipv4/tcp_ipv4.c: context mismatch as we don't have upstream
   commit 28044fc1d495 ("net: Add a bhash2 table hashed by port and
   address") and 08eaef904031 ("tcp: Clean up some functions.")
 - net/ipv6/tcp_ipv6.c: context mismatch as we don't have upstream
   commit 28044fc1d495 ("net: Add a bhash2 table hashed by port and
   address")
 - net/ipv4/tcp_minisocks.c: hunk applied manually to fix a build issue
   caused by missing upstream commit 08eaef904031 ("tcp: Clean up some
   functions.")

commit 4461568aa4e565de2c336f4875ddf912f26da8a5
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Wed Sep 7 18:10:20 2022 -0700

    tcp: Access &tcp_hashinfo via net.

    We will soon introduce an optional per-netns ehash.

    This means we cannot use tcp_hashinfo directly in most places.

    Instead, access it via net->ipv4.tcp_death_row.hashinfo.

    The access will be valid only while initialising tcp_hashinfo
    itself and creating/destroying each netns.

    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2022-11-08 17:10:59 +01:00
Davide Caratti 9aac6c4346 net: add per_cpu_fw_alloc field to struct proto
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2137858
Upstream Status: net.git commit 0defbb0af775
Conflicts:
 - net/core/sock.c: context mismatch because of missing backport of
   upstream commit f20cfd662a62 ("net: add sanity check in proto_register()")

commit 0defbb0af775ef037913786048d099bbe8b9a2c2
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Jun 8 23:34:08 2022 -0700

    net: add per_cpu_fw_alloc field to struct proto

    Each protocol having a ->memory_allocated pointer gets a corresponding
    per-cpu reserve, that following patches will use.

    Instead of having reserved bytes per socket,
    we want to have per-cpu reserves.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Shakeel Butt <shakeelb@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2022-11-08 17:10:55 +01:00
Davide Caratti a3894ee946 net: inet: Retire port only listening_hash
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2137858
Upstream Status: net.git commit cae3873c5b3a

commit cae3873c5b3a4fcd9706fb461ff4e91bdf1f0120
Author: Martin KaFai Lau <kafai@fb.com>
Date:   Wed May 11 17:06:05 2022 -0700

    net: inet: Retire port only listening_hash

    The listen sk is currently stored in two hash tables,
    listening_hash (hashed by port) and lhash2 (hashed by port and address).

    After commit 0ee58dad5b ("net: tcp6: prefer listeners bound to an address")
    and commit d9fbc7f643 ("net: tcp: prefer listeners bound to an address"),
    the TCP-SYN lookup fast path does not use listening_hash.

    The commit 05c0b35709c5 ("tcp: seq_file: Replace listening_hash with lhash2")
    also moved the seq_file (/proc/net/tcp) iteration usage from
    listening_hash to lhash2.

    There are still a few listening_hash usages left.
    One of them is inet_reuseport_add_sock() which uses the listening_hash
    to search a listen sk during the listen() system call.  This turns
    out to be very slow on use cases that listen on many different
    VIPs at a popular port (e.g. 443).  [ On top of the slowness in
    adding to the tail in the IPv6 case ].  The latter patch has a
    selftest to demonstrate this case.

    This patch takes this chance to move all remaining listening_hash
    usages to lhash2 and then retire listening_hash.

    Since most changes need to be done together, it is hard to cut
    the listening_hash to lhash2 switch into small patches.  The
    changes in this patch is highlighted here for the review
    purpose.

    1. Because of the listening_hash removal, lhash2 can use the
       sk->sk_nulls_node instead of the icsk->icsk_listen_portaddr_node.
       This will also keep the sk_unhashed() check to work as is
       after stop adding sk to listening_hash.

       The union is removed from inet_listen_hashbucket because
       only nulls_head is needed.

    2. icsk->icsk_listen_portaddr_node and its helpers are removed.

    3. The current lhash2 users needs to iterate with sk_nulls_node
       instead of icsk_listen_portaddr_node.

       One case is in the inet[6]_lhash2_lookup().

       Another case is the seq_file iterator in tcp_ipv4.c.
       One thing to note is sk_nulls_next() is needed
       because the old inet_lhash2_for_each_icsk_continue()
       does a "next" first before iterating.

    4. Move the remaining listening_hash usage to lhash2

       inet_reuseport_add_sock() which this series is
       trying to improve.

       inet_diag.c and mptcp_diag.c are the final two
       remaining use cases and is moved to lhash2 now also.

    Signed-off-by: Martin KaFai Lau <kafai@fb.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2022-11-08 17:10:54 +01:00
Davide Caratti c7ab33ab51 tcp: add a missing nf_reset_ct() in 3WHS handling
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2136491
Upstream Status: net-next.git commit 6f0012e35160

commit 6f0012e35160cd08a53e46e3b3bbf724b92dfe68
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Jun 23 05:04:36 2022 +0000

    tcp: add a missing nf_reset_ct() in 3WHS handling

    When the third packet of 3WHS connection establishment
    contains payload, it is added into socket receive queue
    without the XFRM check and the drop of connection tracking
    context.

    This means that if the data is left unread in the socket
    receive queue, conntrack module can not be unloaded.

    As most applications usually reads the incoming data
    immediately after accept(), bug has been hiding for
    quite a long time.

    Commit 68822bdf76f1 ("net: generalize skb freeing
    deferral to per-cpu lists") exposed this bug because
    even if the application reads this data, the skb
    with nfct state could stay in a per-cpu cache for
    an arbitrary time, if said cpu no longer process RX softirqs.

    Many thanks to Ilya Maximets for reporting this issue,
    and for testing various patches:
    https://lore.kernel.org/netdev/20220619003919.394622-1-i.maximets@ovn.org/

    Note that I also added a missing xfrm4_policy_check() call,
    although this is probably not a big issue, as the SYN
    packet should have been dropped earlier.

    Fixes: b59c270104 ("[NETFILTER]: Keep conntrack reference until IPsec policy checks are done")
    Reported-by: Ilya Maximets <i.maximets@ovn.org>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Florian Westphal <fw@strlen.de>
    Cc: Pablo Neira Ayuso <pablo@netfilter.org>
    Cc: Steffen Klassert <steffen.klassert@secunet.com>
    Tested-by: Ilya Maximets <i.maximets@ovn.org>
    Reviewed-by: Ilya Maximets <i.maximets@ovn.org>
    Link: https://lore.kernel.org/r/20220623050436.1290307-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2022-11-07 10:10:25 +01:00
Antoine Tenart 626c678449 net: tcp: reset 'drop_reason' to NOT_SPCIFIED in tcp_v{4,6}_rcv()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit f8319dfd1b3b3be6c08795017fc30f880f8bc861
Author: Menglong Dong <imagedong@tencent.com>
Date:   Fri May 13 11:03:39 2022 +0800

    net: tcp: reset 'drop_reason' to NOT_SPCIFIED in tcp_v{4,6}_rcv()

    The 'drop_reason' that passed to kfree_skb_reason() in tcp_v4_rcv()
    and tcp_v6_rcv() can be SKB_NOT_DROPPED_YET(0), as it is used as the
    return value of tcp_inbound_md5_hash().

    And it can panic the kernel with NULL pointer in
    net_dm_packet_report_size() if the reason is 0, as drop_reasons[0]
    is NULL.

    Fixes: 1330b6ef3313 ("skb: make drop reason booleanable")
    Reviewed-by: Jiang Biao <benbjiang@tencent.com>
    Reviewed-by: Hao Peng <flyingpeng@tencent.com>
    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-14 17:40:26 +02:00
Antoine Tenart 04f4917aca skb: make drop reason booleanable
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit 1330b6ef3313fcec577d2b020c290dc8b9f11f1a
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Mon Mar 7 16:44:21 2022 -0800

    skb: make drop reason booleanable

    We have a number of cases where function returns drop/no drop
    decision as a boolean. Now that we want to report the reason
    code as well we have to pass extra output arguments.

    We can make the reason code evaluate correctly as bool.

    I believe we're good to reorder the reasons as they are
    reported to user space as strings.

    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:24 +02:00
Antoine Tenart 997d93a49f net/tcp: Merge TCP-MD5 inbound callbacks
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit 7bbb765b73496699a165d505ecdce962f903b422
Author: Dmitry Safonov <0x7f454c46@gmail.com>
Date:   Wed Feb 23 17:57:40 2022 +0000

    net/tcp: Merge TCP-MD5 inbound callbacks

    The functions do essentially the same work to verify TCP-MD5 sign.
    Code can be merged into one family-independent function in order to
    reduce copy'n'paste and generated code.
    Later with TCP-AO option added, this will allow to create one function
    that's responsible for segment verification, that will have all the
    different checks for MD5/AO/non-signed packets, which in turn will help
    to see checks for all corner-cases in one function, rather than spread
    around different families and functions.

    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
    Signed-off-by: Dmitry Safonov <dima@arista.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Link: https://lore.kernel.org/r/20220223175740.452397-1-dima@arista.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:24 +02:00
Antoine Tenart 7e7867a749 net: tcp: use kfree_skb_reason() for tcp_v{4,6}_do_rcv()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit 8eba65fa5f06519042b98564089b942d795e3f8d
Author: Menglong Dong <imagedong@tencent.com>
Date:   Sun Feb 20 15:06:34 2022 +0800

    net: tcp: use kfree_skb_reason() for tcp_v{4,6}_do_rcv()

    Replace kfree_skb() used in tcp_v4_do_rcv() and tcp_v6_do_rcv() with
    kfree_skb_reason().

    Reviewed-by: Mengen Sun <mengensun@tencent.com>
    Reviewed-by: Hao Peng <flyingpeng@tencent.com>
    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:22 +02:00
Antoine Tenart 0b99c6c861 net: tcp: add skb drop reasons to tcp_add_backlog()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git
Conflicts:\
- In tcp.h due to missing commit f35f821935d8 ("tcp: defer skb freeing
  after socket lock is released") in C9S; which is fine btw as the chunk
  in tcp.h was later removed upstream by commit 68822bdf76f1 ("net:
  generalize skb freeing deferral to per-cpu lists").

commit 7a26dc9e7b43f5a24c4b843713e728582adf1c38
Author: Menglong Dong <imagedong@tencent.com>
Date:   Sun Feb 20 15:06:33 2022 +0800

    net: tcp: add skb drop reasons to tcp_add_backlog()

    Pass the address of drop_reason to tcp_add_backlog() to store the
    reasons for skb drops when fails. Following drop reasons are
    introduced:

    SKB_DROP_REASON_SOCKET_BACKLOG

    Reviewed-by: Mengen Sun <mengensun@tencent.com>
    Reviewed-by: Hao Peng <flyingpeng@tencent.com>
    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:22 +02:00
Antoine Tenart de5f3d75e9 net: tcp: add skb drop reasons to tcp_v{4,6}_inbound_md5_hash()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit 643b622b51f1f0015e0a80f90b4ef9032e6ddb1b
Author: Menglong Dong <imagedong@tencent.com>
Date:   Sun Feb 20 15:06:32 2022 +0800

    net: tcp: add skb drop reasons to tcp_v{4,6}_inbound_md5_hash()

    Pass the address of drop reason to tcp_v4_inbound_md5_hash() and
    tcp_v6_inbound_md5_hash() to store the reasons for skb drops when this
    function fails. Therefore, the drop reason can be passed to
    kfree_skb_reason() when the skb needs to be freed.

    Following drop reasons are added:

    SKB_DROP_REASON_TCP_MD5NOTFOUND
    SKB_DROP_REASON_TCP_MD5UNEXPECTED
    SKB_DROP_REASON_TCP_MD5FAILURE

    SKB_DROP_REASON_TCP_MD5* above correspond to LINUX_MIB_TCPMD5*

    Reviewed-by: Mengen Sun <mengensun@tencent.com>
    Reviewed-by: Hao Peng <flyingpeng@tencent.com>
    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:22 +02:00
Antoine Tenart 21c3e93b20 net: tcp: add skb drop reasons to tcp_v4_rcv()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit 255f9034d3050fb1d0691226712c6b7f1ca674cd
Author: Menglong Dong <imagedong@tencent.com>
Date:   Sun Feb 20 15:06:30 2022 +0800

    net: tcp: add skb drop reasons to tcp_v4_rcv()

    Use kfree_skb_reason() for some path in tcp_v4_rcv() that missed before,
    including:

    SKB_DROP_REASON_SOCKET_FILTER
    SKB_DROP_REASON_XFRM_POLICY

    Reviewed-by: Mengen Sun <mengensun@tencent.com>
    Reviewed-by: Hao Peng <flyingpeng@tencent.com>
    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:22 +02:00
Antoine Tenart 36fdf75633 net: socket: rename SKB_DROP_REASON_SOCKET_FILTER
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit 364df53c081d93fcfd6b91085ff2650c7f17b3c7
Author: Menglong Dong <imagedong@tencent.com>
Date:   Thu Jan 27 17:13:01 2022 +0800

    net: socket: rename SKB_DROP_REASON_SOCKET_FILTER

    Rename SKB_DROP_REASON_SOCKET_FILTER, which is used
    as the reason of skb drop out of socket filter before
    it's part of a released kernel. It will be used for
    more protocols than just TCP in future series.

    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Link: https://lore.kernel.org/all/20220127091308.91401-2-imagedong@tencent.com/
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:21 +02:00
Felix Maurer de20724127 net: bpf: Handle return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071620

commit 91a760b26926265a60c77ddf016529bcf3e17a04
Author: Menglong Dong <imagedong@tencent.com>
Date:   Thu Jan 6 21:20:20 2022 +0800

    net: bpf: Handle return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND()

    The return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() in
    __inet_bind() is not handled properly. While the return value
    is non-zero, it will set inet_saddr and inet_rcv_saddr to 0 and
    exit:

            err = BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk);
            if (err) {
                    inet->inet_saddr = inet->inet_rcv_saddr = 0;
                    goto out_release_sock;
            }

    Let's take UDP for example and see what will happen. For UDP
    socket, it will be added to 'udp_prot.h.udp_table->hash' and
    'udp_prot.h.udp_table->hash2' after the sk->sk_prot->get_port()
    called success. If 'inet->inet_rcv_saddr' is specified here,
    then 'sk' will be in the 'hslot2' of 'hash2' that it don't belong
    to (because inet_saddr is changed to 0), and UDP packet received
    will not be passed to this sock. If 'inet->inet_rcv_saddr' is not
    specified here, the sock will work fine, as it can receive packet
    properly, which is wired, as the 'bind()' is already failed.

    To undo the get_port() operation, introduce the 'put_port' field
    for 'struct proto'. For TCP proto, it is inet_put_port(); For UDP
    proto, it is udp_lib_unhash(); For icmp proto, it is
    ping_unhash().

    Therefore, after sys_bind() fail caused by
    BPF_CGROUP_RUN_PROG_INET4_POST_BIND(), it will be unbinded, which
    means that it can try to be binded to another port.

    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20220106132022.3470772-2-imagedong@tencent.com

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2022-08-24 16:53:48 +02:00
Paolo Abeni 036c0e121e tcp: add accessors to read/set tp->snd_cwnd
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101465
Tested: LNST, Tier1

Upstream commit:
commit 40570375356c874b1578e05c1dcc3ff7c1322dbe
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Apr 5 16:35:38 2022 -0700

    tcp: add accessors to read/set tp->snd_cwnd

    We had various bugs over the years with code
    breaking the assumption that tp->snd_cwnd is greater
    than zero.

    Lately, syzbot reported the WARN_ON_ONCE(!tp->prior_cwnd) added
    in commit 8b8a321ff7 ("tcp: fix zero cwnd in tcp_cwnd_reduction")
    can trigger, and without a repro we would have to spend
    considerable time finding the bug.

    Instead of complaining too late, we want to catch where
    and when tp->snd_cwnd is set to an illegal value.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Suggested-by: Yuchung Cheng <ycheng@google.com>
    Cc: Neal Cardwell <ncardwell@google.com>
    Acked-by: Yuchung Cheng <ycheng@google.com>
    Link: https://lore.kernel.org/r/20220405233538.947344-1-eric.dumazet@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-06-27 16:43:55 +02:00
Patrick Talbert 8c5b3f7fd9 Merge: XDP and networking eBPF rebase to v5.15
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/674

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071618

Depends: !572

Tested: Using bpf selftests, everything passes.

This rebases XDP and networking eBPF to upstream kernel version 5.15.

Signed-off-by: Jiri Benc <jbenc@redhat.com>

Approved-by: Hangbin Liu <haliu@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Toke Høiland-Jørgensen <toke@redhat.com>
Approved-by: Íñigo Huguet <ihuguet@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-06-03 09:26:25 +02:00