Commit Graph

59 Commits

Author SHA1 Message Date
Felix Maurer 2acef5216c tcp_bpf: fix return value of tcp_bpf_sendmsg()
JIRA: https://issues.redhat.com/browse/RHEL-68071
JIRA: https://issues.redhat.com/browse/RHEL-59445
CVE: CVE-2024-46783

Conflicts:
- net/ipv4/tcp_bpf.c: We don't have dc97391e6610 ("sock: Remove
  ->sendpage*() in favour of sendmsg(MSG_SPLICE_PAGES)"), therefore we have
  both tcp_bpf_sendmsg() and tcp_bpf_sendpage(), where upstream only has
  tcp_bpf_sendmsg(). tcp_bpf_sendpage() follows a similar pattern as
  described in the patch: copied could become negative through
  tcp_bpf_send_verdict() and would be returned directly. Apply the same fix
  to only return copied if the value is positive.

commit fe1910f9337bd46a9343967b547ccab26b4b2c6e
Author: Cong Wang <cong.wang@bytedance.com>
Date:   Tue Aug 20 20:07:44 2024 -0700

    tcp_bpf: fix return value of tcp_bpf_sendmsg()

    When we cork messages in psock->cork, the last message triggers the
    flushing will result in sending a sk_msg larger than the current
    message size. In this case, in tcp_bpf_send_verdict(), 'copied' becomes
    negative at least in the following case:

    468         case __SK_DROP:
    469         default:
    470                 sk_msg_free_partial(sk, msg, tosend);
    471                 sk_msg_apply_bytes(psock, tosend);
    472                 *copied -= (tosend + delta); // <==== HERE
    473                 return -EACCES;

    Therefore, it could lead to the following BUG with a proper value of
    'copied' (thanks to syzbot). We should not use negative 'copied' as a
    return value here.

      ------------[ cut here ]------------
      kernel BUG at net/socket.c:733!
      Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
      Modules linked in:
      CPU: 0 UID: 0 PID: 3265 Comm: syz-executor510 Not tainted 6.11.0-rc3-syzkaller-00060-gd07b43284ab3 #0
      Hardware name: linux,dummy-virt (DT)
      pstate: 61400009 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
      pc : sock_sendmsg_nosec net/socket.c:733 [inline]
      pc : sock_sendmsg_nosec net/socket.c:728 [inline]
      pc : __sock_sendmsg+0x5c/0x60 net/socket.c:745
      lr : sock_sendmsg_nosec net/socket.c:730 [inline]
      lr : __sock_sendmsg+0x54/0x60 net/socket.c:745
      sp : ffff800088ea3b30
      x29: ffff800088ea3b30 x28: fbf00000062bc900 x27: 0000000000000000
      x26: ffff800088ea3bc0 x25: ffff800088ea3bc0 x24: 0000000000000000
      x23: f9f00000048dc000 x22: 0000000000000000 x21: ffff800088ea3d90
      x20: f9f00000048dc000 x19: ffff800088ea3d90 x18: 0000000000000001
      x17: 0000000000000000 x16: 0000000000000000 x15: 000000002002ffaf
      x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
      x11: 0000000000000000 x10: ffff8000815849c0 x9 : ffff8000815b49c0
      x8 : 0000000000000000 x7 : 000000000000003f x6 : 0000000000000000
      x5 : 00000000000007e0 x4 : fff07ffffd239000 x3 : fbf00000062bc900
      x2 : 0000000000000000 x1 : 0000000000000000 x0 : 00000000fffffdef
      Call trace:
       sock_sendmsg_nosec net/socket.c:733 [inline]
       __sock_sendmsg+0x5c/0x60 net/socket.c:745
       ____sys_sendmsg+0x274/0x2ac net/socket.c:2597
       ___sys_sendmsg+0xac/0x100 net/socket.c:2651
       __sys_sendmsg+0x84/0xe0 net/socket.c:2680
       __do_sys_sendmsg net/socket.c:2689 [inline]
       __se_sys_sendmsg net/socket.c:2687 [inline]
       __arm64_sys_sendmsg+0x24/0x30 net/socket.c:2687
       __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
       invoke_syscall+0x48/0x110 arch/arm64/kernel/syscall.c:49
       el0_svc_common.constprop.0+0x40/0xe0 arch/arm64/kernel/syscall.c:132
       do_el0_svc+0x1c/0x28 arch/arm64/kernel/syscall.c:151
       el0_svc+0x34/0xec arch/arm64/kernel/entry-common.c:712
       el0t_64_sync_handler+0x100/0x12c arch/arm64/kernel/entry-common.c:730
       el0t_64_sync+0x19c/0x1a0 arch/arm64/kernel/entry.S:598
      Code: f9404463 d63f0060 3108441f 54fffe81 (d4210000)
      ---[ end trace 0000000000000000 ]---

    Fixes: 4f738adba3 ("bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data")
    Reported-by: syzbot+58c03971700330ce14d8@syzkaller.appspotmail.com
    Cc: Jakub Sitnicki <jakub@cloudflare.com>
    Signed-off-by: Cong Wang <cong.wang@bytedance.com>
    Reviewed-by: John Fastabend <john.fastabend@gmail.com>
    Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
    Link: https://patch.msgid.link/20240821030744.320934-1-xiyou.wangcong@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2025-01-09 11:58:42 +01:00
Paolo Abeni c02fe63081 tcp_bpf: properly release resources on error paths
JIRA: https://issues.redhat.com/browse/RHEL-12593
Tested: vs jira reproducer
Conflicts: different context in tcp_bpf_recvmsg_parser() as rhel lacks \
  the upstream commit e5c6de5fa025 ("bpf, sockmap: Incorrectly handling \
  copied_seq"). The conflict could be resolved reusing the existing \
  'out' label, but the present option has been chosen to explicitly \
  cause a conflict in should e5c6de5fa025 be later backported and \
  hopefully help handling the OoO backport correctly.

Upstream commit:
commit 68b54aeff804acceb02f228ea2e28419272c1fb9
Author: Paolo Abeni <pabeni@redhat.com>
Date:   Tue Oct 17 17:49:51 2023 +0200

    tcp_bpf: properly release resources on error paths

    In the blamed commit below, I completely forgot to release the acquired
    resources before erroring out in the TCP BPF code, as reported by Dan.

    Address the issues by replacing the bogus return with a jump to the
    relevant cleanup code.

    Fixes: 419ce133ab92 ("tcp: allow again tcp_disconnect() when threads are waiting")
    Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: John Fastabend <john.fastabend@gmail.com>
    Link: https://lore.kernel.org/r/8f99194c698bcef12666f0a9a999c58f8b1cb52c.1697557782.git.pabeni@redhat.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-23 16:47:47 +02:00
Paolo Abeni 0240ed7c51 tcp: allow again tcp_disconnect() when threads are waiting
JIRA: https://issues.redhat.com/browse/RHEL-12593
Tested: vs bz reproducer
Conflicts: the tls_sw chunk is mangling and applied in \
  tls_rx_reader_acquire(), as rhel lacks the upstream commit \
  f9ae3204fb45 ("net/tls: split  tls_rx_reader_lock"). \
  the wait_on_pending_writer() chunk did not contain the ONCE \
  annotation, as rhel lacks the upstream commit d0ac89f6f987 ("net: \
  deal with most data-races in sk_wait_event()"). The same for \
  sk_stream_wait_memory() chunk.

Upstream commit:
commit 419ce133ab928ab5efd7b50b2ef36ddfd4eadbd2
Author: Paolo Abeni <pabeni@redhat.com>
Date:   Wed Oct 11 09:20:55 2023 +0200

    tcp: allow again tcp_disconnect() when threads are waiting

    As reported by Tom, .NET and applications build on top of it rely
    on connect(AF_UNSPEC) to async cancel pending I/O operations on TCP
    socket.

    The blamed commit below caused a regression, as such cancellation
    can now fail.

    As suggested by Eric, this change addresses the problem explicitly
    causing blocking I/O operation to terminate immediately (with an error)
    when a concurrent disconnect() is executed.

    Instead of tracking the number of threads blocked on a given socket,
    track the number of disconnect() issued on such socket. If such counter
    changes after a blocking operation releasing and re-acquiring the socket
    lock, error out the current operation.

    Fixes: 4faeee0cf8a5 ("tcp: deny tcp_disconnect() when threads are waiting")
    Reported-by: Tom Deseyn <tdeseyn@redhat.com>
    Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1886305
    Suggested-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/f3b95e47e3dbed840960548aebaa8d954372db41.1697008693.git.pabeni@redhat.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-23 16:47:41 +02:00
Felix Maurer 2af4cdfa1b bpf, sockmap: Fix an infinite loop error when len is 0 in tcp_bpf_recvmsg_parser()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2178930
Conflicts:
- net/ipv4/udp_bpf.c: context difference due to missing ec095263a965 ("net:
  remove noblock parameter from recvmsg() entities")
- net/ipv4/tcp_bpf.c: context difference due to missing ec095263a965 ("net:
  remove noblock parameter from recvmsg() entities")

commit d900f3d20cc3169ce42ec72acc850e662a4d4db2
Author: Liu Jian <liujian56@huawei.com>
Date:   Fri Mar 3 16:09:46 2023 +0800

    bpf, sockmap: Fix an infinite loop error when len is 0 in tcp_bpf_recvmsg_parser()

    When the buffer length of the recvmsg system call is 0, we got the
    flollowing soft lockup problem:

    watchdog: BUG: soft lockup - CPU#3 stuck for 27s! [a.out:6149]
    CPU: 3 PID: 6149 Comm: a.out Kdump: loaded Not tainted 6.2.0+ #30
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
    RIP: 0010:remove_wait_queue+0xb/0xc0
    Code: 5e 41 5f c3 cc cc cc cc 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 41 57 <41> 56 41 55 41 54 55 48 89 fd 53 48 89 f3 4c 8d 6b 18 4c 8d 73 20
    RSP: 0018:ffff88811b5978b8 EFLAGS: 00000246
    RAX: 0000000000000000 RBX: ffff88811a7d3780 RCX: ffffffffb7a4d768
    RDX: dffffc0000000000 RSI: ffff88811b597908 RDI: ffff888115408040
    RBP: 1ffff110236b2f1b R08: 0000000000000000 R09: ffff88811a7d37e7
    R10: ffffed10234fa6fc R11: 0000000000000001 R12: ffff88811179b800
    R13: 0000000000000001 R14: ffff88811a7d38a8 R15: ffff88811a7d37e0
    FS:  00007f6fb5398740(0000) GS:ffff888237180000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000020000000 CR3: 000000010b6ba002 CR4: 0000000000370ee0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
     <TASK>
     tcp_msg_wait_data+0x279/0x2f0
     tcp_bpf_recvmsg_parser+0x3c6/0x490
     inet_recvmsg+0x280/0x290
     sock_recvmsg+0xfc/0x120
     ____sys_recvmsg+0x160/0x3d0
     ___sys_recvmsg+0xf0/0x180
     __sys_recvmsg+0xea/0x1a0
     do_syscall_64+0x3f/0x90
     entry_SYSCALL_64_after_hwframe+0x72/0xdc

    The logic in tcp_bpf_recvmsg_parser is as follows:

    msg_bytes_ready:
            copied = sk_msg_recvmsg(sk, psock, msg, len, flags);
            if (!copied) {
                    wait data;
                    goto msg_bytes_ready;
            }

    In this case, "copied" always is 0, the infinite loop occurs.

    According to the Linux system call man page, 0 should be returned in this
    case. Therefore, in tcp_bpf_recvmsg_parser(), if the length is 0, directly
    return. Also modify several other functions with the same problem.

    Fixes: 1f5be6b3b0 ("udp: Implement udp_bpf_recvmsg() for sockmap")
    Fixes: 9825d866ce0d ("af_unix: Implement unix_dgram_bpf_recvmsg()")
    Fixes: c5d2177a72a1 ("bpf, sockmap: Fix race in ingress receive verdict with redirect to self")
    Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
    Signed-off-by: Liu Jian <liujian56@huawei.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: John Fastabend <john.fastabend@gmail.com>
    Cc: Jakub Sitnicki <jakub@cloudflare.com>
    Link: https://lore.kernel.org/bpf/20230303080946.1146638-1-liujian56@huawei.com

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-06-14 10:44:28 +02:00
Jan Stancek 704d11b087 Merge: enable io_uring
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2375

# Merge Request Required Information

## Summary of Changes
This MR updates RHEL's io_uring implementation to match upstream v6.2 + fixes (Fixes: tags and Cc: stable commits).  The final 3 RHEL-only patches disable the feature by default, taint the kernel when it is enabled, and turn on the config option.

## Approved Development Ticket
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2170014
Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2214
Omitted-fix: b0b7a7d24b66 ("io_uring: return back links tw run optimisation")
  This is actually just	an optimization, and it	has non-trivial	conflicts
  which	would require additional backports to resolve.	Skip it.
Omitted-fix: 7d481e035633 ("io_uring/rsrc: fix DEFER_TASKRUN rsrc quiesce")
  This fix is incorrectly tagged.  The code that it applies to is not present in our tree.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>

Approved-by: John Meneghini <jmeneghi@redhat.com>
Approved-by: Ming Lei <ming.lei@redhat.com>
Approved-by: Maurizio Lombardi <mlombard@redhat.com>
Approved-by: Brian Foster <bfoster@redhat.com>
Approved-by: Guillaume Nault <gnault@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-05-17 07:47:08 +02:00
Jeff Moyer 2c3731e9b2 net: remove SOCK_SUPPORT_ZC from sockmap
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237

commit fee9ac06647e59a69fb7aec58f25267c134264b4
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Thu Oct 27 00:25:57 2022 +0100

    net: remove SOCK_SUPPORT_ZC from sockmap
    
    sockmap replaces ->sk_prot with its own callbacks, we should remove
    SOCK_SUPPORT_ZC as the new proto doesn't support msghdr::ubuf_info.
    
    Cc: <stable@vger.kernel.org> # 6.0
    Reported-by: Jakub Kicinski <kuba@kernel.org>
    Fixes: e993ffe3da4bc ("net: flag sockets supporting msghdr originated zerocopy")
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-05-05 15:24:12 -04:00
Jiri Benc 04ea5b929a bpf, sockmap: Check for any of tcp_bpf_prots when cloning a listener
Bugzilla: https://bugzilla.redhat.com/2177177

commit ddce1e091757d0259107c6c0c7262df201de2b66
Author: Jakub Sitnicki <jakub@cloudflare.com>
Date:   Sat Jan 21 13:41:44 2023 +0100

    bpf, sockmap: Check for any of tcp_bpf_prots when cloning a listener
    
    A listening socket linked to a sockmap has its sk_prot overridden. It
    points to one of the struct proto variants in tcp_bpf_prots. The variant
    depends on the socket's family and which sockmap programs are attached.
    
    A child socket cloned from a TCP listener initially inherits their sk_prot.
    But before cloning is finished, we restore the child's proto to the
    listener's original non-tcp_bpf_prots one. This happens in
    tcp_create_openreq_child -> tcp_bpf_clone.
    
    Today, in tcp_bpf_clone we detect if the child's proto should be restored
    by checking only for the TCP_BPF_BASE proto variant. This is not
    correct. The sk_prot of listening socket linked to a sockmap can point to
    to any variant in tcp_bpf_prots.
    
    If the listeners sk_prot happens to be not the TCP_BPF_BASE variant, then
    the child socket unintentionally is left if the inherited sk_prot by
    tcp_bpf_clone.
    
    This leads to issues like infinite recursion on close [1], because the
    child state is otherwise not set up for use with tcp_bpf_prot operations.
    
    Adjust the check in tcp_bpf_clone to detect all of tcp_bpf_prots variants.
    
    Note that it wouldn't be sufficient to check the socket state when
    overriding the sk_prot in tcp_bpf_update_proto in order to always use the
    TCP_BPF_BASE variant for listening sockets. Since commit
    b8b8315e39ff ("bpf, sockmap: Remove unhash handler for BPF sockmap usage")
    it is possible for a socket to transition to TCP_LISTEN state while already
    linked to a sockmap, e.g. connect() -> insert into map ->
    connect(AF_UNSPEC) -> listen().
    
    [1]: https://lore.kernel.org/all/00000000000073b14905ef2e7401@google.com/
    
    Fixes: e80251555f ("tcp_bpf: Don't let child socket inherit parent protocol ops on copy")
    Reported-by: syzbot+04c21ed96d861dccc5cd@syzkaller.appspotmail.com
    Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
    Acked-by: John Fastabend <john.fastabend@gmail.com>
    Link: https://lore.kernel.org/r/20230113-sockmap-fix-v2-2-1e0ee7ac2f90@cloudflare.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2023-04-28 11:43:23 +02:00
Jerome Marchand 3318134b4d bpf, sockmap: Fix data loss caused by using apply_bytes on ingress redirect
Bugzilla: https://bugzilla.redhat.com/2177177

commit 9072931f020bfd907d6d89ee21ff1481cd78b407
Author: Pengcheng Yang <yangpc@wangsu.com>
Date:   Tue Nov 29 18:40:40 2022 +0800

    bpf, sockmap: Fix data loss caused by using apply_bytes on ingress redirect

    Use apply_bytes on ingress redirect, when apply_bytes is less than
    the length of msg data, some data may be skipped and lost in
    bpf_tcp_ingress().

    If there is still data in the scatterlist that has not been consumed,
    we cannot move the msg iter.

    Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
    Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
    Link: https://lore.kernel.org/bpf/1669718441-2654-4-git-send-email-yangpc@wangsu.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:13 +02:00
Jerome Marchand f79873d0b0 bpf, sockmap: Fix missing BPF_F_INGRESS flag when using apply_bytes
Bugzilla: https://bugzilla.redhat.com/2177177

commit a351d6087bf7d3d8440d58d3bf244ec64b89394a
Author: Pengcheng Yang <yangpc@wangsu.com>
Date:   Tue Nov 29 18:40:39 2022 +0800

    bpf, sockmap: Fix missing BPF_F_INGRESS flag when using apply_bytes

    When redirecting, we use sk_msg_to_ingress() to get the BPF_F_INGRESS
    flag from the msg->flags. If apply_bytes is used and it is larger than
    the current data being processed, sk_psock_msg_verdict() will not be
    called when sendmsg() is called again. At this time, the msg->flags is 0,
    and we lost the BPF_F_INGRESS flag.

    So we need to save the BPF_F_INGRESS flag in sk_psock and use it when
    redirection.

    Fixes: 8934ce2fd0 ("bpf: sockmap redirect ingress support")
    Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
    Link: https://lore.kernel.org/bpf/1669718441-2654-3-git-send-email-yangpc@wangsu.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:13 +02:00
Jerome Marchand fd0bd4e708 bpf, sockmap: Fix repeated calls to sock_put() when msg has more_data
Bugzilla: https://bugzilla.redhat.com/2177177

commit 7a9841ca025275b5b0edfb0b618934abb6ceec15
Author: Pengcheng Yang <yangpc@wangsu.com>
Date:   Tue Nov 29 18:40:38 2022 +0800

    bpf, sockmap: Fix repeated calls to sock_put() when msg has more_data

    In tcp_bpf_send_verdict() redirection, the eval variable is assigned to
    __SK_REDIRECT after the apply_bytes data is sent, if msg has more_data,
    sock_put() will be called multiple times.

    We should reset the eval variable to __SK_NONE every time more_data
    starts.

    This causes:

    IPv4: Attempt to release TCP socket in state 1 00000000b4c925d7
    ------------[ cut here ]------------
    refcount_t: addition on 0; use-after-free.
    WARNING: CPU: 5 PID: 4482 at lib/refcount.c:25 refcount_warn_saturate+0x7d/0x110
    Modules linked in:
    CPU: 5 PID: 4482 Comm: sockhash_bypass Kdump: loaded Not tainted 6.0.0 #1
    Hardware name: Red Hat KVM, BIOS 1.11.0-2.el7 04/01/2014
    Call Trace:
     <TASK>
     __tcp_transmit_skb+0xa1b/0xb90
     ? __alloc_skb+0x8c/0x1a0
     ? __kmalloc_node_track_caller+0x184/0x320
     tcp_write_xmit+0x22a/0x1110
     __tcp_push_pending_frames+0x32/0xf0
     do_tcp_sendpages+0x62d/0x640
     tcp_bpf_push+0xae/0x2c0
     tcp_bpf_sendmsg_redir+0x260/0x410
     ? preempt_count_add+0x70/0xa0
     tcp_bpf_send_verdict+0x386/0x4b0
     tcp_bpf_sendmsg+0x21b/0x3b0
     sock_sendmsg+0x58/0x70
     __sys_sendto+0xfa/0x170
     ? xfd_validate_state+0x1d/0x80
     ? switch_fpu_return+0x59/0xe0
     __x64_sys_sendto+0x24/0x30
     do_syscall_64+0x37/0x90
     entry_SYSCALL_64_after_hwframe+0x63/0xcd

    Fixes: cd9733f5d75c ("tcp_bpf: Fix one concurrency problem in the tcp_bpf_send_verdict function")
    Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
    Link: https://lore.kernel.org/bpf/1669718441-2654-2-git-send-email-yangpc@wangsu.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:13 +02:00
Artem Savkov 5120f0250e bpf, sockmap: Fix sk->sk_forward_alloc warn_on in sk_stream_kill_queues
Bugzilla: https://bugzilla.redhat.com/2137876

commit d8616ee2affcff37c5d315310da557a694a3303d
Author: Wang Yufen <wangyufen@huawei.com>
Date:   Tue May 24 15:53:11 2022 +0800

    bpf, sockmap: Fix sk->sk_forward_alloc warn_on in sk_stream_kill_queues
    
    During TCP sockmap redirect pressure test, the following warning is triggered:
    
    WARNING: CPU: 3 PID: 2145 at net/core/stream.c:205 sk_stream_kill_queues+0xbc/0xd0
    CPU: 3 PID: 2145 Comm: iperf Kdump: loaded Tainted: G        W         5.10.0+ #9
    Call Trace:
     inet_csk_destroy_sock+0x55/0x110
     inet_csk_listen_stop+0xbb/0x380
     tcp_close+0x41b/0x480
     inet_release+0x42/0x80
     __sock_release+0x3d/0xa0
     sock_close+0x11/0x20
     __fput+0x9d/0x240
     task_work_run+0x62/0x90
     exit_to_user_mode_prepare+0x110/0x120
     syscall_exit_to_user_mode+0x27/0x190
     entry_SYSCALL_64_after_hwframe+0x44/0xa9
    
    The reason we observed is that:
    
    When the listener is closing, a connection may have completed the three-way
    handshake but not accepted, and the client has sent some packets. The child
    sks in accept queue release by inet_child_forget()->inet_csk_destroy_sock(),
    but psocks of child sks have not released.
    
    To fix, add sock_map_destroy to release psocks.
    
    Signed-off-by: Wang Yufen <wangyufen@huawei.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
    Acked-by: John Fastabend <john.fastabend@gmail.com>
    Link: https://lore.kernel.org/bpf/20220524075311.649153-1-wangyufen@huawei.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-01-05 15:46:29 +01:00
Yauheni Kaliuta 1780186ed1 bpf, sockmap: Fix the sk->sk_forward_alloc warning of sk_stream_kill_queues
Bugzilla: http://bugzilla.redhat.com/2120968

commit 8ec95b94716a1e4d126edc3fb2bc426a717e2dba
Author: Wang Yufen <wangyufen@huawei.com>
Date:   Tue Nov 1 09:31:36 2022 +0800

    bpf, sockmap: Fix the sk->sk_forward_alloc warning of sk_stream_kill_queues
    
    When running `test_sockmap` selftests, the following warning appears:
    
      WARNING: CPU: 2 PID: 197 at net/core/stream.c:205 sk_stream_kill_queues+0xd3/0xf0
      Call Trace:
      <TASK>
      inet_csk_destroy_sock+0x55/0x110
      tcp_rcv_state_process+0xd28/0x1380
      ? tcp_v4_do_rcv+0x77/0x2c0
      tcp_v4_do_rcv+0x77/0x2c0
      __release_sock+0x106/0x130
      __tcp_close+0x1a7/0x4e0
      tcp_close+0x20/0x70
      inet_release+0x3c/0x80
      __sock_release+0x3a/0xb0
      sock_close+0x14/0x20
      __fput+0xa3/0x260
      task_work_run+0x59/0xb0
      exit_to_user_mode_prepare+0x1b3/0x1c0
      syscall_exit_to_user_mode+0x19/0x50
      do_syscall_64+0x48/0x90
      entry_SYSCALL_64_after_hwframe+0x44/0xae
    
    The root case is in commit 84472b436e76 ("bpf, sockmap: Fix more uncharged
    while msg has more_data"), where I used msg->sg.size to replace the tosend,
    causing breakage:
    
      if (msg->apply_bytes && msg->apply_bytes < tosend)
        tosend = psock->apply_bytes;
    
    Fixes: 84472b436e76 ("bpf, sockmap: Fix more uncharged while msg has more_data")
    Reported-by: Jakub Sitnicki <jakub@cloudflare.com>
    Signed-off-by: Wang Yufen <wangyufen@huawei.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: John Fastabend <john.fastabend@gmail.com>
    Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
    Link: https://lore.kernel.org/bpf/1667266296-8794-1-git-send-email-wangyufen@huawei.com

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-30 12:47:11 +02:00
Jiri Benc 5fcd440c95 bpf, sockmap: Fix double uncharge the mem of sk_msg
Bugzilla: https://bugzilla.redhat.com/2120966

commit 2486ab434b2c2a14e9237296db00b1e1b7ae3273
Author: Wang Yufen <wangyufen@huawei.com>
Date:   Fri Mar 4 16:11:45 2022 +0800

    bpf, sockmap: Fix double uncharge the mem of sk_msg

    If tcp_bpf_sendmsg is running during a tear down operation, psock may be
    freed.

    tcp_bpf_sendmsg()
     tcp_bpf_send_verdict()
      sk_msg_return()
      tcp_bpf_sendmsg_redir()
       unlikely(!psock))
         sk_msg_free()

    The mem of msg has been uncharged in tcp_bpf_send_verdict() by
    sk_msg_return(), and would be uncharged by sk_msg_free() again. When psock
    is null, we can simply returning an error code, this would then trigger
    the sk_msg_free_nocharge in the error path of __SK_REDIRECT and would have
    the side effect of throwing an error up to user space. This would be a
    slight change in behavior from user side but would look the same as an
    error if the redirect on the socket threw an error.

    This issue can cause the following info:
    WARNING: CPU: 0 PID: 2136 at net/ipv4/af_inet.c:155 inet_sock_destruct+0x13c/0x260
    Call Trace:
     <TASK>
     __sk_destruct+0x24/0x1f0
     sk_psock_destroy+0x19b/0x1c0
     process_one_work+0x1b3/0x3c0
     worker_thread+0x30/0x350
     ? process_one_work+0x3c0/0x3c0
     kthread+0xe6/0x110
     ? kthread_complete_and_exit+0x20/0x20
     ret_from_fork+0x22/0x30
     </TASK>

    Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
    Signed-off-by: Wang Yufen <wangyufen@huawei.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: John Fastabend <john.fastabend@gmail.com>
    Link: https://lore.kernel.org/bpf/20220304081145.2037182-5-wangyufen@huawei.com

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:58:01 +02:00
Jiri Benc 490dac177d bpf, sockmap: Fix more uncharged while msg has more_data
Bugzilla: https://bugzilla.redhat.com/2120966

commit 84472b436e760ba439e1969a9e3c5ae7c86de39d
Author: Wang Yufen <wangyufen@huawei.com>
Date:   Fri Mar 4 16:11:44 2022 +0800

    bpf, sockmap: Fix more uncharged while msg has more_data

    In tcp_bpf_send_verdict(), if msg has more data after
    tcp_bpf_sendmsg_redir():

    tcp_bpf_send_verdict()
     tosend = msg->sg.size  //msg->sg.size = 22220
     case __SK_REDIRECT:
      sk_msg_return()  //uncharged msg->sg.size(22220) sk->sk_forward_alloc
      tcp_bpf_sendmsg_redir() //after tcp_bpf_sendmsg_redir, msg->sg.size=11000
     goto more_data;
     tosend = msg->sg.size  //msg->sg.size = 11000
     case __SK_REDIRECT:
      sk_msg_return()  //uncharged msg->sg.size(11000) to sk->sk_forward_alloc

    The msg->sg.size(11000) has been uncharged twice, to fix we can charge the
    remaining msg->sg.size before goto more data.

    This issue can cause the following info:
    WARNING: CPU: 0 PID: 9860 at net/core/stream.c:208 sk_stream_kill_queues+0xd4/0x1a0
    Call Trace:
     <TASK>
     inet_csk_destroy_sock+0x55/0x110
     __tcp_close+0x279/0x470
     tcp_close+0x1f/0x60
     inet_release+0x3f/0x80
     __sock_release+0x3d/0xb0
     sock_close+0x11/0x20
     __fput+0x92/0x250
     task_work_run+0x6a/0xa0
     do_exit+0x33b/0xb60
     do_group_exit+0x2f/0xa0
     get_signal+0xb6/0x950
     arch_do_signal_or_restart+0xac/0x2a0
     ? vfs_write+0x237/0x290
     exit_to_user_mode_prepare+0xa9/0x200
     syscall_exit_to_user_mode+0x12/0x30
     do_syscall_64+0x46/0x80
     entry_SYSCALL_64_after_hwframe+0x44/0xae
     </TASK>

    WARNING: CPU: 0 PID: 2136 at net/ipv4/af_inet.c:155 inet_sock_destruct+0x13c/0x260
    Call Trace:
     <TASK>
     __sk_destruct+0x24/0x1f0
     sk_psock_destroy+0x19b/0x1c0
     process_one_work+0x1b3/0x3c0
     worker_thread+0x30/0x350
     ? process_one_work+0x3c0/0x3c0
     kthread+0xe6/0x110
     ? kthread_complete_and_exit+0x20/0x20
     ret_from_fork+0x22/0x30
     </TASK>

    Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
    Signed-off-by: Wang Yufen <wangyufen@huawei.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: John Fastabend <john.fastabend@gmail.com>
    Link: https://lore.kernel.org/bpf/20220304081145.2037182-4-wangyufen@huawei.com

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:58:01 +02:00
Hangbin Liu 92ae9687c5 sock: redo the psock vs ULP protection check
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101278
Upstream Status: net.git commit e34a07c0ae39

commit e34a07c0ae3906f97eb18df50902e2a01c1015b6
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Mon Jun 20 12:13:53 2022 -0700

    sock: redo the psock vs ULP protection check

    Commit 8a59f9d1e3 ("sock: Introduce sk->sk_prot->psock_update_sk_prot()")
    has moved the inet_csk_has_ulp(sk) check from sk_psock_init() to
    the new tcp_bpf_update_proto() function. I'm guessing that this
    was done to allow creating psocks for non-inet sockets.

    Unfortunately the destruction path for psock includes the ULP
    unwind, so we need to fail the sk_psock_init() itself.
    Otherwise if ULP is already present we'll notice that later,
    and call tcp_update_ulp() with the sk_proto of the ULP
    itself, which will most likely result in the ULP looping
    its callbacks.

    Fixes: 8a59f9d1e3 ("sock: Introduce sk->sk_prot->psock_update_sk_prot()")
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Reviewed-by: John Fastabend <john.fastabend@gmail.com>
    Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
    Tested-by: Jakub Sitnicki <jakub@cloudflare.com>
    Link: https://lore.kernel.org/r/20220620191353.1184629-2-kuba@kernel.org
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Hangbin Liu <haliu@redhat.com>
2022-06-27 14:15:22 +08:00
Felix Maurer 095298045b bpf, sockmap: Fix return codes from tcp_bpf_recvmsg_parser()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071619

commit 5b2c5540b8110eea0d67a78fb0ddb9654c58daeb
Author: John Fastabend <john.fastabend@gmail.com>
Date:   Tue Jan 4 12:59:18 2022 -0800

    bpf, sockmap: Fix return codes from tcp_bpf_recvmsg_parser()

    Applications can be confused slightly because we do not always return the
    same error code as expected, e.g. what the TCP stack normally returns. For
    example on a sock err sk->sk_err instead of returning the sock_error we
    return EAGAIN. This usually means the application will 'try again'
    instead of aborting immediately. Another example, when a shutdown event
    is received we should immediately abort instead of waiting for data when
    the user provides a timeout.

    These tend to not be fatal, applications usually recover, but introduces
    bogus errors to the user or introduces unexpected latency. Before
    'c5d2177a72a16' we fell back to the TCP stack when no data was available
    so we managed to catch many of the cases here, although with the extra
    latency cost of calling tcp_msg_wait_data() first.

    To fix lets duplicate the error handling in TCP stack into tcp_bpf so
    that we get the same error codes.

    These were found in our CI tests that run applications against sockmap
    and do longer lived testing, at least compared to test_sockmap that
    does short-lived ping/pong tests, and in some of our test clusters
    we deploy.

    Its non-trivial to do these in a shorter form CI tests that would be
    appropriate for BPF selftests, but we are looking into it so we can
    ensure this keeps working going forward. As a preview one idea is to
    pull in the packetdrill testing which catches some of this.

    Fixes: c5d2177a72a16 ("bpf, sockmap: Fix race in ingress receive verdict with redirect to self")
    Signed-off-by: John Fastabend <john.fastabend@gmail.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/bpf/20220104205918.286416-1-john.fastabend@gmail.com

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2022-06-07 20:22:42 +02:00
Felix Maurer 4ff960fab1 bpf, sockmap: Fix race in ingress receive verdict with redirect to self
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071619

commit c5d2177a72a1659554922728fc407f59950aa929
Author: John Fastabend <john.fastabend@gmail.com>
Date:   Wed Nov 3 13:47:34 2021 -0700

    bpf, sockmap: Fix race in ingress receive verdict with redirect to self

    A socket in a sockmap may have different combinations of programs attached
    depending on configuration. There can be no programs in which case the socket
    acts as a sink only. There can be a TX program in this case a BPF program is
    attached to sending side, but no RX program is attached. There can be an RX
    program only where sends have no BPF program attached, but receives are hooked
    with BPF. And finally, both TX and RX programs may be attached. Giving us the
    permutations:

     None, Tx, Rx, and TxRx

    To date most of our use cases have been TX case being used as a fast datapath
    to directly copy between local application and a userspace proxy. Or Rx cases
    and TxRX applications that are operating an in kernel based proxy. The traffic
    in the first case where we hook applications into a userspace application looks
    like this:

      AppA  redirect   AppB
       Tx <-----------> Rx
       |                |
       +                +
       TCP <--> lo <--> TCP

    In this case all traffic from AppA (after 3whs) is copied into the AppB
    ingress queue and no traffic is ever on the TCP recieive_queue.

    In the second case the application never receives, except in some rare error
    cases, traffic on the actual user space socket. Instead the send happens in
    the kernel.

               AppProxy       socket pool
           sk0 ------------->{sk1,sk2, skn}
            ^                      |
            |                      |
            |                      v
           ingress              lb egress
           TCP                  TCP

    Here because traffic is never read off the socket with userspace recv() APIs
    there is only ever one reader on the sk receive_queue. Namely the BPF programs.

    However, we've started to introduce a third configuration where the BPF program
    on receive should process the data, but then the normal case is to push the
    data into the receive queue of AppB.

           AppB
           recv()                (userspace)
         -----------------------
           tcp_bpf_recvmsg()     (kernel)
             |             |
             |             |
             |             |
           ingress_msgQ    |
             |             |
           RX_BPF          |
             |             |
             v             v
           sk->receive_queue

    This is different from the App{A,B} redirect because traffic is first received
    on the sk->receive_queue.

    Now for the issue. The tcp_bpf_recvmsg() handler first checks the ingress_msg
    queue for any data handled by the BPF rx program and returned with PASS code
    so that it was enqueued on the ingress msg queue. Then if no data exists on
    that queue it checks the socket receive queue. Unfortunately, this is the same
    receive_queue the BPF program is reading data off of. So we get a race. Its
    possible for the recvmsg() hook to pull data off the receive_queue before the
    BPF hook has a chance to read it. It typically happens when an application is
    banging on recv() and getting EAGAINs. Until they manage to race with the RX
    BPF program.

    To fix this we note that before this patch at attach time when the socket is
    loaded into the map we check if it needs a TX program or just the base set of
    proto bpf hooks. Then it uses the above general RX hook regardless of if we
    have a BPF program attached at rx or not. This patch now extends this check to
    handle all cases enumerated above, TX, RX, TXRX, and none. And to fix above
    race when an RX program is attached we use a new hook that is nearly identical
    to the old one except now we do not let the recv() call skip the RX BPF program.
    Now only the BPF program pulls data from sk->receive_queue and recv() only
    pulls data from the ingress msgQ post BPF program handling.

    With this resolved our AppB from above has been up and running for many hours
    without detecting any errors. We do this by correlating counters in RX BPF
    events and the AppB to ensure data is never skipping the BPF program. Selftests,
    was not able to detect this because we only run them for a short period of time
    on well ordered send/recvs so we don't get any of the noise we see in real
    application environments.

    Fixes: 51199405f9 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
    Signed-off-by: John Fastabend <john.fastabend@gmail.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Tested-by: Jussi Maki <joamaki@gmail.com>
    Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
    Link: https://lore.kernel.org/bpf/20211103204736.248403-4-john.fastabend@gmail.com

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2022-06-07 20:22:41 +02:00
Felix Maurer 9f295aea23 bpf, sockmap: Remove unhash handler for BPF sockmap usage
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071619

commit b8b8315e39ffaca82e79d86dde26e9144addf66b
Author: John Fastabend <john.fastabend@gmail.com>
Date:   Wed Nov 3 13:47:33 2021 -0700

    bpf, sockmap: Remove unhash handler for BPF sockmap usage

    We do not need to handle unhash from BPF side we can simply wait for the
    close to happen. The original concern was a socket could transition from
    ESTABLISHED state to a new state while the BPF hook was still attached.
    But, we convinced ourself this is no longer possible and we also improved
    BPF sockmap to handle listen sockets so this is no longer a problem.

    More importantly though there are cases where unhash is called when data is
    in the receive queue. The BPF unhash logic will flush this data which is
    wrong. To be correct it should keep the data in the receive queue and allow
    a receiving application to continue reading the data. This may happen when
    tcp_abort() is received for example. Instead of complicating the logic in
    unhash simply moving all this to tcp_close() hook solves this.

    Fixes: 51199405f9 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
    Signed-off-by: John Fastabend <john.fastabend@gmail.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Tested-by: Jussi Maki <joamaki@gmail.com>
    Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
    Link: https://lore.kernel.org/bpf/20211103204736.248403-3-john.fastabend@gmail.com

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2022-06-07 20:22:41 +02:00
Jiri Benc 4a6d693164 tcp_bpf: Fix one concurrency problem in the tcp_bpf_send_verdict function
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071618

commit cd9733f5d75c94a32544d6ce5be47e14194cf137
Author: Liu Jian <liujian56@huawei.com>
Date:   Tue Oct 12 13:20:19 2021 +0800

    tcp_bpf: Fix one concurrency problem in the tcp_bpf_send_verdict function

    With two Msgs, msgA and msgB and a user doing nonblocking sendmsg calls (or
    multiple cores) on a single socket 'sk' we could get the following flow.

     msgA, sk                               msgB, sk
     -----------                            ---------------
     tcp_bpf_sendmsg()
     lock(sk)
     psock = sk->psock
                                            tcp_bpf_sendmsg()
                                            lock(sk) ... blocking
    tcp_bpf_send_verdict
    if (psock->eval == NONE)
       psock->eval = sk_psock_msg_verdict
     ..
     < handle SK_REDIRECT case >
       release_sock(sk)                     < lock dropped so grab here >
       ret = tcp_bpf_sendmsg_redir
                                            psock = sk->psock
                                            tcp_bpf_send_verdict
     lock_sock(sk) ... blocking on B
                                            if (psock->eval == NONE) <- boom.
                                             psock->eval will have msgA state

    The problem here is we dropped the lock on msgA and grabbed it with msgB.
    Now we have old state in psock and importantly psock->eval has not been
    cleared. So msgB will run whatever action was done on A and the verdict
    program may never see it.

    Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
    Signed-off-by: Liu Jian <liujian56@huawei.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Acked-by: John Fastabend <john.fastabend@gmail.com>
    Link: https://lore.kernel.org/bpf/20211012052019.184398-1-liujian56@huawei.com

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-05-12 17:29:54 +02:00
Jiri Benc 7a0400e360 skmsg: Extract and reuse sk_msg_is_readable()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071618

commit fb4e0a5e73d4bb5ab69b7905abd2ec3b580e9b59
Author: Cong Wang <cong.wang@bytedance.com>
Date:   Fri Oct 8 13:33:04 2021 -0700

    skmsg: Extract and reuse sk_msg_is_readable()

    tcp_bpf_sock_is_readable() is pretty much generic,
    we can extract it and reuse it for non-TCP sockets.

    Signed-off-by: Cong Wang <cong.wang@bytedance.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20211008203306.37525-3-xiyou.wangcong@gmail.com

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-05-12 17:29:53 +02:00
Jiri Benc df10d51307 net: Rename ->stream_memory_read to ->sock_is_readable
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071618

Conflicts:
- [minor] Context difference in struct proto due to missing 6c302e799a0d
  "net: forward_alloc_get depends on CONFIG_MPTCP".
- [minor] Context difference in sock.h due to out of order backport of
  4c1e34c0dbff "vsock: Enable y2038 safe timeval for timeout".

commit 7b50ecfcc6cdfe87488576bc3ed443dc8d083b90
Author: Cong Wang <cong.wang@bytedance.com>
Date:   Fri Oct 8 13:33:03 2021 -0700

    net: Rename ->stream_memory_read to ->sock_is_readable

    The proto ops ->stream_memory_read() is currently only used
    by TCP to check whether psock queue is empty or not. We need
    to rename it before reusing it for non-TCP protocols, and
    adjust the exsiting users accordingly.

    Signed-off-by: Cong Wang <cong.wang@bytedance.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20211008203306.37525-2-xiyou.wangcong@gmail.com

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-05-12 17:29:53 +02:00
John Fastabend 228a4a7ba8 bpf, sockmap, tcp: sk_prot needs inuse_idx set for proc stats
The proc socket stats use sk_prot->inuse_idx value to record inuse sock
stats. We currently do not set this correctly from sockmap side. The
result is reading sock stats '/proc/net/sockstat' gives incorrect values.
The socket counter is incremented correctly, but because we don't set the
counter correctly when we replace sk_prot we may omit the decrement.

To get the correct inuse_idx value move the core_initcall that initializes
the TCP proto handlers to late_initcall. This way it is initialized after
TCP has the chance to assign the inuse_idx value from the register protocol
handler.

Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Link: https://lore.kernel.org/bpf/20210712195546.423990-3-john.fastabend@gmail.com
2021-07-15 19:54:22 +02:00
Jakub Kicinski b6df00789e Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Trivial conflict in net/netfilter/nf_tables_api.c.

Duplicate fix in tools/testing/selftests/net/devlink_port_split.py
- take the net-next version.

skmsg, and L4 bpf - keep the bpf code but remove the flags
and err params.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-06-29 15:45:27 -07:00
Cong Wang 9f2470fbc4 skmsg: Improve udp_bpf_recvmsg() accuracy
I tried to reuse sk_msg_wait_data() for different protocols,
but it turns out it can not be simply reused. For example,
UDP actually uses two queues to receive skb:
udp_sk(sk)->reader_queue and sk->sk_receive_queue. So we have
to check both of them to know whether we have received any
packet.

Also, UDP does not lock the sock during BH Rx path, it makes
no sense for its ->recvmsg() to lock the sock. It is always
possible for ->recvmsg() to be called before packets actually
arrive in the receive queue, we just use best effort to make
it accurate here.

Fixes: 1f5be6b3b0 ("udp: Implement udp_bpf_recvmsg() for sockmap")
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210615021342.7416-2-xiyou.wangcong@gmail.com
2021-06-21 16:48:11 +02:00
Cong Wang c49661aa6f skmsg: Remove unused parameters of sk_msg_wait_data()
'err' and 'flags' are not used, we can just get rid of them.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <song@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210517022348.50555-1-xiyou.wangcong@gmail.com
2021-05-18 16:44:19 +02:00
David S. Miller 5f6c2f536d Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2021-04-23

The following pull-request contains BPF updates for your *net-next* tree.

We've added 69 non-merge commits during the last 22 day(s) which contain
a total of 69 files changed, 3141 insertions(+), 866 deletions(-).

The main changes are:

1) Add BPF static linker support for extern resolution of global, from Andrii.

2) Refine retval for bpf_get_task_stack helper, from Dave.

3) Add a bpf_snprintf helper, from Florent.

4) A bunch of miscellaneous improvements from many developers.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-25 18:02:32 -07:00
Cong Wang 51e0158a54 skmsg: Pass psock pointer to ->psock_update_sk_prot()
Using sk_psock() to retrieve psock pointer from sock requires
RCU read lock, but we already get psock pointer before calling
->psock_update_sk_prot() in both cases, so we can just pass it
without bothering sk_psock().

Fixes: 8a59f9d1e3 ("sock: Introduce sk->sk_prot->psock_update_sk_prot()")
Reported-by: syzbot+320a3bc8d80f478c37e4@syzkaller.appspotmail.com
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: syzbot+320a3bc8d80f478c37e4@syzkaller.appspotmail.com
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210407032111.33398-1-xiyou.wangcong@gmail.com
2021-04-12 17:34:27 +02:00
Jakub Kicinski 8859a44ea0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts:

MAINTAINERS
 - keep Chandrasekar
drivers/net/ethernet/mellanox/mlx5/core/en_main.c
 - simple fix + trust the code re-added to param.c in -next is fine
include/linux/bpf.h
 - trivial
include/linux/ethtool.h
 - trivial, fix kdoc while at it
include/linux/skmsg.h
 - move to relevant place in tcp.c, comment re-wrapped
net/core/skmsg.c
 - add the sk = sk // sk = NULL around calls
net/tipc/crypto.c
 - trivial

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-04-09 20:48:35 -07:00
Cong Wang 2bc793e327 skmsg: Extract __tcp_bpf_recvmsg() and tcp_bpf_wait_data()
Although these two functions are only used by TCP, they are not
specific to TCP at all, both operate on skmsg and ingress_msg,
so fit in net/core/skmsg.c very well.

And we will need them for non-TCP, so rename and move them to
skmsg.c and export them to modules.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210331023237.41094-13-xiyou.wangcong@gmail.com
2021-04-01 10:56:14 -07:00
Cong Wang 8a59f9d1e3 sock: Introduce sk->sk_prot->psock_update_sk_prot()
Currently sockmap calls into each protocol to update the struct
proto and replace it. This certainly won't work when the protocol
is implemented as a module, for example, AF_UNIX.

Introduce a new ops sk->sk_prot->psock_update_sk_prot(), so each
protocol can implement its own way to replace the struct proto.
This also helps get rid of symbol dependencies on CONFIG_INET.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210331023237.41094-11-xiyou.wangcong@gmail.com
2021-04-01 10:56:14 -07:00
Cong Wang b01fd6e802 skmsg: Introduce a spinlock to protect ingress_msg
Currently we rely on lock_sock to protect ingress_msg,
it is too big for this, we can actually just use a spinlock
to protect this list like protecting other skb queues.

__tcp_bpf_recvmsg() is still special because of peeking,
it still has to use lock_sock.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210331023237.41094-3-xiyou.wangcong@gmail.com
2021-04-01 10:56:13 -07:00
Cong Wang 887596095e bpf: Clean up sockmap related Kconfigs
As suggested by John, clean up sockmap related Kconfigs:

Reduce the scope of CONFIG_BPF_STREAM_PARSER down to TCP stream
parser, to reflect its name.

Make the rest sockmap code simply depend on CONFIG_BPF_SYSCALL
and CONFIG_INET, the latter is still needed at this point because
of TCP/UDP proto update. And leave CONFIG_NET_SOCK_MSG untouched,
as it is used by non-sockmap cases.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210223184934.6054-2-xiyou.wangcong@gmail.com
2021-02-26 12:28:03 -08:00
John Fastabend 36cd0e696a bpf, sockmap: Ensure SO_RCVBUF memory is observed on ingress redirect
Fix sockmap sk_skb programs so that they observe sk_rcvbuf limits. This
allows users to tune SO_RCVBUF and sockmap will honor them.

We can refactor the if(charge) case out in later patches. But, keep this
fix to the point.

Fixes: 51199405f9 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/160556568657.73229.8404601585878439060.stgit@john-XPS-13-9370
2020-11-18 00:12:34 +01:00
John Fastabend c9c89dcd87 bpf, sockmap: Fix partial copy_page_to_iter so progress can still be made
If copy_page_to_iter() fails or even partially completes, but with fewer
bytes copied than expected we currently reset sg.start and return EFAULT.
This proves problematic if we already copied data into the user buffer
before we return an error. Because we leave the copied data in the user
buffer and fail to unwind the scatterlist so kernel side believes data
has been copied and user side believes data has _not_ been received.

Expected behavior should be to return number of bytes copied and then
on the next read we need to return the error assuming its still there. This
can happen if we have a copy length spanning multiple scatterlist elements
and one or more complete before the error is hit.

The error is rare enough though that my normal testing with server side
programs, such as nginx, httpd, envoy, etc., I have never seen this. The
only reliable way to reproduce that I've found is to stream movies over
my browser for a day or so and wait for it to hang. Not very scientific,
but with a few extra WARN_ON()s in the code the bug was obvious.

When we review the errors from copy_page_to_iter() it seems we are hitting
a page fault from copy_page_to_iter_iovec() where the code checks
fault_in_pages_writeable(buf, copy) where buf is the user buffer. It
also seems typical server applications don't hit this case.

The other way to try and reproduce this is run the sockmap selftest tool
test_sockmap with data verification enabled, but it doesn't reproduce the
fault. Perhaps we can trigger this case artificially somehow from the
test tools. I haven't sorted out a way to do that yet though.

Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/160556566659.73229.15694973114605301063.stgit@john-XPS-13-9370
2020-11-18 00:12:10 +01:00
Lorenz Bauer 7b219da43f net: sk_msg: Simplify sk_psock initialization
Initializing psock->sk_proto and other saved callbacks is only
done in sk_psock_update_proto, after sk_psock_init has returned.
The logic for this is difficult to follow, and needlessly complex.

Instead, initialize psock->sk_proto whenever we allocate a new
psock. Additionally, assert the following invariants:

* The SK has no ULP: ULP does it's own finagling of sk->sk_prot
* sk_user_data is unused: we need it to store sk_psock

Protect our access to sk_user_data with sk_callback_lock, which
is what other users like reuseport arrays, etc. do.

The result is that an sk_psock is always fully initialized, and
that psock->sk_proto is always the "original" struct proto.
The latter allows us to use psock->sk_proto when initializing
IPv6 TCP / UDP callbacks for sockmap.

Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200821102948.21918-2-lmb@cloudflare.com
2020-08-21 15:16:11 -07:00
Sabrina Dubroca 2c7269b231 bpf: tcp: Recv() should return 0 when the peer socket is closed
If the peer is closed, we will never get more data, so
tcp_bpf_wait_data will get stuck forever. In case we passed
MSG_DONTWAIT to recv(), we get EAGAIN but we should actually get
0.

>From man 2 recv:

    RETURN VALUE

    When a stream socket peer has performed an orderly shutdown, the
    return value will be 0 (the traditional "end-of-file" return).

This patch makes tcp_bpf_wait_data always return 1 when the peer
socket has been shutdown. Either we have data available, and it would
have returned 1 anyway, or there isn't, in which case we'll call
tcp_recvmsg which does the right thing in this situation.

Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/26038a28c21fea5d04d4bd4744c5686d3f2e5504.1591784177.git.sd@queasysnail.net
2020-06-12 15:10:12 -07:00
dihu 487082fb7b bpf/sockmap: Fix kernel panic at __tcp_bpf_recvmsg
When user application calls read() with MSG_PEEK flag to read data
of bpf sockmap socket, kernel panic happens at
__tcp_bpf_recvmsg+0x12c/0x350. sk_msg is not removed from ingress_msg
queue after read out under MSG_PEEK flag is set. Because it's not
judged whether sk_msg is the last msg of ingress_msg queue, the next
sk_msg may be the head of ingress_msg queue, whose memory address of
sg page is invalid. So it's necessary to add check codes to prevent
this problem.

[20759.125457] BUG: kernel NULL pointer dereference, address:
0000000000000008
[20759.132118] CPU: 53 PID: 51378 Comm: envoy Tainted: G            E
5.4.32 #1
[20759.140890] Hardware name: Inspur SA5212M4/YZMB-00370-109, BIOS
4.1.12 06/18/2017
[20759.149734] RIP: 0010:copy_page_to_iter+0xad/0x300
[20759.270877] __tcp_bpf_recvmsg+0x12c/0x350
[20759.276099] tcp_bpf_recvmsg+0x113/0x370
[20759.281137] inet_recvmsg+0x55/0xc0
[20759.285734] __sys_recvfrom+0xc8/0x130
[20759.290566] ? __audit_syscall_entry+0x103/0x130
[20759.296227] ? syscall_trace_enter+0x1d2/0x2d0
[20759.301700] ? __audit_syscall_exit+0x1e4/0x290
[20759.307235] __x64_sys_recvfrom+0x24/0x30
[20759.312226] do_syscall_64+0x55/0x1b0
[20759.316852] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Signed-off-by: dihu <anny.hu@linux.alibaba.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20200605084625.9783-1-anny.hu@linux.alibaba.com
2020-06-09 10:56:36 -07:00
John Fastabend 81aabbb9fb bpf, sockmap: bpf_tcp_ingress needs to subtract bytes from sg.size
In bpf_tcp_ingress we used apply_bytes to subtract bytes from sg.size
which is used to track total bytes in a message. But this is not
correct because apply_bytes is itself modified in the main loop doing
the mem_charge.

Then at the end of this we have sg.size incorrectly set and out of
sync with actual sk values. Then we can get a splat if we try to
cork the data later and again try to redirect the msg to ingress. To
fix instead of trying to track msg.size do the easy thing and include
it as part of the sk_msg_xfer logic so that when the msg is moved the
sg.size is always correct.

To reproduce the below users will need ingress + cork and hit an
error path that will then try to 'free' the skmsg.

[  173.699981] BUG: KASAN: null-ptr-deref in sk_msg_free_elem+0xdd/0x120
[  173.699987] Read of size 8 at addr 0000000000000008 by task test_sockmap/5317

[  173.700000] CPU: 2 PID: 5317 Comm: test_sockmap Tainted: G          I       5.7.0-rc1+ #43
[  173.700005] Hardware name: Dell Inc. Precision 5820 Tower/002KVM, BIOS 1.9.2 01/24/2019
[  173.700009] Call Trace:
[  173.700021]  dump_stack+0x8e/0xcb
[  173.700029]  ? sk_msg_free_elem+0xdd/0x120
[  173.700034]  ? sk_msg_free_elem+0xdd/0x120
[  173.700042]  __kasan_report+0x102/0x15f
[  173.700052]  ? sk_msg_free_elem+0xdd/0x120
[  173.700060]  kasan_report+0x32/0x50
[  173.700070]  sk_msg_free_elem+0xdd/0x120
[  173.700080]  __sk_msg_free+0x87/0x150
[  173.700094]  tcp_bpf_send_verdict+0x179/0x4f0
[  173.700109]  tcp_bpf_sendpage+0x3ce/0x5d0

Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/158861290407.14306.5327773422227552482.stgit@john-Precision-5820-Tower
2020-05-06 00:22:22 +02:00
Xiyu Yang 18f02ad19e bpf: Fix sk_psock refcnt leak when receiving message
tcp_bpf_recvmsg() invokes sk_psock_get(), which returns a reference of
the specified sk_psock object to "psock" with increased refcnt.

When tcp_bpf_recvmsg() returns, local variable "psock" becomes invalid,
so the refcount should be decreased to keep refcount balanced.

The reference counting issue happens in several exception handling paths
of tcp_bpf_recvmsg(). When those error scenarios occur such as "flags"
includes MSG_ERRQUEUE, the function forgets to decrease the refcnt
increased by sk_psock_get(), causing a refcnt leak.

Fix this issue by calling sk_psock_put() or pulling up the error queue
read handling when those error scenarios occur.

Fixes: e7a5f1f1cd ("bpf/sockmap: Read psock ingress_msg before sk_receive_queue")
Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/1587872115-42805-1-git-send-email-xiyuyang19@fudan.edu.cn
2020-04-27 23:04:05 +02:00
YueHaibing c0fd336ea4 bpf, tcp: Make tcp_bpf_recvmsg static
After commit f747632b60 ("bpf: sockmap: Move generic sockmap
hooks from BPF TCP"), tcp_bpf_recvmsg() is not used out of
tcp_bpf.c, so make it static and remove it from tcp.h. Also move
it to BPF_STREAM_PARSER #ifdef to fix unused function warnings.

Fixes: f747632b60 ("bpf: sockmap: Move generic sockmap hooks from BPF TCP")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200320023426.60684-3-yuehaibing@huawei.com
2020-03-20 15:56:55 +01:00
YueHaibing a26527981a bpf, tcp: Fix unused function warnings
If BPF_STREAM_PARSER is not set, gcc warns:

  net/ipv4/tcp_bpf.c:483:12: warning: 'tcp_bpf_sendpage' defined but not used [-Wunused-function]
  net/ipv4/tcp_bpf.c:395:12: warning: 'tcp_bpf_sendmsg' defined but not used [-Wunused-function]
  net/ipv4/tcp_bpf.c:13:13: warning: 'tcp_bpf_stream_read' defined but not used [-Wunused-function]

Moves the unused functions into the #ifdef CONFIG_BPF_STREAM_PARSER.

Fixes: f747632b60 ("bpf: sockmap: Move generic sockmap hooks from BPF TCP")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200320023426.60684-2-yuehaibing@huawei.com
2020-03-20 15:56:25 +01:00
Lorenz Bauer f747632b60 bpf: sockmap: Move generic sockmap hooks from BPF TCP
The init, close and unhash handlers from TCP sockmap are generic,
and can be reused by UDP sockmap. Move the helpers into the sockmap code
base and expose them. This requires tcp_bpf_get_proto and tcp_bpf_clone to
be conditional on BPF_STREAM_PARSER.

The moved functions are unmodified, except that sk_psock_unlink is
renamed to sock_map_unlink to better match its behaviour.

Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200309111243.6982-6-lmb@cloudflare.com
2020-03-09 22:34:58 +01:00
Lorenz Bauer d19da360ee bpf: tcp: Move assertions into tcp_bpf_get_proto
We need to ensure that sk->sk_prot uses certain callbacks, so that
code that directly calls e.g. tcp_sendmsg in certain corner cases
works. To avoid spurious asserts, we must to do this only if
sk_psock_update_proto has not yet been called. The same invariants
apply for tcp_bpf_check_v6_needs_rebuild, so move the call as well.

Doing so allows us to merge tcp_bpf_init and tcp_bpf_reinit.

Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200309111243.6982-4-lmb@cloudflare.com
2020-03-09 22:34:58 +01:00
Lorenz Bauer 1a2e20132d skmsg: Update saved hooks only once
Only update psock->saved_* if psock->sk_proto has not been initialized
yet. This allows us to get rid of tcp_bpf_reinit_sk_prot.

Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200309111243.6982-3-lmb@cloudflare.com
2020-03-09 22:34:58 +01:00
Jakub Sitnicki e80251555f tcp_bpf: Don't let child socket inherit parent protocol ops on copy
Prepare for cloning listening sockets that have their protocol callbacks
overridden by sk_msg. Child sockets must not inherit parent callbacks that
access state stored in sk_user_data owned by the parent.

Restore the child socket protocol callbacks before it gets hashed and any
of the callbacks can get invoked.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200218171023.844439-4-jakub@cloudflare.com
2020-02-21 22:29:45 +01:00
Jakub Sitnicki b8e202d1d1 net, sk_msg: Annotate lockless access to sk_prot on clone
sk_msg and ULP frameworks override protocol callbacks pointer in
sk->sk_prot, while tcp accesses it locklessly when cloning the listening
socket, that is with neither sk_lock nor sk_callback_lock held.

Once we enable use of listening sockets with sockmap (and hence sk_msg),
there will be shared access to sk->sk_prot if socket is getting cloned
while being inserted/deleted to/from the sockmap from another CPU:

Read side:

tcp_v4_rcv
  sk = __inet_lookup_skb(...)
  tcp_check_req(sk)
    inet_csk(sk)->icsk_af_ops->syn_recv_sock
      tcp_v4_syn_recv_sock
        tcp_create_openreq_child
          inet_csk_clone_lock
            sk_clone_lock
              READ_ONCE(sk->sk_prot)

Write side:

sock_map_ops->map_update_elem
  sock_map_update_elem
    sock_map_update_common
      sock_map_link_no_progs
        tcp_bpf_init
          tcp_bpf_update_sk_prot
            sk_psock_update_proto
              WRITE_ONCE(sk->sk_prot, ops)

sock_map_ops->map_delete_elem
  sock_map_delete_elem
    __sock_map_delete
     sock_map_unref
       sk_psock_put
         sk_psock_drop
           sk_psock_restore_proto
             tcp_update_ulp
               WRITE_ONCE(sk->sk_prot, proto)

Mark the shared access with READ_ONCE/WRITE_ONCE annotations.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200218171023.844439-2-jakub@cloudflare.com
2020-02-21 22:29:45 +01:00
John Fastabend 7361d44896 bpf: Sockmap/tls, fix pop data with SK_DROP return code
When user returns SK_DROP we need to reset the number of copied bytes
to indicate to the user the bytes were dropped and not sent. If we
don't reset the copied arg sendmsg will return as if those bytes were
copied giving the user a positive return value.

This works as expected today except in the case where the user also
pops bytes. In the pop case the sg.size is reduced but we don't correctly
account for this when copied bytes is reset. The popped bytes are not
accounted for and we return a small positive value potentially confusing
the user.

The reason this happens is due to a typo where we do the wrong comparison
when accounting for pop bytes. In this fix notice the if/else is not
needed and that we have a similar problem if we push data except its not
visible to the user because if delta is larger the sg.size we return a
negative value so it appears as an error regardless.

Fixes: 7246d8ed4d ("bpf: helper to pop data from messages")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/bpf/20200111061206.8028-9-john.fastabend@gmail.com
2020-01-15 23:26:13 +01:00
Lingpeng Chen e7a5f1f1cd bpf/sockmap: Read psock ingress_msg before sk_receive_queue
Right now in tcp_bpf_recvmsg, sock read data first from sk_receive_queue
if not empty than psock->ingress_msg otherwise. If a FIN packet arrives
and there's also some data in psock->ingress_msg, the data in
psock->ingress_msg will be purged. It is always happen when request to a
HTTP1.0 server like python SimpleHTTPServer since the server send FIN
packet after data is sent out.

Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Reported-by: Arika Chen <eaglesora@gmail.com>
Suggested-by: Arika Chen <eaglesora@gmail.com>
Signed-off-by: Lingpeng Chen <forrest0579@gmail.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200109014833.18951-1-forrest0579@gmail.com
2020-01-09 23:13:48 +01:00
Jakub Kicinski 031097d9e0 net: skmsg: fix TLS 1.3 crash with full sk_msg
TLS 1.3 started using the entry at the end of the SG array
for chaining-in the single byte content type entry. This mostly
works:

[ E E E E E E . . ]
  ^           ^
   start       end

                 E < content type
               /
[ E E E E E E C . ]
  ^           ^
   start       end

(Where E denotes a populated SG entry; C denotes a chaining entry.)

If the array is full, however, the end will point to the start:

[ E E E E E E E E ]
  ^
   start
   end

And we end up overwriting the start:

    E < content type
   /
[ C E E E E E E E ]
  ^
   start
   end

The sg array is supposed to be a circular buffer with start and
end markers pointing anywhere. In case where start > end
(i.e. the circular buffer has "wrapped") there is an extra entry
reserved at the end to chain the two halves together.

[ E E E E E E . . l ]

(Where l is the reserved entry for "looping" back to front.

As suggested by John, let's reserve another entry for chaining
SG entries after the main circular buffer. Note that this entry
has to be pointed to by the end entry so its position is not fixed.

Examples of full messages:

[ E E E E E E E E . l ]
  ^               ^
   start           end

   <---------------.
[ E E . E E E E E E l ]
      ^ ^
   end   start

Now the end will always point to an unused entry, so TLS 1.3
can always use it.

Fixes: 130b392c6c ("net: tls: Add tls 1.3 support")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-28 22:40:29 -08:00
Jakub Kicinski 414776621d net/tls: prevent skb_orphan() from leaking TLS plain text with offload
sk_validate_xmit_skb() and drivers depend on the sk member of
struct sk_buff to identify segments requiring encryption.
Any operation which removes or does not preserve the original TLS
socket such as skb_orphan() or skb_clone() will cause clear text
leaks.

Make the TCP socket underlying an offloaded TLS connection
mark all skbs as decrypted, if TLS TX is in offload mode.
Then in sk_validate_xmit_skb() catch skbs which have no socket
(or a socket with no validation) and decrypted flag set.

Note that CONFIG_SOCK_VALIDATE_XMIT, CONFIG_TLS_DEVICE and
sk->sk_validate_xmit_skb are slightly interchangeable right now,
they all imply TLS offload. The new checks are guarded by
CONFIG_TLS_DEVICE because that's the option guarding the
sk_buff->decrypted member.

Second, smaller issue with orphaning is that it breaks
the guarantee that packets will be delivered to device
queues in-order. All TLS offload drivers depend on that
scheduling property. This means skb_orphan_partial()'s
trick of preserving partial socket references will cause
issues in the drivers. We need a full orphan, and as a
result netem delay/throttling will cause all TLS offload
skbs to be dropped.

Reusing the sk_buff->decrypted flag also protects from
leaking clear text when incoming, decrypted skb is redirected
(e.g. by TC).

See commit 0608c69c9a ("bpf: sk_msg, sock{map|hash} redirect
through ULP") for justification why the internal flag is safe.
The only location which could leak the flag in is tcp_bpf_sendmsg(),
which is taken care of by clearing the previously unused bit.

v2:
 - remove superfluous decrypted mark copy (Willem);
 - remove the stale doc entry (Boris);
 - rely entirely on EOR marking to prevent coalescing (Boris);
 - use an internal sendpages flag instead of marking the socket
   (Boris).
v3 (Willem):
 - reorganize the can_skb_orphan_partial() condition;
 - fix the flag leak-in through tcp_bpf_sendmsg.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-08 22:39:35 -07:00