Commit Graph

94 Commits

Author SHA1 Message Date
Antoine Tenart 5e0d04b8ef net/sock: Introduce trace_sk_data_ready()
JIRA: https://issues.redhat.com/browse/RHEL-17413
Upstream Status: linux.git
Conflicts:\
- drivers/infiniband/hw/erdma/erdma_cm.c chunk missing due to missing
  upstream commit 920d93eac8b9 ("RDMA/erdma: Add connection management
  (CM) support") in c9s.
- Context diff in fs/dlm/lowcomms.c due to missing upstream commit
  dbb751ffab0b ("fs: dlm: parallelize lowcomms socket handling") in c9s.
- Context diff in net/core/net-traces.c as 8139dccd464a ("udp6: add a
  missing call into udp_fail_queue_rcv_skb tracepoint") was backported
  earlier in c9s.
- Context diff in net/tls/tls_sw.c as 74836ec828fe ("tls: rx: strp:
  don't use GFP_KERNEL in softirq context") was backported earlier in
  c9s.
- Context diff in net/sunrpc/svcsock.c as upstream commit fc80fc2d4e39
  ("SUNRPC: Fix UAF in svc_tcp_listen_data_ready()") was backported
  before in c9s.

commit 40e0b09081420853542571c38875b48b60404ebb
Author: Peilin Ye <peilin.ye@bytedance.com>
Date:   Thu Jan 19 16:45:16 2023 -0800

    net/sock: Introduce trace_sk_data_ready()

    As suggested by Cong, introduce a tracepoint for all ->sk_data_ready()
    callback implementations.  For example:

    <...>
      iperf-609  [002] .....  70.660425: sk_data_ready: family=2 protocol=6 func=sock_def_readable
      iperf-609  [002] .....  70.660436: sk_data_ready: family=2 protocol=6 func=sock_def_readable
    <...>

    Suggested-by: Cong Wang <cong.wang@bytedance.com>
    Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2023-12-11 11:15:00 +01:00
Chris von Recklinghausen 48cb06d2f2 iov_iter: advancing variants of iov_iter_get_pages{,_alloc}()
Conflicts: fs/cifs/file.c, fs/cifs/misc.c - We don't have
	38c8a9a52082 ("smb: move client and server files to common directory fs/smb")
	so modify them instead of fs/smb/client/file.c and fs/smb/client/misc.c
	like upstream

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 1ef255e257173f4bc44317ef2076e7e0de688fdf
Author: Al Viro <viro@zeniv.linux.org.uk>
Date:   Thu Jun 9 10:28:36 2022 -0400

    iov_iter: advancing variants of iov_iter_get_pages{,_alloc}()

    Most of the users immediately follow successful iov_iter_get_pages()
    with advancing by the amount it had returned.

    Provide inline wrappers doing that, convert trivial open-coded
    uses of those.

    BTW, iov_iter_get_pages() never returns more than it had been asked
    to; such checks in cifs ought to be removed someday...

    Reviewed-by: Jeff Layton <jlayton@kernel.org>
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:04 -04:00
Felix Maurer 2d92cf1f17 bpf, sockmap: Pass skb ownership through read_skb
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2218483
Conflicts:
- net/ipv4/udp.c: Context difference due to missing ec095263a965 ("net:
  remove noblock parameter from recvmsg() entities") and db39dfdc1c3b
  ("udp: Use WARN_ON_ONCE() in udp_read_skb()"); 31f1fbcb346c ("udp:
  Refactor udp_read_skb()") was adapted to reflect this
- net/vmw_vsock/virtio_transport_common.c: Skipped, because the relevant
  code is not there, missing 634f1a7110b4 ("vsock: support sockmap")

commit 78fa0d61d97a728d306b0c23d353c0e340756437
Author: John Fastabend <john.fastabend@gmail.com>
Date:   Mon May 22 19:56:05 2023 -0700

    bpf, sockmap: Pass skb ownership through read_skb

    The read_skb hook calls consume_skb() now, but this means that if the
    recv_actor program wants to use the skb it needs to inc the ref cnt
    so that the consume_skb() doesn't kfree the sk_buff.

    This is problematic because in some error cases under memory pressure
    we may need to linearize the sk_buff from sk_psock_skb_ingress_enqueue().
    Then we get this,

     skb_linearize()
       __pskb_pull_tail()
         pskb_expand_head()
           BUG_ON(skb_shared(skb))

    Because we incremented users refcnt from sk_psock_verdict_recv() we
    hit the bug on with refcnt > 1 and trip it.

    To fix lets simply pass ownership of the sk_buff through the skb_read
    call. Then we can drop the consume from read_skb handlers and assume
    the verdict recv does any required kfree.

    Bug found while testing in our CI which runs in VMs that hit memory
    constraints rather regularly. William tested TCP read_skb handlers.

    [  106.536188] ------------[ cut here ]------------
    [  106.536197] kernel BUG at net/core/skbuff.c:1693!
    [  106.536479] invalid opcode: 0000 [#1] PREEMPT SMP PTI
    [  106.536726] CPU: 3 PID: 1495 Comm: curl Not tainted 5.19.0-rc5 #1
    [  106.537023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.16.0-1 04/01/2014
    [  106.537467] RIP: 0010:pskb_expand_head+0x269/0x330
    [  106.538585] RSP: 0018:ffffc90000138b68 EFLAGS: 00010202
    [  106.538839] RAX: 000000000000003f RBX: ffff8881048940e8 RCX: 0000000000000a20
    [  106.539186] RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff8881048940e8
    [  106.539529] RBP: ffffc90000138be8 R08: 00000000e161fd1a R09: 0000000000000000
    [  106.539877] R10: 0000000000000018 R11: 0000000000000000 R12: ffff8881048940e8
    [  106.540222] R13: 0000000000000003 R14: 0000000000000000 R15: ffff8881048940e8
    [  106.540568] FS:  00007f277dde9f00(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000
    [  106.540954] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [  106.541227] CR2: 00007f277eeede64 CR3: 000000000ad3e000 CR4: 00000000000006e0
    [  106.541569] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [  106.541915] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [  106.542255] Call Trace:
    [  106.542383]  <IRQ>
    [  106.542487]  __pskb_pull_tail+0x4b/0x3e0
    [  106.542681]  skb_ensure_writable+0x85/0xa0
    [  106.542882]  sk_skb_pull_data+0x18/0x20
    [  106.543084]  bpf_prog_b517a65a242018b0_bpf_skskb_http_verdict+0x3a9/0x4aa9
    [  106.543536]  ? migrate_disable+0x66/0x80
    [  106.543871]  sk_psock_verdict_recv+0xe2/0x310
    [  106.544258]  ? sk_psock_write_space+0x1f0/0x1f0
    [  106.544561]  tcp_read_skb+0x7b/0x120
    [  106.544740]  tcp_data_queue+0x904/0xee0
    [  106.544931]  tcp_rcv_established+0x212/0x7c0
    [  106.545142]  tcp_v4_do_rcv+0x174/0x2a0
    [  106.545326]  tcp_v4_rcv+0xe70/0xf60
    [  106.545500]  ip_protocol_deliver_rcu+0x48/0x290
    [  106.545744]  ip_local_deliver_finish+0xa7/0x150

    Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
    Reported-by: William Findlay <will@isovalent.com>
    Signed-off-by: John Fastabend <john.fastabend@gmail.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Tested-by: William Findlay <will@isovalent.com>
    Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
    Link: https://lore.kernel.org/bpf/20230523025618.113937-2-john.fastabend@gmail.com

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-06-29 15:45:40 +02:00
Jerome Marchand f79873d0b0 bpf, sockmap: Fix missing BPF_F_INGRESS flag when using apply_bytes
Bugzilla: https://bugzilla.redhat.com/2177177

commit a351d6087bf7d3d8440d58d3bf244ec64b89394a
Author: Pengcheng Yang <yangpc@wangsu.com>
Date:   Tue Nov 29 18:40:39 2022 +0800

    bpf, sockmap: Fix missing BPF_F_INGRESS flag when using apply_bytes

    When redirecting, we use sk_msg_to_ingress() to get the BPF_F_INGRESS
    flag from the msg->flags. If apply_bytes is used and it is larger than
    the current data being processed, sk_psock_msg_verdict() will not be
    called when sendmsg() is called again. At this time, the msg->flags is 0,
    and we lost the BPF_F_INGRESS flag.

    So we need to save the BPF_F_INGRESS flag in sk_psock and use it when
    redirection.

    Fixes: 8934ce2fd0 ("bpf: sockmap redirect ingress support")
    Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
    Link: https://lore.kernel.org/bpf/1669718441-2654-3-git-send-email-yangpc@wangsu.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:13 +02:00
Felix Maurer 657dee5ce4 bpf, sock_map: Move cancel_work_sync() out of sock lock
Bugzilla: https://bugzilla.redhat.com/2166911

commit 8bbabb3fddcd0f858be69ed5abc9b470a239d6f2
Author: Cong Wang <cong.wang@bytedance.com>
Date:   Tue Nov 1 21:34:17 2022 -0700

    bpf, sock_map: Move cancel_work_sync() out of sock lock
    
    Stanislav reported a lockdep warning, which is caused by the
    cancel_work_sync() called inside sock_map_close(), as analyzed
    below by Jakub:
    
    psock->work.func = sk_psock_backlog()
      ACQUIRE psock->work_mutex
        sk_psock_handle_skb()
          skb_send_sock()
            __skb_send_sock()
              sendpage_unlocked()
                kernel_sendpage()
                  sock->ops->sendpage = inet_sendpage()
                    sk->sk_prot->sendpage = tcp_sendpage()
                      ACQUIRE sk->sk_lock
                        tcp_sendpage_locked()
                      RELEASE sk->sk_lock
      RELEASE psock->work_mutex
    
    sock_map_close()
      ACQUIRE sk->sk_lock
      sk_psock_stop()
        sk_psock_clear_state(psock, SK_PSOCK_TX_ENABLED)
        cancel_work_sync()
          __cancel_work_timer()
            __flush_work()
              // wait for psock->work to finish
      RELEASE sk->sk_lock
    
    We can move the cancel_work_sync() out of the sock lock protection,
    but still before saved_close() was called.
    
    Fixes: 799aa7f98d ("skmsg: Avoid lock_sock() in sk_psock_backlog()")
    Reported-by: Stanislav Fomichev <sdf@google.com>
    Signed-off-by: Cong Wang <cong.wang@bytedance.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Tested-by: Jakub Sitnicki <jakub@cloudflare.com>
    Acked-by: John Fastabend <john.fastabend@gmail.com>
    Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
    Link: https://lore.kernel.org/bpf/20221102043417.279409-1-xiyou.wangcong@gmail.com

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-03-06 14:54:39 +01:00
Felix Maurer c53b530137 skmsg: pass gfp argument to alloc_sk_msg()
Bugzilla: https://bugzilla.redhat.com/2137876

commit 2d1f274b95c6e4ba6a813b3b8e7a1a38d54a0a08
Author: Eric Dumazet <edumazet@google.com>
Date:   Sat Oct 15 21:24:41 2022 +0000

    skmsg: pass gfp argument to alloc_sk_msg()
    
    syzbot found that alloc_sk_msg() could be called from a
    non sleepable context. sk_psock_verdict_recv() uses
    rcu_read_lock() protection.
    
    We need the callers to pass a gfp_t argument to avoid issues.
    
    syzbot report was:
    
    BUG: sleeping function called from invalid context at include/linux/sched/mm.h:274
    in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 3613, name: syz-executor414
    preempt_count: 0, expected: 0
    RCU nest depth: 1, expected: 0
    INFO: lockdep is turned off.
    CPU: 0 PID: 3613 Comm: syz-executor414 Not tainted 6.0.0-syzkaller-09589-g55be6084c8e0 #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/22/2022
    Call Trace:
    <TASK>
    __dump_stack lib/dump_stack.c:88 [inline]
    dump_stack_lvl+0x1e3/0x2cb lib/dump_stack.c:106
    __might_resched+0x538/0x6a0 kernel/sched/core.c:9877
    might_alloc include/linux/sched/mm.h:274 [inline]
    slab_pre_alloc_hook mm/slab.h:700 [inline]
    slab_alloc_node mm/slub.c:3162 [inline]
    slab_alloc mm/slub.c:3256 [inline]
    kmem_cache_alloc_trace+0x59/0x310 mm/slub.c:3287
    kmalloc include/linux/slab.h:600 [inline]
    kzalloc include/linux/slab.h:733 [inline]
    alloc_sk_msg net/core/skmsg.c:507 [inline]
    sk_psock_skb_ingress_self+0x5c/0x330 net/core/skmsg.c:600
    sk_psock_verdict_apply+0x395/0x440 net/core/skmsg.c:1014
    sk_psock_verdict_recv+0x34d/0x560 net/core/skmsg.c:1201
    tcp_read_skb+0x4a1/0x790 net/ipv4/tcp.c:1770
    tcp_rcv_established+0x129d/0x1a10 net/ipv4/tcp_input.c:5971
    tcp_v4_do_rcv+0x479/0xac0 net/ipv4/tcp_ipv4.c:1681
    sk_backlog_rcv include/net/sock.h:1109 [inline]
    __release_sock+0x1d8/0x4c0 net/core/sock.c:2906
    release_sock+0x5d/0x1c0 net/core/sock.c:3462
    tcp_sendmsg+0x36/0x40 net/ipv4/tcp.c:1483
    sock_sendmsg_nosec net/socket.c:714 [inline]
    sock_sendmsg net/socket.c:734 [inline]
    __sys_sendto+0x46d/0x5f0 net/socket.c:2117
    __do_sys_sendto net/socket.c:2129 [inline]
    __se_sys_sendto net/socket.c:2125 [inline]
    __x64_sys_sendto+0xda/0xf0 net/socket.c:2125
    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
    entry_SYSCALL_64_after_hwframe+0x63/0xcd
    
    Fixes: 43312915b5ba ("skmsg: Get rid of unncessary memset()")
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Cong Wang <cong.wang@bytedance.com>
    Cc: Daniel Borkmann <daniel@iogearbox.net>
    Cc: John Fastabend <john.fastabend@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-01-05 15:50:16 +01:00
Felix Maurer fdab1f5740 tcp: handle pure FIN case correctly
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2137876

commit 2e23acd99efacfd2a63cb9725afbc65e4e964fb7
Author: Cong Wang <cong.wang@bytedance.com>
Date:   Wed Aug 17 12:54:45 2022 -0700

    tcp: handle pure FIN case correctly

    When skb->len==0, the recv_actor() returns 0 too, but we also use 0
    for error conditions. This patch amends this by propagating the errors
    to tcp_read_skb() so that we can distinguish skb->len==0 case from
    error cases.

    Fixes: 04919bed948d ("tcp: Introduce tcp_read_skb()")
    Reported-by: Eric Dumazet <edumazet@google.com>
    Cc: John Fastabend <john.fastabend@gmail.com>
    Cc: Jakub Sitnicki <jakub@cloudflare.com>
    Signed-off-by: Cong Wang <cong.wang@bytedance.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-01-05 15:50:16 +01:00
Felix Maurer ac09ae99c2 net: fix refcount bug in sk_psock_get (2)
Bugzilla: https://bugzilla.redhat.com/2137876

commit 2a0133723f9ebeb751cfce19f74ec07e108bef1f
Author: Hawkins Jiawei <yin31149@gmail.com>
Date:   Fri Aug 5 15:48:34 2022 +0800

    net: fix refcount bug in sk_psock_get (2)
    
    Syzkaller reports refcount bug as follows:
    ------------[ cut here ]------------
    refcount_t: saturated; leaking memory.
    WARNING: CPU: 1 PID: 3605 at lib/refcount.c:19 refcount_warn_saturate+0xf4/0x1e0 lib/refcount.c:19
    Modules linked in:
    CPU: 1 PID: 3605 Comm: syz-executor208 Not tainted 5.18.0-syzkaller-03023-g7e062cda7d90 #0
     <TASK>
     __refcount_add_not_zero include/linux/refcount.h:163 [inline]
     __refcount_inc_not_zero include/linux/refcount.h:227 [inline]
     refcount_inc_not_zero include/linux/refcount.h:245 [inline]
     sk_psock_get+0x3bc/0x410 include/linux/skmsg.h:439
     tls_data_ready+0x6d/0x1b0 net/tls/tls_sw.c:2091
     tcp_data_ready+0x106/0x520 net/ipv4/tcp_input.c:4983
     tcp_data_queue+0x25f2/0x4c90 net/ipv4/tcp_input.c:5057
     tcp_rcv_state_process+0x1774/0x4e80 net/ipv4/tcp_input.c:6659
     tcp_v4_do_rcv+0x339/0x980 net/ipv4/tcp_ipv4.c:1682
     sk_backlog_rcv include/net/sock.h:1061 [inline]
     __release_sock+0x134/0x3b0 net/core/sock.c:2849
     release_sock+0x54/0x1b0 net/core/sock.c:3404
     inet_shutdown+0x1e0/0x430 net/ipv4/af_inet.c:909
     __sys_shutdown_sock net/socket.c:2331 [inline]
     __sys_shutdown_sock net/socket.c:2325 [inline]
     __sys_shutdown+0xf1/0x1b0 net/socket.c:2343
     __do_sys_shutdown net/socket.c:2351 [inline]
     __se_sys_shutdown net/socket.c:2349 [inline]
     __x64_sys_shutdown+0x50/0x70 net/socket.c:2349
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x46/0xb0
     </TASK>
    
    During SMC fallback process in connect syscall, kernel will
    replaces TCP with SMC. In order to forward wakeup
    smc socket waitqueue after fallback, kernel will sets
    clcsk->sk_user_data to origin smc socket in
    smc_fback_replace_callbacks().
    
    Later, in shutdown syscall, kernel will calls
    sk_psock_get(), which treats the clcsk->sk_user_data
    as psock type, triggering the refcnt warning.
    
    So, the root cause is that smc and psock, both will use
    sk_user_data field. So they will mismatch this field
    easily.
    
    This patch solves it by using another bit(defined as
    SK_USER_DATA_PSOCK) in PTRMASK, to mark whether
    sk_user_data points to a psock object or not.
    This patch depends on a PTRMASK introduced in commit f1ff5ce2cd
    ("net, sk_msg: Clear sk_user_data pointer on clone if tagged").
    
    For there will possibly be more flags in the sk_user_data field,
    this patch also refactor sk_user_data flags code to be more generic
    to improve its maintainability.
    
    Reported-and-tested-by: syzbot+5f26f85569bd179c18ce@syzkaller.appspotmail.com
    Suggested-by: Jakub Kicinski <kuba@kernel.org>
    Acked-by: Wen Gu <guwen@linux.alibaba.com>
    Signed-off-by: Hawkins Jiawei <yin31149@gmail.com>
    Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-01-05 15:46:53 +01:00
Felix Maurer cb1d716af6 skmsg: Get rid of unncessary memset()
Bugzilla: https://bugzilla.redhat.com/2137876

commit 43312915b5ba20741617dd2119e835205fa8580c
Author: Cong Wang <cong.wang@bytedance.com>
Date:   Wed Jun 15 09:20:14 2022 -0700

    skmsg: Get rid of unncessary memset()
    
    We always allocate skmsg with kzalloc(), so there is no need
    to call memset(0) on it, the only thing we need from
    sk_msg_init() is sg_init_marker(). So introduce a new helper
    which is just kzalloc()+sg_init_marker(), this saves an
    unncessary memset(0) for skmsg on fast path.
    
    Signed-off-by: Cong Wang <cong.wang@bytedance.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Reviewed-by: John Fastabend <john.fastabend@gmail.com>
    Link: https://lore.kernel.org/bpf/20220615162014.89193-5-xiyou.wangcong@gmail.com

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-01-05 15:46:53 +01:00
Felix Maurer 6aa8ccbbdc skmsg: Get rid of skb_clone()
Bugzilla: https://bugzilla.redhat.com/2137876

commit 57452d767feaeab405de3bff0d240c3ac84bfe0d
Author: Cong Wang <cong.wang@bytedance.com>
Date:   Wed Jun 15 09:20:13 2022 -0700

    skmsg: Get rid of skb_clone()
    
    With ->read_skb() now we have an entire skb dequeued from
    receive queue, now we just need to grab an addtional refcnt
    before passing its ownership to recv actors.
    
    And we should not touch them any more, particularly for
    skb->sk. Fortunately, skb->sk is already set for most of
    the protocols except UDP where skb->sk has been stolen,
    so we have to fix it up for UDP case.
    
    Signed-off-by: Cong Wang <cong.wang@bytedance.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Reviewed-by: John Fastabend <john.fastabend@gmail.com>
    Link: https://lore.kernel.org/bpf/20220615162014.89193-4-xiyou.wangcong@gmail.com

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-01-05 15:46:53 +01:00
Felix Maurer 09faf01cb9 net: Introduce a new proto_ops ->read_skb()
Bugzilla: https://bugzilla.redhat.com/2137876

Conflicts: Context difference due to not yet applied 314001f0bf927
("af_unix: Add OOB support") and already applied 3f92a64e44e5 ("tcp:
allow tls to decrypt directly from the tcp rcv queue")

commit 965b57b469a589d64d81b1688b38dcb537011bb0
Author: Cong Wang <cong.wang@bytedance.com>
Date:   Wed Jun 15 09:20:12 2022 -0700

    net: Introduce a new proto_ops ->read_skb()

    Currently both splice() and sockmap use ->read_sock() to
    read skb from receive queue, but for sockmap we only read
    one entire skb at a time, so ->read_sock() is too conservative
    to use. Introduce a new proto_ops ->read_skb() which supports
    this sematic, with this we can finally pass the ownership of
    skb to recv actors.

    For non-TCP protocols, all ->read_sock() can be simply
    converted to ->read_skb().

    Signed-off-by: Cong Wang <cong.wang@bytedance.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Reviewed-by: John Fastabend <john.fastabend@gmail.com>
    Link: https://lore.kernel.org/bpf/20220615162014.89193-3-xiyou.wangcong@gmail.com

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-01-05 15:46:53 +01:00
Artem Savkov 5120f0250e bpf, sockmap: Fix sk->sk_forward_alloc warn_on in sk_stream_kill_queues
Bugzilla: https://bugzilla.redhat.com/2137876

commit d8616ee2affcff37c5d315310da557a694a3303d
Author: Wang Yufen <wangyufen@huawei.com>
Date:   Tue May 24 15:53:11 2022 +0800

    bpf, sockmap: Fix sk->sk_forward_alloc warn_on in sk_stream_kill_queues
    
    During TCP sockmap redirect pressure test, the following warning is triggered:
    
    WARNING: CPU: 3 PID: 2145 at net/core/stream.c:205 sk_stream_kill_queues+0xbc/0xd0
    CPU: 3 PID: 2145 Comm: iperf Kdump: loaded Tainted: G        W         5.10.0+ #9
    Call Trace:
     inet_csk_destroy_sock+0x55/0x110
     inet_csk_listen_stop+0xbb/0x380
     tcp_close+0x41b/0x480
     inet_release+0x42/0x80
     __sock_release+0x3d/0xa0
     sock_close+0x11/0x20
     __fput+0x9d/0x240
     task_work_run+0x62/0x90
     exit_to_user_mode_prepare+0x110/0x120
     syscall_exit_to_user_mode+0x27/0x190
     entry_SYSCALL_64_after_hwframe+0x44/0xa9
    
    The reason we observed is that:
    
    When the listener is closing, a connection may have completed the three-way
    handshake but not accepted, and the client has sent some packets. The child
    sks in accept queue release by inet_child_forget()->inet_csk_destroy_sock(),
    but psocks of child sks have not released.
    
    To fix, add sock_map_destroy to release psocks.
    
    Signed-off-by: Wang Yufen <wangyufen@huawei.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
    Acked-by: John Fastabend <john.fastabend@gmail.com>
    Link: https://lore.kernel.org/bpf/20220524075311.649153-1-wangyufen@huawei.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-01-05 15:46:29 +01:00
Yauheni Kaliuta 4739f91034 bpf, sockmap: Call skb_linearize only when required in sk_psock_skb_ingress_enqueue
Bugzilla: https://bugzilla.redhat.com/2120968

commit 3527bfe6a92d940abfca87929207e734039f496b
Author: Liu Jian <liujian56@huawei.com>
Date:   Wed Apr 27 19:51:50 2022 +0800

    bpf, sockmap: Call skb_linearize only when required in sk_psock_skb_ingress_enqueue
    
    The skb_to_sgvec fails only when the number of frag_list and frags
    exceeds MAX_MSG_FRAGS. Therefore, we can call skb_linearize only
    when the conversion fails.
    
    Signed-off-by: Liu Jian <liujian56@huawei.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: John Fastabend <john.fastabend@gmail.com>
    Link: https://lore.kernel.org/bpf/20220427115150.210213-1-liujian56@huawei.com

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-30 12:47:01 +02:00
Jiri Benc 1bb9b37492 bpf, sockmap: Fix memleak in tcp_bpf_sendmsg while sk msg is full
Bugzilla: https://bugzilla.redhat.com/2120966

commit 9c34e38c4a870eb30b13f42f5b44f42e9d19ccb8
Author: Wang Yufen <wangyufen@huawei.com>
Date:   Fri Mar 4 16:11:43 2022 +0800

    bpf, sockmap: Fix memleak in tcp_bpf_sendmsg while sk msg is full

    If tcp_bpf_sendmsg() is running while sk msg is full. When sk_msg_alloc()
    returns -ENOMEM error, tcp_bpf_sendmsg() goes to wait_for_memory. If partial
    memory has been alloced by sk_msg_alloc(), that is, msg_tx->sg.size is
    greater than osize after sk_msg_alloc(), memleak occurs. To fix we use
    sk_msg_trim() to release the allocated memory, then goto wait for memory.

    Other call paths of sk_msg_alloc() have the similar issue, such as
    tls_sw_sendmsg(), so handle sk_msg_trim logic inside sk_msg_alloc(),
    as Cong Wang suggested.

    This issue can cause the following info:
    WARNING: CPU: 3 PID: 7950 at net/core/stream.c:208 sk_stream_kill_queues+0xd4/0x1a0
    Call Trace:
     <TASK>
     inet_csk_destroy_sock+0x55/0x110
     __tcp_close+0x279/0x470
     tcp_close+0x1f/0x60
     inet_release+0x3f/0x80
     __sock_release+0x3d/0xb0
     sock_close+0x11/0x20
     __fput+0x92/0x250
     task_work_run+0x6a/0xa0
     do_exit+0x33b/0xb60
     do_group_exit+0x2f/0xa0
     get_signal+0xb6/0x950
     arch_do_signal_or_restart+0xac/0x2a0
     exit_to_user_mode_prepare+0xa9/0x200
     syscall_exit_to_user_mode+0x12/0x30
     do_syscall_64+0x46/0x80
     entry_SYSCALL_64_after_hwframe+0x44/0xae
     </TASK>

    WARNING: CPU: 3 PID: 2094 at net/ipv4/af_inet.c:155 inet_sock_destruct+0x13c/0x260
    Call Trace:
     <TASK>
     __sk_destruct+0x24/0x1f0
     sk_psock_destroy+0x19b/0x1c0
     process_one_work+0x1b3/0x3c0
     kthread+0xe6/0x110
     ret_from_fork+0x22/0x30
     </TASK>

    Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
    Signed-off-by: Wang Yufen <wangyufen@huawei.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: John Fastabend <john.fastabend@gmail.com>
    Link: https://lore.kernel.org/bpf/20220304081145.2037182-3-wangyufen@huawei.com

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:58:01 +02:00
Paolo Abeni f332e54b50 skmsg: Fix wrong last sg check in sk_msg_recvmsg()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134161
Tested: LNST, Tier1

Upstream commit:
commit 583585e48d965338e73e1eb383768d16e0922d73
Author: Liu Jian <liujian56@huawei.com>
Date:   Tue Aug 9 17:49:15 2022 +0800

    skmsg: Fix wrong last sg check in sk_msg_recvmsg()

    Fix one kernel NULL pointer dereference as below:

    [  224.462334] Call Trace:
    [  224.462394]  __tcp_bpf_recvmsg+0xd3/0x380
    [  224.462441]  ? sock_has_perm+0x78/0xa0
    [  224.462463]  tcp_bpf_recvmsg+0x12e/0x220
    [  224.462494]  inet_recvmsg+0x5b/0xd0
    [  224.462534]  __sys_recvfrom+0xc8/0x130
    [  224.462574]  ? syscall_trace_enter+0x1df/0x2e0
    [  224.462606]  ? __do_page_fault+0x2de/0x500
    [  224.462635]  __x64_sys_recvfrom+0x24/0x30
    [  224.462660]  do_syscall_64+0x5d/0x1d0
    [  224.462709]  entry_SYSCALL_64_after_hwframe+0x65/0xca

    In commit 9974d37ea75f ("skmsg: Fix invalid last sg check in
    sk_msg_recvmsg()"), we change last sg check to sg_is_last(),
    but in sockmap redirection case (without stream_parser/stream_verdict/
    skb_verdict), we did not mark the end of the scatterlist. Check the
    sk_msg_alloc, sk_msg_page_add, and bpf_msg_push_data functions, they all
    do not mark the end of sg. They are expected to use sg.end for end
    judgment. So the judgment of '(i != msg_rx->sg.end)' is added back here.

    Fixes: 9974d37ea75f ("skmsg: Fix invalid last sg check in sk_msg_recvmsg()")
    Signed-off-by: Liu Jian <liujian56@huawei.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: John Fastabend <john.fastabend@gmail.com>
    Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
    Link: https://lore.kernel.org/bpf/20220809094915.150391-1-liujian56@huawei.com

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-13 13:00:04 +02:00
Paolo Abeni 538a5f7a95 skmsg: Schedule psock work if the cached skb exists on the psock
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134161
Tested: LNST, Tier1

Upstream commit:
commit bec217197b412d74168c6a42fc0f76d0cc9cad00
Author: Liu Jian <liujian56@huawei.com>
Date:   Wed Sep 7 15:13:11 2022 +0800

    skmsg: Schedule psock work if the cached skb exists on the psock

    In sk_psock_backlog function, for ingress direction skb, if no new data
    packet arrives after the skb is cached, the cached skb does not have a
    chance to be added to the receive queue of psock. As a result, the cached
    skb cannot be received by the upper-layer application. Fix this by reschedule
    the psock work to dispose the cached skb in sk_msg_recvmsg function.

    Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
    Signed-off-by: Liu Jian <liujian56@huawei.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: John Fastabend <john.fastabend@gmail.com>
    Link: https://lore.kernel.org/bpf/20220907071311.60534-1-liujian56@huawei.com

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-13 13:00:04 +02:00
Paolo Abeni 0e61e407b0 skmsg: Fix invalid last sg check in sk_msg_recvmsg()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134161
Tested: LNST, Tier1

Upstream commit:
commit 9974d37ea75f01b47d16072b5dad305bd8d23fcc
Author: Liu Jian <liujian56@huawei.com>
Date:   Tue Jun 28 20:36:16 2022 +0800

    skmsg: Fix invalid last sg check in sk_msg_recvmsg()

    In sk_psock_skb_ingress_enqueue function, if the linear area + nr_frags +
    frag_list of the SKB has NR_MSG_FRAG_IDS blocks in total, skb_to_sgvec
    will return NR_MSG_FRAG_IDS, then msg->sg.end will be set to
    NR_MSG_FRAG_IDS, and in addition, (NR_MSG_FRAG_IDS - 1) is set to the last
    SG of msg. Recv the msg in sk_msg_recvmsg, when i is (NR_MSG_FRAG_IDS - 1),
    the sk_msg_iter_var_next(i) will change i to 0 (not NR_MSG_FRAG_IDS), the
    judgment condition "msg_rx->sg.start==msg_rx->sg.end" and
    "i != msg_rx->sg.end" can not work.

    As a result, the processed msg cannot be deleted from ingress_msg list.
    But the length of all the sge of the msg has changed to 0. Then the next
    recvmsg syscall will process the msg repeatedly, because the length of sge
    is 0, the -EFAULT error is always returned.

    Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
    Signed-off-by: Liu Jian <liujian56@huawei.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: John Fastabend <john.fastabend@gmail.com>
    Link: https://lore.kernel.org/bpf/20220628123616.186950-1-liujian56@huawei.com

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-13 13:00:03 +02:00
Felix Maurer 1f429e7ec2 bpf, sockmap: Do not ignore orig_len parameter
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071620

commit 60ce37b03917e593d8e5d8bcc7ec820773daf81d
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Mar 2 08:17:22 2022 -0800

    bpf, sockmap: Do not ignore orig_len parameter

    Currently, sk_psock_verdict_recv() returns skb->len

    This is problematic because tcp_read_sock() might have
    passed orig_len < skb->len, due to the presence of TCP urgent data.

    This causes an infinite loop from tcp_read_sock()

    Followup patch will make tcp_read_sock() more robust vs bad actors.

    Fixes: ef5659280e ("bpf, sockmap: Allow skipping sk_skb parser program")
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Acked-by: John Fastabend <john.fastabend@gmail.com>
    Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
    Tested-by: Jakub Sitnicki <jakub@cloudflare.com>
    Acked-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/r/20220302161723.3910001-1-eric.dumazet@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2022-08-24 12:53:58 +02:00
Hangbin Liu 92ae9687c5 sock: redo the psock vs ULP protection check
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101278
Upstream Status: net.git commit e34a07c0ae39

commit e34a07c0ae3906f97eb18df50902e2a01c1015b6
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Mon Jun 20 12:13:53 2022 -0700

    sock: redo the psock vs ULP protection check

    Commit 8a59f9d1e3 ("sock: Introduce sk->sk_prot->psock_update_sk_prot()")
    has moved the inet_csk_has_ulp(sk) check from sk_psock_init() to
    the new tcp_bpf_update_proto() function. I'm guessing that this
    was done to allow creating psocks for non-inet sockets.

    Unfortunately the destruction path for psock includes the ULP
    unwind, so we need to fail the sk_psock_init() itself.
    Otherwise if ULP is already present we'll notice that later,
    and call tcp_update_ulp() with the sk_proto of the ULP
    itself, which will most likely result in the ULP looping
    its callbacks.

    Fixes: 8a59f9d1e3 ("sock: Introduce sk->sk_prot->psock_update_sk_prot()")
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Reviewed-by: John Fastabend <john.fastabend@gmail.com>
    Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
    Tested-by: Jakub Sitnicki <jakub@cloudflare.com>
    Link: https://lore.kernel.org/r/20220620191353.1184629-2-kuba@kernel.org
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Hangbin Liu <haliu@redhat.com>
2022-06-27 14:15:22 +08:00
Felix Maurer 90900a0f40 bpf, sockmap: Re-evaluate proto ops when psock is removed from sockmap
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071619

commit c0d95d3380ee099d735e08618c0d599e72f6c8b0
Author: John Fastabend <john.fastabend@gmail.com>
Date:   Fri Nov 19 10:14:18 2021 -0800

    bpf, sockmap: Re-evaluate proto ops when psock is removed from sockmap

    When a sock is added to a sock map we evaluate what proto op hooks need to
    be used. However, when the program is removed from the sock map we have not
    been evaluating if that changes the required program layout.

    Before the patch listed in the 'fixes' tag this was not causing failures
    because the base program set handles all cases. Specifically, the case with
    a stream parser and the case with out a stream parser are both handled. With
    the fix below we identified a race when running with a proto op that attempts
    to read skbs off both the stream parser and the skb->receive_queue. Namely,
    that a race existed where when the stream parser is empty checking the
    skb->receive_queue from recvmsg at the precies moment when the parser is
    paused and the receive_queue is not empty could result in skipping the stream
    parser. This may break a RX policy depending on the parser to run.

    The fix tag then loads a specific proto ops that resolved this race. But, we
    missed removing that proto ops recv hook when the sock is removed from the
    sockmap. The result is the stream parser is stopped so no more skbs will be
    aggregated there, but the hook and BPF program continues to be attached on
    the psock. User space will then get an EBUSY when trying to read the socket
    because the recvmsg() handler is now waiting on a stopped stream parser.

    To fix we rerun the proto ops init() function which will look at the new set
    of progs attached to the psock and rest the proto ops hook to the correct
    handlers. And in the above case where we remove the sock from the sock map
    the RX prog will no longer be listed so the proto ops is removed.

    Fixes: c5d2177a72a16 ("bpf, sockmap: Fix race in ingress receive verdict with redirect to self")
    Signed-off-by: John Fastabend <john.fastabend@gmail.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/bpf/20211119181418.353932-3-john.fastabend@gmail.com

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2022-06-07 20:22:42 +02:00
Felix Maurer 73fb414f47 skmsg: Lose offset info in sk_psock_skb_ingress
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071619

commit 7303524e04af49a47991e19f895c3b8cdc3796c7
Author: Liu Jian <liujian56@huawei.com>
Date:   Fri Oct 29 22:12:14 2021 +0800

    skmsg: Lose offset info in sk_psock_skb_ingress

    If sockmap enable strparser, there are lose offset info in
    sk_psock_skb_ingress(). If the length determined by parse_msg function is not
    skb->len, the skb will be converted to sk_msg multiple times, and userspace
    app will get the data multiple times.

    Fix this by get the offset and length from strp_msg. And as Cong suggested,
    add one bit in skb->_sk_redir to distinguish enable or disable strparser.

    Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
    Signed-off-by: Liu Jian <liujian56@huawei.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Reviewed-by: Cong Wang <cong.wang@bytedance.com>
    Acked-by: John Fastabend <john.fastabend@gmail.com>
    Link: https://lore.kernel.org/bpf/20211029141216.211899-1-liujian56@huawei.com

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2022-06-07 20:22:40 +02:00
Jiri Benc 7a0400e360 skmsg: Extract and reuse sk_msg_is_readable()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071618

commit fb4e0a5e73d4bb5ab69b7905abd2ec3b580e9b59
Author: Cong Wang <cong.wang@bytedance.com>
Date:   Fri Oct 8 13:33:04 2021 -0700

    skmsg: Extract and reuse sk_msg_is_readable()

    tcp_bpf_sock_is_readable() is pretty much generic,
    we can extract it and reuse it for non-TCP sockets.

    Signed-off-by: Cong Wang <cong.wang@bytedance.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20211008203306.37525-3-xiyou.wangcong@gmail.com

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-05-12 17:29:53 +02:00
John Fastabend 9635720b7c bpf, sockmap: Fix memleak on ingress msg enqueue
If backlog handler is running during a tear down operation we may enqueue
data on the ingress msg queue while tear down is trying to free it.

 sk_psock_backlog()
   sk_psock_handle_skb()
     skb_psock_skb_ingress()
       sk_psock_skb_ingress_enqueue()
         sk_psock_queue_msg(psock,msg)
                                           spin_lock(ingress_lock)
                                            sk_psock_zap_ingress()
                                             _sk_psock_purge_ingerss_msg()
                                              _sk_psock_purge_ingress_msg()
                                            -- free ingress_msg list --
                                           spin_unlock(ingress_lock)
           spin_lock(ingress_lock)
           list_add_tail(msg,ingress_msg) <- entry on list with no one
                                             left to free it.
           spin_unlock(ingress_lock)

To fix we only enqueue from backlog if the ENABLED bit is set. The tear
down logic clears the bit with ingress_lock set so we wont enqueue the
msg in the last step.

Fixes: 799aa7f98d ("skmsg: Avoid lock_sock() in sk_psock_backlog()")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210727160500.1713554-4-john.fastabend@gmail.com
2021-07-27 14:55:30 -07:00
John Fastabend 476d98018f bpf, sockmap: On cleanup we additionally need to remove cached skb
Its possible if a socket is closed and the receive thread is under memory
pressure it may have cached a skb. We need to ensure these skbs are
free'd along with the normal ingress_skb queue.

Before 799aa7f98d ("skmsg: Avoid lock_sock() in sk_psock_backlog()") tear
down and backlog processing both had sock_lock for the common case of
socket close or unhash. So it was not possible to have both running in
parrallel so all we would need is the kfree in those kernels.

But, latest kernels include the commit 799aa7f98d5e and this requires a
bit more work. Without the ingress_lock guarding reading/writing the
state->skb case its possible the tear down could run before the state
update causing it to leak memory or worse when the backlog reads the state
it could potentially run interleaved with the tear down and we might end up
free'ing the state->skb from tear down side but already have the reference
from backlog side. To resolve such races we wrap accesses in ingress_lock
on both sides serializing tear down and backlog case. In both cases this
only happens after an EAGAIN error case so having an extra lock in place
is likely fine. The normal path will skip the locks.

Note, we check state->skb before grabbing lock. This works because
we can only enqueue with the mutex we hold already. Avoiding a race
on adding state->skb after the check. And if tear down path is running
that is also fine if the tear down path then removes state->skb we
will simply set skb=NULL and the subsequent goto is skipped. This
slight complication avoids locking in normal case.

With this fix we no longer see this warning splat from tcp side on
socket close when we hit the above case with redirect to ingress self.

[224913.935822] WARNING: CPU: 3 PID: 32100 at net/core/stream.c:208 sk_stream_kill_queues+0x212/0x220
[224913.935841] Modules linked in: fuse overlay bpf_preload x86_pkg_temp_thermal intel_uncore wmi_bmof squashfs sch_fq_codel efivarfs ip_tables x_tables uas xhci_pci ixgbe mdio xfrm_algo xhci_hcd wmi
[224913.935897] CPU: 3 PID: 32100 Comm: fgs-bench Tainted: G          I       5.14.0-rc1alu+ #181
[224913.935908] Hardware name: Dell Inc. Precision 5820 Tower/002KVM, BIOS 1.9.2 01/24/2019
[224913.935914] RIP: 0010:sk_stream_kill_queues+0x212/0x220
[224913.935923] Code: 8b 83 20 02 00 00 85 c0 75 20 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 89 df e8 2b 11 fe ff eb c3 0f 0b e9 7c ff ff ff 0f 0b eb ce <0f> 0b 5b 5d 41 5c 41 5d 41 5e 41 5f c3 90 0f 1f 44 00 00 41 57 41
[224913.935932] RSP: 0018:ffff88816271fd38 EFLAGS: 00010206
[224913.935941] RAX: 0000000000000ae8 RBX: ffff88815acd5240 RCX: dffffc0000000000
[224913.935948] RDX: 0000000000000003 RSI: 0000000000000ae8 RDI: ffff88815acd5460
[224913.935954] RBP: ffff88815acd5460 R08: ffffffff955c0ae8 R09: fffffbfff2e6f543
[224913.935961] R10: ffffffff9737aa17 R11: fffffbfff2e6f542 R12: ffff88815acd5390
[224913.935967] R13: ffff88815acd5480 R14: ffffffff98d0c080 R15: ffffffff96267500
[224913.935974] FS:  00007f86e6bd1700(0000) GS:ffff888451cc0000(0000) knlGS:0000000000000000
[224913.935981] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[224913.935988] CR2: 000000c0008eb000 CR3: 00000001020e0005 CR4: 00000000003706e0
[224913.935994] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[224913.936000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[224913.936007] Call Trace:
[224913.936016]  inet_csk_destroy_sock+0xba/0x1f0
[224913.936033]  __tcp_close+0x620/0x790
[224913.936047]  tcp_close+0x20/0x80
[224913.936056]  inet_release+0x8f/0xf0
[224913.936070]  __sock_release+0x72/0x120
[224913.936083]  sock_close+0x14/0x20

Fixes: a136678c0b ("bpf: sk_msg, zap ingress queue on psock down")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210727160500.1713554-3-john.fastabend@gmail.com
2021-07-27 14:55:21 -07:00
John Fastabend 343597d558 bpf, sockmap: Zap ingress queues after stopping strparser
We don't want strparser to run and pass skbs into skmsg handlers when
the psock is null. We just sk_drop them in this case. When removing
a live socket from map it means extra drops that we do not need to
incur. Move the zap below strparser close to avoid this condition.

This way we stop the stream parser first stopping it from processing
packets and then delete the psock.

Fixes: a136678c0b ("bpf: sk_msg, zap ingress queue on psock down")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210727160500.1713554-2-john.fastabend@gmail.com
2021-07-27 14:55:10 -07:00
John Fastabend 7e6b27a691 bpf, sockmap: Fix potential memory leak on unlikely error case
If skb_linearize is needed and fails we could leak a msg on the error
handling. To fix ensure we kfree the msg block before returning error.
Found during code review.

Fixes: 4363023d26 ("bpf, sockmap: Avoid failures from skb_to_sgvec when skb has frag_list")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Link: https://lore.kernel.org/bpf/20210712195546.423990-2-john.fastabend@gmail.com
2021-07-15 19:49:12 +02:00
Cong Wang 781dd0431e skmsg: Increase sk->sk_drops when dropping packets
It is hard to observe packet drops without increasing relevant
drop counters, here we should increase sk->sk_drops which is
a protocol-independent counter. Fortunately psock is always
associated with a struct sock, we can just use psock->sk.

Suggested-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210615021342.7416-9-xiyou.wangcong@gmail.com
2021-06-21 16:48:44 +02:00
Cong Wang 42830571f1 skmsg: Pass source psock to sk_psock_skb_redirect()
sk_psock_skb_redirect() only takes skb as a parameter, we
will need to know where this skb is from, so just pass
the source psock to this function as a new parameter.
This patch prepares for the next one.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210615021342.7416-8-xiyou.wangcong@gmail.com
2021-06-21 16:48:41 +02:00
Cong Wang 1581a6c1c3 skmsg: Teach sk_psock_verdict_apply() to return errors
Currently sk_psock_verdict_apply() is void, but it handles some
error conditions too. Its caller is impossible to learn whether
it succeeds or fails, especially sk_psock_verdict_recv().

Make it return int to indicate error cases and propagate errors
to callers properly.

Fixes: ef5659280e ("bpf, sockmap: Allow skipping sk_skb parser program")
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210615021342.7416-7-xiyou.wangcong@gmail.com
2021-06-21 16:48:37 +02:00
Cong Wang 0cf6672b23 skmsg: Fix a memory leak in sk_psock_verdict_apply()
If the dest psock does not set SK_PSOCK_TX_ENABLED,
the skb can't be queued anywhere so must be dropped.

This one is found during code review.

Fixes: 799aa7f98d ("skmsg: Avoid lock_sock() in sk_psock_backlog()")
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210615021342.7416-6-xiyou.wangcong@gmail.com
2021-06-21 16:48:33 +02:00
Cong Wang 30b9c54a70 skmsg: Clear skb redirect pointer before dropping it
When we drop skb inside sk_psock_skb_redirect(), we have to clear
its skb->_sk_redir pointer too, otherwise kfree_skb() would
misinterpret it as a valid skb->_skb_refdst and dst_release()
would eventually complain.

Fixes: e3526bb92a ("skmsg: Move sk_redir from TCP_SKB_CB to skb")
Reported-by: Jiang Wang <jiang.wang@bytedance.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210615021342.7416-5-xiyou.wangcong@gmail.com
2021-06-21 16:48:29 +02:00
Cong Wang 9f2470fbc4 skmsg: Improve udp_bpf_recvmsg() accuracy
I tried to reuse sk_msg_wait_data() for different protocols,
but it turns out it can not be simply reused. For example,
UDP actually uses two queues to receive skb:
udp_sk(sk)->reader_queue and sk->sk_receive_queue. So we have
to check both of them to know whether we have received any
packet.

Also, UDP does not lock the sock during BH Rx path, it makes
no sense for its ->recvmsg() to lock the sock. It is always
possible for ->recvmsg() to be called before packets actually
arrive in the receive queue, we just use best effort to make
it accurate here.

Fixes: 1f5be6b3b0 ("udp: Implement udp_bpf_recvmsg() for sockmap")
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210615021342.7416-2-xiyou.wangcong@gmail.com
2021-06-21 16:48:11 +02:00
Jakub Kicinski 8859a44ea0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts:

MAINTAINERS
 - keep Chandrasekar
drivers/net/ethernet/mellanox/mlx5/core/en_main.c
 - simple fix + trust the code re-added to param.c in -next is fine
include/linux/bpf.h
 - trivial
include/linux/ethtool.h
 - trivial, fix kdoc while at it
include/linux/skmsg.h
 - move to relevant place in tcp.c, comment re-wrapped
net/core/skmsg.c
 - add the sk = sk // sk = NULL around calls
net/tipc/crypto.c
 - trivial

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-04-09 20:48:35 -07:00
John Fastabend 144748eb0c bpf, sockmap: Fix incorrect fwd_alloc accounting
Incorrect accounting fwd_alloc can result in a warning when the socket
is torn down,

 [18455.319240] WARNING: CPU: 0 PID: 24075 at net/core/stream.c:208 sk_stream_kill_queues+0x21f/0x230
 [...]
 [18455.319543] Call Trace:
 [18455.319556]  inet_csk_destroy_sock+0xba/0x1f0
 [18455.319577]  tcp_rcv_state_process+0x1b4e/0x2380
 [18455.319593]  ? lock_downgrade+0x3a0/0x3a0
 [18455.319617]  ? tcp_finish_connect+0x1e0/0x1e0
 [18455.319631]  ? sk_reset_timer+0x15/0x70
 [18455.319646]  ? tcp_schedule_loss_probe+0x1b2/0x240
 [18455.319663]  ? lock_release+0xb2/0x3f0
 [18455.319676]  ? __release_sock+0x8a/0x1b0
 [18455.319690]  ? lock_downgrade+0x3a0/0x3a0
 [18455.319704]  ? lock_release+0x3f0/0x3f0
 [18455.319717]  ? __tcp_close+0x2c6/0x790
 [18455.319736]  ? tcp_v4_do_rcv+0x168/0x370
 [18455.319750]  tcp_v4_do_rcv+0x168/0x370
 [18455.319767]  __release_sock+0xbc/0x1b0
 [18455.319785]  __tcp_close+0x2ee/0x790
 [18455.319805]  tcp_close+0x20/0x80

This currently happens because on redirect case we do skb_set_owner_r()
with the original sock. This increments the fwd_alloc memory accounting
on the original sock. Then on redirect we may push this into the queue
of the psock we are redirecting to. When the skb is flushed from the
queue we give the memory back to the original sock. The problem is if
the original sock is destroyed/closed with skbs on another psocks queue
then the original sock will not have a way to reclaim the memory before
being destroyed. Then above warning will be thrown

  sockA                          sockB

  sk_psock_strp_read()
   sk_psock_verdict_apply()
     -- SK_REDIRECT --
     sk_psock_skb_redirect()
                                skb_queue_tail(psock_other->ingress_skb..)

  sk_close()
   sock_map_unref()
     sk_psock_put()
       sk_psock_drop()
         sk_psock_zap_ingress()

At this point we have torn down our own psock, but have the outstanding
skb in psock_other. Note that SK_PASS doesn't have this problem because
the sk_psock_drop() logic releases the skb, its still associated with
our psock.

To resolve lets only account for sockets on the ingress queue that are
still associated with the current socket. On the redirect case we will
check memory limits per 6fa9201a89, but will omit fwd_alloc accounting
until skb is actually enqueued. When the skb is sent via skb_send_sock_locked
or received with sk_psock_skb_ingress memory will be claimed on psock_other.

Fixes: 6fa9201a89 ("bpf, sockmap: Avoid returning unneeded EAGAIN when redirecting to self")
Reported-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/161731444013.68884.4021114312848535993.stgit@john-XPS-13-9370
2021-04-07 01:29:06 +02:00
Cong Wang 2bc793e327 skmsg: Extract __tcp_bpf_recvmsg() and tcp_bpf_wait_data()
Although these two functions are only used by TCP, they are not
specific to TCP at all, both operate on skmsg and ingress_msg,
so fit in net/core/skmsg.c very well.

And we will need them for non-TCP, so rename and move them to
skmsg.c and export them to modules.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210331023237.41094-13-xiyou.wangcong@gmail.com
2021-04-01 10:56:14 -07:00
Cong Wang 8a59f9d1e3 sock: Introduce sk->sk_prot->psock_update_sk_prot()
Currently sockmap calls into each protocol to update the struct
proto and replace it. This certainly won't work when the protocol
is implemented as a module, for example, AF_UNIX.

Introduce a new ops sk->sk_prot->psock_update_sk_prot(), so each
protocol can implement its own way to replace the struct proto.
This also helps get rid of symbol dependencies on CONFIG_INET.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210331023237.41094-11-xiyou.wangcong@gmail.com
2021-04-01 10:56:14 -07:00
Cong Wang a7ba4558e6 sock_map: Introduce BPF_SK_SKB_VERDICT
Reusing BPF_SK_SKB_STREAM_VERDICT is possible but its name is
confusing and more importantly we still want to distinguish them
from user-space. So we can just reuse the stream verdict code but
introduce a new type of eBPF program, skb_verdict. Users are not
allowed to attach stream_verdict and skb_verdict programs to the
same map.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210331023237.41094-10-xiyou.wangcong@gmail.com
2021-04-01 10:56:14 -07:00
Cong Wang 190179f65b skmsg: Use GFP_KERNEL in sk_psock_create_ingress_msg()
This function is only called in process context.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210331023237.41094-7-xiyou.wangcong@gmail.com
2021-04-01 10:56:13 -07:00
Cong Wang 7786dfc41a skmsg: Use rcu work for destroying psock
The RCU callback sk_psock_destroy() only queues work psock->gc,
so we can just switch to rcu work to simplify the code.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210331023237.41094-6-xiyou.wangcong@gmail.com
2021-04-01 10:56:13 -07:00
Cong Wang 799aa7f98d skmsg: Avoid lock_sock() in sk_psock_backlog()
We do not have to lock the sock to avoid losing sk_socket,
instead we can purge all the ingress queues when we close
the socket. Sending or receiving packets after orphaning
socket makes no sense.

We do purge these queues when psock refcnt reaches zero but
here we want to purge them explicitly in sock_map_close().
There are also some nasty race conditions on testing bit
SK_PSOCK_TX_ENABLED and queuing/canceling the psock work,
we can expand psock->ingress_lock a bit to protect them too.

As noticed by John, we still have to lock the psock->work,
because the same work item could be running concurrently on
different CPU's.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210331023237.41094-5-xiyou.wangcong@gmail.com
2021-04-01 10:56:13 -07:00
Cong Wang b01fd6e802 skmsg: Introduce a spinlock to protect ingress_msg
Currently we rely on lock_sock to protect ingress_msg,
it is too big for this, we can actually just use a spinlock
to protect this list like protecting other skb queues.

__tcp_bpf_recvmsg() is still special because of peeking,
it still has to use lock_sock.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210331023237.41094-3-xiyou.wangcong@gmail.com
2021-04-01 10:56:13 -07:00
Cong Wang 37f0e514db skmsg: Lock ingress_skb when purging
Currently we purge the ingress_skb queue only when psock
refcnt goes down to 0, so locking the queue is not necessary,
but in order to be called during ->close, we have to lock it
here.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210331023237.41094-2-xiyou.wangcong@gmail.com
2021-04-01 10:56:13 -07:00
Cong Wang 5333423222 skmsg: Get rid of sk_psock_bpf_run()
It is now nearly identical to bpf_prog_run_pin_on_cpu() and
it has an unused parameter 'psock', so we can just get rid
of it and call bpf_prog_run_pin_on_cpu() directly.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210223184934.6054-9-xiyou.wangcong@gmail.com
2021-02-26 12:28:04 -08:00
Cong Wang cd81cefb1a skmsg: Make __sk_psock_purge_ingress_msg() static
It is only used within skmsg.c so can become static.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210223184934.6054-8-xiyou.wangcong@gmail.com
2021-02-26 12:28:04 -08:00
Cong Wang ae8b8332fb sock_map: Rename skb_parser and skb_verdict
These two eBPF programs are tied to BPF_SK_SKB_STREAM_PARSER
and BPF_SK_SKB_STREAM_VERDICT, rename them to reflect the fact
they are only used for TCP. And save the name 'skb_verdict' for
general use later.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210223184934.6054-6-xiyou.wangcong@gmail.com
2021-02-26 12:28:04 -08:00
Cong Wang e3526bb92a skmsg: Move sk_redir from TCP_SKB_CB to skb
Currently TCP_SKB_CB() is hard-coded in skmsg code, it certainly
does not work for any other non-TCP protocols. We can move them to
skb ext, but it introduces a memory allocation on fast path.

Fortunately, we only need to a word-size to store all the information,
because the flags actually only contains 1 bit so can be just packed
into the lowest bit of the "pointer", which is stored as unsigned
long.

Inside struct sk_buff, '_skb_refdst' can be reused because skb dst is
no longer needed after ->sk_data_ready() so we can just drop it.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210223184934.6054-5-xiyou.wangcong@gmail.com
2021-02-26 12:28:03 -08:00
Cong Wang 16137b09a6 bpf: Compute data_end dynamically with JIT code
Currently, we compute ->data_end with a compile-time constant
offset of skb. But as Jakub pointed out, we can actually compute
it in eBPF JIT code at run-time, so that we can competely get
rid of ->data_end. This is similar to skb_shinfo(skb) computation
in bpf_convert_shinfo_access().

Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210223184934.6054-4-xiyou.wangcong@gmail.com
2021-02-26 12:28:03 -08:00
Cong Wang 5a685cd94b skmsg: Get rid of struct sk_psock_parser
struct sk_psock_parser is embedded in sk_psock, it is
unnecessary as skb verdict also uses ->saved_data_ready.
We can simply fold these fields into sk_psock, and get rid
of ->enabled.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210223184934.6054-3-xiyou.wangcong@gmail.com
2021-02-26 12:28:03 -08:00
Cong Wang 887596095e bpf: Clean up sockmap related Kconfigs
As suggested by John, clean up sockmap related Kconfigs:

Reduce the scope of CONFIG_BPF_STREAM_PARSER down to TCP stream
parser, to reflect its name.

Make the rest sockmap code simply depend on CONFIG_BPF_SYSCALL
and CONFIG_INET, the latter is still needed at this point because
of TCP/UDP proto update. And leave CONFIG_NET_SOCK_MSG untouched,
as it is used by non-sockmap cases.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210223184934.6054-2-xiyou.wangcong@gmail.com
2021-02-26 12:28:03 -08:00
Cong Wang 8063e184e4 skmsg: Make sk_psock_destroy() static
sk_psock_destroy() is a RCU callback, I can't see any reason why
it could be used outside.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Lorenz Bauer <lmb@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210127221501.46866-1-xiyou.wangcong@gmail.com
2021-01-28 00:35:03 +01:00