Commit Graph

390 Commits

Author SHA1 Message Date
Jeff Moyer c46aaba751 net: change proto and proto_ops accept type
JIRA: https://issues.redhat.com/browse/RHEL-64867
Conflicts: RHEL is missing commit 1ded5e5a5931 ("net: annotate
data-races around sock->ops"), which accounts for the differences in
ops structure dereferencing.

commit 92ef0fd55ac80dfc2e4654edfe5d1ddfa6e070fe
Author: Jens Axboe <axboe@kernel.dk>
Date:   Thu May 9 09:20:08 2024 -0600

    net: change proto and proto_ops accept type
    
    Rather than pass in flags, error pointer, and whether this is a kernel
    invocation or not, add a struct proto_accept_arg struct as the argument.
    This then holds all of these arguments, and prepares accept for being
    able to pass back more information.
    
    No functional changes in this patch.
    
    Acked-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-12-02 11:12:33 -05:00
Ian Kent f7a70a9fc1 fs: port vfs_*() helpers to struct mnt_idmap
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

Conflicts: There was a whitespasce difference possibly due to CentOS
	Stream commit c912400e45 ("fs: Fix description of
	vfs_tmpfile()")
	CentOS Stream commit c4f3dd0731 ("nfsd: handle failure to
	collect pre/post-op attrs more sanely") is present which caused
	a hunk reject in fs/nfsd/nfs3proc.c and two hunks to be rejected
	in fs/nfsd/vfs.c the hunks were manually applied.
	Upstream commit 79b05beaa5c34 ("af_unix: Acquire/Release
	per-netns hash table's locks.") is not present in CentOS Stream
	fixed the conflict manually.
	Dropped ksmbd hunks, ksmbd source is not present.
	Upstream commit 3350607dc5637 ("security: Create file_truncate
	hook from path_truncate hook") is not present in CentOS Stream.

commit abf08576afe31506b812c8c1be9714f78613f300
Author: Christian Brauner <brauner@kernel.org>
Date:   Fri Jan 13 12:49:10 2023 +0100

    fs: port vfs_*() helpers to struct mnt_idmap

    Convert to struct mnt_idmap.

    Last cycle we merged the necessary infrastructure in
    256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
    This is just the conversion to struct mnt_idmap.

    Currently we still pass around the plain namespace that was attached to a
    mount. This is in general pretty convenient but it makes it easy to
    conflate namespaces that are relevant on the filesystem with namespaces
    that are relevent on the mount level. Especially for non-vfs developers
    without detailed knowledge in this area this can be a potential source for
    bugs.

    Once the conversion to struct mnt_idmap is done all helpers down to the
    really low-level helpers will take a struct mnt_idmap argument instead of
    two namespace arguments. This way it becomes impossible to conflate the two
    eliminating the possibility of any bugs. All of the vfs and all filesystems
    only operate on struct mnt_idmap.

    Acked-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-16 08:29:51 +08:00
Lucas Zampieri 7d84201666 Merge: af_unix: Fix data races around sk->sk_shutdown.
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4606

JIRA: https://issues.redhat.com/browse/RHEL-43969
Upstream Status: linux.git
CVE: CVE-2024-38596

Signed-off-by: Guillaume Nault <gnault@redhat.com>

Approved-by: Davide Caratti <dcaratti@redhat.com>
Approved-by: Florian Westphal <fwestpha@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-07-08 13:00:12 +00:00
Guillaume Nault bb6c36110b af_unix: Fix data races in unix_release_sock/unix_stream_sendmsg
JIRA: https://issues.redhat.com/browse/RHEL-43969
Upstream Status: linux.git
CVE: CVE-2024-38596

commit 540bf24fba16b88c1b3b9353927204b4f1074e25
Author: Breno Leitao <leitao@debian.org>
Date:   Thu May 9 01:14:46 2024 -0700

    af_unix: Fix data races in unix_release_sock/unix_stream_sendmsg

    A data-race condition has been identified in af_unix. In one data path,
    the write function unix_release_sock() atomically writes to
    sk->sk_shutdown using WRITE_ONCE. However, on the reader side,
    unix_stream_sendmsg() does not read it atomically. Consequently, this
    issue is causing the following KCSAN splat to occur:

    	BUG: KCSAN: data-race in unix_release_sock / unix_stream_sendmsg

    	write (marked) to 0xffff88867256ddbb of 1 bytes by task 7270 on cpu 28:
    	unix_release_sock (net/unix/af_unix.c:640)
    	unix_release (net/unix/af_unix.c:1050)
    	sock_close (net/socket.c:659 net/socket.c:1421)
    	__fput (fs/file_table.c:422)
    	__fput_sync (fs/file_table.c:508)
    	__se_sys_close (fs/open.c:1559 fs/open.c:1541)
    	__x64_sys_close (fs/open.c:1541)
    	x64_sys_call (arch/x86/entry/syscall_64.c:33)
    	do_syscall_64 (arch/x86/entry/common.c:?)
    	entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)

    	read to 0xffff88867256ddbb of 1 bytes by task 989 on cpu 14:
    	unix_stream_sendmsg (net/unix/af_unix.c:2273)
    	__sock_sendmsg (net/socket.c:730 net/socket.c:745)
    	____sys_sendmsg (net/socket.c:2584)
    	__sys_sendmmsg (net/socket.c:2638 net/socket.c:2724)
    	__x64_sys_sendmmsg (net/socket.c:2753 net/socket.c:2750 net/socket.c:2750)
    	x64_sys_call (arch/x86/entry/syscall_64.c:33)
    	do_syscall_64 (arch/x86/entry/common.c:?)
    	entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)

    	value changed: 0x01 -> 0x03

    The line numbers are related to commit dd5a440a31fa ("Linux 6.9-rc7").

    Commit e1d09c2c2f57 ("af_unix: Fix data races around sk->sk_shutdown.")
    addressed a comparable issue in the past regarding sk->sk_shutdown.
    However, it overlooked resolving this particular data path.
    This patch only offending unix_stream_sendmsg() function, since the
    other reads seem to be protected by unix_state_lock() as discussed in
    Link: https://lore.kernel.org/all/20240508173324.53565-1-kuniyu@amazon.com/

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Breno Leitao <leitao@debian.org>
    Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Link: https://lore.kernel.org/r/20240509081459.2807828-1-leitao@debian.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2024-06-26 13:48:52 +02:00
Guillaume Nault 8574f1b610 af_unix: Fix data races around sk->sk_shutdown.
JIRA: https://issues.redhat.com/browse/RHEL-43969
Upstream Status: linux.git
CVE: CVE-2024-38596

commit e1d09c2c2f5793474556b60f83900e088d0d366d
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Tue May 9 17:34:56 2023 -0700

    af_unix: Fix data races around sk->sk_shutdown.

    KCSAN found a data race around sk->sk_shutdown where unix_release_sock()
    and unix_shutdown() update it under unix_state_lock(), OTOH unix_poll()
    and unix_dgram_poll() read it locklessly.

    We need to annotate the writes and reads with WRITE_ONCE() and READ_ONCE().

    BUG: KCSAN: data-race in unix_poll / unix_release_sock

    write to 0xffff88800d0f8aec of 1 bytes by task 264 on cpu 0:
     unix_release_sock+0x75c/0x910 net/unix/af_unix.c:631
     unix_release+0x59/0x80 net/unix/af_unix.c:1042
     __sock_release+0x7d/0x170 net/socket.c:653
     sock_close+0x19/0x30 net/socket.c:1397
     __fput+0x179/0x5e0 fs/file_table.c:321
     ____fput+0x15/0x20 fs/file_table.c:349
     task_work_run+0x116/0x1a0 kernel/task_work.c:179
     resume_user_mode_work include/linux/resume_user_mode.h:49 [inline]
     exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
     exit_to_user_mode_prepare+0x174/0x180 kernel/entry/common.c:204
     __syscall_exit_to_user_mode_work kernel/entry/common.c:286 [inline]
     syscall_exit_to_user_mode+0x1a/0x30 kernel/entry/common.c:297
     do_syscall_64+0x4b/0x90 arch/x86/entry/common.c:86
     entry_SYSCALL_64_after_hwframe+0x72/0xdc

    read to 0xffff88800d0f8aec of 1 bytes by task 222 on cpu 1:
     unix_poll+0xa3/0x2a0 net/unix/af_unix.c:3170
     sock_poll+0xcf/0x2b0 net/socket.c:1385
     vfs_poll include/linux/poll.h:88 [inline]
     ep_item_poll.isra.0+0x78/0xc0 fs/eventpoll.c:855
     ep_send_events fs/eventpoll.c:1694 [inline]
     ep_poll fs/eventpoll.c:1823 [inline]
     do_epoll_wait+0x6c4/0xea0 fs/eventpoll.c:2258
     __do_sys_epoll_wait fs/eventpoll.c:2270 [inline]
     __se_sys_epoll_wait fs/eventpoll.c:2265 [inline]
     __x64_sys_epoll_wait+0xcc/0x190 fs/eventpoll.c:2265
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x72/0xdc

    value changed: 0x00 -> 0x03

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 1 PID: 222 Comm: dbus-broker Not tainted 6.3.0-rc7-02330-gca6270c12e20 #2
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014

    Fixes: 3c73419c09 ("af_unix: fix 'poll for write'/ connected DGRAM sockets")
    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2024-06-26 13:46:00 +02:00
Davide Caratti c211ba403d af_unix: fix lockdep positive in sk_diag_dump_icons()
JIRA: https://issues.redhat.com/browse/RHEL-33410
Upstream Status: net.git commit 4d322dce82a1d44f8c83f0f54f95dd1b8dcf46c9

commit 4d322dce82a1d44f8c83f0f54f95dd1b8dcf46c9
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Jan 30 18:42:35 2024 +0000

    af_unix: fix lockdep positive in sk_diag_dump_icons()

    syzbot reported a lockdep splat [1].

    Blamed commit hinted about the possible lockdep
    violation, and code used unix_state_lock_nested()
    in an attempt to silence lockdep.

    It is not sufficient, because unix_state_lock_nested()
    is already used from unix_state_double_lock().

    We need to use a separate subclass.

    This patch adds a distinct enumeration to make things
    more explicit.

    Also use swap() in unix_state_double_lock() as a clean up.

    v2: add a missing inline keyword to unix_state_lock_nested()

    [1]
    WARNING: possible circular locking dependency detected
    6.8.0-rc1-syzkaller-00356-g8a696a29c690 #0 Not tainted

    syz-executor.1/2542 is trying to acquire lock:
     ffff88808b5df9e8 (rlock-AF_UNIX){+.+.}-{2:2}, at: skb_queue_tail+0x36/0x120 net/core/skbuff.c:3863

    but task is already holding lock:
     ffff88808b5dfe70 (&u->lock/1){+.+.}-{2:2}, at: unix_dgram_sendmsg+0xfc7/0x2200 net/unix/af_unix.c:2089

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (&u->lock/1){+.+.}-{2:2}:
            lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754
            _raw_spin_lock_nested+0x31/0x40 kernel/locking/spinlock.c:378
            sk_diag_dump_icons net/unix/diag.c:87 [inline]
            sk_diag_fill+0x6ea/0xfe0 net/unix/diag.c:157
            sk_diag_dump net/unix/diag.c:196 [inline]
            unix_diag_dump+0x3e9/0x630 net/unix/diag.c:220
            netlink_dump+0x5c1/0xcd0 net/netlink/af_netlink.c:2264
            __netlink_dump_start+0x5d7/0x780 net/netlink/af_netlink.c:2370
            netlink_dump_start include/linux/netlink.h:338 [inline]
            unix_diag_handler_dump+0x1c3/0x8f0 net/unix/diag.c:319
           sock_diag_rcv_msg+0xe3/0x400
            netlink_rcv_skb+0x1df/0x430 net/netlink/af_netlink.c:2543
            sock_diag_rcv+0x2a/0x40 net/core/sock_diag.c:280
            netlink_unicast_kernel net/netlink/af_netlink.c:1341 [inline]
            netlink_unicast+0x7e6/0x980 net/netlink/af_netlink.c:1367
            netlink_sendmsg+0xa37/0xd70 net/netlink/af_netlink.c:1908
            sock_sendmsg_nosec net/socket.c:730 [inline]
            __sock_sendmsg net/socket.c:745 [inline]
            sock_write_iter+0x39a/0x520 net/socket.c:1160
            call_write_iter include/linux/fs.h:2085 [inline]
            new_sync_write fs/read_write.c:497 [inline]
            vfs_write+0xa74/0xca0 fs/read_write.c:590
            ksys_write+0x1a0/0x2c0 fs/read_write.c:643
            do_syscall_x64 arch/x86/entry/common.c:52 [inline]
            do_syscall_64+0xf5/0x230 arch/x86/entry/common.c:83
           entry_SYSCALL_64_after_hwframe+0x63/0x6b

    -> #0 (rlock-AF_UNIX){+.+.}-{2:2}:
            check_prev_add kernel/locking/lockdep.c:3134 [inline]
            check_prevs_add kernel/locking/lockdep.c:3253 [inline]
            validate_chain+0x1909/0x5ab0 kernel/locking/lockdep.c:3869
            __lock_acquire+0x1345/0x1fd0 kernel/locking/lockdep.c:5137
            lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754
            __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
            _raw_spin_lock_irqsave+0xd5/0x120 kernel/locking/spinlock.c:162
            skb_queue_tail+0x36/0x120 net/core/skbuff.c:3863
            unix_dgram_sendmsg+0x15d9/0x2200 net/unix/af_unix.c:2112
            sock_sendmsg_nosec net/socket.c:730 [inline]
            __sock_sendmsg net/socket.c:745 [inline]
            ____sys_sendmsg+0x592/0x890 net/socket.c:2584
            ___sys_sendmsg net/socket.c:2638 [inline]
            __sys_sendmmsg+0x3b2/0x730 net/socket.c:2724
            __do_sys_sendmmsg net/socket.c:2753 [inline]
            __se_sys_sendmmsg net/socket.c:2750 [inline]
            __x64_sys_sendmmsg+0xa0/0xb0 net/socket.c:2750
            do_syscall_x64 arch/x86/entry/common.c:52 [inline]
            do_syscall_64+0xf5/0x230 arch/x86/entry/common.c:83
           entry_SYSCALL_64_after_hwframe+0x63/0x6b

    other info that might help us debug this:

     Possible unsafe locking scenario:

           CPU0                    CPU1
           ----                    ----
      lock(&u->lock/1);
                                   lock(rlock-AF_UNIX);
                                   lock(&u->lock/1);
      lock(rlock-AF_UNIX);

     *** DEADLOCK ***

    1 lock held by syz-executor.1/2542:
      #0: ffff88808b5dfe70 (&u->lock/1){+.+.}-{2:2}, at: unix_dgram_sendmsg+0xfc7/0x2200 net/unix/af_unix.c:2089

    stack backtrace:
    CPU: 1 PID: 2542 Comm: syz-executor.1 Not tainted 6.8.0-rc1-syzkaller-00356-g8a696a29c690 #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
    Call Trace:
     <TASK>
      __dump_stack lib/dump_stack.c:88 [inline]
      dump_stack_lvl+0x1e7/0x2d0 lib/dump_stack.c:106
      check_noncircular+0x366/0x490 kernel/locking/lockdep.c:2187
      check_prev_add kernel/locking/lockdep.c:3134 [inline]
      check_prevs_add kernel/locking/lockdep.c:3253 [inline]
      validate_chain+0x1909/0x5ab0 kernel/locking/lockdep.c:3869
      __lock_acquire+0x1345/0x1fd0 kernel/locking/lockdep.c:5137
      lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754
      __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
      _raw_spin_lock_irqsave+0xd5/0x120 kernel/locking/spinlock.c:162
      skb_queue_tail+0x36/0x120 net/core/skbuff.c:3863
      unix_dgram_sendmsg+0x15d9/0x2200 net/unix/af_unix.c:2112
      sock_sendmsg_nosec net/socket.c:730 [inline]
      __sock_sendmsg net/socket.c:745 [inline]
      ____sys_sendmsg+0x592/0x890 net/socket.c:2584
      ___sys_sendmsg net/socket.c:2638 [inline]
      __sys_sendmmsg+0x3b2/0x730 net/socket.c:2724
      __do_sys_sendmmsg net/socket.c:2753 [inline]
      __se_sys_sendmmsg net/socket.c:2750 [inline]
      __x64_sys_sendmmsg+0xa0/0xb0 net/socket.c:2750
      do_syscall_x64 arch/x86/entry/common.c:52 [inline]
      do_syscall_64+0xf5/0x230 arch/x86/entry/common.c:83
     entry_SYSCALL_64_after_hwframe+0x63/0x6b
    RIP: 0033:0x7f26d887cda9
    Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
    RSP: 002b:00007f26d95a60c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
    RAX: ffffffffffffffda RBX: 00007f26d89abf80 RCX: 00007f26d887cda9
    RDX: 000000000000003e RSI: 00000000200bd000 RDI: 0000000000000004
    RBP: 00007f26d88c947a R08: 0000000000000000 R09: 0000000000000000
    R10: 00000000000008c0 R11: 0000000000000246 R12: 0000000000000000
    R13: 000000000000000b R14: 00007f26d89abf80 R15: 00007ffcfe081a68

    Fixes: 2aac7a2cb0 ("unix_diag: Pending connections IDs NLA")
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Link: https://lore.kernel.org/r/20240130184235.1620738-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2024-06-06 11:51:40 +02:00
Davide Caratti 9a7e59dd2a af_unix: Fix data races around sk->sk_shutdown.
JIRA: https://issues.redhat.com/browse/RHEL-33410
Upstream Status: net.git commit e1d09c2c2f5793474556b60f83900e088d0d366d

commit e1d09c2c2f5793474556b60f83900e088d0d366d
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Tue May 9 17:34:56 2023 -0700

    af_unix: Fix data races around sk->sk_shutdown.

    KCSAN found a data race around sk->sk_shutdown where unix_release_sock()
    and unix_shutdown() update it under unix_state_lock(), OTOH unix_poll()
    and unix_dgram_poll() read it locklessly.

    We need to annotate the writes and reads with WRITE_ONCE() and READ_ONCE().

    BUG: KCSAN: data-race in unix_poll / unix_release_sock

    write to 0xffff88800d0f8aec of 1 bytes by task 264 on cpu 0:
     unix_release_sock+0x75c/0x910 net/unix/af_unix.c:631
     unix_release+0x59/0x80 net/unix/af_unix.c:1042
     __sock_release+0x7d/0x170 net/socket.c:653
     sock_close+0x19/0x30 net/socket.c:1397
     __fput+0x179/0x5e0 fs/file_table.c:321
     ____fput+0x15/0x20 fs/file_table.c:349
     task_work_run+0x116/0x1a0 kernel/task_work.c:179
     resume_user_mode_work include/linux/resume_user_mode.h:49 [inline]
     exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
     exit_to_user_mode_prepare+0x174/0x180 kernel/entry/common.c:204
     __syscall_exit_to_user_mode_work kernel/entry/common.c:286 [inline]
     syscall_exit_to_user_mode+0x1a/0x30 kernel/entry/common.c:297
     do_syscall_64+0x4b/0x90 arch/x86/entry/common.c:86
     entry_SYSCALL_64_after_hwframe+0x72/0xdc

    read to 0xffff88800d0f8aec of 1 bytes by task 222 on cpu 1:
     unix_poll+0xa3/0x2a0 net/unix/af_unix.c:3170
     sock_poll+0xcf/0x2b0 net/socket.c:1385
     vfs_poll include/linux/poll.h:88 [inline]
     ep_item_poll.isra.0+0x78/0xc0 fs/eventpoll.c:855
     ep_send_events fs/eventpoll.c:1694 [inline]
     ep_poll fs/eventpoll.c:1823 [inline]
     do_epoll_wait+0x6c4/0xea0 fs/eventpoll.c:2258
     __do_sys_epoll_wait fs/eventpoll.c:2270 [inline]
     __se_sys_epoll_wait fs/eventpoll.c:2265 [inline]
     __x64_sys_epoll_wait+0xcc/0x190 fs/eventpoll.c:2265
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x72/0xdc

    value changed: 0x00 -> 0x03

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 1 PID: 222 Comm: dbus-broker Not tainted 6.3.0-rc7-02330-gca6270c12e20 #2
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014

    Fixes: 3c73419c09 ("af_unix: fix 'poll for write'/ connected DGRAM sockets")
    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2024-06-06 11:39:13 +02:00
Davide Caratti 426724b26b af_unix: Fix a data race of sk->sk_receive_queue->qlen.
JIRA: https://issues.redhat.com/browse/RHEL-33410
Upstream Status: net.git commit 679ed006d416ea0cecfe24a99d365d1dea69c683

commit 679ed006d416ea0cecfe24a99d365d1dea69c683
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Tue May 9 17:34:55 2023 -0700

    af_unix: Fix a data race of sk->sk_receive_queue->qlen.

    KCSAN found a data race of sk->sk_receive_queue->qlen where recvmsg()
    updates qlen under the queue lock and sendmsg() checks qlen under
    unix_state_sock(), not the queue lock, so the reader side needs
    READ_ONCE().

    BUG: KCSAN: data-race in __skb_try_recv_from_queue / unix_wait_for_peer

    write (marked) to 0xffff888019fe7c68 of 4 bytes by task 49792 on cpu 0:
     __skb_unlink include/linux/skbuff.h:2347 [inline]
     __skb_try_recv_from_queue+0x3de/0x470 net/core/datagram.c:197
     __skb_try_recv_datagram+0xf7/0x390 net/core/datagram.c:263
     __unix_dgram_recvmsg+0x109/0x8a0 net/unix/af_unix.c:2452
     unix_dgram_recvmsg+0x94/0xa0 net/unix/af_unix.c:2549
     sock_recvmsg_nosec net/socket.c:1019 [inline]
     ____sys_recvmsg+0x3a3/0x3b0 net/socket.c:2720
     ___sys_recvmsg+0xc8/0x150 net/socket.c:2764
     do_recvmmsg+0x182/0x560 net/socket.c:2858
     __sys_recvmmsg net/socket.c:2937 [inline]
     __do_sys_recvmmsg net/socket.c:2960 [inline]
     __se_sys_recvmmsg net/socket.c:2953 [inline]
     __x64_sys_recvmmsg+0x153/0x170 net/socket.c:2953
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x72/0xdc

    read to 0xffff888019fe7c68 of 4 bytes by task 49793 on cpu 1:
     skb_queue_len include/linux/skbuff.h:2127 [inline]
     unix_recvq_full net/unix/af_unix.c:229 [inline]
     unix_wait_for_peer+0x154/0x1a0 net/unix/af_unix.c:1445
     unix_dgram_sendmsg+0x13bc/0x14b0 net/unix/af_unix.c:2048
     sock_sendmsg_nosec net/socket.c:724 [inline]
     sock_sendmsg+0x148/0x160 net/socket.c:747
     ____sys_sendmsg+0x20e/0x620 net/socket.c:2503
     ___sys_sendmsg+0xc6/0x140 net/socket.c:2557
     __sys_sendmmsg+0x11d/0x370 net/socket.c:2643
     __do_sys_sendmmsg net/socket.c:2672 [inline]
     __se_sys_sendmmsg net/socket.c:2669 [inline]
     __x64_sys_sendmmsg+0x58/0x70 net/socket.c:2669
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x72/0xdc

    value changed: 0x0000000b -> 0x00000001

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 1 PID: 49793 Comm: syz-executor.0 Not tainted 6.3.0-rc7-02330-gca6270c12e20 #2
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2024-06-06 11:39:13 +02:00
Waiman Long 6d0328a7cf Revert "Revert "Merge: cgroup: Backport upstream cgroup commits up to v6.8""
JIRA: https://issues.redhat.com/browse/RHEL-36683
Upstream Status: RHEL only

This reverts commit 08637d76a2 which is a
revert of "Merge: cgroup: Backport upstream cgroup commits up to v6.8"

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-18 21:38:20 -04:00
Lucas Zampieri 08637d76a2 Revert "Merge: cgroup: Backport upstream cgroup commits up to v6.8"
This reverts merge request !4128
2024-05-16 15:26:41 +00:00
Waiman Long 724656e7cf freezer,sched: Rewrite core freezer logic
JIRA: https://issues.redhat.com/browse/RHEL-34600
Conflicts:
 1) A merge conflict in the kernel/signal.c hunk due to the presence
    of RHEL-only commit 975d318867 ("signal: Don't disable preemption
    in ptrace_stop() on PREEMPT_RT.").
 2) A merge conflict in the kernel/time/hrtimer.c hunk due to the
    presence of RHEL-only commit 5f76194136 ("time/hrtimer: Embed
    hrtimer mode into hrtimer_sleeper").
 3) The fs/cifs/inode.c hunk was applied to fs/smb/client/inode.c due
    to the presence of upstream commit 38c8a9a52082 ("smb: move client
    and server files to common directory fs/smb").
 4) Similarly, the fs/cifs/transport.c hunk was applied to
    fs/smb/client/transport.c manually due to the presence of
    a later upstream commit d527f51331ca ("cifs: Fix UAF in
    cifs_demultiplex_thread()").

Note that all the prerequiste patches in the same patch series
(https://lore.kernel.org/lkml/20220822111816.760285417@infradead.org/)
had already been merged into RHEL9.

commit f5d39b020809146cc28e6e73369bf8065e0310aa
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Mon, 22 Aug 2022 13:18:22 +0200

    freezer,sched: Rewrite core freezer logic

    Rewrite the core freezer to behave better wrt thawing and be simpler
    in general.

    By replacing PF_FROZEN with TASK_FROZEN, a special block state, it is
    ensured frozen tasks stay frozen until thawed and don't randomly wake
    up early, as is currently possible.

    As such, it does away with PF_FROZEN and PF_FREEZER_SKIP, freeing up
    two PF_flags (yay!).

    Specifically; the current scheme works a little like:

            freezer_do_not_count();
            schedule();
            freezer_count();

    And either the task is blocked, or it lands in try_to_freezer()
    through freezer_count(). Now, when it is blocked, the freezer
    considers it frozen and continues.

    However, on thawing, once pm_freezing is cleared, freezer_count()
    stops working, and any random/spurious wakeup will let a task run
    before its time.

    That is, thawing tries to thaw things in explicit order; kernel
    threads and workqueues before doing bringing SMP back before userspace
    etc.. However due to the above mentioned races it is entirely possible
    for userspace tasks to thaw (by accident) before SMP is back.

    This can be a fatal problem in asymmetric ISA architectures (eg ARMv9)
    where the userspace task requires a special CPU to run.

    As said; replace this with a special task state TASK_FROZEN and add
    the following state transitions:

            TASK_FREEZABLE  -> TASK_FROZEN
            __TASK_STOPPED  -> TASK_FROZEN
            __TASK_TRACED   -> TASK_FROZEN

    The new TASK_FREEZABLE can be set on any state part of TASK_NORMAL
    (IOW. TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE) -- any such state
    is already required to deal with spurious wakeups and the freezer
    causes one such when thawing the task (since the original state is
    lost).

    The special __TASK_{STOPPED,TRACED} states *can* be restored since
    their canonical state is in ->jobctl.

    With this, frozen tasks need an explicit TASK_FROZEN wakeup and are
    free of undue (early / spurious) wakeups.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Ingo Molnar <mingo@kernel.org>
    Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
    Link: https://lore.kernel.org/r/20220822114649.055452969@infradead.org

Signed-off-by: Waiman Long <longman@redhat.com>
2024-04-26 22:49:06 -04:00
Guillaume Nault 4e2b5d2e07 af_unix: Fix null-ptr-deref in unix_stream_sendpage().
JIRA: https://issues.redhat.com/browse/RHEL-17264
Upstream Status: git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
CVE: CVE-2023-4622

commit 790c2f9d15b594350ae9bca7b236f2b1859de02c
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Mon Aug 21 10:55:05 2023 -0700

    af_unix: Fix null-ptr-deref in unix_stream_sendpage().

    Bing-Jhong Billy Jheng reported null-ptr-deref in unix_stream_sendpage()
    with detailed analysis and a nice repro.

    unix_stream_sendpage() tries to add data to the last skb in the peer's
    recv queue without locking the queue.

    If the peer's FD is passed to another socket and the socket's FD is
    passed to the peer, there is a loop between them.  If we close both
    sockets without receiving FD, the sockets will be cleaned up by garbage
    collection.

    The garbage collection iterates such sockets and unlinks skb with
    FD from the socket's receive queue under the queue's lock.

    So, there is a race where unix_stream_sendpage() could access an skb
    locklessly that is being released by garbage collection, resulting in
    use-after-free.

    To avoid the issue, unix_stream_sendpage() must lock the peer's recv
    queue.

    Note the issue does not exist in 6.5+ thanks to the recent sendpage()
    refactoring.

    This patch is originally written by Linus Torvalds.

    BUG: unable to handle page fault for address: ffff988004dd6870
    PF: supervisor read access in kernel mode
    PF: error_code(0x0000) - not-present page
    PGD 0 P4D 0
    PREEMPT SMP PTI
    CPU: 4 PID: 297 Comm: garbage_uaf Not tainted 6.1.46 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
    RIP: 0010:kmem_cache_alloc_node+0xa2/0x1e0
    Code: c0 0f 84 32 01 00 00 41 83 fd ff 74 10 48 8b 00 48 c1 e8 3a 41 39 c5 0f 85 1c 01 00 00 41 8b 44 24 28 49 8b 3c 24 48 8d 4a 40 <49> 8b 1c 06 4c 89 f0 65 48 0f c7 0f 0f 94 c0 84 c0 74 a1 41 8b 44
    RSP: 0018:ffffc9000079fac0 EFLAGS: 00000246
    RAX: 0000000000000070 RBX: 0000000000000005 RCX: 000000000001a284
    RDX: 000000000001a244 RSI: 0000000000400cc0 RDI: 000000000002eee0
    RBP: 0000000000400cc0 R08: 0000000000400cc0 R09: 0000000000000003
    R10: 0000000000000001 R11: 0000000000000000 R12: ffff888003970f00
    R13: 00000000ffffffff R14: ffff988004dd6800 R15: 00000000000000e8
    FS:  00007f174d6f3600(0000) GS:ffff88807db00000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffff988004dd6870 CR3: 00000000092be000 CR4: 00000000007506e0
    PKRU: 55555554
    Call Trace:
     <TASK>
     ? __die_body.cold+0x1a/0x1f
     ? page_fault_oops+0xa9/0x1e0
     ? fixup_exception+0x1d/0x310
     ? exc_page_fault+0xa8/0x150
     ? asm_exc_page_fault+0x22/0x30
     ? kmem_cache_alloc_node+0xa2/0x1e0
     ? __alloc_skb+0x16c/0x1e0
     __alloc_skb+0x16c/0x1e0
     alloc_skb_with_frags+0x48/0x1e0
     sock_alloc_send_pskb+0x234/0x270
     unix_stream_sendmsg+0x1f5/0x690
     sock_sendmsg+0x5d/0x60
     ____sys_sendmsg+0x210/0x260
     ___sys_sendmsg+0x83/0xd0
     ? kmem_cache_alloc+0xc6/0x1c0
     ? avc_disable+0x20/0x20
     ? percpu_counter_add_batch+0x53/0xc0
     ? alloc_empty_file+0x5d/0xb0
     ? alloc_file+0x91/0x170
     ? alloc_file_pseudo+0x94/0x100
     ? __fget_light+0x9f/0x120
     __sys_sendmsg+0x54/0xa0
     do_syscall_64+0x3b/0x90
     entry_SYSCALL_64_after_hwframe+0x69/0xd3
    RIP: 0033:0x7f174d639a7d
    Code: 28 89 54 24 1c 48 89 74 24 10 89 7c 24 08 e8 8a c1 f4 ff 8b 54 24 1c 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 33 44 89 c7 48 89 44 24 08 e8 de c1 f4 ff 48
    RSP: 002b:00007ffcb563ea50 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
    RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f174d639a7d
    RDX: 0000000000000000 RSI: 00007ffcb563eab0 RDI: 0000000000000007
    RBP: 00007ffcb563eb10 R08: 0000000000000000 R09: 00000000ffffffff
    R10: 00000000004040a0 R11: 0000000000000293 R12: 00007ffcb563ec28
    R13: 0000000000401398 R14: 0000000000403e00 R15: 00007f174d72c000
     </TASK>

    Fixes: 869e7c6248 ("net: af_unix: implement stream sendpage support")
    Reported-by: Bing-Jhong Billy Jheng <billy@starlabs.sg>
    Reviewed-by: Bing-Jhong Billy Jheng <billy@starlabs.sg>
    Co-developed-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Signed-off-by: Guillaume Nault <gnault@redhat.com>
2023-12-01 17:16:48 +01:00
Jan Stancek 96911a0b20 Merge: net/other: phase-1 backports for RHEL-9.4
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3264

JIRA: https://issues.redhat.com/browse/RHEL-14526
Upstream Status: all mainline in net.git
Tested: boot-tested only
Conflicts: None

Signed-off-by: Davide Caratti <dcaratti@redhat.com>

Approved-by: Hangbin Liu <haliu@redhat.com>
Approved-by: Florian Westphal <fwestpha@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-11-20 21:49:15 +01:00
Davide Caratti d9798a3459 af_unix: Fix data-race around unix_tot_inflight.
JIRA: https://issues.redhat.com/browse/RHEL-14526
Upstream Status: net.git commit ade32bd8a738d7497ffe9743c46728db26740f78

commit ade32bd8a738d7497ffe9743c46728db26740f78
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Fri Sep 1 17:27:06 2023 -0700

    af_unix: Fix data-race around unix_tot_inflight.

    unix_tot_inflight is changed under spin_lock(unix_gc_lock), but
    unix_release_sock() reads it locklessly.

    Let's use READ_ONCE() for unix_tot_inflight.

    Note that the writer side was marked by commit 9d6d7f1cb67c ("af_unix:
    annote lockless accesses to unix_tot_inflight & gc_in_progress")

    BUG: KCSAN: data-race in unix_inflight / unix_release_sock

    write (marked) to 0xffffffff871852b8 of 4 bytes by task 123 on cpu 1:
     unix_inflight+0x130/0x180 net/unix/scm.c:64
     unix_attach_fds+0x137/0x1b0 net/unix/scm.c:123
     unix_scm_to_skb net/unix/af_unix.c:1832 [inline]
     unix_dgram_sendmsg+0x46a/0x14f0 net/unix/af_unix.c:1955
     sock_sendmsg_nosec net/socket.c:724 [inline]
     sock_sendmsg+0x148/0x160 net/socket.c:747
     ____sys_sendmsg+0x4e4/0x610 net/socket.c:2493
     ___sys_sendmsg+0xc6/0x140 net/socket.c:2547
     __sys_sendmsg+0x94/0x140 net/socket.c:2576
     __do_sys_sendmsg net/socket.c:2585 [inline]
     __se_sys_sendmsg net/socket.c:2583 [inline]
     __x64_sys_sendmsg+0x45/0x50 net/socket.c:2583
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x72/0xdc

    read to 0xffffffff871852b8 of 4 bytes by task 4891 on cpu 0:
     unix_release_sock+0x608/0x910 net/unix/af_unix.c:671
     unix_release+0x59/0x80 net/unix/af_unix.c:1058
     __sock_release+0x7d/0x170 net/socket.c:653
     sock_close+0x19/0x30 net/socket.c:1385
     __fput+0x179/0x5e0 fs/file_table.c:321
     ____fput+0x15/0x20 fs/file_table.c:349
     task_work_run+0x116/0x1a0 kernel/task_work.c:179
     resume_user_mode_work include/linux/resume_user_mode.h:49 [inline]
     exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
     exit_to_user_mode_prepare+0x174/0x180 kernel/entry/common.c:204
     __syscall_exit_to_user_mode_work kernel/entry/common.c:286 [inline]
     syscall_exit_to_user_mode+0x1a/0x30 kernel/entry/common.c:297
     do_syscall_64+0x4b/0x90 arch/x86/entry/common.c:86
     entry_SYSCALL_64_after_hwframe+0x72/0xdc

    value changed: 0x00000000 -> 0x00000001

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 0 PID: 4891 Comm: systemd-coredum Not tainted 6.4.0-rc5-01219-gfa0e21fa4443 #5
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014

    Fixes: 9305cfa444 ("[AF_UNIX]: Make unix_tot_inflight counter non-atomic")
    Reported-by: syzkaller <syzkaller@googlegroups.com>
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2023-10-24 15:12:12 +02:00
Chris von Recklinghausen 1f619343f6 treewide: use get_random_u32() when possible
Conflicts:
	drivers/gpu/drm/tests/drm_buddy_test.c
	drivers/gpu/drm/tests/drm_mm_test.c - We already have
		ce28ab1380e8 ("drm/tests: Add back seed value information")
		so keep calls to kunit_info.
	drop changes to drivers/misc/habanalabs/gaudi2/gaudi2.c
		fs/ntfs3/fslog.c - files not in CS9
	net/sunrpc/auth_gss/gss_krb5_wrap.c - We already have
		7f675ca7757b ("SUNRPC: Improve Kerberos confounder generation")
		so code to change is gone.
	drivers/gpu/drm/i915/i915_gem_gtt.c
	drivers/gpu/drm/i915/selftests/i915_selftest.c
	drivers/gpu/drm/tests/drm_buddy_test.c
	drivers/gpu/drm/tests/drm_mm_test.c
		change added under
		4cb818386e ("Merge DRM changes from upstream v6.0.8..v6.1")

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit a251c17aa558d8e3128a528af5cf8b9d7caae4fd
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Wed Oct 5 17:43:22 2022 +0200

    treewide: use get_random_u32() when possible

    The prandom_u32() function has been a deprecated inline wrapper around
    get_random_u32() for several releases now, and compiles down to the
    exact same code. Replace the deprecated wrapper with a direct call to
    the real function. The same also applies to get_random_int(), which is
    just a wrapper around get_random_u32(). This was done as a basic find
    and replace.

    Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Reviewed-by: Yury Norov <yury.norov@gmail.com>
    Reviewed-by: Jan Kara <jack@suse.cz> # for ext4
    Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake
    Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd
    Acked-by: Jakub Kicinski <kuba@kernel.org>
    Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbol
t
    Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
    Acked-by: Helge Deller <deller@gmx.de> # for parisc
    Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:03 -04:00
Felix Maurer 2d92cf1f17 bpf, sockmap: Pass skb ownership through read_skb
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2218483
Conflicts:
- net/ipv4/udp.c: Context difference due to missing ec095263a965 ("net:
  remove noblock parameter from recvmsg() entities") and db39dfdc1c3b
  ("udp: Use WARN_ON_ONCE() in udp_read_skb()"); 31f1fbcb346c ("udp:
  Refactor udp_read_skb()") was adapted to reflect this
- net/vmw_vsock/virtio_transport_common.c: Skipped, because the relevant
  code is not there, missing 634f1a7110b4 ("vsock: support sockmap")

commit 78fa0d61d97a728d306b0c23d353c0e340756437
Author: John Fastabend <john.fastabend@gmail.com>
Date:   Mon May 22 19:56:05 2023 -0700

    bpf, sockmap: Pass skb ownership through read_skb

    The read_skb hook calls consume_skb() now, but this means that if the
    recv_actor program wants to use the skb it needs to inc the ref cnt
    so that the consume_skb() doesn't kfree the sk_buff.

    This is problematic because in some error cases under memory pressure
    we may need to linearize the sk_buff from sk_psock_skb_ingress_enqueue().
    Then we get this,

     skb_linearize()
       __pskb_pull_tail()
         pskb_expand_head()
           BUG_ON(skb_shared(skb))

    Because we incremented users refcnt from sk_psock_verdict_recv() we
    hit the bug on with refcnt > 1 and trip it.

    To fix lets simply pass ownership of the sk_buff through the skb_read
    call. Then we can drop the consume from read_skb handlers and assume
    the verdict recv does any required kfree.

    Bug found while testing in our CI which runs in VMs that hit memory
    constraints rather regularly. William tested TCP read_skb handlers.

    [  106.536188] ------------[ cut here ]------------
    [  106.536197] kernel BUG at net/core/skbuff.c:1693!
    [  106.536479] invalid opcode: 0000 [#1] PREEMPT SMP PTI
    [  106.536726] CPU: 3 PID: 1495 Comm: curl Not tainted 5.19.0-rc5 #1
    [  106.537023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.16.0-1 04/01/2014
    [  106.537467] RIP: 0010:pskb_expand_head+0x269/0x330
    [  106.538585] RSP: 0018:ffffc90000138b68 EFLAGS: 00010202
    [  106.538839] RAX: 000000000000003f RBX: ffff8881048940e8 RCX: 0000000000000a20
    [  106.539186] RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff8881048940e8
    [  106.539529] RBP: ffffc90000138be8 R08: 00000000e161fd1a R09: 0000000000000000
    [  106.539877] R10: 0000000000000018 R11: 0000000000000000 R12: ffff8881048940e8
    [  106.540222] R13: 0000000000000003 R14: 0000000000000000 R15: ffff8881048940e8
    [  106.540568] FS:  00007f277dde9f00(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000
    [  106.540954] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [  106.541227] CR2: 00007f277eeede64 CR3: 000000000ad3e000 CR4: 00000000000006e0
    [  106.541569] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [  106.541915] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [  106.542255] Call Trace:
    [  106.542383]  <IRQ>
    [  106.542487]  __pskb_pull_tail+0x4b/0x3e0
    [  106.542681]  skb_ensure_writable+0x85/0xa0
    [  106.542882]  sk_skb_pull_data+0x18/0x20
    [  106.543084]  bpf_prog_b517a65a242018b0_bpf_skskb_http_verdict+0x3a9/0x4aa9
    [  106.543536]  ? migrate_disable+0x66/0x80
    [  106.543871]  sk_psock_verdict_recv+0xe2/0x310
    [  106.544258]  ? sk_psock_write_space+0x1f0/0x1f0
    [  106.544561]  tcp_read_skb+0x7b/0x120
    [  106.544740]  tcp_data_queue+0x904/0xee0
    [  106.544931]  tcp_rcv_established+0x212/0x7c0
    [  106.545142]  tcp_v4_do_rcv+0x174/0x2a0
    [  106.545326]  tcp_v4_rcv+0xe70/0xf60
    [  106.545500]  ip_protocol_deliver_rcu+0x48/0x290
    [  106.545744]  ip_local_deliver_finish+0xa7/0x150

    Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
    Reported-by: William Findlay <will@isovalent.com>
    Signed-off-by: John Fastabend <john.fastabend@gmail.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Tested-by: William Findlay <will@isovalent.com>
    Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
    Link: https://lore.kernel.org/bpf/20230523025618.113937-2-john.fastabend@gmail.com

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-06-29 15:45:40 +02:00
Felix Maurer 8058591656 af_unix: Refactor unix_read_skb()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2218483
commit d6e3b27cbd2df555ff0736796ad2f9a17e74be8b
Author: Peilin Ye <peilin.ye@bytedance.com>
Date:   Thu Sep 22 21:59:26 2022 -0700

    af_unix: Refactor unix_read_skb()
    
    Similar to udp_read_skb(), delete the unnecessary while loop in
    unix_read_skb() for readability.  Since recv_actor() cannot return a
    value greater than skb->len (see sk_psock_verdict_recv()), remove the
    redundant check.
    
    Suggested-by: Cong Wang <cong.wang@bytedance.com>
    Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
    Link: https://lore.kernel.org/r/7009141683ad6cd3785daced3e4a80ba0eb773b5.1663909008.git.peilin.ye@bytedance.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-06-29 15:45:40 +02:00
Davide Caratti a94bc10366 af_unix: Fix a data-race in unix_dgram_peer_wake_me().
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2190429
Upstream Status: net.git commit 662a80946ce1

commit 662a80946ce13633ae90a55379f1346c10f0c432
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Sun Jun 5 16:23:25 2022 -0700

    af_unix: Fix a data-race in unix_dgram_peer_wake_me().

    unix_dgram_poll() calls unix_dgram_peer_wake_me() without `other`'s
    lock held and check if its receive queue is full.  Here we need to
    use unix_recvq_full_lockless() instead of unix_recvq_full(), otherwise
    KCSAN will report a data-race.

    Fixes: 7d267278a9 ("unix: avoid use-after-free in ep_remove_wait_queue")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Link: https://lore.kernel.org/r/20220605232325.11804-1-kuniyu@amazon.com
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2023-04-28 14:11:35 +02:00
Davide Caratti 081d5c4598 unix: Fix race in SOCK_SEQPACKET's unix_dgram_sendmsg()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2164865
Upstream Status: net.git commit 3ff8bff704f4

commit 3ff8bff704f4de125dca2262e5b5b963a3da1d87
Author: Kirill Tkhai <tkhai@ya.ru>
Date:   Tue Dec 13 00:05:53 2022 +0300

    unix: Fix race in SOCK_SEQPACKET's unix_dgram_sendmsg()

    There is a race resulting in alive SOCK_SEQPACKET socket
    may change its state from TCP_ESTABLISHED to TCP_CLOSE:

    unix_release_sock(peer)                  unix_dgram_sendmsg(sk)
      sock_orphan(peer)
        sock_set_flag(peer, SOCK_DEAD)
                                               sock_alloc_send_pskb()
                                                 if !(sk->sk_shutdown & SEND_SHUTDOWN)
                                                   OK
                                               if sock_flag(peer, SOCK_DEAD)
                                                 sk->sk_state = TCP_CLOSE
      sk->sk_shutdown = SHUTDOWN_MASK

    After that socket sk remains almost normal: it is able to connect, listen, accept
    and recvmsg, while it can't sendmsg.

    Since this is the only possibility for alive SOCK_SEQPACKET to change
    the state in such way, we should better fix this strange and potentially
    danger corner case.

    Note, that we will return EPIPE here like this is normally done in sock_alloc_send_pskb().
    Originally used ECONNREFUSED looks strange, since it's strange to return
    a specific retval in dependence of race in kernel, when user can't affect on this.

    Also, move TCP_CLOSE assignment for SOCK_DGRAM sockets under state lock
    to fix race with unix_dgram_connect():

    unix_dgram_connect(other)            unix_dgram_sendmsg(sk)
                                           unix_peer(sk) = NULL
                                           unix_state_unlock(sk)
      unix_state_double_lock(sk, other)
      sk->sk_state  = TCP_ESTABLISHED
      unix_peer(sk) = other
      unix_state_double_unlock(sk, other)
                                           sk->sk_state  = TCP_CLOSED

    This patch fixes both of these races.

    Fixes: 83301b5367a9 ("af_unix: Set TCP_ESTABLISHED for datagram sockets too")
    Signed-off-by: Kirill Tkhai <tkhai@ya.ru>
    Link: https://lore.kernel.org/r/135fda25-22d5-837a-782b-ceee50e19844@ya.ru
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2023-02-02 16:24:03 +01:00
Davide Caratti c2e1968f90 af_unix: call proto_unregister() in the error path in af_unix_init()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2164865
Upstream Status: net.git commit 73e341e0281a

commit 73e341e0281a35274629e9be27eae2f9b1b492bf
Author: Yang Yingliang <yangyingliang@huawei.com>
Date:   Thu Dec 8 23:01:58 2022 +0800

    af_unix: call proto_unregister() in the error path in af_unix_init()

    If register unix_stream_proto returns error, unix_dgram_proto needs
    be unregistered.

    Fixes: 94531cfcbe79 ("af_unix: Add unix_stream_proto for sockmap")
    Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
    Reviewed-by: Simon Horman <simon.horman@corigine.com>
    Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2023-02-02 16:24:03 +01:00
Felix Maurer 09faf01cb9 net: Introduce a new proto_ops ->read_skb()
Bugzilla: https://bugzilla.redhat.com/2137876

Conflicts: Context difference due to not yet applied 314001f0bf927
("af_unix: Add OOB support") and already applied 3f92a64e44e5 ("tcp:
allow tls to decrypt directly from the tcp rcv queue")

commit 965b57b469a589d64d81b1688b38dcb537011bb0
Author: Cong Wang <cong.wang@bytedance.com>
Date:   Wed Jun 15 09:20:12 2022 -0700

    net: Introduce a new proto_ops ->read_skb()

    Currently both splice() and sockmap use ->read_sock() to
    read skb from receive queue, but for sockmap we only read
    one entire skb at a time, so ->read_sock() is too conservative
    to use. Introduce a new proto_ops ->read_skb() which supports
    this sematic, with this we can finally pass the ownership of
    skb to recv actors.

    For non-TCP protocols, all ->read_sock() can be simply
    converted to ->read_skb().

    Signed-off-by: Cong Wang <cong.wang@bytedance.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Reviewed-by: John Fastabend <john.fastabend@gmail.com>
    Link: https://lore.kernel.org/bpf/20220615162014.89193-3-xiyou.wangcong@gmail.com

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-01-05 15:46:53 +01:00
Frantisek Hrbata 34b02be423 Merge: CNB: net: remove noblock parameter from skb_recv_datagram()
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1655

Bugzilla: https://bugzilla.redhat.com/2143360
Tested: build, boot

Conflicts:
 - isotp: missing many commits, such as:
   30ffd5332e06 ("can: isotp: return -EADDRNOTAVAIL when reading from unbound socket")
   42bf50a1795a ("can: isotp: support MSG_TRUNC flag when reading from socket")
   e382fea8ae54 ("can: isotp: restore accidentally removed MSG_PEEK feature")
 - removed chunks of non existent net/mctp

```
commit f4b41f062c424209e3939a81e6da022e049a45f2
Author: Oliver Hartkopp <socketcan@hartkopp.net>
Date:   Mon Apr 4 18:30:22 2022 +0200

    net: remove noblock parameter from skb_recv_datagram()

    skb_recv_datagram() has two parameters 'flags' and 'noblock' that are
    merged inside skb_recv_datagram() by 'flags | (noblock ? MSG_DONTWAIT : 0)'

    As 'flags' may contain MSG_DONTWAIT as value most callers split the 'flags'
    into 'flags' and 'noblock' with finally obsolete bit operations like this:

    skb_recv_datagram(sk, flags & ~MSG_DONTWAIT, flags & MSG_DONTWAIT, &rc);

    And this is not even done consistently with the 'flags' parameter.

    This patch removes the obsolete and costly splitting into two parameters
    and only performs bit operations when really needed on the caller side.

    One missing conversion thankfully reported by kernel test robot. I missed
    to enable kunit tests to build the mctp code.

    Reported-by: kernel test robot <lkp@intel.com>
    Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
    Signed-off-by: David S. Miller <davem@davemloft.net>
```

Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>

Approved-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: Xin Long <lxin@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-11-30 08:10:47 -05:00
Íñigo Huguet e24462420c net: remove noblock parameter from skb_recv_datagram()
Bugzilla: https://bugzilla.redhat.com/2143360

Conflicts:
 - isotp: missing many commits, such as:
   30ffd5332e06 ("can: isotp: return -EADDRNOTAVAIL when reading from unbound socket")
   42bf50a1795a ("can: isotp: support MSG_TRUNC flag when reading from socket")
   e382fea8ae54 ("can: isotp: restore accidentally removed MSG_PEEK feature")
 - removed chunks of non existent net/mctp

commit f4b41f062c424209e3939a81e6da022e049a45f2
Author: Oliver Hartkopp <socketcan@hartkopp.net>
Date:   Mon Apr 4 18:30:22 2022 +0200

    net: remove noblock parameter from skb_recv_datagram()
    
    skb_recv_datagram() has two parameters 'flags' and 'noblock' that are
    merged inside skb_recv_datagram() by 'flags | (noblock ? MSG_DONTWAIT : 0)'
    
    As 'flags' may contain MSG_DONTWAIT as value most callers split the 'flags'
    into 'flags' and 'noblock' with finally obsolete bit operations like this:
    
    skb_recv_datagram(sk, flags & ~MSG_DONTWAIT, flags & MSG_DONTWAIT, &rc);
    
    And this is not even done consistently with the 'flags' parameter.
    
    This patch removes the obsolete and costly splitting into two parameters
    and only performs bit operations when really needed on the caller side.
    
    One missing conversion thankfully reported by kernel test robot. I missed
    to enable kunit tests to build the mctp code.
    
    Reported-by: kernel test robot <lkp@intel.com>
    Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
2022-11-18 11:18:14 +01:00
Jiri Benc 418019c715 bpf: Support bpf_(get|set)sockopt() in bpf unix iter.
Bugzilla: https://bugzilla.redhat.com/2120966

commit eb7d8f1d9ebc7379f09a51bf4faa35e0bfa7437d
Author: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Date:   Thu Jan 13 09:28:47 2022 +0900

    bpf: Support bpf_(get|set)sockopt() in bpf unix iter.

    This patch makes bpf_(get|set)sockopt() available when iterating AF_UNIX
    sockets.

    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
    Link: https://lore.kernel.org/r/20220113002849.4384-4-kuniyu@amazon.co.jp
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:57:53 +02:00
Jiri Benc 1e320cfd7c bpf: af_unix: Use batching algorithm in bpf unix iter.
Bugzilla: https://bugzilla.redhat.com/2120966

commit 855d8e77ffb05be6e54c34dababccb20318aec00
Author: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Date:   Thu Jan 13 09:28:46 2022 +0900

    bpf: af_unix: Use batching algorithm in bpf unix iter.

    The commit 04c7820b776f ("bpf: tcp: Bpf iter batching and lock_sock")
    introduces the batching algorithm to iterate TCP sockets with more
    consistency.

    This patch uses the same algorithm to iterate AF_UNIX sockets.

    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
    Link: https://lore.kernel.org/r/20220113002849.4384-3-kuniyu@amazon.co.jp
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:57:53 +02:00
Jiri Benc f83ccdbf81 af_unix: Refactor unix_next_socket().
Bugzilla: https://bugzilla.redhat.com/2120966

commit 4408d55a64677febdcb50d1b44d0dc714ce4187e
Author: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Date:   Thu Jan 13 09:28:45 2022 +0900

    af_unix: Refactor unix_next_socket().

    Currently, unix_next_socket() is overloaded depending on the 2nd argument.
    If it is NULL, unix_next_socket() returns the first socket in the hash.  If
    not NULL, it returns the next socket in the same hash list or the first
    socket in the next non-empty hash list.

    This patch refactors unix_next_socket() into two functions unix_get_first()
    and unix_get_next().  unix_get_first() newly acquires a lock and returns
    the first socket in the list.  unix_get_next() returns the next socket in a
    list or releases a lock and falls back to unix_get_first().

    In the following patch, bpf iter holds entire sockets in a list and always
    releases the lock before .show().  It always calls unix_get_first() to
    acquire a lock in each iteration.  So, this patch makes the change easier
    to follow.

    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
    Link: https://lore.kernel.org/r/20220113002849.4384-2-kuniyu@amazon.co.jp
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:57:53 +02:00
Jiri Benc 50833f03dd af_unix: Relax race in unix_autobind().
Bugzilla: https://bugzilla.redhat.com/2120966

commit 9acbc584c3a4e9706703039708ec947ffc152c66
Author: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Date:   Wed Nov 24 11:14:31 2021 +0900

    af_unix: Relax race in unix_autobind().

    When we bind an AF_UNIX socket without a name specified, the kernel selects
    an available one from 0x00000 to 0xFFFFF.  unix_autobind() starts searching
    from a number in the 'static' variable and increments it after acquiring
    two locks.

    If multiple processes try autobind, they obtain the same lock and check if
    a socket in the hash list has the same name.  If not, one process uses it,
    and all except one end up retrying the _next_ number (actually not, it may
    be incremented by the other processes).  The more we autobind sockets in
    parallel, the longer the latency gets.  We can avoid such a race by
    searching for a name from a random number.

    These show latency in unix_autobind() while 64 CPUs are simultaneously
    autobind-ing 1024 sockets for each.

      Without this patch:

         usec          : count     distribution
            0          : 1176     |***                                     |
            2          : 3655     |***********                             |
            4          : 4094     |*************                           |
            6          : 3831     |************                            |
            8          : 3829     |************                            |
            10         : 3844     |************                            |
            12         : 3638     |***********                             |
            14         : 2992     |*********                               |
            16         : 2485     |*******                                 |
            18         : 2230     |*******                                 |
            20         : 2095     |******                                  |
            22         : 1853     |*****                                   |
            24         : 1827     |*****                                   |
            26         : 1677     |*****                                   |
            28         : 1473     |****                                    |
            30         : 1573     |*****                                   |
            32         : 1417     |****                                    |
            34         : 1385     |****                                    |
            36         : 1345     |****                                    |
            38         : 1344     |****                                    |
            40         : 1200     |***                                     |

      With this patch:

         usec          : count     distribution
            0          : 1855     |******                                  |
            2          : 6464     |*********************                   |
            4          : 9936     |********************************        |
            6          : 12107    |****************************************|
            8          : 10441    |**********************************      |
            10         : 7264     |***********************                 |
            12         : 4254     |**************                          |
            14         : 2538     |********                                |
            16         : 1596     |*****                                   |
            18         : 1088     |***                                     |
            20         : 800      |**                                      |
            22         : 670      |**                                      |
            24         : 601      |*                                       |
            26         : 562      |*                                       |
            28         : 525      |*                                       |
            30         : 446      |*                                       |
            32         : 378      |*                                       |
            34         : 337      |*                                       |
            36         : 317      |*                                       |
            38         : 314      |*                                       |
            40         : 298      |                                        |

    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:57:53 +02:00
Jiri Benc ca09b582d6 af_unix: Replace the big lock with small locks.
Bugzilla: https://bugzilla.redhat.com/2120966

commit afd20b9290e184c203fe22f2d6b80dc7127ba724
Author: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Date:   Wed Nov 24 11:14:30 2021 +0900

    af_unix: Replace the big lock with small locks.

    The hash table of AF_UNIX sockets is protected by the single lock.  This
    patch replaces it with per-hash locks.

    The effect is noticeable when we handle multiple sockets simultaneously.
    Here is a test result on an EC2 c5.24xlarge instance.  It shows latency
    (under 10us only) in unix_insert_unbound_socket() while 64 CPUs creating
    1024 sockets for each in parallel.

      Without this patch:

         nsec          : count     distribution
            0          : 179      |                                        |
            500        : 3021     |*********                               |
            1000       : 6271     |*******************                     |
            1500       : 6318     |*******************                     |
            2000       : 5828     |*****************                       |
            2500       : 5124     |***************                         |
            3000       : 4426     |*************                           |
            3500       : 3672     |***********                             |
            4000       : 3138     |*********                               |
            4500       : 2811     |********                                |
            5000       : 2384     |*******                                 |
            5500       : 2023     |******                                  |
            6000       : 1954     |*****                                   |
            6500       : 1737     |*****                                   |
            7000       : 1749     |*****                                   |
            7500       : 1520     |****                                    |
            8000       : 1469     |****                                    |
            8500       : 1394     |****                                    |
            9000       : 1232     |***                                     |
            9500       : 1138     |***                                     |
            10000      : 994      |***                                     |

      With this patch:

         nsec          : count     distribution
            0          : 1634     |****                                    |
            500        : 13170    |****************************************|
            1000       : 13156    |*************************************** |
            1500       : 9010     |***************************             |
            2000       : 6363     |*******************                     |
            2500       : 4443     |*************                           |
            3000       : 3240     |*********                               |
            3500       : 2549     |*******                                 |
            4000       : 1872     |*****                                   |
            4500       : 1504     |****                                    |
            5000       : 1247     |***                                     |
            5500       : 1035     |***                                     |
            6000       : 889      |**                                      |
            6500       : 744      |**                                      |
            7000       : 634      |*                                       |
            7500       : 498      |*                                       |
            8000       : 433      |*                                       |
            8500       : 355      |*                                       |
            9000       : 336      |*                                       |
            9500       : 284      |                                        |
            10000      : 243      |                                        |

    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:57:53 +02:00
Jiri Benc 264d6b03a0 af_unix: Save hash in sk_hash.
Bugzilla: https://bugzilla.redhat.com/2120966

commit e6b4b873896f0e9298f70d25726f4bb1e1b265ba
Author: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Date:   Wed Nov 24 11:14:29 2021 +0900

    af_unix: Save hash in sk_hash.

    To replace unix_table_lock with per-hash locks in the next patch, we need
    to save a hash in each socket because /proc/net/unix or BPF prog iterate
    sockets while holding a hash table lock and release it later in a different
    function.

    Currently, we store a real/pseudo hash in struct unix_address.  However, we
    do not allocate it to unbound sockets, nor should we do just for that.  For
    this purpose, we can use sk_hash.  Then, we no longer use the hash field in
    struct unix_address and can remove it.

    Also, this patch does
      - rename unix_insert_socket() to unix_insert_unbound_socket()
      - remove the redundant list argument from __unix_insert_socket() and
         unix_insert_unbound_socket()
      - use 'unsigned int' instead of 'unsigned' in __unix_set_addr_hash()
      - remove 'inline' from unix_remove_socket() and
         unix_insert_unbound_socket().

    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:57:53 +02:00
Jiri Benc 6450454e10 af_unix: Add helpers to calculate hashes.
Bugzilla: https://bugzilla.redhat.com/2120966

commit f452be496a5c8f58b1a67cde79e89b9f1cfde31c
Author: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Date:   Wed Nov 24 11:14:28 2021 +0900

    af_unix: Add helpers to calculate hashes.

    This patch adds three helper functions that calculate hashes for unbound
    sockets and bound sockets with BSD/abstract addresses.

    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:57:53 +02:00
Jiri Benc 9f9e8cf942 af_unix: Remove UNIX_ABSTRACT() macro and test sun_path[0] instead.
Bugzilla: https://bugzilla.redhat.com/2120966

commit 5ce7ab4961a9320ca0836e06849210d088723a56
Author: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Date:   Wed Nov 24 11:14:27 2021 +0900

    af_unix: Remove UNIX_ABSTRACT() macro and test sun_path[0] instead.

    In BSD and abstract address cases, we store sockets in the hash table with
    keys between 0 and UNIX_HASH_SIZE - 1.  However, the hash saved in a socket
    varies depending on its address type; sockets with BSD addresses always
    have UNIX_HASH_SIZE in their unix_sk(sk)->addr->hash.

    This is just for the UNIX_ABSTRACT() macro used to check the address type.
    The difference of the saved hashes comes from the first byte of the address
    in the first place.  So, we can test it directly.

    Then we can keep a real hash in each socket and replace unix_table_lock
    with per-hash locks in the later patch.

    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:57:53 +02:00
Jiri Benc bde4862c31 af_unix: Allocate unix_address in unix_bind_(bsd|abstract)().
Bugzilla: https://bugzilla.redhat.com/2120966

commit 12f21c49ad83eba93d0485b8c9edcc28201bee93
Author: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Date:   Wed Nov 24 11:14:26 2021 +0900

    af_unix: Allocate unix_address in unix_bind_(bsd|abstract)().

    To terminate address with '\0' in unix_bind_bsd(), we add
    unix_create_addr() and call it in unix_bind_bsd() and unix_bind_abstract().

    Also, unix_bind_abstract() does not return -EEXIST.  Only
    kern_path_create() and vfs_mknod() in unix_bind_bsd() can return it,
    so we move the last error check in unix_bind() to unix_bind_bsd().

    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:57:53 +02:00
Jiri Benc 9399405ae2 af_unix: Remove unix_mkname().
Bugzilla: https://bugzilla.redhat.com/2120966

commit 5c32a3ed64b4c87ed6d9978074db5f0a54c4cd20
Author: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Date:   Wed Nov 24 11:14:25 2021 +0900

    af_unix: Remove unix_mkname().

    This patch removes unix_mkname() and postpones calculating a hash to
    unix_bind_abstract().  Some BSD stuffs still remain in unix_bind()
    though, the next patch packs them into unix_bind_bsd().

    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:57:53 +02:00
Jiri Benc 9d1470dc1f af_unix: Copy unix_mkname() into unix_find_(bsd|abstract)().
Bugzilla: https://bugzilla.redhat.com/2120966

commit d2d8c9fddb1c11ccfa73bf0ad2b1e6b4ea7afdaf
Author: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Date:   Wed Nov 24 11:14:24 2021 +0900

    af_unix: Copy unix_mkname() into unix_find_(bsd|abstract)().

    We should not call unix_mkname() before unix_find_other() and instead do
    the same thing where necessary based on the address type:

      - terminating the address with '\0' in unix_find_bsd()
      - calculating the hash in unix_find_abstract().

    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:57:53 +02:00
Jiri Benc 9d10cafdc9 af_unix: Cut unix_validate_addr() out of unix_mkname().
Bugzilla: https://bugzilla.redhat.com/2120966

commit b8a58aa6fccc5b2940f0da18c7f02e8a1deb693a
Author: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Date:   Wed Nov 24 11:14:23 2021 +0900

    af_unix: Cut unix_validate_addr() out of unix_mkname().

    unix_mkname() tests socket address length and family and does some
    processing based on the address type.  It is called in the early stage,
    and therefore some instructions are redundant and can end up in vain.

    The address length/family tests are done twice in unix_bind().  Also, the
    address type is rechecked later in unix_bind() and unix_find_other(), where
    we can do the same processing.  Moreover, in the BSD address case, the hash
    is set to 0 but never used and confusing.

    This patch moves the address tests out of unix_mkname(), and the following
    patches move the other part into appropriate places and remove
    unix_mkname() finally.

    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:57:52 +02:00
Jiri Benc 5384473aa6 af_unix: Return an error as a pointer in unix_find_other().
Bugzilla: https://bugzilla.redhat.com/2120966

commit aed26f557bbc94f0c778f63d7dfe86af99208f68
Author: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Date:   Wed Nov 24 11:14:22 2021 +0900

    af_unix: Return an error as a pointer in unix_find_other().

    We can return an error as a pointer and need not pass an additional
    argument to unix_find_other().

    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:57:52 +02:00
Jiri Benc 04e0d4ff01 af_unix: Factorise unix_find_other() based on address types.
Bugzilla: https://bugzilla.redhat.com/2120966

commit fa39ef0e472961baef49ddb0e6f7b8ebb555bd8f
Author: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Date:   Wed Nov 24 11:14:21 2021 +0900

    af_unix: Factorise unix_find_other() based on address types.

    As done in the commit fa42d910a3 ("unix_bind(): take BSD and abstract
    address cases into new helpers"), this patch moves BSD and abstract address
    cases from unix_find_other() into unix_find_bsd() and unix_find_abstract().

    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:57:52 +02:00
Jiri Benc 8bb71496a8 af_unix: Pass struct sock to unix_autobind().
Bugzilla: https://bugzilla.redhat.com/2120966

commit f7ed31f4615f4e1d97c0e4325c5b8a240e10073c
Author: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Date:   Wed Nov 24 11:14:20 2021 +0900

    af_unix: Pass struct sock to unix_autobind().

    We do not use struct socket in unix_autobind() and pass struct sock to
    unix_bind_bsd() and unix_bind_abstract().  Let's pass it to unix_autobind()
    as well.

    Also, this patch fixes these errors by checkpatch.pl.

      ERROR: do not use assignment in if condition
      #1795: FILE: net/unix/af_unix.c:1795:
      +     if (test_bit(SOCK_PASSCRED, &sock->flags) && !u->addr

      CHECK: Logical continuations should be on the previous line
      #1796: FILE: net/unix/af_unix.c:1796:
      +     if (test_bit(SOCK_PASSCRED, &sock->flags) && !u->addr
      +         && (err = unix_autobind(sock)) != 0)

    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:57:52 +02:00
Jiri Benc b1374dfe08 af_unix: Use offsetof() instead of sizeof().
Bugzilla: https://bugzilla.redhat.com/2120966

commit 755662ce78d14c1a9118df921c528b1f992ded2e
Author: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Date:   Wed Nov 24 11:14:19 2021 +0900

    af_unix: Use offsetof() instead of sizeof().

    The length of the AF_UNIX socket address contains an offset to the member
    sun_path of struct sockaddr_un.

    Currently, the preceding member is just sun_family, and its type is
    sa_family_t and resolved to short.  Therefore, the offset is represented by
    sizeof(short).  However, it is not clear and fragile to changes in struct
    sockaddr_storage or sockaddr_un.

    This commit makes it clear and robust by rewriting sizeof() with
    offsetof().

    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:57:52 +02:00
Petr Oros 21e2fb0e83 net: Don't include filter.h from net/sock.h
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101792

Conflicts:
drivers/infiniband/core/cache.c
- adjusted context conflict due to missing b74525f21e33ab ("RDMA/core:
  Delete useless module.h include")
drivers/infiniband/hw/mlx5/fs.c
- missing upstream commit ffa501ef196312 ("RDMA/mlx5: Add steering support in
  optional flow counters") adding net/inet_ecn.h. Without inet_ecn.h missing
  declarations for ether_addr_copy() and is_multicast_ether_addr()
  We add net/inet_ecn.h include in this commit.
drivers/net/amt.c
- Unmerged because file missing in RHEL

Upstream commit(s):
commit b6459415b384cb829f0b2a4268f211c789f6cf0b
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Tue Dec 28 16:49:13 2021 -0800

    net: Don't include filter.h from net/sock.h

    sock.h is pretty heavily used (5k objects rebuilt on x86 after
    it's touched). We can drop the include of filter.h from it and
    add a forward declaration of struct sk_filter instead.
    This decreases the number of rebuilt objects when bpf.h
    is touched from ~5k to ~1k.

    There's a lot of missing includes this was masking. Primarily
    in networking tho, this time.

    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Acked-by: Marc Kleine-Budde <mkl@pengutronix.de>
    Acked-by: Florian Fainelli <f.fainelli@gmail.com>
    Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
    Acked-by: Stefano Garzarella <sgarzare@redhat.com>
    Link: https://lore.kernel.org/bpf/20211229004913.513372-1-kuba@kernel.org

Signed-off-by: Petr Oros <poros@redhat.com>
2022-07-13 10:49:16 +02:00
Ivan Vecera fa0c210030 net: drop nopreempt requirement on sock_prot_inuse_add()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2096377

commit b3cb764aa1d753cf6a58858f9e2097ba71e8100b
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 15 09:11:50 2021 -0800

    net: drop nopreempt requirement on sock_prot_inuse_add()

    This is distracting really, let's make this simpler,
    because many callers had to take care of this
    by themselves, even if on x86 this adds more
    code than really needed.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-06-13 18:35:56 +02:00
Jiri Benc 54697ceb89 af_unix: fix regression in read after shutdown
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071618

commit f9390b249c90a15a4d9e69fbfb7a53c860b1fcaf
Author: Vincent Whitchurch <vincent.whitchurch@axis.com>
Date:   Fri Nov 19 13:05:21 2021 +0100

    af_unix: fix regression in read after shutdown

    On kernels before v5.15, calling read() on a unix socket after
    shutdown(SHUT_RD) or shutdown(SHUT_RDWR) would return the data
    previously written or EOF.  But now, while read() after
    shutdown(SHUT_RD) still behaves the same way, read() after
    shutdown(SHUT_RDWR) always fails with -EINVAL.

    This behaviour change was apparently inadvertently introduced as part of
    a bug fix for a different regression caused by the commit adding sockmap
    support to af_unix, commit 94531cfcbe79c359 ("af_unix: Add
    unix_stream_proto for sockmap").  Those commits, for unclear reasons,
    started setting the socket state to TCP_CLOSE on shutdown(SHUT_RDWR),
    while this state change had previously only been done in
    unix_release_sock().

    Restore the original behaviour.  The sockmap tests in
    tests/selftests/bpf continue to pass after this patch.

    Fixes: d0c6416bd7091647f60 ("unix: Fix an issue in unix_shutdown causing the other end read/write failures")
    Link: https://lore.kernel.org/lkml/20211111140000.GA10779@axis.com/
    Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
    Tested-by: Casey Schaufler <casey@schaufler-ca.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-05-12 17:29:54 +02:00
Jiri Benc 5d30002e41 af_unix: Rename UNIX-DGRAM to UNIX to maintain backwards compatability
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071618

commit 0edf0824e0dc359ed76bf96af986e6570ca2c0b9
Author: Stephen Boyd <swboyd@chromium.org>
Date:   Fri Oct 8 14:59:45 2021 -0700

    af_unix: Rename UNIX-DGRAM to UNIX to maintain backwards compatability

    Then name of this protocol changed in commit 94531cfcbe79 ("af_unix: Add
    unix_stream_proto for sockmap") because that commit added stream support
    to the af_unix protocol. Renaming the existing protocol makes a ChromeOS
    protocol test[1] fail now that the name has changed in
    /proc/net/protocols from "UNIX" to "UNIX-DGRAM".

    Let's put the name back to how it was while keeping the stream protocol
    as "UNIX-STREAM" so that the procfs interface doesn't change. This fixes
    the test and maintains backwards compatibility in proc.

    Cc: Jiang Wang <jiang.wang@bytedance.com>
    Cc: Andrii Nakryiko <andrii@kernel.org>
    Cc: Cong Wang <cong.wang@bytedance.com>
    Cc: Jakub Sitnicki <jakub@cloudflare.com>
    Cc: John Fastabend <john.fastabend@gmail.com>
    Cc: Dmitry Osipenko <digetx@gmail.com>
    Link: https://source.chromium.org/chromiumos/chromiumos/codesearch/+/main:src/platform/tast-tests/src/chromiumos/tast/local/bundles/cros/network/supported_protocols.go;l=50;drc=e8b1c3f94cb40a054f4aa1ef1aff61e75dc38f18 [1]
    Fixes: 94531cfcbe79 ("af_unix: Add unix_stream_proto for sockmap")
    Signed-off-by: Stephen Boyd <swboyd@chromium.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-05-12 17:29:53 +02:00
Jiri Benc 14f633cc1d net: Implement ->sock_is_readable() for UDP and AF_UNIX
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071618

commit af493388950b6ea3a86f860cfaffab137e024fc8
Author: Cong Wang <cong.wang@bytedance.com>
Date:   Fri Oct 8 13:33:05 2021 -0700

    net: Implement ->sock_is_readable() for UDP and AF_UNIX

    Yucong noticed we can't poll() sockets in sockmap even
    when they are the destination sockets of redirections.
    This is because we never poll any psock queues in ->poll(),
    except for TCP. With ->sock_is_readable() now we can
    overwrite >sock_is_readable(), invoke and implement it for
    both UDP and AF_UNIX sockets.

    Reported-by: Yucong Sun <sunyucong@gmail.com>
    Signed-off-by: Cong Wang <cong.wang@bytedance.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20211008203306.37525-4-xiyou.wangcong@gmail.com

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-05-12 17:29:53 +02:00
Jiri Benc 2044e01cf4 unix: Fix an issue in unix_shutdown causing the other end read/write failures
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071618

commit d0c6416bd7091647f6041599f396bfa19ae30368
Author: Jiang Wang <jiang.wang@bytedance.com>
Date:   Mon Oct 4 23:25:28 2021 +0000

    unix: Fix an issue in unix_shutdown causing the other end read/write failures

    Commit 94531cfcbe79 ("af_unix: Add unix_stream_proto for sockmap") sets
    unix domain socket peer state to TCP_CLOSE in unix_shutdown. This could
    happen when the local end is shutdown but the other end is not. Then,
    the other end will get read or write failures which is not expected.
    Fix the issue by setting the local state to shutdown.

    Fixes: 94531cfcbe79 ("af_unix: Add unix_stream_proto for sockmap")
    Reported-by: Casey Schaufler <casey@schaufler-ca.com>
    Suggested-by: Cong Wang <cong.wang@bytedance.com>
    Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Tested-by: Casey Schaufler <casey@schaufler-ca.com>
    Reviewed-by: Casey Schaufler <casey@schaufler-ca.com>
    Acked-by: Song Liu <songliubraving@fb.com>
    Link: https://lore.kernel.org/bpf/20211004232530.2377085-1-jiang.wang@bytedance.com

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-05-12 17:29:53 +02:00
Jiri Benc 11722ad22c af_unix: fix potential NULL deref in unix_dgram_connect()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071618

commit dc56ad7028c5f559b3ce90d5cca2e6b7b839f1d5
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Aug 30 10:21:37 2021 -0700

    af_unix: fix potential NULL deref in unix_dgram_connect()

    syzbot was able to trigger NULL deref in unix_dgram_connect() [1]

    This happens in

            if (unix_peer(sk))
                    sk->sk_state = other->sk_state = TCP_ESTABLISHED; // crash because @other is NULL

    Because locks have been dropped, unix_peer() might be non NULL,
    while @other is NULL (AF_UNSPEC case)

    We need to move code around, so that we no longer access
    unix_peer() and sk_state while locks have been released.

    [1]
    general protection fault, probably for non-canonical address 0xdffffc0000000002: 0000 [#1] PREEMPT SMP KASAN
    KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017]
    CPU: 0 PID: 10341 Comm: syz-executor239 Not tainted 5.14.0-rc7-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    RIP: 0010:unix_dgram_connect+0x32a/0xc60 net/unix/af_unix.c:1226
    Code: 00 00 45 31 ed 49 83 bc 24 f8 05 00 00 00 74 69 e8 eb 5b a6 f9 48 8d 7d 12 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 48 89 fa 83 e2 07 38 d0 7f 08 84 c0 0f 85 e0 07 00 00
    RSP: 0018:ffffc9000a89fcd8 EFLAGS: 00010202
    RAX: dffffc0000000000 RBX: 0000000000000004 RCX: 0000000000000000
    RDX: 0000000000000002 RSI: ffffffff87cf4ef5 RDI: 0000000000000012
    RBP: 0000000000000000 R08: 0000000000000000 R09: ffff88802e1917c3
    R10: ffffffff87cf4eba R11: 0000000000000001 R12: ffff88802e191740
    R13: 0000000000000000 R14: ffff88802e191d38 R15: ffff88802e1917c0
    FS:  00007f3eb0052700(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00000000004787d0 CR3: 0000000029c0a000 CR4: 00000000001506f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
     __sys_connect_file+0x155/0x1a0 net/socket.c:1890
     __sys_connect+0x161/0x190 net/socket.c:1907
     __do_sys_connect net/socket.c:1917 [inline]
     __se_sys_connect net/socket.c:1914 [inline]
     __x64_sys_connect+0x6f/0xb0 net/socket.c:1914
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x44/0xae
    RIP: 0033:0x446a89
    Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 a1 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
    RSP: 002b:00007f3eb0052208 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
    RAX: ffffffffffffffda RBX: 00000000004cc4d8 RCX: 0000000000446a89
    RDX: 000000000000006e RSI: 0000000020000180 RDI: 0000000000000003
    RBP: 00000000004cc4d0 R08: 00007f3eb0052700 R09: 0000000000000000
    R10: 00007f3eb0052700 R11: 0000000000000246 R12: 00000000004cc4dc
    R13: 00007ffd791e79cf R14: 00007f3eb0052300 R15: 0000000000022000
    Modules linked in:
    ---[ end trace 4eb809357514968c ]---
    RIP: 0010:unix_dgram_connect+0x32a/0xc60 net/unix/af_unix.c:1226
    Code: 00 00 45 31 ed 49 83 bc 24 f8 05 00 00 00 74 69 e8 eb 5b a6 f9 48 8d 7d 12 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 48 89 fa 83 e2 07 38 d0 7f 08 84 c0 0f 85 e0 07 00 00
    RSP: 0018:ffffc9000a89fcd8 EFLAGS: 00010202
    RAX: dffffc0000000000 RBX: 0000000000000004 RCX: 0000000000000000
    RDX: 0000000000000002 RSI: ffffffff87cf4ef5 RDI: 0000000000000012
    RBP: 0000000000000000 R08: 0000000000000000 R09: ffff88802e1917c3
    R10: ffffffff87cf4eba R11: 0000000000000001 R12: ffff88802e191740
    R13: 0000000000000000 R14: ffff88802e191d38 R15: ffff88802e1917c0
    FS:  00007f3eb0052700(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007ffd791fe960 CR3: 0000000029c0a000 CR4: 00000000001506e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

    Fixes: 83301b5367a9 ("af_unix: Set TCP_ESTABLISHED for datagram sockets too")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Cong Wang <cong.wang@bytedance.com>
    Cc: Alexei Starovoitov <ast@kernel.org>
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-05-12 17:29:52 +02:00
Jiri Benc f8db6053d4 af_unix: Fix NULL pointer bug in unix_shutdown
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071618

commit d359902d5c357b280e7a0862bb8a1ba56b3fc197
Author: Jiang Wang <jiang.wang@bytedance.com>
Date:   Sat Aug 21 18:07:36 2021 +0000

    af_unix: Fix NULL pointer bug in unix_shutdown

    Commit 94531cfcbe79 ("af_unix: Add unix_stream_proto for sockmap")
    introduced a bug for af_unix SEQPACKET type. In unix_shutdown, the
    unhash function will call prot->unhash(), which is NULL for SEQPACKET.
    And kernel will panic. On ARM32, it will show following messages: (it
    likely affects x86 too).

    Fix the bug by checking the prot->unhash is NULL or not first.

    Kernel log:
    <--- cut here ---
     Unable to handle kernel NULL pointer dereference at virtual address
    00000000
     pgd = 2fba1ffb
     *pgd=00000000
     Internal error: Oops: 80000005 [#1] PREEMPT SMP THUMB2
     Modules linked in:
     CPU: 1 PID: 1999 Comm: falkon Tainted: G        W
    5.14.0-rc5-01175-g94531cfcbe79-dirty #9240
     Hardware name: NVIDIA Tegra SoC (Flattened Device Tree)
     PC is at 0x0
     LR is at unix_shutdown+0x81/0x1a8
     pc : [<00000000>]    lr : [<c08f3311>]    psr: 600f0013
     sp : e45aff70  ip : e463a3c0  fp : beb54f04
     r10: 00000125  r9 : e45ae000  r8 : c4a56664
     r7 : 00000001  r6 : c4a56464  r5 : 00000001  r4 : c4a56400
     r3 : 00000000  r2 : c5a6b180  r1 : 00000000  r0 : c4a56400
     Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
     Control: 50c5387d  Table: 05aa804a  DAC: 00000051
     Register r0 information: slab PING start c4a56400 pointer offset 0
     Register r1 information: NULL pointer
     Register r2 information: slab task_struct start c5a6b180 pointer offset 0
     Register r3 information: NULL pointer
     Register r4 information: slab PING start c4a56400 pointer offset 0
     Register r5 information: non-paged memory
     Register r6 information: slab PING start c4a56400 pointer offset 100
     Register r7 information: non-paged memory
     Register r8 information: slab PING start c4a56400 pointer offset 612
     Register r9 information: non-slab/vmalloc memory
     Register r10 information: non-paged memory
     Register r11 information: non-paged memory
     Register r12 information: slab filp start e463a3c0 pointer offset 0
     Process falkon (pid: 1999, stack limit = 0x9ec48895)
     Stack: (0xe45aff70 to 0xe45b0000)
     ff60:                                     e45ae000 c5f26a00 00000000 00000125
     ff80: c0100264 c07f7fa3 beb54f04 fffffff7 00000001 e6f3fc0e b5e5e9ec beb54ec4
     ffa0: b5da0ccc c010024b b5e5e9ec beb54ec4 0000000f 00000000 00000000 beb54ebc
     ffc0: b5e5e9ec beb54ec4 b5da0ccc 00000125 beb54f58 00785238 beb5529c beb54f04
     ffe0: b5da1e24 beb54eac b301385c b62b6ee8 600f0030 0000000f 00000000 00000000
     [<c08f3311>] (unix_shutdown) from [<c07f7fa3>] (__sys_shutdown+0x2f/0x50)
     [<c07f7fa3>] (__sys_shutdown) from [<c010024b>]
    (__sys_trace_return+0x1/0x16)
     Exception stack(0xe45affa8 to 0xe45afff0)

    Fixes: 94531cfcbe79 ("af_unix: Add unix_stream_proto for sockmap")
    Reported-by: Dmitry Osipenko <digetx@gmail.com>
    Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Tested-by: Dmitry Osipenko <digetx@gmail.com>
    Acked-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
    Link: https://lore.kernel.org/bpf/20210821180738.1151155-1-jiang.wang@bytedance.com

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-05-12 17:29:51 +02:00
Jiri Benc 028135f373 af_unix: Add unix_stream_proto for sockmap
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071618

Conflicts:
- Code difference in unix_create and context difference in unix_create1 and
  unix_stream_connect to out of order backport of f4bd73b5a950 "af_unix:
  Return errno instead of NULL in unix_create1()." The resulting code
  matches the current upstream.

commit 94531cfcbe79c3598acf96806627b2137ca32eb9
Author: Jiang Wang <jiang.wang@bytedance.com>
Date:   Mon Aug 16 19:03:21 2021 +0000

    af_unix: Add unix_stream_proto for sockmap

    Previously, sockmap for AF_UNIX protocol only supports
    dgram type. This patch add unix stream type support, which
    is similar to unix_dgram_proto. To support sockmap, dgram
    and stream cannot share the same unix_proto anymore, because
    they have different implementations, such as unhash for stream
    type (which will remove closed or disconnected sockets from the map),
    so rename unix_proto to unix_dgram_proto and add a new
    unix_stream_proto.

    Also implement stream related sockmap functions.
    And add dgram key words to those dgram specific functions.

    Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Reviewed-by: Cong Wang <cong.wang@bytedance.com>
    Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
    Acked-by: John Fastabend <john.fastabend@gmail.com>
    Link: https://lore.kernel.org/bpf/20210816190327.2739291-3-jiang.wang@bytedance.com

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-05-12 17:29:50 +02:00
Jiri Benc e3a1d16e8b af_unix: Add read_sock for stream socket types
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071618

Conflicts:
- [minor] context difference due to missing af_unix OOB support (commit
  314001f0bf92 "af_unix: Add OOB support")

commit 77462de14a43f4d98dbd8de0f5743a4e02450b1d
Author: Jiang Wang <jiang.wang@bytedance.com>
Date:   Mon Aug 16 19:03:20 2021 +0000

    af_unix: Add read_sock for stream socket types

    To support sockmap for af_unix stream type, implement
    read_sock, which is similar to the read_sock for unix
    dgram sockets.

    Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Reviewed-by: Cong Wang <cong.wang@bytedance.com>
    Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
    Acked-by: John Fastabend <john.fastabend@gmail.com>
    Link: https://lore.kernel.org/bpf/20210816190327.2739291-2-jiang.wang@bytedance.com

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-05-12 17:29:50 +02:00
Jiri Benc 3694be0f5b bpf: af_unix: Implement BPF iterator for UNIX domain socket.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071618

commit 2c860a43dd77f969bb959336a2f743d7103a8f63
Author: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Date:   Sat Aug 14 10:57:15 2021 +0900

    bpf: af_unix: Implement BPF iterator for UNIX domain socket.

    This patch implements the BPF iterator for the UNIX domain socket.

    Currently, the batch optimisation introduced for the TCP iterator in the
    commit 04c7820b776f ("bpf: tcp: Bpf iter batching and lock_sock") is not
    used for the UNIX domain socket.  It will require replacing the big lock
    for the hash table with small locks for each hash list not to block other
    processes.

    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Yonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/bpf/20210814015718.42704-2-kuniyu@amazon.co.jp

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-05-12 17:29:49 +02:00