Commit Graph

1181 Commits

Author SHA1 Message Date
Chris von Recklinghausen b92cce1ea6 mm: multi-gen LRU: support page table walks
Conflicts:
	fs/exec.c - We already have
		33a2d6bc3480 ("Revert "fs/exec: allow to unshare a time namespace on vfork+exec"")
		so don't add call to timens_on_fork back in
	include/linux/mmzone.h - We already have
		e6ad640bc404 ("mm: deduplicate cacheline padding code")
		so keep CACHELINE_PADDING(_pad2_) over ZONE_PADDING(_pad2_)
	mm/vmscan.c - The backport of
		badc28d4924b ("mm: shrinkers: fix deadlock in shrinker debugfs")
		added an #include <linux/debugfs.h>. Keep it.

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit bd74fdaea146029e4fa12c6de89adbe0779348a9
Author: Yu Zhao <yuzhao@google.com>
Date:   Sun Sep 18 02:00:05 2022 -0600

    mm: multi-gen LRU: support page table walks

    To further exploit spatial locality, the aging prefers to walk page tables
    to search for young PTEs and promote hot pages.  A kill switch will be
    added in the next patch to disable this behavior.  When disabled, the
    aging relies on the rmap only.

    NB: this behavior has nothing similar with the page table scanning in the
    2.4 kernel [1], which searches page tables for old PTEs, adds cold pages
    to swapcache and unmaps them.

    To avoid confusion, the term "iteration" specifically means the traversal
    of an entire mm_struct list; the term "walk" will be applied to page
    tables and the rmap, as usual.

    An mm_struct list is maintained for each memcg, and an mm_struct follows
    its owner task to the new memcg when this task is migrated.  Given an
    lruvec, the aging iterates lruvec_memcg()->mm_list and calls
    walk_page_range() with each mm_struct on this list to promote hot pages
    before it increments max_seq.

    When multiple page table walkers iterate the same list, each of them gets
    a unique mm_struct; therefore they can run concurrently.  Page table
    walkers ignore any misplaced pages, e.g., if an mm_struct was migrated,
    pages it left in the previous memcg will not be promoted when its current
    memcg is under reclaim.  Similarly, page table walkers will not promote
    pages from nodes other than the one under reclaim.

    This patch uses the following optimizations when walking page tables:
    1. It tracks the usage of mm_struct's between context switches so that
       page table walkers can skip processes that have been sleeping since
       the last iteration.
    2. It uses generational Bloom filters to record populated branches so
       that page table walkers can reduce their search space based on the
       query results, e.g., to skip page tables containing mostly holes or
       misplaced pages.
    3. It takes advantage of the accessed bit in non-leaf PMD entries when
       CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
    4. It does not zigzag between a PGD table and the same PMD table
       spanning multiple VMAs. IOW, it finishes all the VMAs within the
       range of the same PMD table before it returns to a PGD table. This
       improves the cache performance for workloads that have large
       numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.

    Server benchmark results:
      Single workload:
        fio (buffered I/O): no change

      Single workload:
        memcached (anon): +[8, 10]%
                    Ops/sec      KB/sec
          patch1-7: 1147696.57   44640.29
          patch1-8: 1245274.91   48435.66

      Configurations:
        no change

    Client benchmark results:
      kswapd profiles:
        patch1-7
          48.16%  lzo1x_1_do_compress (real work)
           8.20%  page_vma_mapped_walk (overhead)
           7.06%  _raw_spin_unlock_irq
           2.92%  ptep_clear_flush
           2.53%  __zram_bvec_write
           2.11%  do_raw_spin_lock
           2.02%  memmove
           1.93%  lru_gen_look_around
           1.56%  free_unref_page_list
           1.40%  memset

        patch1-8
          49.44%  lzo1x_1_do_compress (real work)
           6.19%  page_vma_mapped_walk (overhead)
           5.97%  _raw_spin_unlock_irq
           3.13%  get_pfn_folio
           2.85%  ptep_clear_flush
           2.42%  __zram_bvec_write
           2.08%  do_raw_spin_lock
           1.92%  memmove
           1.44%  alloc_zspage
           1.36%  memset

      Configurations:
        no change

    Thanks to the following developers for their efforts [3].
      kernel test robot <lkp@intel.com>

    [1] https://lwn.net/Articles/23732/
    [2] https://llvm.org/docs/ScudoHardenedAllocator.html
    [3] https://lore.kernel.org/r/202204160827.ekEARWQo-lkp@intel.com/

    Link: https://lkml.kernel.org/r/20220918080010.2920238-9-yuzhao@google.com
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
    Cc: Andi Kleen <ak@linux.intel.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Barry Song <baohua@kernel.org>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Hillf Danton <hdanton@sina.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michael Larabel <Michael@MichaelLarabel.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:46 -04:00
Wander Lairson Costa e41e816446
kernel/fork: beware of __put_task_struct() calling context
Bugzilla: https://bugzilla.redhat.com/2060283

commit d243b34459cea30cfe5f3a9b2feb44e7daff9938
Author: Wander Lairson Costa <wander@redhat.com>
Date:   Wed Jun 14 09:23:21 2023 -0300

    kernel/fork: beware of __put_task_struct() calling context

    Under PREEMPT_RT, __put_task_struct() indirectly acquires sleeping
    locks. Therefore, it can't be called from an non-preemptible context.

    One practical example is splat inside inactive_task_timer(), which is
    called in a interrupt context:

      CPU: 1 PID: 2848 Comm: life Kdump: loaded Tainted: G W ---------
       Hardware name: HP ProLiant DL388p Gen8, BIOS P70 07/15/2012
       Call Trace:
       dump_stack_lvl+0x57/0x7d
       mark_lock_irq.cold+0x33/0xba
       mark_lock+0x1e7/0x400
       mark_usage+0x11d/0x140
       __lock_acquire+0x30d/0x930
       lock_acquire.part.0+0x9c/0x210
       rt_spin_lock+0x27/0xe0
       refill_obj_stock+0x3d/0x3a0
       kmem_cache_free+0x357/0x560
       inactive_task_timer+0x1ad/0x340
       __run_hrtimer+0x8a/0x1a0
       __hrtimer_run_queues+0x91/0x130
       hrtimer_interrupt+0x10f/0x220
       __sysvec_apic_timer_interrupt+0x7b/0xd0
       sysvec_apic_timer_interrupt+0x4f/0xd0
       asm_sysvec_apic_timer_interrupt+0x12/0x20
       RIP: 0033:0x7fff196bf6f5

    Instead of calling __put_task_struct() directly, we defer it using
    call_rcu(). A more natural approach would use a workqueue, but since
    in PREEMPT_RT, we can't allocate dynamic memory from atomic context,
    the code would become more complex because we would need to put the
    work_struct instance in the task_struct and initialize it when we
    allocate a new task_struct.

    The issue is reproducible with stress-ng:

      while true; do
          stress-ng --sched deadline --sched-period 1000000000 \
                  --sched-runtime 800000000 --sched-deadline \
                  1000000000 --mmapfork 23 -t 20
      done

    Reported-by: Hu Chunyu <chuhu@redhat.com>
    Suggested-by: Oleg Nesterov <oleg@redhat.com>
    Suggested-by: Valentin Schneider <vschneid@redhat.com>
    Suggested-by: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Wander Lairson Costa <wander@redhat.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20230614122323.37957-2-wander@redhat.com

Signed-off-by: Wander Lairson Costa <wander@redhat.com>
2023-08-30 10:15:24 -03:00
Viktor Malik eedd4d3960
seccomp: Move copy_seccomp() to no failure path.
Bugzilla: https://bugzilla.redhat.com/2218682

commit a1140cb215fa13dcec06d12ba0c3ee105633b7c4
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Tue Aug 23 08:45:32 2022 -0700

    seccomp: Move copy_seccomp() to no failure path.

    Our syzbot instance reported memory leaks in do_seccomp() [0], similar
    to the report [1].  It shows that we miss freeing struct seccomp_filter
    and some objects included in it.

    We can reproduce the issue with the program below [2] which calls one
    seccomp() and two clone() syscalls.

    The first clone()d child exits earlier than its parent and sends a
    signal to kill it during the second clone(), more precisely before the
    fatal_signal_pending() test in copy_process().  When the parent receives
    the signal, it has to destroy the embryonic process and return -EINTR to
    user space.  In the failure path, we have to call seccomp_filter_release()
    to decrement the filter's refcount.

    Initially, we called it in free_task() called from the failure path, but
    the commit 3a15fb6ed9 ("seccomp: release filter after task is fully
    dead") moved it to release_task() to notify user space as early as possible
    that the filter is no longer used.

    To keep the change and current seccomp refcount semantics, let's move
    copy_seccomp() just after the signal check and add a WARN_ON_ONCE() in
    free_task() for future debugging.

    [0]:
    unreferenced object 0xffff8880063add00 (size 256):
      comm "repro_seccomp", pid 230, jiffies 4294687090 (age 9.914s)
      hex dump (first 32 bytes):
        01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00  ................
        ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
      backtrace:
        do_seccomp (./include/linux/slab.h:600 ./include/linux/slab.h:733 kernel/seccomp.c:666 kernel/seccomp.c:708 kernel/seccomp.c:1871 kernel/seccomp.c:1991)
        do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
        entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
    unreferenced object 0xffffc90000035000 (size 4096):
      comm "repro_seccomp", pid 230, jiffies 4294687090 (age 9.915s)
      hex dump (first 32 bytes):
        01 00 00 00 00 00 00 00 00 00 00 00 05 00 00 00  ................
        00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      backtrace:
        __vmalloc_node_range (mm/vmalloc.c:3226)
        __vmalloc_node (mm/vmalloc.c:3261 (discriminator 4))
        bpf_prog_alloc_no_stats (kernel/bpf/core.c:91)
        bpf_prog_alloc (kernel/bpf/core.c:129)
        bpf_prog_create_from_user (net/core/filter.c:1414)
        do_seccomp (kernel/seccomp.c:671 kernel/seccomp.c:708 kernel/seccomp.c:1871 kernel/seccomp.c:1991)
        do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
        entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
    unreferenced object 0xffff888003fa1000 (size 1024):
      comm "repro_seccomp", pid 230, jiffies 4294687090 (age 9.915s)
      hex dump (first 32 bytes):
        00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      backtrace:
        bpf_prog_alloc_no_stats (./include/linux/slab.h:600 ./include/linux/slab.h:733 kernel/bpf/core.c:95)
        bpf_prog_alloc (kernel/bpf/core.c:129)
        bpf_prog_create_from_user (net/core/filter.c:1414)
        do_seccomp (kernel/seccomp.c:671 kernel/seccomp.c:708 kernel/seccomp.c:1871 kernel/seccomp.c:1991)
        do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
        entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
    unreferenced object 0xffff888006360240 (size 16):
      comm "repro_seccomp", pid 230, jiffies 4294687090 (age 9.915s)
      hex dump (first 16 bytes):
        01 00 37 00 76 65 72 6c e0 83 01 06 80 88 ff ff  ..7.verl........
      backtrace:
        bpf_prog_store_orig_filter (net/core/filter.c:1137)
        bpf_prog_create_from_user (net/core/filter.c:1428)
        do_seccomp (kernel/seccomp.c:671 kernel/seccomp.c:708 kernel/seccomp.c:1871 kernel/seccomp.c:1991)
        do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
        entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
    unreferenced object 0xffff8880060183e0 (size 8):
      comm "repro_seccomp", pid 230, jiffies 4294687090 (age 9.915s)
      hex dump (first 8 bytes):
        06 00 00 00 00 00 ff 7f                          ........
      backtrace:
        kmemdup (mm/util.c:129)
        bpf_prog_store_orig_filter (net/core/filter.c:1144)
        bpf_prog_create_from_user (net/core/filter.c:1428)
        do_seccomp (kernel/seccomp.c:671 kernel/seccomp.c:708 kernel/seccomp.c:1871 kernel/seccomp.c:1991)
        do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
        entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)

    [1]: https://syzkaller.appspot.com/bug?id=2809bb0ac77ad9aa3f4afe42d6a610aba594a987

    [2]:
    #define _GNU_SOURCE
    #include <sched.h>
    #include <signal.h>
    #include <unistd.h>
    #include <sys/syscall.h>
    #include <linux/filter.h>
    #include <linux/seccomp.h>

    void main(void)
    {
            struct sock_filter filter[] = {
                    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
            };
            struct sock_fprog fprog = {
                    .len = sizeof(filter) / sizeof(filter[0]),
                    .filter = filter,
            };
            long i, pid;

            syscall(__NR_seccomp, SECCOMP_SET_MODE_FILTER, 0, &fprog);

            for (i = 0; i < 2; i++) {
                    pid = syscall(__NR_clone, CLONE_NEWNET | SIGKILL, NULL, NULL, 0);
                    if (pid == 0)
                            return;
            }
    }

    Fixes: 3a15fb6ed9 ("seccomp: release filter after task is fully dead")
    Reported-by: syzbot+ab17848fe269b573eb71@syzkaller.appspotmail.com
    Reported-by: Ayushman Dutta <ayudutta@amazon.com>
    Suggested-by: Kees Cook <keescook@chromium.org>
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
    Signed-off-by: Kees Cook <keescook@chromium.org>
    Link: https://lore.kernel.org/r/20220823154532.82913-1-kuniyu@amazon.com

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2023-07-11 13:39:56 +02:00
Nico Pache 9735c28de1 hugetlb: add vma based lock for pmd sharing
commit 8d9bfb2608145cf3e408428c224099e1585471af
Author: Mike Kravetz <mike.kravetz@oracle.com>
Date:   Wed Sep 14 15:18:07 2022 -0700

    hugetlb: add vma based lock for pmd sharing

    Allocate a new hugetlb_vma_lock structure and hang off vm_private_data for
    synchronization use by vmas that could be involved in pmd sharing.  This
    data structure contains a rw semaphore that is the primary tool used for
    synchronization.

    This new structure is ref counted, so that it can exist when NOT attached
    to a vma.  This is only helpful in resolving lock ordering issues where
    code may need to obtain the vma_lock while there are no guarantees the vma
    may go away.  By obtaining a ref on the structure, it can be guaranteed
    that at least the rw semaphore will not go away.

    Only add infrastructure for the new lock here.  Actual use will be added
    in subsequent patches.

    [mike.kravetz@oracle.com: fix build issue for missing hugetlb_vma_lock_release]
      Link: https://lkml.kernel.org/r/YyNUtA1vRASOE4+M@monkey
    Link: https://lkml.kernel.org/r/20220914221810.95771-7-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: James Houghton <jthoughton@google.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:11:01 -06:00
Waiman Long d4d0032407 rcu-tasks: Add data structures for lightweight grace periods
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2169516

commit 434c9eefb959c36331a93617ea95df903469b99f
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Mon, 16 May 2022 17:56:16 -0700

    rcu-tasks: Add data structures for lightweight grace periods

    This commit adds fields to task_struct and to rcu_tasks_percpu that will
    be used to avoid the task-list scan for RCU Tasks Trace grace periods,
    and also initializes these fields.

    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Alexei Starovoitov <ast@kernel.org>
    Cc: Andrii Nakryiko <andrii@kernel.org>
    Cc: Martin KaFai Lau <kafai@fb.com>
    Cc: KP Singh <kpsingh@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-03-30 08:36:09 -04:00
Chris von Recklinghausen c902f8fb40 mm: khugepaged: make khugepaged_enter() void function
Bugzilla: https://bugzilla.redhat.com/2160210

commit d2081b2bf8195b8239c67fdd61518e077da7cbec
Author: Yang Shi <shy828301@gmail.com>
Date:   Thu May 19 14:08:49 2022 -0700

    mm: khugepaged: make khugepaged_enter() void function

    The most callers of khugepaged_enter() don't care about the return value.
    Only dup_mmap(), anonymous THP page fault and MADV_HUGEPAGE handle the
    error by returning -ENOMEM.  Actually it is not harmful for them to ignore
    the error case either.  It also sounds overkilling to fail fork() and page
    fault early due to khugepaged_enter() error, and MADV_HUGEPAGE does set
    VM_HUGEPAGE flag regardless of the error.

    Link: https://lkml.kernel.org/r/20220510203222.24246-6-shy828301@gmail.com
    Signed-off-by: Yang Shi <shy828301@gmail.com>
    Acked-by: Song Liu <song@kernel.org>
    Acked-by: Vlastmil Babka <vbabka@suse.cz>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Song Liu <songliubraving@fb.com>
    Cc: Theodore Ts'o <tytso@mit.edu>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:09 -04:00
Ming Lei 4ea71b71db blk-cgroup: store a gendisk to throttle in struct task_struct
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2175212

commit f05837ed73d0c73e950b2d9f2612febb0d3d451e
Author: Christoph Hellwig <hch@lst.de>
Date:   Fri Feb 3 16:03:48 2023 +0100

    blk-cgroup: store a gendisk to throttle in struct task_struct

    Switch from a request_queue pointer and reference to a gendisk once
    for the throttle information in struct task_struct.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
    Link: https://lore.kernel.org/r/20230203150400.3199230-8-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2023-03-11 23:27:40 +08:00
Oleg Nesterov 9da64baa41 fs/exec: switch timens when a task gets a new mm
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2116442

commit 2b5f9dad32ed19e8db3b0f10a84aa824a219803b
Author: Andrei Vagin <avagin@gmail.com>
Date:   Tue Sep 20 17:31:19 2022 -0700

    fs/exec: switch timens when a task gets a new mm

    Changing a time namespace requires remapping a vvar page, so we don't want
    to allow doing that if any other tasks can use the same mm.

    Currently, we install a time namespace when a task is created with a new
    vm. exec() is another case when a task gets a new mm and so it can switch
    a time namespace safely, but it isn't handled now.

    One more issue of the current interface is that clone() with CLONE_VM isn't
    allowed if the current task has unshared a time namespace
    (timens_for_children doesn't match the current timens).

    Both these issues make some inconvenience for users. For example, Alexey
    and Florian reported that posix_spawn() uses vfork+exec and this pattern
    doesn't work with time namespaces due to the both described issues.
    LXC needed to workaround the exec() issue by calling setns.

    In the commit 133e2d3e81de5 ("fs/exec: allow to unshare a time namespace on
    vfork+exec"), we tried to fix these issues with minimal impact on UAPI. But
    it adds extra complexity and some undesirable side effects. Eric suggested
    fixing the issues properly because here are all the reasons to suppose that
    there are no users that depend on the old behavior.

    Cc: Alexey Izbyshev <izbyshev@ispras.ru>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Dmitry Safonov <0x7f454c46@gmail.com>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Florian Weimer <fweimer@redhat.com>
    Cc: Kees Cook <keescook@chromium.org>
    Suggested-by: "Eric W. Biederman" <ebiederm@xmission.com>
    Origin-author: "Eric W. Biederman" <ebiederm@xmission.com>
    Signed-off-by: Andrei Vagin <avagin@gmail.com>
    Signed-off-by: Kees Cook <keescook@chromium.org>
    Link: https://lore.kernel.org/r/20220921003120.209637-1-avagin@google.com

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2023-01-11 10:43:02 +01:00
Herton R. Krzesinski 961ed794e3 Merge: rv: Add Runtime Verification (RV) interface
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1402

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2129758
Tested: Used Qemu and custom userspace in https://gitlab.com/acarmina/test-files

Signed-off-by: Alessandro Carminati <acarmina@redhat.com>

Approved-by: Waiman Long <longman@redhat.com>
Approved-by: Juri Lelli <juri.lelli@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2022-12-09 16:18:07 +00:00
Alessandro Carminati 4ad522d3e0 rv/include: Add deterministic automata monitor definition via C macros
Bugzilla: https://bugzilla.redhat.com/2129758

commit 792575348ff70e05c6040d02fce38e949ef92c37
From: Daniel Bristot de Oliveira <bristot@kernel.org>
Date: Fri, 29 Jul 2022 11:38:43 +0200

    rv/include: Add deterministic automata monitor definition via C macros

    In Linux terms, the runtime verification monitors are encapsulated
    inside the "RV monitor" abstraction. The "RV monitor" includes a set
    of instances of the monitor (per-cpu monitor, per-task monitor, and
    so on), the helper functions that glue the monitor to the system
    reference model, and the trace output as a reaction for event parsing
    and exceptions, as depicted below:

    Linux  +----- RV Monitor ----------------------------------+ Formal
     Realm |                                                   |  Realm
     +-------------------+     +----------------+     +-----------------+
     |   Linux kernel    |     |     Monitor    |     |     Reference   |
     |     Tracing       |  -> |   Instance(s)  | <-  |       Model     |
     | (instrumentation) |     | (verification) |     | (specification) |
     +-------------------+     +----------------+     +-----------------+
            |                          |                       |
            |                          V                       |
            |                     +----------+                 |
            |                     | Reaction |                 |
            |                     +--+--+--+-+                 |
            |                        |  |  |                   |
            |                        |  |  +-> trace output ?  |
            +------------------------|--|----------------------+
                                     |  +----> panic ?
                                     +-------> <user-specified>

    Add the rv/da_monitor.h, enabling automatic code generation for the
    *Monitor Instance(s)* using C macros, and code to support it.

    The benefits of the usage of macro for monitor synthesis are 3-fold as it:

    - Reduces the code duplication;
    - Facilitates the bug fix/improvement;
    - Avoids the case of developers changing the core of the monitor code
      to manipulate the model in a (let's say) non-standard way.

    This initial implementation presents three different types of monitor
    instances:

    - DECLARE_DA_MON_GLOBAL(name, type)
    - DECLARE_DA_MON_PER_CPU(name, type)
    - DECLARE_DA_MON_PER_TASK(name, type)

    The first declares the functions for a global deterministic automata monitor,
    the second for monitors with per-cpu instances, and the third with per-task
    instances.

    Link: https://lkml.kernel.org/r/51b0bf425a281e226dfeba7401d2115d6091f84e.1659052063.git.bristot@kernel.org

    Cc: Wim Van Sebroeck <wim@linux-watchdog.org>
    Cc: Guenter Roeck <linux@roeck-us.net>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Will Deacon <will@kernel.org>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Marco Elver <elver@google.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: "Paul E. McKenney" <paulmck@kernel.org>
    Cc: Shuah Khan <skhan@linuxfoundation.org>
    Cc: Gabriele Paoloni <gpaoloni@redhat.com>
    Cc: Juri Lelli <juri.lelli@redhat.com>
    Cc: Clark Williams <williams@redhat.com>
    Cc: Tao Zhou <tao.zhou@linux.dev>
    Cc: Randy Dunlap <rdunlap@infradead.org>
    Cc: linux-doc@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-trace-devel@vger.kernel.org
    Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
    Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

Signed-off-by: Alessandro Carminati <acarmina@redhat.com>
2022-11-14 12:55:51 +01:00
Jerome Marchand 3f3cfe65ec rethook: Add a generic return hook
Bugzilla: https://bugzilla.redhat.com/2120966

commit 54ecbe6f1ed5138c895bdff55608cf502755b20e
Author: Masami Hiramatsu <mhiramat@kernel.org>
Date:   Tue Mar 15 23:00:50 2022 +0900

    rethook: Add a generic return hook

    Add a return hook framework which hooks the function return. Most of the
    logic came from the kretprobe, but this is independent from kretprobe.

    Note that this is expected to be used with other function entry hooking
    feature, like ftrace, fprobe, adn kprobes. Eventually this will replace
    the kretprobe (e.g. kprobe + rethook = kretprobe), but at this moment,
    this is just an additional hook.

    Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
    Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
    Tested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/164735285066.1084943.9259661137330166643.stgit@devnote2

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-10-25 14:58:03 +02:00
Frantisek Hrbata e9e9bc8da2 Merge: mm changes through v5.18 for 9.2
Merge conflicts:
-----------------
Conflicts with !1142(merged) "io_uring: update to v5.15"

fs/io-wq.c
        - static bool io_wqe_create_worker(struct io_wqe *wqe, struct io_wqe_acct *acct)
          !1142 already contains backport of 3146cba99aa2 ("io-wq: make worker creation resilient against signals")
          along with other commits which are not present in !1370. Resolved in favor of HEAD(!1142)
        - static int io_wqe_worker(void *data)
          !1370 does not contain 767a65e9f317 ("io-wq: fix potential race of acct->nr_workers")
          Resolved in favor of HEAD(!1142)
        - static void io_init_new_worker(struct io_wqe *wqe, struct io_worker *worker,
          HEAD(!1142) does not contain e32cf5dfbe22 ("kthread: Generalize pf_io_worker so it can point to struct kthread")
          Resolved in favor of !1370
        - static void create_worker_cont(struct callback_head *cb)
          !1370 does not contain 66e70be72288 ("io-wq: fix memory leak in create_io_worker()")
          Resolved in favor of HEAD(!1142)
        - static void io_workqueue_create(struct work_struct *work)
          !1370 does not contain 66e70be72288 ("io-wq: fix memory leak in create_io_worker()")
          Resolved in favor of HEAD(!1142)
        - static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
          !1370 does not contain 66e70be72288 ("io-wq: fix memory leak in create_io_worker()")
          Resolved in favor of HEAD(!1142)
        - static bool io_wq_work_match_item(struct io_wq_work *work, void *data)
          !1370 does not contain 713b9825a4c4 ("io-wq: fix cancellation on create-worker failure")
          Resolved in favor of HEAD(!1142)
        - static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work)
          !1370 is missing 713b9825a4c4 ("io-wq: fix cancellation on create-worker failure")
          removed wrongly merged run_cancel label
          Resolved in favor of HEAD(!1142)
        - static bool io_task_work_match(struct callback_head *cb, void *data)
          !1370 is missing 3b33e3f4a6c0 ("io-wq: fix silly logic error in io_task_work_match()")
          Resolved in favor of HEAD(!1142)
        - static void io_wq_exit_workers(struct io_wq *wq)
          !1370 is missing 3b33e3f4a6c0 ("io-wq: fix silly logic error in io_task_work_match()")
          Resolved in favor of HEAD(!1142)
        - int io_wq_max_workers(struct io_wq *wq, int *new_count)
          !1370 is missing 3b33e3f4a6c0 ("io-wq: fix silly logic error in io_task_work_match()")
fs/io_uring.c
        - static int io_register_iowq_max_workers(struct io_ring_ctx *ctx,
          !1370 is missing bunch of commits after 2e480058ddc2 ("io-wq: provide a way to limit max number of workers")
          Resolved in favor of HEAD(!1142)
include/uapi/linux/io_uring.h
        - !1370 is missing dd47c104533d ("io-wq: provide IO_WQ_* constants for IORING_REGISTER_IOWQ_MAX_WORKERS arg items")
          just a comment conflict
          Resolved in favor of HEAD(!1142)
kernel/exit.c
        - void __noreturn do_exit(long code)
        - !1370 contains bunch of commits after f552a27afe67 ("io_uring: remove files pointer in cancellation functions")
          Resolved in favor of !1370

Conflicts with !1357(merged) "NFS refresh for RHEL-9.2"

fs/nfs/callback.c
        - nfs4_callback_svc(void *vrqstp)
          !1370 is missing f49169c97fce ("NFSD: Remove svc_serv_ops::svo_module") where the module_put_and_kthread_exit() was removed
          Resolved in favor of HEAD(!1357)
fs/nfs/file.c
          !1357 is missing 187c82cb0380 ("fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio")
          Resolved in favor of HEAD(!1370)
fs/nfsd/nfssvc.c
        - nfsd(void *vrqstp)
          !1370 is missing f49169c97fce ("NFSD: Remove svc_serv_ops::svo_module")
          Resolved in favor of HEAD(!1357)
-----------------

MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1370

Bugzilla: https://bugzilla.redhat.com/2120352

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2099722

Patches 1-9 are changes to selftests
Patches 10-31 are reverts of RHEL-only patches to address COR CVE
Patches 32-320 are the machine dependent mm changes ported by Rafael
Patch 321 reverts the backport of 6692c98c7df5. See below.
Patches 322-981 are the machine independent mm changes
Patches 982-1016 are David Hildebrand's upstream changes to address the COR CVE

RHEL commit b23c298982 fork: Stop protecting back_fork_cleanup_cgroup_lock with CONFIG_NUMA
which is a backport of upstream 6692c98c7df5 and is reverted early in this series. 6692c98c7df5
is a fix for upstream 40966e316f86 which was not in RHEL until this series. 6692c98c7df5 is re-added
after 40966e316f86.

Omitted-fix: 310d1344e3c5 ("Revert "powerpc: Remove unused FW_FEATURE_NATIVE references"")
        to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716

Omitted-fix: 465d0eb0dc31 ("Docs/admin-guide/mm/damon/usage: fix the example code snip")
        to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716

Omitted-fix: 317314527d17 ("mm/hugetlb: correct demote page offset logic")
        to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716

Omitted-fix: 37dcc673d065 ("frontswap: don't call ->init if no ops are registered")
        to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716

Omitted-fix: 30c19366636f ("mm: fix BUG splat with kvmalloc + GFP_ATOMIC")
        to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716

Omitted: fix: fa84693b3c89 io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: 009ad9f0c6ee io_uring: drop ctx->uring_lock before acquiring sqd->lock
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: bc369921d670 io-wq: max_worker fixes
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743

Omitted-fix: e139a1ec92f8 io_uring: apply max_workers limit to all future users
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743

Omitted-fix: 71c9ce27bb57 io-wq: fix max-workers not correctly set on multi-node system
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743

Omitted-fix: 41d3a6bd1d37 io_uring: pin SQPOLL data before unlocking ring lock
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: bad119b9a000 io_uring: honour zeroes as io-wq worker limits
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743

Omitted-fix: 08bdbd39b584 io-wq: ensure that hash wait lock is IRQ disabling
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: 713b9825a4c4 io-wq: fix cancellation on create-worker failure
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: 3b33e3f4a6c0 io-wq: fix silly logic error in io_task_work_match()
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: 71e1cef2d794 io-wq: Remove duplicate code in io_workqueue_create()
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=210774

Omitted-fix: a226abcd5d42 io-wq: don't retry task_work creation failure on fatal conditions
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743

Omitted-fix: fa84693b3c89 io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL
        fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: dd47c104533d io-wq: provide IO_WQ_* constants for IORING_REGISTER_IOWQ_MAX_WORKERS arg items
        fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: 4f0712ccec09 hexagon: Fix function name in die()
	unsupported arch

Omitted-fix: 751971af2e36 csky: Fix function name in csky_alignment() and die()
	unsupported arch

Omitted-fix: dcbc65aac283 ptrace: Remove duplicated include in ptrace.c
        unsupported arch

Omitted-fix: eb48d4219879 drm/i915: Fix oops due to missing stack depot
	fixed in RHEL commit 105d2d4832 Merge DRM changes from upstream v5.16..v5.17

Omitted-fix: 751a9d69b197 drm/i915: Fix oops due to missing stack depot
	fixed in RHEL commit 99fc716fc4 Merge DRM changes from upstream v5.17..v5.18

Omitted-fix: eb48d4219879 drm/i915: Fix oops due to missing stack depot
	fixed in RHEL commit 105d2d4832 Merge DRM changes from upstream v5.16..v5.17

Omitted-fix: 751a9d69b197 drm/i915: Fix oops due to missing stack depot
	fixed in RHEL commit 99fc716fc4 Merge DRM changes from upstream v5.17..v5.18

Omitted-fix: b95dc06af3e6 drm/amdgpu: disable runpm if we are the primary adapter
        reverted later

Omitted-fix: 5a90c24ad028 Revert "drm/amdgpu: disable runpm if we are the primary adapter"
        revert of above omitted fix

Omitted-fix: 724bbe49c5e4 fs/ntfs3: provide block_invalidate_folio to fix memory leak
	unsupported fs

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>

Approved-by: John W. Linville <linville@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>
Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Lyude Paul <lyude@redhat.com>
Approved-by: Donald Dutile <ddutile@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-10-23 19:49:41 +02:00
Chris von Recklinghausen 34f31c17ed kthread: Don't allocate kthread_struct for init and umh
Bugzilla: https://bugzilla.redhat.com/2120352

commit 343f4c49f2438d8920f1f76fa823ee59b91f02e4
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Mon Apr 11 11:40:14 2022 -0500

    kthread: Don't allocate kthread_struct for init and umh

    If kthread_is_per_cpu runs concurrently with free_kthread_struct the
    kthread_struct that was just freed may be read from.

    This bug was introduced by commit 40966e316f86 ("kthread: Ensure
    struct kthread is present for all kthreads").  When kthread_struct
    started to be allocated for all tasks that have PF_KTHREAD set.  This
    in turn required the kthread_struct to be freed in kernel_execve and
    violated the assumption that kthread_struct will have the same
    lifetime as the task.

    Looking a bit deeper this only applies to callers of kernel_execve
    which is just the init process and the user mode helper processes.
    These processes really don't want to be kernel threads but are for
    historical reasons.  Mostly that copy_thread does not know how to take
    a kernel mode function to the process with for processes without
    PF_KTHREAD or PF_IO_WORKER set.

    Solve this by not allocating kthread_struct for the init process and
    the user mode helper processes.

    This is done by adding a kthread member to struct kernel_clone_args.
    Setting kthread in fork_idle and kernel_thread.  Adding
    user_mode_thread that works like kernel_thread except it does not set
    kthread.  In fork only allocating the kthread_struct if .kthread is set.

    I have looked at kernel/kthread.c and since commit 40966e316f86
    ("kthread: Ensure struct kthread is present for all kthreads") there
    have been no assumptions added that to_kthread or __to_kthread will
    not return NULL.

    There are a few callers of to_kthread or __to_kthread that assume a
    non-NULL struct kthread pointer will be returned.  These functions are
    kthread_data(), kthread_parmme(), kthread_exit(), kthread(),
    kthread_park(), kthread_unpark(), kthread_stop().  All of those functions
    can reasonably expected to be called when it is know that a task is a
    kthread so that assumption seems reasonable.

    Cc: stable@vger.kernel.org
    Fixes: 40966e316f86 ("kthread: Ensure struct kthread is present for all kthreads")
    Reported-by: Максим Кутявин <maximkabox13@gmail.com>
    Link: https://lkml.kernel.org/r/20220506141512.516114-1-ebiederm@xmission.com
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:07 -04:00
Chris von Recklinghausen ead48fb27e kasan, arm64: reset pointer tags of vmapped stacks
Bugzilla: https://bugzilla.redhat.com/2120352

commit 51fb34de2a4c8fa0f221246313700bfe3b6c586d
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Mar 24 18:11:10 2022 -0700

    kasan, arm64: reset pointer tags of vmapped stacks

    Once tag-based KASAN modes start tagging vmalloc() allocations, kernel
    stacks start getting tagged if CONFIG_VMAP_STACK is enabled.

    Reset the tag of kernel stack pointers after allocation in
    arch_alloc_vmap_stack().

    For SW_TAGS KASAN, when CONFIG_KASAN_STACK is enabled, the instrumentation
    can't handle the SP register being tagged.

    For HW_TAGS KASAN, there's no instrumentation-related issues.  However,
    the impact of having a tagged SP register needs to be properly evaluated,
    so keep it non-tagged for now.

    Note, that the memory for the stack allocation still gets tagged to catch
    vmalloc-into-stack out-of-bounds accesses.

    [andreyknvl@google.com: fix case when a stack is retrieved from cached_stacks]
      Link: https://lkml.kernel.org/r/f50c5f96ef896d7936192c888b0c0a7674e33184.1644943792.git.andreyknvl@google.com
    [dan.carpenter@oracle.com: remove unnecessary check in alloc_thread_stack_node()]
      Link: https://lkml.kernel.org/r/20220301080706.GB17208@kili

    Link: https://lkml.kernel.org/r/698c5ab21743c796d46c15d075b9481825973e34.1643047180.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
    Acked-by: Catalin Marinas <catalin.marinas@arm.com>
    Acked-by: Marco Elver <elver@google.com>
    Reviewed-by: Marco Elver <elver@google.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Evgenii Stepanov <eugenis@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:59 -04:00
Chris von Recklinghausen d23b04c12d kasan, fork: reset pointer tags of vmapped stacks
Bugzilla: https://bugzilla.redhat.com/2120352

commit c08e6a1206e6876c66e0528b3ec717f557b07dd4
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Mar 24 18:11:07 2022 -0700

    kasan, fork: reset pointer tags of vmapped stacks

    Once tag-based KASAN modes start tagging vmalloc() allocations, kernel
    stacks start getting tagged if CONFIG_VMAP_STACK is enabled.

    Reset the tag of kernel stack pointers after allocation in
    alloc_thread_stack_node().

    For SW_TAGS KASAN, when CONFIG_KASAN_STACK is enabled, the instrumentation
    can't handle the SP register being tagged.

    For HW_TAGS KASAN, there's no instrumentation-related issues.  However,
    the impact of having a tagged SP register needs to be properly evaluated,
    so keep it non-tagged for now.

    Note, that the memory for the stack allocation still gets tagged to catch
    vmalloc-into-stack out-of-bounds accesses.

    Link: https://lkml.kernel.org/r/c6c96f012371ecd80e1936509ebcd3b07a5956f7.1643047180.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Reviewed-by: Alexander Potapenko <glider@google.com>
    Acked-by: Marco Elver <elver@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Evgenii Stepanov <eugenis@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:59 -04:00
Chris von Recklinghausen 71958349db mm: refactor vm_area_struct::anon_vma_name usage code
Bugzilla: https://bugzilla.redhat.com/2120352

commit 5c26f6ac9416b63d093e29c30e79b3297e425472
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Fri Mar 4 20:28:51 2022 -0800

    mm: refactor vm_area_struct::anon_vma_name usage code

    Avoid mixing strings and their anon_vma_name referenced pointers by
    using struct anon_vma_name whenever possible.  This simplifies the code
    and allows easier sharing of anon_vma_name structures when they
    represent the same name.

    [surenb@google.com: fix comment]

    Link: https://lkml.kernel.org/r/20220223153613.835563-1-surenb@google.com
    Link: https://lkml.kernel.org/r/20220224231834.1481408-1-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Suggested-by: Matthew Wilcox <willy@infradead.org>
    Suggested-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Cc: Colin Cross <ccross@google.com>
    Cc: Sumit Semwal <sumit.semwal@linaro.org>
    Cc: Dave Hansen <dave.hansen@intel.com>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Alexey Gladkov <legion@kernel.org>
    Cc: Sasha Levin <sashal@kernel.org>
    Cc: Chris Hyser <chris.hyser@oracle.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Xiaofeng Cao <caoxiaofeng@yulong.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Cyrill Gorcunov <gorcunov@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:46 -04:00
Chris von Recklinghausen 283b28d4ba mm: move anon_vma declarations to linux/mm_inline.h
Bugzilla: https://bugzilla.redhat.com/2120352

commit 17fca131cee21724ee953a17c185c14e9533af5b
Author: Arnd Bergmann <arnd@arndb.de>
Date:   Fri Jan 14 14:06:07 2022 -0800

    mm: move anon_vma declarations to linux/mm_inline.h

    The patch to add anonymous vma names causes a build failure in some
    configurations:

      include/linux/mm_types.h: In function 'is_same_vma_anon_name':
      include/linux/mm_types.h:924:37: error: implicit declaration of function 'strcmp' [-Werror=implicit-function-declaration]
        924 |         return name && vma_name && !strcmp(name, vma_name);
            |                                     ^~~~~~
      include/linux/mm_types.h:22:1: note: 'strcmp' is defined in header '<string.h>'; did you forget to '#include <string.h>'?

    This should not really be part of linux/mm_types.h in the first place,
    as that header is meant to only contain structure defintions and need a
    minimum set of indirect includes itself.

    While the header clearly includes more than it should at this point,
    let's not make it worse by including string.h as well, which would pull
    in the expensive (compile-speed wise) fortify-string logic.

    Move the new functions into a separate header that only needs to be
    included in a couple of locations.

    Link: https://lkml.kernel.org/r/20211207125710.2503446-1-arnd@kernel.org
    Fixes: "mm: add a field to store names for private anonymous memory"
    Signed-off-by: Arnd Bergmann <arnd@arndb.de>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Colin Cross <ccross@google.com>
    Cc: Eric Biederman <ebiederm@xmission.com>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:39 -04:00
Chris von Recklinghausen 70649ff1fb mm: add a field to store names for private anonymous memory
Bugzilla: https://bugzilla.redhat.com/2120352

commit 9a10064f5625d5572c3626c1516e0bebc6c9fe9b
Author: Colin Cross <ccross@google.com>
Date:   Fri Jan 14 14:05:59 2022 -0800

    mm: add a field to store names for private anonymous memory

    In many userspace applications, and especially in VM based applications
    like Android uses heavily, there are multiple different allocators in
    use.  At a minimum there is libc malloc and the stack, and in many cases
    there are libc malloc, the stack, direct syscalls to mmap anonymous
    memory, and multiple VM heaps (one for small objects, one for big
    objects, etc.).  Each of these layers usually has its own tools to
    inspect its usage; malloc by compiling a debug version, the VM through
    heap inspection tools, and for direct syscalls there is usually no way
    to track them.

    On Android we heavily use a set of tools that use an extended version of
    the logic covered in Documentation/vm/pagemap.txt to walk all pages
    mapped in userspace and slice their usage by process, shared (COW) vs.
    unique mappings, backing, etc.  This can account for real physical
    memory usage even in cases like fork without exec (which Android uses
    heavily to share as many private COW pages as possible between
    processes), Kernel SamePage Merging, and clean zero pages.  It produces
    a measurement of the pages that only exist in that process (USS, for
    unique), and a measurement of the physical memory usage of that process
    with the cost of shared pages being evenly split between processes that
    share them (PSS).

    If all anonymous memory is indistinguishable then figuring out the real
    physical memory usage (PSS) of each heap requires either a pagemap
    walking tool that can understand the heap debugging of every layer, or
    for every layer's heap debugging tools to implement the pagemap walking
    logic, in which case it is hard to get a consistent view of memory
    across the whole system.

    Tracking the information in userspace leads to all sorts of problems.
    It either needs to be stored inside the process, which means every
    process has to have an API to export its current heap information upon
    request, or it has to be stored externally in a filesystem that somebody
    needs to clean up on crashes.  It needs to be readable while the process
    is still running, so it has to have some sort of synchronization with
    every layer of userspace.  Efficiently tracking the ranges requires
    reimplementing something like the kernel vma trees, and linking to it
    from every layer of userspace.  It requires more memory, more syscalls,
    more runtime cost, and more complexity to separately track regions that
    the kernel is already tracking.

    This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a
    userspace-provided name for anonymous vmas.  The names of named
    anonymous vmas are shown in /proc/pid/maps and /proc/pid/smaps as
    [anon:<name>].

    Userspace can set the name for a region of memory by calling

       prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name)

    Setting the name to NULL clears it.  The name length limit is 80 bytes
    including NUL-terminator and is checked to contain only printable ascii
    characters (including space), except '[',']','\','$' and '`'.

    Ascii strings are being used to have a descriptive identifiers for vmas,
    which can be understood by the users reading /proc/pid/maps or
    /proc/pid/smaps.  Names can be standardized for a given system and they
    can include some variable parts such as the name of the allocator or a
    library, tid of the thread using it, etc.

    The name is stored in a pointer in the shared union in vm_area_struct
    that points to a null terminated string.  Anonymous vmas with the same
    name (equivalent strings) and are otherwise mergeable will be merged.
    The name pointers are not shared between vmas even if they contain the
    same name.  The name pointer is stored in a union with fields that are
    only used on file-backed mappings, so it does not increase memory usage.

    CONFIG_ANON_VMA_NAME kernel configuration is introduced to enable this
    feature.  It keeps the feature disabled by default to prevent any
    additional memory overhead and to avoid confusing procfs parsers on
    systems which are not ready to support named anonymous vmas.

    The patch is based on the original patch developed by Colin Cross, more
    specifically on its latest version [1] posted upstream by Sumit Semwal.
    It used a userspace pointer to store vma names.  In that design, name
    pointers could be shared between vmas.  However during the last
    upstreaming attempt, Kees Cook raised concerns [2] about this approach
    and suggested to copy the name into kernel memory space, perform
    validity checks [3] and store as a string referenced from
    vm_area_struct.

    One big concern is about fork() performance which would need to strdup
    anonymous vma names.  Dave Hansen suggested experimenting with
    worst-case scenario of forking a process with 64k vmas having longest
    possible names [4].  I ran this experiment on an ARM64 Android device
    and recorded a worst-case regression of almost 40% when forking such a
    process.

    This regression is addressed in the followup patch which replaces the
    pointer to a name with a refcounted structure that allows sharing the
    name pointer between vmas of the same name.  Instead of duplicating the
    string during fork() or when splitting a vma it increments the refcount.

    [1] https://lore.kernel.org/linux-mm/20200901161459.11772-4-sumit.semwal@linaro.org/
    [2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/
    [3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook/
    [4] https://lore.kernel.org/linux-mm/5d0358ab-8c47-2f5f-8e43-23b89d6a8e95@intel.com/

    Changes for prctl(2) manual page (in the options section):

    PR_SET_VMA
            Sets an attribute specified in arg2 for virtual memory areas
            starting from the address specified in arg3 and spanning the
            size specified  in arg4. arg5 specifies the value of the attribute
            to be set. Note that assigning an attribute to a virtual memory
            area might prevent it from being merged with adjacent virtual
            memory areas due to the difference in that attribute's value.

            Currently, arg2 must be one of:

            PR_SET_VMA_ANON_NAME
                    Set a name for anonymous virtual memory areas. arg5 should
                    be a pointer to a null-terminated string containing the
                    name. The name length including null byte cannot exceed
                    80 bytes. If arg5 is NULL, the name of the appropriate
                    anonymous virtual memory areas will be reset. The name
                    can contain only printable ascii characters (including
                    space), except '[',']','\','$' and '`'.

                    This feature is available only if the kernel is built with
                    the CONFIG_ANON_VMA_NAME option enabled.

    [surenb@google.com: docs: proc.rst: /proc/PID/maps: fix malformed table]
      Link: https://lkml.kernel.org/r/20211123185928.2513763-1-surenb@google.com
    [surenb: rebased over v5.15-rc6, replaced userpointer with a kernel copy,
     added input sanitization and CONFIG_ANON_VMA_NAME config. The bulk of the
     work here was done by Colin Cross, therefore, with his permission, keeping
     him as the author]

    Link: https://lkml.kernel.org/r/20211019215511.3771969-2-surenb@google.com
    Signed-off-by: Colin Cross <ccross@google.com>
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Cyrill Gorcunov <gorcunov@openvz.org>
    Cc: Dave Hansen <dave.hansen@intel.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Ingo Molnar <mingo@kernel.org>
    Cc: Jan Glauber <jan.glauber@gmail.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: John Stultz <john.stultz@linaro.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Pekka Enberg <penberg@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Rob Landley <rob@landley.net>
    Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
    Cc: Shaohua Li <shli@fusionio.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:38 -04:00
Chris von Recklinghausen e3594b8ced exit: Remove profile_handoff_task
Bugzilla: https://bugzilla.redhat.com/2120352

commit 2873cd31a20c25b5e763b35e5fb886f0938c6dd5
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Sat Jan 8 10:03:24 2022 -0600

    exit: Remove profile_handoff_task

    All profile_handoff_task does is notify the task_free_notifier chain.
    The helpers task_handoff_register and task_handoff_unregister are used
    to add and delete entries from that chain and are never called.

    So remove the dead code and make it much easier to read and reason
    about __put_task_struct.

    Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
    Link: https://lkml.kernel.org/r/87fspyw6m0.fsf@email.froward.int.ebiederm.org
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:36 -04:00
Chris von Recklinghausen 2d6179b3cd kthread: Generalize pf_io_worker so it can point to struct kthread
Bugzilla: https://bugzilla.redhat.com/2120352

commit e32cf5dfbe227b355776948b2c9b5691b84d1cbd
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Wed Dec 22 22:10:09 2021 -0600

    kthread: Generalize pf_io_worker so it can point to struct kthread

    The point of using set_child_tid to hold the kthread pointer was that
    it already did what is necessary.  There are now restrictions on when
    set_child_tid can be initialized and when set_child_tid can be used in
    schedule_tail.  Which indicates that continuing to use set_child_tid
    to hold the kthread pointer is a bad idea.

    Instead of continuing to use the set_child_tid field of task_struct
    generalize the pf_io_worker field of task_struct and use it to hold
    the kthread pointer.

    Rename pf_io_worker (which is a void * pointer) to worker_private so
    it can be used to store kthreads struct kthread pointer.  Update the
    kthread code to store the kthread pointer in the worker_private field.
    Remove the places where set_child_tid had to be dealt with carefully
    because kthreads also used it.

    Link: https://lkml.kernel.org/r/CAHk-=wgtFAA9SbVYg0gR1tqPMC17-NYcs0GQkaYg1bGhh1uJQQ@mail.gmail.com
    Link: https://lkml.kernel.org/r/87a6grvqy8.fsf_-_@email.froward.int.ebiederm.org
    Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:35 -04:00
Chris von Recklinghausen e69fa4c5ac fork: Rename bad_fork_cleanup_threadgroup_lock to bad_fork_cleanup_delayacct
Bugzilla: https://bugzilla.redhat.com/2120352

commit ff8288ff475e47544569359772f88f2b39fd2cf9
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Mon Dec 20 10:42:18 2021 -0600

    fork: Rename bad_fork_cleanup_threadgroup_lock to bad_fork_cleanup_delayacct

    I just fixed a bug in copy_process when using the label
    bad_fork_cleanup_threadgroup_lock.  While fixing the bug I looked
    closer at the label and realized it has been misnamed since
    568ac88821 ("cgroup: reduce read locked section of
    cgroup_threadgroup_rwsem during fork").

    Fix the name so that fork is easier to understand.

    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:34 -04:00
Chris von Recklinghausen 13df86d418 fork: Stop protecting back_fork_cleanup_cgroup_lock with CONFIG_NUMA
Bugzilla: https://bugzilla.redhat.com/2120352

commit 6692c98c7df53502adb8b8b73ab9bcbd399f7a06
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Mon Dec 20 10:20:14 2021 -0600

    fork: Stop protecting back_fork_cleanup_cgroup_lock with CONFIG_NUMA

    Mark Brown <broonie@kernel.org> reported:

    > This is also causing further build errors including but not limited to:
    >
    > /tmp/next/build/kernel/fork.c: In function 'copy_process':
    > /tmp/next/build/kernel/fork.c:2106:4: error: label 'bad_fork_cleanup_threadgroup_lock' used but not defined
    >  2106 |    goto bad_fork_cleanup_threadgroup_lock;
    >       |    ^~~~

    It turns out that I messed up and was depending upon a label protected
    by an ifdef.  Move the label out of the ifdef as the ifdef around the label
    no longer makes sense (if it ever did).

    Link: https://lkml.kernel.org/r/YbugCP144uxXvRsk@sirena.org.uk
    Fixes: 40966e316f86 ("kthread: Ensure struct kthread is present for all kthreads")
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:34 -04:00
Chris von Recklinghausen e1e51160dc kthread: Ensure struct kthread is present for all kthreads
Bugzilla: https://bugzilla.redhat.com/2120352

commit 40966e316f86b8cfd83abd31ccb4df729309d3e7
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Thu Dec 2 09:56:14 2021 -0600

    kthread: Ensure struct kthread is present for all kthreads

    Today the rules are a bit iffy and arbitrary about which kernel
    threads have struct kthread present.  Both idle threads and thread
    started with create_kthread want struct kthread present so that is
    effectively all kernel threads.  Make the rule that if PF_KTHREAD
    and the task is running then struct kthread is present.

    This will allow the kernel thread code to using tsk->exit_code
    with different semantics from ordinary processes.

    To make ensure that struct kthread is present for all
    kernel threads move it's allocation into copy_process.

    Add a deallocation of struct kthread in exec for processes
    that were kernel threads.

    Move the allocation of struct kthread for the initial thread
    earlier so that it is not repeated for each additional idle
    thread.

    Move the initialization of struct kthread into set_kthread_struct
    so that the structure is always and reliably initailized.

    Clear set_child_tid in free_kthread_struct to ensure the kthread
    struct is reliably freed during exec.  The function
    free_kthread_struct does not need to clear vfork_done during exec as
    exec_mm_release called from exec_mmap has already cleared vfork_done.

    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:33 -04:00
Chris von Recklinghausen 2ef8f32db2 coredump: Limit coredumps to a single thread group
Bugzilla: https://bugzilla.redhat.com/2120352

commit 0258b5fd7c7124b87e185a1a9322d2c66b1876b7
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Wed Sep 22 11:24:02 2021 -0500

    coredump: Limit coredumps to a single thread group

    Today when a signal is delivered with a handler of SIG_DFL whose
    default behavior is to generate a core dump not only that process but
    every process that shares the mm is killed.

    In the case of vfork this looks like a real world problem.  Consider
    the following well defined sequence.

            if (vfork() == 0) {
                    execve(...);
                    _exit(EXIT_FAILURE);
            }

    If a signal that generates a core dump is received after vfork but
    before the execve changes the mm the process that called vfork will
    also be killed (as the mm is shared).

    Similarly if the execve fails after the point of no return the kernel
    delivers SIGSEGV which will kill both the exec'ing process and because
    the mm is shared the process that called vfork as well.

    As far as I can tell this behavior is a violation of people's
    reasonable expectations, POSIX, and is unnecessarily fragile when the
    system is low on memory.

    Solve this by making a userspace visible change to only kill a single
    process/thread group.  This is possible because Jann Horn recently
    modified[1] the coredump code so that the mm can safely be modified
    while the coredump is happening.  With LinuxThreads long gone I don't
    expect anyone to have a notice this behavior change in practice.

    To accomplish this move the core_state pointer from mm_struct to
    signal_struct, which allows different thread groups to coredump
    simultatenously.

    In zap_threads remove the work to kill anything except for the current
    thread group.

    v2: Remove core_state from the VM_BUG_ON_MM print to fix
        compile failure when CONFIG_DEBUG_VM is enabled.
        Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>

    [1] a07279c9a8 ("binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot")
    Fixes: d89f3847def4 ("[PATCH] thread-aware coredumps, 2.5.43-C3")
    History-tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
    Link: https://lkml.kernel.org/r/87y27mvnke.fsf@disp2133
    Link: https://lkml.kernel.org/r/20211007144701.67592574@canb.auug.org.au
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:25 -04:00
Chris von Recklinghausen f7fb43f6b1 coredump: Don't perform any cleanups before dumping core
Bugzilla: https://bugzilla.redhat.com/2120352

commit 92307383082daff5df884a25df9e283efb7ef261
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Wed Sep 1 11:33:50 2021 -0500

    coredump:  Don't perform any cleanups before dumping core

    Rename coredump_exit_mm to coredump_task_exit and call it from do_exit
    before PTRACE_EVENT_EXIT, and before any cleanup work for a task
    happens.  This ensures that an accurate copy of the process can be
    captured in the coredump as no cleanup for the process happens before
    the coredump completes.  This also ensures that PTRACE_EVENT_EXIT
    will not be visited by any thread until the coredump is complete.

    Add a new flag PF_POSTCOREDUMP so that tasks that have passed through
    coredump_task_exit can be recognized and ignored in zap_process.

    Now that all of the coredumping happens before exit_mm remove code to
    test for a coredump in progress from mm_release.

    Replace "may_ptrace_stop()" with a simple test of "current->ptrace".
    The other tests in may_ptrace_stop all concern avoiding stopping
    during a coredump.  These tests are no longer necessary as it is now
    guaranteed that fatal_signal_pending will be set if the code enters
    ptrace_stop during a coredump.  The code in ptrace_stop is guaranteed
    not to stop if fatal_signal_pending returns true.

    Until this change "ptrace_event(PTRACE_EVENT_EXIT)" could call
    ptrace_stop without fatal_signal_pending being true, as signals are
    dequeued in get_signal before calling do_exit.  This is no longer
    an issue as "ptrace_event(PTRACE_EVENT_EXIT)" is no longer reached
    until after the coredump completes.

    Link: https://lkml.kernel.org/r/874kaax26c.fsf@disp2133
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:25 -04:00
Chris von Recklinghausen ce19bd2383 Revert "fork: Stop protecting back_fork_cleanup_cgroup_lock with CONFIG_NUMA"
Bugzilla: https://bugzilla.redhat.com/2120352
Upstream status: RHEL only

6692c98c7df5 is a fix for upstream 40966e316f86 which was not in RHEL until
this series. We'll re-add 6692c98c7df5 after 40966e316f86.

This reverts commit b23c298982.

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:22 -04:00
Kamal Heib 765dacfaa6 IB/core: Fix a nested dead lock as part of ODP flow
Bugzilla: https://bugzilla.redhat.com/2120665

commit 85eaeb5058f0f04dffb124c97c86b4f18db0b833
Author: Yishai Hadas <yishaih@nvidia.com>
Date:   Wed Aug 24 09:10:36 2022 +0300

    IB/core: Fix a nested dead lock as part of ODP flow

    Fix a nested dead lock as part of ODP flow by using mmput_async().

    From the below call trace [1] can see that calling mmput() once we have
    the umem_odp->umem_mutex locked as required by
    ib_umem_odp_map_dma_and_lock() might trigger in the same task the
    exit_mmap()->__mmu_notifier_release()->mlx5_ib_invalidate_range() which
    may dead lock when trying to lock the same mutex.

    Moving to use mmput_async() will solve the problem as the above
    exit_mmap() flow will be called in other task and will be executed once
    the lock will be available.

    [1]
    [64843.077665] task:kworker/u133:2  state:D stack:    0 pid:80906 ppid:
    2 flags:0x00004000
    [64843.077672] Workqueue: mlx5_ib_page_fault mlx5_ib_eqe_pf_action [mlx5_ib]
    [64843.077719] Call Trace:
    [64843.077722]  <TASK>
    [64843.077724]  __schedule+0x23d/0x590
    [64843.077729]  schedule+0x4e/0xb0
    [64843.077735]  schedule_preempt_disabled+0xe/0x10
    [64843.077740]  __mutex_lock.constprop.0+0x263/0x490
    [64843.077747]  __mutex_lock_slowpath+0x13/0x20
    [64843.077752]  mutex_lock+0x34/0x40
    [64843.077758]  mlx5_ib_invalidate_range+0x48/0x270 [mlx5_ib]
    [64843.077808]  __mmu_notifier_release+0x1a4/0x200
    [64843.077816]  exit_mmap+0x1bc/0x200
    [64843.077822]  ? walk_page_range+0x9c/0x120
    [64843.077828]  ? __cond_resched+0x1a/0x50
    [64843.077833]  ? mutex_lock+0x13/0x40
    [64843.077839]  ? uprobe_clear_state+0xac/0x120
    [64843.077860]  mmput+0x5f/0x140
    [64843.077867]  ib_umem_odp_map_dma_and_lock+0x21b/0x580 [ib_core]
    [64843.077931]  pagefault_real_mr+0x9a/0x140 [mlx5_ib]
    [64843.077962]  pagefault_mr+0xb4/0x550 [mlx5_ib]
    [64843.077992]  pagefault_single_data_segment.constprop.0+0x2ac/0x560
    [mlx5_ib]
    [64843.078022]  mlx5_ib_eqe_pf_action+0x528/0x780 [mlx5_ib]
    [64843.078051]  process_one_work+0x22b/0x3d0
    [64843.078059]  worker_thread+0x53/0x410
    [64843.078065]  ? process_one_work+0x3d0/0x3d0
    [64843.078073]  kthread+0x12a/0x150
    [64843.078079]  ? set_kthread_struct+0x50/0x50
    [64843.078085]  ret_from_fork+0x22/0x30
    [64843.078093]  </TASK>

    Fixes: 36f30e486d ("IB/core: Improve ODP to use hmm_range_fault()")
    Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
    Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
    Link: https://lore.kernel.org/r/74d93541ea533ef7daec6f126deb1072500aeb16.1661251841.git.leonro@nvidia.com
    Signed-off-by: Leon Romanovsky <leon@kernel.org>

Signed-off-by: Kamal Heib <kheib@redhat.com>
2022-10-06 15:48:09 -04:00
Jerry Snitselaar f80179b6f2 mm: Fix PASID use-after-free issue
Bugzilla: https://bugzilla.redhat.com/2113044
Upstream Status: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

commit 2667ed10d9f01e250ba806276740782c89d77fda
Author: Fenghua Yu <fenghua.yu@intel.com>
Date:   Thu Apr 28 11:00:41 2022 -0700

    mm: Fix PASID use-after-free issue

    The PASID is being freed too early.  It needs to stay around until after
    device drivers that might be using it have had a chance to clear it out
    of the hardware.

    The relevant refcounts are:

      mmget() /mmput()  refcount the mm's address space
      mmgrab()/mmdrop() refcount the mm itself

    The PASID is currently tied to the life of the mm's address space and freed
    in __mmput().  This makes logical sense because the PASID can't be used
    once the address space is gone.

    But, this misses an important point: even after the address space is gone,
    the PASID will still be programmed into a device.  Device drivers might,
    for instance, still need to flush operations that are outstanding and need
    to use that PASID.  They do this at file->release() time.

    Device drivers call the IOMMU driver to hold a reference on the mm itself
    and drop it at file->release() time.  But, the IOMMU driver holds a
    reference on the mm itself, not the address space.  The address space (and
    the PASID) is long gone by the time the driver tries to clean up.  This is
    effectively a use-after-free bug on the PASID.

    To fix this, move the PASID free operation from __mmput() to __mmdrop().
    This ensures that the IOMMU driver's existing mmgrab() keeps the PASID
    allocated until it drops its mm reference.

    Fixes: 701fac40384f ("iommu/sva: Assign a PASID to mm on PASID allocation and free it on mm exit")
    Reported-by: Zhangfei Gao <zhangfei.gao@foxmail.com>
    Suggested-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
    Suggested-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
    Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Tested-by: Zhangfei Gao <zhangfei.gao@foxmail.com>
    Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
    Link: https://lore.kernel.org/r/20220428180041.806809-1-fenghua.yu@intel.com

(cherry picked from commit 2667ed10d9f01e250ba806276740782c89d77fda)
Signed-off-by: Jerry Snitselaar <jsnitsel@redhat.com>
2022-07-31 00:26:56 -07:00
Phil Auld 99ae4e07c3 posix-cpu-timers: Clear task::posix_cputimers_work in copy_process()
Bugzilla: https://bugzilla.redhat.com/2078906

commit ca7752caeaa70bd31d1714af566c9809688544af
Author: Michael Pratt <mpratt@google.com>
Date:   Mon Nov 1 17:06:15 2021 -0400

    posix-cpu-timers: Clear task::posix_cputimers_work in copy_process()

    copy_process currently copies task_struct.posix_cputimers_work as-is. If a
    timer interrupt arrives while handling clone and before dup_task_struct
    completes then the child task will have:

    1. posix_cputimers_work.scheduled = true
    2. posix_cputimers_work.work queued.

    copy_process clears task_struct.task_works, so (2) will have no effect and
    posix_cpu_timers_work will never run (not to mention it doesn't make sense
    for two tasks to share a common linked list).

    Since posix_cpu_timers_work never runs, posix_cputimers_work.scheduled is
    never cleared. Since scheduled is set, future timer interrupts will skip
    scheduling work, with the ultimate result that the task will never receive
    timer expirations.

    Together, the complete flow is:

    1. Task 1 calls clone(), enters kernel.
    2. Timer interrupt fires, schedules task work on Task 1.
       2a. task_struct.posix_cputimers_work.scheduled = true
       2b. task_struct.posix_cputimers_work.work added to
           task_struct.task_works.
    3. dup_task_struct() copies Task 1 to Task 2.
    4. copy_process() clears task_struct.task_works for Task 2.
    5. Future timer interrupts on Task 2 see
       task_struct.posix_cputimers_work.scheduled = true and skip scheduling
       work.

    Fix this by explicitly clearing contents of task_struct.posix_cputimers_work
    in copy_process(). This was never meant to be shared or inherited across
    tasks in the first place.

    Fixes: 1fb497dd00 ("posix-cpu-timers: Provide mechanisms to defer timer handling to task_work")
    Reported-by: Rhys Hiltner <rhys@justin.tv>
    Signed-off-by: Michael Pratt <mpratt@google.com>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Cc: <stable@vger.kernel.org>
    Link: https://lore.kernel.org/r/20211101210615.716522-1-mpratt@google.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-06-01 13:54:12 -04:00
Patrick Talbert 092af648a0 Merge: bpf: update to v5.15
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/572

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2041365

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>

Approved-by: Rado Vrbovsky <rvrbovsk@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Yauheni Kaliuta <ykaliuta@redhat.com>
Approved-by: Artem Savkov <asavkov@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-05-26 09:27:25 +02:00
Patrick Talbert f9a5b7f4d0 Merge: Scheduler RT prerequisites
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/754

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2076594
Tested:  Sanity tested with scheduler stress tests.

This is a handful of commits to help the RT merge. Keeping the differences
as small as possible reduces the maintenance.

Signed-off-by: Phil Auld <pauld@redhat.com>

Approved-by: Fernando Pacheco <fpacheco@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>
Approved-by: Wander Lairson Costa <wander@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-05-12 09:28:27 +02:00
Patrick Talbert e03a17d432 Merge: IDXD driver update for 9.1.0
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/692

Bugzilla: https://bugzilla.redhat.com/1971962

Bugzilla: https://bugzilla.redhat.com/1973884

Bugzilla: https://bugzilla.redhat.com/2004573

Bugzilla: https://bugzilla.redhat.com/2040041

Bugzilla: https://bugzilla.redhat.com/2040044

Bugzilla: https://bugzilla.redhat.com/2040046

Bugzilla: https://bugzilla.redhat.com/2040048

Bugzilla: https://bugzilla.redhat.com/2040052

Bugzilla: https://bugzilla.redhat.com/2040496

Bugzilla: https://bugzilla.redhat.com/2046470

Bugzilla: https://bugzilla.redhat.com/2072168

Testing: ran dsa_user_test_runner.sh on sapphire rapids system. Intel is also testing.

Conflicts: Should be noted in the individual commits. As with RHEL8 MR, the are a couple
           conflicts that were caused by having to deal with cleanups that were done in
           the upstream merge commits. One RH_KABI work around to task_struct. The end
           result was compared with upstream and the only difference is due to a patch
           changing a callback function to void not being backported, since the general kernel
           patch for that hasn't been backported.

This patchset updates the idxd driver to 5.18, and also pulls in upstream fixes to re-enable ENQCMD feature support.

Signed-off-by: Jerry Snitselaar <jsnitsel@redhat.com>

Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: David Arcari <darcari@redhat.com>
Approved-by: Myron Stowe <mstowe@redhat.com>
Approved-by: Dean Nelson <dnelson@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-05-06 10:36:21 +02:00
Jerome Marchand 8bdb9654b1 bpf: Add ambient BPF runtime context stored in current
Bugzilla: http://bugzilla.redhat.com/2041365

Conflicts: Minor conflict with commit a2baf4e8bb ("bpf: Fix
potentially incorrect results with bpf_get_local_storage()")

commit c7603cfa04e7c3a435b31d065f7cbdc829428f6e
Author: Andrii Nakryiko <andrii@kernel.org>
Date:   Mon Jul 12 16:06:15 2021 -0700

    bpf: Add ambient BPF runtime context stored in current

    b910eaaaa4 ("bpf: Fix NULL pointer dereference in bpf_get_local_storage()
    helper") fixed the problem with cgroup-local storage use in BPF by
    pre-allocating per-CPU array of 8 cgroup storage pointers to accommodate
    possible BPF program preemptions and nested executions.

    While this seems to work good in practice, it introduces new and unnecessary
    failure mode in which not all BPF programs might be executed if we fail to
    find an unused slot for cgroup storage, however unlikely it is. It might also
    not be so unlikely when/if we allow sleepable cgroup BPF programs in the
    future.

    Further, the way that cgroup storage is implemented as ambiently-available
    property during entire BPF program execution is a convenient way to pass extra
    information to BPF program and helpers without requiring user code to pass
    around extra arguments explicitly. So it would be good to have a generic
    solution that can allow implementing this without arbitrary restrictions.
    Ideally, such solution would work for both preemptable and sleepable BPF
    programs in exactly the same way.

    This patch introduces such solution, bpf_run_ctx. It adds one pointer field
    (bpf_ctx) to task_struct. This field is maintained by BPF_PROG_RUN family of
    macros in such a way that it always stays valid throughout BPF program
    execution. BPF program preemption is handled by remembering previous
    current->bpf_ctx value locally while executing nested BPF program and
    restoring old value after nested BPF program finishes. This is handled by two
    helper functions, bpf_set_run_ctx() and bpf_reset_run_ctx(), which are
    supposed to be used before and after BPF program runs, respectively.

    Restoring old value of the pointer handles preemption, while bpf_run_ctx
    pointer being a property of current task_struct naturally solves this problem
    for sleepable BPF programs by "following" BPF program execution as it is
    scheduled in and out of CPU. It would even allow CPU migration of BPF
    programs, even though it's not currently allowed by BPF infra.

    This patch cleans up cgroup local storage handling as a first application. The
    design itself is generic, though, with bpf_run_ctx being an empty struct that
    is supposed to be embedded into a specific struct for a given BPF program type
    (bpf_cg_run_ctx in this case). Follow up patches are planned that will expand
    this mechanism for other uses within tracing BPF programs.

    To verify that this change doesn't revert the fix to the original cgroup
    storage issue, I ran the same repro as in the original report ([0]) and didn't
    get any problems. Replacing bpf_reset_run_ctx(old_run_ctx) with
    bpf_reset_run_ctx(NULL) triggers the issue pretty quickly (so repro does work).

      [0] https://lore.kernel.org/bpf/YEEvBUiJl2pJkxTd@krava/

    Fixes: b910eaaaa4 ("bpf: Fix NULL pointer dereference in bpf_get_local_storage() helper")
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Yonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/bpf/20210712230615.3525979-1-andrii@kernel.org

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-04-29 18:14:33 +02:00
Patrick Talbert dbecc7b791 Merge: Scheduler updates and fixes
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/627

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2062831
Upstream Status: Linux
Tested: By me with a series of scheduler stress and performance tests

Scheduler updates and fixes from 5.17 and 5.18-rc1. This series
keeps the scheduler up to date and addresses a handful of potential
issues.

Signed-off-by: Phil Auld <pauld@redhat.com>

Approved-by: Waiman Long <longman@redhat.com>
Approved-by: Wander Lairson Costa <wander@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-04-28 10:01:36 +02:00
Patrick Talbert 66c2474052 Merge: ucounts: Backport fixes for ucount rlimits
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/571

Bugzilla: https://bugzilla.redhat.com/2061724
Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

This is another batch of rlimits and ucounts fixes from upstream.

Signed-off-by: Alexey Gladkov <agladkov@redhat.com>

Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: ebiederm <ebiederm@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-04-28 10:01:28 +02:00
Phil Auld ed88e6ec64 fork: Use IS_ENABLED() in account_kernel_stack()
Bugzilla: https://bugzilla.redhat.com/2076594

commit 0ce055f85335e48bc571114d61a70ae217039362
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Thu Feb 17 11:24:06 2022 +0100

    fork: Use IS_ENABLED() in account_kernel_stack()

    Not strickly needed but checking CONFIG_VMAP_STACK instead of
    task_stack_vm_area()' result allows the compiler the remove the else
    path in the CONFIG_VMAP_STACK case where the pointer can't be NULL.

    Check for CONFIG_VMAP_STACK in order to use the proper path.

    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Acked-by: Andy Lutomirski <luto@kernel.org>
    Link: https://lore.kernel.org/r/20220217102406.3697941-9-bigeasy@linutronix.de

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-04-19 15:06:57 -04:00
Phil Auld 3377627ebb fork: Only cache the VMAP stack in finish_task_switch()
Bugzilla: https://bugzilla.redhat.com/2076594

commit e540bf3162e822d7a1f07e69e3bb1b4f925ca368
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Thu Feb 17 11:24:05 2022 +0100

    fork: Only cache the VMAP stack in finish_task_switch()

    The task stack could be deallocated later, but for fork()/exec() kind of
    workloads (say a shell script executing several commands) it is important
    that the stack is released in finish_task_switch() so that in VMAP_STACK
    case it can be cached and reused in the new task.

    For PREEMPT_RT it would be good if the wake-up in vfree_atomic() could
    be avoided in the scheduling path. Far worse are the other
    free_thread_stack() implementations which invoke __free_pages()/
    kmem_cache_free() with disabled preemption.

    Cache the stack in free_thread_stack() in the VMAP_STACK case and
    RCU-delay the free path otherwise. Free the stack in the RCU callback.
    In the VMAP_STACK case this is another opportunity to fill the cache.

    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Acked-by: Andy Lutomirski <luto@kernel.org>
    Link: https://lore.kernel.org/r/20220217102406.3697941-8-bigeasy@linutronix.de

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-04-19 15:06:57 -04:00
Phil Auld 12a445f420 fork: Move task stack accounting to do_exit()
Bugzilla: https://bugzilla.redhat.com/2076594
Conflicts: Fuzz in one hunk of dup_task_struct due to context diff.

commit 1a03d3f13ffe5dd24142d6db629e72c11b704d99
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Thu Feb 17 11:24:04 2022 +0100

    fork: Move task stack accounting to do_exit()

    There is no need to perform the stack accounting of the outgoing task in
    its final schedule() invocation which happens with preemption disabled.
    The task is leaving, the resources will be freed and the accounting can
    happen in do_exit() before the actual schedule invocation which
    frees the stack memory.

    Move the accounting of the stack memory from release_task_stack() to
    exit_task_stack_account() which then can be invoked from do_exit().

    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Acked-by: Andy Lutomirski <luto@kernel.org>
    Link: https://lore.kernel.org/r/20220217102406.3697941-7-bigeasy@linutronix.de

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-04-19 15:06:03 -04:00
Phil Auld b467d819fd fork: Move memcg_charge_kernel_stack() into CONFIG_VMAP_STACK
Bugzilla: https://bugzilla.redhat.com/2076594

commit f1c1a9ee00e4c53c9ccc03ec1aff4792948a25eb
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Thu Feb 17 11:24:03 2022 +0100

    fork: Move memcg_charge_kernel_stack() into CONFIG_VMAP_STACK

    memcg_charge_kernel_stack() is only used in the CONFIG_VMAP_STACK case.

    Move memcg_charge_kernel_stack() into the CONFIG_VMAP_STACK block and
    invoke it from within alloc_thread_stack_node().

    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Acked-by: Andy Lutomirski <luto@kernel.org>
    Link: https://lore.kernel.org/r/20220217102406.3697941-6-bigeasy@linutronix.de

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-04-19 15:01:34 -04:00
Phil Auld 853ec22562 fork: Don't assign the stack pointer in dup_task_struct()
Bugzilla: https://bugzilla.redhat.com/2076594

commit 7865aba3ade4cf30f0ac08e015550084a50d9afb
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Thu Feb 17 11:24:02 2022 +0100

    fork: Don't assign the stack pointer in dup_task_struct()

    All four versions of alloc_thread_stack_node() assign now
    task_struct::stack in case the allocation was successful.

    Let alloc_thread_stack_node() return an error code instead of the stack
    pointer and remove the stack assignment in dup_task_struct().

    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Acked-by: Andy Lutomirski <luto@kernel.org>
    Link: https://lore.kernel.org/r/20220217102406.3697941-5-bigeasy@linutronix.de

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-04-19 15:01:34 -04:00
Phil Auld 4bc2d72c11 fork, IA64: Provide alloc_thread_stack_node() for IA64
Bugzilla: https://bugzilla.redhat.com/2076594

commit 2bb0529c0bc0698f3baf3e88ffd61a18eef252a7
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Thu Feb 17 11:24:01 2022 +0100

    fork, IA64: Provide alloc_thread_stack_node() for IA64

    Provide a generic alloc_thread_stack_node() for IA64 and
    CONFIG_ARCH_THREAD_STACK_ALLOCATOR which returns stack pointer and sets
    task_struct::stack so it behaves exactly like the other implementations.

    Rename IA64's alloc_thread_stack_node() and add the generic version to the
    fork code so it is in one place _and_ to drastically lower the chances of
    fat fingering the IA64 code.  Do the same for free_thread_stack().

    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Acked-by: Andy Lutomirski <luto@kernel.org>
    Link: https://lore.kernel.org/r/20220217102406.3697941-4-bigeasy@linutronix.de

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-04-19 15:01:34 -04:00
Phil Auld 33dc3c163e fork: Duplicate task_struct before stack allocation
Bugzilla: https://bugzilla.redhat.com/2076594

commit 546c42b2c5c161619736dd730d3df709181999d0
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Thu Feb 17 11:24:00 2022 +0100

    fork: Duplicate task_struct before stack allocation

    alloc_thread_stack_node() already populates the task_struct::stack
    member except on IA64. The stack pointer is saved and populated again
    because IA64 needs it and arch_dup_task_struct() overwrites it.

    Allocate thread's stack after task_struct has been duplicated as a
    preparation for further changes.

    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Acked-by: Andy Lutomirski <luto@kernel.org>
    Link: https://lore.kernel.org/r/20220217102406.3697941-3-bigeasy@linutronix.de

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-04-19 15:01:34 -04:00
Phil Auld 3f417b0af8 fork: Redo ifdefs around task stack handling
Bugzilla: https://bugzilla.redhat.com/2076594

commit be9a2277cafd318976d59c41a7f45a934ec43b26
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Thu Feb 17 11:23:59 2022 +0100

    fork: Redo ifdefs around task stack handling

    The use of ifdef CONFIG_VMAP_STACK is confusing in terms what is
    actually happenning and what can happen.

    For instance from reading free_thread_stack() it appears that in the
    CONFIG_VMAP_STACK case it may receive a non-NULL vm pointer but it may also
    be NULL in which case __free_pages() is used to free the stack.  This is
    however not the case because in the VMAP case a non-NULL pointer is always
    returned here.  Since it looks like this might happen, the compiler creates
    the correct dead code with the invocation to __free_pages() and everything
    around it. Twice.

    Add spaces between the ifdef and the identifer to recognize the ifdef
    level which is currently in scope.

    Add the current identifer as a comment behind #else and #endif.
    Move the code within free_thread_stack() and alloc_thread_stack_node()
    into the relevant ifdef blocks.

    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Acked-by: Andy Lutomirski <luto@kernel.org>
    Link: https://lore.kernel.org/r/20220217102406.3697941-2-bigeasy@linutronix.de

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-04-19 15:01:34 -04:00
Ming Lei d3eb0b2e5b fork: move copy_io to block/blk-ioc.c
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2066297

commit 88c9a2ce520ba381bb70658c80ec704f4d60f728
Author: Christoph Hellwig <hch@lst.de>
Date:   Fri Nov 26 12:58:05 2021 +0100

    fork: move copy_io to block/blk-ioc.c

    Move the copying of the I/O context to the block layer as that is where
    we can use the proper low-level interfaces.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20211126115817.2087431-3-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2022-04-11 11:44:22 +08:00
Jerry Snitselaar 1865727f74 sched: Define and initialize a flag to identify valid PASID in the task
Bugzilla: https://bugzilla.redhat.com/2004573
Upstream Status: kernel/git/torvalds/linux.git
Conflict: RH_KABI work around for addition to task_struct

commit a3d29e8291b622780eb6e4e3eeaf2b24ec78fd43
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Mon Feb 7 15:02:50 2022 -0800

    sched: Define and initialize a flag to identify valid PASID in the task

    Add a new single bit field to the task structure to track whether this task
    has initialized the IA32_PASID MSR to the mm's PASID.

    Initialize the field to zero when creating a new task with fork/clone.

    Signed-off-by: Peter Zijlstra <peterz@infradead.org>
    Co-developed-by: Fenghua Yu <fenghua.yu@intel.com>
    Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
    Signed-off-by: Borislav Petkov <bp@suse.de>
    Reviewed-by: Tony Luck <tony.luck@intel.com>
    Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
    Link: https://lore.kernel.org/r/20220207230254.3342514-8-fenghua.yu@intel.com

(cherry picked from commit a3d29e8291b622780eb6e4e3eeaf2b24ec78fd43)
Signed-off-by: Jerry Snitselaar <jsnitsel@redhat.com>
2022-04-07 16:24:04 -07:00
Jerry Snitselaar b98576031e iommu/sva: Assign a PASID to mm on PASID allocation and free it on mm exit
Bugzilla: https://bugzilla.redhat.com/2004573
Upstream Status: kernel/git/torvalds/linux.git
Conflict: 30209b93177a ("iommu: Fix some W=1 warnings") not backported yet.

commit 701fac40384f07197b106136012804c3cae0b3de
Author: Fenghua Yu <fenghua.yu@intel.com>
Date:   Mon Feb 7 15:02:48 2022 -0800

    iommu/sva: Assign a PASID to mm on PASID allocation and free it on mm exit

    PASIDs are process-wide. It was attempted to use refcounted PASIDs to
    free them when the last thread drops the refcount. This turned out to
    be complex and error prone. Given the fact that the PASID space is 20
    bits, which allows up to 1M processes to have a PASID associated
    concurrently, PASID resource exhaustion is not a realistic concern.

    Therefore, it was decided to simplify the approach and stick with lazy
    on demand PASID allocation, but drop the eager free approach and make an
    allocated PASID's lifetime bound to the lifetime of the process.

    Get rid of the refcounting mechanisms and replace/rename the interfaces
    to reflect this new approach.

      [ bp: Massage commit message. ]

    Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
    Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
    Signed-off-by: Borislav Petkov <bp@suse.de>
    Reviewed-by: Tony Luck <tony.luck@intel.com>
    Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
    Reviewed-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
    Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
    Acked-by: Joerg Roedel <jroedel@suse.de>
    Link: https://lore.kernel.org/r/20220207230254.3342514-6-fenghua.yu@intel.com

(cherry picked from commit 701fac40384f07197b106136012804c3cae0b3de)
Signed-off-by: Jerry Snitselaar <jsnitsel@redhat.com>
2022-04-07 16:24:03 -07:00
Jerry Snitselaar 357c3fe294 kernel/fork: Initialize mm's PASID
Bugzilla: https://bugzilla.redhat.com/2004573
Upstream Status: kernel/git/torvalds/linux.git

commit a6cbd44093ef305b02ad5f80ed54abf0148a696c
Author: Fenghua Yu <fenghua.yu@intel.com>
Date:   Mon Feb 7 15:02:47 2022 -0800

    kernel/fork: Initialize mm's PASID

    A new mm doesn't have a PASID yet when it's created. Initialize
    the mm's PASID on fork() or for init_mm to INVALID_IOASID (-1).

    INIT_PASID (0) is reserved for kernel legacy DMA PASID. It cannot be
    allocated to a user process. Initializing the process's PASID to 0 may
    cause confusion that's why the process uses the reserved kernel legacy
    DMA PASID. Initializing the PASID to INVALID_IOASID (-1) explicitly
    tells the process doesn't have a valid PASID yet.

    Even though the only user of mm_pasid_init() is in fork.c, define it in
    <linux/sched/mm.h> as the first of three mm/pasid life cycle functions
    (init/set/drop) to keep these all together.

    Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
    Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
    Signed-off-by: Borislav Petkov <bp@suse.de>
    Reviewed-by: Tony Luck <tony.luck@intel.com>
    Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
    Link: https://lore.kernel.org/r/20220207230254.3342514-5-fenghua.yu@intel.com

(cherry picked from commit a6cbd44093ef305b02ad5f80ed54abf0148a696c)
Signed-off-by: Jerry Snitselaar <jsnitsel@redhat.com>
2022-04-07 16:24:03 -07:00
Jerry Snitselaar 4d821fa118 mm: Change CONFIG option for mm->pasid field
Bugzilla: https://bugzilla.redhat.com/2004573
Upstream Status: kernel/git/torvalds/linux.git

commit 7a853c2d5951419fdf3c1c9d2b6f5a38f6a6857d
Author: Fenghua Yu <fenghua.yu@intel.com>
Date:   Mon Feb 7 15:02:45 2022 -0800

    mm: Change CONFIG option for mm->pasid field

    This currently depends on CONFIG_IOMMU_SUPPORT. But it is only
    needed when CONFIG_IOMMU_SVA option is enabled.

    Change the CONFIG guards around definition and initialization
    of mm->pasid field.

    Suggested-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
    Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
    Signed-off-by: Borislav Petkov <bp@suse.de>
    Reviewed-by: Tony Luck <tony.luck@intel.com>
    Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
    Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
    Link: https://lore.kernel.org/r/20220207230254.3342514-3-fenghua.yu@intel.com

(cherry picked from commit 7a853c2d5951419fdf3c1c9d2b6f5a38f6a6857d)
Signed-off-by: Jerry Snitselaar <jsnitsel@redhat.com>
2022-04-07 16:24:03 -07:00
Patrick Talbert b402b1b887 Merge: copy_process(): Move fd_install() out of sighand->siglock critical section
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/594

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2051855    
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/594

commit ddc204b517e60ae64db34f9832dc41dafa77c751    
Author: Waiman Long <longman@redhat.com>    
Date:   Tue, 8 Feb 2022 11:39:12 -0500

    copy_process(): Move fd_install() out of sighand->siglock critical section

    I was made aware of the following lockdep splat:

    [ 2516.308763] =====================================================
    [ 2516.309085] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
    [ 2516.309433] 5.14.0-51.el9.aarch64+debug #1 Not tainted
    [ 2516.309703] -----------------------------------------------------
    [ 2516.310149] stress-ng/153663 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
    [ 2516.310512] ffff0000e422b198 (&newf->file_lock){+.+.}-{2:2}, at: fd_install+0x368/0x4f0
    [ 2516.310944]
                   and this task is already holding:
    [ 2516.311248] ffff0000c08140d8 (&sighand->siglock){-.-.}-{2:2}, at: copy_process+0x1e2c/0x3e80
    [ 2516.311804] which would create a new lock dependency:
    [ 2516.312066]  (&sighand->siglock){-.-.}-{2:2} -> (&newf->file_lock){+.+.}-{2:2}
    [ 2516.312446]
                   but this new dependency connects a HARDIRQ-irq-safe lock:
    [ 2516.312983]  (&sighand->siglock){-.-.}-{2:2}
       :
    [ 2516.330700]  Possible interrupt unsafe locking scenario:

    [ 2516.331075]        CPU0                    CPU1
    [ 2516.331328]        ----                    ----
    [ 2516.331580]   lock(&newf->file_lock);
    [ 2516.331790]                                local_irq_disable();
    [ 2516.332231]                                lock(&sighand->siglock);
    [ 2516.332579]                                lock(&newf->file_lock);
    [ 2516.332922]   <Interrupt>
    [ 2516.333069]     lock(&sighand->siglock);
    [ 2516.333291]
                    *** DEADLOCK ***
    [ 2516.389845]
                   stack backtrace:
    [ 2516.390101] CPU: 3 PID: 153663 Comm: stress-ng Kdump: loaded Not tainted 5.14.0-51.el9.aarch64+debug #1
    [ 2516.390756] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
    [ 2516.391155] Call trace:
    [ 2516.391302]  dump_backtrace+0x0/0x3e0
    [ 2516.391518]  show_stack+0x24/0x30
    [ 2516.391717]  dump_stack_lvl+0x9c/0xd8
    [ 2516.391938]  dump_stack+0x1c/0x38
    [ 2516.392247]  print_bad_irq_dependency+0x620/0x710
    [ 2516.392525]  check_irq_usage+0x4fc/0x86c
    [ 2516.392756]  check_prev_add+0x180/0x1d90
    [ 2516.392988]  validate_chain+0x8e0/0xee0
    [ 2516.393215]  __lock_acquire+0x97c/0x1e40
    [ 2516.393449]  lock_acquire.part.0+0x240/0x570
    [ 2516.393814]  lock_acquire+0x90/0xb4
    [ 2516.394021]  _raw_spin_lock+0xe8/0x154
    [ 2516.394244]  fd_install+0x368/0x4f0
    [ 2516.394451]  copy_process+0x1f5c/0x3e80
    [ 2516.394678]  kernel_clone+0x134/0x660
    [ 2516.394895]  __do_sys_clone3+0x130/0x1f4
    [ 2516.395128]  __arm64_sys_clone3+0x5c/0x7c
    [ 2516.395478]  invoke_syscall.constprop.0+0x78/0x1f0
    [ 2516.395762]  el0_svc_common.constprop.0+0x22c/0x2c4
    [ 2516.396050]  do_el0_svc+0xb0/0x10c
    [ 2516.396252]  el0_svc+0x24/0x34
    [ 2516.396436]  el0t_64_sync_handler+0xa4/0x12c
    [ 2516.396688]  el0t_64_sync+0x198/0x19c
    [ 2517.491197] NET: Registered PF_ATMPVC protocol family
    [ 2517.491524] NET: Registered PF_ATMSVC protocol family
    [ 2591.991877] sched: RT throttling activated

    One way to solve this problem is to move the fd_install() call out of
    the sighand->siglock critical section.

    Before commit 6fd2fe494b ("copy_process(): don't use ksys_close()
    on cleanups"), the pidfd installation was done without holding both
    the task_list lock and the sighand->siglock. Obviously, holding these
    two locks are not really needed to protect the fd_install() call.
    So move the fd_install() call down to after the releases of both locks.

    Link: https://lore.kernel.org/r/20220208163912.1084752-1-longman@redhat.com
    Fixes: 6fd2fe494b ("copy_process(): don't use ksys_close() on cleanups")
    Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Christian Brauner <brauner@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>

Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-03-30 05:46:21 +00:00
Phil Auld b23c298982 fork: Stop protecting back_fork_cleanup_cgroup_lock with CONFIG_NUMA
Bugzilla: http://bugzilla.redhat.com/2062831

commit 6692c98c7df53502adb8b8b73ab9bcbd399f7a06
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Mon Dec 20 10:20:14 2021 -0600

    fork: Stop protecting back_fork_cleanup_cgroup_lock with CONFIG_NUMA

    Mark Brown <broonie@kernel.org> reported:

    > This is also causing further build errors including but not limited to:
    >
    > /tmp/next/build/kernel/fork.c: In function 'copy_process':
    > /tmp/next/build/kernel/fork.c:2106:4: error: label 'bad_fork_cleanup_threadgroup_lock' used but not defined
    >  2106 |    goto bad_fork_cleanup_threadgroup_lock;
    >       |    ^~~~

    It turns out that I messed up and was depending upon a label protected
    by an ifdef.  Move the label out of the ifdef as the ifdef around the label
    no longer makes sense (if it ever did).

    Link: https://lkml.kernel.org/r/YbugCP144uxXvRsk@sirena.org.uk
    Fixes: 40966e316f86 ("kthread: Ensure struct kthread is present for all kthreads")
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-03-28 09:28:36 -04:00
Waiman Long cbc5ee16ab copy_process(): Move fd_install() out of sighand->siglock critical section
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2051855

commit ddc204b517e60ae64db34f9832dc41dafa77c751
Author: Waiman Long <longman@redhat.com>
Date:   Tue, 8 Feb 2022 11:39:12 -0500

    copy_process(): Move fd_install() out of sighand->siglock critical section

    I was made aware of the following lockdep splat:

    [ 2516.308763] =====================================================
    [ 2516.309085] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
    [ 2516.309433] 5.14.0-51.el9.aarch64+debug #1 Not tainted
    [ 2516.309703] -----------------------------------------------------
    [ 2516.310149] stress-ng/153663 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
    [ 2516.310512] ffff0000e422b198 (&newf->file_lock){+.+.}-{2:2}, at: fd_install+0x368/0x4f0
    [ 2516.310944]
                   and this task is already holding:
    [ 2516.311248] ffff0000c08140d8 (&sighand->siglock){-.-.}-{2:2}, at: copy_process+0x1e2c/0x3e80
    [ 2516.311804] which would create a new lock dependency:
    [ 2516.312066]  (&sighand->siglock){-.-.}-{2:2} -> (&newf->file_lock){+.+.}-{2:2}
    [ 2516.312446]
                   but this new dependency connects a HARDIRQ-irq-safe lock:
    [ 2516.312983]  (&sighand->siglock){-.-.}-{2:2}
       :
    [ 2516.330700]  Possible interrupt unsafe locking scenario:

    [ 2516.331075]        CPU0                    CPU1
    [ 2516.331328]        ----                    ----
    [ 2516.331580]   lock(&newf->file_lock);
    [ 2516.331790]                                local_irq_disable();
    [ 2516.332231]                                lock(&sighand->siglock);
    [ 2516.332579]                                lock(&newf->file_lock);
    [ 2516.332922]   <Interrupt>
    [ 2516.333069]     lock(&sighand->siglock);
    [ 2516.333291]
                    *** DEADLOCK ***
    [ 2516.389845]
                   stack backtrace:
    [ 2516.390101] CPU: 3 PID: 153663 Comm: stress-ng Kdump: loaded Not tainted 5.14.0-51.el9.aarch64+debug #1
    [ 2516.390756] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
    [ 2516.391155] Call trace:
    [ 2516.391302]  dump_backtrace+0x0/0x3e0
    [ 2516.391518]  show_stack+0x24/0x30
    [ 2516.391717]  dump_stack_lvl+0x9c/0xd8
    [ 2516.391938]  dump_stack+0x1c/0x38
    [ 2516.392247]  print_bad_irq_dependency+0x620/0x710
    [ 2516.392525]  check_irq_usage+0x4fc/0x86c
    [ 2516.392756]  check_prev_add+0x180/0x1d90
    [ 2516.392988]  validate_chain+0x8e0/0xee0
    [ 2516.393215]  __lock_acquire+0x97c/0x1e40
    [ 2516.393449]  lock_acquire.part.0+0x240/0x570
    [ 2516.393814]  lock_acquire+0x90/0xb4
    [ 2516.394021]  _raw_spin_lock+0xe8/0x154
    [ 2516.394244]  fd_install+0x368/0x4f0
    [ 2516.394451]  copy_process+0x1f5c/0x3e80
    [ 2516.394678]  kernel_clone+0x134/0x660
    [ 2516.394895]  __do_sys_clone3+0x130/0x1f4
    [ 2516.395128]  __arm64_sys_clone3+0x5c/0x7c
    [ 2516.395478]  invoke_syscall.constprop.0+0x78/0x1f0
    [ 2516.395762]  el0_svc_common.constprop.0+0x22c/0x2c4
    [ 2516.396050]  do_el0_svc+0xb0/0x10c
    [ 2516.396252]  el0_svc+0x24/0x34
    [ 2516.396436]  el0t_64_sync_handler+0xa4/0x12c
    [ 2516.396688]  el0t_64_sync+0x198/0x19c
    [ 2517.491197] NET: Registered PF_ATMPVC protocol family
    [ 2517.491524] NET: Registered PF_ATMSVC protocol family
    [ 2591.991877] sched: RT throttling activated

    One way to solve this problem is to move the fd_install() call out of
    the sighand->siglock critical section.

    Before commit 6fd2fe494b ("copy_process(): don't use ksys_close()
    on cleanups"), the pidfd installation was done without holding both
    the task_list lock and the sighand->siglock. Obviously, holding these
    two locks are not really needed to protect the fd_install() call.
    So move the fd_install() call down to after the releases of both locks.

    Link: https://lore.kernel.org/r/20220208163912.1084752-1-longman@redhat.com
    Fixes: 6fd2fe494b ("copy_process(): don't use ksys_close() on cleanups")
    Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Christian Brauner <brauner@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2022-03-16 16:01:59 -04:00
Phil Auld fc23011d32 sched: Fix yet more sched_fork() races
Bugzilla: http://bugzilla.redhat.com/2062836

commit b1e8206582f9d680cff7d04828708c8b6ab32957
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Mon Feb 14 10:16:57 2022 +0100

    sched: Fix yet more sched_fork() races

    Where commit 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an
    invalid sched_task_group") fixed a fork race vs cgroup, it opened up a
    race vs syscalls by not placing the task on the runqueue before it
    gets exposed through the pidhash.

    Commit 13765de8148f ("sched/fair: Fix fault in reweight_entity") is
    trying to fix a single instance of this, instead fix the whole class
    of issues, effectively reverting this commit.

    Fixes: 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group")
    Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Tested-by: Tadeusz Struk <tadeusz.struk@linaro.org>
    Tested-by: Zhang Qiao <zhangqiao22@huawei.com>
    Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Link: https://lkml.kernel.org/r/YgoeCbwj5mbCR0qA@hirez.programming.kicks-ass.net

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-03-14 09:25:27 -04:00
Alexey Gladkov 944a7e8443 ucounts: Enforce RLIMIT_NPROC not RLIMIT_NPROC+1
Bugzilla: https://bugzilla.redhat.com/2061724
Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

commit 8f2f9c4d82f24f172ae439e5035fc1e0e4c229dd
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Wed Feb 9 20:03:19 2022 -0600

    ucounts: Enforce RLIMIT_NPROC not RLIMIT_NPROC+1

    Michal Koutný <mkoutny@suse.com> wrote:

    > It was reported that v5.14 behaves differently when enforcing
    > RLIMIT_NPROC limit, namely, it allows one more task than previously.
    > This is consequence of the commit 21d1c5e386 ("Reimplement
    > RLIMIT_NPROC on top of ucounts") that missed the sharpness of
    > equality in the forking path.

    This can be fixed either by fixing the test or by moving the increment
    to be before the test.  Fix it my moving copy_creds which contains
    the increment before is_ucounts_overlimit.

    In the case of CLONE_NEWUSER the ucounts in the task_cred changes.
    The function is_ucounts_overlimit needs to use the final version of
    the ucounts for the new process.  Which means moving the
    is_ucounts_overlimit test after copy_creds is necessary.

    Both the test in fork and the test in set_user were semantically
    changed when the code moved to ucounts.  The change of the test in
    fork was bad because it was before the increment.  The test in
    set_user was wrong and the change to ucounts fixed it.  So this
    fix only restores the old behavior in one lcation not two.

    Link: https://lkml.kernel.org/r/20220204181144.24462-1-mkoutny@suse.com
    Link: https://lkml.kernel.org/r/20220216155832.680775-2-ebiederm@xmission.com
    Cc: stable@vger.kernel.org
    Reported-by: Michal Koutný <mkoutny@suse.com>
    Reviewed-by: Michal Koutný <mkoutny@suse.com>
    Fixes: 21d1c5e386 ("Reimplement RLIMIT_NPROC on top of ucounts")
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Alexey Gladkov <agladkov@redhat.com>
2022-03-08 13:37:50 +01:00
Herton R. Krzesinski f13f32b81b Merge: sched: backports from 5.16 merge window
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/217
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2020279

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2029640

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1921343

Upstream Status: Linux
Tested: By me, with scheduler stress and sanity tests. Boot tested
    on Alderlake for topology changes.

5.16+ scheduler fixes. This includes some commits requested by
the Livepatch team and some AlderLake topology changes. A few
additional patches were pulled in to make the rest apply. With
those and the dependency all patches apply cleanly.

v2: added 3 more commits from sched/urgent.

Added one last (hopefully) fix from sched/urgent.

Signed-off-by: Phil Auld <pauld@redhat.com>
RH-Acked-by: David Arcari <darcari@redhat.com>
RH-Acked-by: Rafael Aquini <aquini@redhat.com>
RH-Acked-by: Wander Lairson Costa <wander@redhat.com>
RH-Acked-by: Waiman Long <longman@redhat.com>
RH-Acked-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2021-12-22 10:22:13 -03:00
Phil Auld eec2be9bfb kernel/sched: Fix sched_fork() access an invalid sched_task_group
Bugzilla: http://bugzilla.redhat.com/2020279

commit 4ef0c5c6b5ba1f38f0ea1cedad0cad722f00c14a
Author: Zhang Qiao <zhangqiao22@huawei.com>
Date:   Wed Sep 15 14:40:30 2021 +0800

    kernel/sched: Fix sched_fork() access an invalid sched_task_group

    There is a small race between copy_process() and sched_fork()
    where child->sched_task_group point to an already freed pointer.

            parent doing fork()      | someone moving the parent
                                     | to another cgroup
      -------------------------------+-------------------------------
      copy_process()
          + dup_task_struct()<1>
                                      parent move to another cgroup,
                                      and free the old cgroup. <2>
          + sched_fork()
            + __set_task_cpu()<3>
            + task_fork_fair()
              + sched_slice()<4>

    In the worst case, this bug can lead to "use-after-free" and
    cause panic as shown above:

      (1) parent copy its sched_task_group to child at <1>;

      (2) someone move the parent to another cgroup and free the old
          cgroup at <2>;

      (3) the sched_task_group and cfs_rq that belong to the old cgroup
          will be accessed at <3> and <4>, which cause a panic:

      [] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
      [] PGD 8000001fa0a86067 P4D 8000001fa0a86067 PUD 2029955067 PMD 0
      [] Oops: 0000 [#1] SMP PTI
      [] CPU: 7 PID: 648398 Comm: ebizzy Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0.x86_64+ #1
      [] RIP: 0010:sched_slice+0x84/0xc0

      [] Call Trace:
      []  task_fork_fair+0x81/0x120
      []  sched_fork+0x132/0x240
      []  copy_process.part.5+0x675/0x20e0
      []  ? __handle_mm_fault+0x63f/0x690
      []  _do_fork+0xcd/0x3b0
      []  do_syscall_64+0x5d/0x1d0
      []  entry_SYSCALL_64_after_hwframe+0x65/0xca
      [] RIP: 0033:0x7f04418cd7e1

    Between cgroup_can_fork() and cgroup_post_fork(), the cgroup
    membership and thus sched_task_group can't change. So update child's
    sched_task_group at sched_post_fork() and move task_fork() and
    __set_task_cpu() (where accees the sched_task_group) from sched_fork()
    to sched_post_fork().

    Fixes: 8323f26ce3 ("sched: Fix race in task_group")
    Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Tejun Heo <tj@kernel.org>
    Link: https://lkml.kernel.org/r/20210915064030.2231-1-zhangqiao22@huawei.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2021-12-13 16:07:47 -05:00
Ming Lei 3032dc0359 kernel: remove spurious blkdev.h includes
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2018403

commit 545c6647d2d9f3bf8a086d5ff47fb85e5c5dca28
Author: Christoph Hellwig <hch@lst.de>
Date:   Mon Sep 20 14:33:17 2021 +0200

    kernel: remove spurious blkdev.h includes

    Various files have acquired spurious includes of <linux/blkdev.h> over
    time.  Remove them.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Link: https://lore.kernel.org/r/20210920123328.1399408-7-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2021-12-06 16:42:47 +08:00
Rafael Aquini 4a7cb3d485 mm/hugetlb: initialize hugetlb_usage in mm_init
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 13db8c50477d83ad3e3b9b0ae247e5cd833a7ae4
Author: Liu Zixian <liuzixian4@huawei.com>
Date:   Wed Sep 8 18:10:05 2021 -0700

    mm/hugetlb: initialize hugetlb_usage in mm_init

    After fork, the child process will get incorrect (2x) hugetlb_usage.  If
    a process uses 5 2MB hugetlb pages in an anonymous mapping,

            HugetlbPages:      10240 kB

    and then forks, the child will show,

            HugetlbPages:      20480 kB

    The reason for double the amount is because hugetlb_usage will be copied
    from the parent and then increased when we copy page tables from parent
    to child.  Child will have 2x actual usage.

    Fix this by adding hugetlb_count_init in mm_init.

    Link: https://lkml.kernel.org/r/20210826071742.877-1-liuzixian4@huawei.com
    Fixes: 5d317b2b65 ("mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status")
    Signed-off-by: Liu Zixian <liuzixian4@huawei.com>
    Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:43:41 -05:00
Rafael Aquini 27a596f01a mm: remove VM_DENYWRITE
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 8d0920bde5eb8ec7e567939b85e65a0596c8580d
Author: David Hildenbrand <david@redhat.com>
Date:   Thu Apr 22 12:08:20 2021 +0200

    mm: remove VM_DENYWRITE

    All in-tree users of MAP_DENYWRITE are gone. MAP_DENYWRITE cannot be
    set from user space, so all users are gone; let's remove it.

    Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
    Acked-by: Christian König <christian.koenig@amd.com>
    Signed-off-by: David Hildenbrand <david@redhat.com>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:40:33 -05:00
Rafael Aquini 77323173e0 kernel/fork: always deny write access to current MM exe_file
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit fe69d560b5bd9ec77b5d5749bd7027344daef47e
Author: David Hildenbrand <david@redhat.com>
Date:   Fri Apr 23 10:29:59 2021 +0200

    kernel/fork: always deny write access to current MM exe_file

    We want to remove VM_DENYWRITE only currently only used when mapping the
    executable during exec. During exec, we already deny_write_access() the
    executable, however, after exec completes the VMAs mapped
    with VM_DENYWRITE effectively keeps write access denied via
    deny_write_access().

    Let's deny write access when setting or replacing the MM exe_file. With
    this change, we can remove VM_DENYWRITE for mapping executables.

    Make set_mm_exe_file() return an error in case deny_write_access()
    fails; note that this should never happen, because exec code does a
    deny_write_access() early and keeps write access denied when calling
    set_mm_exe_file. However, it makes the code easier to read and makes
    set_mm_exe_file() and replace_mm_exe_file() look more similar.

    This represents a minor user space visible change:
    sys_prctl(PR_SET_MM_MAP/EXE_FILE) can now fail if the file is already
    opened writable. Also, after sys_prctl(PR_SET_MM_MAP/EXE_FILE) the file
    cannot be opened writable. Note that we can already fail with -EACCES if
    the file doesn't have execute permissions.

    Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
    Acked-by: Christian König <christian.koenig@amd.com>
    Signed-off-by: David Hildenbrand <david@redhat.com>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:40:30 -05:00
Rafael Aquini 8a0e26470c kernel/fork: factor out replacing the current MM exe_file
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 35d7bdc86031a2c1ae05ac27dfa93b2acdcbaecc
Author: David Hildenbrand <david@redhat.com>
Date:   Fri Apr 23 10:20:25 2021 +0200

    kernel/fork: factor out replacing the current MM exe_file

    Let's factor the main logic out into replace_mm_exe_file(), such that
    all mm->exe_file logic is contained in kernel/fork.c.

    While at it, perform some simple cleanups that are possible now that
    we're simplifying the individual functions.

    Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
    Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
    Acked-by: Christian König <christian.koenig@amd.com>
    Signed-off-by: David Hildenbrand <david@redhat.com>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:40:30 -05:00
Phil Auld 864855cf60 sched: Introduce task_struct::user_cpus_ptr to track requested affinity
Bugzilla: http://bugzilla.redhat.com/1992256

commit b90ca8badbd11488e5f762346b028666808164e7
Author: Will Deacon <will@kernel.org>
Date:   Fri Jul 30 12:24:33 2021 +0100

    sched: Introduce task_struct::user_cpus_ptr to track requested affinity

    In preparation for saving and restoring the user-requested CPU affinity
    mask of a task, add a new cpumask_t pointer to 'struct task_struct'.

    If the pointer is non-NULL, then the mask is copied across fork() and
    freed on task exit.

    Signed-off-by: Will Deacon <will@kernel.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com>
    Link: https://lore.kernel.org/r/20210730112443.23245-7-will@kernel.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2021-11-01 18:14:46 -04:00
Eric W. Biederman 5ddf994fa2 ucounts: Fix regression preventing increasing of rlimits in init_user_ns
"Ma, XinjianX" <xinjianx.ma@intel.com> reported:

> When lkp team run kernel selftests, we found after these series of patches, testcase mqueue: mq_perf_tests
> in kselftest failed with following message.
>
> # selftests: mqueue: mq_perf_tests
> #
> # Initial system state:
> #       Using queue path:                       /mq_perf_tests
> #       RLIMIT_MSGQUEUE(soft):                  819200
> #       RLIMIT_MSGQUEUE(hard):                  819200
> #       Maximum Message Size:                   8192
> #       Maximum Queue Size:                     10
> #       Nice value:                             0
> #
> # Adjusted system state for testing:
> #       RLIMIT_MSGQUEUE(soft):                  (unlimited)
> #       RLIMIT_MSGQUEUE(hard):                  (unlimited)
> #       Maximum Message Size:                   16777216
> #       Maximum Queue Size:                     65530
> #       Nice value:                             -20
> #       Continuous mode:                        (disabled)
> #       CPUs to pin:                            3
> # ./mq_perf_tests: mq_open() at 296: Too many open files
> not ok 2 selftests: mqueue: mq_perf_tests # exit=1
> ```
>
> Test env:
> rootfs: debian-10
> gcc version: 9

After investigation the problem turned out to be that ucount_max for
the rlimits in init_user_ns was being set to the initial rlimit value.
The practical problem is that ucount_max provides a limit that
applications inside the user namespace can not exceed.  Which means in
practice that rlimits that have been converted to use the ucount
infrastructure were not able to exceend their initial rlimits.

Solve this by setting the relevant values of ucount_max to
RLIM_INIFINITY.  A limit in init_user_ns is pointless so the code
should allow the values to grow as large as possible without riscking
an underflow or an overflow.

As the ltp test case was a bit of a pain I have reproduced the rlimit failure
and tested the fix with the following little C program:
> #include <stdio.h>
> #include <fcntl.h>
> #include <sys/stat.h>
> #include <mqueue.h>
> #include <sys/time.h>
> #include <sys/resource.h>
> #include <errno.h>
> #include <string.h>
> #include <stdlib.h>
> #include <limits.h>
> #include <unistd.h>
>
> int main(int argc, char **argv)
> {
> 	struct mq_attr mq_attr;
> 	struct rlimit rlim;
> 	mqd_t mqd;
> 	int ret;
>
> 	ret = getrlimit(RLIMIT_MSGQUEUE, &rlim);
> 	if (ret != 0) {
> 		fprintf(stderr, "getrlimit(RLIMIT_MSGQUEUE) failed: %s\n", strerror(errno));
> 		exit(EXIT_FAILURE);
> 	}
> 	printf("RLIMIT_MSGQUEUE %lu %lu\n",
> 	       rlim.rlim_cur, rlim.rlim_max);
> 	rlim.rlim_cur = RLIM_INFINITY;
> 	rlim.rlim_max = RLIM_INFINITY;
> 	ret = setrlimit(RLIMIT_MSGQUEUE, &rlim);
> 	if (ret != 0) {
> 		fprintf(stderr, "setrlimit(RLIMIT_MSGQUEUE, RLIM_INFINITY) failed: %s\n", strerror(errno));
> 		exit(EXIT_FAILURE);
> 	}
>
> 	memset(&mq_attr, 0, sizeof(struct mq_attr));
> 	mq_attr.mq_maxmsg = 65536 - 1;
> 	mq_attr.mq_msgsize = 16*1024*1024 - 1;
>
> 	mqd = mq_open("/mq_rlimit_test", O_RDONLY|O_CREAT, 0600, &mq_attr);
> 	if (mqd == (mqd_t)-1) {
> 		fprintf(stderr, "mq_open failed: %s\n", strerror(errno));
> 		exit(EXIT_FAILURE);
> 	}
> 	ret = mq_close(mqd);
> 	if (ret) {
> 		fprintf(stderr, "mq_close failed; %s\n", strerror(errno));
> 		exit(EXIT_FAILURE);
> 	}
>
> 	return EXIT_SUCCESS;
> }

Fixes: 6e52a9f053 ("Reimplement RLIMIT_MSGQUEUE on top of ucounts")
Fixes: d7c9e99aee ("Reimplement RLIMIT_MEMLOCK on top of ucounts")
Fixes: d646969055 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
Fixes: 21d1c5e386 ("Reimplement RLIMIT_NPROC on top of ucounts")
Reported-by: kernel test robot lkp@intel.com
Acked-by: Alexey Gladkov <legion@kernel.org>
Link: https://lkml.kernel.org/r/87eeajswfc.fsf_-_@disp2133
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-08-23 16:10:42 -05:00
Linus Torvalds 65090f30ab Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:
 "191 patches.

  Subsystems affected by this patch series: kthread, ia64, scripts,
  ntfs, squashfs, ocfs2, kernel/watchdog, and mm (gup, pagealloc, slab,
  slub, kmemleak, dax, debug, pagecache, gup, swap, memcg, pagemap,
  mprotect, bootmem, dma, tracing, vmalloc, kasan, initialization,
  pagealloc, and memory-failure)"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (191 commits)
  mm,hwpoison: make get_hwpoison_page() call get_any_page()
  mm,hwpoison: send SIGBUS with error virutal address
  mm/page_alloc: split pcp->high across all online CPUs for cpuless nodes
  mm/page_alloc: allow high-order pages to be stored on the per-cpu lists
  mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM
  mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA
  docs: remove description of DISCONTIGMEM
  arch, mm: remove stale mentions of DISCONIGMEM
  mm: remove CONFIG_DISCONTIGMEM
  m68k: remove support for DISCONTIGMEM
  arc: remove support for DISCONTIGMEM
  arc: update comment about HIGHMEM implementation
  alpha: remove DISCONTIGMEM and NUMA
  mm/page_alloc: move free_the_page
  mm/page_alloc: fix counting of managed_pages
  mm/page_alloc: improve memmap_pages dbg msg
  mm: drop SECTION_SHIFT in code comments
  mm/page_alloc: introduce vm.percpu_pagelist_high_fraction
  mm/page_alloc: limit the number of pages on PCP lists when reclaim is active
  mm/page_alloc: scale the number of pages that are batch freed
  ...
2021-06-29 17:29:11 -07:00
Andrea Arcangeli a458b76a41 mm: gup: pack has_pinned in MMF_HAS_PINNED
has_pinned 32bit can be packed in the MMF_HAS_PINNED bit as a noop
cleanup.

Any atomic_inc/dec to the mm cacheline shared by all threads in pin-fast
would reintroduce a loss of SMP scalability to pin-fast, so there's no
future potential usefulness to keep an atomic in the mm for this.

set_bit(MMF_HAS_PINNED) will be theoretically a bit slower than WRITE_ONCE
(atomic_set is equivalent to WRITE_ONCE), but the set_bit (just like
atomic_set after this commit) has to be still issued only once per "mm",
so the difference between the two will be lost in the noise.

will-it-scale "mmap2" shows no change in performance with enterprise
config as expected.

will-it-scale "pin_fast" retains the > 4000% SMP scalability performance
improvement against upstream as expected.

This is a noop as far as overall performance and SMP scalability are
concerned.

[peterx@redhat.com: pack has_pinned in MMF_HAS_PINNED]
  Link: https://lkml.kernel.org/r/YJqWESqyxa8OZA+2@t490s
[akpm@linux-foundation.org: coding style fixes]
[peterx@redhat.com: fix build for task_mmu.c, introduce mm_set_has_pinned_flag, fix comments]

Link: https://lkml.kernel.org/r/20210507150553.208763-4-peterx@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Kirill Shutemov <kirill@shutemov.name>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:48 -07:00
Linus Torvalds c54b245d01 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull user namespace rlimit handling update from Eric Biederman:
 "This is the work mainly by Alexey Gladkov to limit rlimits to the
  rlimits of the user that created a user namespace, and to allow users
  to have stricter limits on the resources created within a user
  namespace."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  cred: add missing return error code when set_cred_ucounts() failed
  ucounts: Silence warning in dec_rlimit_ucounts
  ucounts: Set ucount_max to the largest positive value the type can hold
  kselftests: Add test to check for rlimit changes in different user namespaces
  Reimplement RLIMIT_MEMLOCK on top of ucounts
  Reimplement RLIMIT_SIGPENDING on top of ucounts
  Reimplement RLIMIT_MSGQUEUE on top of ucounts
  Reimplement RLIMIT_NPROC on top of ucounts
  Use atomic_t for ucounts reference counting
  Add a reference to ucounts for each cred
  Increase size of ucounts to atomic_long_t
2021-06-28 20:39:26 -07:00
Linus Torvalds 54a728dc5e Scheduler udpates for this cycle:
- Changes to core scheduling facilities:
 
     - Add "Core Scheduling" via CONFIG_SCHED_CORE=y, which enables
       coordinated scheduling across SMT siblings. This is a much
       requested feature for cloud computing platforms, to allow
       the flexible utilization of SMT siblings, without exposing
       untrusted domains to information leaks & side channels, plus
       to ensure more deterministic computing performance on SMT
       systems used by heterogenous workloads.
 
       There's new prctls to set core scheduling groups, which
       allows more flexible management of workloads that can share
       siblings.
 
     - Fix task->state access anti-patterns that may result in missed
       wakeups and rename it to ->__state in the process to catch new
       abuses.
 
  - Load-balancing changes:
 
      - Tweak newidle_balance for fair-sched, to improve
        'memcache'-like workloads.
 
      - "Age" (decay) average idle time, to better track & improve workloads
        such as 'tbench'.
 
      - Fix & improve energy-aware (EAS) balancing logic & metrics.
 
      - Fix & improve the uclamp metrics.
 
      - Fix task migration (taskset) corner case on !CONFIG_CPUSET.
 
      - Fix RT and deadline utilization tracking across policy changes
 
      - Introduce a "burstable" CFS controller via cgroups, which allows
        bursty CPU-bound workloads to borrow a bit against their future
        quota to improve overall latencies & batching. Can be tweaked
        via /sys/fs/cgroup/cpu/<X>/cpu.cfs_burst_us.
 
      - Rework assymetric topology/capacity detection & handling.
 
  - Scheduler statistics & tooling:
 
      - Disable delayacct by default, but add a sysctl to enable
        it at runtime if tooling needs it. Use static keys and
        other optimizations to make it more palatable.
 
      - Use sched_clock() in delayacct, instead of ktime_get_ns().
 
  - Misc cleanups and fixes.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmDZcPoRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1g3yw//WfhIqy7Psa9d/MBMjQDRGbTuO4+w22Dj
 vmWFU44Q4KJxQHWeIgUlrK+dzvYWvNmflUs2CUUOiDVzxFTHMIyBtL4qCBUbx4Ns
 vKAcB9wsWZge2o3WzZqpProRhdoRaSKw8egUr2q7rACVBkckY7eGP/OjWxXU8BdA
 b7D0LPWwuIBFfN4pFYeCDLn32Dqr9s6Chyj+ZecabdG7EE6Gu+f1diVcxy7JE/mc
 4WWL0D1RqdgpGrBEuMJIxPYekdrZiuy4jtEbztz5gbTBteN1cj3BLfqn0Pc/e6rO
 Vyuc5mXCAmzRVi18z6g6bsVl+IA/nrbErENB2OHOhOYtqiZxqGTd4GPWZszMyY17
 5AsEO5+5pcaBsy4gyp09qURggBu9zhJnMVmOI3rIHZkmkhwzc6uUJlyhDCTiFWOz
 3ZF3LjbZEyCKodMD8qMHbs3axIBpIfZqjzkvSKyFnvfXEGVytVse7NUuWtQ36u92
 GnURxVeYY1TDVXvE1Y8owNKMxknKQ6YRlypP7Dtbeo/qG6hShp0xmS7qDLDi0ybZ
 ZlK+bDECiVoDf3nvJo+8v5M82IJ3CBt4UYldeRJsa1YCK/FsbK8tp91fkEfnXVue
 +U6LPX0AmMpXacR5HaZfb3uBIKRw/QMdP/7RFtBPhpV6jqCrEmuqHnpPQiEVtxwO
 UmG7bt94Trk=
 =3VDr
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler udpates from Ingo Molnar:

 - Changes to core scheduling facilities:

    - Add "Core Scheduling" via CONFIG_SCHED_CORE=y, which enables
      coordinated scheduling across SMT siblings. This is a much
      requested feature for cloud computing platforms, to allow the
      flexible utilization of SMT siblings, without exposing untrusted
      domains to information leaks & side channels, plus to ensure more
      deterministic computing performance on SMT systems used by
      heterogenous workloads.

      There are new prctls to set core scheduling groups, which allows
      more flexible management of workloads that can share siblings.

    - Fix task->state access anti-patterns that may result in missed
      wakeups and rename it to ->__state in the process to catch new
      abuses.

 - Load-balancing changes:

    - Tweak newidle_balance for fair-sched, to improve 'memcache'-like
      workloads.

    - "Age" (decay) average idle time, to better track & improve
      workloads such as 'tbench'.

    - Fix & improve energy-aware (EAS) balancing logic & metrics.

    - Fix & improve the uclamp metrics.

    - Fix task migration (taskset) corner case on !CONFIG_CPUSET.

    - Fix RT and deadline utilization tracking across policy changes

    - Introduce a "burstable" CFS controller via cgroups, which allows
      bursty CPU-bound workloads to borrow a bit against their future
      quota to improve overall latencies & batching. Can be tweaked via
      /sys/fs/cgroup/cpu/<X>/cpu.cfs_burst_us.

    - Rework assymetric topology/capacity detection & handling.

 - Scheduler statistics & tooling:

    - Disable delayacct by default, but add a sysctl to enable it at
      runtime if tooling needs it. Use static keys and other
      optimizations to make it more palatable.

    - Use sched_clock() in delayacct, instead of ktime_get_ns().

 - Misc cleanups and fixes.

* tag 'sched-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
  sched/doc: Update the CPU capacity asymmetry bits
  sched/topology: Rework CPU capacity asymmetry detection
  sched/core: Introduce SD_ASYM_CPUCAPACITY_FULL sched_domain flag
  psi: Fix race between psi_trigger_create/destroy
  sched/fair: Introduce the burstable CFS controller
  sched/uclamp: Fix uclamp_tg_restrict()
  sched/rt: Fix Deadline utilization tracking during policy change
  sched/rt: Fix RT utilization tracking during policy change
  sched: Change task_struct::state
  sched,arch: Remove unused TASK_STATE offsets
  sched,timer: Use __set_current_state()
  sched: Add get_current_state()
  sched,perf,kvm: Fix preemption condition
  sched: Introduce task_is_running()
  sched: Unbreak wakeups
  sched/fair: Age the average idle time
  sched/cpufreq: Consider reduced CPU capacity in energy calculation
  sched/fair: Take thermal pressure into account while estimating energy
  thermal/cpufreq_cooling: Update offline CPUs per-cpu thermal_pressure
  sched/fair: Return early from update_tg_cfs_load() if delta == 0
  ...
2021-06-28 12:14:19 -07:00
Linus Torvalds b4b27b9eed Revert "signal: Allow tasks to cache one sigqueue struct"
This reverts commits 4bad58ebc8 (and
399f8dd9a8, which tried to fix it).

I do not believe these are correct, and I'm about to release 5.13, so am
reverting them out of an abundance of caution.

The locking is odd, and appears broken.

On the allocation side (in __sigqueue_alloc()), the locking is somewhat
straightforward: it depends on sighand->siglock.  Since one caller
doesn't hold that lock, it further then tests 'sigqueue_flags' to avoid
the case with no locks held.

On the freeing side (in sigqueue_cache_or_free()), there is no locking
at all, and the logic instead depends on 'current' being a single
thread, and not able to race with itself.

To make things more exciting, there's also the data race between freeing
a signal and allocating one, which is handled by using WRITE_ONCE() and
READ_ONCE(), and being mutually exclusive wrt the initial state (ie
freeing will only free if the old state was NULL, while allocating will
obviously only use the value if it was non-NULL, so only one or the
other will actually act on the value).

However, while the free->alloc paths do seem mutually exclusive thanks
to just the data value dependency, it's not clear what the memory
ordering constraints are on it.  Could writes from the previous
allocation possibly be delayed and seen by the new allocation later,
causing logical inconsistencies?

So it's all very exciting and unusual.

And in particular, it seems that the freeing side is incorrect in
depending on "current" being single-threaded.  Yes, 'current' is a
single thread, but in the presense of asynchronous events even a single
thread can have data races.

And such asynchronous events can and do happen, with interrupts causing
signals to be flushed and thus free'd (for example - sending a
SIGCONT/SIGSTOP can happen from interrupt context, and can flush
previously queued process control signals).

So regardless of all the other questions about the memory ordering and
locking for this new cached allocation, the sigqueue_cache_or_free()
assumptions seem to be fundamentally incorrect.

It may be that people will show me the errors of my ways, and tell me
why this is all safe after all.  We can reinstate it if so.  But my
current belief is that the WRITE_ONCE() that sets the cached entry needs
to be a smp_store_release(), and the READ_ONCE() that finds a cached
entry needs to be a smp_load_acquire() to handle memory ordering
correctly.

And the sequence in sigqueue_cache_or_free() would need to either use a
lock or at least be interrupt-safe some way (perhaps by using something
like the percpu 'cmpxchg': it doesn't need to be SMP-safe, but like the
percpu operations it needs to be interrupt-safe).

Fixes: 399f8dd9a8 ("signal: Prevent sigqueue caching after task got released")
Fixes: 4bad58ebc8 ("signal: Allow tasks to cache one sigqueue struct")
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-27 13:32:54 -07:00
Peter Zijlstra 2f064a59a1 sched: Change task_struct::state
Change the type and name of task_struct::state. Drop the volatile and
shrink it to an 'unsigned int'. Rename it in order to find all uses
such that we can use READ_ONCE/WRITE_ONCE as appropriate.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Daniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20210611082838.550736351@infradead.org
2021-06-18 11:43:09 +02:00
Frederic Weisbecker a8ea6fc9b0 sched: Stop PF_NO_SETAFFINITY from being inherited by various init system threads
Commit:

  00b89fe019 ("sched: Make the idle task quack like a per-CPU kthread")

... added PF_KTHREAD | PF_NO_SETAFFINITY to the idle kernel threads.

Unfortunately these properties are inherited to the init/0 children
through kernel_thread() calls: init/1 and kthreadd. There are several
side effects to that:

1) kthreadd affinity can not be reset anymore from userspace. Also
   PF_NO_SETAFFINITY propagates to all kthreadd children, including
   the unbound kthreads Therefore it's not possible anymore to overwrite
   the affinity of any of them. Here is an example of warning reported
   by rcutorture:

		WARNING: CPU: 0 PID: 116 at kernel/rcu/tree_nocb.h:1306 rcu_bind_current_to_nocb+0x31/0x40
		Call Trace:
		 rcu_torture_fwd_prog+0x62/0x730
		 kthread+0x122/0x140
		 ret_from_fork+0x22/0x30

2) init/1 does an exec() in the end which clears both
   PF_KTHREAD and PF_NO_SETAFFINITY so we are fine once kernel_init()
   escapes to userspace. But until then, no initcall or init code can
   successfully call sched_setaffinity() to init/1.

   Also PF_KTHREAD looks legit on init/1 before it calls exec() but
   we better be careful with unknown introduced side effects.

One way to solve the PF_NO_SETAFFINITY issue is to not inherit this flag
on copy_process() at all. The cases where it matters are:

* fork_idle(): explicitly set the flag already.
* fork() syscalls: userspace tasks that shouldn't be concerned by that.
* create_io_thread(): the callers explicitly attribute the flag to the
                      newly created tasks.
* kernel_thread():
	- Fix the issues on init/1 and kthreadd
	- Fix the issues on kthreadd children.
	- Usermode helper created by an unbound workqueue. This shouldn't
	  matter. In the worst case it gives more control to userspace
	  on setting affinity to these short living tasks although this can
	  be tuned with inherited unbound workqueues affinity already.

Fixes: 00b89fe019 ("sched: Make the idle task quack like a per-CPU kthread")
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20210525235849.441842-1-frederic@kernel.org
2021-05-26 08:58:53 +02:00
Valentin Schneider f1a0a376ca sched/core: Initialize the idle task with preemption disabled
As pointed out by commit

  de9b8f5dcb ("sched: Fix crash trying to dequeue/enqueue the idle thread")

init_idle() can and will be invoked more than once on the same idle
task. At boot time, it is invoked for the boot CPU thread by
sched_init(). Then smp_init() creates the threads for all the secondary
CPUs and invokes init_idle() on them.

As the hotplug machinery brings the secondaries to life, it will issue
calls to idle_thread_get(), which itself invokes init_idle() yet again.
In this case it's invoked twice more per secondary: at _cpu_up(), and at
bringup_cpu().

Given smp_init() already initializes the idle tasks for all *possible*
CPUs, no further initialization should be required. Now, removing
init_idle() from idle_thread_get() exposes some interesting expectations
with regards to the idle task's preempt_count: the secondary startup always
issues a preempt_disable(), requiring some reset of the preempt count to 0
between hot-unplug and hotplug, which is currently served by
idle_thread_get() -> idle_init().

Given the idle task is supposed to have preemption disabled once and never
see it re-enabled, it seems that what we actually want is to initialize its
preempt_count to PREEMPT_DISABLED and leave it there. Do that, and remove
init_idle() from idle_thread_get().

Secondary startups were patched via coccinelle:

  @begone@
  @@

  -preempt_disable();
  ...
  cpu_startup_entry(CPUHP_AP_ONLINE_IDLE);

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210512094636.2958515-1-valentin.schneider@arm.com
2021-05-12 13:01:45 +02:00
Peter Zijlstra 85dd3f6120 sched: Inherit task cookie on fork()
Note that sched_core_fork() is called from under tasklist_lock, and
not from sched_fork() earlier. This avoids a few races later.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Don Hiatt <dhiatt@digitalocean.com>
Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210422123308.980003687@infradead.org
2021-05-12 11:43:31 +02:00
Peter Zijlstra 6e33cad0af sched: Trivial core scheduling cookie management
In order to not have to use pid_struct, create a new, smaller,
structure to manage task cookies for core scheduling.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Don Hiatt <dhiatt@digitalocean.com>
Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210422123308.919768100@infradead.org
2021-05-12 11:43:31 +02:00
Xiaofeng Cao a8ca6b1388 kernel/fork.c: fix typos
change 'ancestoral' to 'ancestral'
change 'reuseable' to 'reusable'
delete 'do' grammatically

Link: https://lkml.kernel.org/r/20210317082031.11692-1-caoxiaofeng@yulong.com
Signed-off-by: Xiaofeng Cao <caoxiaofeng@yulong.com>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-06 19:24:13 -07:00
Rolf Eike Beer a689539938 kernel/fork.c: simplify copy_mm()
All this can happen without a single goto.

Link: https://lkml.kernel.org/r/2072685.XptgVkyDqn@devpool47
Signed-off-by: Rolf Eike Beer <eb@emlix.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-06 19:24:13 -07:00
Alexey Gladkov c1ada3dc72 ucounts: Set ucount_max to the largest positive value the type can hold
The ns->ucount_max[] is signed long which is less than the rlimit size.
We have to protect ucount_max[] from overflow and only use the largest
value that we can hold.

On 32bit using "long" instead of "unsigned long" to hold the counts has
the downside that RLIMIT_MSGQUEUE and RLIMIT_MEMLOCK are limited to 2GiB
instead of 4GiB. I don't think anyone cares but it should be mentioned
in case someone does.

The RLIMIT_NPROC and RLIMIT_SIGPENDING used atomic_t so their maximum
hasn't changed.

Signed-off-by: Alexey Gladkov <legion@kernel.org>
Link: https://lkml.kernel.org/r/1825a5dfa18bc5a570e79feb05e2bd07fd57e7e3.1619094428.git.legion@kernel.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2021-04-30 14:14:03 -05:00
Alexey Gladkov d7c9e99aee Reimplement RLIMIT_MEMLOCK on top of ucounts
The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

Changelog

v11:
* Fix issue found by lkp robot.

v8:
* Fix issues found by lkp-tests project.

v7:
* Keep only ucounts for RLIMIT_MEMLOCK checks instead of struct cred.

v6:
* Fix bug in hugetlb_file_setup() detected by trinity.

Reported-by: kernel test robot <oliver.sang@intel.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Alexey Gladkov <legion@kernel.org>
Link: https://lkml.kernel.org/r/970d50c70c71bfd4496e0e8d2a0a32feebebb350.1619094428.git.legion@kernel.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2021-04-30 14:14:02 -05:00
Alexey Gladkov d646969055 Reimplement RLIMIT_SIGPENDING on top of ucounts
The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

Changelog

v11:
* Revert most of changes to fix performance issues.

v10:
* Fix memory leak on get_ucounts failure.

Signed-off-by: Alexey Gladkov <legion@kernel.org>
Link: https://lkml.kernel.org/r/df9d7764dddd50f28616b7840de74ec0f81711a8.1619094428.git.legion@kernel.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2021-04-30 14:14:02 -05:00
Alexey Gladkov 6e52a9f053 Reimplement RLIMIT_MSGQUEUE on top of ucounts
The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

Signed-off-by: Alexey Gladkov <legion@kernel.org>
Link: https://lkml.kernel.org/r/2531f42f7884bbfee56a978040b3e0d25cdf6cde.1619094428.git.legion@kernel.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2021-04-30 14:14:01 -05:00
Alexey Gladkov 21d1c5e386 Reimplement RLIMIT_NPROC on top of ucounts
The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

To illustrate the impact of rlimits, let's say there is a program that
does not fork. Some service-A wants to run this program as user X in
multiple containers. Since the program never fork the service wants to
set RLIMIT_NPROC=1.

service-A
 \- program (uid=1000, container1, rlimit_nproc=1)
 \- program (uid=1000, container2, rlimit_nproc=1)

The service-A sets RLIMIT_NPROC=1 and runs the program in container1.
When the service-A tries to run a program with RLIMIT_NPROC=1 in
container2 it fails since user X already has one running process.

We cannot use existing inc_ucounts / dec_ucounts because they do not
allow us to exceed the maximum for the counter. Some rlimits can be
overlimited by root or if the user has the appropriate capability.

Changelog

v11:
* Change inc_rlimit_ucounts() which now returns top value of ucounts.
* Drop inc_rlimit_ucounts_and_test() because the return code of
  inc_rlimit_ucounts() can be checked.

Signed-off-by: Alexey Gladkov <legion@kernel.org>
Link: https://lkml.kernel.org/r/c5286a8aa16d2d698c222f7532f3d735c82bc6bc.1619094428.git.legion@kernel.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2021-04-30 14:14:01 -05:00
Alexey Gladkov 905ae01c4a Add a reference to ucounts for each cred
For RLIMIT_NPROC and some other rlimits the user_struct that holds the
global limit is kept alive for the lifetime of a process by keeping it
in struct cred. Adding a pointer to ucounts in the struct cred will
allow to track RLIMIT_NPROC not only for user in the system, but for
user in the user_namespace.

Updating ucounts may require memory allocation which may fail. So, we
cannot change cred.ucounts in the commit_creds() because this function
cannot fail and it should always return 0. For this reason, we modify
cred.ucounts before calling the commit_creds().

Changelog

v6:
* Fix null-ptr-deref in is_ucounts_overlimit() detected by trinity. This
  error was caused by the fact that cred_alloc_blank() left the ucounts
  pointer empty.

Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Alexey Gladkov <legion@kernel.org>
Link: https://lkml.kernel.org/r/b37aaef28d8b9b0d757e07ba6dd27281bbe39259.1619094428.git.legion@kernel.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2021-04-30 14:14:00 -05:00
Muchun Song 27faca83a7 mm: memcontrol: fix kernel stack account
For simplification commit 991e767385 ("mm: memcontrol: account kernel
stack per node") changed the per zone vmalloc backed stack pages
accounting to per node.

By doing that we have lost a certain precision because those pages might
live in different NUMA nodes.  In the end NR_KERNEL_STACK_KB exported to
the userspace might be over estimated on some nodes while underestimated
on others.  But this is not a real world problem, just a problem found
by reading the code.  So there is no actual data to showing how much
impact it has on users.

This doesn't impose any real problem to correctnes of the kernel
behavior as the counter is not used for any internal processing but it
can cause some confusion to the userspace.

Address the problem by accounting each vmalloc backing page to its own
node.

Link: https://lkml.kernel.org/r/20210303151843.81156-1-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:37 -07:00
Linus Torvalds 9d31d23389 Networking changes for 5.13.
Core:
 
  - bpf:
 	- allow bpf programs calling kernel functions (initially to
 	  reuse TCP congestion control implementations)
 	- enable task local storage for tracing programs - remove the
 	  need to store per-task state in hash maps, and allow tracing
 	  programs access to task local storage previously added for
 	  BPF_LSM
 	- add bpf_for_each_map_elem() helper, allowing programs to
 	  walk all map elements in a more robust and easier to verify
 	  fashion
 	- sockmap: support UDP and cross-protocol BPF_SK_SKB_VERDICT
 	  redirection
 	- lpm: add support for batched ops in LPM trie
 	- add BTF_KIND_FLOAT support - mostly to allow use of BTF
 	  on s390 which has floats in its headers files
 	- improve BPF syscall documentation and extend the use of kdoc
 	  parsing scripts we already employ for bpf-helpers
 	- libbpf, bpftool: support static linking of BPF ELF files
 	- improve support for encapsulation of L2 packets
 
  - xdp: restructure redirect actions to avoid a runtime lookup,
 	improving performance by 4-8% in microbenchmarks
 
  - xsk: build skb by page (aka generic zerocopy xmit) - improve
 	performance of software AF_XDP path by 33% for devices
 	which don't need headers in the linear skb part (e.g. virtio)
 
  - nexthop: resilient next-hop groups - improve path stability
 	on next-hops group changes (incl. offload for mlxsw)
 
  - ipv6: segment routing: add support for IPv4 decapsulation
 
  - icmp: add support for RFC 8335 extended PROBE messages
 
  - inet: use bigger hash table for IP ID generation
 
  - tcp: deal better with delayed TX completions - make sure we don't
 	give up on fast TCP retransmissions only because driver is
 	slow in reporting that it completed transmitting the original
 
  - tcp: reorder tcp_congestion_ops for better cache locality
 
  - mptcp:
 	- add sockopt support for common TCP options
 	- add support for common TCP msg flags
 	- include multiple address ids in RM_ADDR
 	- add reset option support for resetting one subflow
 
  - udp: GRO L4 improvements - improve 'forward' / 'frag_list'
 	co-existence with UDP tunnel GRO, allowing the first to take
 	place correctly	even for encapsulated UDP traffic
 
  - micro-optimize dev_gro_receive() and flow dissection, avoid
 	retpoline overhead on VLAN and TEB GRO
 
  - use less memory for sysctls, add a new sysctl type, to allow using
 	u8 instead of "int" and "long" and shrink networking sysctls
 
  - veth: allow GRO without XDP - this allows aggregating UDP
 	packets before handing them off to routing, bridge, OvS, etc.
 
  - allow specifing ifindex when device is moved to another namespace
 
  - netfilter:
 	- nft_socket: add support for cgroupsv2
 	- nftables: add catch-all set element - special element used
 	  to define a default action in case normal lookup missed
 	- use net_generic infra in many modules to avoid allocating
 	  per-ns memory unnecessarily
 
  - xps: improve the xps handling to avoid potential out-of-bound
 	accesses and use-after-free when XPS change race with other
 	re-configuration under traffic
 
  - add a config knob to turn off per-cpu netdev refcnt to catch
 	underflows in testing
 
 Device APIs:
 
  - add WWAN subsystem to organize the WWAN interfaces better and
    hopefully start driving towards more unified and vendor-
    -independent APIs
 
  - ethtool:
 	- add interface for reading IEEE MIB stats (incl. mlx5 and
 	  bnxt support)
 	- allow network drivers to dump arbitrary SFP EEPROM data,
 	  current offset+length API was a poor fit for modern SFP
 	  which define EEPROM in terms of pages (incl. mlx5 support)
 
  - act_police, flow_offload: add support for packet-per-second
 	policing (incl. offload for nfp)
 
  - psample: add additional metadata attributes like transit delay
 	for packets sampled from switch HW (and corresponding egress
 	and policy-based sampling in the mlxsw driver)
 
  - dsa: improve support for sandwiched LAGs with bridge and DSA
 
  - netfilter:
 	- flowtable: use direct xmit in topologies with IP
 	  forwarding, bridging, vlans etc.
 	- nftables: counter hardware offload support
 
  - Bluetooth:
 	- improvements for firmware download w/ Intel devices
 	- add support for reading AOSP vendor capabilities
 	- add support for virtio transport driver
 
  - mac80211:
 	- allow concurrent monitor iface and ethernet rx decap
 	- set priority and queue mapping for injected frames
 
  - phy: add support for Clause-45 PHY Loopback
 
  - pci/iov: add sysfs MSI-X vector assignment interface
 	to distribute MSI-X resources to VFs (incl. mlx5 support)
 
 New hardware/drivers:
 
  - dsa: mv88e6xxx: add support for Marvell mv88e6393x -
 	11-port Ethernet switch with 8x 1-Gigabit Ethernet
 	and 3x 10-Gigabit interfaces.
 
  - dsa: support for legacy Broadcom tags used on BCM5325, BCM5365
 	and BCM63xx switches
 
  - Microchip KSZ8863 and KSZ8873; 3x 10/100Mbps Ethernet switches
 
  - ath11k: support for QCN9074 a 802.11ax device
 
  - Bluetooth: Broadcom BCM4330 and BMC4334
 
  - phy: Marvell 88X2222 transceiver support
 
  - mdio: add BCM6368 MDIO mux bus controller
 
  - r8152: support RTL8153 and RTL8156 (USB Ethernet) chips
 
  - mana: driver for Microsoft Azure Network Adapter (MANA)
 
  - Actions Semi Owl Ethernet MAC
 
  - can: driver for ETAS ES58X CAN/USB interfaces
 
 Pure driver changes:
 
  - add XDP support to: enetc, igc, stmmac
  - add AF_XDP support to: stmmac
 
  - virtio:
 	- page_to_skb() use build_skb when there's sufficient tailroom
 	  (21% improvement for 1000B UDP frames)
 	- support XDP even without dedicated Tx queues - share the Tx
 	  queues with the stack when necessary
 
  - mlx5:
 	- flow rules: add support for mirroring with conntrack,
 	  matching on ICMP, GTP, flex filters and more
 	- support packet sampling with flow offloads
 	- persist uplink representor netdev across eswitch mode
 	  changes
 	- allow coexistence of CQE compression and HW time-stamping
 	- add ethtool extended link error state reporting
 
  - ice, iavf: support flow filters, UDP Segmentation Offload
 
  - dpaa2-switch:
 	- move the driver out of staging
 	- add spanning tree (STP) support
 	- add rx copybreak support
 	- add tc flower hardware offload on ingress traffic
 
  - ionic:
 	- implement Rx page reuse
 	- support HW PTP time-stamping
 
  - octeon: support TC hardware offloads - flower matching on ingress
 	and egress ratelimitting.
 
  - stmmac:
 	- add RX frame steering based on VLAN priority in tc flower
 	- support frame preemption (FPE)
 	- intel: add cross time-stamping freq difference adjustment
 
  - ocelot:
 	- support forwarding of MRP frames in HW
 	- support multiple bridges
 	- support PTP Sync one-step timestamping
 
  - dsa: mv88e6xxx, dpaa2-switch: offload bridge port flags like
 	learning, flooding etc.
 
  - ipa: add IPA v4.5, v4.9 and v4.11 support (Qualcomm SDX55, SM8350,
 	SC7280 SoCs)
 
  - mt7601u: enable TDLS support
 
  - mt76:
 	- add support for 802.3 rx frames (mt7915/mt7615)
 	- mt7915 flash pre-calibration support
 	- mt7921/mt7663 runtime power management fixes
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmCKFPIACgkQMUZtbf5S
 Irtw0g/+NA8bWdHNgG4H5rya0pv2z3IieLRmSdDfKRQQXcJpklawc5MKVVaTee/Q
 5/QqgPdCsu1LAU6JXBKsKmyDDaMlQKdWuKbOqDSiAQKoMesZStTEHf9d851ZzgxA
 Cdb6O7BD3lBl/IN+oxNG+KcmD1LKquTPKGySq2mQtEdLO12ekAsranzmj4voKffd
 q9tBShpXQ7Dq77DLYfiQXVCvsizNcbbJFuxX0o9Lpb9+61ZyYAbogZSa9ypiZZwR
 I/9azRBtJg7UV1aD/cLuAfy66Qh7t63+rCxVazs5Os8jVO26P/jQdisnnOe/x+p9
 wYEmKm3GSu0V4SAPxkWW+ooKusflCeqDoMIuooKt6kbP6BRj540veGw3Ww/m5YFr
 7pLQkTSP/tSjuGQIdBE1LOP5LBO8DZeC8Kiop9V0fzAW9hFSZbEq25WW0bPj8QQO
 zA4Z7yWlslvxcfY2BdJX3wD8klaINkl/8fDWZFFsBdfFX2VeLtm7Xfduw34BJpvU
 rYT3oWr6PhtkPAKR32SUcemSfeWgIVU41eSshzRz3kez1NngBUuLlSGGSEaKbes5
 pZVt6pYFFVByyf6MTHFEoQvafZfEw04JILZpo4R5V8iTHzom0kD3Py064sBiXEw2
 B6t+OW4qgcxGblpFkK2lD4kR2s1TPUs0ckVO6sAy1x8q60KKKjY=
 =vcbA
 -----END PGP SIGNATURE-----

Merge tag 'net-next-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core:

   - bpf:
        - allow bpf programs calling kernel functions (initially to
          reuse TCP congestion control implementations)
        - enable task local storage for tracing programs - remove the
          need to store per-task state in hash maps, and allow tracing
          programs access to task local storage previously added for
          BPF_LSM
        - add bpf_for_each_map_elem() helper, allowing programs to walk
          all map elements in a more robust and easier to verify fashion
        - sockmap: support UDP and cross-protocol BPF_SK_SKB_VERDICT
          redirection
        - lpm: add support for batched ops in LPM trie
        - add BTF_KIND_FLOAT support - mostly to allow use of BTF on
          s390 which has floats in its headers files
        - improve BPF syscall documentation and extend the use of kdoc
          parsing scripts we already employ for bpf-helpers
        - libbpf, bpftool: support static linking of BPF ELF files
        - improve support for encapsulation of L2 packets

   - xdp: restructure redirect actions to avoid a runtime lookup,
     improving performance by 4-8% in microbenchmarks

   - xsk: build skb by page (aka generic zerocopy xmit) - improve
     performance of software AF_XDP path by 33% for devices which don't
     need headers in the linear skb part (e.g. virtio)

   - nexthop: resilient next-hop groups - improve path stability on
     next-hops group changes (incl. offload for mlxsw)

   - ipv6: segment routing: add support for IPv4 decapsulation

   - icmp: add support for RFC 8335 extended PROBE messages

   - inet: use bigger hash table for IP ID generation

   - tcp: deal better with delayed TX completions - make sure we don't
     give up on fast TCP retransmissions only because driver is slow in
     reporting that it completed transmitting the original

   - tcp: reorder tcp_congestion_ops for better cache locality

   - mptcp:
        - add sockopt support for common TCP options
        - add support for common TCP msg flags
        - include multiple address ids in RM_ADDR
        - add reset option support for resetting one subflow

   - udp: GRO L4 improvements - improve 'forward' / 'frag_list'
     co-existence with UDP tunnel GRO, allowing the first to take place
     correctly even for encapsulated UDP traffic

   - micro-optimize dev_gro_receive() and flow dissection, avoid
     retpoline overhead on VLAN and TEB GRO

   - use less memory for sysctls, add a new sysctl type, to allow using
     u8 instead of "int" and "long" and shrink networking sysctls

   - veth: allow GRO without XDP - this allows aggregating UDP packets
     before handing them off to routing, bridge, OvS, etc.

   - allow specifing ifindex when device is moved to another namespace

   - netfilter:
        - nft_socket: add support for cgroupsv2
        - nftables: add catch-all set element - special element used to
          define a default action in case normal lookup missed
        - use net_generic infra in many modules to avoid allocating
          per-ns memory unnecessarily

   - xps: improve the xps handling to avoid potential out-of-bound
     accesses and use-after-free when XPS change race with other
     re-configuration under traffic

   - add a config knob to turn off per-cpu netdev refcnt to catch
     underflows in testing

  Device APIs:

   - add WWAN subsystem to organize the WWAN interfaces better and
     hopefully start driving towards more unified and vendor-
     independent APIs

   - ethtool:
        - add interface for reading IEEE MIB stats (incl. mlx5 and bnxt
          support)
        - allow network drivers to dump arbitrary SFP EEPROM data,
          current offset+length API was a poor fit for modern SFP which
          define EEPROM in terms of pages (incl. mlx5 support)

   - act_police, flow_offload: add support for packet-per-second
     policing (incl. offload for nfp)

   - psample: add additional metadata attributes like transit delay for
     packets sampled from switch HW (and corresponding egress and
     policy-based sampling in the mlxsw driver)

   - dsa: improve support for sandwiched LAGs with bridge and DSA

   - netfilter:
        - flowtable: use direct xmit in topologies with IP forwarding,
          bridging, vlans etc.
        - nftables: counter hardware offload support

   - Bluetooth:
        - improvements for firmware download w/ Intel devices
        - add support for reading AOSP vendor capabilities
        - add support for virtio transport driver

   - mac80211:
        - allow concurrent monitor iface and ethernet rx decap
        - set priority and queue mapping for injected frames

   - phy: add support for Clause-45 PHY Loopback

   - pci/iov: add sysfs MSI-X vector assignment interface to distribute
     MSI-X resources to VFs (incl. mlx5 support)

  New hardware/drivers:

   - dsa: mv88e6xxx: add support for Marvell mv88e6393x - 11-port
     Ethernet switch with 8x 1-Gigabit Ethernet and 3x 10-Gigabit
     interfaces.

   - dsa: support for legacy Broadcom tags used on BCM5325, BCM5365 and
     BCM63xx switches

   - Microchip KSZ8863 and KSZ8873; 3x 10/100Mbps Ethernet switches

   - ath11k: support for QCN9074 a 802.11ax device

   - Bluetooth: Broadcom BCM4330 and BMC4334

   - phy: Marvell 88X2222 transceiver support

   - mdio: add BCM6368 MDIO mux bus controller

   - r8152: support RTL8153 and RTL8156 (USB Ethernet) chips

   - mana: driver for Microsoft Azure Network Adapter (MANA)

   - Actions Semi Owl Ethernet MAC

   - can: driver for ETAS ES58X CAN/USB interfaces

  Pure driver changes:

   - add XDP support to: enetc, igc, stmmac

   - add AF_XDP support to: stmmac

   - virtio:
        - page_to_skb() use build_skb when there's sufficient tailroom
          (21% improvement for 1000B UDP frames)
        - support XDP even without dedicated Tx queues - share the Tx
          queues with the stack when necessary

   - mlx5:
        - flow rules: add support for mirroring with conntrack, matching
          on ICMP, GTP, flex filters and more
        - support packet sampling with flow offloads
        - persist uplink representor netdev across eswitch mode changes
        - allow coexistence of CQE compression and HW time-stamping
        - add ethtool extended link error state reporting

   - ice, iavf: support flow filters, UDP Segmentation Offload

   - dpaa2-switch:
        - move the driver out of staging
        - add spanning tree (STP) support
        - add rx copybreak support
        - add tc flower hardware offload on ingress traffic

   - ionic:
        - implement Rx page reuse
        - support HW PTP time-stamping

   - octeon: support TC hardware offloads - flower matching on ingress
     and egress ratelimitting.

   - stmmac:
        - add RX frame steering based on VLAN priority in tc flower
        - support frame preemption (FPE)
        - intel: add cross time-stamping freq difference adjustment

   - ocelot:
        - support forwarding of MRP frames in HW
        - support multiple bridges
        - support PTP Sync one-step timestamping

   - dsa: mv88e6xxx, dpaa2-switch: offload bridge port flags like
     learning, flooding etc.

   - ipa: add IPA v4.5, v4.9 and v4.11 support (Qualcomm SDX55, SM8350,
     SC7280 SoCs)

   - mt7601u: enable TDLS support

   - mt76:
        - add support for 802.3 rx frames (mt7915/mt7615)
        - mt7915 flash pre-calibration support
        - mt7921/mt7663 runtime power management fixes"

* tag 'net-next-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2451 commits)
  net: selftest: fix build issue if INET is disabled
  net: netrom: nr_in: Remove redundant assignment to ns
  net: tun: Remove redundant assignment to ret
  net: phy: marvell: add downshift support for M88E1240
  net: dsa: ksz: Make reg_mib_cnt a u8 as it never exceeds 255
  net/sched: act_ct: Remove redundant ct get and check
  icmp: standardize naming of RFC 8335 PROBE constants
  bpf, selftests: Update array map tests for per-cpu batched ops
  bpf: Add batched ops support for percpu array
  bpf: Implement formatted output helpers with bstr_printf
  seq_file: Add a seq_bprintf function
  sfc: adjust efx->xdp_tx_queue_count with the real number of initialized queues
  net:nfc:digital: Fix a double free in digital_tg_recv_dep_req
  net: fix a concurrency bug in l2tp_tunnel_register()
  net/smc: Remove redundant assignment to rc
  mpls: Remove redundant assignment to err
  llc2: Remove redundant assignment to rc
  net/tls: Remove redundant initialization of record
  rds: Remove redundant assignment to nr_sig
  dt-bindings: net: mdio-gpio: add compatible for microchip,mdio-smi0
  ...
2021-04-29 11:57:23 -07:00
Linus Torvalds 625434dafd for-5.13/io_uring-2021-04-27
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmCIRBUQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjt5D/9de6zCaha6CyfIIPiU+crropQ2jPzO49cb
 WzcOCmdhSv0GtYlhdnIqCOo5p8mRDWJAEBU9upTDTCWOx9hwr5Ms0TCNQHxuQ/T0
 4Ll+/cMsOxeTypiykfMtOG9TEmYSria2vTJKLgpyaP4ohfJa3uT7r2NZ8NK/8T4t
 wwbJ+jCSKewelI1l0XD8k8LBU39FS/KRgLTdfYj/rCW3PWt/ZE2eSIYjZQvMCVOC
 3fIdgOOJAMQVQafz+YAeJd2E+/l5/8YcJVKpJMVtBNbqTHIjA4EsInZauy8TpBgW
 OzJ3I+XdF70qZM119tI/nXw3sb0e+UV0fRsIXLkOwTEBzowernrAtsEwAOP+qFKS
 2YnqSKOSjMO5d5Mpkz6T0MDMloU45jph88lUH0RoShVxGa7jv+TMOL6QU1oOyxc1
 +gPPbApQs9WtSZDHsTJ0xFLpol804UDQmwb38mHdzedDVSE7iip1jANkw6LEhKkJ
 Mlg60ZF1Z305G+cDhrbs02ZGVa+fzbrtXtLlTqZw8bNX9lBp0JLtDpzskjbnUmck
 6A04nfg+Eto5GvAn+FRBuOCPridLEk2K6ygko/gwQWsYCgqkCgRuqjlIQCSZy5iu
 jHEFixIXKn6eACf+YzLVxSLyEQrmFyDSypbN7LvzoKJYo/loy8Q1+42nGlrVC3zi
 +CB1NokPng==
 =ZJ8L
 -----END PGP SIGNATURE-----

Merge tag 'for-5.13/io_uring-2021-04-27' of git://git.kernel.dk/linux-block

Pull io_uring updates from Jens Axboe:

 - Support for multi-shot mode for POLL requests

 - More efficient reference counting. This is shamelessly stolen from
   the mm side. Even though referencing is mostly single/dual user, the
   128 count was retained to keep the code the same. Maybe this
   should/could be made generic at some point.

 - Removal of the need to have a manager thread for each ring. The
   manager threads only job was checking and creating new io-threads as
   needed, instead we handle this from the queue path.

 - Allow SQPOLL without CAP_SYS_ADMIN or CAP_SYS_NICE. Since 5.12, this
   thread is "just" a regular application thread, so no need to restrict
   use of it anymore.

 - Cleanup of how internal async poll data lifetime is managed.

 - Fix for syzbot reported crash on SQPOLL cancelation.

 - Make buffer registration more like file registrations, which includes
   flexibility in avoiding full set unregistration and re-registration.

 - Fix for io-wq affinity setting.

 - Be a bit more defensive in task->pf_io_worker setup.

 - Various SQPOLL fixes.

 - Cleanup of SQPOLL creds handling.

 - Improvements to in-flight request tracking.

 - File registration cleanups.

 - Tons of cleanups and little fixes

* tag 'for-5.13/io_uring-2021-04-27' of git://git.kernel.dk/linux-block: (156 commits)
  io_uring: maintain drain logic for multishot poll requests
  io_uring: Check current->io_uring in io_uring_cancel_sqpoll
  io_uring: fix NULL reg-buffer
  io_uring: simplify SQPOLL cancellations
  io_uring: fix work_exit sqpoll cancellations
  io_uring: Fix uninitialized variable up.resv
  io_uring: fix invalid error check after malloc
  io_uring: io_sq_thread() no longer needs to reset current->pf_io_worker
  kernel: always initialize task->pf_io_worker to NULL
  io_uring: update sq_thread_idle after ctx deleted
  io_uring: add full-fledged dynamic buffers support
  io_uring: implement fixed buffers registration similar to fixed files
  io_uring: prepare fixed rw for dynanic buffers
  io_uring: keep table of pointers to ubufs
  io_uring: add generic rsrc update with tags
  io_uring: add IORING_REGISTER_RSRC
  io_uring: enumerate dynamic resources
  io_uring: add generic path for rsrc update
  io_uring: preparation for rsrc tagging
  io_uring: decouple CQE filling from requests
  ...
2021-04-28 14:56:09 -07:00
Linus Torvalds 16b3d0cf5b Scheduler updates for this cycle are:
- Clean up SCHED_DEBUG: move the decades old mess of sysctl, procfs and debugfs interfaces
    to a unified debugfs interface.
 
  - Signals: Allow caching one sigqueue object per task, to improve performance & latencies.
 
  - Improve newidle_balance() irq-off latencies on systems with a large number of CPU cgroups.
 
  - Improve energy-aware scheduling
 
  - Improve the PELT metrics for certain workloads
 
  - Reintroduce select_idle_smt() to improve load-balancing locality - but without the previous
    regressions
 
  - Add 'scheduler latency debugging': warn after long periods of pending need_resched. This
    is an opt-in feature that requires the enabling of the LATENCY_WARN scheduler feature,
    or the use of the resched_latency_warn_ms=xx boot parameter.
 
  - CPU hotplug fixes for HP-rollback, and for the 'fail' interface. Fix remaining
    balance_push() vs. hotplug holes/races
 
  - PSI fixes, plus allow /proc/pressure/ files to be written by CAP_SYS_RESOURCE tasks as well
 
  - Fix/improve various load-balancing corner cases vs. capacity margins
 
  - Fix sched topology on systems with NUMA diameter of 3 or above
 
  - Fix PF_KTHREAD vs to_kthread() race
 
  - Minor rseq optimizations
 
  - Misc cleanups, optimizations, fixes and smaller updates
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmCJInsRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1i5XxAArh0b+fwXlkVGzTUly7HQjhU7lFbChnmF
 h6ToyNLi6pXoZ14VC/WoRIME+RzK3gmw9cEFaSLVPxbkbekTcyWS78kqmcg1/j2v
 kO/20QhXobiIxVskYfoMmqSavZ5mKhMWBqtFXkCuYfxwGylas0VVdh3AZLJ7N21G
 WEoFh99pVULwWnPHxM2ZQ87Ex9BkGKbsBTswxWpprCfXLqD0N2hHlABpwJP78zRf
 VniWFOcC7lslILCFawb7CqGgAwbgV85nDRS4QCuCKisrkFywvjJrEeu/W+h1NfhF
 d6ves/osNdEAM1DSALoxwEA42An8l8xh8NyJnl8JZV00LW0DM108O5/7pf5Zcryc
 RHV3RxA7skgezBh5uThvo60QzNK+kVMatI4qpQEHxLE52CaDl/fBu1Cgb/VUxnIl
 AEBfyiFbk+skHpuMFKtl30Tx3M+yJKMTzFPd4kYjHYGEDwtAcXcB3dJQW48A79i3
 H3IWcDcXpk5Rjo2UZmaXdt/qlj7mP6U0xdOUq8ZK6JOC4uY9skszVGsfuNN9QQ5u
 2E2YKKVrGFoQydl4C8R6A7axL2VzIJszHFZNipd8E3YOyW7PWRAkr02tOOkBTj8N
 dLMcNM7aPJWqEYiEIjEzGQN20pweJ1dRA29LDuOswKh+7W2bWTQFh6F2Q8Haansc
 RVg5PDzl+Mc=
 =E7mz
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:

 - Clean up SCHED_DEBUG: move the decades old mess of sysctl, procfs and
   debugfs interfaces to a unified debugfs interface.

 - Signals: Allow caching one sigqueue object per task, to improve
   performance & latencies.

 - Improve newidle_balance() irq-off latencies on systems with a large
   number of CPU cgroups.

 - Improve energy-aware scheduling

 - Improve the PELT metrics for certain workloads

 - Reintroduce select_idle_smt() to improve load-balancing locality -
   but without the previous regressions

 - Add 'scheduler latency debugging': warn after long periods of pending
   need_resched. This is an opt-in feature that requires the enabling of
   the LATENCY_WARN scheduler feature, or the use of the
   resched_latency_warn_ms=xx boot parameter.

 - CPU hotplug fixes for HP-rollback, and for the 'fail' interface. Fix
   remaining balance_push() vs. hotplug holes/races

 - PSI fixes, plus allow /proc/pressure/ files to be written by
   CAP_SYS_RESOURCE tasks as well

 - Fix/improve various load-balancing corner cases vs. capacity margins

 - Fix sched topology on systems with NUMA diameter of 3 or above

 - Fix PF_KTHREAD vs to_kthread() race

 - Minor rseq optimizations

 - Misc cleanups, optimizations, fixes and smaller updates

* tag 'sched-core-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (61 commits)
  cpumask/hotplug: Fix cpu_dying() state tracking
  kthread: Fix PF_KTHREAD vs to_kthread() race
  sched/debug: Fix cgroup_path[] serialization
  sched,psi: Handle potential task count underflow bugs more gracefully
  sched: Warn on long periods of pending need_resched
  sched/fair: Move update_nohz_stats() to the CONFIG_NO_HZ_COMMON block to simplify the code & fix an unused function warning
  sched/debug: Rename the sched_debug parameter to sched_verbose
  sched,fair: Alternative sched_slice()
  sched: Move /proc/sched_debug to debugfs
  sched,debug: Convert sysctl sched_domains to debugfs
  debugfs: Implement debugfs_create_str()
  sched,preempt: Move preempt_dynamic to debug.c
  sched: Move SCHED_DEBUG sysctl to debugfs
  sched: Don't make LATENCYTOP select SCHED_DEBUG
  sched: Remove sched_schedstats sysctl out from under SCHED_DEBUG
  sched/numa: Allow runtime enabling/disabling of NUMA balance without SCHED_DEBUG
  sched: Use cpu_dying() to fix balance_push vs hotplug-rollback
  cpumask: Introduce DYING mask
  cpumask: Make cpu_{online,possible,present,active}() inline
  rseq: Optimise rseq_get_rseq_cs() and clear_rseq_cs()
  ...
2021-04-28 13:33:57 -07:00
Linus Torvalds 42dec9a936 Perf events changes in this cycle were:
- Improve Intel uncore PMU support:
 
      - Parse uncore 'discovery tables' - a new hardware capability enumeration method
        introduced on the latest Intel platforms. This table is in a well-defined PCI
        namespace location and is read via MMIO. It is organized in an rbtree.
 
        These uncore tables will allow the discovery of standard counter blocks, but
        fancier counters still need to be enumerated explicitly.
 
      - Add Alder Lake support
 
      - Improve IIO stacks to PMON mapping support on Skylake servers
 
  - Add Intel Alder Lake PMU support - which requires the introduction of 'hybrid' CPUs
    and PMUs. Alder Lake is a mix of Golden Cove ('big') and Gracemont ('small' - Atom derived)
    cores.
 
    The CPU-side feature set is entirely symmetrical - but on the PMU side there's
    core type dependent PMU functionality.
 
  - Reduce data loss with CPU level hardware tracing on Intel PT / AUX profiling, by
    fixing the AUX allocation watermark logic.
 
  - Improve ring buffer allocation on NUMA systems
 
  - Put 'struct perf_event' into their separate kmem_cache pool
 
  - Add support for synchronous signals for select perf events. The immediate motivation
    is to support low-overhead sampling-based race detection for user-space code. The
    feature consists of the following main changes:
 
     - Add thread-only event inheritance via perf_event_attr::inherit_thread, which limits
       inheritance of events to CLONE_THREAD.
 
     - Add the ability for events to not leak through exec(), via perf_event_attr::remove_on_exec.
 
     - Allow the generation of SIGTRAP via perf_event_attr::sigtrap, extend siginfo with an u64
       ::si_perf, and add the breakpoint information to ::si_addr and ::si_perf if the event is
       PERF_TYPE_BREAKPOINT.
 
    The siginfo support is adequate for breakpoints right now - but the new field can be used
    to introduce support for other types of metadata passed over siginfo as well.
 
  - Misc fixes, cleanups and smaller updates.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmCJGpERHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1j9zBAAuVbG2snV6SBSdXLhQcM66N3NckOXvSY5
 QjjhQcuwJQEK/NJB3266K5d8qSmdyRBsWf3GCsrmyBT67P1V28K44Pu7oCV0UDtf
 mpVRjEP0oR7hNsANSSgo8Fa4ZD7H5waX7dK7925Tvw8By3mMoZoddiD/84WJHhxO
 NDF+GRFaRj+/dpbhV8cdCoXTjYdkC36vYuZs3b9lu0tS9D/AJgsNy7TinLvO02Cs
 5peP+2y29dgvCXiGBiuJtEA6JyGnX3nUJCvfOZZ/DWDc3fdduARlRrc5Aiq4n/wY
 UdSkw1VTZBlZ1wMSdmHQVeC5RIH3uWUtRoNqy0Yc90lBm55AQ0EENwIfWDUDC5zy
 USdBqWTNWKMBxlEilUIyqKPQK8LW/31TRzqy8BWKPNcZt5yP5YS1SjAJRDDjSwL/
 I+OBw1vjLJamYh8oNiD5b+VLqNQba81jFASfv+HVWcULumnY6ImECCpkg289Fkpi
 BVR065boifJDlyENXFbvTxyMBXQsZfA+EhtxG7ju2Ni+TokBbogyCb3L2injPt9g
 7jjtTOqmfad4gX1WSc+215iYZMkgECcUd9E+BfOseEjBohqlo7yNKIfYnT8mE/Xq
 nb7eHjyvLiE8tRtZ+7SjsujOMHv9LhWFAbSaxU/kEVzpkp0zyd6mnnslDKaaHLhz
 goUMOL/D0lg=
 =NhQ7
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf event updates from Ingo Molnar:

 - Improve Intel uncore PMU support:

     - Parse uncore 'discovery tables' - a new hardware capability
       enumeration method introduced on the latest Intel platforms. This
       table is in a well-defined PCI namespace location and is read via
       MMIO. It is organized in an rbtree.

       These uncore tables will allow the discovery of standard counter
       blocks, but fancier counters still need to be enumerated
       explicitly.

     - Add Alder Lake support

     - Improve IIO stacks to PMON mapping support on Skylake servers

 - Add Intel Alder Lake PMU support - which requires the introduction of
   'hybrid' CPUs and PMUs. Alder Lake is a mix of Golden Cove ('big')
   and Gracemont ('small' - Atom derived) cores.

   The CPU-side feature set is entirely symmetrical - but on the PMU
   side there's core type dependent PMU functionality.

 - Reduce data loss with CPU level hardware tracing on Intel PT / AUX
   profiling, by fixing the AUX allocation watermark logic.

 - Improve ring buffer allocation on NUMA systems

 - Put 'struct perf_event' into their separate kmem_cache pool

 - Add support for synchronous signals for select perf events. The
   immediate motivation is to support low-overhead sampling-based race
   detection for user-space code. The feature consists of the following
   main changes:

     - Add thread-only event inheritance via
       perf_event_attr::inherit_thread, which limits inheritance of
       events to CLONE_THREAD.

     - Add the ability for events to not leak through exec(), via
       perf_event_attr::remove_on_exec.

     - Allow the generation of SIGTRAP via perf_event_attr::sigtrap,
       extend siginfo with an u64 ::si_perf, and add the breakpoint
       information to ::si_addr and ::si_perf if the event is
       PERF_TYPE_BREAKPOINT.

   The siginfo support is adequate for breakpoints right now - but the
   new field can be used to introduce support for other types of
   metadata passed over siginfo as well.

 - Misc fixes, cleanups and smaller updates.

* tag 'perf-core-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
  signal, perf: Add missing TRAP_PERF case in siginfo_layout()
  signal, perf: Fix siginfo_t by avoiding u64 on 32-bit architectures
  perf/x86: Allow for 8<num_fixed_counters<16
  perf/x86/rapl: Add support for Intel Alder Lake
  perf/x86/cstate: Add Alder Lake CPU support
  perf/x86/msr: Add Alder Lake CPU support
  perf/x86/intel/uncore: Add Alder Lake support
  perf: Extend PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE
  perf/x86/intel: Add Alder Lake Hybrid support
  perf/x86: Support filter_match callback
  perf/x86/intel: Add attr_update for Hybrid PMUs
  perf/x86: Add structures for the attributes of Hybrid PMUs
  perf/x86: Register hybrid PMUs
  perf/x86: Factor out x86_pmu_show_pmu_cap
  perf/x86: Remove temporary pmu assignment in event_init
  perf/x86/intel: Factor out intel_pmu_check_extra_regs
  perf/x86/intel: Factor out intel_pmu_check_event_constraints
  perf/x86/intel: Factor out intel_pmu_check_num_counters
  perf/x86: Hybrid PMU support for extra_regs
  perf/x86: Hybrid PMU support for event constraints
  ...
2021-04-28 13:03:44 -07:00
Stefan Metzmacher ff24430330 kernel: always initialize task->pf_io_worker to NULL
Otherwise io_wq_worker_{running,sleeping}() may dereference an
invalid pointer (in future). Currently all users of create_io_thread()
are fine and get task->pf_io_worker = NULL implicitly from the
wq_manager, which got it either from the userspace thread
of the sq_thread, which explicitly reset it to NULL.

I think it's safer to always reset it in order to avoid future
problems.

Fixes: 3bfe610669 ("io-wq: fork worker threads from original task")
cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-25 10:29:03 -06:00
Ingo Molnar d0d252b8ca Linux 5.12-rc8
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmB8qHweHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGEXIIAILUbsTJsNsvZIkZ
 uQ6SY6gnsPFkRiSRjY0YsZLUnqjTuiiHeTz4gzkonddwdnAp/9g6OIHIEBaeTqBh
 sTUMU/61Fgtrt/IvkA1yJ3rlawqgwdMe2VdimB+EFhufcSKq+5vpd3MVP4IuGx4E
 J3psoTU4gVltFs5t+1QjvI3XmByN0Qm8FMRXR7iL+zov1QTmGwR3G6Rn4AymG+QT
 pdruKboyZPfsrFGSVx7wd3HpFyQcrclEX9rKmBNZqets9d9JGWnqnEN4vQKmwO86
 4MV29ucdMXH0AMB3kzGdVp0Ji2Ykt5W0K+MUWbFLtcSxnpu1OyBKGsEAMlRbD7ik
 gm0bMSw=
 =qHI0
 -----END PGP SIGNATURE-----

Merge tag 'v5.12-rc8' into sched/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-04-20 10:13:58 +02:00
Marco Elver 2b26f0aa00 perf: Support only inheriting events if cloned with CLONE_THREAD
Adds bit perf_event_attr::inherit_thread, to restricting inheriting
events only if the child was cloned with CLONE_THREAD.

This option supports the case where an event is supposed to be
process-wide only (including subthreads), but should not propagate
beyond the current process's shared environment.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/lkml/YBvj6eJR%2FDY2TsEB@hirez.programming.kicks-ass.net/
2021-04-16 16:32:40 +02:00
Thomas Gleixner 4bad58ebc8 signal: Allow tasks to cache one sigqueue struct
The idea for this originates from the real time tree to make signal
delivery for realtime applications more efficient. In quite some of these
application scenarios a control tasks signals workers to start their
computations. There is usually only one signal per worker on flight.  This
works nicely as long as the kmem cache allocations do not hit the slow path
and cause latencies.

To cure this an optimistic caching was introduced (limited to RT tasks)
which allows a task to cache a single sigqueue in a pointer in task_struct
instead of handing it back to the kmem cache after consuming a signal. When
the next signal is sent to the task then the cached sigqueue is used
instead of allocating a new one. This solved the problem for this set of
application scenarios nicely.

The task cache is not preallocated so the first signal sent to a task goes
always to the cache allocator. The cached sigqueue stays around until the
task exits and is freed when task::sighand is dropped.

After posting this solution for mainline the discussion came up whether
this would be useful in general and should not be limited to realtime
tasks: https://lore.kernel.org/r/m11rcu7nbr.fsf@fess.ebiederm.org

One concern leading to the original limitation was to avoid a large amount
of pointlessly cached sigqueues in alive tasks. The other concern was
vs. RLIMIT_SIGPENDING as these cached sigqueues are not accounted for.

The accounting problem is real, but on the other hand slightly academic.
After gathering some statistics it turned out that after boot of a regular
distro install there are less than 10 sigqueues cached in ~1500 tasks.

In case of a 'mass fork and fire signal to child' scenario the extra 80
bytes of memory per task are well in the noise of the overall memory
consumption of the fork bomb.

If this should be limited then this would need an extra counter in struct
user, more atomic instructions and a seperate rlimit. Yet another tunable
which is mostly unused.

The caching is actually used. After boot and a full kernel compile on a
64CPU machine with make -j128 the number of 'allocations' looks like this:

  From slab:	   23996
  From task cache: 52223

I.e. it reduces the number of slab cache operations by ~68%.

A typical pattern there is:

<...>-58490 __sigqueue_alloc:  for 58488 from slab ffff8881132df460
<...>-58488 __sigqueue_free:   cache ffff8881132df460
<...>-58488 __sigqueue_alloc:  for 1149 from cache ffff8881103dc550
  bash-1149 exit_task_sighand: free ffff8881132df460
  bash-1149 __sigqueue_free:   cache ffff8881103dc550

The interesting sequence is that the exiting task 58488 grabs the sigqueue
from bash's task cache to signal exit and bash sticks it back into it's own
cache. Lather, rinse and repeat.

The caching is probably not noticable for the general use case, but the
benefit for latency sensitive applications is clear. While kmem caches are
usually just serving from the fast path the slab merging (default) can
depending on the usage pattern of the merged slabs cause occasional slow
path allocations.

The time spared per cached entry is a few micro seconds per signal which is
not relevant for e.g. a kernel build, but for signal heavy workloads it's
measurable.

As there is no real downside of this caching mechanism making it
unconditionally available is preferred over more conditional code or new
magic tunables.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lkml.kernel.org/r/87sg4lbmxo.fsf@nanos.tec.linutronix.de
2021-04-14 18:04:08 +02:00
Jens Axboe 66ae0d1e2d kernel: allow fork with TIF_NOTIFY_SIGNAL pending
fork() fails if signal_pending() is true, but there are two conditions
that can lead to that:

1) An actual signal is pending. We want fork to fail for that one, like
   we always have.

2) TIF_NOTIFY_SIGNAL is pending, because the task has pending task_work.
   We don't need to make it fail for that case.

Allow fork() to proceed if just task_work is pending, by changing the
signal_pending() check to task_sigpending().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-11 17:42:00 -06:00
Jakub Kicinski 8859a44ea0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts:

MAINTAINERS
 - keep Chandrasekar
drivers/net/ethernet/mellanox/mlx5/core/en_main.c
 - simple fix + trust the code re-added to param.c in -next is fine
include/linux/bpf.h
 - trivial
include/linux/ethtool.h
 - trivial, fix kdoc while at it
include/linux/skmsg.h
 - move to relevant place in tcp.c, comment re-wrapped
net/core/skmsg.c
 - add the sk = sk // sk = NULL around calls
net/tipc/crypto.c
 - trivial

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-04-09 20:48:35 -07:00
Linus Torvalds b44d1ddcf8 io_uring-5.12-2021-03-27
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmBf1KAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjVSD/0f1HdekXnIE6aSRQ7YEV8ux2t5wUeDyP8U
 cdcZ8fBW9PvKZLdODSI4sw8UYV5OYEBcfImFe3nRVHR+RIVQo72UTYvuHqeUYNct
 w3drgF2GEMIxJFZR6zf9LDrQVduPqXvbEJLui6TN+eX/5E99ZlUWMLwkX1k+vDju
 QfaGZjz2736GTn1MPc7jdyZKoK7eCi5xtNFPash5wGck7aYl5TGXnG/8bRYsv2Tw
 eCYKbvv4x0s8OFcYVQMooDfbIMCyyfTwt6YatFHQEtM/RM+M66gndvv3jfkeJQju
 hz0I8qOJ8X5lf0VucncWs5J8b9Whr5YZV+k9461xalBbV9ed2vzIIikP8DpCxtYz
 yKbsdDm0+3hwfuZOz+d7ooEXKsphJ1PnSsEeuNZXtKDXVtphksUbbq4H2NLINcsQ
 m6dwaRPSEA0EymngGY2e+8+CU0euiE4mqoMpw4D9m9Irs+BAaWYGk9xCWr0BGem0
 auZOMqvV2xktdBlGx1BJCLts1sHHxy8IM3u0852R/1AfcKOkXwNVPt62I8e9ceIA
 wc731aWHwJfS25m430xFDPJKJpUZoZgste4qwVym70CmRziuamgYyIfrfRg1ZjsD
 ZBa9Z4hPiT4e0eDqlYjcMpl9FORgYQXVXy5ofd/eZg5xkU8X+i6TVZkaQNkZyqV/
 4ogBZYUolg==
 =mwLC
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.12-2021-03-27' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:

 - Use thread info versions of flag testing, as discussed last week.

 - The series enabling PF_IO_WORKER to just take signals, instead of
   needing to special case that they do not in a bunch of places. Ends
   up being pretty trivial to do, and then we can revert all the special
   casing we're currently doing.

 - Kill dead pointer assignment

 - Fix hashed part of async work queue trace

 - Fix sign extension issue for IORING_OP_PROVIDE_BUFFERS

 - Fix a link completion ordering regression in this merge window

 - Cancellation fixes

* tag 'io_uring-5.12-2021-03-27' of git://git.kernel.dk/linux-block:
  io_uring: remove unsued assignment to pointer io
  io_uring: don't cancel extra on files match
  io_uring: don't cancel-track common timeouts
  io_uring: do post-completion chore on t-out cancel
  io_uring: fix timeout cancel return code
  Revert "signal: don't allow STOP on PF_IO_WORKER threads"
  Revert "kernel: freezer should treat PF_IO_WORKER like PF_KTHREAD for freezing"
  Revert "kernel: treat PF_IO_WORKER like PF_KTHREAD for ptrace/signals"
  Revert "signal: don't allow sending any signals to PF_IO_WORKER threads"
  kernel: stop masking signals in create_io_thread()
  io_uring: handle signals for IO threads like a normal thread
  kernel: don't call do_exit() for PF_IO_WORKER threads
  io_uring: maintain CQE order of a failed link
  io-wq: fix race around pending work on teardown
  io_uring: do ctx sqd ejection in a clear context
  io_uring: fix provide_buffers sign extension
  io_uring: don't skip file_end_write() on reissue
  io_uring: correct io_queue_async_work() traces
  io_uring: don't use {test,clear}_tsk_thread_flag() for current
2021-03-28 11:42:05 -07:00
Jens Axboe b16b3855d8 kernel: stop masking signals in create_io_thread()
This is racy - move the blocking into when the task is created and
we're marking it as PF_IO_WORKER anyway. The IO threads are now
prepared to handle signals like SIGSTOP as well, so clear that from
the mask to allow proper stopping of IO threads.

Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-27 14:09:10 -06:00
David S. Miller efd13b71a3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 15:31:22 -07:00
Linus Torvalds 0ada2dad8b io_uring-5.12-2021-03-19
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmBVI8cQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpuFOD/494N0khk5EpLnoq0+/uyRpnqnTjL3n+iWc
 fviiodL2/eirKWML/WbNUaKOWMs76iBwRqvTFnmCuyVexM9iPq3BXHocNYESYFni
 0EfuL+jzs/LjQLVJgCxyYUyafDtCGZ5ct/3ilfGWSY13ngfYdUVT1p+u9NK94T63
 4SrT6KKqEnpStpA1kjCw+doL17Tx2jrcrnX8gztIm0IarTnJGusiNZboy1IBMcqf
 Lw7CEePn4b9/0wKJa8sDYIFtI8Rvj2Jk86c4DDpGgoPU6I9fGPnp3oMGrxlwectT
 uTguzTlKAvbSu6v+2jqHCcXpkOG3aQJJM+YaNZmWOKwkLdyzLLIDT7SPlNHlacDF
 yBj+Ou3FbKvVUrYldUHlQoLZIAgp7AQO1JBilijNNibXsH0M4Gaw3aGPFmhEFfeJ
 /y+DXEfi2TGC6Yo+Ogub9Rh3gd2kgATu9Qbbnxi5TmYFc6WASBHP3OQEMVpVkD6F
 IZxZDvIKMj3DoYX3Can0vlqiWhmL5o7gyaRTkmxc4A21CR+AHstupDNTHbR23IsY
 dVxWmfrU25VFcIUAUOUgzPayDRn5KevexXjpkC8MVPQUqe/8FgI18eigDWTwlkcG
 0AZUraswv8uT5b0oLj9cawtAU9Dlit7niI6r9I3dtoUAD3JY4+yDp7oZp2TTOV2z
 +rgS+5zjug==
 =aPxz
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.12-2021-03-19' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "Quieter week this time, which was both expected and desired. About
  half of the below is fixes for this release, the other half are just
  fixes in general. In detail:

   - Fix the freezing of IO threads, by making the freezer not send them
     fake signals. Make them freezable by default.

   - Like we did for personalities, move the buffer IDR to xarray. Kills
     some code and avoids a use-after-free on teardown.

   - SQPOLL cleanups and fixes (Pavel)

   - Fix linked timeout race (Pavel)

   - Fix potential completion post use-after-free (Pavel)

   - Cleanup and move internal structures outside of general kernel view
     (Stefan)

   - Use MSG_SIGNAL for send/recv from io_uring (Stefan)"

* tag 'io_uring-5.12-2021-03-19' of git://git.kernel.dk/linux-block:
  io_uring: don't leak creds on SQO attach error
  io_uring: use typesafe pointers in io_uring_task
  io_uring: remove structures from include/linux/io_uring.h
  io_uring: imply MSG_NOSIGNAL for send[msg]()/recv[msg]() calls
  io_uring: fix sqpoll cancellation via task_work
  io_uring: add generic callback_head helpers
  io_uring: fix concurrent parking
  io_uring: halt SQO submission on ctx exit
  io_uring: replace sqd rw_semaphore with mutex
  io_uring: fix complete_post use ctx after free
  io_uring: fix ->flags races by linked timeouts
  io_uring: convert io_buffer_idr to XArray
  io_uring: allow IO worker threads to be frozen
  kernel: freezer should treat PF_IO_WORKER like PF_KTHREAD for freezing
2021-03-19 17:01:09 -07:00
Linus Torvalds 50eb842fe5 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "28 patches.

  Subsystems affected by this series: mm (memblock, pagealloc, hugetlb,
  highmem, kfence, oom-kill, madvise, kasan, userfaultfd, memcg, and
  zram), core-kernel, kconfig, fork, binfmt, MAINTAINERS, kbuild, and
  ia64"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (28 commits)
  zram: fix broken page writeback
  zram: fix return value on writeback_store
  mm/memcg: set memcg when splitting page
  mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg and add nr_pages argument
  ia64: fix ptrace(PTRACE_SYSCALL_INFO_EXIT) sign
  ia64: fix ia64_syscall_get_set_arguments() for break-based syscalls
  mm/userfaultfd: fix memory corruption due to writeprotect
  kasan: fix KASAN_STACK dependency for HW_TAGS
  kasan, mm: fix crash with HW_TAGS and DEBUG_PAGEALLOC
  mm/madvise: replace ptrace attach requirement for process_madvise
  include/linux/sched/mm.h: use rcu_dereference in in_vfork()
  kfence: fix reports if constant function prefixes exist
  kfence, slab: fix cache_alloc_debugcheck_after() for bulk allocations
  kfence: fix printk format for ptrdiff_t
  linux/compiler-clang.h: define HAVE_BUILTIN_BSWAP*
  MAINTAINERS: exclude uapi directories in API/ABI section
  binfmt_misc: fix possible deadlock in bm_register_write
  mm/highmem.c: fix zero_user_segments() with start > end
  hugetlb: do early cow when page pinned on src mm
  mm: use is_cow_mapping() across tree where proper
  ...
2021-03-14 12:23:34 -07:00
Fenghua Yu 82e69a121b mm/fork: clear PASID for new mm
When a new mm is created, its PASID should be cleared, i.e.  the PASID is
initialized to its init state 0 on both ARM and X86.

This patch was part of the series introducing mm->pasid, but got lost
along the way [1].  It still makes sense to have it, because each address
space has a different PASID.  And the IOMMU code in
iommu_sva_alloc_pasid() expects the pasid field of a new mm struct to be
cleared.

[1] https://lore.kernel.org/linux-iommu/YDgh53AcQHT+T3L0@otcwcpicx3.sc.intel.com/

Link: https://lkml.kernel.org/r/20210302103837.2562625-1-jean-philippe@linaro.org
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: Jacob Pan <jacob.jun.pan@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-03-13 11:27:30 -08:00
Jens Axboe 16efa4fce3 io_uring: allow IO worker threads to be frozen
With the freezer using the proper signaling to notify us of when it's
time to freeze a thread, we can re-enable normal freezer usage for the
IO threads. Ensure that SQPOLL, io-wq, and the io-wq manager call
try_to_freeze() appropriately, and remove the default setting of
PF_NOFREEZE from create_io_thread().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-12 20:26:13 -07:00
Jens Axboe e22bc9b481 kernel: make IO threads unfreezable by default
The io-wq threads were already marked as no-freeze, but the manager was
not. On resume, we perpetually have signal_pending() being true, and
hence the manager will loop and spin 100% of the time.

Just mark the tasks created by create_io_thread() as PF_NOFREEZE by
default, and remove any knowledge of it in io-wq and io_uring.

Reported-by: Kevin Locke <kevin@kevinlocke.name>
Tested-by: Kevin Locke <kevin@kevinlocke.name>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-10 07:28:43 -07:00
David S. Miller c1acda9807 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2021-03-09

The following pull-request contains BPF updates for your *net-next* tree.

We've added 90 non-merge commits during the last 17 day(s) which contain
a total of 114 files changed, 5158 insertions(+), 1288 deletions(-).

The main changes are:

1) Faster bpf_redirect_map(), from Björn.

2) skmsg cleanup, from Cong.

3) Support for floating point types in BTF, from Ilya.

4) Documentation for sys_bpf commands, from Joe.

5) Support for sk_lookup in bpf_prog_test_run, form Lorenz.

6) Enable task local storage for tracing programs, from Song.

7) bpf_for_each_map_elem() helper, from Yonghong.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-09 18:07:05 -08:00