Commit Graph

172 Commits

Author SHA1 Message Date
Radostin Stoyanov ce70e6b0f4 kthread: unpark only parked kthread
JIRA: https://issues.redhat.com/browse/RHEL-63788
CVE: CVE-2024-50019

commit 214e01ad4ed7158cab66498810094fac5d09b218
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Fri Sep 13 23:46:34 2024 +0200

    kthread: unpark only parked kthread

    Calling into kthread unparking unconditionally is mostly harmless when
    the kthread is already unparked. The wake up is then simply ignored
    because the target is not in TASK_PARKED state.

    However if the kthread is per CPU, the wake up is preceded by a call
    to kthread_bind() which expects the task to be inactive and in
    TASK_PARKED state, which obviously isn't the case if it is unparked.

    As a result, calling kthread_stop() on an unparked per-cpu kthread
    triggers such a warning:

            WARNING: CPU: 0 PID: 11 at kernel/kthread.c:525 __kthread_bind_mask kernel/kthread.c:525
             <TASK>
             kthread_stop+0x17a/0x630 kernel/kthread.c:707
             destroy_workqueue+0x136/0xc40 kernel/workqueue.c:5810
             wg_destruct+0x1e2/0x2e0 drivers/net/wireguard/device.c:257
             netdev_run_todo+0xe1a/0x1000 net/core/dev.c:10693
             default_device_exit_batch+0xa14/0xa90 net/core/dev.c:11769
             ops_exit_list net/core/net_namespace.c:178 [inline]
             cleanup_net+0x89d/0xcc0 net/core/net_namespace.c:640
             process_one_work kernel/workqueue.c:3231 [inline]
             process_scheduled_works+0xa2c/0x1830 kernel/workqueue.c:3312
             worker_thread+0x86d/0xd70 kernel/workqueue.c:3393
             kthread+0x2f0/0x390 kernel/kthread.c:389
             ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
             ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
             </TASK>

    Fix this with skipping unecessary unparking while stopping a kthread.

    Link: https://lkml.kernel.org/r/20240913214634.12557-1-frederic@kernel.org
    Fixes: 5c25b5ff89 ("workqueue: Tag bound workers with KTHREAD_IS_PER_CPU")
    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Reported-by: syzbot+943d34fa3cf2191e3068@syzkaller.appspotmail.com
    Tested-by: syzbot+943d34fa3cf2191e3068@syzkaller.appspotmail.com
    Suggested-by: Thomas Gleixner <tglx@linutronix.de>
    Cc: Hillf Danton <hdanton@sina.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Radostin Stoyanov <radostin@redhat.com>
2024-11-25 12:31:48 +00:00
Audra Mitchell 8947f5b14c lazy tlb: introduce lazy tlb mm refcount helper functions
JIRA: https://issues.redhat.com/browse/RHEL-55462

This patch is a backport of the following upstream commit:
commit aa464ba9a1e444d5ef95bb63ee3b2ef26fc96ed7
Author: Nicholas Piggin <npiggin@gmail.com>
Date:   Fri Feb 3 17:18:34 2023 +1000

    lazy tlb: introduce lazy tlb mm refcount helper functions

    Add explicit _lazy_tlb annotated functions for lazy tlb mm refcounting.
    This makes the lazy tlb mm references more obvious, and allows the
    refcounting scheme to be modified in later changes.  There is no
    functional change with this patch.

    Link: https://lkml.kernel.org/r/20230203071837.1136453-3-npiggin@gmail.com
    Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-11-04 09:14:17 -05:00
Audra Mitchell 4d3fb3dd76 kthread: simplify kthread_use_mm refcounting
JIRA: https://issues.redhat.com/browse/RHEL-55462

This patch is a backport of the following upstream commit:
commit 6cad87b0d216c6acdc40c5531c7b62db33fef5b1
Author: Nicholas Piggin <npiggin@gmail.com>
Date:   Fri Feb 3 17:18:33 2023 +1000

    kthread: simplify kthread_use_mm refcounting

    Patch series "shoot lazy tlbs (lazy tlb refcount scalability
    improvement)", v7.

    This series improves scalability of context switching between user and
    kernel threads on large systems with a threaded process spread across a
    lot of CPUs.

    Discussion of v6 here:
    https://lore.kernel.org/linux-mm/20230118080011.2258375-1-npiggin@gmail.com/

    This patch (of 5):

    Remove the special case avoiding refcounting when the mm to be used is the
    same as the kernel thread's active (lazy tlb) mm.  kthread_use_mm() should
    not be such a performance critical path that this matters much.  This
    simplifies a later change to lazy tlb mm refcounting.

    Link: https://lkml.kernel.org/r/20230203071837.1136453-1-npiggin@gmail.com
    Link: https://lkml.kernel.org/r/20230203071837.1136453-2-npiggin@gmail.com
    Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-11-04 09:14:17 -05:00
Jerry Snitselaar f967ec8f33 kthread: add kthread_stop_put
JIRA: https://issues.redhat.com/browse/RHEL-55466
Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Conflicts: cd7272215c44 ("accel/ivpu: Add command buffer submission logic") not backported
           to RHEL. fe4f7940d212 ("gfs2: Fix asynchronous thread destruction") not backported
	   to RHEL. The dma-buf and drm changes were previously backported in the
	   9ec5e27a98 and 01dac368a4 updates. Dropped unsupported xen-netback bits.

commit 6309727ef27162deabd5c095c11af24970fba5a2
Author: Andreas Gruenbacher <agruenba@redhat.com>
Date:   Fri Sep 8 01:40:48 2023 +0200

    kthread: add kthread_stop_put

    Add a kthread_stop_put() helper that stops a thread and puts its task
    struct.  Use it to replace the various instances of kthread_stop()
    followed by put_task_struct().

    Remove the kthread_stop_put() macro in usbip that is similar but doesn't
    return the result of kthread_stop().

    [agruenba@redhat.com: fix kerneldoc comment]
      Link: https://lkml.kernel.org/r/20230911111730.2565537-1-agruenba@redhat.com
    [akpm@linux-foundation.org: document kthread_stop_put()'s argument]
    Link: https://lkml.kernel.org/r/20230907234048.2499820-1-agruenba@redhat.com
    Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit 6309727ef27162deabd5c095c11af24970fba5a2)
Signed-off-by: Jerry Snitselaar <jsnitsel@redhat.com>
2024-09-20 12:26:34 -07:00
Nico Pache fa294ea05c kunit: Handle test faults
commit 3a35c13007dea132a65f07de05c26b87837fadc2
Author: Mickaël Salaün <mic@digikod.net>
Date:   Mon Apr 8 09:46:22 2024 +0200

    kunit: Handle test faults

    Previously, when a kernel test thread crashed (e.g. NULL pointer
    dereference, general protection fault), the KUnit test hanged for 30
    seconds and exited with a timeout error.

    Fix this issue by waiting on task_struct->vfork_done instead of the
    custom kunit_try_catch.try_completion, and track the execution state by
    initially setting try_result with -EINTR and only setting it to 0 if
    the test passed.

    Fix kunit_generic_run_threadfn_adapter() signature by returning 0
    instead of calling kthread_complete_and_exit().  Because thread's exit
    code is never checked, always set it to 0 to make it clear.  To make
    this explicit, export kthread_exit() for KUnit tests built as module.

    Fix the -EINTR error message, which couldn't be reached until now.

    This is tested with a following patch.

    Cc: Brendan Higgins <brendanhiggins@google.com>
    Cc: Eric W. Biederman <ebiederm@xmission.com>
    Cc: Shuah Khan <skhan@linuxfoundation.org>
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Reviewed-by: David Gow <davidgow@google.com>
    Tested-by: Rae Moar <rmoar@google.com>
    Signed-off-by: Mickaël Salaün <mic@digikod.net>
    Link: https://lore.kernel.org/r/20240408074625.65017-5-mic@digikod.net
    Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-39303
Signed-off-by: Nico Pache <npache@redhat.com>
2024-07-31 20:32:29 -06:00
Waiman Long 50392f94c5 treewide: Drop WARN_ON_FUNCTION_MISMATCH
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 4b24356312fbe1bace72f9905d529b14fc34c1c3
Author: Sami Tolvanen <samitolvanen@google.com>
Date:   Thu, 8 Sep 2022 14:54:56 -0700

    treewide: Drop WARN_ON_FUNCTION_MISMATCH

    CONFIG_CFI_CLANG no longer breaks cross-module function address
    equality, which makes WARN_ON_FUNCTION_MISMATCH unnecessary. Remove
    the definition and switch back to WARN_ON_ONCE.

    Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Tested-by: Kees Cook <keescook@chromium.org>
    Tested-by: Nathan Chancellor <nathan@kernel.org>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Kees Cook <keescook@chromium.org>
    Link: https://lore.kernel.org/r/20220908215504.3686827-15-samitolvanen@google.com

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:22 -04:00
Phil Auld 6633de32e0 sched/wait: Fix a kthread_park race with wait_woken()
JIRA: https://issues.redhat.com/browse/RHEL-1536

commit ef73d6a4ef0b35524125c3cfc6deafc26a0c966a
Author: Arve Hjønnevåg <arve@android.com>
Date:   Fri Jun 2 21:23:46 2023 +0000

    sched/wait: Fix a kthread_park race with wait_woken()

    kthread_park and wait_woken have a similar race that
    kthread_stop and wait_woken used to have before it was fixed in
    commit cb6538e740 ("sched/wait: Fix a kthread race with
    wait_woken()"). Extend that fix to also cover kthread_park.

    [jstultz: Made changes suggested by Peter to optimize
     memory loads]

    Signed-off-by: Arve Hjønnevåg <arve@android.com>
    Signed-off-by: John Stultz <jstultz@google.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Link: https://lore.kernel.org/r/20230602212350.535358-1-jstultz@google.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-09-07 14:30:59 -04:00
Chris von Recklinghausen ea2fa2fb80 uaccess: remove CONFIG_SET_FS
Conflicts: in arch/, only keep changes to arch/Kconfig and
	arch/arm64/kernel/traps.c. All other arch files in the upstream version
	of this patch are dropped.

Bugzilla: https://bugzilla.redhat.com/2120352

commit 967747bbc084b93b54e66f9047d342232314cd25
Author: Arnd Bergmann <arnd@arndb.de>
Date:   Fri Feb 11 21:42:45 2022 +0100

    uaccess: remove CONFIG_SET_FS

    There are no remaining callers of set_fs(), so CONFIG_SET_FS
    can be removed globally, along with the thread_info field and
    any references to it.

    This turns access_ok() into a cheaper check against TASK_SIZE_MAX.

    As CONFIG_SET_FS is now gone, drop all remaining references to
    set_fs()/get_fs(), mm_segment_t, user_addr_max() and uaccess_kernel().

    Acked-by: Sam Ravnborg <sam@ravnborg.org> # for sparc32 changes
    Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
    Tested-by: Sergey Matyukevich <sergey.matyukevich@synopsys.com> # for arc ch
anges
    Acked-by: Stafford Horne <shorne@gmail.com> # [openrisc, asm-generic]
    Acked-by: Dinh Nguyen <dinguyen@kernel.org>
    Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:45 -04:00
Chris von Recklinghausen 2d6179b3cd kthread: Generalize pf_io_worker so it can point to struct kthread
Bugzilla: https://bugzilla.redhat.com/2120352

commit e32cf5dfbe227b355776948b2c9b5691b84d1cbd
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Wed Dec 22 22:10:09 2021 -0600

    kthread: Generalize pf_io_worker so it can point to struct kthread

    The point of using set_child_tid to hold the kthread pointer was that
    it already did what is necessary.  There are now restrictions on when
    set_child_tid can be initialized and when set_child_tid can be used in
    schedule_tail.  Which indicates that continuing to use set_child_tid
    to hold the kthread pointer is a bad idea.

    Instead of continuing to use the set_child_tid field of task_struct
    generalize the pf_io_worker field of task_struct and use it to hold
    the kthread pointer.

    Rename pf_io_worker (which is a void * pointer) to worker_private so
    it can be used to store kthreads struct kthread pointer.  Update the
    kthread code to store the kthread pointer in the worker_private field.
    Remove the places where set_child_tid had to be dealt with carefully
    because kthreads also used it.

    Link: https://lkml.kernel.org/r/CAHk-=wgtFAA9SbVYg0gR1tqPMC17-NYcs0GQkaYg1bGhh1uJQQ@mail.gmail.com
    Link: https://lkml.kernel.org/r/87a6grvqy8.fsf_-_@email.froward.int.ebiederm.org
    Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:35 -04:00
Chris von Recklinghausen 76ac069af7 exit/kthread: Fix the kerneldoc comment for kthread_complete_and_exit
Bugzilla: https://bugzilla.redhat.com/2120352

commit 5eb6f22823e023ee13391978c329c4bd18f82552
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Tue Dec 14 11:25:01 2021 -0600

    exit/kthread: Fix the kerneldoc comment for kthread_complete_and_exit

    I misspelled kthread_complete_and_exit in the kernel doc comment fix
    it.

    Link: https://lkml.kernel.org/r/202112141329.KBkyJ5ql-lkp@intel.com
    Link: https://lkml.kernel.org/r/202112141422.Cykr6YUS-lkp@intel.com
    Reported-by: kernel test robot <lkp@intel.com>
    Fixes: cead18552660 ("exit: Rename complete_and_exit to kthread_complete_and_exit")
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:33 -04:00
Chris von Recklinghausen ea8c54be27 exit/kthread: Move the exit code for kernel threads into struct kthread
Bugzilla: https://bugzilla.redhat.com/2120352

commit 6b1248798eb6f6d5285db214299996ecc5dc1e6b
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Fri Dec 3 11:42:49 2021 -0600

    exit/kthread: Move the exit code for kernel threads into struct kthread

    The exit code of kernel threads has different semantics than the
    exit_code of userspace tasks.  To avoid confusion and allow
    the userspace implementation to change as needed move
    the kernel thread exit code into struct kthread.

    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:33 -04:00
Chris von Recklinghausen e1e51160dc kthread: Ensure struct kthread is present for all kthreads
Bugzilla: https://bugzilla.redhat.com/2120352

commit 40966e316f86b8cfd83abd31ccb4df729309d3e7
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Thu Dec 2 09:56:14 2021 -0600

    kthread: Ensure struct kthread is present for all kthreads

    Today the rules are a bit iffy and arbitrary about which kernel
    threads have struct kthread present.  Both idle threads and thread
    started with create_kthread want struct kthread present so that is
    effectively all kernel threads.  Make the rule that if PF_KTHREAD
    and the task is running then struct kthread is present.

    This will allow the kernel thread code to using tsk->exit_code
    with different semantics from ordinary processes.

    To make ensure that struct kthread is present for all
    kernel threads move it's allocation into copy_process.

    Add a deallocation of struct kthread in exec for processes
    that were kernel threads.

    Move the allocation of struct kthread for the initial thread
    earlier so that it is not repeated for each additional idle
    thread.

    Move the initialization of struct kthread into set_kthread_struct
    so that the structure is always and reliably initailized.

    Clear set_child_tid in free_kthread_struct to ensure the kthread
    struct is reliably freed during exec.  The function
    free_kthread_struct does not need to clear vfork_done during exec as
    exec_mm_release called from exec_mmap has already cleared vfork_done.

    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:33 -04:00
Chris von Recklinghausen 92bc08e714 exit: Rename complete_and_exit to kthread_complete_and_exit
Bugzilla: https://bugzilla.redhat.com/2120352

commit cead18552660702a4a46f58e65188fe5f36e9dfe
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Mon Nov 22 11:15:19 2021 -0600

    exit: Rename complete_and_exit to kthread_complete_and_exit

    Update complete_and_exit to call kthread_exit instead of do_exit.

    Change the name to reflect this change in functionality.  All of the
    users of complete_and_exit are causing the current kthread to exit so
    this change makes it clear what is happening.

    Move the implementation of kthread_complete_and_exit from
    kernel/exit.c to to kernel/kthread.c.  As this function is kthread
    specific it makes most sense to live with the kthread functions.

    There are no functional change.

    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:33 -04:00
Chris von Recklinghausen acf23bfa03 exit: Implement kthread_exit
Bugzilla: https://bugzilla.redhat.com/2120352

commit bbda86e988d4c124e4cfa816291cbd583ae8bfb1
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Mon Nov 22 10:27:36 2021 -0600

    exit: Implement kthread_exit

    The way the per task_struct exit_code is used by kernel threads is not
    quite compatible how it is used by userspace applications.  The low
    byte of the userspace exit_code value encodes the exit signal.  While
    kthreads just use the value as an int holding ordinary kernel function
    exit status like -EPERM.

    Add kthread_exit to clearly separate the two kinds of uses.

    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:33 -04:00
Al Stone a8d833d2e8 exit/kthread: Have kernel threads return instead of calling do_exit
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071830
Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2071840
Tested: This is one of a series of patch sets to enable Arm SystemReady IR
 support in the kernel for NXP i.MX8 platforms.  At this stage, this
 has been tested by ensuring we can survive the CI/CD loop -- i.e.,
 that we have not broken anything else, and a simple boot test.  When
 sufficient drivers have been brought in for i.MX8M, we will be able
 to run further tests.

commit 111e70490d2a673730b89c010b61cea2d982d121
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Wed Oct 20 12:43:58 2021 -0500

    exit/kthread: Have kernel threads return instead of calling do_exit

    In 2009 Oleg reworked[1] the kernel threads so that it is not
    necessary to call do_exit if you are not using kthread_stop().  Remove
    the explicit calls of do_exit and complete_and_exit (with a NULL
    completion) that were previously necessary.

    [1] 63706172f3 ("kthreads: rework kthread_stop()")
    Link: https://lkml.kernel.org/r/20211020174406.17889-12-ebiederm@xmission.com
    Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
    (cherry picked from commit 111e70490d2a673730b89c010b61cea2d982d121)

Signed-off-by: Al Stone <ahs3@redhat.com>
2022-08-25 10:44:58 -06:00
Ming Lei dd34bed622 kthread: unexport kthread_blkcg
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083917

commit f624506f98b198e65b44da303f44974590fb16c0
Author: Christoph Hellwig <hch@lst.de>
Date:   Wed Apr 20 06:27:23 2022 +0200

    kthread: unexport kthread_blkcg

    kthread_blkcg is only used by the built-in blk-cgroup code.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Acked-by: Tejun Heo <tj@kernel.org>
    Link: https://lore.kernel.org/r/20220420042723.1010598-16-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2022-06-22 08:58:04 +08:00
Waiman Long 695b8f5f5a kthread: add the helper function kthread_run_on_cpu()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2076713

commit 800977f6f32e452cba6b04ef21d2f5383ca29209
Author: Cai Huoqing <cai.huoqing@linux.dev>
Date:   Fri, 14 Jan 2022 14:02:52 -0800

    kthread: add the helper function kthread_run_on_cpu()

    Add a new helper function kthread_run_on_cpu(), which includes
    kthread_create_on_cpu/wake_up_process().

    In some cases, use kthread_run_on_cpu() directly instead of
    kthread_create_on_node/kthread_bind/wake_up_process() or
    kthread_create_on_cpu/wake_up_process() or
    kthreadd_create/kthread_bind/wake_up_process() to simplify the code.

    [akpm@linux-foundation.org: export kthread_create_on_cpu to modules]

    Link: https://lkml.kernel.org/r/20211022025711.3673-2-caihuoqing@baidu.com
    Signed-off-by: Cai Huoqing <caihuoqing@baidu.com>
    Cc: Bernard Metzler <bmt@zurich.ibm.com>
    Cc: Cai Huoqing <caihuoqing@baidu.com>
    Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: Doug Ledford <dledford@redhat.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
    Cc: Josh Triplett <josh@joshtriplett.org>
    Cc: Lai Jiangshan <jiangshanlai@gmail.com>
    Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Cc: "Paul E . McKenney" <paulmck@kernel.org>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2022-05-12 08:26:10 -04:00
Phil Auld 1cf795c344 sched/isolation: Use single feature type while referring to housekeeping cpumask
Bugzilla: http://bugzilla.redhat.com/2065222

commit 04d4e665a60902cf36e7ad39af1179cb5df542ad
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Mon Feb 7 16:59:06 2022 +0100

    sched/isolation: Use single feature type while referring to housekeeping cpumask

    Refer to housekeeping APIs using single feature types instead of flags.
    This prevents from passing multiple isolation features at once to
    housekeeping interfaces, which soon won't be possible anymore as each
    isolation features will have their own cpumask.

    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
    Reviewed-by: Phil Auld <pauld@redhat.com>
    Link: https://lore.kernel.org/r/20220207155910.527133-5-frederic@kernel.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-03-31 10:40:39 -04:00
Phil Auld e92f885fca kthread: Move prio/affinite change into the newly created thread
Bugzilla: http://bugzilla.redhat.com/2020279

commit 1a7243ca4074beed97b68d7235a6e34862fc2cd6
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Tue Nov 10 12:38:47 2020 +0100

    kthread: Move prio/affinite change into the newly created thread

    With enabled threaded interrupts the nouveau driver reported the
    following:

    | Chain exists of:
    |   &mm->mmap_lock#2 --> &device->mutex --> &cpuset_rwsem
    |
    |  Possible unsafe locking scenario:
    |
    |        CPU0                    CPU1
    |        ----                    ----
    |   lock(&cpuset_rwsem);
    |                                lock(&device->mutex);
    |                                lock(&cpuset_rwsem);
    |   lock(&mm->mmap_lock#2);

    The device->mutex is nvkm_device::mutex.

    Unblocking the lockchain at `cpuset_rwsem' is probably the easiest
    thing to do.  Move the priority reset to the start of the newly
    created thread.

    Fixes: 710da3c8ea ("sched/core: Prevent race condition between cpuset and __sched_setscheduler()")
    Reported-by: Mike Galbraith <efault@gmx.de>
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/a23a826af7c108ea5651e73b8fbae5e653f16e86.camel@gmx.de

Signed-off-by: Phil Auld <pauld@redhat.com>
2021-12-13 16:07:45 -05:00
Linus Torvalds 65090f30ab Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:
 "191 patches.

  Subsystems affected by this patch series: kthread, ia64, scripts,
  ntfs, squashfs, ocfs2, kernel/watchdog, and mm (gup, pagealloc, slab,
  slub, kmemleak, dax, debug, pagecache, gup, swap, memcg, pagemap,
  mprotect, bootmem, dma, tracing, vmalloc, kasan, initialization,
  pagealloc, and memory-failure)"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (191 commits)
  mm,hwpoison: make get_hwpoison_page() call get_any_page()
  mm,hwpoison: send SIGBUS with error virutal address
  mm/page_alloc: split pcp->high across all online CPUs for cpuless nodes
  mm/page_alloc: allow high-order pages to be stored on the per-cpu lists
  mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM
  mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA
  docs: remove description of DISCONTIGMEM
  arch, mm: remove stale mentions of DISCONIGMEM
  mm: remove CONFIG_DISCONTIGMEM
  m68k: remove support for DISCONTIGMEM
  arc: remove support for DISCONTIGMEM
  arc: update comment about HIGHMEM implementation
  alpha: remove DISCONTIGMEM and NUMA
  mm/page_alloc: move free_the_page
  mm/page_alloc: fix counting of managed_pages
  mm/page_alloc: improve memmap_pages dbg msg
  mm: drop SECTION_SHIFT in code comments
  mm/page_alloc: introduce vm.percpu_pagelist_high_fraction
  mm/page_alloc: limit the number of pages on PCP lists when reclaim is active
  mm/page_alloc: scale the number of pages that are batch freed
  ...
2021-06-29 17:29:11 -07:00
Petr Mladek d71ba1649f kthread_worker: fix return value when kthread_mod_delayed_work() races with kthread_cancel_delayed_work_sync()
kthread_mod_delayed_work() might race with
kthread_cancel_delayed_work_sync() or another kthread_mod_delayed_work()
call.  The function lets the other operation win when it sees
work->canceling counter set.  And it returns @false.

But it should return @true as it is done by the related workqueue API, see
mod_delayed_work_on().

The reason is that the return value might be used for reference counting.
It has to distinguish the case when the number of queued works has changed
or stayed the same.

The change is safe.  kthread_mod_delayed_work() return value is not
checked anywhere at the moment.

Link: https://lore.kernel.org/r/20210521163526.GA17916@redhat.com
Link: https://lkml.kernel.org/r/20210610133051.15337-4-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Minchan Kim <minchan@google.com>
Cc: <jenhaochen@google.com>
Cc: Martin Liu <liumartin@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:45 -07:00
Linus Torvalds 54a728dc5e Scheduler udpates for this cycle:
- Changes to core scheduling facilities:
 
     - Add "Core Scheduling" via CONFIG_SCHED_CORE=y, which enables
       coordinated scheduling across SMT siblings. This is a much
       requested feature for cloud computing platforms, to allow
       the flexible utilization of SMT siblings, without exposing
       untrusted domains to information leaks & side channels, plus
       to ensure more deterministic computing performance on SMT
       systems used by heterogenous workloads.
 
       There's new prctls to set core scheduling groups, which
       allows more flexible management of workloads that can share
       siblings.
 
     - Fix task->state access anti-patterns that may result in missed
       wakeups and rename it to ->__state in the process to catch new
       abuses.
 
  - Load-balancing changes:
 
      - Tweak newidle_balance for fair-sched, to improve
        'memcache'-like workloads.
 
      - "Age" (decay) average idle time, to better track & improve workloads
        such as 'tbench'.
 
      - Fix & improve energy-aware (EAS) balancing logic & metrics.
 
      - Fix & improve the uclamp metrics.
 
      - Fix task migration (taskset) corner case on !CONFIG_CPUSET.
 
      - Fix RT and deadline utilization tracking across policy changes
 
      - Introduce a "burstable" CFS controller via cgroups, which allows
        bursty CPU-bound workloads to borrow a bit against their future
        quota to improve overall latencies & batching. Can be tweaked
        via /sys/fs/cgroup/cpu/<X>/cpu.cfs_burst_us.
 
      - Rework assymetric topology/capacity detection & handling.
 
  - Scheduler statistics & tooling:
 
      - Disable delayacct by default, but add a sysctl to enable
        it at runtime if tooling needs it. Use static keys and
        other optimizations to make it more palatable.
 
      - Use sched_clock() in delayacct, instead of ktime_get_ns().
 
  - Misc cleanups and fixes.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmDZcPoRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1g3yw//WfhIqy7Psa9d/MBMjQDRGbTuO4+w22Dj
 vmWFU44Q4KJxQHWeIgUlrK+dzvYWvNmflUs2CUUOiDVzxFTHMIyBtL4qCBUbx4Ns
 vKAcB9wsWZge2o3WzZqpProRhdoRaSKw8egUr2q7rACVBkckY7eGP/OjWxXU8BdA
 b7D0LPWwuIBFfN4pFYeCDLn32Dqr9s6Chyj+ZecabdG7EE6Gu+f1diVcxy7JE/mc
 4WWL0D1RqdgpGrBEuMJIxPYekdrZiuy4jtEbztz5gbTBteN1cj3BLfqn0Pc/e6rO
 Vyuc5mXCAmzRVi18z6g6bsVl+IA/nrbErENB2OHOhOYtqiZxqGTd4GPWZszMyY17
 5AsEO5+5pcaBsy4gyp09qURggBu9zhJnMVmOI3rIHZkmkhwzc6uUJlyhDCTiFWOz
 3ZF3LjbZEyCKodMD8qMHbs3axIBpIfZqjzkvSKyFnvfXEGVytVse7NUuWtQ36u92
 GnURxVeYY1TDVXvE1Y8owNKMxknKQ6YRlypP7Dtbeo/qG6hShp0xmS7qDLDi0ybZ
 ZlK+bDECiVoDf3nvJo+8v5M82IJ3CBt4UYldeRJsa1YCK/FsbK8tp91fkEfnXVue
 +U6LPX0AmMpXacR5HaZfb3uBIKRw/QMdP/7RFtBPhpV6jqCrEmuqHnpPQiEVtxwO
 UmG7bt94Trk=
 =3VDr
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler udpates from Ingo Molnar:

 - Changes to core scheduling facilities:

    - Add "Core Scheduling" via CONFIG_SCHED_CORE=y, which enables
      coordinated scheduling across SMT siblings. This is a much
      requested feature for cloud computing platforms, to allow the
      flexible utilization of SMT siblings, without exposing untrusted
      domains to information leaks & side channels, plus to ensure more
      deterministic computing performance on SMT systems used by
      heterogenous workloads.

      There are new prctls to set core scheduling groups, which allows
      more flexible management of workloads that can share siblings.

    - Fix task->state access anti-patterns that may result in missed
      wakeups and rename it to ->__state in the process to catch new
      abuses.

 - Load-balancing changes:

    - Tweak newidle_balance for fair-sched, to improve 'memcache'-like
      workloads.

    - "Age" (decay) average idle time, to better track & improve
      workloads such as 'tbench'.

    - Fix & improve energy-aware (EAS) balancing logic & metrics.

    - Fix & improve the uclamp metrics.

    - Fix task migration (taskset) corner case on !CONFIG_CPUSET.

    - Fix RT and deadline utilization tracking across policy changes

    - Introduce a "burstable" CFS controller via cgroups, which allows
      bursty CPU-bound workloads to borrow a bit against their future
      quota to improve overall latencies & batching. Can be tweaked via
      /sys/fs/cgroup/cpu/<X>/cpu.cfs_burst_us.

    - Rework assymetric topology/capacity detection & handling.

 - Scheduler statistics & tooling:

    - Disable delayacct by default, but add a sysctl to enable it at
      runtime if tooling needs it. Use static keys and other
      optimizations to make it more palatable.

    - Use sched_clock() in delayacct, instead of ktime_get_ns().

 - Misc cleanups and fixes.

* tag 'sched-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
  sched/doc: Update the CPU capacity asymmetry bits
  sched/topology: Rework CPU capacity asymmetry detection
  sched/core: Introduce SD_ASYM_CPUCAPACITY_FULL sched_domain flag
  psi: Fix race between psi_trigger_create/destroy
  sched/fair: Introduce the burstable CFS controller
  sched/uclamp: Fix uclamp_tg_restrict()
  sched/rt: Fix Deadline utilization tracking during policy change
  sched/rt: Fix RT utilization tracking during policy change
  sched: Change task_struct::state
  sched,arch: Remove unused TASK_STATE offsets
  sched,timer: Use __set_current_state()
  sched: Add get_current_state()
  sched,perf,kvm: Fix preemption condition
  sched: Introduce task_is_running()
  sched: Unbreak wakeups
  sched/fair: Age the average idle time
  sched/cpufreq: Consider reduced CPU capacity in energy calculation
  sched/fair: Take thermal pressure into account while estimating energy
  thermal/cpufreq_cooling: Update offline CPUs per-cpu thermal_pressure
  sched/fair: Return early from update_tg_cfs_load() if delta == 0
  ...
2021-06-28 12:14:19 -07:00
Petr Mladek 5fa54346ca kthread: prevent deadlock when kthread_mod_delayed_work() races with kthread_cancel_delayed_work_sync()
The system might hang with the following backtrace:

	schedule+0x80/0x100
	schedule_timeout+0x48/0x138
	wait_for_common+0xa4/0x134
	wait_for_completion+0x1c/0x2c
	kthread_flush_work+0x114/0x1cc
	kthread_cancel_work_sync.llvm.16514401384283632983+0xe8/0x144
	kthread_cancel_delayed_work_sync+0x18/0x2c
	xxxx_pm_notify+0xb0/0xd8
	blocking_notifier_call_chain_robust+0x80/0x194
	pm_notifier_call_chain_robust+0x28/0x4c
	suspend_prepare+0x40/0x260
	enter_state+0x80/0x3f4
	pm_suspend+0x60/0xdc
	state_store+0x108/0x144
	kobj_attr_store+0x38/0x88
	sysfs_kf_write+0x64/0xc0
	kernfs_fop_write_iter+0x108/0x1d0
	vfs_write+0x2f4/0x368
	ksys_write+0x7c/0xec

It is caused by the following race between kthread_mod_delayed_work()
and kthread_cancel_delayed_work_sync():

CPU0				CPU1

Context: Thread A		Context: Thread B

kthread_mod_delayed_work()
  spin_lock()
  __kthread_cancel_work()
     spin_unlock()
     del_timer_sync()
				kthread_cancel_delayed_work_sync()
				  spin_lock()
				  __kthread_cancel_work()
				    spin_unlock()
				    del_timer_sync()
				    spin_lock()

				  work->canceling++
				  spin_unlock
     spin_lock()
   queue_delayed_work()
     // dwork is put into the worker->delayed_work_list

   spin_unlock()

				  kthread_flush_work()
     // flush_work is put at the tail of the dwork

				    wait_for_completion()

Context: IRQ

  kthread_delayed_work_timer_fn()
    spin_lock()
    list_del_init(&work->node);
    spin_unlock()

BANG: flush_work is not longer linked and will never get proceed.

The problem is that kthread_mod_delayed_work() checks work->canceling
flag before canceling the timer.

A simple solution is to (re)check work->canceling after
__kthread_cancel_work().  But then it is not clear what should be
returned when __kthread_cancel_work() removed the work from the queue
(list) and it can't queue it again with the new @delay.

The return value might be used for reference counting.  The caller has
to know whether a new work has been queued or an existing one was
replaced.

The proper solution is that kthread_mod_delayed_work() will remove the
work from the queue (list) _only_ when work->canceling is not set.  The
flag must be checked after the timer is stopped and the remaining
operations can be done under worker->lock.

Note that kthread_mod_delayed_work() could remove the timer and then
bail out.  It is fine.  The other canceling caller needs to cancel the
timer as well.  The important thing is that the queue (list)
manipulation is done atomically under worker->lock.

Link: https://lkml.kernel.org/r/20210610133051.15337-3-pmladek@suse.com
Fixes: 9a6b06c8d9 ("kthread: allow to modify delayed kthread work")
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reported-by: Martin Liu <liumartin@google.com>
Cc: <jenhaochen@google.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24 19:40:54 -07:00
Petr Mladek 34b3d53447 kthread_worker: split code for canceling the delayed work timer
Patch series "kthread_worker: Fix race between kthread_mod_delayed_work()
and kthread_cancel_delayed_work_sync()".

This patchset fixes the race between kthread_mod_delayed_work() and
kthread_cancel_delayed_work_sync() including proper return value
handling.

This patch (of 2):

Simple code refactoring as a preparation step for fixing a race between
kthread_mod_delayed_work() and kthread_cancel_delayed_work_sync().

It does not modify the existing behavior.

Link: https://lkml.kernel.org/r/20210610133051.15337-2-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Cc: <jenhaochen@google.com>
Cc: Martin Liu <liumartin@google.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24 19:40:54 -07:00
Peter Zijlstra 2f064a59a1 sched: Change task_struct::state
Change the type and name of task_struct::state. Drop the volatile and
shrink it to an 'unsigned int'. Rename it in order to find all uses
such that we can use READ_ONCE/WRITE_ONCE as appropriate.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Daniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20210611082838.550736351@infradead.org
2021-06-18 11:43:09 +02:00
Valentin Schneider 00b89fe019 sched: Make the idle task quack like a per-CPU kthread
For all intents and purposes, the idle task is a per-CPU kthread. It isn't
created via the same route as other pcpu kthreads however, and as a result
it is missing a few bells and whistles: it fails kthread_is_per_cpu() and
it doesn't have PF_NO_SETAFFINITY set.

Fix the former by giving the idle task a kthread struct along with the
KTHREAD_IS_PER_CPU flag. This requires some extra iffery as init_idle()
call be called more than once on the same idle task.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210510151024.2448573-2-valentin.schneider@arm.com
2021-05-18 12:53:53 +02:00
Linus Torvalds 16b3d0cf5b Scheduler updates for this cycle are:
- Clean up SCHED_DEBUG: move the decades old mess of sysctl, procfs and debugfs interfaces
    to a unified debugfs interface.
 
  - Signals: Allow caching one sigqueue object per task, to improve performance & latencies.
 
  - Improve newidle_balance() irq-off latencies on systems with a large number of CPU cgroups.
 
  - Improve energy-aware scheduling
 
  - Improve the PELT metrics for certain workloads
 
  - Reintroduce select_idle_smt() to improve load-balancing locality - but without the previous
    regressions
 
  - Add 'scheduler latency debugging': warn after long periods of pending need_resched. This
    is an opt-in feature that requires the enabling of the LATENCY_WARN scheduler feature,
    or the use of the resched_latency_warn_ms=xx boot parameter.
 
  - CPU hotplug fixes for HP-rollback, and for the 'fail' interface. Fix remaining
    balance_push() vs. hotplug holes/races
 
  - PSI fixes, plus allow /proc/pressure/ files to be written by CAP_SYS_RESOURCE tasks as well
 
  - Fix/improve various load-balancing corner cases vs. capacity margins
 
  - Fix sched topology on systems with NUMA diameter of 3 or above
 
  - Fix PF_KTHREAD vs to_kthread() race
 
  - Minor rseq optimizations
 
  - Misc cleanups, optimizations, fixes and smaller updates
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmCJInsRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1i5XxAArh0b+fwXlkVGzTUly7HQjhU7lFbChnmF
 h6ToyNLi6pXoZ14VC/WoRIME+RzK3gmw9cEFaSLVPxbkbekTcyWS78kqmcg1/j2v
 kO/20QhXobiIxVskYfoMmqSavZ5mKhMWBqtFXkCuYfxwGylas0VVdh3AZLJ7N21G
 WEoFh99pVULwWnPHxM2ZQ87Ex9BkGKbsBTswxWpprCfXLqD0N2hHlABpwJP78zRf
 VniWFOcC7lslILCFawb7CqGgAwbgV85nDRS4QCuCKisrkFywvjJrEeu/W+h1NfhF
 d6ves/osNdEAM1DSALoxwEA42An8l8xh8NyJnl8JZV00LW0DM108O5/7pf5Zcryc
 RHV3RxA7skgezBh5uThvo60QzNK+kVMatI4qpQEHxLE52CaDl/fBu1Cgb/VUxnIl
 AEBfyiFbk+skHpuMFKtl30Tx3M+yJKMTzFPd4kYjHYGEDwtAcXcB3dJQW48A79i3
 H3IWcDcXpk5Rjo2UZmaXdt/qlj7mP6U0xdOUq8ZK6JOC4uY9skszVGsfuNN9QQ5u
 2E2YKKVrGFoQydl4C8R6A7axL2VzIJszHFZNipd8E3YOyW7PWRAkr02tOOkBTj8N
 dLMcNM7aPJWqEYiEIjEzGQN20pweJ1dRA29LDuOswKh+7W2bWTQFh6F2Q8Haansc
 RVg5PDzl+Mc=
 =E7mz
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:

 - Clean up SCHED_DEBUG: move the decades old mess of sysctl, procfs and
   debugfs interfaces to a unified debugfs interface.

 - Signals: Allow caching one sigqueue object per task, to improve
   performance & latencies.

 - Improve newidle_balance() irq-off latencies on systems with a large
   number of CPU cgroups.

 - Improve energy-aware scheduling

 - Improve the PELT metrics for certain workloads

 - Reintroduce select_idle_smt() to improve load-balancing locality -
   but without the previous regressions

 - Add 'scheduler latency debugging': warn after long periods of pending
   need_resched. This is an opt-in feature that requires the enabling of
   the LATENCY_WARN scheduler feature, or the use of the
   resched_latency_warn_ms=xx boot parameter.

 - CPU hotplug fixes for HP-rollback, and for the 'fail' interface. Fix
   remaining balance_push() vs. hotplug holes/races

 - PSI fixes, plus allow /proc/pressure/ files to be written by
   CAP_SYS_RESOURCE tasks as well

 - Fix/improve various load-balancing corner cases vs. capacity margins

 - Fix sched topology on systems with NUMA diameter of 3 or above

 - Fix PF_KTHREAD vs to_kthread() race

 - Minor rseq optimizations

 - Misc cleanups, optimizations, fixes and smaller updates

* tag 'sched-core-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (61 commits)
  cpumask/hotplug: Fix cpu_dying() state tracking
  kthread: Fix PF_KTHREAD vs to_kthread() race
  sched/debug: Fix cgroup_path[] serialization
  sched,psi: Handle potential task count underflow bugs more gracefully
  sched: Warn on long periods of pending need_resched
  sched/fair: Move update_nohz_stats() to the CONFIG_NO_HZ_COMMON block to simplify the code & fix an unused function warning
  sched/debug: Rename the sched_debug parameter to sched_verbose
  sched,fair: Alternative sched_slice()
  sched: Move /proc/sched_debug to debugfs
  sched,debug: Convert sysctl sched_domains to debugfs
  debugfs: Implement debugfs_create_str()
  sched,preempt: Move preempt_dynamic to debug.c
  sched: Move SCHED_DEBUG sysctl to debugfs
  sched: Don't make LATENCYTOP select SCHED_DEBUG
  sched: Remove sched_schedstats sysctl out from under SCHED_DEBUG
  sched/numa: Allow runtime enabling/disabling of NUMA balance without SCHED_DEBUG
  sched: Use cpu_dying() to fix balance_push vs hotplug-rollback
  cpumask: Introduce DYING mask
  cpumask: Make cpu_{online,possible,present,active}() inline
  rseq: Optimise rseq_get_rseq_cs() and clear_rseq_cs()
  ...
2021-04-28 13:33:57 -07:00
Peter Zijlstra 3a7956e25e kthread: Fix PF_KTHREAD vs to_kthread() race
The kthread_is_per_cpu() construct relies on only being called on
PF_KTHREAD tasks (per the WARN in to_kthread). This gives rise to the
following usage pattern:

	if ((p->flags & PF_KTHREAD) && kthread_is_per_cpu(p))

However, as reported by syzcaller, this is broken. The scenario is:

	CPU0				CPU1 (running p)

	(p->flags & PF_KTHREAD) // true

					begin_new_exec()
					  me->flags &= ~(PF_KTHREAD|...);
	kthread_is_per_cpu(p)
	  to_kthread(p)
	    WARN(!(p->flags & PF_KTHREAD) <-- *SPLAT*

Introduce __to_kthread() that omits the WARN and is sure to check both
values.

Use this to remove the problematic pattern for kthread_is_per_cpu()
and fix a number of other kthread_*() functions that have similar
issues but are currently not used in ways that would expose the
problem.

Notably kthread_func() is only ever called on 'current', while
kthread_probe_data() is only used for PF_WQ_WORKER, which implies the
task is from kthread_create*().

Fixes: ac687e6e8c ("kthread: Extract KTHREAD_IS_PER_CPU")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com>
Link: https://lkml.kernel.org/r/YH6WJc825C4P0FCK@hirez.programming.kicks-ass.net
2021-04-21 13:55:42 +02:00
Sami Tolvanen 0a5b412891 kthread: use WARN_ON_FUNCTION_MISMATCH
With CONFIG_CFI_CLANG, a callback function passed to
__kthread_queue_delayed_work from a module points to a jump table
entry defined in the module instead of the one used in the core
kernel, which breaks function address equality in this check:

  WARN_ON_ONCE(timer->function != ktead_delayed_work_timer_fn);

Use WARN_ON_FUNCTION_MISMATCH() instead to disable the warning
when CFI and modules are both enabled.

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210408182843.1754385-7-samitolvanen@google.com
2021-04-08 16:04:21 -07:00
Linus Torvalds 24c56ee06c - Correct the marking of kthreads which are supposed to run on a specific,
single CPU vs such which are affine to only one CPU, mark per-cpu workqueue
  threads as such and make sure that marking "survives" CPU hotplug. Fix CPU
  hotplug issues with such kthreads.
 
  - A fix to not push away tasks on CPUs coming online.
 
  - Have workqueue CPU hotplug code use cpu_possible_mask when breaking affinity
    on CPU offlining so that pending workers can finish on newly arrived onlined
    CPUs too.
 
  - Dump tasks which haven't vacated a CPU which is currently being unplugged.
 
  - Register a special scale invariance callback which gets called on resume
  from RAM to read out APERF/MPERF after resume and thus make the schedutil
  scaling governor more precise.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmANYCAACgkQEsHwGGHe
 VUo+OBAAjfqkijDlXiGX6lrT5gRx5NZICpeMgbWa7J13XHT1ysD/b0fMGFIUyF6k
 aszDLTl8U/S1/qGAYlzTSPAFcdZ+ENiFqQ48ozMk4jZC3p0quHTjs/PdiSG6kYBi
 +e4smht+bSyLKxsG8hN0kJ+mLEd+uIQ13kP4YkxPgWbJ9WNP/U6HHGBo0rBchtSe
 Kn6bdd8CfwmC6rSazp7kdQoFoWeQaoMI1ODX3VphK1GtL1wq8WSICzRhpg3caeyG
 3lCIddoNW9mCA9Nkc6R6HeV3uW9JGkPAjnmtTIEHDbg9pib7xNT978ieTQuqNDCi
 DlAHDGumzoaiVJZhD/1fj/RXMJr2YUHxtrXWNsXpiKJ9g8Tn+WC0UW/4+Mx2L/km
 0RSoXJlMs1fGopS2I/fObZ6RPhmg4D+gJsMCdaHQzX4NgxZAGhNNPxMckZ0IM8A0
 2NNXSHUZHVTHeJEW0E/glOcpWb5hG+vDwiBMNEWfTwYpTfrw2EEOZaKniZE7WlSL
 4ItM9rkLGl1KToJzAH4A0oUtSy3vtSCo8B1noGlc09Lj+oCIBlr81z9+C79a2oxG
 qE7Xd4X7y7Qs3JeCbRZWQa7/2Kf1v4XnjELrJJeCZC85r0ZqJDwRX8w7lkmW2XPU
 m4J2prr/DDZSqrRh23/xC1fsU+vcBKSfKUFKAH4Lg2VIaUfSUEk=
 =2DAF
 -----END PGP SIGNATURE-----

Merge tag 'sched_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Borislav Petkov:

 - Correct the marking of kthreads which are supposed to run on a
   specific, single CPU vs such which are affine to only one CPU, mark
   per-cpu workqueue threads as such and make sure that marking
   "survives" CPU hotplug. Fix CPU hotplug issues with such kthreads.

 - A fix to not push away tasks on CPUs coming online.

 - Have workqueue CPU hotplug code use cpu_possible_mask when breaking
   affinity on CPU offlining so that pending workers can finish on newly
   arrived onlined CPUs too.

 - Dump tasks which haven't vacated a CPU which is currently being
   unplugged.

 - Register a special scale invariance callback which gets called on
   resume from RAM to read out APERF/MPERF after resume and thus make
   the schedutil scaling governor more precise.

* tag 'sched_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Relax the set_cpus_allowed_ptr() semantics
  sched: Fix CPU hotplug / tighten is_per_cpu_kthread()
  sched: Prepare to use balance_push in ttwu()
  workqueue: Restrict affinity change to rescuer
  workqueue: Tag bound workers with KTHREAD_IS_PER_CPU
  kthread: Extract KTHREAD_IS_PER_CPU
  sched: Don't run cpu-online with balance_push() enabled
  workqueue: Use cpu_possible_mask instead of cpu_active_mask to break affinity
  sched/core: Print out straggler tasks in sched_cpu_dying()
  x86: PM: Register syscore_ops for scale invariance
2021-01-24 10:09:20 -08:00
Peter Zijlstra ac687e6e8c kthread: Extract KTHREAD_IS_PER_CPU
There is a need to distinguish geniune per-cpu kthreads from kthreads
that happen to have a single CPU affinity.

Geniune per-cpu kthreads are kthreads that are CPU affine for
correctness, these will obviously have PF_KTHREAD set, but must also
have PF_NO_SETAFFINITY set, lest userspace modify their affinity and
ruins things.

However, these two things are not sufficient, PF_NO_SETAFFINITY is
also set on other tasks that have their affinities controlled through
other means, like for instance workqueues.

Therefore another bit is needed; it turns out kthread_create_per_cpu()
already has such a bit: KTHREAD_IS_PER_CPU, which is used to make
kthread_park()/kthread_unpark() work correctly.

Expose this flag and remove the implicit setting of it from
kthread_create_on_cpu(); the io_uring usage of it seems dubious at
best.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210121103506.557620262@infradead.org
2021-01-22 15:09:42 +01:00
Yanfei Xu cb5021ca62
kthread: remove comments about old _do_fork() helper
The old _do_fork() helper has been removed in favor of kernel_clone().
Here correct some comments which still contain _do_fork()

Link: https://lore.kernel.org/r/20210111104807.18022-1-yanfei.xu@windriver.com
Cc: christian@brauner.io
Cc: linux-kernel@vger.kernel.org
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-11 15:11:56 +01:00
Linus Torvalds ac73e3dc8a Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:

 - a few random little subsystems

 - almost all of the MM patches which are staged ahead of linux-next
   material. I'll trickle to post-linux-next work in as the dependents
   get merged up.

Subsystems affected by this patch series: kthread, kbuild, ide, ntfs,
ocfs2, arch, and mm (slab-generic, slab, slub, dax, debug, pagecache,
gup, swap, shmem, memcg, pagemap, mremap, hmm, vmalloc, documentation,
kasan, pagealloc, memory-failure, hugetlb, vmscan, z3fold, compaction,
oom-kill, migration, cma, page-poison, userfaultfd, zswap, zsmalloc,
uaccess, zram, and cleanups).

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (200 commits)
  mm: cleanup kstrto*() usage
  mm: fix fall-through warnings for Clang
  mm: slub: convert sysfs sprintf family to sysfs_emit/sysfs_emit_at
  mm: shmem: convert shmem_enabled_show to use sysfs_emit_at
  mm:backing-dev: use sysfs_emit in macro defining functions
  mm: huge_memory: convert remaining use of sprintf to sysfs_emit and neatening
  mm: use sysfs_emit for struct kobject * uses
  mm: fix kernel-doc markups
  zram: break the strict dependency from lzo
  zram: add stat to gather incompressible pages since zram set up
  zram: support page writeback
  mm/process_vm_access: remove redundant initialization of iov_r
  mm/zsmalloc.c: rework the list_add code in insert_zspage()
  mm/zswap: move to use crypto_acomp API for hardware acceleration
  mm/zswap: fix passing zero to 'PTR_ERR' warning
  mm/zswap: make struct kernel_param_ops definitions const
  userfaultfd/selftests: hint the test runner on required privilege
  userfaultfd/selftests: fix retval check for userfaultfd_open()
  userfaultfd/selftests: always dump something in modes
  userfaultfd: selftests: make __{s,u}64 format specifiers portable
  ...
2020-12-15 12:53:37 -08:00
Petr Mladek ebb2bdcef8 kthread_worker: document CPU hotplug handling
The kthread worker API is simple.  In short, it allows to create, use, and
destroy workers.  kthread_create_worker_on_cpu() just allows to bind a
newly created worker to a given CPU.

It is up to the API user how to handle CPU hotplug.  They have to decide
how to handle pending work items, prevent queuing new ones, and restore
the functionality when the CPU goes off and on.  There are few catches:

   + The CPU affinity gets lost when it is scheduled on an offline CPU.

   + The worker might not exist when the CPU was off when the user
     created the workers.

A good practice is to implement two CPU hotplug callbacks and
destroy/create the worker when CPU goes down/up.

Mention this in the function description.

[akpm@linux-foundation.org: grammar tweaks]

Link: https://lore.kernel.org/r/20201028073031.4536-1-qiang.zhang@windriver.com
Link: https://lkml.kernel.org/r/20201102101039.19227-1-pmladek@suse.com
Reported-by: Zhang Qiang <Qiang.Zhang@windriver.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 12:13:36 -08:00
Rob Clark f630c7c6f1 kthread: add kthread_work tracepoints
While migrating some code from wq to kthread_worker, I found that I missed
the execute_start/end tracepoints.  So add similar tracepoints for
kthread_work.  And for completeness, queue_work tracepoint (although this
one differs slightly from the matching workqueue tracepoint).

Link: https://lkml.kernel.org/r/20201010180323.126634-1-robdclark@gmail.com
Signed-off-by: Rob Clark <robdclark@chromium.org>
Cc: Rob Clark <robdclark@chromium.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Phil Auld <pauld@redhat.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Thara Gopinath <thara.gopinath@linaro.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Vincent Donnefort <vincent.donnefort@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Ilias Stamatis <stamatis.iliass@gmail.com>
Cc: Liang Chen <cl@rock-chips.com>
Cc: Ben Dooks <ben.dooks@codethink.co.uk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: "J. Bruce Fields" <bfields@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 12:13:36 -08:00
Ingo Molnar a787bdaff8 Merge branch 'linus' into sched/core, to resolve semantic conflict
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-11-27 11:10:50 +01:00
Zqiang 6993d0fdbe kthread_worker: prevent queuing delayed work from timer_fn when it is being canceled
There is a small race window when a delayed work is being canceled and
the work still might be queued from the timer_fn:

	CPU0						CPU1
kthread_cancel_delayed_work_sync()
   __kthread_cancel_work_sync()
     __kthread_cancel_work()
        work->canceling++;
					      kthread_delayed_work_timer_fn()
						   kthread_insert_work();

BUG: kthread_insert_work() should not get called when work->canceling is
set.

Signed-off-by: Zqiang <qiang.zhang@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20201014083030.16895-1-qiang.zhang@windriver.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-11-02 12:14:19 -08:00
Mathieu Desnoyers 618758ed3a sched: membarrier: cover kthread_use_mm (v4)
Add comments and memory barrier to kthread_use_mm and kthread_unuse_mm
to allow the effect of membarrier(2) to apply to kthreads accessing
user-space memory as well.

Given that no prior kthread use this guarantee and that it only affects
kthreads, adding this guarantee does not affect user-space ABI.

Refine the check in membarrier_global_expedited to exclude runqueues
running the idle thread rather than all kthreads from the IPI cpumask.

Now that membarrier_global_expedited can IPI kthreads, the scheduler
also needs to update the runqueue's membarrier_state when entering lazy
TLB state.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201020134715.13909-3-mathieu.desnoyers@efficios.com
2020-10-29 11:00:31 +01:00
Randy Dunlap 7b7b8a2c95 kernel/: fix repeated words in comments
Fix multiple occurrences of duplicated words in kernel/.

Fix one typo/spello on the same line as a duplicate word.  Change one
instance of "the the" to "that the".  Otherwise just drop one of the
repeated words.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: https://lkml.kernel.org/r/98202fa6-8919-ef63-9efe-c0fad5ca7af1@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:19 -07:00
Christoph Hellwig 3d13f313ce uaccess: add force_uaccess_{begin,end} helpers
Add helpers to wrap the get_fs/set_fs magic for undoing any damange done
by set_fs(KERNEL_DS).  There is no real functional benefit, but this
documents the intent of these calls better, and will allow stubbing the
functions out easily for kernels builds that do not allow address space
overrides in the future.

[hch@lst.de: drop two incorrect hunks, fix a commit log typo]
  Link: http://lkml.kernel.org/r/20200714105505.935079-6-hch@lst.de

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Greentime Hu <green.hu@gmail.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Link: http://lkml.kernel.org/r/20200710135706.537715-6-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 10:57:59 -07:00
Ilias Stamatis 4ca1085c95 kthread: remove incorrect comment in kthread_create_on_cpu()
Originally kthread_create_on_cpu() parked and woke up the new thread.
However, since commit a65d40961d ("kthread/smpboot: do not park in
kthread_create_on_cpu()") this is no longer the case.  This patch removes
the comment that has been left behind and is now incorrect / stale.

Fixes: a65d40961d ("kthread/smpboot: do not park in kthread_create_on_cpu()")
Signed-off-by: Ilias Stamatis <stamatis.iliass@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: http://lkml.kernel.org/r/20200611135920.240551-1-stamatis.iliass@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 11:33:21 -07:00
Peter Zijlstra 38cf307c1f mm: fix kthread_use_mm() vs TLB invalidate
For SMP systems using IPI based TLB invalidation, looking at
current->active_mm is entirely reasonable.  This then presents the
following race condition:

  CPU0			CPU1

  flush_tlb_mm(mm)	use_mm(mm)
    <send-IPI>
			  tsk->active_mm = mm;
			  <IPI>
			    if (tsk->active_mm == mm)
			      // flush TLBs
			  </IPI>
			  switch_mm(old_mm,mm,tsk);

Where it is possible the IPI flushed the TLBs for @old_mm, not @mm,
because the IPI lands before we actually switched.

Avoid this by disabling IRQs across changing ->active_mm and
switch_mm().

Of the (SMP) architectures that have IPI based TLB invalidate:

  Alpha    - checks active_mm
  ARC      - ASID specific
  IA64     - checks active_mm
  MIPS     - ASID specific flush
  OpenRISC - shoots down world
  PARISC   - shoots down world
  SH       - ASID specific
  SPARC    - ASID specific
  x86      - N/A
  xtensa   - checks active_mm

So at the very least Alpha, IA64 and Xtensa are suspect.

On top of this, for scheduler consistency we need at least preemption
disabled across changing tsk->mm and doing switch_mm(), which is
currently provided by task_lock(), but that's not sufficient for
PREEMPT_RT.

[akpm@linux-foundation.org: add comment]

Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jann Horn <jannh@google.com>
Cc: Will Deacon <will@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200721154106.GE10769@hirez.programming.kicks-ass.net
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 11:33:21 -07:00
Peter Zijlstra faa2fd7cba Merge branch 'sched/urgent' 2020-07-08 11:38:59 +02:00
Christoph Hellwig fe557319aa maccess: rename probe_kernel_{read,write} to copy_{from,to}_kernel_nofault
Better describe what these functions do.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-17 10:57:41 -07:00
Marcelo Tosatti 9cc5b86568 isolcpus: Affine unbound kernel threads to housekeeping cpus
This is a kernel enhancement that configures the cpu affinity of kernel
threads via kernel boot option nohz_full=.

When this option is specified, the cpumask is immediately applied upon
kthread launch. This does not affect kernel threads that specify cpu
and node.

This allows CPU isolation (that is not allowing certain threads
to execute on certain CPUs) without using the isolcpus=domain parameter,
making it possible to enable load balancing on such CPUs
during runtime (see kernel-parameters.txt).

Note-1: this is based off on Wind River's patch at
https://github.com/starlingx-staging/stx-integ/blob/master/kernel/kernel-std/centos/patches/affine-compute-kernel-threads.patch

Difference being that this patch is limited to modifying kernel thread
cpumask. Behaviour of other threads can be controlled via cgroups or
sched_setaffinity.

Note-2: Wind River's patch was based off Christoph Lameter's patch at
https://lwn.net/Articles/565932/ with the only difference being
the kernel parameter changed from kthread to kthread_cpus.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200527142909.23372-3-frederic@kernel.org
2020-06-15 14:10:03 +02:00
Marcelo Tosatti 043eb8e105 kthread: Switch to cpu_possible_mask
Next patch will switch unbound kernel threads mask to
housekeeping_cpumask(), a subset of cpu_possible_mask. So in order to
ease bisection, lets first switch kthreads default affinity from
cpu_all_mask to cpu_possible_mask.

It looks safe to do so as cpu_possible_mask seem to be initialized
at setup_arch() time, way before kthreadd is created.

Suggested-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200527142909.23372-2-frederic@kernel.org
2020-06-15 14:10:03 +02:00
Linus Torvalds 623f6dc593 Merge branch 'akpm' (patches from Andrew)
Merge some more updates from Andrew Morton:

 - various hotfixes and minor things

 - hch's use_mm/unuse_mm clearnups

Subsystems affected by this patch series: mm/hugetlb, scripts, kcov,
lib, nilfs, checkpatch, lib, mm/debug, ocfs2, lib, misc.

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  kernel: set USER_DS in kthread_use_mm
  kernel: better document the use_mm/unuse_mm API contract
  kernel: move use_mm/unuse_mm to kthread.c
  kernel: move use_mm/unuse_mm to kthread.c
  stacktrace: cleanup inconsistent variable type
  lib: test get_count_order/long in test_bitops.c
  mm: add comments on pglist_data zones
  ocfs2: fix spelling mistake and grammar
  mm/debug_vm_pgtable: fix kernel crash by checking for THP support
  lib: fix bitmap_parse() on 64-bit big endian archs
  checkpatch: correct check for kernel parameters doc
  nilfs2: fix null pointer dereference at nilfs_segctor_do_construct()
  lib/lz4/lz4_decompress.c: document deliberate use of `&'
  kcov: check kcov_softirq in kcov_remote_stop()
  scripts/spelling: add a few more typos
  khugepaged: selftests: fix timeout condition in wait_for_scan()
2020-06-11 13:25:53 -07:00
Christoph Hellwig 37c54f9bd4 kernel: set USER_DS in kthread_use_mm
Some architectures like arm64 and s390 require USER_DS to be set for
kernel threads to access user address space, which is the whole purpose of
kthread_use_mm, but other like x86 don't.  That has lead to a huge mess
where some callers are fixed up once they are tested on said
architectures, while others linger around and yet other like io_uring try
to do "clever" optimizations for what usually is just a trivial asignment
to a member in the thread_struct for most architectures.

Make kthread_use_mm set USER_DS, and kthread_unuse_mm restore to the
previous value instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Felipe Balbi <balbi@kernel.org>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: http://lkml.kernel.org/r/20200404094101.672954-7-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-10 19:14:18 -07:00
Christoph Hellwig f5678e7f2a kernel: better document the use_mm/unuse_mm API contract
Switch the function documentation to kerneldoc comments, and add
WARN_ON_ONCE asserts that the calling thread is a kernel thread and does
not have ->mm set (or has ->mm set in the case of unuse_mm).

Also give the functions a kthread_ prefix to better document the use case.

[hch@lst.de: fix a comment typo, cover the newly merged use_mm/unuse_mm caller in vfio]
  Link: http://lkml.kernel.org/r/20200416053158.586887-3-hch@lst.de
[sfr@canb.auug.org.au: powerpc/vas: fix up for {un}use_mm() rename]
  Link: http://lkml.kernel.org/r/20200422163935.5aa93ba5@canb.auug.org.au

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [usb]
Acked-by: Haren Myneni <haren@linux.ibm.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Felipe Balbi <balbi@kernel.org>
Cc: Jason Wang <jasowang@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
Link: http://lkml.kernel.org/r/20200404094101.672954-6-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-10 19:14:18 -07:00
Christoph Hellwig 9bf5b9eb23 kernel: move use_mm/unuse_mm to kthread.c
Patch series "improve use_mm / unuse_mm", v2.

This series improves the use_mm / unuse_mm interface by better documenting
the assumptions, and my taking the set_fs manipulations spread over the
callers into the core API.

This patch (of 3):

Use the proper API instead.

Link: http://lkml.kernel.org/r/20200404094101.672954-1-hch@lst.de

These helpers are only for use with kernel threads, and I will tie them
more into the kthread infrastructure going forward.  Also move the
prototypes to kthread.h - mmu_context.h was a little weird to start with
as it otherwise contains very low-level MM bits.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Felipe Balbi <balbi@kernel.org>
Cc: Jason Wang <jasowang@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: http://lkml.kernel.org/r/20200404094101.672954-1-hch@lst.de
Link: http://lkml.kernel.org/r/20200416053158.586887-1-hch@lst.de
Link: http://lkml.kernel.org/r/20200404094101.672954-5-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-10 19:14:18 -07:00