Commit Graph

749 Commits

Author SHA1 Message Date
Radostin Stoyanov d46a537401 signal: restore the override_rlimit logic
JIRA: https://issues.redhat.com/browse/RHEL-68020
CVE: CVE-2024-50271

commit 9e05e5c7ee8758141d2db7e8fea2cab34500c6ed
Author: Roman Gushchin <roman.gushchin@linux.dev>
Date:   Mon Nov 4 19:54:19 2024 +0000

    signal: restore the override_rlimit logic

    Prior to commit d646969055 ("Reimplement RLIMIT_SIGPENDING on top of
    ucounts") UCOUNT_RLIMIT_SIGPENDING rlimit was not enforced for a class of
    signals.  However now it's enforced unconditionally, even if
    override_rlimit is set.  This behavior change caused production issues.

    For example, if the limit is reached and a process receives a SIGSEGV
    signal, sigqueue_alloc fails to allocate the necessary resources for the
    signal delivery, preventing the signal from being delivered with siginfo.
    This prevents the process from correctly identifying the fault address and
    handling the error.  From the user-space perspective, applications are
    unaware that the limit has been reached and that the siginfo is
    effectively 'corrupted'.  This can lead to unpredictable behavior and
    crashes, as we observed with java applications.

    Fix this by passing override_rlimit into inc_rlimit_get_ucounts() and skip
    the comparison to max there if override_rlimit is set.  This effectively
    restores the old behavior.

    Link: https://lkml.kernel.org/r/20241104195419.3962584-1-roman.gushchin@linux.dev
    Fixes: d646969055 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
    Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
    Co-developed-by: Andrei Vagin <avagin@google.com>
    Signed-off-by: Andrei Vagin <avagin@google.com>
    Acked-by: Oleg Nesterov <oleg@redhat.com>
    Acked-by: Alexey Gladkov <legion@kernel.org>
    Cc: Kees Cook <kees@kernel.org>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Radostin Stoyanov <radostin@redhat.com>
2024-12-20 15:33:02 +00:00
Rafael Aquini e5cf0b4377 mm: suppress mm fault logging if fatal signal already pending
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit 5f0bc0b042fc77ff70e14c790abdec960cde4ec1
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Tue Jul 25 09:38:32 2023 -0700

    mm: suppress mm fault logging if fatal signal already pending

    Commit eda0047296a1 ("mm: make the page fault mmap locking killable")
    intentionally made it much easier to trigger the "page fault fails
    because a fatal signal is pending" situation, by having the mmap locking
    fail early in that case.

    We have long aborted page faults in other fatal cases when the actual IO
    for a page is interrupted by SIGKILL - which is particularly useful for
    the traditional case of NFS hanging due to network issues, but local
    filesystems could cause it too if you happened to get the SIGKILL while
    waiting for a page to be faulted in (eg lock_folio_maybe_drop_mmap()).

    So aborting the page fault wasn't a new condition - but it now triggers
    earlier, before we even get to 'handle_mm_fault()'.  And as a result the
    error doesn't go through our 'fault_signal_pending()' logic, and doesn't
    get filtered away there.

    Normally you'd never even notice, because if a fatal signal is pending,
    the new SIGSEGV we send ends up being ignored anyway.

    But it turns out that there is one very noticeable exception: if you
    enable 'show_unhandled_signals', the aborted page fault will be logged
    in the kernel messages, and you'll get a scary line looking something
    like this in your logs:

      pverados[2183248]: segfault at 55e5a00f9ae0 ip 000055e5a00f9ae0 sp 00007ffc0720bea8 error 14 in perl[55e5a00d4000+195000] likely on CPU 10 (core 4, socket 0)

    which is rather misleading.  It's not really a segfault at all, it's
    just "the thread was killed before the page fault completed, so we
    aborted the page fault".

    Fix this by just making it clear that a pending fatal signal means that
    any new signal coming in after that is implicitly handled.  This will
    avoid the misleading logging, since now the signal isn't 'unhandled' any
    more.

    Reported-and-tested-by: Fiona Ebner <f.ebner@proxmox.com>
    Tested-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
    Link: https://lore.kernel.org/lkml/8d063a26-43f5-0bb7-3203-c6a04dc159f8@proxmox.com/
    Acked-by: Oleg Nesterov <oleg@redhat.com>
    Fixes: eda0047296a1 ("mm: make the page fault mmap locking killable")
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:37:05 -04:00
Waiman Long 6d0328a7cf Revert "Revert "Merge: cgroup: Backport upstream cgroup commits up to v6.8""
JIRA: https://issues.redhat.com/browse/RHEL-36683
Upstream Status: RHEL only

This reverts commit 08637d76a2 which is a
revert of "Merge: cgroup: Backport upstream cgroup commits up to v6.8"

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-18 21:38:20 -04:00
Lucas Zampieri 08637d76a2 Revert "Merge: cgroup: Backport upstream cgroup commits up to v6.8"
This reverts merge request !4128
2024-05-16 15:26:41 +00:00
Waiman Long 724656e7cf freezer,sched: Rewrite core freezer logic
JIRA: https://issues.redhat.com/browse/RHEL-34600
Conflicts:
 1) A merge conflict in the kernel/signal.c hunk due to the presence
    of RHEL-only commit 975d318867 ("signal: Don't disable preemption
    in ptrace_stop() on PREEMPT_RT.").
 2) A merge conflict in the kernel/time/hrtimer.c hunk due to the
    presence of RHEL-only commit 5f76194136 ("time/hrtimer: Embed
    hrtimer mode into hrtimer_sleeper").
 3) The fs/cifs/inode.c hunk was applied to fs/smb/client/inode.c due
    to the presence of upstream commit 38c8a9a52082 ("smb: move client
    and server files to common directory fs/smb").
 4) Similarly, the fs/cifs/transport.c hunk was applied to
    fs/smb/client/transport.c manually due to the presence of
    a later upstream commit d527f51331ca ("cifs: Fix UAF in
    cifs_demultiplex_thread()").

Note that all the prerequiste patches in the same patch series
(https://lore.kernel.org/lkml/20220822111816.760285417@infradead.org/)
had already been merged into RHEL9.

commit f5d39b020809146cc28e6e73369bf8065e0310aa
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Mon, 22 Aug 2022 13:18:22 +0200

    freezer,sched: Rewrite core freezer logic

    Rewrite the core freezer to behave better wrt thawing and be simpler
    in general.

    By replacing PF_FROZEN with TASK_FROZEN, a special block state, it is
    ensured frozen tasks stay frozen until thawed and don't randomly wake
    up early, as is currently possible.

    As such, it does away with PF_FROZEN and PF_FREEZER_SKIP, freeing up
    two PF_flags (yay!).

    Specifically; the current scheme works a little like:

            freezer_do_not_count();
            schedule();
            freezer_count();

    And either the task is blocked, or it lands in try_to_freezer()
    through freezer_count(). Now, when it is blocked, the freezer
    considers it frozen and continues.

    However, on thawing, once pm_freezing is cleared, freezer_count()
    stops working, and any random/spurious wakeup will let a task run
    before its time.

    That is, thawing tries to thaw things in explicit order; kernel
    threads and workqueues before doing bringing SMP back before userspace
    etc.. However due to the above mentioned races it is entirely possible
    for userspace tasks to thaw (by accident) before SMP is back.

    This can be a fatal problem in asymmetric ISA architectures (eg ARMv9)
    where the userspace task requires a special CPU to run.

    As said; replace this with a special task state TASK_FROZEN and add
    the following state transitions:

            TASK_FREEZABLE  -> TASK_FROZEN
            __TASK_STOPPED  -> TASK_FROZEN
            __TASK_TRACED   -> TASK_FROZEN

    The new TASK_FREEZABLE can be set on any state part of TASK_NORMAL
    (IOW. TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE) -- any such state
    is already required to deal with spurious wakeups and the freezer
    causes one such when thawing the task (since the original state is
    lost).

    The special __TASK_{STOPPED,TRACED} states *can* be restored since
    their canonical state is in ->jobctl.

    With this, frozen tasks need an explicit TASK_FROZEN wakeup and are
    free of undue (early / spurious) wakeups.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Ingo Molnar <mingo@kernel.org>
    Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
    Link: https://lore.kernel.org/r/20220822114649.055452969@infradead.org

Signed-off-by: Waiman Long <longman@redhat.com>
2024-04-26 22:49:06 -04:00
Eder Zulian 269be86edd signal: Add proper comment about the preempt-disable in ptrace_stop().
JIRA: https://issues.redhat.com/browse/RHEL-3988
Upstream Status: https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git

commit bf1069f8c099d4c10e2f884dc72a83a4653fc6e4
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Thu Aug 3 12:09:31 2023 +0200

    signal: Add proper comment about the preempt-disable in ptrace_stop().

    Commit 53da1d9456 ("fix ptrace slowness") added a preempt-disable section
    between read_unlock() and the following schedule() invocation without
    explaining why it is needed.

    Replace the comment with an explanation why this is needed. Clarify that
    it is needed for correctness but for performance reasons.

    Acked-by: Oleg Nesterov <oleg@redhat.com>
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Link: https://lore.kernel.org/r/20230803100932.325870-2-bigeasy@linutronix.de

Signed-off-by: Eder Zulian <ezulian@redhat.com>
2023-11-06 12:29:40 +01:00
Oleg Nesterov 213383d9db undo Revert "signal: Don't disable preemption in ptrace_stop() on PREEMPT_RT."
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174325

Upstream Status: RHEL-only

Reintroduce the temporarily reverted rhel-only commit 975d318867
("signal: Don't disable preemption in ptrace_stop() on PREEMPT_RT.").

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2023-07-06 15:55:32 +02:00
Oleg Nesterov 87bd1b747c signal handling: don't use BUG_ON() for debugging
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174325

commit a382f8fee42ca10c9bfce0d2352d4153f931f5dc
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Wed Jul 6 12:20:59 2022 -0700

    signal handling: don't use BUG_ON() for debugging

    These are indeed "should not happen" situations, but it turns out recent
    changes made the 'task_is_stopped_or_trace()' case trigger (fix for that
    exists, is pending more testing), and the BUG_ON() makes it
    unnecessarily hard to actually debug for no good reason.

    It's been that way for a long time, but let's make it clear: BUG_ON() is
    not good for debugging, and should never be used in situations where you
    could just say "this shouldn't happen, but we can continue".

    Use WARN_ON_ONCE() instead to make sure it gets logged, and then just
    continue running.  Instead of making the system basically unusuable
    because you crashed the machine while potentially holding some very core
    locks (eg this function is commonly called while holding 'tasklist_lock'
    for writing).

    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2023-07-06 15:55:32 +02:00
Oleg Nesterov c888fdf0b1 sched,signal,ptrace: Rework TASK_TRACED, TASK_STOPPED state
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174325

Omitted-fix: 3418357a32db ("ptrace: fix clearing of JOBCTL_TRACED in ptrace_unfreeze_traced()")
That fix duplicates de2a34771f51 included in this series

commit 31cae1eaae4fd65095ad6a3659db467bc3c2599e
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Tue May 3 15:57:47 2022 -0500

    sched,signal,ptrace: Rework TASK_TRACED, TASK_STOPPED state

    Currently ptrace_stop() / do_signal_stop() rely on the special states
    TASK_TRACED and TASK_STOPPED resp. to keep unique state. That is, this
    state exists only in task->__state and nowhere else.

    There's two spots of bother with this:

     - PREEMPT_RT has task->saved_state which complicates matters,
       meaning task_is_{traced,stopped}() needs to check an additional
       variable.

     - An alternative freezer implementation that itself relies on a
       special TASK state would loose TASK_TRACED/TASK_STOPPED and will
       result in misbehaviour.

    As such, add additional state to task->jobctl to track this state
    outside of task->__state.

    NOTE: this doesn't actually fix anything yet, just adds extra state.

    --EWB
      * didn't add a unnecessary newline in signal.h
      * Update t->jobctl in signal_wake_up and ptrace_signal_wake_up
        instead of in signal_wake_up_state.  This prevents the clearing
        of TASK_STOPPED and TASK_TRACED from getting lost.
      * Added warnings if JOBCTL_STOPPED or JOBCTL_TRACED are not cleared

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20220421150654.757693825@infradead.org
    Tested-by: Kees Cook <keescook@chromium.org>
    Reviewed-by: Oleg Nesterov <oleg@redhat.com>
    Link: https://lkml.kernel.org/r/20220505182645.497868-12-ebiederm@xmission.com
    Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2023-07-06 15:55:31 +02:00
Oleg Nesterov b85b393abb ptrace: Don't change __state
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174325

commit 2500ad1c7fa42ad734677853961a3a8bec0772c5
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Fri Apr 29 08:43:34 2022 -0500

    ptrace: Don't change __state

    Stop playing with tsk->__state to remove TASK_WAKEKILL while a ptrace
    command is executing.

    Instead remove TASK_WAKEKILL from the definition of TASK_TRACED, and
    implement a new jobctl flag TASK_PTRACE_FROZEN.  This new flag is set
    in jobctl_freeze_task and cleared when ptrace_stop is awoken or in
    jobctl_unfreeze_task (when ptrace_stop remains asleep).

    In signal_wake_up add __TASK_TRACED to state along with TASK_WAKEKILL
    when the wake up is for a fatal signal.  Skip adding __TASK_TRACED
    when TASK_PTRACE_FROZEN is not set.  This has the same effect as
    changing TASK_TRACED to __TASK_TRACED as all of the wake_ups that use
    TASK_KILLABLE go through signal_wake_up.

    Handle a ptrace_stop being called with a pending fatal signal.
    Previously it would have been handled by schedule simply failing to
    sleep.  As TASK_WAKEKILL is no longer part of TASK_TRACED schedule
    will sleep with a fatal_signal_pending.   The code in signal_wake_up
    guarantees that the code will be awaked by any fatal signal that
    codes after TASK_TRACED is set.

    Previously the __state value of __TASK_TRACED was changed to
    TASK_RUNNING when woken up or back to TASK_TRACED when the code was
    left in ptrace_stop.  Now when woken up ptrace_stop now clears
    JOBCTL_PTRACE_FROZEN and when left sleeping ptrace_unfreezed_traced
    clears JOBCTL_PTRACE_FROZEN.

    Tested-by: Kees Cook <keescook@chromium.org>
    Reviewed-by: Oleg Nesterov <oleg@redhat.com>
    Link: https://lkml.kernel.org/r/20220505182645.497868-10-ebiederm@xmission.com
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2023-07-06 15:55:31 +02:00
Oleg Nesterov 67192f1fc6 ptrace: Admit ptrace_stop can generate spuriuos SIGTRAPs
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174325

commit 57b6de08b5f6586851c2261ef0cc16cd275615e7
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Wed May 4 13:39:58 2022 -0500

    ptrace: Admit ptrace_stop can generate spuriuos SIGTRAPs

    Long ago and far away there was a BUG_ON at the start of ptrace_stop
    that did "BUG_ON(!(current->ptrace & PT_PTRACED));" [1].  The BUG_ON
    had never triggered but examination of the code showed that the BUG_ON
    could actually trigger.  To complement removing the BUG_ON an attempt
    to better handle the race was added.

    The code detected the tracer had gone away and did not call
    do_notify_parent_cldstop.  The code also attempted to prevent
    ptrace_report_syscall from sending spurious SIGTRAPs when the tracer
    went away.

    The code to detect when the tracer had gone away before sending a
    signal to tracer was a legitimate fix and continues to work to this
    date.

    The code to prevent sending spurious SIGTRAPs is a failure.  At the
    time and until today the code only catches it when the tracer goes
    away after siglock is dropped and before read_lock is acquired.  If
    the tracer goes away after read_lock is dropped a spurious SIGTRAP can
    still be sent to the tracee.  The tracer going away after read_lock
    is dropped is the far likelier case as it is the bigger window.

    Given that the attempt to prevent the generation of a SIGTRAP was a
    failure and continues to be a failure remove the code that attempts to
    do that.  This simplifies the code in ptrace_stop and makes
    ptrace_stop much easier to reason about.

    To successfully deal with the tracer going away, all of the tracer's
    instrumentation of the child would need to be removed, and reliably
    detecting when the tracer has set a signal to continue with would need
    to be implemented.

    [1] 66519f549ae5 ("[PATCH] fix ptracer death race yielding bogus BUG_ON")
    History-Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
    Tested-by: Kees Cook <keescook@chromium.org>
    Reviewed-by: Oleg Nesterov <oleg@redhat.com>
    Link: https://lkml.kernel.org/r/20220505182645.497868-9-ebiederm@xmission.com
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2023-07-06 15:55:31 +02:00
Oleg Nesterov ac3c6b060a signal: Use lockdep_assert_held instead of assert_spin_locked
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174325

commit cb3c19c93d656caa6fe63d6277aabd7e570f1d03
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Fri Apr 29 09:16:10 2022 -0500

    signal: Use lockdep_assert_held instead of assert_spin_locked

    The distinction is that assert_spin_locked() checks if the lock is
    held *by*anyone* whereas lockdep_assert_held() asserts the current
    context holds the lock.  Also, the check goes away if you build
    without lockdep.

    Suggested-by: Peter Zijlstra <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/Ympr/+PX4XgT/UKU@hirez.programming.kicks-ass.net
    Tested-by: Kees Cook <keescook@chromium.org>
    Reviewed-by: Oleg Nesterov <oleg@redhat.com>
    Link: https://lkml.kernel.org/r/20220505182645.497868-6-ebiederm@xmission.com
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2023-07-06 15:55:30 +02:00
Oleg Nesterov 6806642199 signal: Replace __group_send_sig_info with send_signal_locked
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174325

commit e71ba124078e391879e0bf111529fa2d630d106c
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Fri Apr 22 09:28:50 2022 -0500

    signal: Replace __group_send_sig_info with send_signal_locked

    The function __group_send_sig_info is just a light wrapper around
    send_signal_locked with one parameter fixed to a constant value.  As
    the wrapper adds no real value update the code to directly call the
    wrapped function.

    Tested-by: Kees Cook <keescook@chromium.org>
    Reviewed-by: Oleg Nesterov <oleg@redhat.com>
    Link: https://lkml.kernel.org/r/20220505182645.497868-2-ebiederm@xmission.com
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2023-07-06 15:55:30 +02:00
Oleg Nesterov a4b8434f3e signal: Rename send_signal send_signal_locked
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174325

commit 157cc18122b4a1456d19048e151a164216c4a704
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Fri Apr 22 09:48:54 2022 -0500

    signal: Rename send_signal send_signal_locked

    Rename send_signal and __send_signal to send_signal_locked and
    __send_signal_locked to make send_signal usable outside of
    signal.c.

    Tested-by: Kees Cook <keescook@chromium.org>
    Reviewed-by: Oleg Nesterov <oleg@redhat.com>
    Link: https://lkml.kernel.org/r/20220505182645.497868-1-ebiederm@xmission.com
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2023-07-06 15:55:30 +02:00
Oleg Nesterov 634b3e94ba ptrace: Return the signal to continue with from ptrace_stop
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174325

commit 6487d1dab837214ec2fd3f0ddd5f787e63be7c20
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Thu Jan 27 12:19:13 2022 -0600

    ptrace: Return the signal to continue with from ptrace_stop

    The signal a task should continue with after a ptrace stop is
    inconsistently read, cleared, and sent.  Solve this by reading and
    clearing the signal to be sent in ptrace_stop.

    In an ideal world everything except ptrace_signal would share a common
    implementation of continuing with the signal, so ptracers could count
    on the signal they ask to continue with actually being delivered.  For
    now retain bug compatibility and just return with the signal number
    the ptracer requested the code continue with.

    Link: https://lkml.kernel.org/r/875yoe7qdp.fsf_-_@email.froward.int.ebiederm.org
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2023-06-23 14:09:58 +02:00
Oleg Nesterov 4916e577b7 ptrace: Move setting/clearing ptrace_message into ptrace_stop
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174325

commit 336d4b814bf078fa698488632c19beca47308896
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Thu Jan 27 12:15:32 2022 -0600

    ptrace: Move setting/clearing ptrace_message into ptrace_stop

    Today ptrace_message is easy to overlook as it not a core part of
    ptrace_stop.  It has been overlooked so much that there are places
    that set ptrace_message and don't clear it, and places that never set
    it.  So if you get an unlucky sequence of events the ptracer may be
    able to read a ptrace_message that does not apply to the current
    ptrace stop.

    Move setting of ptrace_message into ptrace_stop so that it always gets
    set before the stop, and always gets cleared after the stop.  This
    prevents non-sense from being reported to userspace and makes
    ptrace_message more visible in the ptrace helper functions so that
    kernel developers can see it.

    Link: https://lkml.kernel.org/r/87bky67qfv.fsf_-_@email.froward.int.ebiederm.org
    Acked-by: Oleg Nesterov <oleg@redhat.com>
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2023-06-23 14:09:42 +02:00
Oleg Nesterov 4203eee018 Revert "signal: Don't disable preemption in ptrace_stop() on PREEMPT_RT."
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174325

Upstream Status: RHEL-only

This reverts commit 975d318867.

Because it doesn't match upstream and thus conflicts with other necessary
changes. This fix will be re-introduced at the end of this series.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2023-06-23 14:07:10 +02:00
Juri Lelli 975d318867 signal: Don't disable preemption in ptrace_stop() on PREEMPT_RT.
Bugzilla: https://bugzilla.redhat.com/2171995
Upstream Status: https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git
Conflicts: Whitespace mismatch and missing series at merge commit
           67850b7bdcd2 ("Merge tag 'ptrace_stop-cleanup-for-v5.19'"),
           which seems a nice-have, but not essential.

commit 277af213394c063b8e2b8a712e11d41866d456d9
Author:    Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:      Wed Jun 22 11:36:17 2022 +0200

    signal: Don't disable preemption in ptrace_stop() on PREEMPT_RT.

    Commit
       53da1d9456 ("fix ptrace slowness")

    is just band aid around the problem.
    The invocation of do_notify_parent_cldstop() wakes the parent and makes
    it runnable. The scheduler then wants to replace this still running task
    with the parent. With the read_lock() acquired this is not possible
    because preemption is disabled and so this is deferred until read_unlock().
    This scheduling point is undesired and is avoided by disabling preemption
    around the unlock operation enabled again before the schedule() invocation
    without a preemption point.
    This is only undesired because the parent sleeps a cycle in
    wait_task_inactive() until the traced task leaves the run-queue in
    schedule(). It is not a correctness issue, it is just band aid to avoid the
    visbile delay which sums up over multiple invocations.
    The task can still be preempted if an interrupt occurs between
    preempt_enable_no_resched() and freezable_schedule() because on the IRQ-exit
    path of the interrupt scheduling _will_ happen. This is ignored since it does
    not happen very often.

    On PREEMPT_RT keeping preemption disabled during the invocation of
    cgroup_enter_frozen() becomes a problem because the function acquires
    css_set_lock which is a sleeping lock on PREEMPT_RT and must not be
    acquired with disabled preemption.

    Don't disable preemption on PREEMPT_RT. Remove the TODO regarding adding
    read_unlock_no_resched() as there is no need for it and will cause harm.

    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Link: https://lkml.kernel.org/r/20220720154435.232749-2-bigeasy@linutronix.de

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
2023-02-27 13:46:08 +01:00
Frantisek Hrbata e235b3c09a Merge: perf: Sync with upstream v5.19
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1361

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2123231

Signed-off-by: Michael Petlan <mpetlan@redhat.com>

Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Jerome Marchand <jmarchan@redhat.com>
Approved-by: Artem Savkov <asavkov@redhat.com>
Approved-by: Yauheni Kaliuta <ykaliuta@redhat.com>
Approved-by: David Arcari <darcari@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-11-07 01:58:21 -05:00
Chris von Recklinghausen 6fb7c30612 task_work: Call tracehook_notify_signal from get_signal on all architectures
Bugzilla: https://bugzilla.redhat.com/2120352

commit 8ba62d37949e248c698c26e0d82d72fda5d33ebf
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Wed Feb 9 09:51:14 2022 -0600

    task_work: Call tracehook_notify_signal from get_signal on all architectures

    Always handle TIF_NOTIFY_SIGNAL in get_signal.  With commit 35d0b389f3
    ("task_work: unconditionally run task_work from get_signal()") always
    calling task_work_run all of the work of tracehook_notify_signal is
    already happening except clearing TIF_NOTIFY_SIGNAL.

    Factor clear_notify_signal out of tracehook_notify_signal and use it in
    get_signal so that get_signal only needs one call of task_work_run.

    To keep the semantics in sync update xfer_to_guest_mode_work (which
    does not call get_signal) to call tracehook_notify_signal if either
    _TIF_SIGPENDING or _TIF_NOTIFY_SIGNAL.

    Reviewed-by: Kees Cook <keescook@chromium.org>
    Link: https://lkml.kernel.org/r/20220309162454.123006-8-ebiederm@xmission.com
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:47 -04:00
Chris von Recklinghausen 00a98ce2d4 task_work: Introduce task_work_pending
Bugzilla: https://bugzilla.redhat.com/2120352

commit 7f62d40d9cb50fd146fe8ff071f98fa3c1855083
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Wed Feb 9 08:52:41 2022 -0600

    task_work: Introduce task_work_pending

    Wrap the test of task->task_works in a helper function to make
    it clear what is being tested.

    All of the other readers of task->task_work use READ_ONCE and this is
    even necessary on current as other processes can update
    task->task_work.  So for consistency I have added READ_ONCE into
    task_work_pending.

    Reviewed-by: Kees Cook <keescook@chromium.org>
    Link: https://lkml.kernel.org/r/20220309162454.123006-7-ebiederm@xmission.com
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:47 -04:00
Chris von Recklinghausen f64a4f551f ptrace: Remove tracehook_signal_handler
Bugzilla: https://bugzilla.redhat.com/2120352

commit c145137dc990fd67b52fbc52faae5ba46f168cca
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Thu Jan 27 12:04:27 2022 -0600

    ptrace: Remove tracehook_signal_handler

    The two line function tracehook_signal_handler is only called from
    signal_delivered.  Expand it inline in signal_delivered and remove it.
    Just to make it easier to understand what is going on.

    Reviewed-by: Kees Cook <keescook@chromium.org>
    Link: https://lkml.kernel.org/r/20220309162454.123006-5-ebiederm@xmission.com
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:47 -04:00
Chris von Recklinghausen e149d74e12 signal: HANDLER_EXIT should clear SIGNAL_UNKILLABLE
Bugzilla: https://bugzilla.redhat.com/2120352

commit 5c72263ef2fbe99596848f03758ae2dc593adf2c
Author: Kees Cook <keescook@chromium.org>
Date:   Tue Feb 8 00:57:17 2022 -0800

    signal: HANDLER_EXIT should clear SIGNAL_UNKILLABLE

    Fatal SIGSYS signals (i.e. seccomp RET_KILL_* syscall filter actions)
    were not being delivered to ptraced pid namespace init processes. Make
    sure the SIGNAL_UNKILLABLE doesn't get set for these cases.

    Reported-by: Robert Święcki <robert@swiecki.net>
    Suggested-by: "Eric W. Biederman" <ebiederm@xmission.com>
    Fixes: 00b06da29cf9 ("signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed")
    Cc: stable@vger.kernel.org
    Signed-off-by: Kees Cook <keescook@chromium.org>
    Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
    Link: https://lore.kernel.org/lkml/878rui8u4a.fsf@email.froward.int.ebiederm.org

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:44 -04:00
Chris von Recklinghausen a9c55ab07f signal: clean up kernel-doc comments
Bugzilla: https://bugzilla.redhat.com/2120352

commit 6410349ea5e177f3e53c2006d2041eed47e986ae
Author: Randy Dunlap <rdunlap@infradead.org>
Date:   Tue Dec 21 19:10:27 2021 -0800

    signal: clean up kernel-doc comments

    Fix kernel-doc warnings in kernel/signal.c:

    kernel/signal.c:1830: warning: Function parameter or member 'force_coredump' not described in 'force_sig_seccomp'
    kernel/signal.c:2873: warning: missing initial short description on line:
     * signal_delivered -

    Also add a closing parenthesis to the comments in signal_delivered().

    Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Richard Weinberger <richard@nod.at>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Marco Elver <elver@google.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Link: https://lkml.kernel.org/r/20211222031027.29694-1-rdunlap@infradead.org
    Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:36 -04:00
Chris von Recklinghausen e545ae66be signal: Remove the helper signal_group_exit
Bugzilla: https://bugzilla.redhat.com/2120352

commit 49697335e0b441b0553598c1b48ee9ebb053d2f1
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Thu Jun 24 02:14:30 2021 -0500

    signal: Remove the helper signal_group_exit

    This helper is misleading.  It tests for an ongoing exec as well as
    the process having received a fatal signal.

    Sometimes it is appropriate to treat an on-going exec differently than
    a process that is shutting down due to a fatal signal.  In particular
    taking the fast path out of exit_signals instead of retargeting
    signals is not appropriate during exec, and not changing the the exit
    code in do_group_exit during exec.

    Removing the helper makes it more obvious what is going on as both
    cases must be coded for explicitly.

    While removing the helper fix the two cases where I have observed
    using signal_group_exit resulted in the wrong result.

    In exit_signals only test for SIGNAL_GROUP_EXIT so that signals are
    retargetted during an exec.

    In do_group_exit use 0 as the exit code during an exec as de_thread
    does not set group_exit_code.  As best as I can determine
    group_exit_code has been is set to 0 most of the time during
    de_thread.  During a thread group stop group_exit_code is set to the
    stop signal and when the thread group receives SIGCONT group_exit_code
    is reset to 0.

    Link: https://lkml.kernel.org/r/20211213225350.27481-8-ebiederm@xmission.com
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:36 -04:00
Chris von Recklinghausen 282c129641 signal: Remove SIGNAL_GROUP_COREDUMP
Bugzilla: https://bugzilla.redhat.com/2120352

commit 2f824d4d197e02275562359a2ae5274177ce500c
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Sat Jan 8 09:48:31 2022 -0600

    signal: Remove SIGNAL_GROUP_COREDUMP

    After the previous cleanups "signal->core_state" is set whenever
    SIGNAL_GROUP_COREDUMP is set and "signal->core_state" is tested
    whenver the code wants to know if a coredump is in progress.  The
    remaining tests of SIGNAL_GROUP_COREDUMP also test to see if
    SIGNAL_GROUP_EXIT is set.  Similarly the only place that sets
    SIGNAL_GROUP_COREDUMP also sets SIGNAL_GROUP_EXIT.

    Which makes SIGNAL_GROUP_COREDUMP unecessary and redundant. So stop
    setting SIGNAL_GROUP_COREDUMP, stop testing SIGNAL_GROUP_COREDUMP, and
    remove it's definition.

    With the setting of SIGNAL_GROUP_COREDUMP gone, coredump_finish no
    longer needs to clear SIGNAL_GROUP_COREDUMP out of signal->flags
    by setting SIGNAL_GROUP_EXIT.

    Link: https://lkml.kernel.org/r/20211213225350.27481-5-ebiederm@xmission.com
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:36 -04:00
Chris von Recklinghausen d8085e9b32 signal: Make coredump handling explicit in complete_signal
Bugzilla: https://bugzilla.redhat.com/2120352

commit 7ba03471ac4ad2432e5ccf67d9d4ab03c177578a
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Sat Jan 8 11:01:12 2022 -0600

    signal: Make coredump handling explicit in complete_signal

    Ever since commit 6cd8f0acae ("coredump: ensure that SIGKILL always
    kills the dumping thread") it has been possible for a SIGKILL received
    during a coredump to set SIGNAL_GROUP_EXIT and trigger a process
    shutdown (for a second time).

    Update the logic to explicitly allow coredumps so that coredumps can
    set SIGNAL_GROUP_EXIT and shutdown like an ordinary process.

    Link: https://lkml.kernel.org/r/87zgo6ytyf.fsf_-_@email.froward.int.ebiederm.org
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:36 -04:00
Chris von Recklinghausen 7e62823861 signal: Have prepare_signal detect coredumps using signal->core_state
Bugzilla: https://bugzilla.redhat.com/2120352

commit a0287db0f1d6918919203ba31fd7cda59bf889e8
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Sat Jan 8 09:34:50 2022 -0600

    signal: Have prepare_signal detect coredumps using signal->core_state

    In preparation for removing the flag SIGNAL_GROUP_COREDUMP, change
    prepare_signal to test signal->core_state instead of the flag
    SIGNAL_GROUP_COREDUMP.

    Both fields are protected by siglock and both live in signal_struct so
    there are no real tradeoffs here, just a change to which field is
    being tested.

    Link: https://lkml.kernel.org/r/20211213225350.27481-1-ebiederm@xmission.com
    Link: https://lkml.kernel.org/r/875yqu14co.fsf_-_@email.froward.int.ebiederm.org
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:36 -04:00
Chris von Recklinghausen 56644a77a7 signal: Replace force_fatal_sig with force_exit_sig when in doubt
Conflicts: drop changes to arch/m68k/kernel/traps.c,
	arch/sparc/kernel/signal_32.c, arch/sparc/kernel/windows.c -
		unsupported arches

Bugzilla: https://bugzilla.redhat.com/2120352

commit fcb116bc43c8c37c052530ead79872f8b2615711
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Thu Nov 18 14:23:21 2021 -0600

    signal: Replace force_fatal_sig with force_exit_sig when in doubt

    Recently to prevent issues with SECCOMP_RET_KILL and similar signals
    being changed before they are delivered SA_IMMUTABLE was added.

    Unfortunately this broke debuggers[1][2] which reasonably expect
    to be able to trap synchronous SIGTRAP and SIGSEGV even when
    the target process is not configured to handle those signals.

    Add force_exit_sig and use it instead of force_fatal_sig where
    historically the code has directly called do_exit.  This has the
    implementation benefits of going through the signal exit path
    (including generating core dumps) without the danger of allowing
    userspace to ignore or change these signals.

    This avoids userspace regressions as older kernels exited with do_exit
    which debuggers also can not intercept.

    In the future is should be possible to improve the quality of
    implementation of the kernel by changing some of these force_exit_sig
    calls to force_fatal_sig.  That can be done where it matters on
    a case-by-case basis with careful analysis.

    Reported-by: Kyle Huey <me@kylehuey.com>
    Reported-by: kernel test robot <oliver.sang@intel.com>
    [1] https://lkml.kernel.org/r/CAP045AoMY4xf8aC_4QU_-j7obuEPYgTcnQQP3Yxk=2X90
jtpjw@mail.gmail.com
    [2] https://lkml.kernel.org/r/20211117150258.GB5403@xsang-OptiPlex-9020
    Fixes: 00b06da29cf9 ("signal: Add SA_IMMUTABLE to ensure forced siganls do n
ot get changed")
    Fixes: a3616a3c0272 ("signal/m68k: Use force_sigsegv(SIGSEGV) in fpsp040_die
")
    Fixes: 83a1f27ad773 ("signal/powerpc: On swapcontext failure force SIGSEGV")
    Fixes: 9bc508cf0791 ("signal/s390: Use force_sigsegv in default_trap_handler
")
    Fixes: 086ec444f866 ("signal/sparc32: In setup_rt_frame and setup_fram use f
orce_fatal_sig")
    Fixes: c317d306d550 ("signal/sparc32: Exit with a fatal signal when try_to_c
lear_window_buffer fails")
    Fixes: 695dd0d634df ("signal/x86: In emulate_vsyscall force a signal instead
 of calling do_exit")
    Fixes: 1fbd60df8a85 ("signal/vm86_32: Properly send SIGSEGV when the vm86 st
ate cannot be saved.")
    Fixes: 941edc5bf174 ("exit/syscall_user_dispatch: Send ordinary signals on f
ailure")
    Link: https://lkml.kernel.org/r/871r3dqfv8.fsf_-_@email.froward.int.ebiederm
.org
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Tested-by: Kees Cook <keescook@chromium.org>
    Tested-by: Kyle Huey <khuey@kylehuey.com>
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:31 -04:00
Chris von Recklinghausen 73393e9677 signal: Don't always set SA_IMMUTABLE for forced signals
Bugzilla: https://bugzilla.redhat.com/2120352

commit e349d945fac76bddc78ae1cb92a0145b427a87ce
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Thu Nov 18 11:11:13 2021 -0600

    signal: Don't always set SA_IMMUTABLE for forced signals

    Recently to prevent issues with SECCOMP_RET_KILL and similar signals
    being changed before they are delivered SA_IMMUTABLE was added.

    Unfortunately this broke debuggers[1][2] which reasonably expect to be
    able to trap synchronous SIGTRAP and SIGSEGV even when the target
    process is not configured to handle those signals.

    Update force_sig_to_task to support both the case when we can allow
    the debugger to intercept and possibly ignore the signal and the case
    when it is not safe to let userspace know about the signal until the
    process has exited.

    Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
    Reported-by: Kyle Huey <me@kylehuey.com>
    Reported-by: kernel test robot <oliver.sang@intel.com>
    Cc: stable@vger.kernel.org
    [1] https://lkml.kernel.org/r/CAP045AoMY4xf8aC_4QU_-j7obuEPYgTcnQQP3Yxk=2X90jtpjw@mail.gmail.com
    [2] https://lkml.kernel.org/r/20211117150258.GB5403@xsang-OptiPlex-9020
    Fixes: 00b06da29cf9 ("signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed")
    Link: https://lkml.kernel.org/r/877dd5qfw5.fsf_-_@email.froward.int.ebiederm.org
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Tested-by: Kees Cook <keescook@chromium.org>
    Tested-by: Kyle Huey <khuey@kylehuey.com>
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:31 -04:00
Chris von Recklinghausen beff6d154c signal: Requeue ptrace signals
Bugzilla: https://bugzilla.redhat.com/2120352

commit b171f667f3787946a8ba9644305339e93ae799c9
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Mon Nov 15 13:49:45 2021 -0600

    signal: Requeue ptrace signals

    Kyle Huey <me@kylehuey.com> writes:

    > rr, a userspace record and replay debugger[0], uses the recorded register
    > state at PTRACE_EVENT_EXIT to find the point in time at which to cease
    > executing the program during replay.
    >
    > If a SIGKILL races with processing another signal in get_signal, it is
    > possible for the kernel to decline to notify the tracer of the original
    > signal. But if the original signal had a handler, the kernel proceeds
    > with setting up a signal handler frame as if the tracer had chosen to
    > deliver the signal unmodified to the tracee. When the kernel goes to
    > execute the signal handler that it has now modified the stack and registers
    > for, it will discover the pending SIGKILL, and terminate the tracee
    > without executing the handler. When PTRACE_EVENT_EXIT is delivered to
    > the tracer, however, the effects of handler setup will be visible to
    > the tracer.
    >
    > Because rr (the tracer) was never notified of the signal, it is not aware
    > that a signal handler frame was set up and expects the state of the program
    > at PTRACE_EVENT_EXIT to be a state that will be reconstructed naturally
    > by allowing the program to execute from the last event. When that fails
    > to happen during replay, rr will assert and die.
    >
    > The following patches add an explicit check for a newly pending SIGKILL
    > after the ptracer has been notified and the siglock has been reacquired.
    > If this happens, we stop processing the current signal and proceed
    > immediately to handling the SIGKILL. This makes the state reported at
    > PTRACE_EVENT_EXIT the unmodified state of the program, and also avoids the
    > work to set up a signal handler frame that will never be used.
    >
    > [0] https://rr-project.org/

    The problem is that while the traced process makes it into ptrace_stop,
    the tracee is killed before the tracer manages to wait for the
    tracee and discover which signal was about to be delivered.

    More generally the problem is that while siglock was dropped a signal
    with process wide effect is short cirucit delivered to the entire
    process killing it, but the process continues to try and deliver another
    signal.

    In general it impossible to avoid all cases where work is performed
    after the process has been killed.  In particular if the process is
    killed after get_signal returns the code will simply not know it has
    been killed until after delivering the signal frame to userspace.

    On the other hand when the code has already discovered the process
    has been killed and taken user space visible action that shows
    the kernel knows the process has been killed, it is just silly
    to then write the signal frame to the user space stack.

    Instead of being silly detect the process has been killed
    in ptrace_signal and requeue the signal so the code can pretend
    it was simply never dequeued for delivery.

    To test the process has been killed I use fatal_signal_pending rather
    than signal_group_exit to match the test in signal_pending_state which
    is used in schedule which is where ptrace_stop detects the process has
    been killed.

    Requeuing the signal so the code can pretend it was simply never
    dequeued improves the user space visible behavior that has been
    present since ebf5ebe31d2c ("[PATCH] signal-fixes-2.5.59-A4").

    Kyle Huey verified that this change in behavior and makes rr happy.

    Reported-by: Kyle Huey <khuey@kylehuey.com>
    Reported-by: Marko Mäkelä <marko.makela@mariadb.com>
    History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.gi
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Link: https://lkml.kernel.org/r/87tugcd5p2.fsf_-_@email.froward.int.ebiederm.org
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:31 -04:00
Chris von Recklinghausen 30ab0124bb signal: Requeue signals in the appropriate queue
Bugzilla: https://bugzilla.redhat.com/2120352

commit 5768d8906bc23d512b1a736c1e198aa833a6daa4
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Mon Nov 15 13:47:13 2021 -0600

    signal: Requeue signals in the appropriate queue

    In the event that a tracer changes which signal needs to be delivered
    and that signal is currently blocked then the signal needs to be
    requeued for later delivery.

    With the advent of CLONE_THREAD the kernel has 2 signal queues per
    task.  The per process queue and the per task queue.  Update the code
    so that if the signal is removed from the per process queue it is
    requeued on the per process queue.  This is necessary to make it
    appear the signal was never dequeued.

    The rr debugger reasonably believes that the state of the process from
    the last ptrace_stop it observed until PTRACE_EVENT_EXIT can be recreated
    by simply letting a process run.  If a SIGKILL interrupts a ptrace_stop
    this is not true today.

    So return signals to their original queue in ptrace_signal so that
    signals that are not delivered appear like they were never dequeued.

    Fixes: 794aa320b79d ("[PATCH] sigfix-2.5.40-D6")
    History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.gi
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Link: https://lkml.kernel.org/r/87zgq4d5r4.fsf_-_@email.froward.int.ebiederm.org
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:31 -04:00
Chris von Recklinghausen b47643344f signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed
Bugzilla: https://bugzilla.redhat.com/2120352

commit 00b06da29cf9dc633cdba87acd3f57f4df3fd5c7
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Fri Oct 29 09:14:19 2021 -0500

    signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed

    As Andy pointed out that there are races between
    force_sig_info_to_task and sigaction[1] when force_sig_info_task.  As
    Kees discovered[2] ptrace is also able to change these signals.

    In the case of seeccomp killing a process with a signal it is a
    security violation to allow the signal to be caught or manipulated.

    Solve this problem by introducing a new flag SA_IMMUTABLE that
    prevents sigaction and ptrace from modifying these forced signals.
    This flag is carefully made kernel internal so that no new ABI is
    introduced.

    Longer term I think this can be solved by guaranteeing short circuit
    delivery of signals in this case.  Unfortunately reliable and
    guaranteed short circuit delivery of these signals is still a ways off
    from being implemented, tested, and merged.  So I have implemented a much
    simpler alternative for now.

    [1] https://lkml.kernel.org/r/b5d52d25-7bde-4030-a7b1-7c6f8ab90660@www.fastmail.com
    [2] https://lkml.kernel.org/r/202110281136.5CE65399A7@keescook
    Cc: stable@vger.kernel.org
    Fixes: 307d522f5eb8 ("signal/seccomp: Refactor seccomp signal and coredump generation")
    Tested-by: Andrea Righi <andrea.righi@canonical.com>
    Tested-by: Kees Cook <keescook@chromium.org>
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:26 -04:00
Chris von Recklinghausen 03300292e0 signal: Implement force_fatal_sig
Bugzilla: https://bugzilla.redhat.com/2120352

commit 26d5badbccddcc063dc5174a2baffd13a23322aa
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Wed Oct 20 12:43:59 2021 -0500

    signal: Implement force_fatal_sig

    Add a simple helper force_fatal_sig that causes a signal to be
    delivered to a process as if the signal handler was set to SIG_DFL.

    Reimplement force_sigsegv based upon this new helper.  This fixes
    force_sigsegv so that when it forces the default signal handler
    to be used the code now forces the signal to be unblocked as well.

    Reusing the tested logic in force_sig_info_to_task that was built for
    force_sig_seccomp this makes the implementation trivial.

    This is interesting both because it makes force_sigsegv simpler and
    because there are a couple of buggy places in the kernel that call
    do_exit(SIGILL) or do_exit(SIGSYS) because there is no straight
    forward way today for those places to simply force the exit of a
    process with the chosen signal.  Creating force_fatal_sig allows
    those places to be implemented with normal signal exits.

    Link: https://lkml.kernel.org/r/20211020174406.17889-13-ebiederm@xmission.com
    Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:26 -04:00
Chris von Recklinghausen f7fb43f6b1 coredump: Don't perform any cleanups before dumping core
Bugzilla: https://bugzilla.redhat.com/2120352

commit 92307383082daff5df884a25df9e283efb7ef261
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Wed Sep 1 11:33:50 2021 -0500

    coredump:  Don't perform any cleanups before dumping core

    Rename coredump_exit_mm to coredump_task_exit and call it from do_exit
    before PTRACE_EVENT_EXIT, and before any cleanup work for a task
    happens.  This ensures that an accurate copy of the process can be
    captured in the coredump as no cleanup for the process happens before
    the coredump completes.  This also ensures that PTRACE_EVENT_EXIT
    will not be visited by any thread until the coredump is complete.

    Add a new flag PF_POSTCOREDUMP so that tasks that have passed through
    coredump_task_exit can be recognized and ignored in zap_process.

    Now that all of the coredumping happens before exit_mm remove code to
    test for a coredump in progress from mm_release.

    Replace "may_ptrace_stop()" with a simple test of "current->ptrace".
    The other tests in may_ptrace_stop all concern avoiding stopping
    during a coredump.  These tests are no longer necessary as it is now
    guaranteed that fatal_signal_pending will be set if the code enters
    ptrace_stop during a coredump.  The code in ptrace_stop is guaranteed
    not to stop if fatal_signal_pending returns true.

    Until this change "ptrace_event(PTRACE_EVENT_EXIT)" could call
    ptrace_stop without fatal_signal_pending being true, as signals are
    dequeued in get_signal before calling do_exit.  This is no longer
    an issue as "ptrace_event(PTRACE_EVENT_EXIT)" is no longer reached
    until after the coredump completes.

    Link: https://lkml.kernel.org/r/874kaax26c.fsf@disp2133
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:25 -04:00
Chris von Recklinghausen 52178dcae5 ptrace: Remove the unnecessary arguments from arch_ptrace_stop
Bugzilla: https://bugzilla.redhat.com/2120352

commit 4f627af8e6068892cafe031df6c14e8a0aaaa426
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Thu Sep 2 16:10:11 2021 -0500

    ptrace: Remove the unnecessary arguments from arch_ptrace_stop

    Both arch_ptrace_stop_needed and arch_ptrace_stop are called with an
    exit_code and a siginfo structure.  Neither argument is used by any of
    the implementations so just remove the unneeded arguments.

    The two arechitectures that implement arch_ptrace_stop are ia64 and
    sparc.  Both architectures flush their register stacks before a
    ptrace_stack so that all of the register information can be accessed
    by debuggers.

    As the question of if a register stack needs to be flushed is
    independent of why ptrace is stopping not needing arguments make sense.

    Cc: David Miller <davem@davemloft.net>
    Cc: sparclinux@vger.kernel.org
    Link: https://lkml.kernel.org/r/87lf3mx290.fsf@disp2133
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:24 -04:00
Chris von Recklinghausen c17e10396a signal: Remove the bogus sigkill_pending in ptrace_stop
Bugzilla: https://bugzilla.redhat.com/2120352

commit 7d613f9f72ec8f90ddefcae038fdae5adb8404b3
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Wed Sep 1 13:21:34 2021 -0500

    signal: Remove the bogus sigkill_pending in ptrace_stop

    The existence of sigkill_pending is a little silly as it is
    functionally a duplicate of fatal_signal_pending that is used in
    exactly one place.

    Checking for pending fatal signals and returning early in ptrace_stop
    is actively harmful.  It casues the ptrace_stop called by
    ptrace_signal to return early before setting current->exit_code.
    Later when ptrace_signal reads the signal number from
    current->exit_code is undefined, making it unpredictable what will
    happen.

    Instead rely on the fact that schedule will not sleep if there is a
    pending signal that can awaken a task.

    Removing the explict sigkill_pending test fixes fixes ptrace_signal
    when ptrace_stop does not stop because current->exit_code is always
    set to to signr.

    Cc: stable@vger.kernel.org
    Fixes: 3d749b9e67 ("ptrace: simplify ptrace_stop()->sigkill_pending() path")
    Fixes: 1a669c2f16 ("Add arch_ptrace_stop")
    Link: https://lkml.kernel.org/r/87pmsyx29t.fsf@disp2133
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:24 -04:00
Chris von Recklinghausen ba903565a5 signal/seccomp: Refactor seccomp signal and coredump generation
Bugzilla: https://bugzilla.redhat.com/2120352

commit 307d522f5eb86cd6ac8c905f5b0577dedac54ec5
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Wed Jun 23 16:44:32 2021 -0500

    signal/seccomp: Refactor seccomp signal and coredump generation

    Factor out force_sig_seccomp from the seccomp signal generation and
    place it in kernel/signal.c.  The function force_sig_seccomp takes a
    parameter force_coredump to indicate that the sigaction field should
    be reset to SIGDFL so that a coredump will be generated when the
    signal is delivered.

    force_sig_seccomp is then used to replace both seccomp_send_sigsys
    and seccomp_init_siginfo.

    force_sig_info_to_task gains an extra parameter to force using
    the default signal action.

    With this change seccomp is no longer a special case and there
    becomes exactly one place do_coredump is called from.

    Further it no longer becomes necessary for __seccomp_filter
    to call do_group_exit.

    Acked-by: Kees Cook <keescook@chromium.org>
    Link: https://lkml.kernel.org/r/87r1gr6qc4.fsf_-_@disp2133
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:23 -04:00
Michael Petlan 2e62db505d signal: Deliver SIGTRAP on perf event asynchronously if blocked
Bugzilla: https://bugzilla.redhat.com/2123231

upstream
========
commit 78ed93d72ded679e3caf0758357209887bda885f
Author: Marco Elver <elver@google.com>
Date: Mon Apr 4 13:12:04 2022 +0200

description
===========
With SIGTRAP on perf events, we have encountered termination of
processes due to user space attempting to block delivery of SIGTRAP.
Consider this case:

    <set up SIGTRAP on a perf event>
    ...
    sigset_t s;
    sigemptyset(&s);
    sigaddset(&s, SIGTRAP | <and others>);
    sigprocmask(SIG_BLOCK, &s, ...);
    ...
    <perf event triggers>

When the perf event triggers, while SIGTRAP is blocked, force_sig_perf()
will force the signal, but revert back to the default handler, thus
terminating the task.

This makes sense for error conditions, but not so much for explicitly
requested monitoring. However, the expectation is still that signals
generated by perf events are synchronous, which will no longer be the
case if the signal is blocked and delivered later.

To give user space the ability to clearly distinguish synchronous from
asynchronous signals, introduce siginfo_t::si_perf_flags and
TRAP_PERF_FLAG_ASYNC (opted for flags in case more binary information is
required in future).

The resolution to the problem is then to (a) no longer force the signal
(avoiding the terminations), but (b) tell user space via si_perf_flags
if the signal was synchronous or not, so that such signals can be
handled differently (e.g. let user space decide to ignore or consider
the data imprecise).

The alternative of making the kernel ignore SIGTRAP on perf events if
the signal is blocked may work for some usecases, but likely causes
issues in others that then have to revert back to interception of
sigprocmask() (which we want to avoid). [ A concrete example: when using
breakpoint perf events to track data-flow, in a region of code where
signals are blocked, data-flow can no longer be tracked accurately.
When a relevant asynchronous signal is received after unblocking the
signal, the data-flow tracking logic needs to know its state is
imprecise. ]

Fixes: 97ba62b278 ("perf: Add support for SIGTRAP on perf events")

Conflicts:
==========
Ignoring hunks in arm32, m68k and sparc files, since we don't support
these architectures.

Signed-off-by: Michael Petlan <mpetlan@redhat.com>
2022-09-21 07:22:42 +02:00
Phil Auld fb68d400e6 signal: In get_signal test for signal_group_exit every time through the loop
Bugzilla: https://bugzilla.redhat.com/2120671

commit e7f7c99ba911f56bc338845c1cd72954ba591707
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Mon Nov 15 11:55:57 2021 -0600

    signal: In get_signal test for signal_group_exit every time through the loop

    Recently while investigating a problem with rr and signals I noticed
    that siglock is dropped in ptrace_signal and get_signal does not jump
    to relock.

    Looking farther to see if the problem is anywhere else I see that
    do_signal_stop also returns if signal_group_exit is true.  I believe
    that test can now never be true, but it is a bit hard to trace
    through and be certain.

    Testing signal_group_exit is not expensive, so move the test for
    signal_group_exit into the for loop inside of get_signal to ensure
    the test is never skipped improperly.

    This has been a potential problem since I added the test for
    signal_group_exit was added.

    Fixes: 35634ffa17 ("signal: Always notice exiting tasks")
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Link: https://lkml.kernel.org/r/875yssekcd.fsf_-_@email.froward.int.ebiederm.org
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-09-01 09:16:55 -04:00
Herton R. Krzesinski 99b4ffe3da Merge: Enable AMX(TMUL) for Sapphire Rapids
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/174
This is the first draft of the commits which are required to support
AMX (aka TMUL) on SPR.  I suspect there are some missing commits.

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2004190
Tested: Intel to perform initial testing.

v2: added missing upstream commits 21e96a2035db and 52d0b8b18776

Signed-off-by: David Arcari <darcari@redhat.com>
RH-Acked-by: Rafael Aquini <aquini@redhat.com>
RH-Acked-by: Steve Best <sbest@redhat.com>
RH-Acked-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2021-12-22 19:34:58 -03:00
Herton R. Krzesinski d635b9c68b Merge: mm: update generic MM code to upstream v5.15
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/201
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396
Brew URL: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=41434412
 Testing: KT1-lite + regressions and performance (scheduler, network, and fs)
          benchmarks, as documented on the BZ.

In order to provide support for several future feature requests (virtio-mem,
filesystems, core-kernel and memory management) targeted for RHEL-9,
the patchset is bringing the core-MM codebase up to upstream v5.15.

This patchset is composed of upstream cherry picks that represent the
difference between current RHEL-9 v5.14 code base and upstream v5.15 plus
their relevant follow-up fixes.

Omitted-fix: 15eb7c888e749 ("locking/rwsem: Add missing __init_rwsem() for PREEMPT_RT")
	     already backported into RHEL9 via commit de3eb21475

Omitted-fix: 6341eb6f39bb7 ("drm/i915/selftests: exercise shmem_writeback with THP")
	     dependencies for this selftest follow up (and the follow-up itself)
	     shall be dealt with via DRM update work done by the graphics team.

Omitted-fix: f24b062607678 ("mm/damon: grammar s/works/work/")
Omitted-fix: db7a347b26fe0 ("mm/damon/dbgfs: use '__GFP_NOWARN' for user-specified size buffer allocation")
Omitted-fix: d78f3853f831e ("mm/damon/dbgfs: fix missed use of damon_dbgfs_lock")
	     albeit DAMON initial integration is part of v5.15, we're explicitly
             introducing it disabled in this backport. DAMON follow-ups, and future
             enablement will be dealt with via a separated (already filed) BZ ticket.

Omitted-fix: e66435936756d ("mm: fix mismerge of folio page flag manipulators")
	     folio pages are a feature integrated into v5.16, and this merge-fix
	     commit is non-relevant to this particular patchset.

Signed-off-by: Rafael Aquini <aquini@redhat.com>
RH-Acked-by: John W. Linville <linville@redhat.com>
RH-Acked-by: Waiman Long <longman@redhat.com>
RH-Acked-by: Prarit Bhargava <prarit@redhat.com>
RH-Acked-by: David Arcari <darcari@redhat.com>
RH-Acked-by: Aristeu Rozanski <arozansk@redhat.com>
RH-Acked-by: Phil Auld <pauld@redhat.com>
RH-Acked-by: David Hildenbrand <david@redhat.com>
RH-Acked-by: Chris von Recklinghausen <crecklin@redhat.com>
Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2021-12-15 11:00:36 -03:00
Herton R. Krzesinski f4fa2705fb Merge: hrtimer updates for RT prerequisites
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/145
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2022896
Upstream Status: Linux
Tested: Sanity tested with timer and scheduler tests

hrtimer updates for RT prerequisites

This is a series for hrtimer and related code that enable the
RT patchset to merge more cleanly.

Signed-off-by: Phil Auld <pauld@redhat.com>
RH-Acked-by: Prarit Bhargava <prarit@redhat.com>
RH-Acked-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2021-12-09 15:01:47 -03:00
David Arcari dd347f557f signal: Add an optional check for altstack size
Bugzilla: http://bugzilla.redhat.com/2004190

commit 1bdda24c4af64cd2d65dec5192ab624c5fee7ca0
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Thu Oct 21 15:55:05 2021 -0700

    signal: Add an optional check for altstack size

    New x86 FPU features will be very large, requiring ~10k of stack in
    signal handlers.  These new features require a new approach called
    "dynamic features".

    The kernel currently tries to ensure that altstacks are reasonably
    sized. Right now, on x86, sys_sigaltstack() requires a size of >=2k.
    However, that 2k is a constant. Simply raising that 2k requirement
    to >10k for the new features would break existing apps which have a
    compiled-in size of 2k.

    Instead of universally enforcing a larger stack, prohibit a process from
    using dynamic features without properly-sized altstacks. This must be
    enforced in two places:

     * A dynamic feature can not be enabled without an large-enough altstack
       for each process thread.
     * Once a dynamic feature is enabled, any request to install a too-small
       altstack will be rejected

    The dynamic feature enabling code must examine each thread in a
    process to ensure that the altstacks are large enough. Add a new lock
    (sigaltstack_lock()) to ensure that threads can not race and change
    their altstack after being examined.

    Add the infrastructure in form of a config option and provide empty
    stubs for architectures which do not need dynamic altstack size checks.

    This implementation will be fleshed out for x86 in a future patch called

      x86/arch_prctl: Add controls for dynamic XSTATE components

      [dhansen: commit message. ]

    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
    Signed-off-by: Borislav Petkov <bp@suse.de>
    Link: https://lkml.kernel.org/r/20211021225527.10184-2-chang.seok.bae@intel.com

Signed-off-by: David Arcari <darcari@redhat.com>
2021-11-29 12:25:08 -05:00
Rafael Aquini fb61f7e0c9 memcg: enable accounting for signals
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 5f58c39819ff78ca5ddbba2b3cd8ff4779b19bb5
Author: Vasily Averin <vvs@virtuozzo.com>
Date:   Thu Sep 2 14:55:35 2021 -0700

    memcg: enable accounting for signals

    When a user send a signal to any another processes it forces the kernel to
    allocate memory for 'struct sigqueue' objects.  The number of signals is
    limited by RLIMIT_SIGPENDING resource limit, but even the default settings
    allow each user to consume up to several megabytes of memory.

    It makes sense to account for these allocations to restrict the host's
    memory consumption from inside the memcg-limited container.

    Link: https://lkml.kernel.org/r/e34e958c-e785-712e-a62a-2c7b66c646c7@virtuozzo.com
    Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
    Reviewed-by: Shakeel Butt <shakeelb@google.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Alexey Dobriyan <adobriyan@gmail.com>
    Cc: Andrei Vagin <avagin@gmail.com>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Borislav Petkov <bp@suse.de>
    Cc: Christian Brauner <christian.brauner@ubuntu.com>
    Cc: Dmitry Safonov <0x7f454c46@gmail.com>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: "H. Peter Anvin" <hpa@zytor.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: "J. Bruce Fields" <bfields@fieldses.org>
    Cc: Jeff Layton <jlayton@kernel.org>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Jiri Slaby <jirislaby@kernel.org>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Serge Hallyn <serge@hallyn.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
    Cc: Yutian Yang <nglaive@gmail.com>
    Cc: Zefan Li <lizefan.x@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:41:21 -05:00
Phil Auld f1fe1713a3 posix-cpu-timers: Assert task sighand is locked while starting cputime counter
Bugzilla: http://bugzilla.redhat.com/2022896

commit a5dec9f82ab2ae486119f0b0820ea16db3e522c3
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Mon Jul 26 14:55:08 2021 +0200

    posix-cpu-timers: Assert task sighand is locked while starting cputime counter

    Starting the process wide cputime counter needs to be done in the same
    sighand locking sequence than actually arming the related timer otherwise
    this races against concurrent timers setting/expiring in the same
    threadgroup.

    Detecting that the cputime counter is started without holding the sighand
    lock is a first step toward debugging such situations.

    Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20210726125513.271824-2-frederic@kernel.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2021-11-15 10:29:56 -05:00
Alexey Gladkov 1cbc87c091 ucounts: Fix signal ucount refcounting
Bugzilla: https://bugzilla.redhat.com/2018142
Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

commit 15bc01effefe97757ef02ca09e9d1b927ab22725
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Sat Oct 16 15:59:49 2021 -0500

    ucounts: Fix signal ucount refcounting

    In commit fda31c5029 ("signal: avoid double atomic counter
    increments for user accounting") Linus made a clever optimization to
    how rlimits and the struct user_struct.  Unfortunately that
    optimization does not work in the obvious way when moved to nested
    rlimits.  The problem is that the last decrement of the per user
    namespace per user sigpending counter might also be the last decrement
    of the sigpending counter in the parent user namespace as well.  Which
    means that simply freeing the leaf ucount in __free_sigqueue is not
    enough.

    Maintain the optimization and handle the tricky cases by introducing
    inc_rlimit_get_ucounts and dec_rlimit_put_ucounts.

    By moving the entire optimization into functions that perform all of
    the work it becomes possible to ensure that every level is handled
    properly.

    The new function inc_rlimit_get_ucounts returns 0 on failure to
    increment the ucount.  This is different than inc_rlimit_ucounts which
    increments the ucounts and returns LONG_MAX if the ucount counter has
    exceeded it's maximum or it wrapped (to indicate the counter needs to
    decremented).

    I wish we had a single user to account all pending signals to across
    all of the threads of a process so this complexity was not necessary

    Cc: stable@vger.kernel.org
    Fixes: d646969055 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
    v1: https://lkml.kernel.org/r/87mtnavszx.fsf_-_@disp2133
    Link: https://lkml.kernel.org/r/87fssytizw.fsf_-_@disp2133
    Reviewed-by: Alexey Gladkov <legion@kernel.org>
    Tested-by: Rune Kleveland <rune.kleveland@infomedia.dk>
    Tested-by: Yu Zhao <yuzhao@google.com>
    Tested-by: Jordan Glover <Golden_Miller83@protonmail.ch>
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Alexey Gladkov <agladkov@redhat.com>
2021-11-05 13:50:32 +01:00
Alexey Gladkov f3791f4df5 Fix UCOUNT_RLIMIT_SIGPENDING counter leak
We must properly handle an errors when we increase the rlimit counter
and the ucounts reference counter. We have to this with RCU protection
to prevent possible use-after-free that could occur due to concurrent
put_cred_rcu().

The following reproducer triggers the problem:

  $ cat testcase.sh
  case "${STEP:-0}" in
  0)
	ulimit -Si 1
	ulimit -Hi 1
	STEP=1 unshare -rU "$0"
	killall sleep
	;;
  1)
	for i in 1 2 3 4 5; do unshare -rU sleep 5 & done
	;;
  esac

with the KASAN report being along the lines of

  BUG: KASAN: use-after-free in put_ucounts+0x17/0xa0
  Write of size 4 at addr ffff8880045f031c by task swapper/2/0

  CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.13.0+ #19
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-alt4 04/01/2014
  Call Trace:
   <IRQ>
   put_ucounts+0x17/0xa0
   put_cred_rcu+0xd5/0x190
   rcu_core+0x3bf/0xcb0
   __do_softirq+0xe3/0x341
   irq_exit_rcu+0xbe/0xe0
   sysvec_apic_timer_interrupt+0x6a/0x90
   </IRQ>
   asm_sysvec_apic_timer_interrupt+0x12/0x20
   default_idle_call+0x53/0x130
   do_idle+0x311/0x3c0
   cpu_startup_entry+0x14/0x20
   secondary_startup_64_no_verify+0xc2/0xcb

  Allocated by task 127:
   kasan_save_stack+0x1b/0x40
   __kasan_kmalloc+0x7c/0x90
   alloc_ucounts+0x169/0x2b0
   set_cred_ucounts+0xbb/0x170
   ksys_unshare+0x24c/0x4e0
   __x64_sys_unshare+0x16/0x20
   do_syscall_64+0x37/0x70
   entry_SYSCALL_64_after_hwframe+0x44/0xae

  Freed by task 0:
   kasan_save_stack+0x1b/0x40
   kasan_set_track+0x1c/0x30
   kasan_set_free_info+0x20/0x30
   __kasan_slab_free+0xeb/0x120
   kfree+0xaa/0x460
   put_cred_rcu+0xd5/0x190
   rcu_core+0x3bf/0xcb0
   __do_softirq+0xe3/0x341

  The buggy address belongs to the object at ffff8880045f0300
   which belongs to the cache kmalloc-192 of size 192
  The buggy address is located 28 bytes inside of
   192-byte region [ffff8880045f0300, ffff8880045f03c0)
  The buggy address belongs to the page:
  page:000000008de0a388 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff8880045f0000 pfn:0x45f0
  flags: 0x100000000000200(slab|node=0|zone=1)
  raw: 0100000000000200 ffffea00000f4640 0000000a0000000a ffff888001042a00
  raw: ffff8880045f0000 000000008010000d 00000001ffffffff 0000000000000000
  page dumped because: kasan: bad access detected

  Memory state around the buggy address:
   ffff8880045f0200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   ffff8880045f0280: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
  >ffff8880045f0300: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                              ^
   ffff8880045f0380: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
   ffff8880045f0400: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ==================================================================
  Disabling lock debugging due to kernel taint

Fixes: d646969055 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Alexey Gladkov <legion@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-08 11:43:24 -07:00
Linus Torvalds 71bd934101 Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
 "190 patches.

  Subsystems affected by this patch series: mm (hugetlb, userfaultfd,
  vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock,
  migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap,
  zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc,
  core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs,
  signals, exec, kcov, selftests, compress/decompress, and ipc"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits)
  ipc/util.c: use binary search for max_idx
  ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock
  ipc: use kmalloc for msg_queue and shmid_kernel
  ipc sem: use kvmalloc for sem_undo allocation
  lib/decompressors: remove set but not used variabled 'level'
  selftests/vm/pkeys: exercise x86 XSAVE init state
  selftests/vm/pkeys: refill shadow register after implicit kernel write
  selftests/vm/pkeys: handle negative sys_pkey_alloc() return code
  selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
  kcov: add __no_sanitize_coverage to fix noinstr for all architectures
  exec: remove checks in __register_bimfmt()
  x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
  hfsplus: report create_date to kstat.btime
  hfsplus: remove unnecessary oom message
  nilfs2: remove redundant continue statement in a while-loop
  kprobes: remove duplicated strong free_insn_page in x86 and s390
  init: print out unknown kernel parameters
  checkpatch: do not complain about positive return values starting with EPOLL
  checkpatch: improve the indented label test
  checkpatch: scripts/spdxcheck.py now requires python3
  ...
2021-07-02 12:08:10 -07:00
Al Viro 97c885d585 x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
Currently we handle SS_AUTODISARM as soon as we have stored the altstack
settings into sigframe - that's the point when we have set the things up
for eventual sigreturn to restore the old settings.  And if we manage to
set the sigframe up (we are not done with that yet), everything's fine.
However, in case of failure we end up with sigframe-to-be abandoned and
SIGSEGV force-delivered.  And in that case we end up with inconsistent
rules - late failures have altstack reset, early ones do not.

It's trivial to get consistent behaviour - just handle SS_AUTODISARM once
we have set the sigframe up and are committed to entering the handler,
i.e.  in signal_delivered().

Link: https://lore.kernel.org/lkml/20200404170604.GN23230@ZenIV.linux.org.uk/
Link: https://github.com/ClangBuiltLinux/linux/issues/876
Link: https://lkml.kernel.org/r/20210422230846.1756380-1-ndesaulniers@google.com
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 11:06:06 -07:00