JIRA: https://issues.redhat.com/browse/RHEL-68020
CVE: CVE-2024-50271
commit 9e05e5c7ee8758141d2db7e8fea2cab34500c6ed
Author: Roman Gushchin <roman.gushchin@linux.dev>
Date: Mon Nov 4 19:54:19 2024 +0000
signal: restore the override_rlimit logic
Prior to commit d646969055 ("Reimplement RLIMIT_SIGPENDING on top of
ucounts") UCOUNT_RLIMIT_SIGPENDING rlimit was not enforced for a class of
signals. However now it's enforced unconditionally, even if
override_rlimit is set. This behavior change caused production issues.
For example, if the limit is reached and a process receives a SIGSEGV
signal, sigqueue_alloc fails to allocate the necessary resources for the
signal delivery, preventing the signal from being delivered with siginfo.
This prevents the process from correctly identifying the fault address and
handling the error. From the user-space perspective, applications are
unaware that the limit has been reached and that the siginfo is
effectively 'corrupted'. This can lead to unpredictable behavior and
crashes, as we observed with java applications.
Fix this by passing override_rlimit into inc_rlimit_get_ucounts() and skip
the comparison to max there if override_rlimit is set. This effectively
restores the old behavior.
Link: https://lkml.kernel.org/r/20241104195419.3962584-1-roman.gushchin@linux.dev
Fixes: d646969055 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Co-developed-by: Andrei Vagin <avagin@google.com>
Signed-off-by: Andrei Vagin <avagin@google.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Alexey Gladkov <legion@kernel.org>
Cc: Kees Cook <kees@kernel.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Radostin Stoyanov <radostin@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27742
This patch is a backport of the following upstream commit:
commit 5f0bc0b042fc77ff70e14c790abdec960cde4ec1
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue Jul 25 09:38:32 2023 -0700
mm: suppress mm fault logging if fatal signal already pending
Commit eda0047296a1 ("mm: make the page fault mmap locking killable")
intentionally made it much easier to trigger the "page fault fails
because a fatal signal is pending" situation, by having the mmap locking
fail early in that case.
We have long aborted page faults in other fatal cases when the actual IO
for a page is interrupted by SIGKILL - which is particularly useful for
the traditional case of NFS hanging due to network issues, but local
filesystems could cause it too if you happened to get the SIGKILL while
waiting for a page to be faulted in (eg lock_folio_maybe_drop_mmap()).
So aborting the page fault wasn't a new condition - but it now triggers
earlier, before we even get to 'handle_mm_fault()'. And as a result the
error doesn't go through our 'fault_signal_pending()' logic, and doesn't
get filtered away there.
Normally you'd never even notice, because if a fatal signal is pending,
the new SIGSEGV we send ends up being ignored anyway.
But it turns out that there is one very noticeable exception: if you
enable 'show_unhandled_signals', the aborted page fault will be logged
in the kernel messages, and you'll get a scary line looking something
like this in your logs:
pverados[2183248]: segfault at 55e5a00f9ae0 ip 000055e5a00f9ae0 sp 00007ffc0720bea8 error 14 in perl[55e5a00d4000+195000] likely on CPU 10 (core 4, socket 0)
which is rather misleading. It's not really a segfault at all, it's
just "the thread was killed before the page fault completed, so we
aborted the page fault".
Fix this by just making it clear that a pending fatal signal means that
any new signal coming in after that is implicitly handled. This will
avoid the misleading logging, since now the signal isn't 'unhandled' any
more.
Reported-and-tested-by: Fiona Ebner <f.ebner@proxmox.com>
Tested-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Link: https://lore.kernel.org/lkml/8d063a26-43f5-0bb7-3203-c6a04dc159f8@proxmox.com/
Acked-by: Oleg Nesterov <oleg@redhat.com>
Fixes: eda0047296a1 ("mm: make the page fault mmap locking killable")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-36683
Upstream Status: RHEL only
This reverts commit 08637d76a2 which is a
revert of "Merge: cgroup: Backport upstream cgroup commits up to v6.8"
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-34600
Conflicts:
1) A merge conflict in the kernel/signal.c hunk due to the presence
of RHEL-only commit 975d318867 ("signal: Don't disable preemption
in ptrace_stop() on PREEMPT_RT.").
2) A merge conflict in the kernel/time/hrtimer.c hunk due to the
presence of RHEL-only commit 5f76194136 ("time/hrtimer: Embed
hrtimer mode into hrtimer_sleeper").
3) The fs/cifs/inode.c hunk was applied to fs/smb/client/inode.c due
to the presence of upstream commit 38c8a9a52082 ("smb: move client
and server files to common directory fs/smb").
4) Similarly, the fs/cifs/transport.c hunk was applied to
fs/smb/client/transport.c manually due to the presence of
a later upstream commit d527f51331ca ("cifs: Fix UAF in
cifs_demultiplex_thread()").
Note that all the prerequiste patches in the same patch series
(https://lore.kernel.org/lkml/20220822111816.760285417@infradead.org/)
had already been merged into RHEL9.
commit f5d39b020809146cc28e6e73369bf8065e0310aa
Author: Peter Zijlstra <peterz@infradead.org>
Date: Mon, 22 Aug 2022 13:18:22 +0200
freezer,sched: Rewrite core freezer logic
Rewrite the core freezer to behave better wrt thawing and be simpler
in general.
By replacing PF_FROZEN with TASK_FROZEN, a special block state, it is
ensured frozen tasks stay frozen until thawed and don't randomly wake
up early, as is currently possible.
As such, it does away with PF_FROZEN and PF_FREEZER_SKIP, freeing up
two PF_flags (yay!).
Specifically; the current scheme works a little like:
freezer_do_not_count();
schedule();
freezer_count();
And either the task is blocked, or it lands in try_to_freezer()
through freezer_count(). Now, when it is blocked, the freezer
considers it frozen and continues.
However, on thawing, once pm_freezing is cleared, freezer_count()
stops working, and any random/spurious wakeup will let a task run
before its time.
That is, thawing tries to thaw things in explicit order; kernel
threads and workqueues before doing bringing SMP back before userspace
etc.. However due to the above mentioned races it is entirely possible
for userspace tasks to thaw (by accident) before SMP is back.
This can be a fatal problem in asymmetric ISA architectures (eg ARMv9)
where the userspace task requires a special CPU to run.
As said; replace this with a special task state TASK_FROZEN and add
the following state transitions:
TASK_FREEZABLE -> TASK_FROZEN
__TASK_STOPPED -> TASK_FROZEN
__TASK_TRACED -> TASK_FROZEN
The new TASK_FREEZABLE can be set on any state part of TASK_NORMAL
(IOW. TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE) -- any such state
is already required to deal with spurious wakeups and the freezer
causes one such when thawing the task (since the original state is
lost).
The special __TASK_{STOPPED,TRACED} states *can* be restored since
their canonical state is in ->jobctl.
With this, frozen tasks need an explicit TASK_FROZEN wakeup and are
free of undue (early / spurious) wakeups.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/r/20220822114649.055452969@infradead.org
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-3988
Upstream Status: https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git
commit bf1069f8c099d4c10e2f884dc72a83a4653fc6e4
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date: Thu Aug 3 12:09:31 2023 +0200
signal: Add proper comment about the preempt-disable in ptrace_stop().
Commit 53da1d9456 ("fix ptrace slowness") added a preempt-disable section
between read_unlock() and the following schedule() invocation without
explaining why it is needed.
Replace the comment with an explanation why this is needed. Clarify that
it is needed for correctness but for performance reasons.
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/20230803100932.325870-2-bigeasy@linutronix.de
Signed-off-by: Eder Zulian <ezulian@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174325
commit a382f8fee42ca10c9bfce0d2352d4153f931f5dc
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date: Wed Jul 6 12:20:59 2022 -0700
signal handling: don't use BUG_ON() for debugging
These are indeed "should not happen" situations, but it turns out recent
changes made the 'task_is_stopped_or_trace()' case trigger (fix for that
exists, is pending more testing), and the BUG_ON() makes it
unnecessarily hard to actually debug for no good reason.
It's been that way for a long time, but let's make it clear: BUG_ON() is
not good for debugging, and should never be used in situations where you
could just say "this shouldn't happen, but we can continue".
Use WARN_ON_ONCE() instead to make sure it gets logged, and then just
continue running. Instead of making the system basically unusuable
because you crashed the machine while potentially holding some very core
locks (eg this function is commonly called while holding 'tasklist_lock'
for writing).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174325
Omitted-fix: 3418357a32db ("ptrace: fix clearing of JOBCTL_TRACED in ptrace_unfreeze_traced()")
That fix duplicates de2a34771f51 included in this series
commit 31cae1eaae4fd65095ad6a3659db467bc3c2599e
Author: Peter Zijlstra <peterz@infradead.org>
Date: Tue May 3 15:57:47 2022 -0500
sched,signal,ptrace: Rework TASK_TRACED, TASK_STOPPED state
Currently ptrace_stop() / do_signal_stop() rely on the special states
TASK_TRACED and TASK_STOPPED resp. to keep unique state. That is, this
state exists only in task->__state and nowhere else.
There's two spots of bother with this:
- PREEMPT_RT has task->saved_state which complicates matters,
meaning task_is_{traced,stopped}() needs to check an additional
variable.
- An alternative freezer implementation that itself relies on a
special TASK state would loose TASK_TRACED/TASK_STOPPED and will
result in misbehaviour.
As such, add additional state to task->jobctl to track this state
outside of task->__state.
NOTE: this doesn't actually fix anything yet, just adds extra state.
--EWB
* didn't add a unnecessary newline in signal.h
* Update t->jobctl in signal_wake_up and ptrace_signal_wake_up
instead of in signal_wake_up_state. This prevents the clearing
of TASK_STOPPED and TASK_TRACED from getting lost.
* Added warnings if JOBCTL_STOPPED or JOBCTL_TRACED are not cleared
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220421150654.757693825@infradead.org
Tested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lkml.kernel.org/r/20220505182645.497868-12-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174325
commit 2500ad1c7fa42ad734677853961a3a8bec0772c5
Author: Eric W. Biederman <ebiederm@xmission.com>
Date: Fri Apr 29 08:43:34 2022 -0500
ptrace: Don't change __state
Stop playing with tsk->__state to remove TASK_WAKEKILL while a ptrace
command is executing.
Instead remove TASK_WAKEKILL from the definition of TASK_TRACED, and
implement a new jobctl flag TASK_PTRACE_FROZEN. This new flag is set
in jobctl_freeze_task and cleared when ptrace_stop is awoken or in
jobctl_unfreeze_task (when ptrace_stop remains asleep).
In signal_wake_up add __TASK_TRACED to state along with TASK_WAKEKILL
when the wake up is for a fatal signal. Skip adding __TASK_TRACED
when TASK_PTRACE_FROZEN is not set. This has the same effect as
changing TASK_TRACED to __TASK_TRACED as all of the wake_ups that use
TASK_KILLABLE go through signal_wake_up.
Handle a ptrace_stop being called with a pending fatal signal.
Previously it would have been handled by schedule simply failing to
sleep. As TASK_WAKEKILL is no longer part of TASK_TRACED schedule
will sleep with a fatal_signal_pending. The code in signal_wake_up
guarantees that the code will be awaked by any fatal signal that
codes after TASK_TRACED is set.
Previously the __state value of __TASK_TRACED was changed to
TASK_RUNNING when woken up or back to TASK_TRACED when the code was
left in ptrace_stop. Now when woken up ptrace_stop now clears
JOBCTL_PTRACE_FROZEN and when left sleeping ptrace_unfreezed_traced
clears JOBCTL_PTRACE_FROZEN.
Tested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lkml.kernel.org/r/20220505182645.497868-10-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174325
commit 57b6de08b5f6586851c2261ef0cc16cd275615e7
Author: Eric W. Biederman <ebiederm@xmission.com>
Date: Wed May 4 13:39:58 2022 -0500
ptrace: Admit ptrace_stop can generate spuriuos SIGTRAPs
Long ago and far away there was a BUG_ON at the start of ptrace_stop
that did "BUG_ON(!(current->ptrace & PT_PTRACED));" [1]. The BUG_ON
had never triggered but examination of the code showed that the BUG_ON
could actually trigger. To complement removing the BUG_ON an attempt
to better handle the race was added.
The code detected the tracer had gone away and did not call
do_notify_parent_cldstop. The code also attempted to prevent
ptrace_report_syscall from sending spurious SIGTRAPs when the tracer
went away.
The code to detect when the tracer had gone away before sending a
signal to tracer was a legitimate fix and continues to work to this
date.
The code to prevent sending spurious SIGTRAPs is a failure. At the
time and until today the code only catches it when the tracer goes
away after siglock is dropped and before read_lock is acquired. If
the tracer goes away after read_lock is dropped a spurious SIGTRAP can
still be sent to the tracee. The tracer going away after read_lock
is dropped is the far likelier case as it is the bigger window.
Given that the attempt to prevent the generation of a SIGTRAP was a
failure and continues to be a failure remove the code that attempts to
do that. This simplifies the code in ptrace_stop and makes
ptrace_stop much easier to reason about.
To successfully deal with the tracer going away, all of the tracer's
instrumentation of the child would need to be removed, and reliably
detecting when the tracer has set a signal to continue with would need
to be implemented.
[1] 66519f549ae5 ("[PATCH] fix ptracer death race yielding bogus BUG_ON")
History-Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Tested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lkml.kernel.org/r/20220505182645.497868-9-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174325
commit cb3c19c93d656caa6fe63d6277aabd7e570f1d03
Author: Eric W. Biederman <ebiederm@xmission.com>
Date: Fri Apr 29 09:16:10 2022 -0500
signal: Use lockdep_assert_held instead of assert_spin_locked
The distinction is that assert_spin_locked() checks if the lock is
held *by*anyone* whereas lockdep_assert_held() asserts the current
context holds the lock. Also, the check goes away if you build
without lockdep.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/Ympr/+PX4XgT/UKU@hirez.programming.kicks-ass.net
Tested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lkml.kernel.org/r/20220505182645.497868-6-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174325
commit e71ba124078e391879e0bf111529fa2d630d106c
Author: Eric W. Biederman <ebiederm@xmission.com>
Date: Fri Apr 22 09:28:50 2022 -0500
signal: Replace __group_send_sig_info with send_signal_locked
The function __group_send_sig_info is just a light wrapper around
send_signal_locked with one parameter fixed to a constant value. As
the wrapper adds no real value update the code to directly call the
wrapped function.
Tested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lkml.kernel.org/r/20220505182645.497868-2-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174325
commit 6487d1dab837214ec2fd3f0ddd5f787e63be7c20
Author: Eric W. Biederman <ebiederm@xmission.com>
Date: Thu Jan 27 12:19:13 2022 -0600
ptrace: Return the signal to continue with from ptrace_stop
The signal a task should continue with after a ptrace stop is
inconsistently read, cleared, and sent. Solve this by reading and
clearing the signal to be sent in ptrace_stop.
In an ideal world everything except ptrace_signal would share a common
implementation of continuing with the signal, so ptracers could count
on the signal they ask to continue with actually being delivered. For
now retain bug compatibility and just return with the signal number
the ptracer requested the code continue with.
Link: https://lkml.kernel.org/r/875yoe7qdp.fsf_-_@email.froward.int.ebiederm.org
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174325
commit 336d4b814bf078fa698488632c19beca47308896
Author: Eric W. Biederman <ebiederm@xmission.com>
Date: Thu Jan 27 12:15:32 2022 -0600
ptrace: Move setting/clearing ptrace_message into ptrace_stop
Today ptrace_message is easy to overlook as it not a core part of
ptrace_stop. It has been overlooked so much that there are places
that set ptrace_message and don't clear it, and places that never set
it. So if you get an unlucky sequence of events the ptracer may be
able to read a ptrace_message that does not apply to the current
ptrace stop.
Move setting of ptrace_message into ptrace_stop so that it always gets
set before the stop, and always gets cleared after the stop. This
prevents non-sense from being reported to userspace and makes
ptrace_message more visible in the ptrace helper functions so that
kernel developers can see it.
Link: https://lkml.kernel.org/r/87bky67qfv.fsf_-_@email.froward.int.ebiederm.org
Acked-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174325
Upstream Status: RHEL-only
This reverts commit 975d318867.
Because it doesn't match upstream and thus conflicts with other necessary
changes. This fix will be re-introduced at the end of this series.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2171995
Upstream Status: https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git
Conflicts: Whitespace mismatch and missing series at merge commit
67850b7bdcd2 ("Merge tag 'ptrace_stop-cleanup-for-v5.19'"),
which seems a nice-have, but not essential.
commit 277af213394c063b8e2b8a712e11d41866d456d9
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date: Wed Jun 22 11:36:17 2022 +0200
signal: Don't disable preemption in ptrace_stop() on PREEMPT_RT.
Commit
53da1d9456 ("fix ptrace slowness")
is just band aid around the problem.
The invocation of do_notify_parent_cldstop() wakes the parent and makes
it runnable. The scheduler then wants to replace this still running task
with the parent. With the read_lock() acquired this is not possible
because preemption is disabled and so this is deferred until read_unlock().
This scheduling point is undesired and is avoided by disabling preemption
around the unlock operation enabled again before the schedule() invocation
without a preemption point.
This is only undesired because the parent sleeps a cycle in
wait_task_inactive() until the traced task leaves the run-queue in
schedule(). It is not a correctness issue, it is just band aid to avoid the
visbile delay which sums up over multiple invocations.
The task can still be preempted if an interrupt occurs between
preempt_enable_no_resched() and freezable_schedule() because on the IRQ-exit
path of the interrupt scheduling _will_ happen. This is ignored since it does
not happen very often.
On PREEMPT_RT keeping preemption disabled during the invocation of
cgroup_enter_frozen() becomes a problem because the function acquires
css_set_lock which is a sleeping lock on PREEMPT_RT and must not be
acquired with disabled preemption.
Don't disable preemption on PREEMPT_RT. Remove the TODO regarding adding
read_unlock_no_resched() as there is no need for it and will cause harm.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lkml.kernel.org/r/20220720154435.232749-2-bigeasy@linutronix.de
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2120352
commit 8ba62d37949e248c698c26e0d82d72fda5d33ebf
Author: Eric W. Biederman <ebiederm@xmission.com>
Date: Wed Feb 9 09:51:14 2022 -0600
task_work: Call tracehook_notify_signal from get_signal on all architectures
Always handle TIF_NOTIFY_SIGNAL in get_signal. With commit 35d0b389f3
("task_work: unconditionally run task_work from get_signal()") always
calling task_work_run all of the work of tracehook_notify_signal is
already happening except clearing TIF_NOTIFY_SIGNAL.
Factor clear_notify_signal out of tracehook_notify_signal and use it in
get_signal so that get_signal only needs one call of task_work_run.
To keep the semantics in sync update xfer_to_guest_mode_work (which
does not call get_signal) to call tracehook_notify_signal if either
_TIF_SIGPENDING or _TIF_NOTIFY_SIGNAL.
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20220309162454.123006-8-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2120352
commit 7f62d40d9cb50fd146fe8ff071f98fa3c1855083
Author: Eric W. Biederman <ebiederm@xmission.com>
Date: Wed Feb 9 08:52:41 2022 -0600
task_work: Introduce task_work_pending
Wrap the test of task->task_works in a helper function to make
it clear what is being tested.
All of the other readers of task->task_work use READ_ONCE and this is
even necessary on current as other processes can update
task->task_work. So for consistency I have added READ_ONCE into
task_work_pending.
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20220309162454.123006-7-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2120352
commit c145137dc990fd67b52fbc52faae5ba46f168cca
Author: Eric W. Biederman <ebiederm@xmission.com>
Date: Thu Jan 27 12:04:27 2022 -0600
ptrace: Remove tracehook_signal_handler
The two line function tracehook_signal_handler is only called from
signal_delivered. Expand it inline in signal_delivered and remove it.
Just to make it easier to understand what is going on.
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20220309162454.123006-5-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2120352
commit 5c72263ef2fbe99596848f03758ae2dc593adf2c
Author: Kees Cook <keescook@chromium.org>
Date: Tue Feb 8 00:57:17 2022 -0800
signal: HANDLER_EXIT should clear SIGNAL_UNKILLABLE
Fatal SIGSYS signals (i.e. seccomp RET_KILL_* syscall filter actions)
were not being delivered to ptraced pid namespace init processes. Make
sure the SIGNAL_UNKILLABLE doesn't get set for these cases.
Reported-by: Robert Święcki <robert@swiecki.net>
Suggested-by: "Eric W. Biederman" <ebiederm@xmission.com>
Fixes: 00b06da29cf9 ("signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Link: https://lore.kernel.org/lkml/878rui8u4a.fsf@email.froward.int.ebiederm.org
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2120352
commit 6410349ea5e177f3e53c2006d2041eed47e986ae
Author: Randy Dunlap <rdunlap@infradead.org>
Date: Tue Dec 21 19:10:27 2021 -0800
signal: clean up kernel-doc comments
Fix kernel-doc warnings in kernel/signal.c:
kernel/signal.c:1830: warning: Function parameter or member 'force_coredump' not described in 'force_sig_seccomp'
kernel/signal.c:2873: warning: missing initial short description on line:
* signal_delivered -
Also add a closing parenthesis to the comments in signal_delivered().
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Richard Weinberger <richard@nod.at>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Marco Elver <elver@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20211222031027.29694-1-rdunlap@infradead.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2120352
commit 49697335e0b441b0553598c1b48ee9ebb053d2f1
Author: Eric W. Biederman <ebiederm@xmission.com>
Date: Thu Jun 24 02:14:30 2021 -0500
signal: Remove the helper signal_group_exit
This helper is misleading. It tests for an ongoing exec as well as
the process having received a fatal signal.
Sometimes it is appropriate to treat an on-going exec differently than
a process that is shutting down due to a fatal signal. In particular
taking the fast path out of exit_signals instead of retargeting
signals is not appropriate during exec, and not changing the the exit
code in do_group_exit during exec.
Removing the helper makes it more obvious what is going on as both
cases must be coded for explicitly.
While removing the helper fix the two cases where I have observed
using signal_group_exit resulted in the wrong result.
In exit_signals only test for SIGNAL_GROUP_EXIT so that signals are
retargetted during an exec.
In do_group_exit use 0 as the exit code during an exec as de_thread
does not set group_exit_code. As best as I can determine
group_exit_code has been is set to 0 most of the time during
de_thread. During a thread group stop group_exit_code is set to the
stop signal and when the thread group receives SIGCONT group_exit_code
is reset to 0.
Link: https://lkml.kernel.org/r/20211213225350.27481-8-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2120352
commit 2f824d4d197e02275562359a2ae5274177ce500c
Author: Eric W. Biederman <ebiederm@xmission.com>
Date: Sat Jan 8 09:48:31 2022 -0600
signal: Remove SIGNAL_GROUP_COREDUMP
After the previous cleanups "signal->core_state" is set whenever
SIGNAL_GROUP_COREDUMP is set and "signal->core_state" is tested
whenver the code wants to know if a coredump is in progress. The
remaining tests of SIGNAL_GROUP_COREDUMP also test to see if
SIGNAL_GROUP_EXIT is set. Similarly the only place that sets
SIGNAL_GROUP_COREDUMP also sets SIGNAL_GROUP_EXIT.
Which makes SIGNAL_GROUP_COREDUMP unecessary and redundant. So stop
setting SIGNAL_GROUP_COREDUMP, stop testing SIGNAL_GROUP_COREDUMP, and
remove it's definition.
With the setting of SIGNAL_GROUP_COREDUMP gone, coredump_finish no
longer needs to clear SIGNAL_GROUP_COREDUMP out of signal->flags
by setting SIGNAL_GROUP_EXIT.
Link: https://lkml.kernel.org/r/20211213225350.27481-5-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2120352
commit 7ba03471ac4ad2432e5ccf67d9d4ab03c177578a
Author: Eric W. Biederman <ebiederm@xmission.com>
Date: Sat Jan 8 11:01:12 2022 -0600
signal: Make coredump handling explicit in complete_signal
Ever since commit 6cd8f0acae ("coredump: ensure that SIGKILL always
kills the dumping thread") it has been possible for a SIGKILL received
during a coredump to set SIGNAL_GROUP_EXIT and trigger a process
shutdown (for a second time).
Update the logic to explicitly allow coredumps so that coredumps can
set SIGNAL_GROUP_EXIT and shutdown like an ordinary process.
Link: https://lkml.kernel.org/r/87zgo6ytyf.fsf_-_@email.froward.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2120352
commit a0287db0f1d6918919203ba31fd7cda59bf889e8
Author: Eric W. Biederman <ebiederm@xmission.com>
Date: Sat Jan 8 09:34:50 2022 -0600
signal: Have prepare_signal detect coredumps using signal->core_state
In preparation for removing the flag SIGNAL_GROUP_COREDUMP, change
prepare_signal to test signal->core_state instead of the flag
SIGNAL_GROUP_COREDUMP.
Both fields are protected by siglock and both live in signal_struct so
there are no real tradeoffs here, just a change to which field is
being tested.
Link: https://lkml.kernel.org/r/20211213225350.27481-1-ebiederm@xmission.com
Link: https://lkml.kernel.org/r/875yqu14co.fsf_-_@email.froward.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
Conflicts: drop changes to arch/m68k/kernel/traps.c,
arch/sparc/kernel/signal_32.c, arch/sparc/kernel/windows.c -
unsupported arches
Bugzilla: https://bugzilla.redhat.com/2120352
commit fcb116bc43c8c37c052530ead79872f8b2615711
Author: Eric W. Biederman <ebiederm@xmission.com>
Date: Thu Nov 18 14:23:21 2021 -0600
signal: Replace force_fatal_sig with force_exit_sig when in doubt
Recently to prevent issues with SECCOMP_RET_KILL and similar signals
being changed before they are delivered SA_IMMUTABLE was added.
Unfortunately this broke debuggers[1][2] which reasonably expect
to be able to trap synchronous SIGTRAP and SIGSEGV even when
the target process is not configured to handle those signals.
Add force_exit_sig and use it instead of force_fatal_sig where
historically the code has directly called do_exit. This has the
implementation benefits of going through the signal exit path
(including generating core dumps) without the danger of allowing
userspace to ignore or change these signals.
This avoids userspace regressions as older kernels exited with do_exit
which debuggers also can not intercept.
In the future is should be possible to improve the quality of
implementation of the kernel by changing some of these force_exit_sig
calls to force_fatal_sig. That can be done where it matters on
a case-by-case basis with careful analysis.
Reported-by: Kyle Huey <me@kylehuey.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
[1] https://lkml.kernel.org/r/CAP045AoMY4xf8aC_4QU_-j7obuEPYgTcnQQP3Yxk=2X90jtpjw@mail.gmail.com
[2] https://lkml.kernel.org/r/20211117150258.GB5403@xsang-OptiPlex-9020
Fixes: 00b06da29cf9 ("signal: Add SA_IMMUTABLE to ensure forced siganls do n
ot get changed")
Fixes: a3616a3c0272 ("signal/m68k: Use force_sigsegv(SIGSEGV) in fpsp040_die
")
Fixes: 83a1f27ad773 ("signal/powerpc: On swapcontext failure force SIGSEGV")
Fixes: 9bc508cf0791 ("signal/s390: Use force_sigsegv in default_trap_handler
")
Fixes: 086ec444f866 ("signal/sparc32: In setup_rt_frame and setup_fram use f
orce_fatal_sig")
Fixes: c317d306d550 ("signal/sparc32: Exit with a fatal signal when try_to_c
lear_window_buffer fails")
Fixes: 695dd0d634df ("signal/x86: In emulate_vsyscall force a signal instead
of calling do_exit")
Fixes: 1fbd60df8a85 ("signal/vm86_32: Properly send SIGSEGV when the vm86 st
ate cannot be saved.")
Fixes: 941edc5bf174 ("exit/syscall_user_dispatch: Send ordinary signals on f
ailure")
Link: https://lkml.kernel.org/r/871r3dqfv8.fsf_-_@email.froward.int.ebiederm
.org
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Kees Cook <keescook@chromium.org>
Tested-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2120352
commit e349d945fac76bddc78ae1cb92a0145b427a87ce
Author: Eric W. Biederman <ebiederm@xmission.com>
Date: Thu Nov 18 11:11:13 2021 -0600
signal: Don't always set SA_IMMUTABLE for forced signals
Recently to prevent issues with SECCOMP_RET_KILL and similar signals
being changed before they are delivered SA_IMMUTABLE was added.
Unfortunately this broke debuggers[1][2] which reasonably expect to be
able to trap synchronous SIGTRAP and SIGSEGV even when the target
process is not configured to handle those signals.
Update force_sig_to_task to support both the case when we can allow
the debugger to intercept and possibly ignore the signal and the case
when it is not safe to let userspace know about the signal until the
process has exited.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Kyle Huey <me@kylehuey.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Cc: stable@vger.kernel.org
[1] https://lkml.kernel.org/r/CAP045AoMY4xf8aC_4QU_-j7obuEPYgTcnQQP3Yxk=2X90jtpjw@mail.gmail.com
[2] https://lkml.kernel.org/r/20211117150258.GB5403@xsang-OptiPlex-9020
Fixes: 00b06da29cf9 ("signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed")
Link: https://lkml.kernel.org/r/877dd5qfw5.fsf_-_@email.froward.int.ebiederm.org
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Kees Cook <keescook@chromium.org>
Tested-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2120352
commit b171f667f3787946a8ba9644305339e93ae799c9
Author: Eric W. Biederman <ebiederm@xmission.com>
Date: Mon Nov 15 13:49:45 2021 -0600
signal: Requeue ptrace signals
Kyle Huey <me@kylehuey.com> writes:
> rr, a userspace record and replay debugger[0], uses the recorded register
> state at PTRACE_EVENT_EXIT to find the point in time at which to cease
> executing the program during replay.
>
> If a SIGKILL races with processing another signal in get_signal, it is
> possible for the kernel to decline to notify the tracer of the original
> signal. But if the original signal had a handler, the kernel proceeds
> with setting up a signal handler frame as if the tracer had chosen to
> deliver the signal unmodified to the tracee. When the kernel goes to
> execute the signal handler that it has now modified the stack and registers
> for, it will discover the pending SIGKILL, and terminate the tracee
> without executing the handler. When PTRACE_EVENT_EXIT is delivered to
> the tracer, however, the effects of handler setup will be visible to
> the tracer.
>
> Because rr (the tracer) was never notified of the signal, it is not aware
> that a signal handler frame was set up and expects the state of the program
> at PTRACE_EVENT_EXIT to be a state that will be reconstructed naturally
> by allowing the program to execute from the last event. When that fails
> to happen during replay, rr will assert and die.
>
> The following patches add an explicit check for a newly pending SIGKILL
> after the ptracer has been notified and the siglock has been reacquired.
> If this happens, we stop processing the current signal and proceed
> immediately to handling the SIGKILL. This makes the state reported at
> PTRACE_EVENT_EXIT the unmodified state of the program, and also avoids the
> work to set up a signal handler frame that will never be used.
>
> [0] https://rr-project.org/
The problem is that while the traced process makes it into ptrace_stop,
the tracee is killed before the tracer manages to wait for the
tracee and discover which signal was about to be delivered.
More generally the problem is that while siglock was dropped a signal
with process wide effect is short cirucit delivered to the entire
process killing it, but the process continues to try and deliver another
signal.
In general it impossible to avoid all cases where work is performed
after the process has been killed. In particular if the process is
killed after get_signal returns the code will simply not know it has
been killed until after delivering the signal frame to userspace.
On the other hand when the code has already discovered the process
has been killed and taken user space visible action that shows
the kernel knows the process has been killed, it is just silly
to then write the signal frame to the user space stack.
Instead of being silly detect the process has been killed
in ptrace_signal and requeue the signal so the code can pretend
it was simply never dequeued for delivery.
To test the process has been killed I use fatal_signal_pending rather
than signal_group_exit to match the test in signal_pending_state which
is used in schedule which is where ptrace_stop detects the process has
been killed.
Requeuing the signal so the code can pretend it was simply never
dequeued improves the user space visible behavior that has been
present since ebf5ebe31d2c ("[PATCH] signal-fixes-2.5.59-A4").
Kyle Huey verified that this change in behavior and makes rr happy.
Reported-by: Kyle Huey <khuey@kylehuey.com>
Reported-by: Marko Mäkelä <marko.makela@mariadb.com>
History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.gi
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/87tugcd5p2.fsf_-_@email.froward.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2120352
commit 5768d8906bc23d512b1a736c1e198aa833a6daa4
Author: Eric W. Biederman <ebiederm@xmission.com>
Date: Mon Nov 15 13:47:13 2021 -0600
signal: Requeue signals in the appropriate queue
In the event that a tracer changes which signal needs to be delivered
and that signal is currently blocked then the signal needs to be
requeued for later delivery.
With the advent of CLONE_THREAD the kernel has 2 signal queues per
task. The per process queue and the per task queue. Update the code
so that if the signal is removed from the per process queue it is
requeued on the per process queue. This is necessary to make it
appear the signal was never dequeued.
The rr debugger reasonably believes that the state of the process from
the last ptrace_stop it observed until PTRACE_EVENT_EXIT can be recreated
by simply letting a process run. If a SIGKILL interrupts a ptrace_stop
this is not true today.
So return signals to their original queue in ptrace_signal so that
signals that are not delivered appear like they were never dequeued.
Fixes: 794aa320b79d ("[PATCH] sigfix-2.5.40-D6")
History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.gi
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/87zgq4d5r4.fsf_-_@email.froward.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2120352
commit 00b06da29cf9dc633cdba87acd3f57f4df3fd5c7
Author: Eric W. Biederman <ebiederm@xmission.com>
Date: Fri Oct 29 09:14:19 2021 -0500
signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed
As Andy pointed out that there are races between
force_sig_info_to_task and sigaction[1] when force_sig_info_task. As
Kees discovered[2] ptrace is also able to change these signals.
In the case of seeccomp killing a process with a signal it is a
security violation to allow the signal to be caught or manipulated.
Solve this problem by introducing a new flag SA_IMMUTABLE that
prevents sigaction and ptrace from modifying these forced signals.
This flag is carefully made kernel internal so that no new ABI is
introduced.
Longer term I think this can be solved by guaranteeing short circuit
delivery of signals in this case. Unfortunately reliable and
guaranteed short circuit delivery of these signals is still a ways off
from being implemented, tested, and merged. So I have implemented a much
simpler alternative for now.
[1] https://lkml.kernel.org/r/b5d52d25-7bde-4030-a7b1-7c6f8ab90660@www.fastmail.com
[2] https://lkml.kernel.org/r/202110281136.5CE65399A7@keescook
Cc: stable@vger.kernel.org
Fixes: 307d522f5eb8 ("signal/seccomp: Refactor seccomp signal and coredump generation")
Tested-by: Andrea Righi <andrea.righi@canonical.com>
Tested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2120352
commit 26d5badbccddcc063dc5174a2baffd13a23322aa
Author: Eric W. Biederman <ebiederm@xmission.com>
Date: Wed Oct 20 12:43:59 2021 -0500
signal: Implement force_fatal_sig
Add a simple helper force_fatal_sig that causes a signal to be
delivered to a process as if the signal handler was set to SIG_DFL.
Reimplement force_sigsegv based upon this new helper. This fixes
force_sigsegv so that when it forces the default signal handler
to be used the code now forces the signal to be unblocked as well.
Reusing the tested logic in force_sig_info_to_task that was built for
force_sig_seccomp this makes the implementation trivial.
This is interesting both because it makes force_sigsegv simpler and
because there are a couple of buggy places in the kernel that call
do_exit(SIGILL) or do_exit(SIGSYS) because there is no straight
forward way today for those places to simply force the exit of a
process with the chosen signal. Creating force_fatal_sig allows
those places to be implemented with normal signal exits.
Link: https://lkml.kernel.org/r/20211020174406.17889-13-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2120352
commit 92307383082daff5df884a25df9e283efb7ef261
Author: Eric W. Biederman <ebiederm@xmission.com>
Date: Wed Sep 1 11:33:50 2021 -0500
coredump: Don't perform any cleanups before dumping core
Rename coredump_exit_mm to coredump_task_exit and call it from do_exit
before PTRACE_EVENT_EXIT, and before any cleanup work for a task
happens. This ensures that an accurate copy of the process can be
captured in the coredump as no cleanup for the process happens before
the coredump completes. This also ensures that PTRACE_EVENT_EXIT
will not be visited by any thread until the coredump is complete.
Add a new flag PF_POSTCOREDUMP so that tasks that have passed through
coredump_task_exit can be recognized and ignored in zap_process.
Now that all of the coredumping happens before exit_mm remove code to
test for a coredump in progress from mm_release.
Replace "may_ptrace_stop()" with a simple test of "current->ptrace".
The other tests in may_ptrace_stop all concern avoiding stopping
during a coredump. These tests are no longer necessary as it is now
guaranteed that fatal_signal_pending will be set if the code enters
ptrace_stop during a coredump. The code in ptrace_stop is guaranteed
not to stop if fatal_signal_pending returns true.
Until this change "ptrace_event(PTRACE_EVENT_EXIT)" could call
ptrace_stop without fatal_signal_pending being true, as signals are
dequeued in get_signal before calling do_exit. This is no longer
an issue as "ptrace_event(PTRACE_EVENT_EXIT)" is no longer reached
until after the coredump completes.
Link: https://lkml.kernel.org/r/874kaax26c.fsf@disp2133
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2120352
commit 4f627af8e6068892cafe031df6c14e8a0aaaa426
Author: Eric W. Biederman <ebiederm@xmission.com>
Date: Thu Sep 2 16:10:11 2021 -0500
ptrace: Remove the unnecessary arguments from arch_ptrace_stop
Both arch_ptrace_stop_needed and arch_ptrace_stop are called with an
exit_code and a siginfo structure. Neither argument is used by any of
the implementations so just remove the unneeded arguments.
The two arechitectures that implement arch_ptrace_stop are ia64 and
sparc. Both architectures flush their register stacks before a
ptrace_stack so that all of the register information can be accessed
by debuggers.
As the question of if a register stack needs to be flushed is
independent of why ptrace is stopping not needing arguments make sense.
Cc: David Miller <davem@davemloft.net>
Cc: sparclinux@vger.kernel.org
Link: https://lkml.kernel.org/r/87lf3mx290.fsf@disp2133
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2120352
commit 7d613f9f72ec8f90ddefcae038fdae5adb8404b3
Author: Eric W. Biederman <ebiederm@xmission.com>
Date: Wed Sep 1 13:21:34 2021 -0500
signal: Remove the bogus sigkill_pending in ptrace_stop
The existence of sigkill_pending is a little silly as it is
functionally a duplicate of fatal_signal_pending that is used in
exactly one place.
Checking for pending fatal signals and returning early in ptrace_stop
is actively harmful. It casues the ptrace_stop called by
ptrace_signal to return early before setting current->exit_code.
Later when ptrace_signal reads the signal number from
current->exit_code is undefined, making it unpredictable what will
happen.
Instead rely on the fact that schedule will not sleep if there is a
pending signal that can awaken a task.
Removing the explict sigkill_pending test fixes fixes ptrace_signal
when ptrace_stop does not stop because current->exit_code is always
set to to signr.
Cc: stable@vger.kernel.org
Fixes: 3d749b9e67 ("ptrace: simplify ptrace_stop()->sigkill_pending() path")
Fixes: 1a669c2f16 ("Add arch_ptrace_stop")
Link: https://lkml.kernel.org/r/87pmsyx29t.fsf@disp2133
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2120352
commit 307d522f5eb86cd6ac8c905f5b0577dedac54ec5
Author: Eric W. Biederman <ebiederm@xmission.com>
Date: Wed Jun 23 16:44:32 2021 -0500
signal/seccomp: Refactor seccomp signal and coredump generation
Factor out force_sig_seccomp from the seccomp signal generation and
place it in kernel/signal.c. The function force_sig_seccomp takes a
parameter force_coredump to indicate that the sigaction field should
be reset to SIGDFL so that a coredump will be generated when the
signal is delivered.
force_sig_seccomp is then used to replace both seccomp_send_sigsys
and seccomp_init_siginfo.
force_sig_info_to_task gains an extra parameter to force using
the default signal action.
With this change seccomp is no longer a special case and there
becomes exactly one place do_coredump is called from.
Further it no longer becomes necessary for __seccomp_filter
to call do_group_exit.
Acked-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/87r1gr6qc4.fsf_-_@disp2133
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2123231
upstream
========
commit 78ed93d72ded679e3caf0758357209887bda885f
Author: Marco Elver <elver@google.com>
Date: Mon Apr 4 13:12:04 2022 +0200
description
===========
With SIGTRAP on perf events, we have encountered termination of
processes due to user space attempting to block delivery of SIGTRAP.
Consider this case:
<set up SIGTRAP on a perf event>
...
sigset_t s;
sigemptyset(&s);
sigaddset(&s, SIGTRAP | <and others>);
sigprocmask(SIG_BLOCK, &s, ...);
...
<perf event triggers>
When the perf event triggers, while SIGTRAP is blocked, force_sig_perf()
will force the signal, but revert back to the default handler, thus
terminating the task.
This makes sense for error conditions, but not so much for explicitly
requested monitoring. However, the expectation is still that signals
generated by perf events are synchronous, which will no longer be the
case if the signal is blocked and delivered later.
To give user space the ability to clearly distinguish synchronous from
asynchronous signals, introduce siginfo_t::si_perf_flags and
TRAP_PERF_FLAG_ASYNC (opted for flags in case more binary information is
required in future).
The resolution to the problem is then to (a) no longer force the signal
(avoiding the terminations), but (b) tell user space via si_perf_flags
if the signal was synchronous or not, so that such signals can be
handled differently (e.g. let user space decide to ignore or consider
the data imprecise).
The alternative of making the kernel ignore SIGTRAP on perf events if
the signal is blocked may work for some usecases, but likely causes
issues in others that then have to revert back to interception of
sigprocmask() (which we want to avoid). [ A concrete example: when using
breakpoint perf events to track data-flow, in a region of code where
signals are blocked, data-flow can no longer be tracked accurately.
When a relevant asynchronous signal is received after unblocking the
signal, the data-flow tracking logic needs to know its state is
imprecise. ]
Fixes: 97ba62b278 ("perf: Add support for SIGTRAP on perf events")
Conflicts:
==========
Ignoring hunks in arm32, m68k and sparc files, since we don't support
these architectures.
Signed-off-by: Michael Petlan <mpetlan@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2120671
commit e7f7c99ba911f56bc338845c1cd72954ba591707
Author: Eric W. Biederman <ebiederm@xmission.com>
Date: Mon Nov 15 11:55:57 2021 -0600
signal: In get_signal test for signal_group_exit every time through the loop
Recently while investigating a problem with rr and signals I noticed
that siglock is dropped in ptrace_signal and get_signal does not jump
to relock.
Looking farther to see if the problem is anywhere else I see that
do_signal_stop also returns if signal_group_exit is true. I believe
that test can now never be true, but it is a bit hard to trace
through and be certain.
Testing signal_group_exit is not expensive, so move the test for
signal_group_exit into the for loop inside of get_signal to ensure
the test is never skipped improperly.
This has been a potential problem since I added the test for
signal_group_exit was added.
Fixes: 35634ffa17 ("signal: Always notice exiting tasks")
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/875yssekcd.fsf_-_@email.froward.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Phil Auld <pauld@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/174
This is the first draft of the commits which are required to support
AMX (aka TMUL) on SPR. I suspect there are some missing commits.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2004190
Tested: Intel to perform initial testing.
v2: added missing upstream commits 21e96a2035db and 52d0b8b18776
Signed-off-by: David Arcari <darcari@redhat.com>
RH-Acked-by: Rafael Aquini <aquini@redhat.com>
RH-Acked-by: Steve Best <sbest@redhat.com>
RH-Acked-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/201
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396
Brew URL: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=41434412
Testing: KT1-lite + regressions and performance (scheduler, network, and fs)
benchmarks, as documented on the BZ.
In order to provide support for several future feature requests (virtio-mem,
filesystems, core-kernel and memory management) targeted for RHEL-9,
the patchset is bringing the core-MM codebase up to upstream v5.15.
This patchset is composed of upstream cherry picks that represent the
difference between current RHEL-9 v5.14 code base and upstream v5.15 plus
their relevant follow-up fixes.
Omitted-fix: 15eb7c888e749 ("locking/rwsem: Add missing __init_rwsem() for PREEMPT_RT")
already backported into RHEL9 via commit de3eb21475
Omitted-fix: 6341eb6f39bb7 ("drm/i915/selftests: exercise shmem_writeback with THP")
dependencies for this selftest follow up (and the follow-up itself)
shall be dealt with via DRM update work done by the graphics team.
Omitted-fix: f24b062607678 ("mm/damon: grammar s/works/work/")
Omitted-fix: db7a347b26fe0 ("mm/damon/dbgfs: use '__GFP_NOWARN' for user-specified size buffer allocation")
Omitted-fix: d78f3853f831e ("mm/damon/dbgfs: fix missed use of damon_dbgfs_lock")
albeit DAMON initial integration is part of v5.15, we're explicitly
introducing it disabled in this backport. DAMON follow-ups, and future
enablement will be dealt with via a separated (already filed) BZ ticket.
Omitted-fix: e66435936756d ("mm: fix mismerge of folio page flag manipulators")
folio pages are a feature integrated into v5.16, and this merge-fix
commit is non-relevant to this particular patchset.
Signed-off-by: Rafael Aquini <aquini@redhat.com>
RH-Acked-by: John W. Linville <linville@redhat.com>
RH-Acked-by: Waiman Long <longman@redhat.com>
RH-Acked-by: Prarit Bhargava <prarit@redhat.com>
RH-Acked-by: David Arcari <darcari@redhat.com>
RH-Acked-by: Aristeu Rozanski <arozansk@redhat.com>
RH-Acked-by: Phil Auld <pauld@redhat.com>
RH-Acked-by: David Hildenbrand <david@redhat.com>
RH-Acked-by: Chris von Recklinghausen <crecklin@redhat.com>
Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/145
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2022896
Upstream Status: Linux
Tested: Sanity tested with timer and scheduler tests
hrtimer updates for RT prerequisites
This is a series for hrtimer and related code that enable the
RT patchset to merge more cleanly.
Signed-off-by: Phil Auld <pauld@redhat.com>
RH-Acked-by: Prarit Bhargava <prarit@redhat.com>
RH-Acked-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
Bugzilla: http://bugzilla.redhat.com/2004190
commit 1bdda24c4af64cd2d65dec5192ab624c5fee7ca0
Author: Thomas Gleixner <tglx@linutronix.de>
Date: Thu Oct 21 15:55:05 2021 -0700
signal: Add an optional check for altstack size
New x86 FPU features will be very large, requiring ~10k of stack in
signal handlers. These new features require a new approach called
"dynamic features".
The kernel currently tries to ensure that altstacks are reasonably
sized. Right now, on x86, sys_sigaltstack() requires a size of >=2k.
However, that 2k is a constant. Simply raising that 2k requirement
to >10k for the new features would break existing apps which have a
compiled-in size of 2k.
Instead of universally enforcing a larger stack, prohibit a process from
using dynamic features without properly-sized altstacks. This must be
enforced in two places:
* A dynamic feature can not be enabled without an large-enough altstack
for each process thread.
* Once a dynamic feature is enabled, any request to install a too-small
altstack will be rejected
The dynamic feature enabling code must examine each thread in a
process to ensure that the altstacks are large enough. Add a new lock
(sigaltstack_lock()) to ensure that threads can not race and change
their altstack after being examined.
Add the infrastructure in form of a config option and provide empty
stubs for architectures which do not need dynamic altstack size checks.
This implementation will be fleshed out for x86 in a future patch called
x86/arch_prctl: Add controls for dynamic XSTATE components
[dhansen: commit message. ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211021225527.10184-2-chang.seok.bae@intel.com
Signed-off-by: David Arcari <darcari@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396
This patch is a backport of the following upstream commit:
commit 5f58c39819ff78ca5ddbba2b3cd8ff4779b19bb5
Author: Vasily Averin <vvs@virtuozzo.com>
Date: Thu Sep 2 14:55:35 2021 -0700
memcg: enable accounting for signals
When a user send a signal to any another processes it forces the kernel to
allocate memory for 'struct sigqueue' objects. The number of signals is
limited by RLIMIT_SIGPENDING resource limit, but even the default settings
allow each user to consume up to several megabytes of memory.
It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.
Link: https://lkml.kernel.org/r/e34e958c-e785-712e-a62a-2c7b66c646c7@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Yutian Yang <nglaive@gmail.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Rafael Aquini <aquini@redhat.com>
Bugzilla: http://bugzilla.redhat.com/2022896
commit a5dec9f82ab2ae486119f0b0820ea16db3e522c3
Author: Frederic Weisbecker <frederic@kernel.org>
Date: Mon Jul 26 14:55:08 2021 +0200
posix-cpu-timers: Assert task sighand is locked while starting cputime counter
Starting the process wide cputime counter needs to be done in the same
sighand locking sequence than actually arming the related timer otherwise
this races against concurrent timers setting/expiring in the same
threadgroup.
Detecting that the cputime counter is started without holding the sighand
lock is a first step toward debugging such situations.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210726125513.271824-2-frederic@kernel.org
Signed-off-by: Phil Auld <pauld@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2018142
Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
commit 15bc01effefe97757ef02ca09e9d1b927ab22725
Author: Eric W. Biederman <ebiederm@xmission.com>
Date: Sat Oct 16 15:59:49 2021 -0500
ucounts: Fix signal ucount refcounting
In commit fda31c5029 ("signal: avoid double atomic counter
increments for user accounting") Linus made a clever optimization to
how rlimits and the struct user_struct. Unfortunately that
optimization does not work in the obvious way when moved to nested
rlimits. The problem is that the last decrement of the per user
namespace per user sigpending counter might also be the last decrement
of the sigpending counter in the parent user namespace as well. Which
means that simply freeing the leaf ucount in __free_sigqueue is not
enough.
Maintain the optimization and handle the tricky cases by introducing
inc_rlimit_get_ucounts and dec_rlimit_put_ucounts.
By moving the entire optimization into functions that perform all of
the work it becomes possible to ensure that every level is handled
properly.
The new function inc_rlimit_get_ucounts returns 0 on failure to
increment the ucount. This is different than inc_rlimit_ucounts which
increments the ucounts and returns LONG_MAX if the ucount counter has
exceeded it's maximum or it wrapped (to indicate the counter needs to
decremented).
I wish we had a single user to account all pending signals to across
all of the threads of a process so this complexity was not necessary
Cc: stable@vger.kernel.org
Fixes: d646969055 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
v1: https://lkml.kernel.org/r/87mtnavszx.fsf_-_@disp2133
Link: https://lkml.kernel.org/r/87fssytizw.fsf_-_@disp2133
Reviewed-by: Alexey Gladkov <legion@kernel.org>
Tested-by: Rune Kleveland <rune.kleveland@infomedia.dk>
Tested-by: Yu Zhao <yuzhao@google.com>
Tested-by: Jordan Glover <Golden_Miller83@protonmail.ch>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Alexey Gladkov <agladkov@redhat.com>
We must properly handle an errors when we increase the rlimit counter
and the ucounts reference counter. We have to this with RCU protection
to prevent possible use-after-free that could occur due to concurrent
put_cred_rcu().
The following reproducer triggers the problem:
$ cat testcase.sh
case "${STEP:-0}" in
0)
ulimit -Si 1
ulimit -Hi 1
STEP=1 unshare -rU "$0"
killall sleep
;;
1)
for i in 1 2 3 4 5; do unshare -rU sleep 5 & done
;;
esac
with the KASAN report being along the lines of
BUG: KASAN: use-after-free in put_ucounts+0x17/0xa0
Write of size 4 at addr ffff8880045f031c by task swapper/2/0
CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.13.0+ #19
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-alt4 04/01/2014
Call Trace:
<IRQ>
put_ucounts+0x17/0xa0
put_cred_rcu+0xd5/0x190
rcu_core+0x3bf/0xcb0
__do_softirq+0xe3/0x341
irq_exit_rcu+0xbe/0xe0
sysvec_apic_timer_interrupt+0x6a/0x90
</IRQ>
asm_sysvec_apic_timer_interrupt+0x12/0x20
default_idle_call+0x53/0x130
do_idle+0x311/0x3c0
cpu_startup_entry+0x14/0x20
secondary_startup_64_no_verify+0xc2/0xcb
Allocated by task 127:
kasan_save_stack+0x1b/0x40
__kasan_kmalloc+0x7c/0x90
alloc_ucounts+0x169/0x2b0
set_cred_ucounts+0xbb/0x170
ksys_unshare+0x24c/0x4e0
__x64_sys_unshare+0x16/0x20
do_syscall_64+0x37/0x70
entry_SYSCALL_64_after_hwframe+0x44/0xae
Freed by task 0:
kasan_save_stack+0x1b/0x40
kasan_set_track+0x1c/0x30
kasan_set_free_info+0x20/0x30
__kasan_slab_free+0xeb/0x120
kfree+0xaa/0x460
put_cred_rcu+0xd5/0x190
rcu_core+0x3bf/0xcb0
__do_softirq+0xe3/0x341
The buggy address belongs to the object at ffff8880045f0300
which belongs to the cache kmalloc-192 of size 192
The buggy address is located 28 bytes inside of
192-byte region [ffff8880045f0300, ffff8880045f03c0)
The buggy address belongs to the page:
page:000000008de0a388 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff8880045f0000 pfn:0x45f0
flags: 0x100000000000200(slab|node=0|zone=1)
raw: 0100000000000200 ffffea00000f4640 0000000a0000000a ffff888001042a00
raw: ffff8880045f0000 000000008010000d 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff8880045f0200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8880045f0280: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
>ffff8880045f0300: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff8880045f0380: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
ffff8880045f0400: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
Disabling lock debugging due to kernel taint
Fixes: d646969055 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Alexey Gladkov <legion@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge more updates from Andrew Morton:
"190 patches.
Subsystems affected by this patch series: mm (hugetlb, userfaultfd,
vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock,
migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap,
zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc,
core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs,
signals, exec, kcov, selftests, compress/decompress, and ipc"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits)
ipc/util.c: use binary search for max_idx
ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock
ipc: use kmalloc for msg_queue and shmid_kernel
ipc sem: use kvmalloc for sem_undo allocation
lib/decompressors: remove set but not used variabled 'level'
selftests/vm/pkeys: exercise x86 XSAVE init state
selftests/vm/pkeys: refill shadow register after implicit kernel write
selftests/vm/pkeys: handle negative sys_pkey_alloc() return code
selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
kcov: add __no_sanitize_coverage to fix noinstr for all architectures
exec: remove checks in __register_bimfmt()
x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
hfsplus: report create_date to kstat.btime
hfsplus: remove unnecessary oom message
nilfs2: remove redundant continue statement in a while-loop
kprobes: remove duplicated strong free_insn_page in x86 and s390
init: print out unknown kernel parameters
checkpatch: do not complain about positive return values starting with EPOLL
checkpatch: improve the indented label test
checkpatch: scripts/spdxcheck.py now requires python3
...
Currently we handle SS_AUTODISARM as soon as we have stored the altstack
settings into sigframe - that's the point when we have set the things up
for eventual sigreturn to restore the old settings. And if we manage to
set the sigframe up (we are not done with that yet), everything's fine.
However, in case of failure we end up with sigframe-to-be abandoned and
SIGSEGV force-delivered. And in that case we end up with inconsistent
rules - late failures have altstack reset, early ones do not.
It's trivial to get consistent behaviour - just handle SS_AUTODISARM once
we have set the sigframe up and are committed to entering the handler,
i.e. in signal_delivered().
Link: https://lore.kernel.org/lkml/20200404170604.GN23230@ZenIV.linux.org.uk/
Link: https://github.com/ClangBuiltLinux/linux/issues/876
Link: https://lkml.kernel.org/r/20210422230846.1756380-1-ndesaulniers@google.com
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>