243 lines
11 KiB
ReStructuredText
243 lines
11 KiB
ReStructuredText
.. SPDX-License-Identifier: GPL-2.0
|
||
|
||
===========================
|
||
How realtime kernels differ
|
||
===========================
|
||
|
||
:Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
||
|
||
Preface
|
||
=======
|
||
|
||
With forced-threaded interrupts and sleeping spin locks, code paths that
|
||
previously caused long scheduling latencies have been made preemptible and
|
||
moved into process context. This allows the scheduler to manage them more
|
||
effectively and respond to higher-priority tasks with reduced latency.
|
||
|
||
The following chapters provide an overview of key differences between a
|
||
PREEMPT_RT kernel and a standard, non-PREEMPT_RT kernel.
|
||
|
||
Locking
|
||
=======
|
||
|
||
Spinning locks such as spinlock_t are used to provide synchronization for data
|
||
structures accessed from both interrupt context and process context. For this
|
||
reason, locking functions are also available with the _irq() or _irqsave()
|
||
suffixes, which disable interrupts before acquiring the lock. This ensures that
|
||
the lock can be safely acquired in process context when interrupts are enabled.
|
||
|
||
However, on a PREEMPT_RT system, interrupts are forced-threaded and no longer
|
||
run in hard IRQ context. As a result, there is no need to disable interrupts as
|
||
part of the locking procedure when using spinlock_t.
|
||
|
||
For low-level core components such as interrupt handling, the scheduler, or the
|
||
timer subsystem the kernel uses raw_spinlock_t. This lock type preserves
|
||
traditional semantics: it disables preemption and, when used with _irq() or
|
||
_irqsave(), also disables interrupts. This ensures proper synchronization in
|
||
critical sections that must remain non-preemptible or with interrupts disabled.
|
||
|
||
Execution context
|
||
=================
|
||
|
||
Interrupt handling in a PREEMPT_RT system is invoked in process context through
|
||
the use of threaded interrupts. Other parts of the kernel also shift their
|
||
execution into threaded context by different mechanisms. The goal is to keep
|
||
execution paths preemptible, allowing the scheduler to interrupt them when a
|
||
higher-priority task needs to run.
|
||
|
||
Below is an overview of the kernel subsystems involved in this transition to
|
||
threaded, preemptible execution.
|
||
|
||
Interrupt handling
|
||
------------------
|
||
|
||
All interrupts are forced-threaded in a PREEMPT_RT system. The exceptions are
|
||
interrupts that are requested with the IRQF_NO_THREAD, IRQF_PERCPU, or
|
||
IRQF_ONESHOT flags.
|
||
|
||
The IRQF_ONESHOT flag is used together with threaded interrupts, meaning those
|
||
registered using request_threaded_irq() and providing only a threaded handler.
|
||
Its purpose is to keep the interrupt line masked until the threaded handler has
|
||
completed.
|
||
|
||
If a primary handler is also provided in this case, it is essential that the
|
||
handler does not acquire any sleeping locks, as it will not be threaded. The
|
||
handler should be minimal and must avoid introducing delays, such as
|
||
busy-waiting on hardware registers.
|
||
|
||
|
||
Soft interrupts, bottom half handling
|
||
-------------------------------------
|
||
|
||
Soft interrupts are raised by the interrupt handler and are executed after the
|
||
handler returns. Since they run in thread context, they can be preempted by
|
||
other threads. Do not assume that softirq context runs with preemption
|
||
disabled. This means you must not rely on mechanisms like local_bh_disable() in
|
||
process context to protect per-CPU variables. Because softirq handlers are
|
||
preemptible under PREEMPT_RT, this approach does not provide reliable
|
||
synchronization.
|
||
|
||
If this kind of protection is required for performance reasons, consider using
|
||
local_lock_nested_bh(). On non-PREEMPT_RT kernels, this allows lockdep to
|
||
verify that bottom halves are disabled. On PREEMPT_RT systems, it adds the
|
||
necessary locking to ensure proper protection.
|
||
|
||
Using local_lock_nested_bh() also makes the locking scope explicit and easier
|
||
for readers and maintainers to understand.
|
||
|
||
|
||
per-CPU variables
|
||
-----------------
|
||
|
||
Protecting access to per-CPU variables solely by using preempt_disable() should
|
||
be avoided, especially if the critical section has unbounded runtime or may
|
||
call APIs that can sleep.
|
||
|
||
If using a spinlock_t is considered too costly for performance reasons,
|
||
consider using local_lock_t. On non-PREEMPT_RT configurations, this introduces
|
||
no runtime overhead when lockdep is disabled. With lockdep enabled, it verifies
|
||
that the lock is only acquired in process context and never from softirq or
|
||
hard IRQ context.
|
||
|
||
On a PREEMPT_RT kernel, local_lock_t is implemented using a per-CPU spinlock_t,
|
||
which provides safe local protection for per-CPU data while keeping the system
|
||
preemptible.
|
||
|
||
Because spinlock_t on PREEMPT_RT does not disable preemption, it cannot be used
|
||
to protect per-CPU data by relying on implicit preemption disabling. If this
|
||
inherited preemption disabling is essential and if local_lock_t cannot be used
|
||
due to performance constraints, brevity of the code, or abstraction boundaries
|
||
within an API then preempt_disable_nested() may be a suitable alternative. On
|
||
non-PREEMPT_RT kernels, it verifies with lockdep that preemption is already
|
||
disabled. On PREEMPT_RT, it explicitly disables preemption.
|
||
|
||
Timers
|
||
------
|
||
|
||
By default, an hrtimer is executed in hard interrupt context. The exception is
|
||
timers initialized with the HRTIMER_MODE_SOFT flag, which are executed in
|
||
softirq context.
|
||
|
||
On a PREEMPT_RT kernel, this behavior is reversed: hrtimers are executed in
|
||
softirq context by default, typically within the ktimersd thread. This thread
|
||
runs at the lowest real-time priority, ensuring it executes before any
|
||
SCHED_OTHER tasks but does not interfere with higher-priority real-time
|
||
threads. To explicitly request execution in hard interrupt context on
|
||
PREEMPT_RT, the timer must be marked with the HRTIMER_MODE_HARD flag.
|
||
|
||
Memory allocation
|
||
-----------------
|
||
|
||
The memory allocation APIs, such as kmalloc() and alloc_pages(), require a
|
||
gfp_t flag to indicate the allocation context. On non-PREEMPT_RT kernels, it is
|
||
necessary to use GFP_ATOMIC when allocating memory from interrupt context or
|
||
from sections where preemption is disabled. This is because the allocator must
|
||
not sleep in these contexts waiting for memory to become available.
|
||
|
||
However, this approach does not work on PREEMPT_RT kernels. The memory
|
||
allocator in PREEMPT_RT uses sleeping locks internally, which cannot be
|
||
acquired when preemption is disabled. Fortunately, this is generally not a
|
||
problem, because PREEMPT_RT moves most contexts that would traditionally run
|
||
with preemption or interrupts disabled into threaded context, where sleeping is
|
||
allowed.
|
||
|
||
What remains problematic is code that explicitly disables preemption or
|
||
interrupts. In such cases, memory allocation must be performed outside the
|
||
critical section.
|
||
|
||
This restriction also applies to memory deallocation routines such as kfree()
|
||
and free_pages(), which may also involve internal locking and must not be
|
||
called from non-preemptible contexts.
|
||
|
||
IRQ work
|
||
--------
|
||
|
||
The irq_work API provides a mechanism to schedule a callback in interrupt
|
||
context. It is designed for use in contexts where traditional scheduling is not
|
||
possible, such as from within NMI handlers or from inside the scheduler, where
|
||
using a workqueue would be unsafe.
|
||
|
||
On non-PREEMPT_RT systems, all irq_work items are executed immediately in
|
||
interrupt context. Items marked with IRQ_WORK_LAZY are deferred until the next
|
||
timer tick but are still executed in interrupt context.
|
||
|
||
On PREEMPT_RT systems, the execution model changes. Because irq_work callbacks
|
||
may acquire sleeping locks or have unbounded execution time, they are handled
|
||
in thread context by a per-CPU irq_work kernel thread. This thread runs at the
|
||
lowest real-time priority, ensuring it executes before any SCHED_OTHER tasks
|
||
but does not interfere with higher-priority real-time threads.
|
||
|
||
The exception are work items marked with IRQ_WORK_HARD_IRQ, which are still
|
||
executed in hard interrupt context. Lazy items (IRQ_WORK_LAZY) continue to be
|
||
deferred until the next timer tick and are also executed by the irq_work/
|
||
thread.
|
||
|
||
RCU callbacks
|
||
-------------
|
||
|
||
RCU callbacks are invoked by default in softirq context. Their execution is
|
||
important because, depending on the use case, they either free memory or ensure
|
||
progress in state transitions. Running these callbacks as part of the softirq
|
||
chain can lead to undesired situations, such as contention for CPU resources
|
||
with other SCHED_OTHER tasks when executed within ksoftirqd.
|
||
|
||
To avoid running callbacks in softirq context, the RCU subsystem provides a
|
||
mechanism to execute them in process context instead. This behavior can be
|
||
enabled by setting the boot command-line parameter rcutree.use_softirq=0. This
|
||
setting is enforced in kernels configured with PREEMPT_RT.
|
||
|
||
Spin until ready
|
||
================
|
||
|
||
The "spin until ready" pattern involves repeatedly checking (spinning on) the
|
||
state of a data structure until it becomes available. This pattern assumes that
|
||
preemption, soft interrupts, or interrupts are disabled. If the data structure
|
||
is marked busy, it is presumed to be in use by another CPU, and spinning should
|
||
eventually succeed as that CPU makes progress.
|
||
|
||
Some examples are hrtimer_cancel() or timer_delete_sync(). These functions
|
||
cancel timers that execute with interrupts or soft interrupts disabled. If a
|
||
thread attempts to cancel a timer and finds it active, spinning until the
|
||
callback completes is safe because the callback can only run on another CPU and
|
||
will eventually finish.
|
||
|
||
On PREEMPT_RT kernels, however, timer callbacks run in thread context. This
|
||
introduces a challenge: a higher-priority thread attempting to cancel the timer
|
||
may preempt the timer callback thread. Since the scheduler cannot migrate the
|
||
callback thread to another CPU due to affinity constraints, spinning can result
|
||
in livelock even on multiprocessor systems.
|
||
|
||
To avoid this, both the canceling and callback sides must use a handshake
|
||
mechanism that supports priority inheritance. This allows the canceling thread
|
||
to suspend until the callback completes, ensuring forward progress without
|
||
risking livelock.
|
||
|
||
In order to solve the problem at the API level, the sequence locks were extended
|
||
to allow a proper handover between the the spinning reader and the maybe
|
||
blocked writer.
|
||
|
||
Sequence locks
|
||
--------------
|
||
|
||
Sequence counters and sequential locks are documented in
|
||
Documentation/locking/seqlock.rst.
|
||
|
||
The interface has been extended to ensure proper preemption states for the
|
||
writer and spinning reader contexts. This is achieved by embedding the writer
|
||
serialization lock directly into the sequence counter type, resulting in
|
||
composite types such as seqcount_spinlock_t or seqcount_mutex_t.
|
||
|
||
These composite types allow readers to detect an ongoing write and actively
|
||
boost the writer’s priority to help it complete its update instead of spinning
|
||
and waiting for its completion.
|
||
|
||
If the plain seqcount_t is used, extra care must be taken to synchronize the
|
||
reader with the writer during updates. The writer must ensure its update is
|
||
serialized and non-preemptible relative to the reader. This cannot be achieved
|
||
using a regular spinlock_t because spinlock_t on PREEMPT_RT does not disable
|
||
preemption. In such cases, using seqcount_spinlock_t is the preferred solution.
|
||
|
||
However, if there is no spinning involved i.e., if the reader only needs to
|
||
detect whether a write has started and not serialize against it then using
|
||
seqcount_t is reasonable.
|