Commit Graph

177 Commits

Author SHA1 Message Date
Waiman Long a2870859f1 clocksource: Use migrate_disable() to avoid calling get_random_u32() in atomic context
JIRA: https://issues.redhat.com/browse/RHEL-76143
Upstream Status: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git

commit 6bb05a33337b2c842373857b63de5c9bf1ae2a09
Author: Waiman Long <longman@redhat.com>
Date:   Fri, 31 Jan 2025 12:33:23 -0500

    clocksource: Use migrate_disable() to avoid calling get_random_u32() in atomic context

    The following bug report happened with a PREEMPT_RT kernel:

      BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
      in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 2012, name: kwatchdog
      preempt_count: 1, expected: 0
      RCU nest depth: 0, expected: 0
      get_random_u32+0x4f/0x110
      clocksource_verify_choose_cpus+0xab/0x1a0
      clocksource_verify_percpu.part.0+0x6b/0x330
      clocksource_watchdog_kthread+0x193/0x1a0

    It is due to the fact that clocksource_verify_choose_cpus() is invoked with
    preemption disabled.  This function invokes get_random_u32() to obtain
    random numbers for choosing CPUs.  The batched_entropy_32 local lock and/or
    the base_crng.lock spinlock in driver/char/random.c will be acquired during
    the call. In PREEMPT_RT kernel, they are both sleeping locks and so cannot
    be acquired in atomic context.

    Fix this problem by using migrate_disable() to allow smp_processor_id() to
    be reliably used without introducing atomic context. preempt_disable() is
    then called after clocksource_verify_choose_cpus() but before the
    clocksource measurement is being run to avoid introducing unexpected
    latency.

    Fixes: 7560c02bdf ("clocksource: Check per-CPU clock synchronization when marked unstable")
    Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
    Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Link: https://lore.kernel.org/all/20250131173323.891943-2-longman@redhat.com

Signed-off-by: Waiman Long <longman@redhat.com>
2025-02-04 13:21:04 -05:00
Waiman Long 188f7b8ee3 clocksource: Use pr_info() for "Checking clocksource synchronization" message
JIRA: https://issues.redhat.com/browse/RHEL-76143
Upstream Status: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git

commit 1f566840a82982141f94086061927a90e79440e5
Author: Waiman Long <longman@redhat.com>
Date:   Fri, 24 Jan 2025 20:54:41 -0500

    clocksource: Use pr_info() for "Checking clocksource synchronization" message

    The "Checking clocksource synchronization" message is normally printed
    when clocksource_verify_percpu() is called for a given clocksource if
    both the CLOCK_SOURCE_UNSTABLE and CLOCK_SOURCE_VERIFY_PERCPU flags
    are set.

    It is an informational message and so pr_info() is the correct choice.

    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
    Acked-by: John Stultz <jstultz@google.com>
    Link: https://lore.kernel.org/all/20250125015442.3740588-1-longman@redhat.com

Signed-off-by: Waiman Long <longman@redhat.com>
2025-02-04 13:21:02 -05:00
Waiman Long 64e753d2d6 clocksource: Make watchdog and suspend-timing multiplication overflow safe
JIRA: https://issues.redhat.com/browse/RHEL-76143

commit d0304569fb019d1bcfbbbce1ce6df6b96f04079b
Author: Adrian Hunter <adrian.hunter@intel.com>
Date:   Mon, 25 Mar 2024 08:40:23 +0200

    clocksource: Make watchdog and suspend-timing multiplication overflow safe

    Kernel timekeeping is designed to keep the change in cycles (since the last
    timer interrupt) below max_cycles, which prevents multiplication overflow
    when converting cycles to nanoseconds. However, if timer interrupts stop,
    the clocksource_cyc2ns() calculation will eventually overflow.

    Add protection against that. Simplify by folding together
    clocksource_delta() and clocksource_cyc2ns() into cycles_to_nsec_safe().
    Check against max_cycles, falling back to a slower higher precision
    calculation.

    Suggested-by: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Link: https://lore.kernel.org/r/20240325064023.2997-20-adrian.hunter@intel.com

Signed-off-by: Waiman Long <longman@redhat.com>
2025-02-04 13:20:59 -05:00
Waiman Long 8b6c3917c0 clocksource: Scale the watchdog read retries automatically
JIRA: https://issues.redhat.com/browse/RHEL-76143
Conflicts: A context diff in the include/linux/clocksource.h hunk due
	   to the presence of later upstream commit 6b2e29977518
	   ("timekeeping: Provide infrastructure for converting to/from
	   a base clock").

commit 2ed08e4bc53298db3f87b528cd804cb0cce066a9
Author: Feng Tang <feng.tang@intel.com>
Date:   Wed, 21 Feb 2024 14:08:59 +0800

    clocksource: Scale the watchdog read retries automatically

    On a 8-socket server the TSC is wrongly marked as 'unstable' and disabled
    during boot time on about one out of 120 boot attempts:

        clocksource: timekeeping watchdog on CPU227: wd-tsc-wd excessive read-back delay of 153560ns vs. limit of 125000ns,
        wd-wd read-back delay only 11440ns, attempt 3, marking tsc unstable
        tsc: Marking TSC unstable due to clocksource watchdog
        TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
        sched_clock: Marking unstable (119294969739, 159204297)<-(125446229205, -5992055152)
        clocksource: Checking clocksource tsc synchronization from CPU 319 to CPUs 0,99,136,180,210,542,601,896.
        clocksource: Switched to clocksource hpet

    The reason is that for platform with a large number of CPUs, there are
    sporadic big or huge read latencies while reading the watchog/clocksource
    during boot or when system is under stress work load, and the frequency and
    maximum value of the latency goes up with the number of online CPUs.

    The cCurrent code already has logic to detect and filter such high latency
    case by reading the watchdog twice and checking the two deltas. Due to the
    randomness of the latency, there is a low probabilty that the first delta
    (latency) is big, but the second delta is small and looks valid. The
    watchdog code retries the readouts by default twice, which is not
    necessarily sufficient for systems with a large number of CPUs.

    There is a command line parameter 'max_cswd_read_retries' which allows to
    increase the number of retries, but that's not user friendly as it needs to
    be tweaked per system. As the number of required retries is proportional to
    the number of online CPUs, this parameter can be calculated at runtime.

    Scale and enlarge the number of retries according to the number of online
    CPUs and remove the command line parameter completely.

    [ tglx: Massaged change log and comments ]

    Signed-off-by: Feng Tang <feng.tang@intel.com>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Tested-by: Jin Wang <jin1.wang@intel.com>
    Tested-by: Paul E. McKenney <paulmck@kernel.org>
    Reviewed-by: Waiman Long <longman@redhat.com>
    Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
    Link: https://lore.kernel.org/r/20240221060859.1027450-1-feng.tang@intel.com

Signed-off-by: Waiman Long <longman@redhat.com>
2025-02-04 13:20:56 -05:00
Waiman Long 6dff788e53 clocksource: Skip watchdog check for large watchdog intervals
JIRA: https://issues.redhat.com/browse/RHEL-76143

commit 644649553508b9bacf0fc7a5bdc4f9e0165576a5
Author: Jiri Wiesner <jwiesner@suse.de>
Date:   Mon, 22 Jan 2024 18:23:50 +0100

    clocksource: Skip watchdog check for large watchdog intervals

    There have been reports of the watchdog marking clocksources unstable on
    machines with 8 NUMA nodes:

      clocksource: timekeeping watchdog on CPU373:
      Marking clocksource 'tsc' as unstable because the skew is too large:
      clocksource:   'hpet' wd_nsec: 14523447520
      clocksource:   'tsc'  cs_nsec: 14524115132

    The measured clocksource skew - the absolute difference between cs_nsec
    and wd_nsec - was 668 microseconds:

      cs_nsec - wd_nsec = 14524115132 - 14523447520 = 667612

    The kernel used 200 microseconds for the uncertainty_margin of both the
    clocksource and watchdog, resulting in a threshold of 400 microseconds (the
    md variable). Both the cs_nsec and the wd_nsec value indicate that the
    readout interval was circa 14.5 seconds.  The observed behaviour is that
    watchdog checks failed for large readout intervals on 8 NUMA node
    machines. This indicates that the size of the skew was directly proportinal
    to the length of the readout interval on those machines. The measured
    clocksource skew, 668 microseconds, was evaluated against a threshold (the
    md variable) that is suited for readout intervals of roughly
    WATCHDOG_INTERVAL, i.e. HZ >> 1, which is 0.5 second.

    The intention of 2e27e793e2 ("clocksource: Reduce clocksource-skew
    threshold") was to tighten the threshold for evaluating skew and set the
    lower bound for the uncertainty_margin of clocksources to twice
    WATCHDOG_MAX_SKEW. Later in c37e85c135ce ("clocksource: Loosen clocksource
    watchdog constraints"), the WATCHDOG_MAX_SKEW constant was increased to
    125 microseconds to fit the limit of NTP, which is able to use a
    clocksource that suffers from up to 500 microseconds of skew per second.
    Both the TSC and the HPET use default uncertainty_margin. When the
    readout interval gets stretched the default uncertainty_margin is no
    longer a suitable lower bound for evaluating skew - it imposes a limit
    that is far stricter than the skew with which NTP can deal.

    The root causes of the skew being directly proportinal to the length of
    the readout interval are:

      * the inaccuracy of the shift/mult pairs of clocksources and the watchdog
      * the conversion to nanoseconds is imprecise for large readout intervals

    Prevent this by skipping the current watchdog check if the readout
    interval exceeds 2 * WATCHDOG_INTERVAL. Considering the maximum readout
    interval of 2 * WATCHDOG_INTERVAL, the current default uncertainty margin
    (of the TSC and HPET) corresponds to a limit on clocksource skew of 250
    ppm (microseconds of skew per second).  To keep the limit imposed by NTP
    (500 microseconds of skew per second) for all possible readout intervals,
    the margins would have to be scaled so that the threshold value is
    proportional to the length of the actual readout interval.

    As for why the readout interval may get stretched: Since the watchdog is
    executed in softirq context the expiration of the watchdog timer can get
    severely delayed on account of a ksoftirqd thread not getting to run in a
    timely manner. Surely, a system with such belated softirq execution is not
    working well and the scheduling issue should be looked into but the
    clocksource watchdog should be able to deal with it accordingly.

    Fixes: 2e27e793e2 ("clocksource: Reduce clocksource-skew threshold")
    Suggested-by: Feng Tang <feng.tang@intel.com>
    Signed-off-by: Jiri Wiesner <jwiesner@suse.de>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Tested-by: Paul E. McKenney <paulmck@kernel.org>
    Reviewed-by: Feng Tang <feng.tang@intel.com>
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/20240122172350.GA740@incl

Signed-off-by: Waiman Long <longman@redhat.com>
2025-02-04 13:20:54 -05:00
Waiman Long 223c07af9b clocksource: Suspend the watchdog temporarily when high read latency detected
JIRA: https://issues.redhat.com/browse/RHEL-76143

commit b7082cdfc464bf9231300605d03eebf943dda307
Author: Feng Tang <feng.tang@intel.com>
Date:   Tue, 20 Dec 2022 16:25:12 +0800

    clocksource: Suspend the watchdog temporarily when high read latency detected

    Bugs have been reported on 8 sockets x86 machines in which the TSC was
    wrongly disabled when the system is under heavy workload.

     [ 818.380354] clocksource: timekeeping watchdog on CPU336: hpet wd-wd read-back delay of 1203520ns
     [ 818.436160] clocksource: wd-tsc-wd read-back delay of 181880ns, clock-skew test skipped!
     [ 819.402962] clocksource: timekeeping watchdog on CPU338: hpet wd-wd read-back delay of 324000ns
     [ 819.448036] clocksource: wd-tsc-wd read-back delay of 337240ns, clock-skew test skipped!
     [ 819.880863] clocksource: timekeeping watchdog on CPU339: hpet read-back delay of 150280ns, attempt 3, marking unstable
     [ 819.936243] tsc: Marking TSC unstable due to clocksource watchdog
     [ 820.068173] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
     [ 820.092382] sched_clock: Marking unstable (818769414384, 1195404998)
     [ 820.643627] clocksource: Checking clocksource tsc synchronization from CPU 267 to CPUs 0,4,25,70,126,430,557,564.
     [ 821.067990] clocksource: Switched to clocksource hpet

    This can be reproduced by running memory intensive 'stream' tests,
    or some of the stress-ng subcases such as 'ioport'.

    The reason for these issues is the when system is under heavy load, the
    read latency of the clocksources can be very high.  Even lightweight TSC
    reads can show high latencies, and latencies are much worse for external
    clocksources such as HPET or the APIC PM timer.  These latencies can
    result in false-positive clocksource-unstable determinations.

    These issues were initially reported by a customer running on a production
    system, and this problem was reproduced on several generations of Xeon
    servers, especially when running the stress-ng test.  These Xeon servers
    were not production systems, but they did have the latest steppings
    and firmware.

    Given that the clocksource watchdog is a continual diagnostic check with
    frequency of twice a second, there is no need to rush it when the system
    is under heavy load.  Therefore, when high clocksource read latencies
    are detected, suspend the watchdog timer for 5 minutes.

    Signed-off-by: Feng Tang <feng.tang@intel.com>
    Acked-by: Waiman Long <longman@redhat.com>
    Cc: John Stultz <jstultz@google.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Stephen Boyd <sboyd@kernel.org>
    Cc: Feng Tang <feng.tang@intel.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2025-02-04 13:20:48 -05:00
Waiman Long 4fe6d9cb54 clocksource: Loosen clocksource watchdog constraints
JIRA: https://issues.redhat.com/browse/RHEL-76143

commit c37e85c135cead4256dc8860073c468d8925c3df
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Tue, 6 Dec 2022 19:36:10 -0800

    clocksource: Loosen clocksource watchdog constraints

    Currently, MAX_SKEW_USEC is set to 100 microseconds, which has worked
    reasonably well.  However, NTP is willing to tolerate 500 microseconds
    of skew per second, and a clocksource that is good enough for NTP should
    be good enough for the clocksource watchdog.  The watchdog's skew is
    controlled by MAX_SKEW_USEC and the CLOCKSOURCE_WATCHDOG_MAX_SKEW_US
    Kconfig option.  However, these values are doubled before being associated
    with a clocksource's ->uncertainty_margin, and the ->uncertainty_margin
    values of the pair of clocksource's being compared are summed before
    checking against the skew.

    Therefore, set both MAX_SKEW_USEC and the default for the
    CLOCKSOURCE_WATCHDOG_MAX_SKEW_US Kconfig option to 125 microseconds of
    skew per second, resulting in 500 microseconds of skew per second in
    the clocksource watchdog's skew comparison.

    Suggested-by Rik van Riel <riel@surriel.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2025-02-04 13:20:46 -05:00
Waiman Long 2135ae82f0 clocksource: Replace cpumask_weight() with cpumask_empty()
JIRA: https://issues.redhat.com/browse/RHEL-76143

commit 8afbcaf8690dac19ebf570a4e4fef9c59c75bf8e
Author: Yury Norov <yury.norov@gmail.com>
Date:   Thu, 10 Feb 2022 14:49:07 -0800

    clocksource: Replace cpumask_weight() with cpumask_empty()

    clocksource_verify_percpu() calls cpumask_weight() to check if any bit of a
    given cpumask is set.

    This can be done more efficiently with cpumask_empty() because
    cpumask_empty() stops traversing the cpumask as soon as it finds first set
    bit, while cpumask_weight() counts all bits unconditionally.

    Signed-off-by: Yury Norov <yury.norov@gmail.com>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Link: https://lore.kernel.org/r/20220210224933.379149-24-yury.norov@gmail.com

Signed-off-by: Waiman Long <longman@redhat.com>
2025-02-04 13:20:43 -05:00
Waiman Long c48ecee560 clocksource: Add a Kconfig option for WATCHDOG_MAX_SKEW
JIRA: https://issues.redhat.com/browse/RHEL-76143
Conflicts: Add a Kconfig file for CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US.

commit fc153c1c58cb8c3bb3b443b4d7dc3211ff5f65fc
Author: Waiman Long <longman@redhat.com>
Date:   Sun, 5 Dec 2021 22:38:15 -0500

    clocksource: Add a Kconfig option for WATCHDOG_MAX_SKEW

    A watchdog maximum skew of 100us may still be too small for
    some systems or archs. It may also be too small when some kernel
    debug config options are enabled.  So add a new Kconfig option
    CLOCKSOURCE_WATCHDOG_MAX_SKEW_US to allow kernel builders to have more
    control on the threshold for marking clocksource as unstable.

    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2025-02-04 13:20:21 -05:00
Prarit Bhargava d9df050187 clocksource: Print clocksource name when clocksource is tested unstable
JIRA: https://issues.redhat.com/browse/RHEL-19589

commit beaa1ffe551c330d8ea23de158432ecaad6c0410
Author: Yunying Sun <yunying.sun@intel.com>
Date:   Wed Nov 16 16:22:21 2022 +0800

    clocksource: Print clocksource name when clocksource is tested unstable

    Some "TSC fall back to HPET" messages appear on systems having more than
    2 NUMA nodes:

    clocksource: timekeeping watchdog on CPU168: hpet read-back delay of 4296200ns, attempt 4, marking unstable

    The "hpet" here is misleading the clocksource watchdog is really
    doing repeated reads of "hpet" in order to check for unrelated delays.
    Therefore, print the name of the clocksource under test, prefixed by
    "wd-" and suffixed by "-wd", for example, "wd-tsc-wd".

    Signed-off-by: Yunying Sun <yunying.sun@intel.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
2023-12-15 11:01:12 -05:00
Nico Pache 9e4078029c cpumask: replace cpumask_next_* with cpumask_first_* where appropriate
commit 9b51d9d866482a703646fd4c07e433c3d9d88efd
Author: Yury Norov <yury.norov@gmail.com>
Date:   Sat Aug 14 14:17:05 2021 -0700

    cpumask: replace cpumask_next_* with cpumask_first_* where appropriate

    cpumask_first() is a more effective analogue of 'next' version if n == -1
    (which means start == 0). This patch replaces 'next' with 'first' where
    things look trivial.

    There's no cpumask_first_zero() function, so create it.

    Signed-off-by: Yury Norov <yury.norov@gmail.com>
    Tested-by: Wolfram Sang <wsa+renesas@sang-engineering.com>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168378
Signed-off-by: Nico Pache <npache@redhat.com>
2023-04-17 11:59:20 -06:00
Waiman Long 3ff4df22d4 clocksource: Reduce the default clocksource_watchdog() retries to 2
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2027463

commit 1a5620671a1b6fd9cc08761677d050f1702f910c
Author: Waiman Long <longman@redhat.com>
Date:   Thu, 18 Nov 2021 14:14:37 -0500

    clocksource: Reduce the default clocksource_watchdog() retries to 2

    With the previous patch, there is an extra watchdog read in each retry.
    Now the total number of clocksource reads is increased to 4 per iteration.
    In order to avoid increasing the clock skew check overhead, the default
    maximum number of retries is reduced from 3 to 2 to maintain the same 12
    clocksource reads in the worst case.

    Suggested-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2022-01-18 14:00:12 -05:00
Waiman Long 445ef6210d clocksource: Avoid accidental unstable marking of clocksources
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2027463

commit c86ff8c55b8ae68837b2fa59dc0c203907e9a15f
Author: Waiman Long <longman@redhat.com>
Date:   Thu, 18 Nov 2021 14:14:36 -0500

    clocksource: Avoid accidental unstable marking of clocksources

    Since commit db3a34e174 ("clocksource: Retry clock read if long delays
    detected") and commit 2e27e793e2 ("clocksource: Reduce clocksource-skew
    threshold"), it is found that tsc clocksource fallback to hpet can
    sometimes happen on both Intel and AMD systems especially when they are
    running stressful benchmarking workloads. Of the 23 systems tested with
    a v5.14 kernel, 10 of them have switched to hpet clock source during
    the test run.

    The result of falling back to hpet is a drastic reduction of performance
    when running benchmarks. For example, the fio performance tests can
    drop up to 70% whereas the iperf3 performance can drop up to 80%.

    4 hpet fallbacks happened during bootup. They were:

      [    8.749399] clocksource: timekeeping watchdog on CPU13: hpet read-back delay of 263750ns, attempt 4, marking unstable
      [   12.044610] clocksource: timekeeping watchdog on CPU19: hpet read-back delay of 186166ns, attempt 4, marking unstable
      [   17.336941] clocksource: timekeeping watchdog on CPU28: hpet read-back delay of 182291ns, attempt 4, marking unstable
      [   17.518565] clocksource: timekeeping watchdog on CPU34: hpet read-back delay of 252196ns, attempt 4, marking unstable

    Other fallbacks happen when the systems were running stressful
    benchmarks. For example:

      [ 2685.867873] clocksource: timekeeping watchdog on CPU117: hpet read-back delay of 57269ns, attempt 4, marking unstable
      [46215.471228] clocksource: timekeeping watchdog on CPU8: hpet read-back delay of 61460ns, attempt 4, marking unstable

    Commit 2e27e793e2 ("clocksource: Reduce clocksource-skew threshold"),
    changed the skew margin from 100us to 50us. I think this is too small
    and can easily be exceeded when running some stressful workloads on a
    thermally stressed system.  So it is switched back to 100us.

    Even a maximum skew margin of 100us may be too small in for some systems
    when booting up especially if those systems are under thermal stress. To
    eliminate the case that the large skew is due to the system being too
    busy slowing down the reading of both the watchdog and the clocksource,
    an extra consecutive read of watchdog clock is being done to check this.

    The consecutive watchdog read delay is compared against
    WATCHDOG_MAX_SKEW/2. If the delay exceeds the limit, we assume that
    the system is just too busy. A warning will be printed to the console
    and the clock skew check is skipped for this round.

    Fixes: db3a34e174 ("clocksource: Retry clock read if long delays detected")
    Fixes: 2e27e793e2 ("clocksource: Reduce clocksource-skew threshold")
    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2022-01-18 14:00:12 -05:00
Waiman Long 80b60eaa96 Revert "clocksource: Increase WATCHDOG_MAX_SKEW"
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2027463
Upstream Status: RHEL only

Revert RHEL9 only commit a8e16c0d8f ("clocksource: Increase
WATCHDOG_MAX_SKEW") before applying upstream fix for the same problem.

Signed-off-by: Waiman Long <longman@redhat.com>
2022-01-18 13:56:02 -05:00
Herton R. Krzesinski 666dc945c2 Merge: Backport v5.15 rcu/locking/cgroup dependencies for kernel-rt
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/130
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2022806
Omitted-fix: fb5d69a5cd78 ("csky: bitops: Remove duplicate __clear_bit define")
             csky is not a supported arch.

This patch series backports a number of v5.15 locking, rcu, cgroup and
some other misceallaneous patches that are dependencies required for
kernel-rt to move up to v5.15-rt release for RHEL9.0. The last 2 patches
are v5.16 fixes.

All the patches applied cleanly without conflicts.

Signed-off-by: Waiman Long <longman@redhat.com>
~~~
Waiman Long (99):
  refscale: Add measurement of clock readout
  torture: Add clocksource-watchdog testing to torture.sh
  torture: Make torture.sh accept --do-all and --donone
  rcu: Fix to include first blocked task in stall warning
  rcu: Fix stall-warning deadlock due to non-release of rcu_node ->lock
  rcutorture: Preempt rather than block when testing task stalls
  rcu/nocb: Start moving nocb code to its own plugin file
  rcu/nocb: Remove NOCB deferred wakeup from rcutree_dead_cpu()
  rcu: Remove special bit at the bottom of the ->dynticks counter
  rcu: Weaken ->dynticks accesses and updates
  Documentation/RCU: Fix emphasis markers
  rcu: Mark accesses to ->rcu_read_lock_nesting
  Documentation/RCU: Fix nested inline markup
  rcu: Mark accesses in tree_stall.h
  rculist: Unify documentation about missing list_empty_rcu()
  rcu/tree: Handle VM stoppage in stall detection
  rcu: Do not disable GP stall detection in rcu_cpu_stall_reset()
  rcu: Start timing stall repetitions after warning complete
  rcu-tasks: Add comments explaining task_struct strategy
  rcu-tasks: Mark ->trc_reader_nesting data races
  rcu-tasks: Mark ->trc_reader_special.b.need_qs data races
  docs: Fix a typo in Documentation/RCU/stallwarn.rst
  locktorture: Mark statistics data races
  locktorture: Count lock readers
  srcutiny: Mark read-side data races
  rcu: Mark lockless ->qsmask read in rcu_check_boost_fail()
  torture: Enable KCSAN summaries over groups of torture-test runs
  torture: Create KCSAN summaries for torture.sh runs
  rcu: Make rcu_gp_init() and rcu_gp_fqs_loop noinline to conserve stack
  rcu/doc: Add a quick quiz to explain further why we need
    smp_mb__after_unlock_lock()
  torture: Make kvm-recheck-scf.sh tolerate qemu-cmd comments
  torture: Make kvm-recheck-lock.sh tolerate qemu-cmd comments
  torture: Log more kvm-remote.sh information
  torture: Protect kvm-remote.sh directory trees from /tmp reaping
  rcuscale: Console output claims too few grace periods
  rcu-tasks: Fix synchronize_rcu_rude() typo in comment
  torture: Make kvm-recheck.sh skip kcsan.sum for build-only runs
  torture: Move parse-console.sh call to PATH-aware scripts
  tools: include: nolibc: Fix a typo occured to occurred in the file
    nolibc.h
  tools/nolibc: Implement msleep()
  scftorture: Add RPC-like IPI tests
  rcu: Remove useless "ret" update in rcu_gp_fqs_loop()
  rcu: Use per_cpu_ptr to get the pointer of per_cpu variable
  Documentation/atomic_t: Document cmpxchg() vs try_cmpxchg()
  rcu: Explain why rcu_all_qs() is a stub in preemptible TREE RCU
  locking/rwsem: Remove an unused parameter of rwsem_wake()
  torture: Put kvm.sh batch-creation awk script into a temp file
  torture: Make kvm.sh select per-scenario affinity masks
  torture: Don't redirect qemu-cmd comment lines
  torture: Make kvm-test-1-run-qemu.sh apply affinity
  rcutorture: Upgrade two-CPU scenarios to four CPUs
  torture: Use numeric taskset argument in jitter.sh
  torture: Consistently name "qemu*" test output files
  torture: Make kvm-test-1-run-batch.sh select per-scenario affinity
    masks
  torture: Don't use "test" command's "-a" argument
  torture: Add timestamps to kvm-test-1-run-qemu.sh output
  torture: Make kvm-test-1-run-qemu.sh check for reboot loops
  scftorture: Avoid NULL pointer exception on early exit
  rcu: Fix macro name CONFIG_TASKS_RCU_TRACE
  locking/atomic: simplify ifdef generation
  locking/atomic: remove ARCH_ATOMIC remanants
  locking/atomic: centralize generated headers
  locking/atomic: add arch_atomic_long*()
  locking/atomic: add generic arch_*() bitops
  doc: Update stallwarn.rst with recent changes
  cgroup: remove cgroup_mount from comments
  rcu: Print human-readable message for schedule() in RCU reader
  rcu: Mark accesses to rcu_state.n_force_qs
  cgroup/cpuset: Miscellaneous code cleanup
  cgroup/cpuset: Fix a partition bug with hotplug
  cgroup/cpuset: Fix violation of cpuset locking rule
  locking/atomic: simplify non-atomic wrappers
  eventfd: Make signal recursion protection a task bit
  Documentation/atomic_t: Document forward progress expectations
  Documentation: Replace deprecated CPU-hotplug functions.
  perf/x86/intel: Replace deprecated CPU-hotplug functions
  perf/hw_breakpoint: Replace deprecated CPU-hotplug functions
  md/raid5: Replace deprecated CPU-hotplug functions.
  thermal: Replace deprecated CPU-hotplug functions.
  mm: Replace deprecated CPU-hotplug functions.
  cgroup: Replace deprecated CPU-hotplug functions.
  genirq/affinity: Replace deprecated CPU-hotplug functions.
  rcu: Replace deprecated CPU-hotplug functions
  smpboot: Replace deprecated CPU-hotplug functions.
  clocksource: Replace deprecated CPU-hotplug functions.
  torture: Replace deprecated CPU-hotplug functions.
  static_call: Update API documentation
  locking/semaphore: Add might_sleep() to down_*() family
  cgroup: cgroup-v1: clean up kernel-doc notation
  cgroup/cpuset: Enable event notification when partition state changes
  debugobjects: Make them PREEMPT_RT aware
  media/atomisp: Use lockdep instead of *mutex_is_locked()
  cgroup: Avoid compiler warnings with no subsystems
  futex: Clarify comment for requeue_pi_wake_futex()
  futex: Avoid redundant task lookup
  docs/core-api: Modify document layout
  Documentation: core-api/cpuhotplug: Rewrite the API section
  efi: Change down_interruptible() in virt_efi_reset_system() to
    down_trylock()
  rcu: Fix rcu_dynticks_curr_cpu_in_eqs() vs noinstr

 .../Tree-RCU-Memory-Ordering.rst              |   29 +
 .../RCU/Design/Requirements/Requirements.rst  |    8 +-
 Documentation/RCU/checklist.rst               |   24 +-
 Documentation/RCU/rcu_dereference.rst         |    6 +-
 Documentation/RCU/stallwarn.rst               |   31 +-
 Documentation/atomic_t.txt                    |   94 +
 Documentation/core-api/cpu_hotplug.rst        |  599 +++++--
 Documentation/trace/ftrace.rst                |    2 +-
 arch/x86/events/intel/core.c                  |    8 +-
 arch/x86/events/intel/pt.c                    |    4 +-
 drivers/firmware/efi/runtime-wrappers.c       |    2 +-
 drivers/md/raid5.c                            |    4 +-
 .../staging/media/atomisp/pci/atomisp_ioctl.c |    4 +-
 drivers/thermal/intel/intel_powerclamp.c      |    4 +-
 fs/aio.c                                      |    2 +-
 fs/eventfd.c                                  |   12 +-
 include/asm-generic/atomic-long.h             | 1014 -----------
 include/asm-generic/bitops/atomic.h           |   32 +-
 include/asm-generic/bitops/lock.h             |   39 +-
 include/asm-generic/bitops/non-atomic.h       |   39 +-
 include/linux/atomic.h                        |    7 +-
 .../linux/{ => atomic}/atomic-arch-fallback.h |    0
 .../atomic}/atomic-instrumented.h             |  586 ++++++-
 include/linux/atomic/atomic-long.h            | 1014 +++++++++++
 include/linux/cpuhotplug.h                    |  132 +-
 include/linux/eventfd.h                       |   11 +-
 include/linux/rculist.h                       |   35 +-
 include/linux/rcupdate.h                      |    4 +-
 include/linux/rcutiny.h                       |    3 -
 include/linux/sched.h                         |    4 +
 include/linux/srcutiny.h                      |    8 +-
 include/linux/static_call.h                   |   33 +
 kernel/cgroup/cgroup-v1.c                     |    8 +-
 kernel/cgroup/cgroup.c                        |   27 +-
 kernel/cgroup/cpuset.c                        |  151 +-
 kernel/events/hw_breakpoint.c                 |    4 +-
 kernel/futex.c                                |   93 +-
 kernel/irq/affinity.c                         |    8 +-
 kernel/locking/locktorture.c                  |   25 +-
 kernel/locking/rwsem.c                        |    6 +-
 kernel/locking/semaphore.c                    |    4 +
 kernel/rcu/rcuscale.c                         |    4 +-
 kernel/rcu/rcutorture.c                       |    7 +-
 kernel/rcu/refscale.c                         |   36 +-
 kernel/rcu/srcutiny.c                         |    2 +-
 kernel/rcu/tasks.h                            |   36 +-
 kernel/rcu/tree.c                             |  117 +-
 kernel/rcu/tree_nocb.h                        | 1496 ++++++++++++++++
 kernel/rcu/tree_plugin.h                      | 1506 +----------------
 kernel/rcu/tree_stall.h                       |  111 +-
 kernel/scftorture.c                           |   78 +-
 kernel/sched/core.c                           |   11 +
 kernel/smpboot.c                              |    8 +-
 kernel/time/clocksource.c                     |    6 +-
 kernel/torture.c                              |    6 +-
 lib/debugobjects.c                            |    7 +-
 mm/swap_slots.c                               |    4 +-
 mm/vmstat.c                                   |   12 +-
 scripts/atomic/check-atomics.sh               |    6 +-
 scripts/atomic/fallbacks/acquire              |    4 +-
 scripts/atomic/fallbacks/add_negative         |    6 +-
 scripts/atomic/fallbacks/add_unless           |    6 +-
 scripts/atomic/fallbacks/andnot               |    4 +-
 scripts/atomic/fallbacks/dec                  |    4 +-
 scripts/atomic/fallbacks/dec_and_test         |    6 +-
 scripts/atomic/fallbacks/dec_if_positive      |    6 +-
 scripts/atomic/fallbacks/dec_unless_positive  |    6 +-
 scripts/atomic/fallbacks/fence                |    4 +-
 scripts/atomic/fallbacks/fetch_add_unless     |    8 +-
 scripts/atomic/fallbacks/inc                  |    4 +-
 scripts/atomic/fallbacks/inc_and_test         |    6 +-
 scripts/atomic/fallbacks/inc_not_zero         |    6 +-
 scripts/atomic/fallbacks/inc_unless_negative  |    6 +-
 scripts/atomic/fallbacks/read_acquire         |    2 +-
 scripts/atomic/fallbacks/release              |    4 +-
 scripts/atomic/fallbacks/set_release          |    2 +-
 scripts/atomic/fallbacks/sub_and_test         |    6 +-
 scripts/atomic/fallbacks/try_cmpxchg          |    4 +-
 scripts/atomic/gen-atomic-fallback.sh         |   68 +-
 scripts/atomic/gen-atomic-instrumented.sh     |   11 +-
 scripts/atomic/gen-atomic-long.sh             |   10 +-
 scripts/atomic/gen-atomics.sh                 |    6 +-
 tools/include/nolibc/nolibc.h                 |   15 +-
 .../selftests/rcutorture/bin/jitter.sh        |   10 +-
 .../rcutorture/bin/kcsan-collapse.sh          |    2 +-
 .../selftests/rcutorture/bin/kvm-again.sh     |    4 +-
 .../rcutorture/bin/kvm-assign-cpus.sh         |  106 ++
 .../rcutorture/bin/kvm-get-cpus-script.sh     |   88 +
 .../rcutorture/bin/kvm-recheck-lock.sh        |    2 +-
 .../rcutorture/bin/kvm-recheck-scf.sh         |    2 +-
 .../selftests/rcutorture/bin/kvm-recheck.sh   |    5 +-
 .../rcutorture/bin/kvm-remote-noreap.sh       |   30 +
 .../selftests/rcutorture/bin/kvm-remote.sh    |   20 +-
 .../rcutorture/bin/kvm-test-1-run-batch.sh    |   24 +
 .../rcutorture/bin/kvm-test-1-run-qemu.sh     |   49 +-
 .../rcutorture/bin/kvm-test-1-run.sh          |    2 +
 tools/testing/selftests/rcutorture/bin/kvm.sh |   39 +-
 .../selftests/rcutorture/bin/torture.sh       |   37 +-
 .../selftests/rcutorture/configs/rcu/RUDE01   |    2 +-
 .../selftests/rcutorture/configs/rcu/TASKS01  |    2 +-
 .../selftests/rcutorture/configs/rcu/TASKS03  |    2 +-
 101 files changed, 4977 insertions(+), 3226 deletions(-)
 delete mode 100644 include/asm-generic/atomic-long.h
 rename include/linux/{ => atomic}/atomic-arch-fallback.h (100%)
 rename include/{asm-generic => linux/atomic}/atomic-instrumented.h (68%)
 create mode 100644 include/linux/atomic/atomic-long.h
 create mode 100644 kernel/rcu/tree_nocb.h
 create mode 100755 tools/testing/selftests/rcutorture/bin/kvm-assign-cpus.sh
 create mode 100755 tools/testing/selftests/rcutorture/bin/kvm-get-cpus-script.sh
 create mode 100755 tools/testing/selftests/rcutorture/bin/kvm-remote-noreap.sh

RH-Acked-by: Prarit Bhargava <prarit@redhat.com>
RH-Acked-by: David Arcari <darcari@redhat.com>
RH-Acked-by: Chris von Recklinghausen <crecklin@redhat.com>
RH-Acked-by: Rafael Aquini <aquini@redhat.com>
RH-Acked-by: Wander Lairson Costa <wander@redhat.com>
RH-Acked-by: Phil Auld <pauld@redhat.com>
RH-Acked-by: Juri Lelli <juri.lelli@redhat.com>

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2021-12-08 19:33:06 -03:00
Waiman Long a8e16c0d8f clocksource: Increase WATCHDOG_MAX_SKEW
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2017164
Upstream Status: RHEL only

Temporarily increase the WATCHDOG_MAX_SKEW threshold from 50us to 400us
which should be big enough to avoid triggering the hpet clocksource
fallback problem documented in the BZ for now. This change will be
reverted once an official upstream patch that address the problem
is backported.

Signed-off-by: Waiman Long <longman@redhat.com>
2021-11-18 10:54:57 -05:00
Waiman Long 3a313ec6e9 clocksource: Replace deprecated CPU-hotplug functions.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2022806

commit 698429f9d0e54ce3964151adff886ee5fc59714b
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Tue, 3 Aug 2021 16:16:17 +0200

    clocksource: Replace deprecated CPU-hotplug functions.

    The functions get_online_cpus() and put_online_cpus() have been
    deprecated during the CPU hotplug rework. They map directly to
    cpus_read_lock() and cpus_read_unlock().

    Replace deprecated CPU-hotplug functions with the official version.
    The behavior remains unchanged.

    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Link: https://lore.kernel.org/r/20210803141621.780504-35-bigeasy@linutronix.de

Signed-off-by: Waiman Long <longman@redhat.com>
2021-11-12 14:23:23 -05:00
Feng Tang 22a2238337 clocksource: Print deviation in nanoseconds when a clocksource becomes unstable
Currently when an unstable clocksource is detected, the raw counters of
that clocksource and watchdog will be printed, which can only be understood
after some math calculation.

So print the delta in nanoseconds as well to make it easier for humans to
check the results.

[ paulmck: Fix typo. ]

Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210527190124.440372-6-paulmck@kernel.org
2021-06-22 16:53:17 +02:00
Paul E. McKenney 1253b9b87e clocksource: Provide kernel module to test clocksource watchdog
When the clocksource watchdog marks a clock as unstable, this might
be due to that clock being unstable or it might be due to delays that
happen to occur between the reads of the two clocks.  It would be good
to have a way of testing the clocksource watchdog's ability to
distinguish between these two causes of clock skew and instability.

Therefore, provide a new clocksource-wdtest module selected by a new
TEST_CLOCKSOURCE_WATCHDOG Kconfig option.  This module has a single module
parameter named "holdoff" that provides the number of seconds of delay
before testing should start, which defaults to zero when built as a module
and to 10 seconds when built directly into the kernel.  Very large systems
that boot slowly may need to increase the value of this module parameter.

This module uses hand-crafted clocksource structures to do its testing,
thus avoiding messing up timing for the rest of the kernel and for user
applications.  This module first verifies that the ->uncertainty_margin
field of the clocksource structures are set sanely.  It then tests the
delay-detection capability of the clocksource watchdog, increasing the
number of consecutive delays injected, first provoking console messages
complaining about the delays and finally forcing a clock-skew event.
Unexpected test results cause at least one WARN_ON_ONCE() console splat.
If there are no splats, the test has passed.  Finally, it fuzzes the
value returned from a clocksource to test the clocksource watchdog's
ability to detect time skew.

This module checks the state of its clocksource after each test, and
uses WARN_ON_ONCE() to emit a console splat if there are any failures.
This should enable all types of test frameworks to detect any such
failures.

This facility is intended for diagnostic use only, and should be avoided
on production systems.

Reported-by: Chris Mason <clm@fb.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Feng Tang <feng.tang@intel.com>
Link: https://lore.kernel.org/r/20210527190124.440372-5-paulmck@kernel.org
2021-06-22 16:53:17 +02:00
Paul E. McKenney 2e27e793e2 clocksource: Reduce clocksource-skew threshold
Currently, WATCHDOG_THRESHOLD is set to detect a 62.5-millisecond skew in
a 500-millisecond WATCHDOG_INTERVAL.  This requires that clocks be skewed
by more than 12.5% in order to be marked unstable.  Except that a clock
that is skewed by that much is probably destroying unsuspecting software
right and left.  And given that there are now checks for false-positive
skews due to delays between reading the two clocks, it should be possible
to greatly decrease WATCHDOG_THRESHOLD, at least for fine-grained clocks
such as TSC.

Therefore, add a new uncertainty_margin field to the clocksource structure
that contains the maximum uncertainty in nanoseconds for the corresponding
clock.  This field may be initialized manually, as it is for
clocksource_tsc_early and clocksource_jiffies, which is copied to
refined_jiffies.  If the field is not initialized manually, it will be
computed at clock-registry time as the period of the clock in question
based on the scale and freq parameters to __clocksource_update_freq_scale()
function.  If either of those two parameters are zero, the
tens-of-milliseconds WATCHDOG_THRESHOLD is used as a cowardly alternative
to dividing by zero.  No matter how the uncertainty_margin field is
calculated, it is bounded below by twice WATCHDOG_MAX_SKEW, that is, by 100
microseconds.

Note that manually initialized uncertainty_margin fields are not adjusted,
but there is a WARN_ON_ONCE() that triggers if any such field is less than
twice WATCHDOG_MAX_SKEW.  This WARN_ON_ONCE() is intended to discourage
production use of the one-nanosecond uncertainty_margin values that are
used to test the clock-skew code itself.

The actual clock-skew check uses the sum of the uncertainty_margin fields
of the two clocksource structures being compared.  Integer overflow is
avoided because the largest computed value of the uncertainty_margin
fields is one billion (10^9), and double that value fits into an
unsigned int.  However, if someone manually specifies (say) UINT_MAX,
they will get what they deserve.

Note that the refined_jiffies uncertainty_margin field is initialized to
TICK_NSEC, which means that skew checks involving this clocksource will
be sufficently forgiving.  In a similar vein, the clocksource_tsc_early
uncertainty_margin field is initialized to 32*NSEC_PER_MSEC, which
replicates the current behavior and allows custom setting if needed
in order to address the rare skews detected for this clocksource in
current mainline.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Feng Tang <feng.tang@intel.com>
Link: https://lore.kernel.org/r/20210527190124.440372-4-paulmck@kernel.org
2021-06-22 16:53:16 +02:00
Paul E. McKenney fa218f1cce clocksource: Limit number of CPUs checked for clock synchronization
Currently, if skew is detected on a clock marked CLOCK_SOURCE_VERIFY_PERCPU,
that clock is checked on all CPUs.  This is thorough, but might not be
what you want on a system with a few tens of CPUs, let alone a few hundred
of them.

Therefore, by default check only up to eight randomly chosen CPUs.  Also
provide a new clocksource.verify_n_cpus kernel boot parameter.  A value of
-1 says to check all of the CPUs, and a non-negative value says to randomly
select that number of CPUs, without concern about selecting the same CPU
multiple times.  However, make use of a cpumask so that a given CPU will be
checked at most once.

Suggested-by: Thomas Gleixner <tglx@linutronix.de> # For verify_n_cpus=1.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Feng Tang <feng.tang@intel.com>
Link: https://lore.kernel.org/r/20210527190124.440372-3-paulmck@kernel.org
2021-06-22 16:53:16 +02:00
Paul E. McKenney 7560c02bdf clocksource: Check per-CPU clock synchronization when marked unstable
Some sorts of per-CPU clock sources have a history of going out of
synchronization with each other.  However, this problem has purportedy been
solved in the past ten years.  Except that it is all too possible that the
problem has instead simply been made less likely, which might mean that
some of the occasional "Marking clocksource 'tsc' as unstable" messages
might be due to desynchronization.  How would anyone know?

Therefore apply CPU-to-CPU synchronization checking to newly unstable
clocksource that are marked with the new CLOCK_SOURCE_VERIFY_PERCPU flag.
Lists of desynchronized CPUs are printed, with the caveat that if it
is the reporting CPU that is itself desynchronized, it will appear that
all the other clocks are wrong.  Just like in real life.

Reported-by: Chris Mason <clm@fb.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Feng Tang <feng.tang@intel.com>
Link: https://lore.kernel.org/r/20210527190124.440372-2-paulmck@kernel.org
2021-06-22 16:53:16 +02:00
Paul E. McKenney db3a34e174 clocksource: Retry clock read if long delays detected
When the clocksource watchdog marks a clock as unstable, this might be due
to that clock being unstable or it might be due to delays that happen to
occur between the reads of the two clocks.  Yes, interrupts are disabled
across those two reads, but there are no shortage of things that can delay
interrupts-disabled regions of code ranging from SMI handlers to vCPU
preemption.  It would be good to have some indication as to why the clock
was marked unstable.

Therefore, re-read the watchdog clock on either side of the read from the
clock under test.  If the watchdog clock shows an excessive time delta
between its pair of reads, the reads are retried.

The maximum number of retries is specified by a new kernel boot parameter
clocksource.max_cswd_read_retries, which defaults to three, that is, up to
four reads, one initial and up to three retries.  If more than one retry
was required, a message is printed on the console (the occasional single
retry is expected behavior, especially in guest OSes).  If the maximum
number of retries is exceeded, the clock under test will be marked
unstable.  However, the probability of this happening due to various sorts
of delays is quite small.  In addition, the reason (clock-read delays) for
the unstable marking will be apparent.

Reported-by: Chris Mason <clm@fb.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Feng Tang <feng.tang@intel.com>
Link: https://lore.kernel.org/r/20210527190124.440372-1-paulmck@kernel.org
2021-06-22 16:53:16 +02:00
Linus Torvalds 152d32aa84 ARM:
- Stage-2 isolation for the host kernel when running in protected mode
 
 - Guest SVE support when running in nVHE mode
 
 - Force W^X hypervisor mappings in nVHE mode
 
 - ITS save/restore for guests using direct injection with GICv4.1
 
 - nVHE panics now produce readable backtraces
 
 - Guest support for PTP using the ptp_kvm driver
 
 - Performance improvements in the S2 fault handler
 
 x86:
 
 - Optimizations and cleanup of nested SVM code
 
 - AMD: Support for virtual SPEC_CTRL
 
 - Optimizations of the new MMU code: fast invalidation,
   zap under read lock, enable/disably dirty page logging under
   read lock
 
 - /dev/kvm API for AMD SEV live migration (guest API coming soon)
 
 - support SEV virtual machines sharing the same encryption context
 
 - support SGX in virtual machines
 
 - add a few more statistics
 
 - improved directed yield heuristics
 
 - Lots and lots of cleanups
 
 Generic:
 
 - Rework of MMU notifier interface, simplifying and optimizing
 the architecture-specific code
 
 - Some selftests improvements
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmCJ13kUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroM1HAgAqzPxEtiTPTFeFJV5cnPPJ3dFoFDK
 y/juZJUQ1AOtvuWzzwuf175ewkv9vfmtG6rVohpNSkUlJYeoc6tw7n8BTTzCVC1b
 c/4Dnrjeycr6cskYlzaPyV6MSgjSv5gfyj1LA5UEM16LDyekmaynosVWY5wJhju+
 Bnyid8l8Utgz+TLLYogfQJQECCrsU0Wm//n+8TWQgLf1uuiwshU5JJe7b43diJrY
 +2DX+8p9yWXCTz62sCeDWNahUv8AbXpMeJ8uqZPYcN1P0gSEUGu8xKmLOFf9kR7b
 M4U1Gyz8QQbjd2lqnwiWIkvRLX6gyGVbq2zH0QbhUe5gg3qGUX7JjrhdDQ==
 =AXUi
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "This is a large update by KVM standards, including AMD PSP (Platform
  Security Processor, aka "AMD Secure Technology") and ARM CoreSight
  (debug and trace) changes.

  ARM:

   - CoreSight: Add support for ETE and TRBE

   - Stage-2 isolation for the host kernel when running in protected
     mode

   - Guest SVE support when running in nVHE mode

   - Force W^X hypervisor mappings in nVHE mode

   - ITS save/restore for guests using direct injection with GICv4.1

   - nVHE panics now produce readable backtraces

   - Guest support for PTP using the ptp_kvm driver

   - Performance improvements in the S2 fault handler

  x86:

   - AMD PSP driver changes

   - Optimizations and cleanup of nested SVM code

   - AMD: Support for virtual SPEC_CTRL

   - Optimizations of the new MMU code: fast invalidation, zap under
     read lock, enable/disably dirty page logging under read lock

   - /dev/kvm API for AMD SEV live migration (guest API coming soon)

   - support SEV virtual machines sharing the same encryption context

   - support SGX in virtual machines

   - add a few more statistics

   - improved directed yield heuristics

   - Lots and lots of cleanups

  Generic:

   - Rework of MMU notifier interface, simplifying and optimizing the
     architecture-specific code

   - a handful of "Get rid of oprofile leftovers" patches

   - Some selftests improvements"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (379 commits)
  KVM: selftests: Speed up set_memory_region_test
  selftests: kvm: Fix the check of return value
  KVM: x86: Take advantage of kvm_arch_dy_has_pending_interrupt()
  KVM: SVM: Skip SEV cache flush if no ASIDs have been used
  KVM: SVM: Remove an unnecessary prototype declaration of sev_flush_asids()
  KVM: SVM: Drop redundant svm_sev_enabled() helper
  KVM: SVM: Move SEV VMCB tracking allocation to sev.c
  KVM: SVM: Explicitly check max SEV ASID during sev_hardware_setup()
  KVM: SVM: Unconditionally invoke sev_hardware_teardown()
  KVM: SVM: Enable SEV/SEV-ES functionality by default (when supported)
  KVM: SVM: Condition sev_enabled and sev_es_enabled on CONFIG_KVM_AMD_SEV=y
  KVM: SVM: Append "_enabled" to module-scoped SEV/SEV-ES control variables
  KVM: SEV: Mask CPUID[0x8000001F].eax according to supported features
  KVM: SVM: Move SEV module params/variables to sev.c
  KVM: SVM: Disable SEV/SEV-ES if NPT is disabled
  KVM: SVM: Free sev_asid_bitmap during init if SEV setup fails
  KVM: SVM: Zero out the VMCB array used to track SEV ASID association
  x86/sev: Drop redundant and potentially misleading 'sev_enabled'
  KVM: x86: Move reverse CPUID helpers to separate header file
  KVM: x86: Rename GPR accessors to make mode-aware variants the defaults
  ...
2021-05-01 10:14:08 -07:00
Thomas Gleixner b2c67cbe9f time: Add mechanism to recognize clocksource in time_get_snapshot
System time snapshots are not conveying information about the current
clocksource which was used, but callers like the PTP KVM guest
implementation have the requirement to evaluate the clocksource type to
select the appropriate mechanism.

Introduce a clocksource id field in struct clocksource which is by default
set to CSID_GENERIC (0). Clocksource implementations can set that field to
a value which allows to identify the clocksource.

Store the clocksource id of the current clocksource in the
system_time_snapshot so callers can evaluate which clocksource was used to
take the snapshot and act accordingly.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20201209060932.212364-5-jianyong.wu@arm.com
2021-04-07 16:33:20 +01:00
Ingo Molnar 4bf07f6562 timekeeping, clocksource: Fix various typos in comments
Fix ~56 single-word typos in timekeeping & clocksource code comments.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: linux-kernel@vger.kernel.org
2021-03-22 23:06:48 +01:00
Arnd Bergmann 77f6c0b874 timekeeping: remove arch_gettimeoffset
With Arm EBSA110 gone, nothing uses it any more, so the corresponding
code and the Kconfig option can be removed.

Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2020-10-30 21:57:04 +01:00
Thomas Gleixner c7f3d43b62 clocksource: Remove obsolete ifdef
CONFIG_GENERIC_VDSO_CLOCK_MODE was a transitional config switch which got
removed after all architectures got converted to the new storage model.

But the removal forgot to remove the #ifdef which guards the
vdso_clock_mode sanity check, which effectively disables the sanity check.

Remove it now.

Fixes: f86fd32db7 ("lib/vdso: Cleanup clock mode storage leftovers")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20200606221531.845475036@linutronix.de
2020-06-09 16:36:47 +02:00
Thomas Gleixner 5d51bee725 clocksource: Add common vdso clock mode storage
All architectures which use the generic VDSO code have their own storage
for the VDSO clock mode. That's pointless and just requires duplicate code.

Provide generic storage for it. The new Kconfig symbol is intermediate and
will be removed once all architectures are converted over.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Link: https://lkml.kernel.org/r/20200207124403.028046322@linutronix.de
2020-02-17 14:40:23 +01:00
Konstantin Khlebnikov febac332a8 clocksource: Prevent double add_timer_on() for watchdog_timer
Kernel crashes inside QEMU/KVM are observed:

  kernel BUG at kernel/time/timer.c:1154!
  BUG_ON(timer_pending(timer) || !timer->function) in add_timer_on().

At the same time another cpu got:

  general protection fault: 0000 [#1] SMP PTI of poinson pointer 0xdead000000000200 in:

  __hlist_del at include/linux/list.h:681
  (inlined by) detach_timer at kernel/time/timer.c:818
  (inlined by) expire_timers at kernel/time/timer.c:1355
  (inlined by) __run_timers at kernel/time/timer.c:1686
  (inlined by) run_timer_softirq at kernel/time/timer.c:1699

Unfortunately kernel logs are badly scrambled, stacktraces are lost.

Printing the timer->function before the BUG_ON() pointed to
clocksource_watchdog().

The execution of clocksource_watchdog() can race with a sequence of
clocksource_stop_watchdog() .. clocksource_start_watchdog():

expire_timers()
 detach_timer(timer, true);
  timer->entry.pprev = NULL;
 raw_spin_unlock_irq(&base->lock);
 call_timer_fn
  clocksource_watchdog()

					clocksource_watchdog_kthread() or
					clocksource_unbind()

					spin_lock_irqsave(&watchdog_lock, flags);
					clocksource_stop_watchdog();
					 del_timer(&watchdog_timer);
					 watchdog_running = 0;
					spin_unlock_irqrestore(&watchdog_lock, flags);

					spin_lock_irqsave(&watchdog_lock, flags);
					clocksource_start_watchdog();
					 add_timer_on(&watchdog_timer, ...);
					 watchdog_running = 1;
					spin_unlock_irqrestore(&watchdog_lock, flags);

  spin_lock(&watchdog_lock);
  add_timer_on(&watchdog_timer, ...);
   BUG_ON(timer_pending(timer) || !timer->function);
    timer_pending() -> true
    BUG()

I.e. inside clocksource_watchdog() watchdog_timer could be already armed.

Check timer_pending() before calling add_timer_on(). This is sufficient as
all operations are synchronized by watchdog_lock.

Fixes: 75c5158f70 ("timekeeping: Update clocksource with stop_machine")
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/158048693917.4378.13823603769948933793.stgit@buzz
2020-02-01 11:07:56 +01:00
Mathieu Malaterre 0f48b41f59 clocksource: Move inline keyword to the beginning of function declarations
The inline keyword was not at the beginning of the function declarations.
Fix the following warnings triggered when using W=1:

  kernel/time/clocksource.c:108:1: warning: 'inline' is not at beginning of declaration [-Wold-style-declaration]
  kernel/time/clocksource.c:113:1: warning: 'inline' is not at beginning of declaration [-Wold-style-declaration]

Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: trivial@kernel.org
Cc: kernel-janitors@vger.kernel.org
Cc: John Stultz <john.stultz@linaro.org>
Cc: Stephen Boyd <sboyd@kernel.org>
Link: https://lkml.kernel.org/r/20190524103339.28787-1-malat@debian.org
2019-06-14 17:04:03 +02:00
Thomas Gleixner 6c7811c628 time: Remove license boilerplate
The SPDX identifier defines the license of the files already. No need for
the boilerplates.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: John Stultz <john.stultz@linaro.org>
Acked-by: Corey Minyard <cminyard@mvista.com>
Acked-by: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Russell King <rmk+kernel@armlinux.org.uk>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: David Riley <davidriley@chromium.org>
Cc: Colin Cross <ccross@android.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: https://lkml.kernel.org/r/20181031182253.132458951@linutronix.de
2018-11-23 11:51:21 +01:00
Thomas Gleixner 35728b8209 time: Add SPDX license identifiers
Update the time(r) core files files with the correct SPDX license
identifier based on the license text in the file itself. The SPDX
identifier is a legally binding shorthand, which can be used instead of the
full boiler plate text.

This work is based on a script and data from Philippe Ombredanne, Kate
Stewart and myself. The data has been created with two independent license
scanners and manual inspection.

The following files do not contain any direct license information and have
been omitted from the big initial SPDX changes:

  timeconst.bc: The .bc files were not touched
  time.c, timer.c, timekeeping.c: Licence was deduced from EXPORT_SYMBOL_GPL

As those files do not contain direct license references they fall under the
project license, i.e. GPL V2 only.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: John Stultz <john.stultz@linaro.org>
Acked-by: Corey Minyard <cminyard@mvista.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Russell King <rmk+kernel@armlinux.org.uk>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: David Riley <davidriley@chromium.org>
Cc: Colin Cross <ccross@android.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: https://lkml.kernel.org/r/20181031182252.879109557@linutronix.de
2018-11-23 11:51:20 +01:00
Thomas Gleixner 58c5fc2b96 time: Remove useless filenames in top level comments
Remove the pointless filenames in the top level comments. They have no
value at all and just occupy space. While at it tidy up some of the
comments and remove a stale one.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Nicolas Pitre <nico@linaro.org>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: John Stultz <john.stultz@linaro.org>
Acked-by: Corey Minyard <cminyard@mvista.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Russell King <rmk+kernel@armlinux.org.uk>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: David Riley <davidriley@chromium.org>
Cc: Colin Cross <ccross@android.com>
Cc: Mark Brown <broonie@kernel.org>
Link: https://lkml.kernel.org/r/20181031182252.794898238@linutronix.de
2018-11-23 11:51:20 +01:00
Thomas Gleixner d67f34c19a clocksource: Provide clocksource_arch_init()
Architectures have extra archdata in the clocksource, e.g. for VDSO
support. There are no sanity checks or general initializations for this
available. Add support for that.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Acked-by: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Matt Rickard <matt@softrans.com.au>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: devel@linuxdriverproject.org
Cc: virtualization@lists.linux-foundation.org
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Juergen Gross <jgross@suse.com>
Link: https://lkml.kernel.org/r/20180917130706.973042587@linutronix.de
2018-10-04 23:00:24 +02:00
Peter Zijlstra e2c631ba75 clocksource: Revert "Remove kthread"
I turns out that the silly spawn kthread from worker was actually needed.

clocksource_watchdog_kthread() cannot be called directly from
clocksource_watchdog_work(), because clocksource_select() calls
timekeeping_notify() which uses stop_machine(). One cannot use
stop_machine() from a workqueue() due lock inversions wrt CPU hotplug.

Revert the patch but add a comment that explain why we jump through such
apparently silly hoops.

Fixes: 7197e77abc ("clocksource: Remove kthread")
Reported-by: Siegfried Metz <frame@mailbox.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Niklas Cassel <niklas.cassel@linaro.org>
Tested-by: Kevin Shanahan <kevin@shanahan.id.au>
Tested-by: viktor_jaegerskuepper@freenet.de
Tested-by: Siegfried Metz <frame@mailbox.org>
Cc: rafael.j.wysocki@intel.com
Cc: len.brown@intel.com
Cc: diego.viola@gmail.com
Cc: rui.zhang@intel.com
Cc: bjorn.andersson@linaro.org
Link: https://lkml.kernel.org/r/20180905084158.GR24124@hirez.programming.kicks-ass.net
2018-09-06 23:38:35 +02:00
Baolin Wang 39232ed5a1 time: Introduce one suspend clocksource to compensate the suspend time
On some hardware with multiple clocksources, we have coarse grained
clocksources that support the CLOCK_SOURCE_SUSPEND_NONSTOP flag, but
which are less than ideal for timekeeping whereas other clocksources
can be better candidates but halt on suspend.

Currently, the timekeeping core only supports timing suspend using
CLOCK_SOURCE_SUSPEND_NONSTOP clocksources if that clocksource is the
current clocksource for timekeeping.

As a result, some architectures try to implement read_persistent_clock64()
using those non-stop clocksources, but isn't really ideal, which will
introduce more duplicate code. To fix this, provide logic to allow a
registered SUSPEND_NONSTOP clocksource, which isn't the current
clocksource, to be used to calculate the suspend time.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Baolin Wang <baolin.wang@linaro.org>
[jstultz: minor tweaks to merge with previous resume changes]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2018-07-19 17:08:52 -07:00
Mathieu Malaterre db6f9e55c8 clocksource: Move inline keyword to the beginning of function declarations
The inline keyword was not at the beginning of the function declarations.
Fix the following warnings triggered when using W=1:

  kernel/time/clocksource.c:456:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration]
  kernel/time/clocksource.c:457:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration]

Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: John Stultz <john.stultz@linaro.org>
Link: https://lkml.kernel.org/r/20180516195943.31924-1-malat@debian.org
2018-05-16 22:21:32 +02:00
Peter Zijlstra 7197e77abc clocksource: Remove kthread
The clocksource watchdog uses a work to spawn a kthread to run the
watchdog. That is about as silly as it sounds, run the watchdog
directly from the work.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: len.brown@intel.com
Cc: rjw@rjwysocki.net
Cc: diego.viola@gmail.com
Cc: rui.zhang@intel.com
Link: https://lkml.kernel.org/r/20180430100344.713862818@infradead.org
2018-05-02 16:11:46 +02:00
Peter Zijlstra 7dba33c634 clocksource: Rework stale comment
AFAICS the hotplug code no longer uses this function.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: len.brown@intel.com
Cc: rjw@rjwysocki.net
Cc: diego.viola@gmail.com
Cc: rui.zhang@intel.com
Link: https://lkml.kernel.org/r/20180430100344.656525644@infradead.org
2018-05-02 16:10:41 +02:00
Peter Zijlstra cd2af07d82 clocksource: Consistent de-rate when marking unstable
When a registered clocksource gets marked unstable the watchdog_kthread
will de-rate and re-select the clocksource. Ensure it also de-rates
when getting called on an unregistered clocksource.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: len.brown@intel.com
Cc: rjw@rjwysocki.net
Cc: diego.viola@gmail.com
Cc: rui.zhang@intel.com
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180430100344.594904898@infradead.org
2018-05-02 16:10:41 +02:00
Peter Zijlstra 5b9e886a4a clocksource: Initialize cs->wd_list
A number of places relies on list_empty(&cs->wd_list), however the
list_head does not get initialized. Do so upon registration, such that
thereafter it is possible to rely on list_empty() correctly reflecting
the list membership status.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Diego Viola <diego.viola@gmail.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: stable@vger.kernel.org
Cc: len.brown@intel.com
Cc: rjw@rjwysocki.net
Cc: rui.zhang@intel.com
Link: https://lkml.kernel.org/r/20180430100344.472662715@infradead.org
2018-05-02 16:10:40 +02:00
Peter Zijlstra 2aae7bcfa4 clocksource: Allow clocksource_mark_unstable() on unregistered clocksources
Because of how the code flips between tsc-early and tsc clocksources
it might need to mark one or both unstable. The current code in
mark_tsc_unstable() only worked because previously it registered the
tsc clocksource once and then never touched it.

Since it now unregisters the tsc-early clocksource, it needs to know
if a clocksource got unregistered and the current cs->mult test
doesn't work for that. Instead use list_empty(&cs->list) to test for
registration.

Furthermore, since clocksource_mark_unstable() needs to place the cs
on the wd_list, it links the cs->list and cs->wd_list serialization.
It must not see a clocsource registered (!empty cs->list) but already
past dequeue_watchdog(). So place {en,de}queue{,_watchdog}() under the
same lock.

Provided cs->list is initialized to empty, this then allows us to
unconditionally use clocksource_mark_unstable(), regardless of the
registration state.

Fixes: aa83c45762 ("x86/tsc: Introduce early tsc clocksource")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Diego Viola <diego.viola@gmail.com>
Cc: len.brown@intel.com
Cc: rjw@rjwysocki.net
Cc: diego.viola@gmail.com
Cc: rui.zhang@intel.com
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180502135312.GS12217@hirez.programming.kicks-ass.net
2018-05-02 16:10:40 +02:00
Baolin Wang 27263e8dc0 clocksource: Use ATTRIBUTE_GROUPS
Use ATTRIBUTE_GROUPS instead of manually creating the individual device
files.

Signed-off-by: Baolin Wang <baolin.wang@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: arnd@arndb.de
Cc: sboyd@codeaurora.org
Cc: broonie@kernel.org
Cc: john.stultz@linaro.org
Link: https://lkml.kernel.org/r/d80dccb981dc2461781ebb8d71a32ccdc1b0e6f9.1516167691.git.baolin.wang@linaro.org
2018-02-28 14:05:07 +01:00
Baolin Wang e87821d18c clocksource: Use DEVICE_ATTR_RW/RO/WO to define device attributes
Convert DEVICE_ATTR to DEVICE_ATTR_RW/RO/WO which is the preferred and
simpler way of implementation.

Signed-off-by: Baolin Wang <baolin.wang@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: arnd@arndb.de
Cc: sboyd@codeaurora.org
Cc: broonie@kernel.org
Cc: john.stultz@linaro.org
Link: https://lkml.kernel.org/r/8f35c77e753e957b61187e8e7b2e4a3d61e4a72b.1516167691.git.baolin.wang@linaro.org
2018-02-28 14:04:52 +01:00
Baolin Wang 7f852afe44 clocksource: Don't walk the clocksource list for empty override
If the override clocksource name is empty there is no point in walking the
clocksource list for a match.

Signed-off-by: Baolin Wang <baolin.wang@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: arnd@arndb.de
Cc: sboyd@codeaurora.org
Cc: broonie@kernel.org
Cc: john.stultz@linaro.org
Link: https://lkml.kernel.org/r/069ce2a605546bcad6552968cff755f0a03f9f10.1516167691.git.baolin.wang@linaro.org
2018-02-28 14:04:52 +01:00
Kees Cook e99e88a9d2 treewide: setup_timer() -> timer_setup()
This converts all remaining cases of the old setup_timer() API into using
timer_setup(), where the callback argument is the structure already
holding the struct timer_list. These should have no behavioral changes,
since they just change which pointer is passed into the callback with
the same available pointers after conversion. It handles the following
examples, in addition to some other variations.

Casting from unsigned long:

    void my_callback(unsigned long data)
    {
        struct something *ptr = (struct something *)data;
    ...
    }
    ...
    setup_timer(&ptr->my_timer, my_callback, ptr);

and forced object casts:

    void my_callback(struct something *ptr)
    {
    ...
    }
    ...
    setup_timer(&ptr->my_timer, my_callback, (unsigned long)ptr);

become:

    void my_callback(struct timer_list *t)
    {
        struct something *ptr = from_timer(ptr, t, my_timer);
    ...
    }
    ...
    timer_setup(&ptr->my_timer, my_callback, 0);

Direct function assignments:

    void my_callback(unsigned long data)
    {
        struct something *ptr = (struct something *)data;
    ...
    }
    ...
    ptr->my_timer.function = my_callback;

have a temporary cast added, along with converting the args:

    void my_callback(struct timer_list *t)
    {
        struct something *ptr = from_timer(ptr, t, my_timer);
    ...
    }
    ...
    ptr->my_timer.function = (TIMER_FUNC_TYPE)my_callback;

And finally, callbacks without a data assignment:

    void my_callback(unsigned long data)
    {
    ...
    }
    ...
    setup_timer(&ptr->my_timer, my_callback, 0);

have their argument renamed to verify they're unused during conversion:

    void my_callback(struct timer_list *unused)
    {
    ...
    }
    ...
    timer_setup(&ptr->my_timer, my_callback, 0);

The conversion is done with the following Coccinelle script:

spatch --very-quiet --all-includes --include-headers \
	-I ./arch/x86/include -I ./arch/x86/include/generated \
	-I ./include -I ./arch/x86/include/uapi \
	-I ./arch/x86/include/generated/uapi -I ./include/uapi \
	-I ./include/generated/uapi --include ./include/linux/kconfig.h \
	--dir . \
	--cocci-file ~/src/data/timer_setup.cocci

@fix_address_of@
expression e;
@@

 setup_timer(
-&(e)
+&e
 , ...)

// Update any raw setup_timer() usages that have a NULL callback, but
// would otherwise match change_timer_function_usage, since the latter
// will update all function assignments done in the face of a NULL
// function initialization in setup_timer().
@change_timer_function_usage_NULL@
expression _E;
identifier _timer;
type _cast_data;
@@

(
-setup_timer(&_E->_timer, NULL, _E);
+timer_setup(&_E->_timer, NULL, 0);
|
-setup_timer(&_E->_timer, NULL, (_cast_data)_E);
+timer_setup(&_E->_timer, NULL, 0);
|
-setup_timer(&_E._timer, NULL, &_E);
+timer_setup(&_E._timer, NULL, 0);
|
-setup_timer(&_E._timer, NULL, (_cast_data)&_E);
+timer_setup(&_E._timer, NULL, 0);
)

@change_timer_function_usage@
expression _E;
identifier _timer;
struct timer_list _stl;
identifier _callback;
type _cast_func, _cast_data;
@@

(
-setup_timer(&_E->_timer, _callback, _E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, &_callback, _E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, _callback, (_cast_data)_E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, &_callback, (_cast_data)_E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, (_cast_func)_callback, _E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, (_cast_func)&_callback, _E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, (_cast_func)_callback, (_cast_data)_E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, (_cast_func)&_callback, (_cast_data)_E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, (_cast_data)_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, (_cast_data)&_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, &_callback, (_cast_data)_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, &_callback, (_cast_data)&_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, (_cast_func)_callback, (_cast_data)_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, (_cast_func)_callback, (_cast_data)&_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, (_cast_func)&_callback, (_cast_data)_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, (_cast_func)&_callback, (_cast_data)&_E);
+timer_setup(&_E._timer, _callback, 0);
|
 _E->_timer@_stl.function = _callback;
|
 _E->_timer@_stl.function = &_callback;
|
 _E->_timer@_stl.function = (_cast_func)_callback;
|
 _E->_timer@_stl.function = (_cast_func)&_callback;
|
 _E._timer@_stl.function = _callback;
|
 _E._timer@_stl.function = &_callback;
|
 _E._timer@_stl.function = (_cast_func)_callback;
|
 _E._timer@_stl.function = (_cast_func)&_callback;
)

// callback(unsigned long arg)
@change_callback_handle_cast
 depends on change_timer_function_usage@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._timer;
type _origtype;
identifier _origarg;
type _handletype;
identifier _handle;
@@

 void _callback(
-_origtype _origarg
+struct timer_list *t
 )
 {
(
	... when != _origarg
	_handletype *_handle =
-(_handletype *)_origarg;
+from_timer(_handle, t, _timer);
	... when != _origarg
|
	... when != _origarg
	_handletype *_handle =
-(void *)_origarg;
+from_timer(_handle, t, _timer);
	... when != _origarg
|
	... when != _origarg
	_handletype *_handle;
	... when != _handle
	_handle =
-(_handletype *)_origarg;
+from_timer(_handle, t, _timer);
	... when != _origarg
|
	... when != _origarg
	_handletype *_handle;
	... when != _handle
	_handle =
-(void *)_origarg;
+from_timer(_handle, t, _timer);
	... when != _origarg
)
 }

// callback(unsigned long arg) without existing variable
@change_callback_handle_cast_no_arg
 depends on change_timer_function_usage &&
                     !change_callback_handle_cast@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._timer;
type _origtype;
identifier _origarg;
type _handletype;
@@

 void _callback(
-_origtype _origarg
+struct timer_list *t
 )
 {
+	_handletype *_origarg = from_timer(_origarg, t, _timer);
+
	... when != _origarg
-	(_handletype *)_origarg
+	_origarg
	... when != _origarg
 }

// Avoid already converted callbacks.
@match_callback_converted
 depends on change_timer_function_usage &&
            !change_callback_handle_cast &&
	    !change_callback_handle_cast_no_arg@
identifier change_timer_function_usage._callback;
identifier t;
@@

 void _callback(struct timer_list *t)
 { ... }

// callback(struct something *handle)
@change_callback_handle_arg
 depends on change_timer_function_usage &&
	    !match_callback_converted &&
            !change_callback_handle_cast &&
            !change_callback_handle_cast_no_arg@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._timer;
type _handletype;
identifier _handle;
@@

 void _callback(
-_handletype *_handle
+struct timer_list *t
 )
 {
+	_handletype *_handle = from_timer(_handle, t, _timer);
	...
 }

// If change_callback_handle_arg ran on an empty function, remove
// the added handler.
@unchange_callback_handle_arg
 depends on change_timer_function_usage &&
	    change_callback_handle_arg@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._timer;
type _handletype;
identifier _handle;
identifier t;
@@

 void _callback(struct timer_list *t)
 {
-	_handletype *_handle = from_timer(_handle, t, _timer);
 }

// We only want to refactor the setup_timer() data argument if we've found
// the matching callback. This undoes changes in change_timer_function_usage.
@unchange_timer_function_usage
 depends on change_timer_function_usage &&
            !change_callback_handle_cast &&
            !change_callback_handle_cast_no_arg &&
	    !change_callback_handle_arg@
expression change_timer_function_usage._E;
identifier change_timer_function_usage._timer;
identifier change_timer_function_usage._callback;
type change_timer_function_usage._cast_data;
@@

(
-timer_setup(&_E->_timer, _callback, 0);
+setup_timer(&_E->_timer, _callback, (_cast_data)_E);
|
-timer_setup(&_E._timer, _callback, 0);
+setup_timer(&_E._timer, _callback, (_cast_data)&_E);
)

// If we fixed a callback from a .function assignment, fix the
// assignment cast now.
@change_timer_function_assignment
 depends on change_timer_function_usage &&
            (change_callback_handle_cast ||
             change_callback_handle_cast_no_arg ||
             change_callback_handle_arg)@
expression change_timer_function_usage._E;
identifier change_timer_function_usage._timer;
identifier change_timer_function_usage._callback;
type _cast_func;
typedef TIMER_FUNC_TYPE;
@@

(
 _E->_timer.function =
-_callback
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E->_timer.function =
-&_callback
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E->_timer.function =
-(_cast_func)_callback;
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E->_timer.function =
-(_cast_func)&_callback
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E._timer.function =
-_callback
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E._timer.function =
-&_callback;
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E._timer.function =
-(_cast_func)_callback
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E._timer.function =
-(_cast_func)&_callback
+(TIMER_FUNC_TYPE)_callback
 ;
)

// Sometimes timer functions are called directly. Replace matched args.
@change_timer_function_calls
 depends on change_timer_function_usage &&
            (change_callback_handle_cast ||
             change_callback_handle_cast_no_arg ||
             change_callback_handle_arg)@
expression _E;
identifier change_timer_function_usage._timer;
identifier change_timer_function_usage._callback;
type _cast_data;
@@

 _callback(
(
-(_cast_data)_E
+&_E->_timer
|
-(_cast_data)&_E
+&_E._timer
|
-_E
+&_E->_timer
)
 )

// If a timer has been configured without a data argument, it can be
// converted without regard to the callback argument, since it is unused.
@match_timer_function_unused_data@
expression _E;
identifier _timer;
identifier _callback;
@@

(
-setup_timer(&_E->_timer, _callback, 0);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, _callback, 0L);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, _callback, 0UL);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, 0);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, 0L);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, 0UL);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_timer, _callback, 0);
+timer_setup(&_timer, _callback, 0);
|
-setup_timer(&_timer, _callback, 0L);
+timer_setup(&_timer, _callback, 0);
|
-setup_timer(&_timer, _callback, 0UL);
+timer_setup(&_timer, _callback, 0);
|
-setup_timer(_timer, _callback, 0);
+timer_setup(_timer, _callback, 0);
|
-setup_timer(_timer, _callback, 0L);
+timer_setup(_timer, _callback, 0);
|
-setup_timer(_timer, _callback, 0UL);
+timer_setup(_timer, _callback, 0);
)

@change_callback_unused_data
 depends on match_timer_function_unused_data@
identifier match_timer_function_unused_data._callback;
type _origtype;
identifier _origarg;
@@

 void _callback(
-_origtype _origarg
+struct timer_list *unused
 )
 {
	... when != _origarg
 }

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-11-21 15:57:07 -08:00
Kees Cook b9eaf18722 treewide: init_timer() -> setup_timer()
This mechanically converts all remaining cases of ancient open-coded timer
setup with the old setup_timer() API, which is the first step in timer
conversions. This has no behavioral changes, since it ultimately just
changes the order of assignment to fields of struct timer_list when
finding variations of:

    init_timer(&t);
    f.function = timer_callback;
    t.data = timer_callback_arg;

to be converted into:

    setup_timer(&t, timer_callback, timer_callback_arg);

The conversion is done with the following Coccinelle script, which
is an improved version of scripts/cocci/api/setup_timer.cocci, in the
following ways:
 - assignments-before-init_timer() cases
 - limit the .data case removal to the specific struct timer_list instance
 - handling calls by dereference (timer->field vs timer.field)

spatch --very-quiet --all-includes --include-headers \
	-I ./arch/x86/include -I ./arch/x86/include/generated \
	-I ./include -I ./arch/x86/include/uapi \
	-I ./arch/x86/include/generated/uapi -I ./include/uapi \
	-I ./include/generated/uapi --include ./include/linux/kconfig.h \
	--dir . \
	--cocci-file ~/src/data/setup_timer.cocci

@fix_address_of@
expression e;
@@

 init_timer(
-&(e)
+&e
 , ...)

// Match the common cases first to avoid Coccinelle parsing loops with
// "... when" clauses.

@match_immediate_function_data_after_init_timer@
expression e, func, da;
@@

-init_timer
+setup_timer
 ( \(&e\|e\)
+, func, da
 );
(
-\(e.function\|e->function\) = func;
-\(e.data\|e->data\) = da;
|
-\(e.data\|e->data\) = da;
-\(e.function\|e->function\) = func;
)

@match_immediate_function_data_before_init_timer@
expression e, func, da;
@@

(
-\(e.function\|e->function\) = func;
-\(e.data\|e->data\) = da;
|
-\(e.data\|e->data\) = da;
-\(e.function\|e->function\) = func;
)
-init_timer
+setup_timer
 ( \(&e\|e\)
+, func, da
 );

@match_function_and_data_after_init_timer@
expression e, e2, e3, e4, e5, func, da;
@@

-init_timer
+setup_timer
 ( \(&e\|e\)
+, func, da
 );
 ... when != func = e2
     when != da = e3
(
-e.function = func;
... when != da = e4
-e.data = da;
|
-e->function = func;
... when != da = e4
-e->data = da;
|
-e.data = da;
... when != func = e5
-e.function = func;
|
-e->data = da;
... when != func = e5
-e->function = func;
)

@match_function_and_data_before_init_timer@
expression e, e2, e3, e4, e5, func, da;
@@
(
-e.function = func;
... when != da = e4
-e.data = da;
|
-e->function = func;
... when != da = e4
-e->data = da;
|
-e.data = da;
... when != func = e5
-e.function = func;
|
-e->data = da;
... when != func = e5
-e->function = func;
)
... when != func = e2
    when != da = e3
-init_timer
+setup_timer
 ( \(&e\|e\)
+, func, da
 );

@r1 exists@
expression t;
identifier f;
position p;
@@

f(...) { ... when any
  init_timer@p(\(&t\|t\))
  ... when any
}

@r2 exists@
expression r1.t;
identifier g != r1.f;
expression e8;
@@

g(...) { ... when any
  \(t.data\|t->data\) = e8
  ... when any
}

// It is dangerous to use setup_timer if data field is initialized
// in another function.
@script:python depends on r2@
p << r1.p;
@@

cocci.include_match(False)

@r3@
expression r1.t, func, e7;
position r1.p;
@@

(
-init_timer@p(&t);
+setup_timer(&t, func, 0UL);
... when != func = e7
-t.function = func;
|
-t.function = func;
... when != func = e7
-init_timer@p(&t);
+setup_timer(&t, func, 0UL);
|
-init_timer@p(t);
+setup_timer(t, func, 0UL);
... when != func = e7
-t->function = func;
|
-t->function = func;
... when != func = e7
-init_timer@p(t);
+setup_timer(t, func, 0UL);
)

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-11-21 15:57:06 -08:00
Peter Zijlstra b421b22b00 x86/tsc, sched/clock, clocksource: Use clocksource watchdog to provide stable sync points
Currently we keep sched_clock_tick() active for stable TSC in order to
keep the per-CPU state semi up-to-date. The (obvious) problem is that
by the time we detect TSC is borked, our per-CPU state is also borked.

So hook into the clocksource watchdog and call a method after we've
found it to still be stable.

There's the obvious race where the TSC goes wonky between finding it
stable and us running the callback, but closing that is too much work
and not really worth it, since we're already detecting TSC wobbles
after the fact, so we cannot, per definition, fully avoid funny clock
values.

And since the watchdog runs less often than the tick, this is also an
optimization.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-15 10:15:18 +02:00
Thomas Gleixner 12907fbb1a sched/clock, clocksource: Add optional cs::mark_unstable() method
PeterZ reported that we'd fail to mark the TSC unstable when the
clocksource watchdog finds it unsuitable.

Allow a clocksource to run a custom action when its being marked
unstable and hook up the TSC unstable code.

Reported-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:29:43 +01:00