Commit Graph

298 Commits

Author SHA1 Message Date
Waiman Long 6b484a545e rcu: Support direct wake-up of synchronize_rcu() users
JIRA: https://issues.redhat.com/browse/RHEL-55557

commit 462df2f543ae360e79fcaa1b498d2a1a0c2a5b63
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Fri, 8 Mar 2024 18:34:07 +0100

    rcu: Support direct wake-up of synchronize_rcu() users

    This patch introduces a small enhancement which allows to do a
    direct wake-up of synchronize_rcu() callers. It occurs after a
    completion of grace period, thus by the gp-kthread.

    Number of clients is limited by the hard-coded maximum allowed
    threshold. The remaining part, if still exists is deferred to
    a main worker.

    Link: https://lore.kernel.org/lkml/Zd0ZtNu+Rt0qXkfS@lothringen/

    Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-08-26 10:57:38 -04:00
Waiman Long 3097ec69ae rcu: Add data structures for synchronize_rcu()
JIRA: https://issues.redhat.com/browse/RHEL-55557

commit dfd458a95d78ce31855fe06bbfde4f4fe60c40db
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Fri, 8 Mar 2024 18:34:04 +0100

    rcu: Add data structures for synchronize_rcu()

    The synchronize_rcu() call is going to be reworked, thus
    this patch adds dedicated fields into the rcu_state structure.

    Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-08-26 10:57:36 -04:00
Waiman Long 7804cac54b rcu: Make hotplug operations track GP state, not flags
JIRA: https://issues.redhat.com/browse/RHEL-55557

commit ae2b217ab542d0db0ca1a6de4f442201a1982f00
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Fri, 8 Mar 2024 11:15:01 -0800

    rcu: Make hotplug operations track GP state, not flags

    Currently, there are rcu_data structure fields named ->rcu_onl_gp_seq
    and ->rcu_ofl_gp_seq that track the rcu_state.gp_flags field at the
    time of the corresponding CPU's last online or offline operation,
    respectively.  However, this information is not particularly useful.
    It would be better to instead track the grace period state kept
    in rcu_state.gp_state.  This would also be consistent with the
    initialization in rcu_boot_init_percpu_data(), which is to RCU_GP_CLEANED
    (an rcu_state.gp_state value), and also with the diagnostics in
    rcu_implicit_dynticks_qs(), whose format is consistent with an integer,
    not a bitmask.

    This commit therefore makes this change and changes the names to
    ->rcu_onl_gp_flags and ->rcu_ofl_gp_flags, respectively.

    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-08-26 10:57:32 -04:00
Waiman Long c0a9325f29 rcu/exp: Remove rcu_par_gp_wq
JIRA: https://issues.redhat.com/browse/RHEL-55557

commit 23da2ad64dbe9f3fab10af90484fe41e144337b1
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Fri, 12 Jan 2024 16:46:21 +0100

    rcu/exp: Remove rcu_par_gp_wq

    TREE04 running on short iterations can produce writer stalls of the
    following kind:

     ??? Writer stall state RTWS_EXP_SYNC(4) g3968 f0x0 ->state 0x2 cpu 0
     task:rcu_torture_wri state:D stack:14568 pid:83    ppid:2      flags:0x00004000
     Call Trace:
      <TASK>
      __schedule+0x2de/0x850
      ? trace_event_raw_event_rcu_exp_funnel_lock+0x6d/0xb0
      schedule+0x4f/0x90
      synchronize_rcu_expedited+0x430/0x670
      ? __pfx_autoremove_wake_function+0x10/0x10
      ? __pfx_synchronize_rcu_expedited+0x10/0x10
      do_rtws_sync.constprop.0+0xde/0x230
      rcu_torture_writer+0x4b4/0xcd0
      ? __pfx_rcu_torture_writer+0x10/0x10
      kthread+0xc7/0xf0
      ? __pfx_kthread+0x10/0x10
      ret_from_fork+0x2f/0x50
      ? __pfx_kthread+0x10/0x10
      ret_from_fork_asm+0x1b/0x30
      </TASK>

    Waiting for an expedited grace period and polling for an expedited
    grace period both are operations that internally rely on the same
    workqueue performing necessary asynchronous work.

    However, a dependency chain is involved between those two operations,
    as depicted below:

           ====== CPU 0 =======                          ====== CPU 1 =======

                                                         synchronize_rcu_expedited()
                                                             exp_funnel_lock()
                                                                 mutex_lock(&rcu_state.exp_mutex);
        start_poll_synchronize_rcu_expedited
            queue_work(rcu_gp_wq, &rnp->exp_poll_wq);
                                                             synchronize_rcu_expedited_queue_work()
                                                                 queue_work(rcu_gp_wq, &rew->rew_work);
                                                             wait_event() // A, wait for &rew->rew_work completion
                                                             mutex_unlock() // B
        //======> switch to kworker

        sync_rcu_do_polled_gp() {
            synchronize_rcu_expedited()
                exp_funnel_lock()
                    mutex_lock(&rcu_state.exp_mutex); // C, wait B
                    ....
        } // D

    Since workqueues are usually implemented on top of several kworkers
    handling the queue concurrently, the above situation wouldn't deadlock
    most of the time because A then doesn't depend on D. But in case of
    memory stress, a single kworker may end up handling alone all the works
    in a serialized way. In that case the above layout becomes a problem
    because A then waits for D, closing a circular dependency:

            A -> D -> C -> B -> A

    This however only happens when CONFIG_RCU_EXP_KTHREAD=n. Indeed
    synchronize_rcu_expedited() is otherwise implemented on top of a kthread
    worker while polling still relies on rcu_gp_wq workqueue, breaking the
    above circular dependency chain.

    Fix this with making expedited grace period to always rely on kthread
    worker. The workqueue based implementation is essentially a duplicate
    anyway now that the per-node initialization is performed by per-node
    kthread workers.

    Meanwhile the CONFIG_RCU_EXP_KTHREAD switch is still kept around to
    manage the scheduler policy of these kthread workers.

    Reported-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
    Reported-by: Thomas Gleixner <tglx@linutronix.de>
    Suggested-by: Joel Fernandes <joel@joelfernandes.org>
    Suggested-by: Paul E. McKenney <paulmck@kernel.org>
    Suggested-by: Neeraj upadhyay <Neeraj.Upadhyay@amd.com>
    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Boqun Feng <boqun.feng@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-08-26 10:57:15 -04:00
Waiman Long d5ad8ad294 rcu/exp: Make parallel exp gp kworker per rcu node
JIRA: https://issues.redhat.com/browse/RHEL-55557

commit 8e5e621566485a3e160c0d8bfba206cb1d6b980d
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Fri, 12 Jan 2024 16:46:19 +0100

    rcu/exp: Make parallel exp gp kworker per rcu node

    When CONFIG_RCU_EXP_KTHREAD=n, the expedited grace period per node
    initialization is performed in parallel via workqueues (one work per
    node).

    However in CONFIG_RCU_EXP_KTHREAD=y, this per node initialization is
    performed by a single kworker serializing each node initialization (one
    work for all nodes).

    The second part is certainly less scalable and efficient beyond a single
    leaf node.

    To improve this, expand this single kworker into per-node kworkers. This
    new layout is eventually intended to remove the workqueues based
    implementation since it will essentially now become duplicate code.

    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Boqun Feng <boqun.feng@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-08-26 10:57:13 -04:00
Waiman Long 119acfe64c rcu: s/boost_kthread_mutex/kthread_mutex
JIRA: https://issues.redhat.com/browse/RHEL-55557

commit 7836b270607676ed1c0c6a4a840a2ede9437a6a1
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Fri, 12 Jan 2024 16:46:17 +0100

    rcu: s/boost_kthread_mutex/kthread_mutex

    This mutex is currently protecting per node boost kthreads creation and
    affinity setting across CPU hotplug operations.

    Since the expedited kworkers will soon be split per node as well, they
    will be subject to the same concurrency constraints against hotplug.

    Therefore their creation and affinity tuning operations will be grouped
    with those of boost kthreads and then rely on the same mutex.

    To prepare for that, generalize its name.

    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Boqun Feng <boqun.feng@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-08-26 10:57:12 -04:00
Waiman Long be64d66573 rcu/nocb: Re-arrange call_rcu() NOCB specific code
JIRA: https://issues.redhat.com/browse/RHEL-55557

commit afd4e6964745ed98b74cacdcce21d73280a0a253
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Tue, 9 Jan 2024 23:24:01 +0100

    rcu/nocb: Re-arrange call_rcu() NOCB specific code

    Currently the call_rcu() function interleaves NOCB and !NOCB enqueue
    code in a complicated way such that:

    * The bypass enqueue code may or may not have enqueued and may or may
      not have locked the ->nocb_lock. Everything that follows is in a
      Schrödinger locking state for the unwary reviewer's eyes.

    * The was_alldone is always set but only used in NOCB related code.

    * The NOCB wake up is distantly related to the locking hopefully
      performed by the bypass enqueue code that did not enqueue on the
      bypass list.

    Unconfuse the whole and gather NOCB and !NOCB specific enqueue code to
    their own functions.

    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Boqun Feng <boqun.feng@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-08-26 10:57:09 -04:00
Waiman Long b91a2b524c rcu/tree: Defer setting of jiffies during stall reset
JIRA: https://issues.redhat.com/browse/RHEL-34076

commit b96e7a5fa0ba9cda32888e04f8f4bac42d49a7f8
Author: Joel Fernandes (Google) <joel@joelfernandes.org>
Date:   Tue, 5 Sep 2023 00:02:11 +0000

    rcu/tree: Defer setting of jiffies during stall reset

    There are instances where rcu_cpu_stall_reset() is called when jiffies
    did not get a chance to update for a long time. Before jiffies is
    updated, the CPU stall detector can go off triggering false-positives
    where a just-started grace period appears to be ages old. In the past,
    we disabled stall detection in rcu_cpu_stall_reset() however this got
    changed [1]. This is resulting in false-positives in KGDB usecase [2].

    Fix this by deferring the update of jiffies to the third run of the FQS
    loop. This is more robust, as, even if rcu_cpu_stall_reset() is called
    just before jiffies is read, we would end up pushing out the jiffies
    read by 3 more FQS loops. Meanwhile the CPU stall detection will be
    delayed and we will not get any false positives.

    [1] https://lore.kernel.org/all/20210521155624.174524-2-senozhatsky@chromium.org/
    [2] https://lore.kernel.org/all/20230814020045.51950-2-chenhuacai@loongson.cn/

    Tested with rcutorture.cpu_stall option as well to verify stall behavior
    with/without patch.

    Tested-by: Huacai Chen <chenhuacai@loongson.cn>
    Reported-by: Binbin Zhou <zhoubinbin@loongson.cn>
    Closes: https://lore.kernel.org/all/20230814020045.51950-2-chenhuacai@loongson.cn/
    Suggested-by: Paul  McKenney <paulmck@kernel.org>
    Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: stable@vger.kernel.org
    Fixes: a80be428fbc1 ("rcu: Do not disable GP stall detection in rcu_cpu_stall_reset()")
    Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-31 10:56:18 -04:00
Waiman Long 576c46a4e4 rcu: Add RCU stall diagnosis information
JIRA: https://issues.redhat.com/browse/RHEL-5228
Conflicts: Upstream merge conflicts in kernel/rcu/{rcu.h,update.c}
	   and Documentation/admin-guide/kernel-parameters.txt with
	   commit 92987fe8bdd1 ("rcu: Allow expedited RCU CPU stall
	   warnings to dump task stacks").  Resolved according
	   to upstream merge commit bba8d3d17dc2 ("Merge branch
	   'stall.2023.01.09a' into HEAD").

commit be42f00b73a0f50710d16eb7cb4efda0cce062dd
Author: Zhen Lei <thunder.leizhen@huawei.com>
Date:   Sat, 19 Nov 2022 17:25:06 +0800

    rcu: Add RCU stall diagnosis information

    Because RCU CPU stall warnings are driven from the scheduling-clock
    interrupt handler, a workload consisting of a very large number of
    short-duration hardware interrupts can result in misleading stall-warning
    messages.  On systems supporting only a single level of interrupts,
    that is, where interrupts handlers cannot be interrupted, this can
    produce misleading diagnostics.  The stack traces will show the
    innocent-bystander interrupted task, not the interrupts that are
    at the very least exacerbating the stall.

    This situation can be improved by displaying the number of interrupts
    and the CPU time that they have consumed.  Diagnosing other types
    of stalls can be eased by also providing the count of softirqs and
    the CPU time that they consumed as well as the number of context
    switches and the task-level CPU time consumed.

    Consider the following output given this change:

    rcu: INFO: rcu_preempt self-detected stall on CPU
    rcu:     0-....: (1250 ticks this GP) <omitted>
    rcu:          hardirqs   softirqs   csw/system
    rcu:  number:      624         45            0
    rcu: cputime:       69          1         2425   ==> 2500(ms)

    This output shows that the number of hard and soft interrupts is small,
    there are no context switches, and the system takes up a lot of time. This
    indicates that the current task is looping with preemption disabled.

    The impact on system performance is negligible because snapshot is
    recorded only once for all continuous RCU stalls.

    This added debugging information is suppressed by default and can be
    enabled by building the kernel with CONFIG_RCU_CPU_STALL_CPUTIME=y or
    by booting with rcupdate.rcu_cpu_stall_cputime=1.

    Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
    Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com>
    Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-09-22 13:21:38 -04:00
Waiman Long 3f47536644 rcu: Make call_rcu() lazy to save power
JIRA: https://issues.redhat.com/browse/RHEL-5228

commit 3cb278e73be58bfb780ecd55129296d2f74c1fb7
Author: Joel Fernandes (Google) <joel@joelfernandes.org>
Date:   Sun, 16 Oct 2022 16:22:54 +0000

    rcu: Make call_rcu() lazy to save power

    Implement timer-based RCU callback batching (also known as lazy
    callbacks). With this we save about 5-10% of power consumed due
    to RCU requests that happen when system is lightly loaded or idle.

    By default, all async callbacks (queued via call_rcu) are marked
    lazy. An alternate API call_rcu_hurry() is provided for the few users,
    for example synchronize_rcu(), that need the old behavior.

    The batch is flushed whenever a certain amount of time has passed, or
    the batch on a particular CPU grows too big. Also memory pressure will
    flush it in a future patch.

    To handle several corner cases automagically (such as rcu_barrier() and
    hotplug), we re-use bypass lists which were originally introduced to
    address lock contention, to handle lazy CBs as well. The bypass list
    length has the lazy CB length included in it. A separate lazy CB length
    counter is also introduced to keep track of the number of lazy CBs.

    [ paulmck: Fix formatting of inline call_rcu_lazy() definition. ]
    [ paulmck: Apply Zqiang feedback. ]
    [ paulmck: Apply s/call_rcu_flush/call_rcu_hurry/ feedback from Tejun Heo. ]

    Suggested-by: Paul McKenney <paulmck@kernel.org>
    Acked-by: Frederic Weisbecker <frederic@kernel.org>
    Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-09-22 13:21:15 -04:00
Waiman Long 703b79599b rcu: Fix missing nocb gp wake on rcu_barrier()
JIRA: https://issues.redhat.com/browse/RHEL-5228

commit b8f7aca3f0e0e6223094ba2662bac90353674b04
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Sun, 16 Oct 2022 16:22:53 +0000

    rcu: Fix missing nocb gp wake on rcu_barrier()

    In preparation for RCU lazy changes, wake up the RCU nocb gp thread if
    needed after an entrain.  This change prevents the RCU barrier callback
    from waiting in the queue for several seconds before the lazy callbacks
    in front of it are serviced.

    Reported-by: Joel Fernandes (Google) <joel@joelfernandes.org>
    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-09-22 13:21:10 -04:00
Waiman Long e580bb0d98 rcu: Add polled expedited grace-period primitives
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2169516

commit d96c52fe4907c68adc5e61a0bef7aec0933223d5
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Fri, 15 Apr 2022 10:55:42 -0700

    rcu: Add polled expedited grace-period primitives

    This commit adds expedited grace-period functionality to RCU's polled
    grace-period API, adding start_poll_synchronize_rcu_expedited() and
    cond_synchronize_rcu_expedited(), which are similar to the existing
    start_poll_synchronize_rcu() and cond_synchronize_rcu() functions,
    respectively.

    Note that although start_poll_synchronize_rcu_expedited() can be invoked
    very early, the resulting expedited grace periods are not guaranteed
    to start until after workqueues are fully initialized.  On the other
    hand, both synchronize_rcu() and synchronize_rcu_expedited() can also
    be invoked very early, and the resulting grace periods will be taken
    into account as they occur.

    [ paulmck: Apply feedback from Neeraj Upadhyay. ]

    Link: https://lore.kernel.org/all/20220121142454.1994916-1-bfoster@redhat.com/
    Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing
    Cc: Brian Foster <bfoster@redhat.com>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Ian Kent <raven@themaw.net>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-03-30 08:47:52 -04:00
Waiman Long ce330fc3bc rcu: Make polled grace-period API account for expedited grace periods
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2169516

commit dd04140531b5d38b77ad9ff7b18117654be5bf5c
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Thu, 14 Apr 2022 06:56:35 -0700

    rcu: Make polled grace-period API account for expedited grace periods

    Currently, this code could splat:

            oldstate = get_state_synchronize_rcu();
            synchronize_rcu_expedited();
            WARN_ON_ONCE(!poll_state_synchronize_rcu(oldstate));

    This situation is counter-intuitive and user-unfriendly.  After all, there
    really was a perfectly valid full grace period right after the call to
    get_state_synchronize_rcu(), so why shouldn't poll_state_synchronize_rcu()
    know about it?

    This commit therefore makes the polled grace-period API aware of expedited
    grace periods in addition to the normal grace periods that it is already
    aware of.  With this change, the above code is guaranteed not to splat.

    Please note that the above code can still splat due to counter wrap on the
    one hand and situations involving partially overlapping normal/expedited
    grace periods on the other.  On 64-bit systems, the second is of course
    much more likely than the first.  It is possible to modify this approach
    to prevent overlapping grace periods from causing splats, but only at
    the expense of greatly increasing the probability of counter wrap, as
    in within milliseconds on 32-bit systems and within minutes on 64-bit
    systems.

    This commit is in preparation for polled expedited grace periods.

    Link: https://lore.kernel.org/all/20220121142454.1994916-1-bfoster@redhat.com/
    Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing
    Cc: Brian Foster <bfoster@redhat.com>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Ian Kent <raven@themaw.net>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-03-30 08:47:51 -04:00
Waiman Long 7df8a78b55 rcu: Switch polled grace-period APIs to ->gp_seq_polled
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2169516

commit bf95b2bc3e42f11f4d7a5e8a98376c2b4a2aa82f
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Wed, 13 Apr 2022 17:46:15 -0700

    rcu: Switch polled grace-period APIs to ->gp_seq_polled

    This commit switches the existing polled grace-period APIs to use a
    new ->gp_seq_polled counter in the rcu_state structure.  An additional
    ->gp_seq_polled_snap counter in that same structure allows the normal
    grace period kthread to interact properly with the !SMP !PREEMPT fastpath
    through synchronize_rcu().  The first of the two to note the end of a
    given grace period will make knowledge of this transition available to
    the polled API.

    This commit is in preparation for polled expedited grace periods.

    [ paulmck: Fix use of rcu_state.gp_seq_polled to start normal grace period. ]

    Link: https://lore.kernel.org/all/20220121142454.1994916-1-bfoster@redhat.com/
    Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing
    Cc: Brian Foster <bfoster@redhat.com>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Ian Kent <raven@themaw.net>
    Co-developed-by: Boqun Feng <boqun.feng@gmail.com>
    Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-03-30 08:47:51 -04:00
Waiman Long 3cd6c37180 rcu: Add nocb_cb_kthread check to rcu_is_callbacks_kthread()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2169516

commit 5103850654fdc651f0a7076ac753b958f018bb85
Author: Zqiang <qiang1.zhang@intel.com>
Date:   Fri, 29 Apr 2022 20:42:22 +0800

    rcu: Add nocb_cb_kthread check to rcu_is_callbacks_kthread()

    Callbacks are invoked in RCU kthreads when calbacks are offloaded
    (rcu_nocbs boot parameter) or when RCU's softirq handler has been
    offloaded to rcuc kthreads (use_softirq==0).  The current code allows
    for the rcu_nocbs case but not the use_softirq case.  This commit adds
    support for the use_softirq case.

    Reported-by: kernel test robot <lkp@intel.com>
    Signed-off-by: Zqiang <qiang1.zhang@intel.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-03-30 08:36:22 -04:00
Waiman Long 8c6af96a89 rcu/nocb: Add/del rdp to iterate from rcuog itself
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2169516

commit 1598f4a4762be0ea6a1bcd229c2c9ff1ebb212bb
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Tue, 19 Apr 2022 14:23:18 +0200

    rcu/nocb: Add/del rdp to iterate from rcuog itself

    NOCB rdp's are part of a group whose list is iterated by the
    corresponding rdp leader.

    This list is RCU traversed because an rdp can be either added or
    deleted concurrently. Upon addition, a new iteration to the list after
    a synchronization point (a pair of LOCK/UNLOCK ->nocb_gp_lock) is forced
    to make sure:

    1) we didn't miss a new element added in the middle of an iteration
    2) we didn't ignore a whole subset of the list due to an element being
       quickly deleted and then re-added.
    3) we prevent from probably other surprises...

    Although this layout is expected to be safe, it doesn't help anybody
    to sleep well.

    Simplify instead the nocb state toggling with moving the list
    modification from the nocb (de-)offloading workqueue to the rcuog
    kthreads instead.

    Whenever the rdp leader is expected to (re-)set the SEGCBLIST_KTHREAD_GP
    flag of a target rdp, the latter is queued so that the leader handles
    the flag flip along with adding or deleting the target rdp to the list
    to iterate. This way the list modification and iteration happen from the
    same kthread and those operations can't race altogether.

    As a bonus, the flags for each rdp don't need to be checked locklessly
    before each iteration, which is one less opportunity to produce
    nightmares.

    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
    Cc: Boqun Feng <boqun.feng@gmail.com>
    Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
    Cc: Joel Fernandes <joel@joelfernandes.org>
    Cc: Zqiang <qiang1.zhang@intel.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-03-30 08:36:21 -04:00
Waiman Long 5b925bf582 rcu/context-tracking: Move RCU-dynticks internal functions to context_tracking
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2169516

commit 172114552701b85d5c3b1a089a73ee85d0d7786b
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Wed, 8 Jun 2022 16:40:33 +0200

    rcu/context-tracking: Move RCU-dynticks internal functions to context_tracking

    Move the core RCU eqs/dynticks functions to context tracking so that
    we can later merge all that code within context tracking.

    Acked-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
    Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
    Cc: Joel Fernandes <joel@joelfernandes.org>
    Cc: Boqun Feng <boqun.feng@gmail.com>
    Cc: Nicolas Saenz Julienne <nsaenz@kernel.org>
    Cc: Marcelo Tosatti <mtosatti@redhat.com>
    Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com>
    Cc: Yu Liao <liaoyu15@huawei.com>
    Cc: Phil Auld <pauld@redhat.com>
    Cc: Paul Gortmaker<paul.gortmaker@windriver.com>
    Cc: Alex Belits <abelits@marvell.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
    Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-03-30 08:36:18 -04:00
Waiman Long e0440c243a rcu/context_tracking: Move dynticks_nmi_nesting to context tracking
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2169516

commit 95e04f48ec0a634e2f221081f5fa1a904755f326
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Wed, 8 Jun 2022 16:40:31 +0200

    rcu/context_tracking: Move dynticks_nmi_nesting to context tracking

    The RCU eqs tracking is going to be performed by the context tracking
    subsystem. The related nesting counters thus need to be moved to the
    context tracking structure.

    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
    Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
    Cc: Joel Fernandes <joel@joelfernandes.org>
    Cc: Boqun Feng <boqun.feng@gmail.com>
    Cc: Nicolas Saenz Julienne <nsaenz@kernel.org>
    Cc: Marcelo Tosatti <mtosatti@redhat.com>
    Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com>
    Cc: Yu Liao <liaoyu15@huawei.com>
    Cc: Phil Auld <pauld@redhat.com>
    Cc: Paul Gortmaker<paul.gortmaker@windriver.com>
    Cc: Alex Belits <abelits@marvell.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
    Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-03-30 08:36:17 -04:00
Waiman Long c1013cee1d rcu/context_tracking: Move dynticks_nesting to context tracking
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2169516

commit 904e600e60f46f92eb4bcfb95788b1fedf7e8237
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Wed, 8 Jun 2022 16:40:30 +0200

    rcu/context_tracking: Move dynticks_nesting to context tracking

    The RCU eqs tracking is going to be performed by the context tracking
    subsystem. The related nesting counters thus need to be moved to the
    context tracking structure.

    Acked-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
    Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
    Cc: Joel Fernandes <joel@joelfernandes.org>
    Cc: Boqun Feng <boqun.feng@gmail.com>
    Cc: Nicolas Saenz Julienne <nsaenz@kernel.org>
    Cc: Marcelo Tosatti <mtosatti@redhat.com>
    Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com>
    Cc: Yu Liao <liaoyu15@huawei.com>
    Cc: Phil Auld <pauld@redhat.com>
    Cc: Paul Gortmaker<paul.gortmaker@windriver.com>
    Cc: Alex Belits <abelits@marvell.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
    Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-03-30 08:36:17 -04:00
Waiman Long 8640b64310 rcu/context_tracking: Move dynticks counter to context tracking
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2169516

commit 62e2412df4b90ae6706ce1f1a9649b789b2e44ef
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Wed, 8 Jun 2022 16:40:29 +0200

    rcu/context_tracking: Move dynticks counter to context tracking

    In order to prepare for merging RCU dynticks counter into the context
    tracking state, move the rcu_data's dynticks field to the context
    tracking structure. It will later be mixed within the context tracking
    state itself.

    [ paulmck: Move enum ctx_state into global scope. ]

    Acked-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
    Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
    Cc: Joel Fernandes <joel@joelfernandes.org>
    Cc: Boqun Feng <boqun.feng@gmail.com>
    Cc: Nicolas Saenz Julienne <nsaenz@kernel.org>
    Cc: Marcelo Tosatti <mtosatti@redhat.com>
    Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com>
    Cc: Yu Liao <liaoyu15@huawei.com>
    Cc: Phil Auld <pauld@redhat.com>
    Cc: Paul Gortmaker<paul.gortmaker@windriver.com>
    Cc: Alex Belits <abelits@marvell.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
    Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-03-30 08:36:17 -04:00
Waiman Long d45fbffb5b rcu: Move expedited grace period (GP) work to RT kthread_worker
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2117491
Conflicts:
 1) A merge conflict in kernel/rcu/rcu.h due to upstream merge conflict
    with commit 99d6a2acb895 ("rcutorture: Suppress debugging grace
    period delays during flooding"). Manually merge according to upstream
    merge commit ce13389053a3.
 2) A fuzz in kernel/rcu/tree.c due to upstream merge conflict with
    commit 87c5adf06bfb ("rcu/nocb: Initialize nocb kthreads only
    for boot CPU prior SMP initialization") and commit 3352911fa9b4
    ("rcu: Initialize boost kthread only for boot node prior SMP
    initialization"). See upstream merge commit ce13389053a3.

commit 9621fbee44df940e2e1b94b0676460a538dffefa
Author: Kalesh Singh <kaleshsingh@google.com>
Date:   Fri, 8 Apr 2022 17:35:27 -0700

    rcu: Move expedited grace period (GP) work to RT kthread_worker

    Enabling CONFIG_RCU_BOOST did not reduce RCU expedited grace-period
    latency because its workqueues run at SCHED_OTHER, and thus can be
    delayed by normal processes.  This commit avoids these delays by moving
    the expedited GP work items to a real-time-priority kthread_worker.

    This option is controlled by CONFIG_RCU_EXP_KTHREAD and disabled by
    default on PREEMPT_RT=y kernels which disable expedited grace periods
    after boot by unconditionally setting rcupdate.rcu_normal_after_boot=1.

    The results were evaluated on arm64 Android devices (6GB ram) running
    5.10 kernel, and capturing trace data in critical user-level code.

    The table below shows the resulting order-of-magnitude improvements
    in synchronize_rcu_expedited() latency:

    ------------------------------------------------------------------------
    |                          |   workqueues  |  kthread_worker |  Diff   |
    ------------------------------------------------------------------------
    | Count                    |          725  |            688  |         |
    ------------------------------------------------------------------------
    | Min Duration       (ns)  |          326  |            447  |  37.12% |
    ------------------------------------------------------------------------
    | Q1                 (ns)  |       39,428  |         38,971  |  -1.16% |
    ------------------------------------------------------------------------
    | Q2 - Median        (ns)  |       98,225  |         69,743  | -29.00% |
    ------------------------------------------------------------------------
    | Q3                 (ns)  |      342,122  |        126,638  | -62.98% |
    ------------------------------------------------------------------------
    | Max Duration       (ns)  |  372,766,967  |      2,329,671  | -99.38% |
    ------------------------------------------------------------------------
    | Avg Duration       (ns)  |    2,746,353  |        151,242  | -94.49% |
    ------------------------------------------------------------------------
    | Standard Deviation (ns)  |   19,327,765  |        294,408  |         |
    ------------------------------------------------------------------------

    The below table show the range of maximums/minimums for
    synchronize_rcu_expedited() latency from all experiments:

    ------------------------------------------------------------------------
    |                          |   workqueues  |  kthread_worker |  Diff   |
    ------------------------------------------------------------------------
    | Total No. of Experiments |           25  |             23  |         |
    ------------------------------------------------------------------------
    | Largest  Maximum   (ns)  |  372,766,967  |      2,329,671  | -99.38% |
    ------------------------------------------------------------------------
    | Smallest Maximum   (ns)  |       38,819  |         86,954  | 124.00% |
    ------------------------------------------------------------------------
    | Range of Maximums  (ns)  |  372,728,148  |      2,242,717  |         |
    ------------------------------------------------------------------------
    | Largest  Minimum   (ns)  |       88,623  |         27,588  | -68.87% |
    ------------------------------------------------------------------------
    | Smallest Minimum   (ns)  |          326  |            447  |  37.12% |
    ------------------------------------------------------------------------
    | Range of Minimums  (ns)  |       88,297  |         27,141  |         |
    ------------------------------------------------------------------------

    Cc: "Paul E. McKenney" <paulmck@kernel.org>
    Cc: Tejun Heo <tj@kernel.org>
    Reported-by: Tim Murray <timmurray@google.com>
    Reported-by: Wei Wang <wvw@google.com>
    Tested-by: Kyle Lin <kylelin@google.com>
    Tested-by: Chunwei Lu <chunweilu@google.com>
    Tested-by: Lulu Wang <luluw@google.com>
    Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2022-08-30 17:38:28 -04:00
Waiman Long f12dfd4e5c rcu: Check for jiffies going backwards
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2117491

commit c708b08c65a0dfae127b9ee33b0fb73535a5e066
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Wed, 23 Feb 2022 17:29:37 -0800

    rcu: Check for jiffies going backwards

    A report of a 12-jiffy normal RCU CPU stall warning raises interesting
    questions about the nature of time on the offending system.  This commit
    instruments rcu_sched_clock_irq(), which is RCU's hook into the
    scheduling-clock interrupt, checking for the jiffies counter going
    backwards.

    Reported-by: Saravanan D <sarvanand@fb.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2022-08-30 17:22:11 -04:00
Waiman Long c54a776b65 rcu/nocb: Initialize nocb kthreads only for boot CPU prior SMP initialization
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2117491

commit 87c5adf06bfbf14c9d13e59d5d174ff5f2aafc0e
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Wed, 16 Feb 2022 16:42:08 +0100

    rcu/nocb: Initialize nocb kthreads only for boot CPU prior SMP initialization

    The rcu_spawn_gp_kthread() function is called as an early initcall, which
    means that SMP initialization hasn't happened yet and only the boot CPU is
    online. Therefore, create only the NOCB kthreads related to the boot CPU.

    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
    Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
    Cc: Joel Fernandes <joel@joelfernandes.org>
    Cc: Boqun Feng <boqun.feng@gmail.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2022-08-30 17:22:01 -04:00
Waiman Long b19ed13b34 rcu: Initialize boost kthread only for boot node prior SMP initialization
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2117491

commit 3352911fa9b47a90165e5c6fed440048c55146d1
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Wed, 16 Feb 2022 16:42:07 +0100

    rcu: Initialize boost kthread only for boot node prior SMP initialization

    The rcu_spawn_gp_kthread() function is called as an early initcall,
    which means that SMP initialization hasn't happened yet and only the
    boot CPU is online.  Therefore, create only the boost kthread for the
    leaf node of the boot CPU.

    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
    Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
    Cc: Joel Fernandes <joel@joelfernandes.org>
    Cc: Boqun Feng <boqun.feng@gmail.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2022-08-30 17:22:01 -04:00
Waiman Long 28c195fff4 rcu/nocb: Move rcu_nocb_is_setup to rcu_state
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2117491

commit 8d2aaa9b7c290e766a41f29c71ec72192851d538
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Mon, 14 Feb 2022 14:23:39 +0100

    rcu/nocb: Move rcu_nocb_is_setup to rcu_state

    This commit moves the RCU nocb initialization witness within rcu_state
    to consolidate RCU's global state.

    Reported-by: Paul E. McKenney <paulmck@kernel.org>
    Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
    Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
    Cc: Joel Fernandes <joel@joelfernandes.org>
    Cc: Boqun Feng <boqun.feng@gmail.com>
    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2022-08-30 17:22:00 -04:00
Waiman Long a9408fae13 rcu: Add per-CPU rcuc task dumps to RCU CPU stall warnings
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2076713

commit c9515875850fefcc79492c5189fe8431e75ddec5
Author: Zqiang <qiang1.zhang@intel.com>
Date:   Tue, 25 Jan 2022 10:47:44 +0800

    rcu: Add per-CPU rcuc task dumps to RCU CPU stall warnings

    When the rcutree.use_softirq kernel boot parameter is set to zero, all
    RCU_SOFTIRQ processing is carried out by the per-CPU rcuc kthreads.
    If these kthreads are being starved, quiescent states will not be
    reported, which in turn means that the grace period will not end, which
    can in turn trigger RCU CPU stall warnings.  This commit therefore dumps
    stack traces of stalled CPUs' rcuc kthreads, which can help identify
    what is preventing those kthreads from running.

    Suggested-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
    Reviewed-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
    Signed-off-by: Zqiang <qiang1.zhang@intel.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2022-05-12 08:30:04 -04:00
Waiman Long 4c01b1af26 rcu: Make rcu_barrier() no longer block CPU-hotplug operations
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2076713

commit 80b3fd474c91b3ecfd845b4a0bfb58706b877ba5
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Tue, 14 Dec 2021 13:35:17 -0800

    rcu: Make rcu_barrier() no longer block CPU-hotplug operations

    This commit removes the cpus_read_lock() and cpus_read_unlock() calls
    from rcu_barrier(), thus allowing CPUs to come and go during the course
    of rcu_barrier() execution.  Posting of the ->barrier_head callbacks does
    synchronize with portions of RCU's CPU-hotplug notifiers, but these locks
    are held for short time periods on both sides.  Thus, full CPU-hotplug
    operations could both start and finish during the execution of a given
    rcu_barrier() invocation.

    Additional synchronization is provided by a global ->barrier_lock.
    Since the ->barrier_lock is only used during rcu_barrier() execution and
    during onlining/offlining a CPU, the contention for this lock should
    be low.  It might be tempting to make use of a per-CPU lock just on
    general principles, but straightforward attempts to do this have the
    problems shown below.

    Initial state: 3 CPUs present, CPU 0 and CPU1 do not have
    any callback and CPU2 has callbacks.

    1. CPU0 calls rcu_barrier().

    2. CPU1 starts offlining for CPU2. CPU1 calls
       rcutree_migrate_callbacks(). rcu_barrier_entrain() is called
       from rcutree_migrate_callbacks(), with CPU2's rdp->barrier_lock.
       It does not entrain ->barrier_head for CPU2, as rcu_barrier()
       on CPU0 hasn't started the barrier sequence (by calling
       rcu_seq_start(&rcu_state.barrier_sequence)) yet.

    3. CPU0 starts new barrier sequence. It iterates over
       CPU0 and CPU1, after acquiring their per-cpu ->barrier_lock
       and finds 0 segcblist length. It updates ->barrier_seq_snap
       for CPU0 and CPU1 and continues loop iteration to CPU2.

        for_each_possible_cpu(cpu) {
            raw_spin_lock_irqsave(&rdp->barrier_lock, flags);
            if (!rcu_segcblist_n_cbs(&rdp->cblist)) {
                WRITE_ONCE(rdp->barrier_seq_snap, gseq);
                raw_spin_unlock_irqrestore(&rdp->barrier_lock, flags);
                rcu_barrier_trace(TPS("NQ"), cpu, rcu_state.barrier_sequence);
                continue;
            }

    4. rcutree_migrate_callbacks() completes execution on CPU1.
       Segcblist len for CPU2 becomes 0.

    5. The loop iteration on CPU0, checks rcu_segcblist_n_cbs(&rdp->cblist)
       for CPU2 and completes the loop iteration after setting
       ->barrier_seq_snap.

    6. As there isn't any ->barrier_head callback entrained; at
       this point, rcu_barrier() in CPU0 returns.

    7. The callbacks, which migrated from CPU2 to CPU1, execute.

    Straightforward per-CPU locking is also subject to the following race
    condition noted by Boqun Feng:

    1. CPU0 calls rcu_barrier(), starting a new barrier sequence by invoking
       rcu_seq_start() and init_completion(), but does not yet initialize
       rcu_state.barrier_cpu_count.

    2. CPU1 starts offlining for CPU2, calling rcutree_migrate_callbacks(),
       which in turn calls rcu_barrier_entrain() holding CPU2's.
       rdp->barrier_lock.  It then entrains ->barrier_head for CPU2
       and atomically increments rcu_state.barrier_cpu_count, which is
       unfortunately not yet initialized to the value 2.

    3. The just-entrained RCU callback is invoked.  It atomically
       decrements rcu_state.barrier_cpu_count and sees that it is
       now zero.  This callback therefore invokes complete().

    4. CPU0 continues executing rcu_barrier(), but is not blocked
       by its call to wait_for_completion().  This results in rcu_barrier()
       returning before all pre-existing callbacks have been invoked,
       which is a bug.

    Therefore, synchronization is provided by rcu_state.barrier_lock,
    which is also held across the initialization sequence, especially the
    rcu_seq_start() and the atomic_set() that sets rcu_state.barrier_cpu_count
    to the value 2.  In addition, this lock is held when entraining the
    rcu_barrier() callback, when deciding whether or not a CPU has callbacks
    that rcu_barrier() must wait on, when setting the ->qsmaskinitnext for
    incoming CPUs, and when migrating callbacks from a CPU that is going
    offline.

    Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
    Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
    Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2022-05-12 08:25:57 -04:00
Waiman Long 6d38f5233d rcu: Rework rcu_barrier() and callback-migration logic
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2076713

commit a16578dd5e3a44b53ca0699ac2971679dab97484
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Tue, 14 Dec 2021 13:15:18 -0800

    rcu: Rework rcu_barrier() and callback-migration logic

    This commit reworks rcu_barrier() and callback-migration logic to
    permit allowing rcu_barrier() to run concurrently with CPU-hotplug
    operations.  The key trick is for callback migration to check to see if
    an rcu_barrier() is in flight, and, if so, enqueue the ->barrier_head
    callback on its behalf.

    This commit adds synchronization with RCU's CPU-hotplug notifiers.  Taken
    together, this will permit a later commit to remove the cpus_read_lock()
    and cpus_read_unlock() calls from rcu_barrier().

    [ paulmck: Updated per kbuild test robot feedback. ]
    [ paulmck: Updated per reviews session with Neeraj, Frederic, Uladzislau, and Boqun. ]

    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2022-05-12 08:25:56 -04:00
Waiman Long 5bef7666bb rcu: Remove unused rcu_state.boost
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2076713
Conflicts: Fuzz in rcu_spawn_one_boost_kthread() due to upstream commit
	   conflict as shown in merge commit d5578190bed3.

commit eae9f147a4b02e132187a2d88a403b9ccc28212a
Author: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Date:   Mon, 13 Dec 2021 12:32:09 +0530

    rcu: Remove unused rcu_state.boost

    Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2022-05-12 08:25:55 -04:00
Waiman Long 8c2518377e rcu/nocb: Handle concurrent nocb kthreads creation
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2076713

commit 02e3024175274ed4bf7912e7a1281b300cec76b5
Author: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Date:   Sat, 11 Dec 2021 22:31:39 +0530

    rcu/nocb: Handle concurrent nocb kthreads creation

    When multiple CPUs in the same nocb gp/cb group concurrently
    come online, they might try to concurrently create the same
    rcuog kthread. Fix this by using nocb gp CPU's spawn mutex to
    provide mutual exclusion for the rcuog kthread creation code.

    [ paulmck: Whitespace fixes per kernel test robot feedback. ]

    Acked-by: David Woodhouse <dwmw@amazon.co.uk>
    Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2022-05-12 08:25:54 -04:00
Waiman Long ba1bfcb746 rcu: Add mutex for rcu boost kthread spawning and affinity setting
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2076713
Conflicts: A fuzz in rcu_boost_kthread_setaffinity() of
	   kernel/rcu/tree_plugin.h due to the presence of a later
	   ustream commit 04d4e665a609 ("sched/isolation: Use single
	   feature type while referring to housekeeping cpumask").

commit 218b957a6959a2fb5b3967fc824072bb89ac2611
Author: David Woodhouse <dwmw@amazon.co.uk>
Date:   Wed, 8 Dec 2021 23:41:53 +0000

    rcu: Add mutex for rcu boost kthread spawning and affinity setting

    As we handle parallel CPU bringup, we will need to take care to avoid
    spawning multiple boost threads, or race conditions when setting their
    affinity. Spotted by Paul McKenney.

    Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
    Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2022-05-12 08:25:17 -04:00
Waiman Long 5824fc0262 rcu: Kill rnp->ofl_seq and use only rcu_state.ofl_lock for exclusion
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2076713

commit 82980b1622d97017053c6792382469d7dc26a486
Author: David Woodhouse <dwmw@amazon.co.uk>
Date:   Tue, 16 Feb 2021 15:04:34 +0000

    rcu: Kill rnp->ofl_seq and use only rcu_state.ofl_lock for exclusion

    If we allow architectures to bring APs online in parallel, then we end
    up requiring rcu_cpu_starting() to be reentrant. But currently, the
    manipulation of rnp->ofl_seq is not thread-safe.

    However, rnp->ofl_seq is also fairly much pointless anyway since both
    rcu_cpu_starting() and rcu_report_dead() hold rcu_state.ofl_lock for
    fairly much the whole time that rnp->ofl_seq is set to an odd number
    to indicate that an operation is in progress.

    So drop rnp->ofl_seq completely, and use only rcu_state.ofl_lock.

    This has a couple of minor complexities: lockdep will complain when we
    take rcu_state.ofl_lock, and currently accepts the 'excuse' of having
    an odd value in rnp->ofl_seq. So switch it to an arch_spinlock_t to
    avoid that false positive complaint. Since we're killing rnp->ofl_seq
    of course that 'excuse' has to be changed too, so make it check for
    arch_spin_is_locked(rcu_state.ofl_lock).

    There's no arch_spin_lock_irqsave() so we have to manually save and
    restore local interrupts around the locking.

    At Paul's request based on Neeraj's analysis, make rcu_gp_init not just
    wait but *exclude* any CPU online/offline activity, which was fairly
    much true already by virtue of it holding rcu_state.ofl_lock.

    Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2022-05-12 08:19:35 -04:00
Patrick Talbert ea38048f36 Merge: rcu: Backport upstream RCU related commits up to v5.17
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/602

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2065994
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/602

This patch series backport upstream RCU and various torture tests up to
v5.17 kernel. Beside patch 10 which has a merge conflict due to upstream
merge conflict, the other patches are all applied cleanly with any issue.

Signed-off-by: Waiman Long <longman@redhat.com>
~~~
Waiman Long (112):
  torture: Apply CONFIG_KCSAN_STRICT to kvm.sh --kcsan argument
  torture: Make torture.sh print the number of files to be compressed
  rcu-nocb: Fix a couple of tree_nocb code-style nits
  rcu: Eliminate rcu_implicit_dynticks_qs() local variable rnhqp
  rcu: Eliminate rcu_implicit_dynticks_qs() local variable ruqp
  doc: Add another stall-warning root cause in stallwarn.rst
  rcu: Fix undefined Kconfig macros
  rcu: Comment rcu_gp_init() code waiting for CPU-hotplug operations
  rcu-tasks: Simplify trc_read_check_handler() atomic operations
  rcu-tasks: Add trc_inspect_reader() checks for exiting critical
    section
  rcu-tasks: Remove second argument of rcu_read_unlock_trace_special()
  rcu: Move rcu_dynticks_eqs_online() to rcu_cpu_starting()
  rcu: Simplify rcu_report_dead() call to rcu_report_exp_rdp()
  rcu: Make rcutree_dying_cpu() use its "cpu" parameter
  rcu-tasks: Wait for trc_read_check_handler() IPIs
  rcutorture: Suppressing read-exit testing is not an error
  rcu-tasks: Fix s/instruction/instructions/ typo in comment
  rcutorture: Warn on individual rcu_torture_init() error conditions
  locktorture: Warn on individual lock_torture_init() error conditions
  rcuscale: Warn on individual rcu_scale_init() error conditions
  rcutorture: Don't cpuhp_remove_state() if cpuhp_setup_state() failed
  rcu: Make rcu_normal_after_boot writable again
  rcu: Make rcu update module parameters world-readable
  rcu-tasks: Move RTGS_WAIT_CBS to beginning of rcu_tasks_kthread() loop
  rcu-tasks: Fix s/rcu_add_holdout/trc_add_holdout/ typo in comment
  rcu-tasks: Correct firstreport usage in check_all_holdout_tasks_trace
  rcu-tasks: Correct comparisons for CPU numbers in
    show_stalled_task_trace
  rcu-tasks: Clarify read side section info for rcu_tasks_rude GP
    primitives
  rcu: Fix existing exp request check in sync_sched_exp_online_cleanup()
  rcutorture: Avoid problematic critical section nesting on PREEMPT_RT
  rcu-tasks: Fix read-side primitives comment for call_rcu_tasks_trace
  rcu-tasks: Fix IPI failure handling in trc_wait_for_one_reader
  rcu: Replace ________p1 and _________p1 with __UNIQUE_ID(rcu)
  rcu-tasks: Update comments to cond_resched_tasks_rcu_qs()
  rcu: Ignore rdp.cpu_no_qs.b.exp on preemptible RCU's rcu_qs()
  rcu: Move rcu_data.cpu_no_qs.b.exp reset to rcu_export_exp_rdp()
  rcu: Remove rcu_data.exp_deferred_qs and convert to rcu_data.cpu
    no_qs.b.exp
  rcu-tasks: Don't remove tasks with pending IPIs from holdout list
  torture: Catch kvm.sh help text up with actual options
  rcutorture: Sanitize RCUTORTURE_RDR_MASK
  rcutorture: More thoroughly test nested readers
  srcu: Prevent redundant __srcu_read_unlock() wakeup
  rcutorture: Suppress pi-lock-across read-unlock testing for Tiny SRCU
  doc: Remove obsolete kernel-per-CPU-kthreads RCU_FAST_NO_HZ advice
  rcu: in_irq() cleanup
  rcu: Always inline rcu_dynticks_task*_{enter,exit}()
  rcu: Mark sync_sched_exp_online_cleanup() ->cpu_no_qs.b.exp load
  rcu: Prevent expedited GP from enabling tick on offline CPU
  rcu: Make idle entry report expedited quiescent states
  rcu/nocb: Make local rcu_nocb_lock_irqsave() safe against concurrent
    deoffloading
  rcu/nocb: Prepare state machine for a new step
  rcu/nocb: Invoke rcu_core() at the start of deoffloading
  rcu/nocb: Make rcu_core() callbacks acceleration preempt-safe
  rcu/nocb: Make rcu_core() callbacks acceleration (de-)offloading safe
  rcu/nocb: Check a stable offloaded state to manipulate
    qlen_last_fqs_check
  rcu/nocb: Use appropriate rcu_nocb_lock_irqsave()
  rcu/nocb: Limit number of softirq callbacks only on softirq
  rcu: Fix callbacks processing time limit retaining cond_resched()
  rcu: Apply callbacks processing time limit only on softirq
  rcu/nocb: Don't invoke local rcu core on callback overload from nocb
    kthread
  rcu: Improve tree_plugin.h comments and add code cleanups
  refscale: Simplify the errexit checkpoint
  refscale: Prevent buffer to pr_alert() being too long
  refscale: Always log the error message
  doc: Add refcount analogy to What is RCU
  refscale: Add missing '\n' to flush message
  scftorture: Add missing '\n' to flush message
  scftorture: Remove unused SCFTORTOUT
  scftorture: Account for weight_resched when checking for all zeroes
  rcuscale: Always log error message
  doc: RCU: Avoid 'Symbol' font-family in SVG figures
  scftorture: Always log error message
  locktorture,rcutorture,torture: Always log error message
  rcu-tasks: Create per-CPU callback lists
  rcu-tasks: Introduce ->percpu_enqueue_shift for dynamic queue
    selection
  rcu-tasks: Convert grace-period counter to grace-period sequence
    number
  rcu_tasks: Convert bespoke callback list to rcu_segcblist structure
  rcu-tasks: Use spin_lock_rcu_node() and friends
  rcu-tasks: Inspect stalled task's trc state in locked state
  rcu-tasks: Add a ->percpu_enqueue_lim to the rcu_tasks structure
  rcu-tasks: Abstract checking of callback lists
  rcu-tasks: Abstract invocations of callbacks
  rcutorture: Avoid soft lockup during cpu stall
  torture: Make kvm-find-errors.sh report link-time undefined symbols
  rcu-tasks: Use workqueues for multiple rcu_tasks_invoke_cbs()
    invocations
  rcu-tasks: Make rcu_barrier_tasks*() handle multiple callback queues
  rcu-tasks: Add rcupdate.rcu_task_enqueue_lim to set initial queueing
  rcutorture: Test RCU-tasks multiqueue callback queueing
  rcu: Avoid running boost kthreads on isolated CPUs
  rcu: Avoid alloc_pages() when recording stack
  rcutorture: Add CONFIG_PREEMPT_DYNAMIC=n to tiny scenarios
  torture: Retry download once before giving up
  rcu-tasks: Count trylocks to estimate call_rcu_tasks() contention
  rcu/nocb: Remove rcu_node structure from nocb list when de-offloaded
  rcu/nocb: Prepare nocb_cb_wait() to start with a non-offloaded rdp
  rcu/nocb: Optimize kthreads and rdp initialization
  rcu/nocb: Create kthreads on all CPUs if "rcu_nocbs=" or "nohz_full="
    are passed
  rcu/nocb: Allow empty "rcu_nocbs" kernel parameter
  rcu/nocb: Merge rcu_spawn_cpu_nocb_kthread() and
    rcu_spawn_one_nocb_kthread()
  rcutorture: Enable multiple concurrent callback-flood kthreads
  rcutorture: Cause TREE02 and TREE10 scenarios to do more callback
    flooding
  rcutorture: Add ability to limit callback-flood intensity
  rcutorture: Combine n_max_cbs from all kthreads in a callback flood
  rcu-tasks: Avoid raw-spinlocked wakeups from call_rcu_tasks_generic()
  rcu-tasks: Use more callback queues if contention encountered
  rcutorture: Test RCU Tasks lock-contention detection
  rcu-tasks: Use separate ->percpu_dequeue_lim for callback dequeueing
  rcu-tasks: Use fewer callbacks queues if callback flood ends
  rcu/exp: Mark current CPU as exp-QS in IPI loop second pass
  torture: Fix incorrectly redirected "exit" in kvm-remote.sh
  torture: Properly redirect kvm-remote.sh "echo" commands
  rcu-tasks: Fix computation of CPU-to-list shift counts

 .../Expedited-Grace-Periods/Funnel0.svg       |   4 +-
 .../Expedited-Grace-Periods/Funnel1.svg       |   4 +-
 .../Expedited-Grace-Periods/Funnel2.svg       |   4 +-
 .../Expedited-Grace-Periods/Funnel3.svg       |   4 +-
 .../Expedited-Grace-Periods/Funnel4.svg       |   4 +-
 .../Expedited-Grace-Periods/Funnel5.svg       |   4 +-
 .../Expedited-Grace-Periods/Funnel6.svg       |   4 +-
 .../Expedited-Grace-Periods/Funnel7.svg       |   4 +-
 .../Expedited-Grace-Periods/Funnel8.svg       |   4 +-
 .../Tree-RCU-Memory-Ordering.rst              |  69 +--
 .../Requirements/GPpartitionReaders1.svg      |  36 +-
 .../Requirements/ReadersPartitionGP1.svg      |  62 +-
 Documentation/RCU/stallwarn.rst               |  10 +
 Documentation/RCU/whatisRCU.rst               |  90 ++-
 .../admin-guide/kernel-parameters.txt         |  66 +-
 .../admin-guide/kernel-per-CPU-kthreads.rst   |   2 +-
 arch/sh/configs/sdk7786_defconfig             |   1 -
 arch/xtensa/configs/nommu_kc705_defconfig     |   1 -
 include/linux/rcu_segcblist.h                 |  51 +-
 include/linux/rcupdate.h                      |  50 +-
 include/linux/rcupdate_trace.h                |   5 +-
 include/linux/rcutiny.h                       |   2 +-
 include/linux/srcu.h                          |   3 +-
 include/linux/torture.h                       |  17 +-
 kernel/locking/locktorture.c                  |  18 +-
 kernel/rcu/Kconfig                            |   2 +-
 kernel/rcu/rcu_segcblist.c                    |  10 +-
 kernel/rcu/rcu_segcblist.h                    |  12 +-
 kernel/rcu/rcuscale.c                         |  24 +-
 kernel/rcu/rcutorture.c                       | 320 +++++++---
 kernel/rcu/refscale.c                         |  50 +-
 kernel/rcu/srcutiny.c                         |   2 +-
 kernel/rcu/tasks.h                            | 583 ++++++++++++++----
 kernel/rcu/tree.c                             | 119 ++--
 kernel/rcu/tree.h                             |  24 +-
 kernel/rcu/tree_exp.h                         |  15 +-
 kernel/rcu/tree_nocb.h                        | 162 +++--
 kernel/rcu/tree_plugin.h                      |  61 +-
 kernel/rcu/update.c                           |   8 +-
 kernel/scftorture.c                           |  20 +-
 kernel/torture.c                              |   4 +-
 .../rcutorture/bin/kvm-find-errors.sh         |   4 +-
 .../rcutorture/bin/kvm-recheck-rcu.sh         |   2 +-
 .../selftests/rcutorture/bin/kvm-remote.sh    |  23 +-
 tools/testing/selftests/rcutorture/bin/kvm.sh |  11 +-
 .../selftests/rcutorture/bin/parse-build.sh   |   3 +-
 .../selftests/rcutorture/bin/torture.sh       |   9 +-
 .../selftests/rcutorture/configs/rcu/SRCU-T   |   1 +
 .../selftests/rcutorture/configs/rcu/SRCU-U   |   1 +
 .../rcutorture/configs/rcu/TASKS01.boot       |   1 +
 .../selftests/rcutorture/configs/rcu/TINY01   |   1 +
 .../selftests/rcutorture/configs/rcu/TINY02   |   1 +
 .../rcutorture/configs/rcu/TRACE01.boot       |   1 +
 .../rcutorture/configs/rcu/TRACE02.boot       |   1 +
 .../rcutorture/configs/rcu/TREE02.boot        |   1 +
 .../rcutorture/configs/rcu/TREE10.boot        |   1 +
 .../rcutorture/configs/rcuscale/TINY          |   1 +
 57 files changed, 1360 insertions(+), 637 deletions(-)
 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE02.boot
 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE10.boot

Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Wander Lairson Costa <wander@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-04-19 12:23:21 +02:00
Waiman Long 026f852e1e rcu/nocb: Remove rcu_node structure from nocb list when de-offloaded
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2065994

commit 2ebc45c44c4f3cc4c757430b2409ece4f976892e
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Tue, 23 Nov 2021 01:37:03 +0100

    rcu/nocb: Remove rcu_node structure from nocb list when de-offloaded

    The nocb_gp_wait() function iterates over all CPUs in its group,
    including even those CPUs that have been de-offloaded.  This is of
    course suboptimal, especially if none of the CPUs within the group are
    currently offloaded.  This will become even more of a problem once a
    nocb kthread is created for all possible CPUs.

    Therefore use a standard double linked list to link all the offloaded
    rcu_data structures and safely add or delete these structure as we
    offload or de-offload them, respectively.

    Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Cc: Boqun Feng <boqun.feng@gmail.com>
    Cc: Uladzislau Rezki <urezki@gmail.com>
    Cc: Josh Triplett <josh@joshtriplett.org>
    Cc: Joel Fernandes <joel@joelfernandes.org>
    Tested-by: Juri Lelli <juri.lelli@redhat.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2022-03-24 17:16:20 -04:00
Waiman Long 400d40f7b0 rcu/nocb: Make local rcu_nocb_lock_irqsave() safe against concurrent deoffloading
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2065994

commit 118e0d4a1bc85d4ecea0427e440a72d21ffbfa6a
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Mon, 11 Oct 2021 16:51:30 +0200

    rcu/nocb: Make local rcu_nocb_lock_irqsave() safe against concurrent deoffloading

    rcu_nocb_lock_irqsave() can be preempted between the call to
    rcu_segcblist_is_offloaded() and the actual locking. This matters now
    that rcu_core() is preemptible on PREEMPT_RT and the (de-)offloading
    process can interrupt the softirq or the rcuc kthread.

    As a result we may locklessly call into code that requires nocb locking.
    In practice this is a problem while we accelerate callbacks on rcu_core().

    Simply disabling interrupts before (instead of after) checking the NOCB
    offload state fixes the issue.

    Reported-and-tested-by: Valentin Schneider <valentin.schneider@arm.com>
    Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Cc: Valentin Schneider <valentin.schneider@arm.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Cc: Josh Triplett <josh@joshtriplett.org>
    Cc: Joel Fernandes <joel@joelfernandes.org>
    Cc: Boqun Feng <boqun.feng@gmail.com>
    Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
    Cc: Uladzislau Rezki <urezki@gmail.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2022-03-24 17:15:59 -04:00
Waiman Long c9b4dd21b8 rcu: Remove rcu_data.exp_deferred_qs and convert to rcu_data.cpu no_qs.b.exp
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2065994

commit 6120b72e25e195b6fa15b0a674479a38166c392a
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Thu, 16 Sep 2021 14:10:48 +0200

    rcu: Remove rcu_data.exp_deferred_qs and convert to rcu_data.cpu no_qs.b.exp

    Having two fields for the same purpose with subtle differences on
    different RCU flavours is confusing, especially when both fields always
    exist on both RCU flavours.

    Fortunately, it is now safe for preemptible RCU to rely on the rcu_data
    structure's ->cpu_no_qs.b.exp field, just like non-preemptible RCU.
    This commit therefore removes the ad-hoc ->exp_deferred_qs field.

    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2022-03-24 17:15:53 -04:00
Desnes A. Nunes do Rosario 9814a162d4 rcu: Remove the RCU_FAST_NO_HZ Kconfig option
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059555
Upstream Status: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=e2c73a6860bdf54f2c6bf8cddc34ddc91a1343e1

commit e2c73a6860bdf54f2c6bf8cddc34ddc91a1343e1
Author: "Paul E. McKenney" <paulmck@kernel.org>
Date: Mon, 27 Sep 2021 14:18:51 -0700

  All of the uses of CONFIG_RCU_FAST_NO_HZ=y that I have seen involve
  systems with RCU callbacks offloaded.  In this situation, all that this
  Kconfig option does is slow down idle entry/exit with an additional
  allways-taken early exit.  If this is the only use case, then this
  Kconfig option nothing but an attractive nuisance that needs to go away.

  This commit therefore removes the RCU_FAST_NO_HZ Kconfig option.

  Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Desnes A. Nunes do Rosario <drosario@redhat.com>
2022-03-24 14:39:57 -04:00
Paul E. McKenney 641faf1b90 Merge branches 'bitmaprange.2021.05.10c', 'doc.2021.05.10c', 'fixes.2021.05.13a', 'kvfree_rcu.2021.05.10c', 'mmdumpobj.2021.05.10c', 'nocb.2021.05.12a', 'srcu.2021.05.12a', 'tasks.2021.05.18a' and 'torture.2021.05.10c' into HEAD
bitmaprange.2021.05.10c: Allow "all" for bitmap ranges.
doc.2021.05.10c: Documentation updates.
fixes.2021.05.13a: Miscellaneous fixes.
kvfree_rcu.2021.05.10c: kvfree_rcu() updates.
mmdumpobj.2021.05.10c: mem_dump_obj() updates.
nocb.2021.05.12a: RCU NOCB CPU updates, including limited deoffloading.
srcu.2021.05.12a: SRCU updates.
tasks.2021.05.18a: Tasks-RCU updates.
torture.2021.05.10c: Torture-test updates.
2021-05-18 10:56:19 -07:00
Ingo Molnar a616aec9aa rcu: Fix various typos in comments
Fix ~12 single-word typos in RCU code comments.

[ paulmck: Apply feedback from Randy Dunlap. ]
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-12 12:11:05 -07:00
Frederic Weisbecker e75bcd48e2 rcu/nocb: Unify timers
Now that ->nocb_timer and ->nocb_bypass_timer have become quite similar,
this commit merges them together.  A new RCU_NOCB_WAKE_BYPASS wake level
is introduced.  As a result, timers perform all kinds of deferred wake
ups but other deferred wakeup callsites only handle non-bypass wakeups
in order not to wake up rcuo too early.

The timer also unconditionally executes a full barrier so as to order
timer_pending() and callback enqueue although the path performing
RCU_NOCB_WAKE_FORCE that makes use of it is debatable. It should also
test against the rdp leader instead of the current rdp.

This unconditional full barrier shouldn't bring visible overhead since
these timers almost never fire.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-12 12:10:23 -07:00
Frederic Weisbecker 870905169d rcu/nocb: Prepare for fine-grained deferred wakeup
Tuning the deferred wakeup level must be done from a safe wakeup
point. Currently those sites are:

* ->nocb_timer
* user/idle/guest entry
* CPU down
* softirq/rcuc

All of these sites perform the wake up for both RCU_NOCB_WAKE and
RCU_NOCB_WAKE_FORCE.

In order to merge ->nocb_timer and ->nocb_bypass_timer together, we plan
to add a new RCU_NOCB_WAKE_BYPASS that really should be deferred until
a timer fires so that we don't wake up the NOCB-gp kthread too early.

To prepare for that, this commit specifies the per-callsite wakeup
level/limit.

Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
[ paulmck: Fix non-NOCB rcu_nocb_need_deferred_wakeup() definition. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-12 12:10:23 -07:00
Paul E. McKenney 3ef5a1c382 rcu: Make RCU priority boosting work on single-CPU rcu_node structures
When any CPU comes online, it checks to see if an RCU-boost kthread has
already been created for that CPU's leaf rcu_node structure, and if
not, it creates one.  Unfortunately, it also verifies that this leaf
rcu_node structure actually has at least one online CPU, and if not,
it declines to create the kthread.  Although this behavior makes sense
during early boot, especially on systems that claim far more CPUs than
they actually have, it makes no sense for the first CPU to come online
for a given rcu_node structure.  There is no point in checking because
we know there is a CPU on its way in.

The problem is that timing differences can cause this incoming CPU to not
yet be reflected in the various bit masks even at rcutree_online_cpu()
time, and there is no chance at rcutree_prepare_cpu() time.  Plus it
would be better to create the RCU-boost kthread at rcutree_prepare_cpu()
to handle the case where the CPU is involved in an RCU priority inversion
very shortly after it comes online.

This commit therefore moves the checking to rcu_prepare_kthreads(), which
is called only at early boot, when the check is appropriate.  In addition,
it makes rcutree_prepare_cpu() invoke rcu_spawn_one_boost_kthread(), which
no longer does any checking for online CPUs.

With this change, RCU priority boosting tests now pass for short rcutorture
runs, even with single-CPU leaf rcu_node structures.

Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Scott Wood <swood@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 16:22:54 -07:00
Paul E. McKenney 396eba65f6 rcu: Add quiescent states and boost states to show_rcu_gp_kthreads() output
This commit adds each rcu_node structure's ->qsmask and "bBEG" output
indicating whether: (1) There is a boost kthread, (2) A reader needs
to be (or is in the process of being) boosted, (3) A reader is blocking
an expedited grace period, and (4) A reader is blocking a normal grace
period.  This helps diagnose RCU priority boosting failures.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 16:22:54 -07:00
Frederic Weisbecker d76e0926d8 rcu/nocb: Use the rcuog CPU's ->nocb_timer
Currently each CPU has its own ->nocb_timer queued when the nocb_gp
wakeup must be deferred.  This approach has many drawbacks, compared to
a solution based on a single timer per NOCB group:

* There are a lot of timers to maintain.

* The per-rdp ->nocb_lock must be held to queue and cancel the timer
  and this lock can already be heavily contended.

* One timer firing doesn't cancel the other timers in the same group:
  - These other timers can thus cause spurious wakeups
  - Each rdp that queued a timer must lock both ->nocb_lock and then
    ->nocb_gp_lock upon exit from the kernel to idle/user/guest mode.

* We can't cancel all of them if we detect an unflushed bypass in
  nocb_gp_wait(). In fact currently we only ever cancel the ->nocb_timer
  of the leader group.

* The leader group's nocb_timer is cancelled without locking ->nocb_lock
  in nocb_gp_wait().  This currently appears to be safe but is an
  accident waiting to happen.

* Since the timer acquires ->nocb_lock, it requires extra care in the
  NOCB (de-)offloading process, requiring that it be either enabled or
  disabled and then flushed.

This commit instead uses the rcuog kthread's CPU's ->nocb_timer instead.
It is protected by nocb_gp_lock, which is _way_ less contended and
remains so even after this change.  As a matter of fact, the nocb_timer
almost never fires and the deferred wakeup is mostly carried out upon
idle/user/guest entry.  Now the early check performed at this point in
do_nocb_deferred_wakeup() is done on rdp_gp->nocb_defer_wakeup, which
is of course racy.  However, this raciness is harmless because we only
need the guarantee that the timer is queued if we were the last one to
queue it.  Any other situation (another CPU has queued it and we either
see it or not) is fine.

This solves all the issues listed above.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 16:02:44 -07:00
Linus Torvalds 657bd90c93 Scheduler updates for v5.12:
[ NOTE: unfortunately this tree had to be freshly rebased today,
         it's a same-content tree of 82891be90f3c (-next published)
         merged with v5.11.
 
         The main reason for the rebase was an authorship misattribution
         problem with a new commit, which we noticed in the last minute,
         and which we didn't want to be merged upstream. The offending
         commit was deep in the tree, and dependent commits had to be
         rebased as well. ]
 
 - Core scheduler updates:
 
   - Add CONFIG_PREEMPT_DYNAMIC: this in its current form adds the
     preempt=none/voluntary/full boot options (default: full),
     to allow distros to build a PREEMPT kernel but fall back to
     close to PREEMPT_VOLUNTARY (or PREEMPT_NONE) runtime scheduling
     behavior via a boot time selection.
 
     There's also the /debug/sched_debug switch to do this runtime.
 
     This feature is implemented via runtime patching (a new variant of static calls).
 
     The scope of the runtime patching can be best reviewed by looking
     at the sched_dynamic_update() function in kernel/sched/core.c.
 
     ( Note that the dynamic none/voluntary mode isn't 100% identical,
       for example preempt-RCU is available in all cases, plus the
       preempt count is maintained in all models, which has runtime
       overhead even with the code patching. )
 
     The PREEMPT_VOLUNTARY/PREEMPT_NONE models, used by the vast majority
     of distributions, are supposed to be unaffected.
 
   - Fix ignored rescheduling after rcu_eqs_enter(). This is a bug that
     was found via rcutorture triggering a hang. The bug is that
     rcu_idle_enter() may wake up a NOCB kthread, but this happens after
     the last generic need_resched() check. Some cpuidle drivers fix it
     by chance but many others don't.
 
     In true 2020 fashion the original bug fix has grown into a 5-patch
     scheduler/RCU fix series plus another 16 RCU patches to address
     the underlying issue of missed preemption events. These are the
     initial fixes that should fix current incarnations of the bug.
 
   - Clean up rbtree usage in the scheduler, by providing & using the following
     consistent set of rbtree APIs:
 
      partial-order; less() based:
        - rb_add(): add a new entry to the rbtree
        - rb_add_cached(): like rb_add(), but for a rb_root_cached
 
      total-order; cmp() based:
        - rb_find(): find an entry in an rbtree
        - rb_find_add(): find an entry, and add if not found
 
        - rb_find_first(): find the first (leftmost) matching entry
        - rb_next_match(): continue from rb_find_first()
        - rb_for_each(): iterate a sub-tree using the previous two
 
   - Improve the SMP/NUMA load-balancer: scan for an idle sibling in a single pass.
     This is a 4-commit series where each commit improves one aspect of the idle
     sibling scan logic.
 
   - Improve the cpufreq cooling driver by getting the effective CPU utilization
     metrics from the scheduler
 
   - Improve the fair scheduler's active load-balancing logic by reducing the number
     of active LB attempts & lengthen the load-balancing interval. This improves
     stress-ng mmapfork performance.
 
   - Fix CFS's estimated utilization (util_est) calculation bug that can result in
     too high utilization values
 
 - Misc updates & fixes:
 
    - Fix the HRTICK reprogramming & optimization feature
    - Fix SCHED_SOFTIRQ raising race & warning in the CPU offlining code
    - Reduce dl_add_task_root_domain() overhead
    - Fix uprobes refcount bug
    - Process pending softirqs in flush_smp_call_function_from_idle()
    - Clean up task priority related defines, remove *USER_*PRIO and
      USER_PRIO()
    - Simplify the sched_init_numa() deduplication sort
    - Documentation updates
    - Fix EAS bug in update_misfit_status(), which degraded the quality
      of energy-balancing
    - Smaller cleanups
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmAtHBsRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1itgg/+NGed12pgPjYBzesdou60Lvx7LZLGjfOt
 M1F1EnmQGn/hEH2fCY6ZoqIZQTVltm7GIcBNabzYTzlaHZsdtyuDUJBZyj19vTlk
 zekcj7WVt+qvfjChaNwEJhQ9nnOM/eohMgEOHMAAJd9zlnQvve7NOLQ56UDM+kn/
 9taFJ5ZPvb4avP6C5p3KivvKex6Bjof/Tl0m3utpNyPpI/qK3FyGxwdgCxU0yepT
 ABWQX5ZQCufFvo1bgnBPfqyzab4MqhoM3bNKBsLQfuAlssG1xRv4KQOev4dRwrt9
 pXJikV5C9yez5d2lGe5p0ltH5IZS/l9x2yI/ZQj3OUDTFyV1ic6WfFAqJgDzVF8E
 i/vvA4NPQiI241Bkps+ErcCw4aVOgiY6TWli74cHjLUIX0+As6aHrFWXGSxUmiHB
 WR+B8KmdfzRTTlhOxMA+cvlpZcKCfxWkJJmXzr/lDZzIuKPqM3QCE2wD9sixkfVo
 JNICT0IvZghWOdbMEfZba8Psh/e2LVI9RzdpEiuYJz1ZrVlt1hO0M6jBxY0hMz9n
 k54z81xODw0a8P2FHMtpmB1vhAeqCmvwA6DO8z0Oxs0DFi+KM2bLf2efHsCKafI+
 Bm5v9YFaOk/55R76hJVh+aYLlyFgFkKd+P/niJTPDnxOk3SqJuXvTrql1HeGHkNr
 kYgQa23dsZk=
 =pyaG
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2021-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "Core scheduler updates:

   - Add CONFIG_PREEMPT_DYNAMIC: this in its current form adds the
     preempt=none/voluntary/full boot options (default: full), to allow
     distros to build a PREEMPT kernel but fall back to close to
     PREEMPT_VOLUNTARY (or PREEMPT_NONE) runtime scheduling behavior via
     a boot time selection.

     There's also the /debug/sched_debug switch to do this runtime.

     This feature is implemented via runtime patching (a new variant of
     static calls).

     The scope of the runtime patching can be best reviewed by looking
     at the sched_dynamic_update() function in kernel/sched/core.c.

     ( Note that the dynamic none/voluntary mode isn't 100% identical,
       for example preempt-RCU is available in all cases, plus the
       preempt count is maintained in all models, which has runtime
       overhead even with the code patching. )

     The PREEMPT_VOLUNTARY/PREEMPT_NONE models, used by the vast
     majority of distributions, are supposed to be unaffected.

   - Fix ignored rescheduling after rcu_eqs_enter(). This is a bug that
     was found via rcutorture triggering a hang. The bug is that
     rcu_idle_enter() may wake up a NOCB kthread, but this happens after
     the last generic need_resched() check. Some cpuidle drivers fix it
     by chance but many others don't.

     In true 2020 fashion the original bug fix has grown into a 5-patch
     scheduler/RCU fix series plus another 16 RCU patches to address the
     underlying issue of missed preemption events. These are the initial
     fixes that should fix current incarnations of the bug.

   - Clean up rbtree usage in the scheduler, by providing & using the
     following consistent set of rbtree APIs:

       partial-order; less() based:
         - rb_add(): add a new entry to the rbtree
         - rb_add_cached(): like rb_add(), but for a rb_root_cached

       total-order; cmp() based:
         - rb_find(): find an entry in an rbtree
         - rb_find_add(): find an entry, and add if not found

         - rb_find_first(): find the first (leftmost) matching entry
         - rb_next_match(): continue from rb_find_first()
         - rb_for_each(): iterate a sub-tree using the previous two

   - Improve the SMP/NUMA load-balancer: scan for an idle sibling in a
     single pass. This is a 4-commit series where each commit improves
     one aspect of the idle sibling scan logic.

   - Improve the cpufreq cooling driver by getting the effective CPU
     utilization metrics from the scheduler

   - Improve the fair scheduler's active load-balancing logic by
     reducing the number of active LB attempts & lengthen the
     load-balancing interval. This improves stress-ng mmapfork
     performance.

   - Fix CFS's estimated utilization (util_est) calculation bug that can
     result in too high utilization values

  Misc updates & fixes:

   - Fix the HRTICK reprogramming & optimization feature

   - Fix SCHED_SOFTIRQ raising race & warning in the CPU offlining code

   - Reduce dl_add_task_root_domain() overhead

   - Fix uprobes refcount bug

   - Process pending softirqs in flush_smp_call_function_from_idle()

   - Clean up task priority related defines, remove *USER_*PRIO and
     USER_PRIO()

   - Simplify the sched_init_numa() deduplication sort

   - Documentation updates

   - Fix EAS bug in update_misfit_status(), which degraded the quality
     of energy-balancing

   - Smaller cleanups"

* tag 'sched-core-2021-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits)
  sched,x86: Allow !PREEMPT_DYNAMIC
  entry/kvm: Explicitly flush pending rcuog wakeup before last rescheduling point
  entry: Explicitly flush pending rcuog wakeup before last rescheduling point
  rcu/nocb: Trigger self-IPI on late deferred wake up before user resume
  rcu/nocb: Perform deferred wake up before last idle's need_resched() check
  rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers
  sched/features: Distinguish between NORMAL and DEADLINE hrtick
  sched/features: Fix hrtick reprogramming
  sched/deadline: Reduce rq lock contention in dl_add_task_root_domain()
  uprobes: (Re)add missing get_uprobe() in __find_uprobe()
  smp: Process pending softirqs in flush_smp_call_function_from_idle()
  sched: Harden PREEMPT_DYNAMIC
  static_call: Allow module use without exposing static_call_key
  sched: Add /debug/sched_preempt
  preempt/dynamic: Support dynamic preempt with preempt= boot option
  preempt/dynamic: Provide irqentry_exit_cond_resched() static call
  preempt/dynamic: Provide preempt_schedule[_notrace]() static calls
  preempt/dynamic: Provide cond_resched() and might_resched() static calls
  preempt: Introduce CONFIG_PREEMPT_DYNAMIC
  static_call: Provide DEFINE_STATIC_CALL_RET0()
  ...
2021-02-21 12:35:04 -08:00
Frederic Weisbecker f8bb5cae96 rcu/nocb: Trigger self-IPI on late deferred wake up before user resume
Entering RCU idle mode may cause a deferred wake up of an RCU NOCB_GP
kthread (rcuog) to be serviced.

Unfortunately the call to rcu_user_enter() is already past the last
rescheduling opportunity before we resume to userspace or to guest mode.
We may escape there with the woken task ignored.

The ultimate resort to fix every callsites is to trigger a self-IPI
(nohz_full depends on arch to implement arch_irq_work_raise()) that will
trigger a reschedule on IRQ tail or guest exit.

Eventually every site that want a saner treatment will need to carefully
place a call to rcu_nocb_flush_deferred_wakeup() before the last explicit
need_resched() check upon resume.

Fixes: 96d3fd0d31 (rcu: Break call_rcu() deadlock involving scheduler and perf)
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210131230548.32970-4-frederic@kernel.org
2021-02-17 14:12:43 +01:00
Frederic Weisbecker 69cdea873c rcu/nocb: Shutdown nocb timer on de-offloading
This commit ensures that the nocb timer is shut down before reaching the
final de-offloaded state.  The key goal is to prevent the timer handler
from manipulating the callbacks without the protection of the nocb locks.

Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Inspired-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-01-06 16:24:59 -08:00
Frederic Weisbecker d97b078182 rcu/nocb: De-offloading CB kthread
To de-offload callback processing back onto a CPU, it is necessary to
clear SEGCBLIST_OFFLOAD and notify the nocb CB kthread, which will then
clear its own bit flag and go to sleep to stop handling callbacks.  This
commit makes that change.  It will also be necessary to notify the nocb
GP kthread in this same way, which is the subject of a follow-on commit.

Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Inspired-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
[ paulmck: Add export per kernel test robot feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-01-06 16:24:19 -08:00
Paul E. McKenney 4d60b475f8 rcu: Prevent lockdep-RCU splats on lock acquisition/release
The rcu_cpu_starting() and rcu_report_dead() functions transition the
current CPU between online and offline state from an RCU perspective.
Unfortunately, this means that the rcu_cpu_starting() function's lock
acquisition and the rcu_report_dead() function's lock releases happen
while the CPU is offline from an RCU perspective, which can result
in lockdep-RCU splats about using RCU from an offline CPU.  And this
situation can also result in too-short grace periods, especially in
guest OSes that are subject to vCPU preemption.

This commit therefore uses sequence-count-like synchronization to forgive
use of RCU while RCU thinks a CPU is offline across the full extent of
the rcu_cpu_starting() and rcu_report_dead() function's lock acquisitions
and releases.

One approach would have been to use the actual sequence-count primitives
provided by the Linux kernel.  Unfortunately, the resulting code looks
completely broken and wrong, and is likely to result in patches that
break RCU in an attempt to address this appearance of broken wrongness.
Plus there is no net savings in lines of code, given the additional
explicit memory barriers required.

Therefore, this sequence count is instead implemented by a new ->ofl_seq
field in the rcu_node structure.  If this counter's value is an odd
number, RCU forgives RCU read-side critical sections on other CPUs covered
by the same rcu_node structure, even if those CPUs are offline from
an RCU perspective.  In addition, if a given leaf rcu_node structure's
->ofl_seq counter value is an odd number, rcu_gp_init() delays starting
the grace period until that counter value changes.

[ paulmck: Apply Peter Zijlstra feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-11-19 19:37:17 -08:00
Neeraj Upadhyay ed73860cec rcu: Fix single-CPU check in rcu_blocking_is_gp()
Currently, for CONFIG_PREEMPTION=n kernels, rcu_blocking_is_gp() uses
num_online_cpus() to determine whether there is only one CPU online.  When
there is only a single CPU online, the simple fact that synchronize_rcu()
could be legally called implies that a full grace period has elapsed.
Therefore, in the single-CPU case, synchronize_rcu() simply returns
immediately.  Unfortunately, num_online_cpus() is unreliable while a
CPU-hotplug operation is transitioning to or from single-CPU operation
because:

1.	num_online_cpus() uses atomic_read(&__num_online_cpus) to
	locklessly sample the number of online CPUs.  The hotplug locks
	are not held, which means that an incoming CPU can concurrently
	update this count.  This in turn means that an RCU read-side
	critical section on the incoming CPU might observe updates
	prior to the grace period, but also that this critical section
	might extend beyond the end of the optimized synchronize_rcu().
	This breaks RCU's fundamental guarantee.

2.	In addition, num_online_cpus() does no ordering, thus providing
	another way that RCU's fundamental guarantee can be broken by
	the current code.

3.	The most probable failure mode happens on outgoing CPUs.
	The outgoing CPU updates the count of online CPUs in the
	CPUHP_TEARDOWN_CPU stop-machine handler, which is fine in
	and of itself due to preemption being disabled at the call
	to num_online_cpus().  Unfortunately, after that stop-machine
	handler returns, the CPU takes one last trip through the
	scheduler (which has RCU readers) and, after the resulting
	context switch, one final dive into the idle loop.  During this
	time, RCU needs to keep track of two CPUs, but num_online_cpus()
	will say that there is only one, which in turn means that the
	surviving CPU will incorrectly ignore the outgoing CPU's RCU
	read-side critical sections.

This problem is illustrated by the following litmus test in which P0()
corresponds to synchronize_rcu() and P1() corresponds to the incoming CPU.
The herd7 tool confirms that the "exists" clause can be satisfied,
thus demonstrating that this breakage can happen according to the Linux
kernel memory model.

   {
     int x = 0;
     atomic_t numonline = ATOMIC_INIT(1);
   }

   P0(int *x, atomic_t *numonline)
   {
     int r0;
     WRITE_ONCE(*x, 1);
     r0 = atomic_read(numonline);
     if (r0 == 1) {
       smp_mb();
     } else {
       synchronize_rcu();
     }
     WRITE_ONCE(*x, 2);
   }

   P1(int *x, atomic_t *numonline)
   {
     int r0; int r1;

     atomic_inc(numonline);
     smp_mb();
     rcu_read_lock();
     r0 = READ_ONCE(*x);
     smp_rmb();
     r1 = READ_ONCE(*x);
     rcu_read_unlock();
   }

   locations [x;numonline;]

   exists (1:r0=0 /\ 1:r1=2)

It is important to note that these problems arise only when the system
is transitioning to or from single-CPU operation.

One solution would be to hold the CPU-hotplug locks while sampling
num_online_cpus(), which was in fact the intent of the (redundant)
preempt_disable() and preempt_enable() surrounding this call to
num_online_cpus().  Actually blocking CPU hotplug would not only result
in excessive overhead, but would also unnecessarily impede CPU-hotplug
operations.

This commit therefore follows long-standing RCU tradition by maintaining
a separate RCU-specific set of CPU-hotplug books.

This separate set of books is implemented by a new ->n_online_cpus field
in the rcu_state structure that maintains RCU's count of the online CPUs.
This count is incremented early in the CPU-online process, so that
the critical transition away from single-CPU operation will occur when
there is only a single CPU.  Similarly for the critical transition to
single-CPU operation, the counter is decremented late in the CPU-offline
process, again while there is only a single CPU.  Because there is only
ever a single CPU when the ->n_online_cpus field undergoes the critical
1->2 and 2->1 transitions, full memory ordering and mutual exclusion is
provided implicitly and, better yet, for free.

In the case where the CPU is coming online, nothing will happen until
the current CPU helps it come online.  Therefore, the new CPU will see
all accesses prior to the optimized grace period, which means that RCU
does not need to further delay this new CPU.  In the case where the CPU
is going offline, the outgoing CPU is totally out of the picture before
the optimized grace period starts, which means that this outgoing CPU
cannot see any of the accesses following that grace period.  Again,
RCU needs no further interaction with the outgoing CPU.

This does mean that synchronize_rcu() will unnecessarily do a few grace
periods the hard way just before the second CPU comes online and just
after the second-to-last CPU goes offline, but it is not worth optimizing
this uncommon case.

Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-11-19 19:37:16 -08:00