Commit Graph

1008 Commits

Author SHA1 Message Date
Rafael Aquini 89edcfe6e5 rcu/kvfree: Fix data-race in __mod_timer / kvfree_call_rcu
JIRA: https://issues.redhat.com/browse/RHEL-72196
CVE: CVE-2024-53160

This patch is a backport of the following upstream commit:
commit a23da88c6c80e41e0503e0b481a22c9eea63f263
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Tue Oct 22 12:53:07 2024 +0200

    rcu/kvfree: Fix data-race in __mod_timer / kvfree_call_rcu

    KCSAN reports a data race when access the krcp->monitor_work.timer.expires
    variable in the schedule_delayed_monitor_work() function:

    <snip>
    BUG: KCSAN: data-race in __mod_timer / kvfree_call_rcu

    read to 0xffff888237d1cce8 of 8 bytes by task 10149 on cpu 1:
     schedule_delayed_monitor_work kernel/rcu/tree.c:3520 [inline]
     kvfree_call_rcu+0x3b8/0x510 kernel/rcu/tree.c:3839
     trie_update_elem+0x47c/0x620 kernel/bpf/lpm_trie.c:441
     bpf_map_update_value+0x324/0x350 kernel/bpf/syscall.c:203
     generic_map_update_batch+0x401/0x520 kernel/bpf/syscall.c:1849
     bpf_map_do_batch+0x28c/0x3f0 kernel/bpf/syscall.c:5143
     __sys_bpf+0x2e5/0x7a0
     __do_sys_bpf kernel/bpf/syscall.c:5741 [inline]
     __se_sys_bpf kernel/bpf/syscall.c:5739 [inline]
     __x64_sys_bpf+0x43/0x50 kernel/bpf/syscall.c:5739
     x64_sys_call+0x2625/0x2d60 arch/x86/include/generated/asm/syscalls_64.h:322
     do_syscall_x64 arch/x86/entry/common.c:52 [inline]
     do_syscall_64+0xc9/0x1c0 arch/x86/entry/common.c:83
     entry_SYSCALL_64_after_hwframe+0x77/0x7f

    write to 0xffff888237d1cce8 of 8 bytes by task 56 on cpu 0:
     __mod_timer+0x578/0x7f0 kernel/time/timer.c:1173
     add_timer_global+0x51/0x70 kernel/time/timer.c:1330
     __queue_delayed_work+0x127/0x1a0 kernel/workqueue.c:2523
     queue_delayed_work_on+0xdf/0x190 kernel/workqueue.c:2552
     queue_delayed_work include/linux/workqueue.h:677 [inline]
     schedule_delayed_monitor_work kernel/rcu/tree.c:3525 [inline]
     kfree_rcu_monitor+0x5e8/0x660 kernel/rcu/tree.c:3643
     process_one_work kernel/workqueue.c:3229 [inline]
     process_scheduled_works+0x483/0x9a0 kernel/workqueue.c:3310
     worker_thread+0x51d/0x6f0 kernel/workqueue.c:3391
     kthread+0x1d1/0x210 kernel/kthread.c:389
     ret_from_fork+0x4b/0x60 arch/x86/kernel/process.c:147
     ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 0 UID: 0 PID: 56 Comm: kworker/u8:4 Not tainted 6.12.0-rc2-syzkaller-00050-g5b7c893ed5ed #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
    Workqueue: events_unbound kfree_rcu_monitor
    <snip>

    kfree_rcu_monitor() rearms the work if a "krcp" has to be still
    offloaded and this is done without holding krcp->lock, whereas
    the kvfree_call_rcu() holds it.

    Fix it by acquiring the "krcp->lock" for kfree_rcu_monitor() so
    both functions do not race anymore.

    Reported-by: syzbot+061d370693bdd99f9d34@syzkaller.appspotmail.com
    Link: https://lore.kernel.org/lkml/ZxZ68KmHDQYU0yfD@pc636/T/
    Fixes: 8fc5494ad5fa ("rcu/kvfree: Move need_offload_krc() out of krcp->lock")
    Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2025-01-08 17:48:13 -05:00
Rado Vrbovsky 9d55dcd124 Merge: rcu: Use system_unbound_wq to avoid disturbing isolated CPUs
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5353

JIRA: https://issues.redhat.com/browse/RHEL-50220

commit 0aac9daef6763e6efef398faff71f8c593651cce    
Author: Waiman Long <longman@redhat.com>    
Date:   Tue, 23 Jul 2024 14:10:25 -0400

    rcu: Use system_unbound_wq to avoid disturbing isolated CPUs

    It was discovered that isolated CPUs could sometimes be disturbed by
    kworkers processing kfree_rcu() works causing higher than expected
    latency. It is because the RCU core uses "system_wq" which doesn't have
    the WQ_UNBOUND flag to handle all its work items. Fix this violation of
    latency limits by using "system_unbound_wq" in the RCU core instead.
    This will ensure that those work items will not be run on CPUs marked
    as isolated.

    Beside the WQ_UNBOUND flag, the other major difference between system_wq
    and system_unbound_wq is their max_active count. The system_unbound_wq
    has a max_active of WQ_MAX_ACTIVE (512) while system_wq's max_active
    is WQ_DFL_ACTIVE (256) which is half of WQ_MAX_ACTIVE.

    Reported-by: Vratislav Bendel <vbendel@redhat.com>
    Closes: https://issues.redhat.com/browse/RHEL-50220
    Signed-off-by: Waiman Long <longman@redhat.com>
    Reviewed-by: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
    Tested-by: Breno Leitao <leitao@debian.org>
    Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>

Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: Wander Lairson Costa <wander@redhat.com>
Approved-by: Sterling Alexander <stalexan@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
2024-10-25 16:29:24 +00:00
Rado Vrbovsky 16bf54f108 Merge: Fix RCUC latency issue
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5165

JIRA: https://issues.redhat.com/browse/RHEL-20288

Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

Signed-off-by: Leonardo Bras <leobras@redhat.com>

Approved-by: Wander Lairson Costa <wander@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>
Approved-by: Marcelo Tosatti <mtosatti@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
2024-10-25 16:26:53 +00:00
Leonardo Bras 483ecb54c6 rcu: Add rcutree.nohz_full_patience_delay to reduce nohz_full OS jitter
JIRA: https://issues.redhat.com/browse/RHEL-20288

commit 68d124b0999919015e6d23008eafea106ec6bb40
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   2024-05-08 20:11:58 -0700

    rcu: Add rcutree.nohz_full_patience_delay to reduce nohz_full OS jitter

    If a CPU is running either a userspace application or a guest OS in
    nohz_full mode, it is possible for a system call to occur just as an
    RCU grace period is starting.  If that CPU also has the scheduling-clock
    tick enabled for any reason (such as a second runnable task), and if the
    system was booted with rcutree.use_softirq=0, then RCU can add insult to
    injury by awakening that CPU's rcuc kthread, resulting in yet another
    task and yet more OS jitter due to switching to that task, running it,
    and switching back.

    In addition, in the common case where that system call is not of
    excessively long duration, awakening the rcuc task is pointless.
    This pointlessness is due to the fact that the CPU will enter an extended
    quiescent state upon returning to the userspace application or guest OS.
    In this case, the rcuc kthread cannot do anything that the main RCU
    grace-period kthread cannot do on its behalf, at least if it is given
    a few additional milliseconds (for example, given the time duration
    specified by rcutree.jiffies_till_first_fqs, give or take scheduling
    delays).

    This commit therefore adds a rcutree.nohz_full_patience_delay kernel
    boot parameter that specifies the grace period age (in milliseconds,
    rounded to jiffies) before which RCU will refrain from awakening the
    rcuc kthread.  Preliminary experimentation suggests a value of 1000,
    that is, one second.  Increasing rcutree.nohz_full_patience_delay will
    increase grace-period latency and in turn increase memory footprint,
    so systems with constrained memory might choose a smaller value.
    Systems with less-aggressive OS-jitter requirements might choose the
    default value of zero, which keeps the traditional immediate-wakeup
    behavior, thus avoiding increases in grace-period latency.

    [ paulmck: Apply Leonardo Bras feedback.  ]

    Link: https://lore.kernel.org/all/20240328171949.743211-1-leobras@redhat.com/

    Reported-by: Leonardo Bras <leobras@redhat.com>
    Suggested-by: Leonardo Bras <leobras@redhat.com>
    Suggested-by: Sean Christopherson <seanjc@google.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Reviewed-by: Leonardo Bras <leobras@redhat.com>

Signed-off-by: Leonardo Bras <leobras@redhat.com>
2024-10-08 18:52:03 -03:00
Waiman Long 02d3ab178b rcu: Use system_unbound_wq to avoid disturbing isolated CPUs
JIRA: https://issues.redhat.com/browse/RHEL-50220

commit 0aac9daef6763e6efef398faff71f8c593651cce
Author: Waiman Long <longman@redhat.com>
Date:   Tue, 23 Jul 2024 14:10:25 -0400

    rcu: Use system_unbound_wq to avoid disturbing isolated CPUs

    It was discovered that isolated CPUs could sometimes be disturbed by
    kworkers processing kfree_rcu() works causing higher than expected
    latency. It is because the RCU core uses "system_wq" which doesn't have
    the WQ_UNBOUND flag to handle all its work items. Fix this violation of
    latency limits by using "system_unbound_wq" in the RCU core instead.
    This will ensure that those work items will not be run on CPUs marked
    as isolated.

    Beside the WQ_UNBOUND flag, the other major difference between system_wq
    and system_unbound_wq is their max_active count. The system_unbound_wq
    has a max_active of WQ_MAX_ACTIVE (512) while system_wq's max_active
    is WQ_DFL_ACTIVE (256) which is half of WQ_MAX_ACTIVE.

    Reported-by: Vratislav Bendel <vbendel@redhat.com>
    Closes: https://issues.redhat.com/browse/RHEL-50220
    Signed-off-by: Waiman Long <longman@redhat.com>
    Reviewed-by: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
    Tested-by: Breno Leitao <leitao@debian.org>
    Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-10-02 11:30:58 -04:00
Waiman Long dfd6ba19b1 rcutorture: Make rcutorture support print rcu-tasks gp state
JIRA: https://issues.redhat.com/browse/RHEL-55557

commit dddcddef1414be3ebc37a40d13fcc0f6a672ba9f
Author: Zqiang <qiang.zhang1211@gmail.com>
Date:   Mon, 18 Mar 2024 17:34:11 +0800

    rcutorture: Make rcutorture support print rcu-tasks gp state

    This commit make rcu-tasks related rcutorture test support rcu-tasks
    gp state printing when the writer stall occurs or the at the end of
    rcutorture test, and generate rcu_ops->get_gp_data() operation to
    simplify the acquisition of gp state for different types of rcutorture
    tests.

    Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-08-26 10:57:47 -04:00
Waiman Long df9c61a685 rcu: Allocate WQ with WQ_MEM_RECLAIM bit set
JIRA: https://issues.redhat.com/browse/RHEL-55557

commit 0fd210baa07a9e3f15df1bc687293eafb119283a
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Fri, 8 Mar 2024 18:34:09 +0100

    rcu: Allocate WQ with WQ_MEM_RECLAIM bit set

    synchronize_rcu() users have to be processed regardless
    of memory pressure so our private WQ needs to have at least
    one execution context what WQ_MEM_RECLAIM flag guarantees.

    Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
    Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
    Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-08-26 10:57:39 -04:00
Waiman Long 6b484a545e rcu: Support direct wake-up of synchronize_rcu() users
JIRA: https://issues.redhat.com/browse/RHEL-55557

commit 462df2f543ae360e79fcaa1b498d2a1a0c2a5b63
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Fri, 8 Mar 2024 18:34:07 +0100

    rcu: Support direct wake-up of synchronize_rcu() users

    This patch introduces a small enhancement which allows to do a
    direct wake-up of synchronize_rcu() callers. It occurs after a
    completion of grace period, thus by the gp-kthread.

    Number of clients is limited by the hard-coded maximum allowed
    threshold. The remaining part, if still exists is deferred to
    a main worker.

    Link: https://lore.kernel.org/lkml/Zd0ZtNu+Rt0qXkfS@lothringen/

    Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-08-26 10:57:38 -04:00
Waiman Long b3fce2b662 rcu: Add a trace event for synchronize_rcu_normal()
JIRA: https://issues.redhat.com/browse/RHEL-55557

commit 2053937a310a3982de9d33af3db2dbd2b32b66e4
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Fri, 8 Mar 2024 18:34:06 +0100

    rcu: Add a trace event for synchronize_rcu_normal()

    Add an rcu_sr_normal() trace event. It takes three arguments
    first one is the name of RCU flavour, second one is a user id
    which triggeres synchronize_rcu_normal() and last one is an
    event.

    There are two traces in the synchronize_rcu_normal(). On entry,
    when a new request is registered and on exit point when request
    is completed.

    Please note, CONFIG_RCU_TRACE=y is required to activate traces.

    Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-08-26 10:57:37 -04:00
Waiman Long e62041bc08 rcu: Reduce synchronize_rcu() latency
JIRA: https://issues.redhat.com/browse/RHEL-55557

commit 988f569ae041ccc93a79d98d1b0043dff4d7e9b7
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Fri, 8 Mar 2024 18:34:05 +0100

    rcu: Reduce synchronize_rcu() latency

    A call to a synchronize_rcu() can be optimized from a latency
    point of view. Workloads which depend on this can benefit of it.

    The delay of wakeme_after_rcu() callback, which unblocks a waiter,
    depends on several factors:

    - how fast a process of offloading is started. Combination of:
        - !CONFIG_RCU_NOCB_CPU/CONFIG_RCU_NOCB_CPU;
        - !CONFIG_RCU_LAZY/CONFIG_RCU_LAZY;
        - other.
    - when started, invoking path is interrupted due to:
        - time limit;
        - need_resched();
        - if limit is reached.
    - where in a nocb list it is located;
    - how fast previous callbacks completed;

    Example:

    1. On our embedded devices i can easily trigger the scenario when
    it is a last in the list out of ~3600 callbacks:

    <snip>
      <...>-29      [001] d..1. 21950.145313: rcu_batch_start: rcu_preempt CBs=3613 bl=28
    ...
      <...>-29      [001] ..... 21950.152578: rcu_invoke_callback: rcu_preempt rhp=00000000b2d6dee8 func=__free_vm_area_struct.cfi_jt
      <...>-29      [001] ..... 21950.152579: rcu_invoke_callback: rcu_preempt rhp=00000000a446f607 func=__free_vm_area_struct.cfi_jt
      <...>-29      [001] ..... 21950.152580: rcu_invoke_callback: rcu_preempt rhp=00000000a5cab03b func=__free_vm_area_struct.cfi_jt
      <...>-29      [001] ..... 21950.152581: rcu_invoke_callback: rcu_preempt rhp=0000000013b7e5ee func=__free_vm_area_struct.cfi_jt
      <...>-29      [001] ..... 21950.152582: rcu_invoke_callback: rcu_preempt rhp=000000000a8ca6f9 func=__free_vm_area_struct.cfi_jt
      <...>-29      [001] ..... 21950.152583: rcu_invoke_callback: rcu_preempt rhp=000000008f162ca8 func=wakeme_after_rcu.cfi_jt
      <...>-29      [001] d..1. 21950.152625: rcu_batch_end: rcu_preempt CBs-invoked=3612 idle=....
    <snip>

    2. We use cpuset/cgroup to classify tasks and assign them into
    different cgroups. For example "backgrond" group which binds tasks
    only to little CPUs or "foreground" which makes use of all CPUs.
    Tasks can be migrated between groups by a request if an acceleration
    is needed.

    See below an example how "surfaceflinger" task gets migrated.
    Initially it is located in the "system-background" cgroup which
    allows to run only on little cores. In order to speed it up it
    can be temporary moved into "foreground" cgroup which allows
    to use big/all CPUs:

    cgroup_attach_task():
     -> cgroup_migrate_execute()
       -> cpuset_can_attach()
         -> percpu_down_write()
           -> rcu_sync_enter()
             -> synchronize_rcu()
       -> now move tasks to the new cgroup.
     -> cgroup_migrate_finish()

    <snip>
             rcuop/1-29      [000] .....  7030.528570: rcu_invoke_callback: rcu_preempt rhp=00000000461605e0 func=wakeme_after_rcu.cfi_jt
        PERFD-SERVER-1855    [000] d..1.  7030.530293: cgroup_attach_task: dst_root=3 dst_id=22 dst_level=1 dst_path=/foreground pid=1900 comm=surfaceflinger
       TimerDispatch-2768    [002] d..5.  7030.537542: sched_migrate_task: comm=surfaceflinger pid=1900 prio=98 orig_cpu=0 dest_cpu=4
    <snip>

    "Boosting a task" depends on synchronize_rcu() latency:

    - first trace shows a completion of synchronize_rcu();
    - second shows attaching a task to a new group;
    - last shows a final step when migration occurs.

    3. To address this drawback, maintain a separate track that consists
    of synchronize_rcu() callers only. After completion of a grace period
    users are deferred to a dedicated worker to process requests.

    4. This patch reduces the latency of synchronize_rcu() approximately
    by ~30-40% on synthetic tests. The real test case, camera launch time,
    shows(time is in milliseconds):

    1-run 542 vs 489 improvement 9%
    2-run 540 vs 466 improvement 13%
    3-run 518 vs 468 improvement 9%
    4-run 531 vs 457 improvement 13%
    5-run 548 vs 475 improvement 13%
    6-run 509 vs 484 improvement 4%

    Synthetic test(no "noise" from other callbacks):
    Hardware: x86_64 64 CPUs, 64GB of memory
    Linux-6.6

    - 10K tasks(simultaneous);
    - each task does(1000 loops)
         synchronize_rcu();
         kfree(p);

    default: CONFIG_RCU_NOCB_CPU: takes 54 seconds to complete all users;
    patch: CONFIG_RCU_NOCB_CPU: takes 35 seconds to complete all users.

    Running 60K gives approximately same results on my setup. Please note
    it is without any interaction with another type of callbacks, otherwise
    it will impact a lot a default case.

    5. By default it is disabled. To enable this perform one of the
    below sequence:

    echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp
    or pass a boot parameter "rcutree.rcu_normal_wake_from_gp=1"

    Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
    Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
    Co-developed-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
    Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
    Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-08-26 10:57:37 -04:00
Waiman Long 7804cac54b rcu: Make hotplug operations track GP state, not flags
JIRA: https://issues.redhat.com/browse/RHEL-55557

commit ae2b217ab542d0db0ca1a6de4f442201a1982f00
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Fri, 8 Mar 2024 11:15:01 -0800

    rcu: Make hotplug operations track GP state, not flags

    Currently, there are rcu_data structure fields named ->rcu_onl_gp_seq
    and ->rcu_ofl_gp_seq that track the rcu_state.gp_flags field at the
    time of the corresponding CPU's last online or offline operation,
    respectively.  However, this information is not particularly useful.
    It would be better to instead track the grace period state kept
    in rcu_state.gp_state.  This would also be consistent with the
    initialization in rcu_boot_init_percpu_data(), which is to RCU_GP_CLEANED
    (an rcu_state.gp_state value), and also with the diagnostics in
    rcu_implicit_dynticks_qs(), whose format is consistent with an integer,
    not a bitmask.

    This commit therefore makes this change and changes the names to
    ->rcu_onl_gp_flags and ->rcu_ofl_gp_flags, respectively.

    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-08-26 10:57:32 -04:00
Waiman Long 20bc7962e4 rcu: Mark loads from rcu_state.n_online_cpus
JIRA: https://issues.redhat.com/browse/RHEL-55557

commit 09e077cf22c4302ab4ca7932f56c5a8b20c9e32b
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Thu, 7 Mar 2024 20:14:55 -0800

    rcu: Mark loads from rcu_state.n_online_cpus

    The rcu_state.n_online_cpus value is only ever updated by CPU-hotplug
    operations, which are serialized.  However, this value is read locklessly.
    This commit therefore marks those reads.  While in the area, it also
    adds ASSERT_EXCLUSIVE_WRITER() calls just in case parallel CPU hotplug
    becomes a thing.

    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-08-26 10:57:32 -04:00
Waiman Long 9b77bfb165 rcu: Remove redundant READ_ONCE() of rcu_state.gp_flags in tree.c
JIRA: https://issues.redhat.com/browse/RHEL-55557

commit 62bb24c4b022b9ba9cf2e4a72f6cd8c3086f0cf8
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Thu, 7 Mar 2024 15:01:54 -0800

    rcu: Remove redundant READ_ONCE() of rcu_state.gp_flags in tree.c

    Although it is functionally OK to do READ_ONCE() of a variable that
    cannot change, it is confusing and at best an accident waiting to happen.
    This commit therefore removes a number of READ_ONCE(rcu_state.gp_flags)
    instances from kernel/rcu/tree.c that are not needed due to updates
    to this field being excluded by virtue of holding the root rcu_node
    structure's ->lock.

    Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
    Closes: https://lore.kernel.org/lkml/4857c5ef-bd8f-4670-87ac-0600a1699d05@paulmck-laptop/T/#mccb23c2a4902da4d3c750165329f8de056903c58
    Reported-by: Julia Lawall <julia.lawall@inria.fr>
    Closes: https://lore.kernel.org/lkml/4857c5ef-bd8f-4670-87ac-0600a1699d05@paulmck-laptop/T/#md1b5c026584f9c3c7b0fbc9240dd7de584597b73
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-08-26 10:57:30 -04:00
Waiman Long 84edc80b1a rcu: Add lockdep checks and kernel-doc header to rcu_softirq_qs()
JIRA: https://issues.redhat.com/browse/RHEL-55557

commit 3183059ad82a0daa8292daf43c325bac57daceb5
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Tue, 27 Feb 2024 15:07:25 -0800

    rcu: Add lockdep checks and kernel-doc header to rcu_softirq_qs()

    There is some indications that rcu_softirq_qs() might be more generally
    used than anticipated.  This commit therefore adds some lockdep assertions
    and some cautionary tales in a new kernel-doc header.

    Link: https://lore.kernel.org/all/Zd4DXTyCf17lcTfq@debian.debian/

    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Jakub Kicinski <kuba@kernel.org>
    Cc: "David S. Miller" <davem@davemloft.net>
    Cc: Yan Zhai <yan@cloudflare.com>
    Cc: <netdev@vger.kernel.org>
    Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-08-26 10:57:25 -04:00
Waiman Long a7ee6faa72 rcu: Provide a boot time parameter to control lazy RCU
JIRA: https://issues.redhat.com/browse/RHEL-55557

commit 7f66f099de4dc4b1a66a3f94e6db16409924a6f8
Author: Qais Yousef <qyousef@layalina.io>
Date:   Sun, 3 Dec 2023 01:12:52 +0000

    rcu: Provide a boot time parameter to control lazy RCU

    To allow more flexible arrangements while still provide a single kernel
    for distros, provide a boot time parameter to enable/disable lazy RCU.

    Specify:

            rcutree.enable_rcu_lazy=[y|1|n|0]

    Which also requires

            rcu_nocbs=all

    at boot time to enable/disable lazy RCU.

    To disable it by default at build time when CONFIG_RCU_LAZY=y, the new
    CONFIG_RCU_LAZY_DEFAULT_OFF can be used.

    Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
    Tested-by: Andrea Righi <andrea.righi@canonical.com>
    Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Boqun Feng <boqun.feng@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-08-26 10:57:22 -04:00
Waiman Long 34abfbfa3e rcu-tasks: Initialize callback lists at rcu_init() time
JIRA: https://issues.redhat.com/browse/RHEL-55557

commit 30ef09635b9ed3ebca4f677495332a2e444a5cda
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Thu, 22 Feb 2024 12:29:54 -0800

    rcu-tasks: Initialize callback lists at rcu_init() time

    In order for RCU Tasks to reliably maintain per-CPU lists of exiting
    tasks, those lists must be initialized before it is possible for tasks
    to exit, especially given that the boot CPU is not necessarily CPU 0
    (an example being, powerpc kexec() kernels).  And at the time that
    rcu_init_tasks_generic() is called, a task could potentially exit,
    unconventional though that sort of thing might be.

    This commit therefore moves the calls to cblist_init_generic() from
    functions called from rcu_init_tasks_generic() to a new function named
    tasks_cblist_init_generic() that is invoked from rcu_init().

    This constituted a bug in a commit that never went to mainline, so
    there is no need for any backporting to -stable.

    Reported-by: Frederic Weisbecker <frederic@kernel.org>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Boqun Feng <boqun.feng@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-08-26 10:57:17 -04:00
Waiman Long c0a9325f29 rcu/exp: Remove rcu_par_gp_wq
JIRA: https://issues.redhat.com/browse/RHEL-55557

commit 23da2ad64dbe9f3fab10af90484fe41e144337b1
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Fri, 12 Jan 2024 16:46:21 +0100

    rcu/exp: Remove rcu_par_gp_wq

    TREE04 running on short iterations can produce writer stalls of the
    following kind:

     ??? Writer stall state RTWS_EXP_SYNC(4) g3968 f0x0 ->state 0x2 cpu 0
     task:rcu_torture_wri state:D stack:14568 pid:83    ppid:2      flags:0x00004000
     Call Trace:
      <TASK>
      __schedule+0x2de/0x850
      ? trace_event_raw_event_rcu_exp_funnel_lock+0x6d/0xb0
      schedule+0x4f/0x90
      synchronize_rcu_expedited+0x430/0x670
      ? __pfx_autoremove_wake_function+0x10/0x10
      ? __pfx_synchronize_rcu_expedited+0x10/0x10
      do_rtws_sync.constprop.0+0xde/0x230
      rcu_torture_writer+0x4b4/0xcd0
      ? __pfx_rcu_torture_writer+0x10/0x10
      kthread+0xc7/0xf0
      ? __pfx_kthread+0x10/0x10
      ret_from_fork+0x2f/0x50
      ? __pfx_kthread+0x10/0x10
      ret_from_fork_asm+0x1b/0x30
      </TASK>

    Waiting for an expedited grace period and polling for an expedited
    grace period both are operations that internally rely on the same
    workqueue performing necessary asynchronous work.

    However, a dependency chain is involved between those two operations,
    as depicted below:

           ====== CPU 0 =======                          ====== CPU 1 =======

                                                         synchronize_rcu_expedited()
                                                             exp_funnel_lock()
                                                                 mutex_lock(&rcu_state.exp_mutex);
        start_poll_synchronize_rcu_expedited
            queue_work(rcu_gp_wq, &rnp->exp_poll_wq);
                                                             synchronize_rcu_expedited_queue_work()
                                                                 queue_work(rcu_gp_wq, &rew->rew_work);
                                                             wait_event() // A, wait for &rew->rew_work completion
                                                             mutex_unlock() // B
        //======> switch to kworker

        sync_rcu_do_polled_gp() {
            synchronize_rcu_expedited()
                exp_funnel_lock()
                    mutex_lock(&rcu_state.exp_mutex); // C, wait B
                    ....
        } // D

    Since workqueues are usually implemented on top of several kworkers
    handling the queue concurrently, the above situation wouldn't deadlock
    most of the time because A then doesn't depend on D. But in case of
    memory stress, a single kworker may end up handling alone all the works
    in a serialized way. In that case the above layout becomes a problem
    because A then waits for D, closing a circular dependency:

            A -> D -> C -> B -> A

    This however only happens when CONFIG_RCU_EXP_KTHREAD=n. Indeed
    synchronize_rcu_expedited() is otherwise implemented on top of a kthread
    worker while polling still relies on rcu_gp_wq workqueue, breaking the
    above circular dependency chain.

    Fix this with making expedited grace period to always rely on kthread
    worker. The workqueue based implementation is essentially a duplicate
    anyway now that the per-node initialization is performed by per-node
    kthread workers.

    Meanwhile the CONFIG_RCU_EXP_KTHREAD switch is still kept around to
    manage the scheduler policy of these kthread workers.

    Reported-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
    Reported-by: Thomas Gleixner <tglx@linutronix.de>
    Suggested-by: Joel Fernandes <joel@joelfernandes.org>
    Suggested-by: Paul E. McKenney <paulmck@kernel.org>
    Suggested-by: Neeraj upadhyay <Neeraj.Upadhyay@amd.com>
    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Boqun Feng <boqun.feng@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-08-26 10:57:15 -04:00
Waiman Long 2f23c68f4a rcu/exp: Handle parallel exp gp kworkers affinity
JIRA: https://issues.redhat.com/browse/RHEL-55557

commit b67cffcbbf9dc759d95d330a5af5d1480af2b1f1
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Fri, 12 Jan 2024 16:46:20 +0100

    rcu/exp: Handle parallel exp gp kworkers affinity

    Affine the parallel expedited gp kworkers to their respective RCU node
    in order to make them close to the cache their are playing with.

    This reuses the boost kthreads machinery that probe into CPU hotplug
    operations such that the kthreads become/stay affine to their respective
    node as soon/long as they contain online CPUs. Otherwise and if the
    current CPU going down was the last online on the leaf node, the related
    kthread is affine to the housekeeping CPUs.

    In the long run, this affinity VS CPU hotplug operation game should
    probably be implemented at the generic kthread level.

    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
    [boqun: s/* rcu_boost_task/*rcu_boost_task as reported by checkpatch]
    Signed-off-by: Boqun Feng <boqun.feng@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-08-26 10:57:14 -04:00
Waiman Long d5ad8ad294 rcu/exp: Make parallel exp gp kworker per rcu node
JIRA: https://issues.redhat.com/browse/RHEL-55557

commit 8e5e621566485a3e160c0d8bfba206cb1d6b980d
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Fri, 12 Jan 2024 16:46:19 +0100

    rcu/exp: Make parallel exp gp kworker per rcu node

    When CONFIG_RCU_EXP_KTHREAD=n, the expedited grace period per node
    initialization is performed in parallel via workqueues (one work per
    node).

    However in CONFIG_RCU_EXP_KTHREAD=y, this per node initialization is
    performed by a single kworker serializing each node initialization (one
    work for all nodes).

    The second part is certainly less scalable and efficient beyond a single
    leaf node.

    To improve this, expand this single kworker into per-node kworkers. This
    new layout is eventually intended to remove the workqueues based
    implementation since it will essentially now become duplicate code.

    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Boqun Feng <boqun.feng@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-08-26 10:57:13 -04:00
Waiman Long 55b6c7a36d rcu/exp: Move expedited kthread worker creation functions above rcutree_prepare_cpu()
JIRA: https://issues.redhat.com/browse/RHEL-55557

commit c19e5d3b497a3036f800edf751dc7814e3e887e1
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Fri, 12 Jan 2024 16:46:18 +0100

    rcu/exp: Move expedited kthread worker creation functions above rcutree_prepare_cpu()

    The expedited kthread worker performing the per node initialization is
    going to be split into per node kthreads. As such, the future per node
    kthread creation will need to be called from CPU hotplug callbacks
    instead of an initcall, right beside the per node boost kthread
    creation.

    To prepare for that, move the kthread worker creation above
    rcutree_prepare_cpu() as a first step to make the review smoother for
    the upcoming modifications.

    No intended functional change.

    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Boqun Feng <boqun.feng@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-08-26 10:57:12 -04:00
Waiman Long 119acfe64c rcu: s/boost_kthread_mutex/kthread_mutex
JIRA: https://issues.redhat.com/browse/RHEL-55557

commit 7836b270607676ed1c0c6a4a840a2ede9437a6a1
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Fri, 12 Jan 2024 16:46:17 +0100

    rcu: s/boost_kthread_mutex/kthread_mutex

    This mutex is currently protecting per node boost kthreads creation and
    affinity setting across CPU hotplug operations.

    Since the expedited kworkers will soon be split per node as well, they
    will be subject to the same concurrency constraints against hotplug.

    Therefore their creation and affinity tuning operations will be grouped
    with those of boost kthreads and then rely on the same mutex.

    To prepare for that, generalize its name.

    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Boqun Feng <boqun.feng@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-08-26 10:57:12 -04:00
Waiman Long be64d66573 rcu/nocb: Re-arrange call_rcu() NOCB specific code
JIRA: https://issues.redhat.com/browse/RHEL-55557

commit afd4e6964745ed98b74cacdcce21d73280a0a253
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Tue, 9 Jan 2024 23:24:01 +0100

    rcu/nocb: Re-arrange call_rcu() NOCB specific code

    Currently the call_rcu() function interleaves NOCB and !NOCB enqueue
    code in a complicated way such that:

    * The bypass enqueue code may or may not have enqueued and may or may
      not have locked the ->nocb_lock. Everything that follows is in a
      Schrödinger locking state for the unwary reviewer's eyes.

    * The was_alldone is always set but only used in NOCB related code.

    * The NOCB wake up is distantly related to the locking hopefully
      performed by the bypass enqueue code that did not enqueue on the
      bypass list.

    Unconfuse the whole and gather NOCB and !NOCB specific enqueue code to
    their own functions.

    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Boqun Feng <boqun.feng@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-08-26 10:57:09 -04:00
Waiman Long 7a42078785 rcu/nocb: Make IRQs disablement symmetric
JIRA: https://issues.redhat.com/browse/RHEL-55557

commit b913c3fe685e0aec80130975b0f330fd709ff324
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Tue, 9 Jan 2024 23:24:00 +0100

    rcu/nocb: Make IRQs disablement symmetric

    Currently IRQs are disabled on call_rcu() and then depending on the
    context:

    * If the CPU is in nocb mode:

       - If the callback is enqueued in the bypass list, IRQs are re-enabled
         implictly by rcu_nocb_try_bypass()

       - If the callback is enqueued in the normal list, IRQs are re-enabled
         implicitly by __call_rcu_nocb_wake()

    * If the CPU is NOT in nocb mode, IRQs are reenabled explicitly from call_rcu()

    This makes the code a bit hard to follow, especially as it interleaves
    with nocb locking.

    To make the IRQ flags coverage clearer and also in order to prepare for
    moving all the nocb enqueue code to its own function, always re-enable
    the IRQ flags explicitly from call_rcu().

    Reviewed-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Boqun Feng <boqun.feng@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-08-26 10:57:08 -04:00
Waiman Long 5c060a1f72 rcu/nocb: Remove needless full barrier after callback advancing
JIRA: https://issues.redhat.com/browse/RHEL-55557

commit 1e8e6951a5774c8dd9d1f14af9c5b7d66130d96f
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Wed, 15 Nov 2023 14:11:28 -0500

    rcu/nocb: Remove needless full barrier after callback advancing

    A full barrier is issued from nocb_gp_wait() upon callbacks advancing
    to order grace period completion with callbacks execution.

    However these two events are already ordered by the
    smp_mb__after_unlock_lock() barrier within the call to
    raw_spin_lock_rcu_node() that is necessary for callbacks advancing to
    happen.

    The following litmus test shows the kind of guarantee that this barrier
    provides:

            C smp_mb__after_unlock_lock

            {}

            // rcu_gp_cleanup()
            P0(spinlock_t *rnp_lock, int *gpnum)
            {
                    // Grace period cleanup increase gp sequence number
                    spin_lock(rnp_lock);
                    WRITE_ONCE(*gpnum, 1);
                    spin_unlock(rnp_lock);
            }

            // nocb_gp_wait()
            P1(spinlock_t *rnp_lock, spinlock_t *nocb_lock, int *gpnum, int *cb_ready)
            {
                    int r1;

                    // Call rcu_advance_cbs() from nocb_gp_wait()
                    spin_lock(nocb_lock);
                    spin_lock(rnp_lock);
                    smp_mb__after_unlock_lock();
                    r1 = READ_ONCE(*gpnum);
                    WRITE_ONCE(*cb_ready, 1);
                    spin_unlock(rnp_lock);
                    spin_unlock(nocb_lock);
            }

            // nocb_cb_wait()
            P2(spinlock_t *nocb_lock, int *cb_ready, int *cb_executed)
            {
                    int r2;

                    // rcu_do_batch() -> rcu_segcblist_extract_done_cbs()
                    spin_lock(nocb_lock);
                    r2 = READ_ONCE(*cb_ready);
                    spin_unlock(nocb_lock);

                    // Actual callback execution
                    WRITE_ONCE(*cb_executed, 1);
            }

            P3(int *cb_executed, int *gpnum)
            {
                    int r3;

                    WRITE_ONCE(*cb_executed, 2);
                    smp_mb();
                    r3 = READ_ONCE(*gpnum);
            }

            exists (1:r1=1 /\ 2:r2=1 /\ cb_executed=2 /\ 3:r3=0) (* Bad outcome. *)

    Here the bad outcome only occurs if the smp_mb__after_unlock_lock() is
    removed. This barrier orders the grace period completion against
    callbacks advancing and even later callbacks invocation, thanks to the
    opportunistic propagation via the ->nocb_lock to nocb_cb_wait().

    Therefore the smp_mb() placed after callbacks advancing can be safely
    removed.

    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Boqun Feng <boqun.feng@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-08-26 10:57:08 -04:00
Waiman Long bb50b0eb40 rcu: Defer RCU kthreads wakeup when CPU is dying
JIRA: https://issues.redhat.com/browse/RHEL-55557

commit e787644caf7628ad3269c1fbd321c3255cf51710
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Tue, 19 Dec 2023 00:19:15 +0100

    rcu: Defer RCU kthreads wakeup when CPU is dying

    When the CPU goes idle for the last time during the CPU down hotplug
    process, RCU reports a final quiescent state for the current CPU. If
    this quiescent state propagates up to the top, some tasks may then be
    woken up to complete the grace period: the main grace period kthread
    and/or the expedited main workqueue (or kworker).

    If those kthreads have a SCHED_FIFO policy, the wake up can indirectly
    arm the RT bandwith timer to the local offline CPU. Since this happens
    after hrtimers have been migrated at CPUHP_AP_HRTIMERS_DYING stage, the
    timer gets ignored. Therefore if the RCU kthreads are waiting for RT
    bandwidth to be available, they may never be actually scheduled.

    This triggers TREE03 rcutorture hangs:

             rcu: INFO: rcu_preempt self-detected stall on CPU
             rcu:     4-...!: (1 GPs behind) idle=9874/1/0x4000000000000000 softirq=0/0 fqs=20 rcuc=21071 jiffies(starved)
             rcu:     (t=21035 jiffies g=938281 q=40787 ncpus=6)
             rcu: rcu_preempt kthread starved for 20964 jiffies! g938281 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
             rcu:     Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
             rcu: RCU grace-period kthread stack dump:
             task:rcu_preempt     state:R  running task     stack:14896 pid:14    tgid:14    ppid:2      flags:0x00004000
             Call Trace:
              <TASK>
              __schedule+0x2eb/0xa80
              schedule+0x1f/0x90
              schedule_timeout+0x163/0x270
              ? __pfx_process_timeout+0x10/0x10
              rcu_gp_fqs_loop+0x37c/0x5b0
              ? __pfx_rcu_gp_kthread+0x10/0x10
              rcu_gp_kthread+0x17c/0x200
              kthread+0xde/0x110
              ? __pfx_kthread+0x10/0x10
              ret_from_fork+0x2b/0x40
              ? __pfx_kthread+0x10/0x10
              ret_from_fork_asm+0x1b/0x30
              </TASK>

    The situation can't be solved with just unpinning the timer. The hrtimer
    infrastructure and the nohz heuristics involved in finding the best
    remote target for an unpinned timer would then also need to handle
    enqueues from an offline CPU in the most horrendous way.

    So fix this on the RCU side instead and defer the wake up to an online
    CPU if it's too late for the local one.

    Reported-by: Paul E. McKenney <paulmck@kernel.org>
    Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-08-26 10:57:06 -04:00
Waiman Long 50a1d64f59 rcu: Force quiescent states only for ongoing grace period
JIRA: https://issues.redhat.com/browse/RHEL-55557

commit dee39c0c1e9624f925da4ca0bece46bdc7427257
Author: Zqiang <qiang.zhang1211@gmail.com>
Date:   Wed, 1 Nov 2023 11:35:07 +0800

    rcu: Force quiescent states only for ongoing grace period

    If an rcutorture test scenario creates an fqs_task kthread, it will
    periodically invoke rcu_force_quiescent_state() in order to start
    force-quiescent-state (FQS) operations.  However, an FQS operation
    will be started even if there is no RCU grace period in progress.
    Although testing FQS operations startup when there is no grace period in
    progress is necessary, it need not happen all that often.  This commit
    therefore causes rcu_force_quiescent_state() to take an early exit
    if there is no grace period in progress.

    Note that there will still be attempts to start an FQS scan in the
    absence of a grace period because the grace period might end right
    after the rcu_force_quiescent_state() function's check.  In actual
    testing, this happens about once every ten minutes, which should
    provide adequate testing.

    Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
    Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-08-26 10:57:02 -04:00
Waiman Long 8636b4de2f rcu/exp: Handle RCU expedited grace period kworker allocation failure
JIRA: https://issues.redhat.com/browse/RHEL-34076

commit e7539ffc9a770f36bacedcf0fbfb4bf2f244f4a5
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Fri, 12 Jan 2024 16:46:16 +0100

    rcu/exp: Handle RCU expedited grace period kworker allocation failure

    Just like is done for the kworker performing nodes initialization,
    gracefully handle the possible allocation failure of the RCU expedited
    grace period main kworker.

    While at it perform a rename of the related checking functions to better
    reflect the expedited specifics.

    Reviewed-by: Kalesh Singh <kaleshsingh@google.com>
    Fixes: 9621fbee44df ("rcu: Move expedited grace period (GP) work to RT kthread_worker")
    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Boqun Feng <boqun.feng@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-31 10:56:18 -04:00
Waiman Long 75fafe02a2 rcu/exp: Fix RCU expedited parallel grace period kworker allocation failure recovery
JIRA: https://issues.redhat.com/browse/RHEL-34076

commit a636c5e6f8fc34be520277e69c7c6ee1d4fc1d17
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Fri, 12 Jan 2024 16:46:15 +0100

    rcu/exp: Fix RCU expedited parallel grace period kworker allocation failure recovery

    Under CONFIG_RCU_EXP_KTHREAD=y, the nodes initialization for expedited
    grace periods is queued to a kworker. However if the allocation of that
    kworker failed, the nodes initialization is performed synchronously by
    the caller instead.

    Now the check for kworker initialization failure relies on the kworker
    pointer to be NULL while its value might actually encapsulate an
    allocation failure error.

    Make sure to handle this case.

    Reviewed-by: Kalesh Singh <kaleshsingh@google.com>
    Fixes: 9621fbee44df ("rcu: Move expedited grace period (GP) work to RT kthread_worker")
    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Boqun Feng <boqun.feng@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-31 10:56:18 -04:00
Waiman Long 3d0e0b40e3 rcu: Break rcu_node_0 --> &rq->__lock order
JIRA: https://issues.redhat.com/browse/RHEL-34076

commit 85d68222ddc5f4522e456d97d201166acb50f716
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Tue, 31 Oct 2023 09:53:08 +0100

    rcu: Break rcu_node_0 --> &rq->__lock order

    Commit 851a723e45d1 ("sched: Always clear user_cpus_ptr in
    do_set_cpus_allowed()") added a kfree() call to free any user
    provided affinity mask, if present. It was changed later to use
    kfree_rcu() in commit 9a5418bc48ba ("sched/core: Use kfree_rcu()
    in do_set_cpus_allowed()") to avoid a circular locking dependency
    problem.

    It turns out that even kfree_rcu() isn't safe for avoiding
    circular locking problem. As reported by kernel test robot,
    the following circular locking dependency now exists:

      &rdp->nocb_lock --> rcu_node_0 --> &rq->__lock

    Solve this by breaking the rcu_node_0 --> &rq->__lock chain by moving
    the resched_cpu() out from under rcu_node lock.

    [peterz: heavily borrowed from Waiman's Changelog]
    [paulmck: applied Z qiang feedback]

    Fixes: 851a723e45d1 ("sched: Always clear user_cpus_ptr in do_set_cpus_allowed()")
    Reported-by: kernel test robot <oliver.sang@intel.com>
    Acked-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/oe-lkp/202310302207.a25f1a30-oliver.sang@intel.com
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-31 10:56:18 -04:00
Waiman Long b91a2b524c rcu/tree: Defer setting of jiffies during stall reset
JIRA: https://issues.redhat.com/browse/RHEL-34076

commit b96e7a5fa0ba9cda32888e04f8f4bac42d49a7f8
Author: Joel Fernandes (Google) <joel@joelfernandes.org>
Date:   Tue, 5 Sep 2023 00:02:11 +0000

    rcu/tree: Defer setting of jiffies during stall reset

    There are instances where rcu_cpu_stall_reset() is called when jiffies
    did not get a chance to update for a long time. Before jiffies is
    updated, the CPU stall detector can go off triggering false-positives
    where a just-started grace period appears to be ages old. In the past,
    we disabled stall detection in rcu_cpu_stall_reset() however this got
    changed [1]. This is resulting in false-positives in KGDB usecase [2].

    Fix this by deferring the update of jiffies to the third run of the FQS
    loop. This is more robust, as, even if rcu_cpu_stall_reset() is called
    just before jiffies is read, we would end up pushing out the jiffies
    read by 3 more FQS loops. Meanwhile the CPU stall detection will be
    delayed and we will not get any false positives.

    [1] https://lore.kernel.org/all/20210521155624.174524-2-senozhatsky@chromium.org/
    [2] https://lore.kernel.org/all/20230814020045.51950-2-chenhuacai@loongson.cn/

    Tested with rcutorture.cpu_stall option as well to verify stall behavior
    with/without patch.

    Tested-by: Huacai Chen <chenhuacai@loongson.cn>
    Reported-by: Binbin Zhou <zhoubinbin@loongson.cn>
    Closes: https://lore.kernel.org/all/20230814020045.51950-2-chenhuacai@loongson.cn/
    Suggested-by: Paul  McKenney <paulmck@kernel.org>
    Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: stable@vger.kernel.org
    Fixes: a80be428fbc1 ("rcu: Do not disable GP stall detection in rcu_cpu_stall_reset()")
    Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-31 10:56:18 -04:00
Waiman Long 7836f9aae1 rcu: Conditionally build CPU-hotplug teardown callbacks
JIRA: https://issues.redhat.com/browse/RHEL-34076

commit 2cb1f6e9a743af58a23cf14563b5eada1e0d3fde
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Fri, 8 Sep 2023 22:36:00 +0200

    rcu: Conditionally build CPU-hotplug teardown callbacks

    Among the three CPU-hotplug teardown RCU callbacks, two of them early
    exit if CONFIG_HOTPLUG_CPU=n, and one is left unchanged. In any case
    all of them have an implementation when CONFIG_HOTPLUG_CPU=n.

    Align instead with the common way to deal with CPU-hotplug teardown
    callbacks and provide a proper stub when they are not supported.

    Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-31 10:56:17 -04:00
Waiman Long 5cc479d257 rcu: Assume rcu_report_dead() is always called locally
JIRA: https://issues.redhat.com/browse/RHEL-34076

commit c964c1f5ee96e1460606d44f80a47bdacd8fe568
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Fri, 8 Sep 2023 22:35:59 +0200

    rcu: Assume rcu_report_dead() is always called locally

    rcu_report_dead() has to be called locally by the CPU that is going to
    exit the RCU state machine. Passing a cpu argument here is error-prone
    and leaves the possibility for a racy remote call.

    Use local access instead.

    Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-31 10:56:17 -04:00
Waiman Long 824c887325 rcu: Assume IRQS disabled from rcu_report_dead()
JIRA: https://issues.redhat.com/browse/RHEL-34076

commit 358662a9616c5078dc4d389d6bceeb5974f4aa97
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Fri, 8 Sep 2023 22:35:58 +0200

    rcu: Assume IRQS disabled from rcu_report_dead()

    rcu_report_dead() is the last RCU word from the CPU down through the
    hotplug path. It is called in the idle loop right before the CPU shuts
    down for good. Because it removes the CPU from the grace period state
    machine and reports an ultimate quiescent state if necessary, no further
    use of RCU is allowed. Therefore it is expected that IRQs are disabled
    upon calling this function and are not to be re-enabled again until the
    CPU shuts down.

    Remove the IRQs disablement from that function and verify instead that
    it is actually called with IRQs disabled as it is expected at that
    special point in the idle path.

    Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
    Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-31 10:56:17 -04:00
Waiman Long 683e6f9676 rcu: kmemleak: Ignore kmemleak false positives when RCU-freeing objects
JIRA: https://issues.redhat.com/browse/RHEL-34076

commit 5f98fd034ca6fd1ab8c91a3488968a0e9caaabf6
Author: Catalin Marinas <catalin.marinas@arm.com>
Date:   Sat, 30 Sep 2023 17:46:56 +0000

    rcu: kmemleak: Ignore kmemleak false positives when RCU-freeing objects

    Since the actual slab freeing is deferred when calling kvfree_rcu(), so
    is the kmemleak_free() callback informing kmemleak of the object
    deletion. From the perspective of the kvfree_rcu() caller, the object is
    freed and it may remove any references to it. Since kmemleak does not
    scan RCU internal data storing the pointer, it will report such objects
    as leaks during the grace period.

    Tell kmemleak to ignore such objects on the kvfree_call_rcu() path. Note
    that the tiny RCU implementation does not have such issue since the
    objects can be tracked from the rcu_ctrlblk structure.

    Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
    Reported-by: Christoph Paasch <cpaasch@apple.com>
    Closes: https://lore.kernel.org/all/F903A825-F05F-4B77-A2B5-7356282FBA2C@apple.com/
    Cc: <stable@vger.kernel.org>
    Tested-by: Christoph Paasch <cpaasch@apple.com>
    Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-31 10:56:17 -04:00
Waiman Long baa1b0fbd8 rcu: Eliminate rcu_gp_slow_unregister() false positive
JIRA: https://issues.redhat.com/browse/RHEL-34076

commit 0ae9942f03d0d034fdb0a4f44fc99f62a3107987
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Fri, 18 Aug 2023 08:53:58 -0700

    rcu: Eliminate rcu_gp_slow_unregister() false positive

    When using rcutorture as a module, there are a number of conditions that
    can abort the modprobe operation, for example, when attempting to run
    both RCU CPU stall warning tests and forward-progress tests.  This can
    cause rcu_torture_cleanup() to be invoked on the unwind path out of
    rcu_rcu_torture_init(), which will mean that rcu_gp_slow_unregister()
    is invoked without a matching rcu_gp_slow_register().  This will cause
    a splat because rcu_gp_slow_unregister() is passed rcu_fwd_cb_nodelay,
    which does not match a NULL pointer.

    This commit therefore forgives a mismatch involving a NULL pointer, thus
    avoiding this false-positive splat.

    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-31 10:56:16 -04:00
Waiman Long 2ef5742a82 rcu: Dump memory object info if callback function is invalid
JIRA: https://issues.redhat.com/browse/RHEL-34076

commit 2cbc482d325ee58001472c4359b311958c4efdd1
Author: Zhen Lei <thunder.leizhen@huawei.com>
Date:   Sat, 5 Aug 2023 11:17:26 +0800

    rcu: Dump memory object info if callback function is invalid

    When a structure containing an RCU callback rhp is (incorrectly) freed
    and reallocated after rhp is passed to call_rcu(), it is not unusual for
    rhp->func to be set to NULL. This defeats the debugging prints used by
    __call_rcu_common() in kernels built with CONFIG_DEBUG_OBJECTS_RCU_HEAD=y,
    which expect to identify the offending code using the identity of this
    function.

    And in kernels build without CONFIG_DEBUG_OBJECTS_RCU_HEAD=y, things
    are even worse, as can be seen from this splat:

    Unable to handle kernel NULL pointer dereference at virtual address 0
    ... ...
    PC is at 0x0
    LR is at rcu_do_batch+0x1c0/0x3b8
    ... ...
     (rcu_do_batch) from (rcu_core+0x1d4/0x284)
     (rcu_core) from (__do_softirq+0x24c/0x344)
     (__do_softirq) from (__irq_exit_rcu+0x64/0x108)
     (__irq_exit_rcu) from (irq_exit+0x8/0x10)
     (irq_exit) from (__handle_domain_irq+0x74/0x9c)
     (__handle_domain_irq) from (gic_handle_irq+0x8c/0x98)
     (gic_handle_irq) from (__irq_svc+0x5c/0x94)
     (__irq_svc) from (arch_cpu_idle+0x20/0x3c)
     (arch_cpu_idle) from (default_idle_call+0x4c/0x78)
     (default_idle_call) from (do_idle+0xf8/0x150)
     (do_idle) from (cpu_startup_entry+0x18/0x20)
     (cpu_startup_entry) from (0xc01530)

    This commit therefore adds calls to mem_dump_obj(rhp) to output some
    information, for example:

      slab kmalloc-256 start ffff410c45019900 pointer offset 0 size 256

    This provides the rough size of the memory block and the offset of the
    rcu_head structure, which as least provides at least a few clues to help
    locate the problem. If the problem is reproducible, additional slab
    debugging can be enabled, for example, CONFIG_DEBUG_SLAB=y, which can
    provide significantly more information.

    Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-31 10:56:16 -04:00
Waiman Long 76702ef926 rcu: Add sysfs to provide throttled access to rcu_barrier()
JIRA: https://issues.redhat.com/browse/RHEL-34076

commit 16128b1f8c823438dcd3f3b4a57cbe7267bcf82f
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Tue, 1 Aug 2023 17:15:25 -0700

    rcu: Add sysfs to provide throttled access to rcu_barrier()

    When running a series of stress tests all making heavy use of RCU,
    it is all too possible to OOM the system when the prior test's RCU
    callbacks don't get invoked until after the subsequent test starts.
    One way of handling this is just a timed wait, but this fails when a
    given CPU has so many callbacks queued that they take longer to invoke
    than allowed for by that timed wait.

    This commit therefore adds an rcutree.do_rcu_barrier module parameter that
    is accessible from sysfs.  Writing one of the many synonyms for boolean
    "true" will cause an rcu_barrier() to be invoked, but will guarantee that
    no more than one rcu_barrier() will be invoked per sixteenth of a second
    via this mechanism.  The flip side is that a given request might wait a
    second or three longer than absolutely necessary, but only when there are
    multiple uses of rcutree.do_rcu_barrier within a one-second time interval.

    This commit unnecessarily serializes the rcu_barrier() machinery, given
    that serialization is already provided by procfs.  This has the advantage
    of allowing throttled rcu_barrier() from other sources within the kernel.

    Reported-by: Johannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-31 10:56:16 -04:00
Waiman Long bd5c010cc7 rcu/tree: Remove superfluous return from void call_rcu* functions
JIRA: https://issues.redhat.com/browse/RHEL-34076

commit 4502138acc8f4139c94ce0ec0eee926f8805fbbc
Author: Joel Fernandes (Google) <joel@joelfernandes.org>
Date:   Sat, 29 Jul 2023 14:27:36 +0000

    rcu/tree: Remove superfluous return from void call_rcu* functions

    The return keyword is not needed here.

    Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-31 10:56:16 -04:00
Waiman Long 3eb0b39ca1 rcu: Mark __rcu_irq_enter_check_tick() ->rcu_urgent_qs load
JIRA: https://issues.redhat.com/browse/RHEL-34076

commit 343640cb5b4ed349b3656ee4b100b36e2ae7e2da
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Mon, 26 Jun 2023 18:37:05 -0700

    rcu: Mark __rcu_irq_enter_check_tick() ->rcu_urgent_qs load

    The rcu_request_urgent_qs_task() function does a cross-CPU store
    to ->rcu_urgent_qs, so this commit therefore marks the load in
    __rcu_irq_enter_check_tick() with READ_ONCE().

    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-31 10:56:14 -04:00
Waiman Long 4beb7953b9 rcu: Clarify rcu_is_watching() kernel-doc comment
JIRA: https://issues.redhat.com/browse/RHEL-34076

commit c924bf5a43e47d5591d527d39b474f8ca9e63c0e
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Tue, 23 May 2023 02:22:10 -0700

    rcu: Clarify rcu_is_watching() kernel-doc comment

    Make it clear that this function always returns either true or false
    without other planned failure modes.

    Reported-by: Masami Hiramatsu <mhiramat@kernel.org>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-31 10:56:14 -04:00
Waiman Long ffbeaff742 rcu/kvfree: Make drain_page_cache() take early return if cache is disabled
JIRA: https://issues.redhat.com/browse/RHEL-34076

commit 6b706e5603c44ff0b6f43c2e26e0d590e1d265f8
Author: Zqiang <qiang1.zhang@intel.com>
Date:   Tue, 18 Apr 2023 20:27:02 +0800

    rcu/kvfree: Make drain_page_cache() take early return if cache is disabled

    If the rcutree.rcu_min_cached_objs kernel boot parameter is set to zero,
    then krcp->page_cache_work will never be triggered to fill page cache.
    In addition, the put_cached_bnode() will not fill page cache.  As a
    result krcp->bkvcache will always be empty, so there is no need to acquire
    krcp->lock to get page from krcp->bkvcache.  This commit therefore makes
    drain_page_cache() return immediately if the rcu_min_cached_objs is zero.

    Signed-off-by: Zqiang <qiang1.zhang@intel.com>
    Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-31 10:56:13 -04:00
Waiman Long b3a6fe7cb6 rcu/kvfree: Make fill page cache start from krcp->nr_bkv_objs
JIRA: https://issues.redhat.com/browse/RHEL-34076

commit 60888b77a06ea16665e4df980bb86b418253e268
Author: Zqiang <qiang1.zhang@intel.com>
Date:   Wed, 12 Apr 2023 22:31:27 +0800

    rcu/kvfree: Make fill page cache start from krcp->nr_bkv_objs

    When the fill_page_cache_func() function is invoked, it assumes that
    the cache of pages is completely empty.  However, there can be some time
    between triggering execution of this function and its actual invocation.
    During this time, kfree_rcu_work() might run, and might fill in part or
    all of this cache of pages, thus invalidating the fill_page_cache_func()
    function's assumption.

    This will not overfill the cache because put_cached_bnode() will reject
    the extra page.  However, it will result in a needless allocation and
    freeing of one extra page, which might not be helpful under lowish-memory
    conditions.

    This commit therefore causes the fill_page_cache_func() to explicitly
    account for pages that have been placed into the cache shortly before
    it starts running.

    Signed-off-by: Zqiang <qiang1.zhang@intel.com>
    Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-31 10:56:13 -04:00
Waiman Long 1b40a8c903 rcu/kvfree: Do not run a page work if a cache is disabled
JIRA: https://issues.redhat.com/browse/RHEL-34076

commit 021a5ff8474379cd6c23e9b0e97aa27e5ff66a8b
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Tue, 11 Apr 2023 15:13:41 +0200

    rcu/kvfree: Do not run a page work if a cache is disabled

    By default the cache size is 5 pages per CPU, but it can be disabled at
    boot time by setting the rcu_min_cached_objs to zero.  When that happens,
    the current code will uselessly set an hrtimer to schedule refilling this
    cache with zero pages.  This commit therefore streamlines this process
    by simply refusing the set the hrtimer when rcu_min_cached_objs is zero.

    Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-31 10:56:13 -04:00
Waiman Long 05dcb1d94b rcu/kvfree: Use consistent krcp when growing kfree_rcu() page cache
JIRA: https://issues.redhat.com/browse/RHEL-34076

commit 309a4316507767f8078d30c9681dc76f4299b0f1
Author: Zqiang <qiang1.zhang@intel.com>
Date:   Sat, 8 Apr 2023 22:25:30 +0800

    rcu/kvfree: Use consistent krcp when growing kfree_rcu() page cache

    The add_ptr_to_bulk_krc_lock() function is invoked to allocate a new
    kfree_rcu() page, also known as a kvfree_rcu_bulk_data structure.
    The kfree_rcu_cpu structure's lock is used to protect this operation,
    except that this lock must be momentarily dropped when allocating memory.
    It is clearly important that the lock that is reacquired be the same
    lock that was acquired initially via krc_this_cpu_lock().

    Unfortunately, this same krc_this_cpu_lock() function is used to
    re-acquire this lock, and if the task migrated to some other CPU during
    the memory allocation, this will result in the kvfree_rcu_bulk_data
    structure being added to the wrong CPU's kfree_rcu_cpu structure.

    This commit therefore replaces that second call to krc_this_cpu_lock()
    with raw_spin_lock_irqsave() in order to explicitly acquire the lock on
    the correct kfree_rcu_cpu structure, thus keeping things straight even
    when the task migrates.

    Signed-off-by: Zqiang <qiang1.zhang@intel.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-31 10:56:13 -04:00
Waiman Long 27dcdec03f rcu/kvfree: Invoke debug_rcu_bhead_unqueue() after checking bnode->gp_snap
JIRA: https://issues.redhat.com/browse/RHEL-34076

commit 1e237994d9c9a5ae47ae13030585a413a29469e6
Author: Zqiang <qiang1.zhang@intel.com>
Date:   Wed, 5 Apr 2023 10:13:59 +0800

    rcu/kvfree: Invoke debug_rcu_bhead_unqueue() after checking bnode->gp_snap

    If kvfree_rcu_bulk() sees that the required grace period has failed to
    elapse, it leaks the memory because readers might still be using it.
    But in that case, the debug-objects subsystem still marks the relevant
    structures as having been freed, even though they are instead being
    leaked.

    This commit fixes this mismatch by invoking debug_rcu_bhead_unqueue()
    only when we are actually going to free the objects.

    Signed-off-by: Zqiang <qiang1.zhang@intel.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-31 10:56:13 -04:00
Waiman Long 5d315e18cc rcu/kvfree: Add debug check for GP complete for kfree_rcu_cpu list
JIRA: https://issues.redhat.com/browse/RHEL-34076

commit f32276a37652a9ce05db27cdfb40ac3e3fc98f9f
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Tue, 4 Apr 2023 16:13:00 +0200

    rcu/kvfree: Add debug check for GP complete for kfree_rcu_cpu list

    Under low-memory conditions, kvfree_rcu() will use each object's
    rcu_head structure to queue objects in a singly linked list headed by
    the kfree_rcu_cpu structure's ->head field.  This list is passed to
    call_rcu() as a unit, but there is no indication of which grace period
    this list needs to wait for.  This in turn prevents adding debug checks
    in the kfree_rcu_work() as was done for the two page-of-pointers channels
    in the kfree_rcu_cpu structure.

    This commit therefore adds a ->head_free_gp_snap field to the
    kfree_rcu_cpu_work structure to record this grace-period number.  It also
    adds a WARN_ON_ONCE() to kfree_rcu_monitor() that checks to make sure
    that the required grace period has in fact elapsed.

    [ paulmck: Fix kerneldoc issue raised by Stephen Rothwell. ]

    Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-31 10:56:13 -04:00
Waiman Long af78d7bb28 rcu/kvfree: Add debug to check grace periods
JIRA: https://issues.redhat.com/browse/RHEL-34076

commit cdfa0f6fa6b7183c062046043b649b9a91e3ac52
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Mon, 3 Apr 2023 16:49:14 -0700

    rcu/kvfree: Add debug to check grace periods

    This commit adds debugging checks to verify that the required RCU
    grace period has elapsed for each kvfree_rcu_bulk_data structure that
    arrives at the kvfree_rcu_bulk() function.  These checks make use
    of that structure's ->gp_snap field, which has been upgraded from an
    unsigned long to an rcu_gp_oldstate structure.  This upgrade reduces
    the chances of false positives to nearly zero, even on 32-bit systems,
    for which this structure carries 64 bits of state.

    Cc: Ziwei Dai <ziwei.dai@unisoc.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-31 10:56:13 -04:00
Waiman Long 8e06ba31df rcu-tasks: Stop rcu_tasks_invoke_cbs() from using never-onlined CPUs
JIRA: https://issues.redhat.com/browse/RHEL-34076

commit 401b0de3ae4fa49d1014c8941e26d9a25f37e7cf
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Wed, 26 Apr 2023 11:11:29 -0700

    rcu-tasks: Stop rcu_tasks_invoke_cbs() from using never-onlined CPUs

    The rcu_tasks_invoke_cbs() function relies on queue_work_on() to silently
    fall back to WORK_CPU_UNBOUND when the specified CPU is offline.  However,
    the queue_work_on() function's silent fallback mechanism relies on that
    CPU having been online at some time in the past.  When queue_work_on()
    is passed a CPU that has never been online, workqueue lockups ensue,
    which can be bad for your kernel's general health and well-being.

    This commit therefore checks whether a given CPU has ever been online,
    and, if not substitutes WORK_CPU_UNBOUND in the subsequent call to
    queue_work_on().  Why not simply omit the queue_work_on() call entirely?
    Because this function is flooding callback-invocation notifications
    to all CPUs, and must deal with possibilities that include a sparse
    cpu_possible_mask.

    This commit also moves the setting of the rcu_data structure's
    ->beenonline field to rcu_cpu_starting(), which executes on the
    incoming CPU before that CPU has ever enabled interrupts.  This ensures
    that the required workqueues are present.  In addition, because the
    incoming CPU has not yet enabled its interrupts, there cannot yet have
    been any softirq handlers running on this CPU, which means that the
    WARN_ON_ONCE(!rdp->beenonline) within the RCU_SOFTIRQ handler cannot
    have triggered yet.

    Fixes: d363f833c6d88 ("rcu-tasks: Use workqueues for multiple rcu_tasks_invoke_cbs() invocations")
    Reported-by: Tejun Heo <tj@kernel.org>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-31 10:56:12 -04:00
Waiman Long 3afd6f4cfd rcu: Make rcu_cpu_starting() rely on interrupts being disabled
JIRA: https://issues.redhat.com/browse/RHEL-34076

commit 15d44dfa40305da1648de4bf001e91cc63148725
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Thu, 27 Apr 2023 10:50:47 -0700

    rcu: Make rcu_cpu_starting() rely on interrupts being disabled

    Currently, rcu_cpu_starting() is written so that it might be invoked
    with interrupts enabled.  However, it is always called when interrupts
    are disabled, either by rcu_init(), notify_cpu_starting(), or from a
    call point prior to the call to notify_cpu_starting().

    But why bother requiring that interrupts be disabled?  The purpose is
    to allow the rcu_data structure's ->beenonline flag to be set after all
    early processing has completed for the incoming CPU, thus allowing this
    flag to be used to determine when workqueues have been set up for the
    incoming CPU, while still allowing this flag to be used as a diagnostic
    within rcu_core().

    This commit therefore makes rcu_cpu_starting() rely on interrupts being
    disabled.

    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-31 10:56:12 -04:00
Waiman Long b0b5f3fea7 rcu: Mark rcu_cpu_kthread() accesses to ->rcu_cpu_has_work
JIRA: https://issues.redhat.com/browse/RHEL-34076

commit a24c1aab652ebacf9ea62470a166514174c96fe1
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Fri, 7 Apr 2023 16:47:34 -0700

    rcu: Mark rcu_cpu_kthread() accesses to ->rcu_cpu_has_work

    The rcu_data structure's ->rcu_cpu_has_work field can be modified by
    any CPU attempting to wake up the rcuc kthread.  Therefore, this commit
    marks accesses to this field from the rcu_cpu_kthread() function.

    This data race was reported by KCSAN.  Not appropriate for backporting
    due to failure being unlikely.

    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-31 10:56:12 -04:00