Commit Graph

875 Commits

Author SHA1 Message Date
Waiman Long 9734ca08bc workqueue: Don't call cpumask_test_cpu() with -1 CPU in wq_update_node_max_active()
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 15930da42f8981dc42c19038042947b475b19f47
Author: Tejun Heo <tj@kernel.org>
Date:   Tue, 30 Jan 2024 18:55:55 -1000

    workqueue: Don't call cpumask_test_cpu() with -1 CPU in wq_update_node_max_active()

    For wq_update_node_max_active(), @off_cpu of -1 indicates that no CPU is
    going down. The function was incorrectly calling cpumask_test_cpu() with -1
    CPU leading to oopses like the following on some archs:

      Unable to handle kernel paging request at virtual address ffff0002100296e0
      ..
      pc : wq_update_node_max_active+0x50/0x1fc
      lr : wq_update_node_max_active+0x1f0/0x1fc
      ...
      Call trace:
        wq_update_node_max_active+0x50/0x1fc
        apply_wqattrs_commit+0xf0/0x114
        apply_workqueue_attrs_locked+0x58/0xa0
        alloc_workqueue+0x5ac/0x774
        workqueue_init_early+0x460/0x540
        start_kernel+0x258/0x684
        __primary_switched+0xb8/0xc0
      Code: 9100a273 35000d01 53067f00 d0016dc1 (f8607a60)
      ---[ end trace 0000000000000000 ]---
      Kernel panic - not syncing: Attempted to kill the idle task!
      ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---

    Fix it.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
    Reported-by: Nathan Chancellor <nathan@kernel.org>
    Tested-by: Nathan Chancellor <nathan@kernel.org>
    Link: http://lkml.kernel.org/r/91eacde0-df99-4d5c-a980-91046f66e612@samsung.com
    Fixes: 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues")

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:30 -04:00
Waiman Long 31c9838ff4 workqueue: Implement system-wide nr_active enforcement for unbound workqueues
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 5797b1c18919cd9c289ded7954383e499f729ce0
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 29 Jan 2024 08:11:25 -1000

    workqueue: Implement system-wide nr_active enforcement for unbound workqueues

    A pool_workqueue (pwq) represents the connection between a workqueue and a
    worker_pool. One of the roles that a pwq plays is enforcement of the
    max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
    workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
    for per-cpu workqueues and per each NUMA node for unbound workqueues, which
    was a natural result of per-cpu workqueues being served by per-cpu pools and
    unbound by per-NUMA pools.

    In terms of max_active enforcement, this was, while not perfect, workable.
    For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
    NUMA machines would get max_active that's multiplied by the number of nodes
    but didn't cause huge problems because NUMA machines are relatively rare and
    the node count is usually pretty low.

    However, cache layouts are more complex now and sharing a worker pool across
    a whole node didn't really work well for unbound workqueues. Thus, a series
    of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
    to use per-cpu pool_workqueues") implemented more flexible affinity
    mechanism for unbound workqueues which enables using e.g. last-level-cache
    aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
    workqueues to use per-cpu pool_workqueues") made unbound workqueues use
    per-cpu pwqs like per-cpu workqueues.

    While the change was necessary to enable more flexible affinity scopes, this
    came with the side effect of blowing up the effective max_active for unbound
    workqueues. Before, the effective max_active for unbound workqueues was
    multiplied by the number of nodes. After, by the number of CPUs.

    636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
    pool_workqueues") claims that this should generally be okay. It is okay for
    users which self-regulates concurrency level which are the vast majority;
    however, there are enough use cases which actually depend on max_active to
    prevent the level of concurrency from going bonkers including several IO
    handling workqueues that can issue a work item for each in-flight IO. With
    targeted benchmarks, the misbehavior can easily be exposed as reported in
    http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.

    Unfortunately, there is no way to express what these use cases need using
    per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
    to set max_active too low but as soon as we increase max_active a bit, we
    can end up with unreasonable number of in-flight work items when many CPUs
    issue IOs at the same time. ie. The acceptable lowest max_active is higher
    than the acceptable highest max_active.

    Ideally, max_active for an unbound workqueue should be system-wide so that
    the users can regulate the total level of concurrency regardless of node and
    cache layout. The reasons workqueue hasn't implemented that yet are:

    - One max_active enforcement decouples from pool boundaires, chaining
      execution after a work item finishes requires inter-pool operations which
      would require lock dancing, which is nasty.

    - Sharing a single nr_active count across the whole system can be pretty
      expensive on NUMA machines.

    - Per-pwq enforcement had been more or less okay while we were using
      per-node pools.

    It looks like we no longer can avoid decoupling max_active enforcement from
    pool boundaries. This patch implements system-wide nr_active mechanism with
    the following design characteristics:

    - To avoid sharing a single counter across multiple nodes, the configured
      max_active is split across nodes according to the proportion of each
      workqueue's online effective CPUs per node. e.g. A node with twice more
      online effective CPUs will get twice higher portion of max_active.

    - Workqueue used to be able to process a chain of interdependent work items
      which is as long as max_active. We can't do this anymore as max_active is
      distributed across the nodes. Instead, a new parameter min_active is
      introduced which determines the minimum level of concurrency within a node
      regardless of how max_active distribution comes out to be.

      It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
      This can lead to higher effective max_weight than configured and also
      deadlocks if a workqueue was depending on being able to handle chains of
      interdependent work items that are longer than 8.

      I believe these should be fine given that the number of CPUs in each NUMA
      node is usually higher than 8 and work item chain longer than 8 is pretty
      unlikely. However, if these assumptions turn out to be wrong, we'll need
      to add an interface to adjust min_active.

    - Each unbound wq has an array of struct wq_node_nr_active which tracks
      per-node nr_active. When its pwq wants to run a work item, it has to
      obtain the matching node's nr_active. If over the node's max_active, the
      pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
      the completion path round-robins the pending pwqs activating the first
      inactive work item of each, which involves some pool lock dancing and
      kicking other pools. It's not the simplest code but doesn't look too bad.

    v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().

        - wq_adjust_max_active() is now protected by wq->mutex instead of
          wq_pool_mutex.

    v3: - wq_node_max_active() used to calculate per-node max_active on the fly
          based on system-wide CPU online states. Lai pointed out that this can
          lead to skewed distributions for workqueues with restricted cpumasks.
          Update the max_active distribution to use per-workqueue effective
          online CPU counts instead of system-wide and cache the calculation
          results in node_nr_active->max.

    v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
    Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
    Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
    Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:30 -04:00
Waiman Long 020c3805d1 workqueue: Introduce struct wq_node_nr_active
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 91ccc6e7233bb10a9c176aa4cc70d6f432a441a5
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 29 Jan 2024 08:11:24 -1000

    workqueue: Introduce struct wq_node_nr_active

    Currently, for both percpu and unbound workqueues, max_active applies
    per-cpu, which is a recent change for unbound workqueues. The change for
    unbound workqueues was a significant departure from the previous behavior of
    per-node application. It made some use cases create undesirable number of
    concurrent work items and left no good way of fixing them. To address the
    problem, workqueue is implementing a NUMA node segmented global nr_active
    mechanism, which will be explained further in the next patch.

    As a preparation, this patch introduces struct wq_node_nr_active. It's a
    data structured allocated for each workqueue and NUMA node pair and
    currently only tracks the workqueue's number of active work items on the
    node. This is split out from the next patch to make it easier to understand
    and review.

    Note that there is an extra wq_node_nr_active allocated for the invalid node
    nr_node_ids which is used to track nr_active for pools which don't have NUMA
    node associated such as the default fallback system-wide pool.

    This doesn't cause any behavior changes visible to userland yet. The next
    patch will expand to implement the control mechanism on top.

    v4: - Fixed out-of-bound access when freeing per-cpu workqueues.

    v3: - Use flexible array for wq->node_nr_active as suggested by Lai.

    v2: - wq->max_active now uses WRITE/READ_ONCE() as suggested by Lai.

        - Lai pointed out that pwq_tryinc_nr_active() incorrectly dropped
          pwq->max_active check. Restored. As the next patch replaces the
          max_active enforcement mechanism, this doesn't change the end result.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:30 -04:00
Waiman Long d91862ff64 workqueue: Move pwq_dec_nr_in_flight() to the end of work item handling
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit dd6c3c5441263723305a9c52c5ccc899a4653000
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 29 Jan 2024 08:11:24 -1000

    workqueue: Move pwq_dec_nr_in_flight() to the end of work item handling

    The planned shared nr_active handling for unbound workqueues will make
    pwq_dec_nr_active() sometimes drop the pool lock temporarily to acquire
    other pool locks, which is necessary as retirement of an nr_active count
    from one pool may need kick off an inactive work item in another pool.

    This patch moves pwq_dec_nr_in_flight() call in try_to_grab_pending() to the
    end of work item handling so that work item state changes stay atomic.
    process_one_work() which is the other user of pwq_dec_nr_in_flight() already
    calls it at the end of work item handling. Comments are added to both call
    sites and pwq_dec_nr_in_flight().

    This shouldn't cause any behavior changes.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:30 -04:00
Waiman Long 02ccecf34b workqueue: RCU protect wq->dfl_pwq and implement accessors for it
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 9f66cff212bb3c1cd25996aaa0dfd0c9e9d8baab
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 29 Jan 2024 08:11:24 -1000

    workqueue: RCU protect wq->dfl_pwq and implement accessors for it

    wq->cpu_pwq is RCU protected but wq->dfl_pwq isn't. This is okay because
    currently wq->dfl_pwq is used only accessed to install it into wq->cpu_pwq
    which doesn't require RCU access. However, we want to be able to access
    wq->dfl_pwq under RCU in the future to access its __pod_cpumask and the code
    can be made easier to read by making the two pwq fields behave in the same
    way.

    - Make wq->dfl_pwq RCU protected.

    - Add unbound_pwq_slot() and unbound_pwq() which can access both ->dfl_pwq
      and ->cpu_pwq. The former returns the double pointer that can be used
      access and update the pwqs. The latter performs locking check and
      dereferences the double pointer.

    - pwq accesses and updates are converted to use unbound_pwq[_slot]().

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:30 -04:00
Waiman Long f90f83b7b8 workqueue: Make wq_adjust_max_active() round-robin pwqs while activating
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit c5404d4e6df6faba1007544b5f4e62c7c14416dd
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 29 Jan 2024 08:11:24 -1000

    workqueue: Make wq_adjust_max_active() round-robin pwqs while activating

    wq_adjust_max_active() needs to activate work items after max_active is
    increased. Previously, it did that by visiting each pwq once activating all
    that could be activated. While this makes sense with per-pwq nr_active,
    nr_active will be shared across multiple pwqs for unbound wqs. Then, we'd
    want to round-robin through pwqs to be fairer.

    In preparation, this patch makes wq_adjust_max_active() round-robin pwqs
    while activating. While the activation ordering changes, this shouldn't
    cause user-noticeable behavior changes.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:30 -04:00
Waiman Long 29cc16fece workqueue: Move nr_active handling into helpers
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 1c270b79ce0b8290f146255ea9057243f6dd3c17
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 29 Jan 2024 08:11:24 -1000

    workqueue: Move nr_active handling into helpers

    __queue_work(), pwq_dec_nr_in_flight() and wq_adjust_max_active() were
    open-coding nr_active handling, which is fine given that the operations are
    trivial. However, the planned unbound nr_active update will make them more
    complicated, so let's move them into helpers.

    - pwq_tryinc_nr_active() is added. It increments nr_active if under
      max_active limit and return a boolean indicating whether inc was
      successful. Note that the function is structured to accommodate future
      changes. __queue_work() is updated to use the new helper.

    - pwq_activate_first_inactive() is updated to use pwq_tryinc_nr_active() and
      thus no longer assumes that nr_active is under max_active and returns a
      boolean to indicate whether a work item has been activated.

    - wq_adjust_max_active() no longer tests directly whether a work item can be
      activated. Instead, it's updated to use the return value of
      pwq_activate_first_inactive() to tell whether a work item has been
      activated.

    - nr_active decrement and activating the first inactive work item is
      factored into pwq_dec_nr_active().

    v3: - WARN_ON_ONCE(!WORK_STRUCT_INACTIVE) added to __pwq_activate_work() as
          now we're calling the function unconditionally from
          pwq_activate_first_inactive().

    v2: - wq->max_active now uses WRITE/READ_ONCE() as suggested by Lai.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:30 -04:00
Waiman Long 86a54e0586 workqueue: Replace pwq_activate_inactive_work() with [__]pwq_activate_work()
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 4c6380305d21e36581b451f7337a36c93b64e050
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 29 Jan 2024 08:11:24 -1000

    workqueue: Replace pwq_activate_inactive_work() with [__]pwq_activate_work()

    To prepare for unbound nr_active handling improvements, move work activation
    part of pwq_activate_inactive_work() into __pwq_activate_work() and add
    pwq_activate_work() which tests WORK_STRUCT_INACTIVE and updates nr_active.

    pwq_activate_first_inactive() and try_to_grab_pending() are updated to use
    pwq_activate_work(). The latter conversion is functionally identical. For
    the former, this conversion adds an unnecessary WORK_STRUCT_INACTIVE
    testing. This is temporary and will be removed by the next patch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:29 -04:00
Waiman Long ba03c5aa0f workqueue: Factor out pwq_is_empty()
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit afa87ce85379e2d93863fce595afdb5771a84004
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 29 Jan 2024 08:11:24 -1000

    workqueue: Factor out pwq_is_empty()

    "!pwq->nr_active && list_empty(&pwq->inactive_works)" test is repeated
    multiple times. Let's factor it out into pwq_is_empty().

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:29 -04:00
Waiman Long 60f17f20d1 workqueue: Move pwq->max_active to wq->max_active
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit a045a272d887575da17ad86d6573e82871b50c27
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 29 Jan 2024 08:11:24 -1000

    workqueue: Move pwq->max_active to wq->max_active

    max_active is a workqueue-wide setting and the configured value is stored in
    wq->saved_max_active; however, the effective value was stored in
    pwq->max_active. While this is harmless, it makes max_active update process
    more complicated and gets in the way of the planned max_active semantic
    updates for unbound workqueues.

    This patches moves pwq->max_active to wq->max_active. This simplifies the
    code and makes freezing and noop max_active updates cheaper too. No
    user-visible behavior change is intended.

    As wq->max_active is updated while holding wq mutex but read without any
    locking, it now uses WRITE/READ_ONCE(). A new locking locking rule WO is
    added for it.

    v2: wq->max_active now uses WRITE/READ_ONCE() as suggested by Lai.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:29 -04:00
Waiman Long 07412404d9 workqueue: Break up enum definitions and give names to the types
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit e563d0a7cdc1890ff36bb177b5c8c2854d881e4d
Author: Tejun Heo <tj@kernel.org>
Date:   Fri, 26 Jan 2024 11:55:50 -1000

    workqueue: Break up enum definitions and give names to the types

    workqueue is collecting different sorts of enums into a single unnamed enum
    type which can increase confusion around enum width. Also, unnamed enums
    can't be accessed from BPF. Let's break up enum definitions according to
    their purposes and give them type names.

    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:29 -04:00
Waiman Long 1ff06fc607 workqueue: Drop unnecessary kick_pool() in create_worker()
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 6a229b0e2ff6143b65ba4ef42bd71e29ffc2c16d
Author: Tejun Heo <tj@kernel.org>
Date:   Fri, 26 Jan 2024 11:55:46 -1000

    workqueue: Drop unnecessary kick_pool() in create_worker()

    After creating a new worker, create_worker() is calling kick_pool() to wake
    up the new worker task. However, as kick_pool() doesn't do anything if there
    is no work pending, it also calls wake_up_process() explicitly. There's no
    reason to call kick_pool() at all. wake_up_process() is enough by itself.
    Drop the unnecessary kick_pool() call.

    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:29 -04:00
Waiman Long 843633d967 workqueue: mark power efficient workqueue as unbounded if nohz_full enabled
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 7bd20b6b87183db2ebf789bcf9d0aa6d06a0defb
Author: Marcelo Tosatti <mtosatti@redhat.com>
Date:   Fri, 19 Jan 2024 12:54:39 -0300

    workqueue: mark power efficient workqueue as unbounded if nohz_full enabled

    A customer using nohz_full has experienced the following interruption:

    oslat-1004510 [018] timer_cancel:         timer=0xffff90a7ca663cf8
    oslat-1004510 [018] timer_expire_entry:   timer=0xffff90a7ca663cf8 function=delayed_work_timer_fn now=4709188240 baseclk=4709188240
    oslat-1004510 [018] workqueue_queue_work: work struct=0xffff90a7ca663cd8 function=fb_flashcursor workqueue=events_power_efficient req_cpu=8192 cpu=18
    oslat-1004510 [018] workqueue_activate_work: work struct 0xffff90a7ca663cd8
    oslat-1004510 [018] sched_wakeup:         kworker/18:1:326 [120] CPU:018
    oslat-1004510 [018] timer_expire_exit:    timer=0xffff90a7ca663cf8
    oslat-1004510 [018] irq_work_entry:       vector=246
    oslat-1004510 [018] irq_work_exit:        vector=246
    oslat-1004510 [018] tick_stop:            success=0 dependency=SCHED
    oslat-1004510 [018] hrtimer_start:        hrtimer=0xffff90a70009cb00 function=tick_sched_timer/0x0 ...
    oslat-1004510 [018] softirq_exit:         vec=1 [action=TIMER]
    oslat-1004510 [018] softirq_entry:        vec=7 [action=SCHED]
    oslat-1004510 [018] softirq_exit:         vec=7 [action=SCHED]
    oslat-1004510 [018] tick_stop:            success=0 dependency=SCHED
    oslat-1004510 [018] sched_switch:         oslat:1004510 [120] R ==> kworker/18:1:326 [120]
    kworker/18:1-326 [018] workqueue_execute_start: work struct 0xffff90a7ca663cd8: function fb_flashcursor
    kworker/18:1-326 [018] workqueue_queue_work: work struct=0xffff9078f119eed0 function=drm_fb_helper_damage_work workqueue=events req_cpu=8192 cpu=18
    kworker/18:1-326 [018] workqueue_activate_work: work struct 0xffff9078f119eed0
    kworker/18:1-326 [018] timer_start:          timer=0xffff90a7ca663cf8 function=delayed_work_timer_fn ...

    Set wq_power_efficient to true, in case nohz_full is enabled.
    This makes the power efficient workqueue be unbounded, which allows
    workqueue items there to be moved to HK CPUs.

    Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:29 -04:00
Waiman Long 5bb6b4870b workqueue: Add rcu lock check at the end of work item execution
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 1a65a6d17cbc58e1aeffb2be962acce49efbef9c
Author: Xuewen Yan <xuewen.yan@unisoc.com>
Date:   Wed, 10 Jan 2024 11:27:24 +0800

    workqueue: Add rcu lock check at the end of work item execution

    Currently the workqueue just checks the atomic and locking states after work
    execution ends. However, sometimes, a work item may not unlock rcu after
    acquiring rcu_read_lock(). And as a result, it would cause rcu stall, but
    the rcu stall warning can not dump the work func, because the work has
    finished.

    In order to quickly discover those works that do not call rcu_read_unlock()
    after rcu_read_lock(), add the rcu lock check.

    Use rcu_preempt_depth() to check the work's rcu status. Normally, this value
    is 0. If this value is bigger than 0, it means the work are still holding
    rcu lock. If so, print err info and the work func.

    tj: Reworded the description for clarity. Minor formatting tweak.

    Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
    Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
    Reviewed-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:29 -04:00
Waiman Long 14bcf3aa22 kernel/workqueue: Bind rescuer to unbound cpumask for WQ_UNBOUND
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 85f0ab43f9de62a4b9c1b503b07f1c33e5a6d2ab
Author: Juri Lelli <juri.lelli@redhat.com>
Date:   Tue, 16 Jan 2024 17:19:27 +0100

    kernel/workqueue: Bind rescuer to unbound cpumask for WQ_UNBOUND

    At the time they are created unbound workqueues rescuers currently use
    cpu_possible_mask as their affinity, but this can be too wide in case a
    workqueue unbound mask has been set as a subset of cpu_possible_mask.

    Make new rescuers use their associated workqueue unbound cpumask from
    the start.

    Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:29 -04:00
Waiman Long 9f2706138e Revert "workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()"
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit aac8a59537dfc704ff344f1aacfd143c089ee20f
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 5 Feb 2024 15:43:41 -1000

    Revert "workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()"

    This reverts commit ca10d851b9ad0338c19e8e3089e24d565ebfffd7.

    The commit allowed workqueue_apply_unbound_cpumask() to clear __WQ_ORDERED
    on now removed implicitly ordered workqueues. This was incorrect in that
    system-wide config change shouldn't break ordering properties of all
    workqueues. The reason why apply_workqueue_attrs() path was allowed to do so
    was because it was targeting the specific workqueue - either the workqueue
    had WQ_SYSFS set or the workqueue user specifically tried to change
    max_active, both of which indicate that the workqueue doesn't need to be
    ordered.

    The implicitly ordered workqueue promotion was removed by the previous
    commit 3bc1e711c26b ("workqueue: Don't implicitly make UNBOUND workqueues w/
    @max_active==1 ordered"). However, it didn't update this path and broke
    build. Let's revert the commit which was incorrect in the first place which
    also fixes build.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Fixes: 3bc1e711c26b ("workqueue: Don't implicitly make UNBOUND workqueues w/ @max_active==1 ordered")
    Fixes: ca10d851b9ad ("workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()")
    Cc: stable@vger.kernel.org # v6.6+
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:29 -04:00
Waiman Long 5dcd3d30aa workqueue: Provide one lock class key per work_on_cpu() callsite
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 265f3ed077036f053981f5eea0b5b43e7c5b39ff
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Sun, 24 Sep 2023 17:07:02 +0200

    workqueue: Provide one lock class key per work_on_cpu() callsite

    All callers of work_on_cpu() share the same lock class key for all the
    functions queued. As a result the workqueue related locking scenario for
    a function A may be spuriously accounted as an inversion against the
    locking scenario of function B such as in the following model:

            long A(void *arg)
            {
                    mutex_lock(&mutex);
                    mutex_unlock(&mutex);
            }

            long B(void *arg)
            {
            }

            void launchA(void)
            {
                    work_on_cpu(0, A, NULL);
            }

            void launchB(void)
            {
                    mutex_lock(&mutex);
                    work_on_cpu(1, B, NULL);
                    mutex_unlock(&mutex);
            }

    launchA and launchB running concurrently have no chance to deadlock.
    However the above can be reported by lockdep as a possible locking
    inversion because the works containing A() and B() are treated as
    belonging to the same locking class.

    The following shows an existing example of such a spurious lockdep splat:

             ======================================================
             WARNING: possible circular locking dependency detected
             6.6.0-rc1-00065-g934ebd6e5359 #35409 Not tainted
             ------------------------------------------------------
             kworker/0:1/9 is trying to acquire lock:
             ffffffff9bc72f30 (cpu_hotplug_lock){++++}-{0:0}, at: _cpu_down+0x57/0x2b0

             but task is already holding lock:
             ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500

             which lock already depends on the new lock.

             the existing dependency chain (in reverse order) is:

             -> #2 ((work_completion)(&wfc.work)){+.+.}-{0:0}:
                            __flush_work+0x83/0x4e0
                            work_on_cpu+0x97/0xc0
                            rcu_nocb_cpu_offload+0x62/0xb0
                            rcu_nocb_toggle+0xd0/0x1d0
                            kthread+0xe6/0x120
                            ret_from_fork+0x2f/0x40
                            ret_from_fork_asm+0x1b/0x30

             -> #1 (rcu_state.barrier_mutex){+.+.}-{3:3}:
                            __mutex_lock+0x81/0xc80
                            rcu_nocb_cpu_deoffload+0x38/0xb0
                            rcu_nocb_toggle+0x144/0x1d0
                            kthread+0xe6/0x120
                            ret_from_fork+0x2f/0x40
                            ret_from_fork_asm+0x1b/0x30

             -> #0 (cpu_hotplug_lock){++++}-{0:0}:
                            __lock_acquire+0x1538/0x2500
                            lock_acquire+0xbf/0x2a0
                            percpu_down_write+0x31/0x200
                            _cpu_down+0x57/0x2b0
                            __cpu_down_maps_locked+0x10/0x20
                            work_for_cpu_fn+0x15/0x20
                            process_scheduled_works+0x2a7/0x500
                            worker_thread+0x173/0x330
                            kthread+0xe6/0x120
                            ret_from_fork+0x2f/0x40
                            ret_from_fork_asm+0x1b/0x30

             other info that might help us debug this:

             Chain exists of:
               cpu_hotplug_lock --> rcu_state.barrier_mutex --> (work_completion)(&wfc.work)

              Possible unsafe locking scenario:

                            CPU0                    CPU1
                            ----                    ----
               lock((work_completion)(&wfc.work));
                                                                            lock(rcu_state.barrier_mutex);
                                                                            lock((work_completion)(&wfc.work));
               lock(cpu_hotplug_lock);

              *** DEADLOCK ***

             2 locks held by kworker/0:1/9:
              #0: ffff900481068b38 ((wq_completion)events){+.+.}-{0:0}, at: process_scheduled_works+0x212/0x500
              #1: ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500

             stack backtrace:
             CPU: 0 PID: 9 Comm: kworker/0:1 Not tainted 6.6.0-rc1-00065-g934ebd6e5359 #35409
             Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
             Workqueue: events work_for_cpu_fn
             Call Trace:
             rcu-torture: rcu_torture_read_exit: Start of episode
              <TASK>
              dump_stack_lvl+0x4a/0x80
              check_noncircular+0x132/0x150
              __lock_acquire+0x1538/0x2500
              lock_acquire+0xbf/0x2a0
              ? _cpu_down+0x57/0x2b0
              percpu_down_write+0x31/0x200
              ? _cpu_down+0x57/0x2b0
              _cpu_down+0x57/0x2b0
              __cpu_down_maps_locked+0x10/0x20
              work_for_cpu_fn+0x15/0x20
              process_scheduled_works+0x2a7/0x500
              worker_thread+0x173/0x330
              ? __pfx_worker_thread+0x10/0x10
              kthread+0xe6/0x120
              ? __pfx_kthread+0x10/0x10
              ret_from_fork+0x2f/0x40
              ? __pfx_kthread+0x10/0x10
              ret_from_fork_asm+0x1b/0x30
              </TASK

    Fix this with providing one lock class key per work_on_cpu() caller.

    Reported-and-tested-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:28 -04:00
Waiman Long d0f98c8671 workqueue: fix -Wformat-truncation in create_worker
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 5d9c7a1e3e8e18db8e10c546de648cda2a57be52
Author: Lucy Mielke <lucymielke@icloud.com>
Date:   Mon, 9 Oct 2023 19:09:46 +0200

    workqueue: fix -Wformat-truncation in create_worker

    Compiling with W=1 emitted the following warning
    (Compiler: gcc (x86-64, ver. 13.2.1, .config: result of make allyesconfig,
    "Treat warnings as errors" turned off):

    kernel/workqueue.c:2188:54: warning: ‘%d’ directive output may be
            truncated writing between 1 and 10 bytes into a region of size
            between 5 and 14 [-Wformat-truncation=]
    kernel/workqueue.c:2188:50: note: directive argument in the range
            [0, 2147483647]
    kernel/workqueue.c:2188:17: note: ‘snprintf’ output between 4 and 23 bytes
            into a destination of size 16

    setting "id_buf" to size 23 will silence the warning, since GCC
    determines snprintf's output to be max. 23 bytes in line 2188.

    Please let me know if there are any mistakes in my patch!

    Signed-off-by: Lucy Mielke <lucymielke@icloud.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:28 -04:00
Waiman Long 419b96f858 workqueue: Use the kmem_cache_free() instead of kfree() to release pwq
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 7b42f401fc6571b6604441789d892d440829e33c
Author: Zqiang <qiang.zhang1211@gmail.com>
Date:   Wed, 11 Oct 2023 16:27:59 +0800

    workqueue: Use the kmem_cache_free() instead of kfree() to release pwq

    Currently, the kfree() be used for pwq objects allocated with
    kmem_cache_alloc() in alloc_and_link_pwqs(), this isn't wrong.
    but usually, use "trace_kmem_cache_alloc/trace_kmem_cache_free"
    to track memory allocation and free. this commit therefore use
    kmem_cache_free() instead of kfree() in alloc_and_link_pwqs()
    and also consistent with release of the pwq in rcu_free_pwq().

    Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:28 -04:00
Waiman Long b5d207d9d1 workqueue: Fix UAF report by KASAN in pwq_release_workfn()
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 643445531829d89dc5ddbe0c5ee4ff8f84ce8687
Author: Zqiang <qiang.zhang1211@gmail.com>
Date:   Wed, 20 Sep 2023 14:07:04 +0800

    workqueue: Fix UAF report by KASAN in pwq_release_workfn()

    Currently, for UNBOUND wq, if the apply_wqattrs_prepare() return error,
    the apply_wqattr_cleanup() will be called and use the pwq_release_worker
    kthread to release resources asynchronously. however, the kfree(wq) is
    invoked directly in failure path of alloc_workqueue(), if the kfree(wq)
    has been executed and when the pwq_release_workfn() accesses wq, this
    leads to the following scenario:

    BUG: KASAN: slab-use-after-free in pwq_release_workfn+0x339/0x380 kernel/workqueue.c:4124
    Read of size 4 at addr ffff888027b831c0 by task pool_workqueue_/3

    CPU: 0 PID: 3 Comm: pool_workqueue_ Not tainted 6.5.0-rc7-next-20230825-syzkaller #0
    Hardware name: Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023
    Call Trace:
     <TASK>
     __dump_stack lib/dump_stack.c:88 [inline]
     dump_stack_lvl+0xd9/0x1b0 lib/dump_stack.c:106
     print_address_description mm/kasan/report.c:364 [inline]
     print_report+0xc4/0x620 mm/kasan/report.c:475
     kasan_report+0xda/0x110 mm/kasan/report.c:588
     pwq_release_workfn+0x339/0x380 kernel/workqueue.c:4124
     kthread_worker_fn+0x2fc/0xa80 kernel/kthread.c:823
     kthread+0x33a/0x430 kernel/kthread.c:388
     ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147
     ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:304
     </TASK>

    Allocated by task 5054:
     kasan_save_stack+0x33/0x50 mm/kasan/common.c:45
     kasan_set_track+0x25/0x30 mm/kasan/common.c:52
     ____kasan_kmalloc mm/kasan/common.c:374 [inline]
     __kasan_kmalloc+0xa2/0xb0 mm/kasan/common.c:383
     kmalloc include/linux/slab.h:599 [inline]
     kzalloc include/linux/slab.h:720 [inline]
     alloc_workqueue+0x16f/0x1490 kernel/workqueue.c:4684
     kvm_mmu_init_tdp_mmu+0x23/0x100 arch/x86/kvm/mmu/tdp_mmu.c:19
     kvm_mmu_init_vm+0x248/0x2e0 arch/x86/kvm/mmu/mmu.c:6180
     kvm_arch_init_vm+0x39/0x720 arch/x86/kvm/x86.c:12311
     kvm_create_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:1222 [inline]
     kvm_dev_ioctl_create_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:5089 [inline]
     kvm_dev_ioctl+0xa31/0x1c20 arch/x86/kvm/../../../virt/kvm/kvm_main.c:5131
     vfs_ioctl fs/ioctl.c:51 [inline]
     __do_sys_ioctl fs/ioctl.c:871 [inline]
     __se_sys_ioctl fs/ioctl.c:857 [inline]
     __x64_sys_ioctl+0x18f/0x210 fs/ioctl.c:857
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x38/0xb0 arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x63/0xcd

    Freed by task 5054:
     kasan_save_stack+0x33/0x50 mm/kasan/common.c:45
     kasan_set_track+0x25/0x30 mm/kasan/common.c:52
     kasan_save_free_info+0x2b/0x40 mm/kasan/generic.c:522
     ____kasan_slab_free mm/kasan/common.c:236 [inline]
     ____kasan_slab_free+0x15b/0x1b0 mm/kasan/common.c:200
     kasan_slab_free include/linux/kasan.h:164 [inline]
     slab_free_hook mm/slub.c:1800 [inline]
     slab_free_freelist_hook+0x114/0x1e0 mm/slub.c:1826
     slab_free mm/slub.c:3809 [inline]
     __kmem_cache_free+0xb8/0x2f0 mm/slub.c:3822
     alloc_workqueue+0xe76/0x1490 kernel/workqueue.c:4746
     kvm_mmu_init_tdp_mmu+0x23/0x100 arch/x86/kvm/mmu/tdp_mmu.c:19
     kvm_mmu_init_vm+0x248/0x2e0 arch/x86/kvm/mmu/mmu.c:6180
     kvm_arch_init_vm+0x39/0x720 arch/x86/kvm/x86.c:12311
     kvm_create_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:1222 [inline]
     kvm_dev_ioctl_create_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:5089 [inline]
     kvm_dev_ioctl+0xa31/0x1c20 arch/x86/kvm/../../../virt/kvm/kvm_main.c:5131
     vfs_ioctl fs/ioctl.c:51 [inline]
     __do_sys_ioctl fs/ioctl.c:871 [inline]
     __se_sys_ioctl fs/ioctl.c:857 [inline]
     __x64_sys_ioctl+0x18f/0x210 fs/ioctl.c:857
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x38/0xb0 arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x63/0xcd

    This commit therefore flush pwq_release_worker in the alloc_and_link_pwqs()
    before invoke kfree(wq).

    Reported-by: syzbot+60db9f652c92d5bacba4@syzkaller.appspotmail.com
    Closes: https://syzkaller.appspot.com/bug?extid=60db9f652c92d5bacba4
    Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:28 -04:00
Waiman Long 933254a1a8 workqueue: Fix missed pwq_release_worker creation in wq_cpu_intensive_thresh_init()
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit dd64c873ed11cdae340be06dcd2364870fd3e4fc
Author: Zqiang <qiang.zhang1211@gmail.com>
Date:   Mon, 11 Sep 2023 16:27:22 +0800

    workqueue: Fix missed pwq_release_worker creation in wq_cpu_intensive_thresh_init()

    Currently, if the wq_cpu_intensive_thresh_us is set to specific
    value, will cause the wq_cpu_intensive_thresh_init() early exit
    and missed creation of pwq_release_worker. this commit therefore
    create the pwq_release_worker in advance before checking the
    wq_cpu_intensive_thresh_us.

    Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>
    Fixes: 967b494e2fd1 ("workqueue: Use a kthread_worker to release pool_workqueues")

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:28 -04:00
Waiman Long ef34ded5ae workqueue: Removed double allocation of wq_update_pod_attrs_buf
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit a6828214480e2f00a8a7e64c7a55fc42b0f54e1c
Author: Steven Rostedt (Google) <rostedt@goodmis.org>
Date:   Tue, 5 Sep 2023 17:49:35 -0400

    workqueue: Removed double allocation of wq_update_pod_attrs_buf

    First commit 2930155b2e272 ("workqueue: Initialize unbound CPU pods later in
    the boot") added the initialization of wq_update_pod_attrs_buf to
    workqueue_init_early(), and then latter on, commit 84193c07105c6
    ("workqueue: Generalize unbound CPU pods") added it as well. This appeared
    in a kmemleak run where the second allocation made the first allocation
    leak.

    Fixes: 84193c07105c6 ("workqueue: Generalize unbound CPU pods")
    Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
    Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:28 -04:00
Waiman Long 0b15385184 workqueue: fix data race with the pwq->stats[] increment
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit fe48ba7daefe75bbbefa2426deddc05f2d530d2d
Author: Mirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr>
Date:   Sat, 26 Aug 2023 16:51:03 +0200

    workqueue: fix data race with the pwq->stats[] increment

    KCSAN has discovered a data race in kernel/workqueue.c:2598:

    [ 1863.554079] ==================================================================
    [ 1863.554118] BUG: KCSAN: data-race in process_one_work / process_one_work

    [ 1863.554142] write to 0xffff963d99d79998 of 8 bytes by task 5394 on cpu 27:
    [ 1863.554154] process_one_work (kernel/workqueue.c:2598)
    [ 1863.554166] worker_thread (./include/linux/list.h:292 kernel/workqueue.c:2752)
    [ 1863.554177] kthread (kernel/kthread.c:389)
    [ 1863.554186] ret_from_fork (arch/x86/kernel/process.c:145)
    [ 1863.554197] ret_from_fork_asm (arch/x86/entry/entry_64.S:312)

    [ 1863.554213] read to 0xffff963d99d79998 of 8 bytes by task 5450 on cpu 12:
    [ 1863.554224] process_one_work (kernel/workqueue.c:2598)
    [ 1863.554235] worker_thread (./include/linux/list.h:292 kernel/workqueue.c:2752)
    [ 1863.554247] kthread (kernel/kthread.c:389)
    [ 1863.554255] ret_from_fork (arch/x86/kernel/process.c:145)
    [ 1863.554266] ret_from_fork_asm (arch/x86/entry/entry_64.S:312)

    [ 1863.554280] value changed: 0x0000000000001766 -> 0x000000000000176a

    [ 1863.554295] Reported by Kernel Concurrency Sanitizer on:
    [ 1863.554303] CPU: 12 PID: 5450 Comm: kworker/u64:1 Tainted: G             L     6.5.0-rc6+ #44
    [ 1863.554314] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023
    [ 1863.554322] Workqueue: btrfs-endio btrfs_end_bio_work [btrfs]
    [ 1863.554941] ==================================================================

        lockdep_invariant_state(true);
    →   pwq->stats[PWQ_STAT_STARTED]++;
        trace_workqueue_execute_start(work);
        worker->current_func(work);

    Moving pwq->stats[PWQ_STAT_STARTED]++; before the line

        raw_spin_unlock_irq(&pool->lock);

    resolves the data race without performance penalty.

    KCSAN detected at least one additional data race:

    [  157.834751] ==================================================================
    [  157.834770] BUG: KCSAN: data-race in process_one_work / process_one_work

    [  157.834793] write to 0xffff9934453f77a0 of 8 bytes by task 468 on cpu 29:
    [  157.834804] process_one_work (/home/marvin/linux/kernel/linux_torvalds/kernel/workqueue.c:2606)
    [  157.834815] worker_thread (/home/marvin/linux/kernel/linux_torvalds/./include/linux/list.h:292 /home/marvin/linux/kernel/linux_torvalds/kernel/workqueue.c:2752)
    [  157.834826] kthread (/home/marvin/linux/kernel/linux_torvalds/kernel/kthread.c:389)
    [  157.834834] ret_from_fork (/home/marvin/linux/kernel/linux_torvalds/arch/x86/kernel/process.c:145)
    [  157.834845] ret_from_fork_asm (/home/marvin/linux/kernel/linux_torvalds/arch/x86/entry/entry_64.S:312)

    [  157.834859] read to 0xffff9934453f77a0 of 8 bytes by task 214 on cpu 7:
    [  157.834868] process_one_work (/home/marvin/linux/kernel/linux_torvalds/kernel/workqueue.c:2606)
    [  157.834879] worker_thread (/home/marvin/linux/kernel/linux_torvalds/./include/linux/list.h:292 /home/marvin/linux/kernel/linux_torvalds/kernel/workqueue.c:2752)
    [  157.834890] kthread (/home/marvin/linux/kernel/linux_torvalds/kernel/kthread.c:389)
    [  157.834897] ret_from_fork (/home/marvin/linux/kernel/linux_torvalds/arch/x86/kernel/process.c:145)
    [  157.834907] ret_from_fork_asm (/home/marvin/linux/kernel/linux_torvalds/arch/x86/entry/entry_64.S:312)

    [  157.834920] value changed: 0x000000000000052a -> 0x0000000000000532

    [  157.834933] Reported by Kernel Concurrency Sanitizer on:
    [  157.834941] CPU: 7 PID: 214 Comm: kworker/u64:2 Tainted: G             L     6.5.0-rc7-kcsan-00169-g81eaf55a60fc #4
    [  157.834951] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023
    [  157.834958] Workqueue: btrfs-endio btrfs_end_bio_work [btrfs]
    [  157.835567] ==================================================================

    in code:

            trace_workqueue_execute_end(work, worker->current_func);
    →       pwq->stats[PWQ_STAT_COMPLETED]++;
            lock_map_release(&lockdep_map);
            lock_map_release(&pwq->wq->lockdep_map);

    which needs to be resolved separately.

    Fixes: 725e8ec59c56c ("workqueue: Add pwq->stats[] and a monitoring script")
    Cc: Tejun Heo <tj@kernel.org>
    Suggested-by: Lai Jiangshan <jiangshanlai@gmail.com>
    Link: https://lore.kernel.org/lkml/20230818194448.29672-1-mirsad.todorovac@alu.unizg.hr/
    Signed-off-by: Mirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:28 -04:00
Waiman Long dc384b64b6 workqueue: Rename rescuer kworker
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit b6a46f7263bd8ba0e545d79bd034c412f32b5875
Author: Aaron Tomlin <atomlin@atomlin.com>
Date:   Tue, 8 Aug 2023 13:03:29 +0100

    workqueue: Rename rescuer kworker

    Each CPU-specific and unbound kworker kthread conforms to a particular
    naming scheme. However, this does not extend to the rescuer kworker.
    At present, a rescuer kworker is simply named according to its
    workqueue's name. This can be cryptic.

    This patch modifies a rescuer to follow the kworker naming scheme.
    The "R" is indicative of a rescuer and after "-" is its workqueue's
    name e.g. "kworker/R-ext4-rsv-conver".

    tj: Use "R" instead of "r" as the prefix to make it more distinctive and
        consistent with how highpri pools are marked.

    Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:28 -04:00
Waiman Long 150f0fd88d workqueue: Make default affinity_scope dynamically updatable
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 523a301e66afd1ea9856660bcf3cee3a7c84c6dd
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 7 Aug 2023 15:57:25 -1000

    workqueue: Make default affinity_scope dynamically updatable

    While workqueue.default_affinity_scope is writable, it only affects
    workqueues which are created afterwards and isn't very useful. Instead,
    let's introduce explicit "default" scope and update the effective scope
    dynamically when workqueue.default_affinity_scope is changed.

    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:28 -04:00
Waiman Long 87b2fa1d0e workqueue: Implement non-strict affinity scope for unbound workqueues
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 8639ecebc9b1796d7074751a350462f5e1c61cd4
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 7 Aug 2023 15:57:25 -1000

    workqueue: Implement non-strict affinity scope for unbound workqueues

    An unbound workqueue can be served by multiple worker_pools to improve
    locality. The segmentation is achieved by grouping CPUs into pods. By
    default, the cache boundaries according to cpus_share_cache() define the
    CPUs are grouped. Let's a workqueue is allowed to run on all CPUs and the
    system has two L3 caches. The workqueue would be mapped to two worker_pools
    each serving one L3 cache domains.

    While this improves locality, because the pod boundaries are strict, it
    limits the total bandwidth a given issuer can consume. For example, let's
    say there is a thread pinned to a CPU issuing enough work items to saturate
    the whole machine. With the machine segmented into two pods, no matter how
    many work items it issues, it can only use half of the CPUs on the system.

    While this limitation has existed for a very long time, it wasn't very
    pronounced because the affinity grouping used to be always by NUMA nodes.
    With cache boundaries as the default and support for even finer grained
    scopes (smt and cpu), it is now an a lot more pressing problem.

    This patch implements non-strict affinity scope where the pod boundaries
    aren't enforced strictly. Going back to the previous example, the workqueue
    would still be mapped to two worker_pools; however, the affinity enforcement
    would be soft. The workers in both pools would have their cpus_allowed set
    to the whole machine thus allowing the scheduler to migrate them anywhere on
    the machine. However, whenever an idle worker is woken up, the workqueue
    code asks the scheduler to bring back the task within the pod if the worker
    is outside. ie. work items start executing within its affinity scope but can
    be migrated outside as the scheduler sees fit. This removes the hard cap on
    utilization while maintaining the benefits of affinity scopes.

    After the earlier ->__pod_cpumask changes, the implementation is pretty
    simple. When non-strict which is the new default:

    * pool_allowed_cpus() returns @pool->attrs->cpumask instead of
      ->__pod_cpumask so that the workers are allowed to run on any CPU that
      the associated workqueues allow.

    * If the idle worker task's ->wake_cpu is outside the pod, kick_pool() sets
      the field to a CPU within the pod.

    This would be the first use of task_struct->wake_cpu outside scheduler
    proper, so it isn't clear whether this would be acceptable. However, other
    methods of migrating tasks are significantly more expensive and are likely
    prohibitively so if we want to do this on every work item. This needs
    discussion with scheduler folks.

    There is also a race window where setting ->wake_cpu wouldn't be effective
    as the target task is still on CPU. However, the window is pretty small and
    this being a best-effort optimization, it doesn't seem to warrant more
    complexity at the moment.

    While the non-strict cache affinity scopes seem to be the best option, the
    performance picture interacts with the affinity scope and is a bit
    complicated to fully discuss in this patch, so the behavior is made easily
    selectable through wqattrs and sysfs and the next patch will add
    documentation to discuss performance implications.

    v2: pool->attrs->affn_strict is set to true for per-cpu worker_pools.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:27 -04:00
Waiman Long 4f5a422058 workqueue: Add workqueue_attrs->__pod_cpumask
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 9546b29e4a6ad6ed7924dd7980975c8e675740a3
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 7 Aug 2023 15:57:25 -1000

    workqueue: Add workqueue_attrs->__pod_cpumask

    workqueue_attrs has two uses:

    * to specify the required unouned workqueue properties by users

    * to match worker_pool's properties to workqueues by core code

    For example, if the user wants to restrict a workqueue to run only CPUs 0
    and 2, and the two CPUs are on different affinity scopes, the workqueue's
    attrs->cpumask would contains CPUs 0 and 2, and the workqueue would be
    associated with two worker_pools, one with attrs->cpumask containing just
    CPU 0 and the other CPU 2.

    Workqueue wants to support non-strict affinity scopes where work items are
    started in their matching affinity scopes but the scheduler is free to
    migrate them outside the starting scopes, which can enable utilizing the
    whole machine while maintaining most of the locality benefits from affinity
    scopes.

    To enable that, worker_pools need to distinguish the strict affinity that it
    has to follow (because that's the restriction coming from the user) and the
    soft affinity that it wants to apply when dispatching work items. Note that
    two worker_pools with different soft dispatching requirements have to be
    separate; otherwise, for example, we'd be ping-ponging worker threads across
    NUMA boundaries constantly.

    This patch adds workqueue_attrs->__pod_cpumask. The new field is double
    underscored as it's only used internally to distinguish worker_pools. A
    worker_pool's ->cpumask is now always the same as the online subset of
    allowed CPUs of the associated workqueues, and ->__pod_cpumask is the pod's
    subset of that ->cpumask. Going back to the example above, both worker_pools
    would have ->cpumask containing both CPUs 0 and 2 but one's ->__pod_cpumask
    would contain 0 while the other's 2.

    * pool_allowed_cpus() is added. It returns the worker_pool's strict cpumask
      that the pool's workers must stay within. This is currently always
      ->__pod_cpumask as all boundaries are still strict.

    * As a workqueue_attrs can now track both the associated workqueues' cpumask
      and its per-pod subset, wq_calc_pod_cpumask() no longer needs an external
      out-argument. Drop @cpumask and instead store the result in
      ->__pod_cpumask.

    * The above also simplifies apply_wqattrs_prepare() as the same
      workqueue_attrs can be used to create all pods associated with a
      workqueue. tmp_attrs is dropped.

    * wq_update_pod() is updated to use wqattrs_equal() to test whether a pwq
      update is needed instead of only comparing ->cpumask so that
      ->__pod_cpumask is compared too. It can directly compare ->__pod_cpumaks
      but the code is easier to understand and more robust this way.

    The only user-visible behavior change is that two workqueues with different
    cpumasks no longer can share worker_pools even when their pod subsets
    coincide. Going back to the example, let's say there's another workqueue
    with cpumask 0, 2, 3, where 2 and 3 are in the same pod. It would be mapped
    to two worker_pools - one with CPU 0, the other with 2 and 3. The former has
    the same cpumask as the first pod of the earlier example and would have
    shared the same worker_pool but that's no longer the case after this patch.
    The worker_pools would have the same ->__pod_cpumask but their ->cpumask's
    wouldn't match.

    While this is necessary to support non-strict affinity scopes, there can be
    further optimizations to maintain sharing among strict affinity scopes.
    However, non-strict affinity scopes are going to be preferable for most use
    cases and we don't see very diverse mixture of unbound workqueue cpumasks
    anyway, so the additional overhead doesn't seem to justify the extra
    complexity.

    v2: - wq_update_pod() was incorrectly comparing target_attrs->__pod_cpumask
          to pool->attrs->cpumask instead of its ->__pod_cpumask. Fix it by
          using wqattrs_equal() for comparison instead.

        - Per-cpu worker pools weren't initializing ->__pod_cpumask which caused
          a subtle problem later on. Set it to cpumask_of(cpu) like ->cpumask.

    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:27 -04:00
Waiman Long 976f7b4a57 workqueue: Factor out need_more_worker() check and worker wake-up
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 0219a3528d72143d8d2c4c793b61541d03518b59
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 7 Aug 2023 15:57:25 -1000

    workqueue: Factor out need_more_worker() check and worker wake-up

    Checking need_more_worker() and calling wake_up_worker() is a repeated
    pattern. Let's add kick_pool(), which checks need_more_worker() and
    open-code wake_up_worker(), and replace wake_up_worker() uses. The following
    conversions aren't one-to-one:

    * __queue_work() was using __need_more_work() because it knows that
      pool->worklist isn't empty. Switching to kick_pool() adds an extra
      list_empty() test.

    * create_worker() always needs to wake up the newly minted worker whether
      there's more work to do or not to avoid triggering hung task check on the
      new task. Keep the current wake_up_process() and still add kick_pool().
      This may lead to an extra wakeup which isn't harmful.

    * pwq_adjust_max_active() was explicitly checking whether it needs to wake
      up a worker or not to avoid spurious wakeups. As kick_pool() only wakes up
      a worker when necessary, this explicit check is no longer necessary and
      dropped.

    * unbind_workers() now calls kick_pool() instead of wake_up_worker() adding
      a need_more_worker() test. This avoids spurious wakeups and shouldn't
      break anything.

    wake_up_worker() is dropped as kick_pool() replaces all its users. After
    this patch, all paths that wakes up a non-rescuer worker to initiate work
    item execution use kick_pool(). This will enable future changes to improve
    locality.

    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:27 -04:00
Waiman Long 332255a89e workqueue: Factor out work to worker assignment and collision handling
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 873eaca6eaf84b1d1ed5b7259308c6a4fca70fdc
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 7 Aug 2023 15:57:25 -1000

    workqueue: Factor out work to worker assignment and collision handling

    The two work execution paths in worker_thread() and rescuer_thread() use
    move_linked_works() to claim work items from @pool->worklist. Once claimed,
    process_schedule_works() is called which invokes process_one_work() on each
    work item. process_one_work() then uses find_worker_executing_work() to
    detect and handle collisions - situations where the work item to be executed
    is still running on another worker.

    This works fine, but, to improve work execution locality, we want to
    establish work to worker association earlier and know for sure that the
    worker is going to excute the work once asssigned, which requires performing
    collision handling earlier while trying to assign the work item to the
    worker.

    This patch introduces assign_work() which assigns a work item to a worker
    using move_linked_works() and then performs collision handling. As collision
    handling is handled earlier, process_one_work() no longer needs to worry
    about them.

    After the this patch, collision checks for linked work items are skipped,
    which should be fine as they can't be queued multiple times concurrently.
    For work items running from rescuers, the timing of collision handling may
    change but the invariant that the work items go through collision handling
    before starting execution does not.

    This patch shouldn't cause noticeable behavior changes, especially given
    that worker_thread() behavior remains the same.

    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:27 -04:00
Waiman Long 89997d1af2 workqueue: Add multiple affinity scopes and interface to select them
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 63c5484e74952f60f5810256bd69814d167b8d22
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 7 Aug 2023 15:57:24 -1000

    workqueue: Add multiple affinity scopes and interface to select them

    Add three more affinity scopes - WQ_AFFN_CPU, SMT and CACHE - and make CACHE
    the default. The code changes to actually add the additional scopes are
    trivial.

    Also add module parameter "workqueue.default_affinity_scope" to override the
    default scope and "affinity_scope" sysfs file to configure it per workqueue.
    wq_dump.py and documentations are updated accordingly.

    This enables significant flexibility in configuring how unbound workqueues
    behave. If affinity scope is set to "cpu", it'll behave close to a per-cpu
    workqueue. On the other hand, "system" removes all locality boundaries.

    Many modern machines have multiple L3 caches often while being mostly
    uniform in terms of memory access. Thus, workqueue's previous behavior of
    spreading work items in each NUMA node had negative performance implications
    from unncessarily crossing L3 boundaries between issue and execution.
    However, picking a finer grained affinity scope also has a downside in that
    an issuer in one group can't utilize CPUs in other groups.

    While dependent on the specifics of workload, there's usually a noticeable
    penalty in crossing L3 boundaries, so let's default to CACHE. This issue
    will be further addressed and documented with examples in future patches.

    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:27 -04:00
Waiman Long 0b97408c78 workqueue: Modularize wq_pod_type initialization
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 025e16845877e80cb169274b330c236056ba553c
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 7 Aug 2023 15:57:24 -1000

    workqueue: Modularize wq_pod_type initialization

    While wq_pod_type[] can now group CPUs in any aribitrary way, WQ_AFFN_NUM
    init is hard coded into workqueue_init_topology(). This patch modularizes
    the init path by introducing init_pod_type() which takes a callback to
    determine whether two CPUs should share a pod as an argument.

    init_pod_type() first scans the CPU combinations testing for sharing to
    assign consecutive pod IDs and initialize pod_type->cpu_pod[]. Once
    ->cpu_pod[] is determined, ->pod_cpus[] and ->pod_node[] are initialized
    accordingly. WQ_AFFN_NUMA is now initialized by calling init_pod_type() with
    cpus_share_numa() which tests whether the CPU belongs to the same NUMA node.

    This patch may change the pod ID assigned to each NUMA node but that
    shouldn't cause any behavior changes as the NUMA node to use for allocations
    are tracked separately in pod_type->pod_node[]. This makes adding new
    affinty types pretty easy.

    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:27 -04:00
Waiman Long cb62b603f4 workqueue: Generalize unbound CPU pods
JIRA: https://issues.redhat.com/browse/RHEL-25103
Conflicts: A minor context diff in the workqueue_apply_unbound_cpumask()
	   hunk due to the presence of a later upstream commit
	   ca10d851b9ad ("workqueue: Override implicit ordered attribute
	   in workqueue_apply_unbound_cpumask()").

commit 84193c07105c62d206fb230b2f29002226628989
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 7 Aug 2023 15:57:24 -1000

    workqueue: Generalize unbound CPU pods

    While renamed to pod, the code still assumes that the pods are defined by
    NUMA boundaries. Let's generalize it:

    * workqueue_attrs->affn_scope is added. Each enum represents the type of
      boundaries that define the pods. There are currently two scopes -
      WQ_AFFN_NUMA and WQ_AFFN_SYSTEM. The former is the same behavior as before
      - one pod per NUMA node. The latter defines one global pod across the
      whole system.

    * struct wq_pod_type is added which describes how pods are configured for
      each affnity scope. For each pod, it lists the member CPUs and the
      preferred NUMA node for memory allocations. The reverse mapping from CPU
      to pod is also available.

    * wq_pod_enabled is dropped. Pod is now always enabled. The previously
      disabled behavior is now implemented through WQ_AFFN_SYSTEM.

    * get_unbound_pool() wants to determine the NUMA node to allocate memory
      from for the new pool. The variables are renamed from node to pod but the
      logic still assumes they're one and the same. Clearly distinguish them -
      walk the WQ_AFFN_NUMA pods to find the matching pod and then use the pod's
      NUMA node.

    * wq_calc_pod_cpumask() was taking @pod but assumed that it was the NUMA
      node. Take @cpu instead and determine the cpumask to use from the pod_type
      matching @attrs.

    * apply_wqattrs_prepare() is update to return ERR_PTR() on error instead of
      NULL so that it can indicate -EINVAL on invalid affinity scopes.

    This patch allows CPUs to be grouped into pods however desired per type.
    While this patch causes some internal behavior changes, nothing material
    should change for workqueue users.

    v2: Trigger WARN_ON_ONCE() in wqattrs_pod_type() if affn_scope is
        WQ_AFFN_NR_TYPES which indicates that the function is called with a
        worker_pool's attrs instead of a workqueue's.

    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:27 -04:00
Waiman Long d1e3435ca0 workqueue: Factor out clearing of workqueue-only attrs fields
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 5de7a03cac14765ba22934b6fb1476456ee36bf8
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 7 Aug 2023 15:57:24 -1000

    workqueue: Factor out clearing of workqueue-only attrs fields

    workqueue_attrs can be used for both workqueues and worker_pools. However,
    some fields, currently only ->ordered, only apply to workqueues and should
    be cleared to the default / invalid values.

    Currently, an unbound workqueue explicitly clears attrs->ordered in
    get_unbound_pool() after copying the source workqueue attrs, while per-cpu
    workqueues rely on the fact that zeroing on allocation gives us the desired
    default value for pool->attrs->ordered.

    This is fragile. Let's add wqattrs_clear_for_pool() which clears
    attrs->ordered and is called from both init_worker_pool() and
    get_unbound_pool(). This will ease adding more workqueue-only attrs fields.

    In get_unbound_pool(), pool->node initialization is moved upwards for
    readability. This shouldn't cause any behavior changes.

    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:27 -04:00
Waiman Long af3d231314 workqueue: Factor out actual cpumask calculation to reduce subtlety in wq_update_pod()
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 0f36ee24cd43c67be07166ddd09866dc7a47cb4c
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 7 Aug 2023 15:57:24 -1000

    workqueue: Factor out actual cpumask calculation to reduce subtlety in wq_update_pod()

    For an unbound pool, multiple cpumasks are involved.

    U: The user-specified cpumask (may be filtered with cpu_possible_mask).

    A: The actual cpumask filtered by wq_unbound_cpumask. If the filtering
       leaves no CPU, wq_unbound_cpumask is used.

    P: Per-pod subsets of #A.

    wq->attrs stores #U, wq->dfl_pwq->pool->attrs->cpumask #A, and
    wq->cpu_pwq[CPU]->pool->attrs->cpumask #P.

    wq_update_pod() is called to update per-pod pwq's during CPU hotplug. To
    calculate the new #P for each workqueue, it needs to call
    wq_calc_pod_cpumask() with @attrs that contains #A. Currently,
    wq_update_pod() achieves this by calling wq_calc_pod_cpumask() with
    wq->dfl_pwq->pool->attrs.

    This is rather fragile because we're calling wq_calc_pod_cpumask() with
    @attrs of a worker_pool rather than the workqueue's actual attrs when what
    we want to calculate is the workqueue's cpumask on the pod. While this works
    fine currently, future changes will add fields which are used differently
    between workqueues and worker_pools and this subtlety will bite us.

    This patch factors out #U -> #A calculation from apply_wqattrs_prepare()
    into wqattrs_actualize_cpumask and updates wq_update_pod() to copy
    wq->unbound_attrs and use the new helper to obtain #A freshly instead of
    abusing wq->dfl_pwq->pool_attrs.

    This shouldn't cause any behavior changes in the current code.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
    Reference: http://lkml.kernel.org/r/30625cdd-4d61-594b-8db9-6816b017dde3@amd.com

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:27 -04:00
Waiman Long 591106944f workqueue: Initialize unbound CPU pods later in the boot
JIRA: https://issues.redhat.com/browse/RHEL-25103
Conflicts:
 1) Minor context diff in init/main.c due to missing upstream commit
    7ec7096b8577 ("mm/page_ext: init page_ext early if there are no
    deferred struct pages") and commit de57807e6f26 ("init,mm: fold
    late call to page_ext_init() to page_alloc_init_late()").
 2) Minor context diff in kernel/workqueue.c due to the presence of
    a later upstream commit 4a6c5607d450 ("workqueue: Make sure that
    wq_unbound_cpumask is never empty").

commit 2930155b2e27232c033970f2e110aaac4187cb9e
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 7 Aug 2023 15:57:24 -1000

    workqueue: Initialize unbound CPU pods later in the boot

    During boot, to initialize unbound CPU pods, wq_pod_init() was called from
    workqueue_init(). This is early enough for NUMA nodes to be set up but
    before SMP is brought up and CPU topology information is populated.

    Workqueue is in the process of improving CPU locality for unbound workqueues
    and will need access to topology information during pod init. This adds a
    new init function workqueue_init_topology() which is called after CPU
    topology information is available and replaces wq_pod_init().

    As unbound CPU pods are now initialized after workqueues are activated, we
    need to revisit the workqueues to apply the pod configuration. Workqueues
    which are created before workqueue_init_topology() are set up so that they
    always use the default worker pool. After pods are set up in
    workqueue_init_topology(), wq_update_pod() is called on all existing
    workqueues to update the pool associations accordingly.

    Note that wq_update_pod_attrs_buf allocation is moved to
    workqueue_init_early(). This isn't necessary right now but enables further
    generalization of pod handling in the future.

    This patch changes the initialization sequence but the end result should be
    the same.

    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:27 -04:00
Waiman Long 09bde2a4d3 workqueue: Move wq_pod_init() below workqueue_init()
JIRA: https://issues.redhat.com/browse/RHEL-25103
Conflicts: A context diff due to the presence of a later upstream commit
	   4a6c5607d450 ("workqueue: Make sure that wq_unbound_cpumask
	   is never empty").

commit a86feae6195ac2148097b063f7fdad8ee1f6dad4
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 7 Aug 2023 15:57:24 -1000

    workqueue: Move wq_pod_init() below workqueue_init()

    wq_pod_init() is called from workqueue_init() and responsible for
    initializing unbound CPU pods according to NUMA node. Workqueue is in the
    process of improving affinity awareness and wants to use other topology
    information to initialize unbound CPU pods; however, unlike NUMA nodes,
    other topology information isn't yet available in workqueue_init().

    The next patch will introduce a later stage init function for workqueue
    which will be responsible for initializing unbound CPU pods. Relocate
    wq_pod_init() below workqueue_init() where the new init function is going to
    be located so that the diff can show the content differences.

    Just a relocation. No functional changes.

    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:26 -04:00
Waiman Long 4b35d49bdf workqueue: Rename NUMA related names to use pod instead
JIRA: https://issues.redhat.com/browse/RHEL-25103
Conflicts: context diffs in two of the hunks due to the presence of
	   later upstream commit 4a6c5607d450 ("workqueue: Make sure that
	   wq_unbound_cpumask is never empty") and commit 31c89007285d
	   ("workqueue.c: Increase workqueue name length").

commit fef59c9cab6ac5592da54f6c2b1195418f14e4d0
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 7 Aug 2023 15:57:23 -1000

    workqueue: Rename NUMA related names to use pod instead

    Workqueue is in the process of improving CPU affinity awareness. It will
    become more flexible and won't be tied to NUMA node boundaries. This patch
    renames all NUMA related names in workqueue.c to use "pod" instead.

    While "pod" isn't a very common term, it short and captures the grouping of
    CPUs well enough. These names are only going to be used within workqueue
    implementation proper, so the specific naming doesn't matter that much.

    * wq_numa_possible_cpumask -> wq_pod_cpus

    * wq_numa_enabled -> wq_pod_enabled

    * wq_update_unbound_numa_attrs_buf -> wq_update_pod_attrs_buf

    * workqueue_select_cpu_near -> select_numa_node_cpu

      This rename is different from others. The function is only used by
      queue_work_node() and specifically tries to find a CPU in the specified
      NUMA node. As workqueue affinity will become more flexible and untied from
      NUMA, this function's name should specifically describe that it's for
      NUMA.

    * wq_calc_node_cpumask -> wq_calc_pod_cpumask

    * wq_update_unbound_numa -> wq_update_pod

    * wq_numa_init -> wq_pod_init

    * node -> pod in local variables

    Only renames. No functional changes.

    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:26 -04:00
Waiman Long 0dc9405a09 workqueue: Rename workqueue_attrs->no_numa to ->ordered
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit af73f5c9febe5095ee492ae43e9898fca65ced70
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 7 Aug 2023 15:57:23 -1000

    workqueue: Rename workqueue_attrs->no_numa to ->ordered

    With the recent removal of NUMA related module param and sysfs knob,
    workqueue_attrs->no_numa is now only used to implement ordered workqueues.
    Let's rename the field so that it's less confusing especially with the
    planned CPU affinity awareness improvements.

    Just a rename. No functional changes.

    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:26 -04:00
Waiman Long de03f9c951 workqueue: Make unbound workqueues to use per-cpu pool_workqueues
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 636b927eba5bc633753f8eb80f35e1d5be806e51
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 7 Aug 2023 15:57:23 -1000

    workqueue: Make unbound workqueues to use per-cpu pool_workqueues

    A pwq (pool_workqueue) represents an association between a workqueue and a
    worker_pool. When a work item is queued, the workqueue selects the pwq to
    use, which in turn determines the pool, and queues the work item to the pool
    through the pwq. pwq is also what implements the maximum concurrency limit -
    @max_active.

    As a per-cpu workqueue should be assocaited with a different worker_pool on
    each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
    However, unbound workqueues were sharing a pwq within each NUMA node by
    default. The sharing has several downsides:

    * Because @max_active is per-pwq, the meaning of @max_active changes
      depending on the machine configuration and whether workqueue NUMA locality
      support is enabled.

    * Makes per-cpu and unbound code deviate.

    * Gets in the way of making workqueue CPU locality awareness more flexible.

    This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
    workqueues do by making the following changes:

    * wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
      just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
      workqueues.

    * numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
      the specified pwq to the target CPU's wq->cpu_pwq.

    * apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
      unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
      This makes the return value of wq_calc_node_cpumask() unnecessary. It now
      returns void.

    * @max_active now means the same thing for both per-cpu and unbound
      workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
      documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
      used in workqueue implementation and will be removed later.

    * All unbound pwq operations which used to be per-numa-node are now per-cpu.

    For most unbound workqueue users, this shouldn't cause noticeable changes.
    Work item issue and completion will be a small bit faster, flush_workqueue()
    would become a bit more expensive, and the total concurrency limit would
    likely become higher. All @max_active==1 use cases are currently being
    audited for conversion into alloc_ordered_workqueue() and they shouldn't be
    affected once the audit and conversion is complete.

    One area where the behavior change may be more noticeable is
    workqueue_congested() as the reported congestion state is now per CPU
    instead of NUMA node. There are only two users of this interface -
    drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
    cc'd. Inputs on the behavior change would be very much appreciated.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Leon Romanovsky <leon@kernel.org>
    Cc: Karsten Graul <kgraul@linux.ibm.com>
    Cc: Wenjia Zhang <wenjia@linux.ibm.com>
    Cc: Jan Karcher <jaka@linux.ibm.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:26 -04:00
Waiman Long 3bc7ad84fe workqueue: Call wq_update_unbound_numa() on all CPUs in NUMA node on CPU hotplug
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 4cbfd3de737b9d00544ff0f673cb75fc37bffb6a
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 7 Aug 2023 15:57:23 -1000

    workqueue: Call wq_update_unbound_numa() on all CPUs in NUMA node on CPU hotplug

    When a CPU went online or offline, wq_update_unbound_numa() was called only
    on the CPU which was going up or down. This works fine because all CPUs on
    the same NUMA node share the same pool_workqueue slot - one CPU updating it
    updates it for everyone in the node.

    However, future changes will make each CPU use a separate pool_workqueue
    even when they're sharing the same worker_pool, which requires updating
    pool_workqueue's for all CPUs which may be sharing the same pool_workqueue
    on hotplug.

    To accommodate the planned changes, this patch updates
    workqueue_on/offline_cpu() so that they call wq_update_unbound_numa() for
    all CPUs sharing the same NUMA node as the CPU going up or down. In the
    current code, the second+ calls would be noops and there shouldn't be any
    behavior changes.

    * As wq_update_unbound_numa() is now called on multiple CPUs per each
      hotplug event, @cpu is renamed to @hotplug_cpu and another @cpu argument
      is added. The former indicates the CPU being hot[un]plugged and the latter
      the CPU whose pool_workqueue is being updated.

    * In wq_update_unbound_numa(), cpu_off is renamed to off_cpu for consistency
      with the new @hotplug_cpu.

    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:26 -04:00
Waiman Long 431872309b workqueue: Make per-cpu pool_workqueues allocated and released like unbound ones
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 687a9aa56f811b381e63f7f8f9149428ac708a3b
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 7 Aug 2023 15:57:23 -1000

    workqueue: Make per-cpu pool_workqueues allocated and released like unbound ones

    Currently, all per-cpu pwq's (pool_workqueue's) are allocated directly
    through a per-cpu allocation and thus, unlike unbound workqueues, not
    reference counted. This difference in lifetime management between the two
    types is a bit confusing.

    Unbound workqueues are currently accessed through wq->numa_pwq_tbl[] which
    isn't suitiable for the planned CPU locality related improvements. The plan
    is to unify pwq handling across per-cpu and unbound workqueues so that
    they're always accessed through wq->cpu_pwq.

    In preparation, this patch makes per-cpu pwq's to be allocated, reference
    counted and released the same way as unbound pwq's. wq->cpu_pwq now holds
    pointers to pwq's instead of containing them directly.

    pwq_unbound_release_workfn() is renamed to pwq_release_workfn() as it's now
    also used for per-cpu work items.

    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:26 -04:00
Waiman Long 1a30e9bd34 workqueue: Use a kthread_worker to release pool_workqueues
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 967b494e2fd143a9c1a3201422aceadb5fa9fbfc
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 7 Aug 2023 15:57:23 -1000

    workqueue: Use a kthread_worker to release pool_workqueues

    pool_workqueue release path is currently bounced to system_wq; however, this
    is a bit tricky because this bouncing occurs while holding a pool lock and
    thus has risk of causing a A-A deadlock. This is currently addressed by the
    fact that only unbound workqueues use this bouncing path and system_wq is a
    per-cpu workqueue.

    While this works, it's brittle and requires a work-around like setting the
    lockdep subclass for the lock of unbound pools. Besides, future changes will
    use the bouncing path for per-cpu workqueues too making the current approach
    unusable.

    Let's just use a dedicated kthread_worker to untangle the dependency. This
    is just one more kthread for all workqueues and makes the pwq release logic
    simpler and more robust.

    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:26 -04:00
Waiman Long 33f0fd9f2a workqueue: Remove module param disable_numa and sysfs knobs pool_ids and numa
JIRA: https://issues.redhat.com/browse/RHEL-25103
Conflicts: A merge conflict in the wq_pool_ids_show() removal hunk due
	   to the presence of a later upstream commit 49277a5b7637
	   ("workqueue: Move workqueue_set_unbound_cpumask() and its
	   helpers inside CONFIG_SYSFS").

commit fcecfa8f271acdf130acbb30842e7848a138af0f
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 7 Aug 2023 15:57:23 -1000

    workqueue: Remove module param disable_numa and sysfs knobs pool_ids and numa

    Unbound workqueue CPU affinity is going to receive an overhaul and the NUMA
    specific knobs won't make sense anymore. Remove them. Also, the pool_ids
    knob was used for debugging and not really meaningful given that there is no
    visibility into the pools associated with those IDs. Remove it too. A future
    patch will improve overall visibility.

    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:26 -04:00
Waiman Long 30f4e1d335 workqueue: Relocate worker and work management functions
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 797e8345cbb0d2913300ee9838eb74cce19485cf
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 7 Aug 2023 15:57:23 -1000

    workqueue: Relocate worker and work management functions

    Collect first_idle_worker(), worker_enter/leave_idle(),
    find_worker_executing_work(), move_linked_works() and wake_up_worker() into
    one place. These functions will later be used to implement higher level
    worker management logic.

    No functional changes.

    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:26 -04:00
Waiman Long dcadc6099c workqueue: Rename wq->cpu_pwqs to wq->cpu_pwq
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit ee1ceef72754427e5167743108c52f826fa4ca5b
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 7 Aug 2023 15:57:23 -1000

    workqueue: Rename wq->cpu_pwqs to wq->cpu_pwq

    wq->cpu_pwqs is a percpu variable carraying one pointer to a pool_workqueue.
    The field name being plural is unusual and confusing. Rename it to singular.

    This patch doesn't cause any functional changes.

    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:26 -04:00
Waiman Long ad3fa632c4 workqueue: Not all work insertion needs to wake up a worker
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit fe089f87cccb066e8ad20f49ddf05e95adc1fa8d
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 7 Aug 2023 15:57:22 -1000

    workqueue: Not all work insertion needs to wake up a worker

    insert_work() always tried to wake up a worker; however, the only time it
    needs to try to wake up a worker is when a new active work item is queued.
    When a work item goes on the inactive list or queueing a flush work item,
    there's no reason to try to wake up a worker.

    This patch moves the worker wakeup logic out of insert_work() and places it
    in the active new work item queueing path in __queue_work().

    While at it:

    * __queue_work() is dereferencing pwq->pool repeatedly. Add local variable
      pool.

    * Every caller of insert_work() calls debug_work_activate(). Consolidate the
      invocations into insert_work().

    * In __queue_work() pool->watchdog_ts update is relocated slightly. This is
      to better accommodate future changes.

    This makes wakeups more precise and will help the planned change to assign
    work items to workers before waking them up. No behavior changes intended.

    v2: WARN_ON_ONCE(pool != last_pool) added in __queue_work() to clarify as
        suggested by Lai.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Cc: Lai Jiangshan <jiangshanlai@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:26 -04:00
Waiman Long cd964710d4 workqueue: Cleanups around process_scheduled_works()
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit c0ab017d43f4c4147f7ecf3ca3cb872a416e17c7
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 7 Aug 2023 15:57:22 -1000

    workqueue: Cleanups around process_scheduled_works()

    * Drop the trivial optimization in worker_thread() where it bypasses calling
      process_scheduled_works() if the first work item isn't linked. This is a
      mostly pointless micro optimization and gets in the way of improving the
      work processing path.

    * Consolidate pool->watchdog_ts updates in the two callers into
      process_scheduled_works().

    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:25 -04:00
Waiman Long 5aa79febcd workqueue: Drop the special locking rule for worker->flags and worker_pool->flags
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit bc8b50c2dfac946c1beed782c1823e52cf55a352
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 7 Aug 2023 15:57:22 -1000

    workqueue: Drop the special locking rule for worker->flags and worker_pool->flags

    worker->flags used to be accessed from scheduler hooks without grabbing
    pool->lock for concurrency management. This is no longer true since
    6d25be5782 ("sched/core, workqueues: Distangle worker accounting from rq
    lock"). Also, it's unclear why worker_pool->flags was using the "X" rule.
    All relevant users are accessing it under the pool lock.

    Let's drop the special "X" rule and use the "L" rule for these flag fields
    instead. While at it, replace the CONTEXT comment with
    lockdep_assert_held().

    This allows worker_set/clr_flags() to be used from context which isn't the
    worker itself. This will be used later to implement assinging work items to
    workers before waking them up so that workqueue can have better control over
    which worker executes which work item on which CPU.

    The only actual changes are sanity checks. There shouldn't be any visible
    behavior changes.

    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:25 -04:00
Waiman Long 4753d1c10a workqueue: use LIST_HEAD to initialize cull_list
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 9680540c0c56a1f75a2d6aab31bf38aa429aa9d9
Author: Yang Yingliang <yangyingliang@huawei.com>
Date:   Fri, 4 Aug 2023 11:22:15 +0800

    workqueue: use LIST_HEAD to initialize cull_list

    Use LIST_HEAD() to initialize cull_list instead of open-coding it.

    Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
    Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:25 -04:00
Waiman Long 9e8e8dfabf workqueue: Warn attempt to flush system-wide workqueues.
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 20bdedafd2f63e0ba70991127f9b5c0826ebdb32
Author: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date:   Fri, 30 Jun 2023 21:28:53 +0900

    workqueue: Warn attempt to flush system-wide workqueues.

    Based on commit c4f135d643823a86 ("workqueue: Wrap flush_workqueue() using
    a macro"), all in-tree users stopped flushing system-wide workqueues.
    Therefore, start emitting runtime message so that all out-of-tree users
    will understand that they need to update their code.

    Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:25 -04:00
Waiman Long 5df3631c9c workqueue: Scale up wq_cpu_intensive_thresh_us if BogoMIPS is below 4000
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit aa6fde93f3a49e42c0fe0490d7f3711bac0d162e
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 17 Jul 2023 12:50:02 -1000

    workqueue: Scale up wq_cpu_intensive_thresh_us if BogoMIPS is below 4000

    wq_cpu_intensive_thresh_us is used to detect CPU-hogging per-cpu work items.
    Once detected, they're excluded from concurrency management to prevent them
    from blocking other per-cpu work items. If CONFIG_WQ_CPU_INTENSIVE_REPORT is
    enabled, repeat offenders are also reported so that the code can be updated.

    The default threshold is 10ms which is long enough to do fair bit of work on
    modern CPUs while short enough to be usually not noticeable. This
    unfortunately leads to a lot of, arguable spurious, detections on very slow
    CPUs. Using the same threshold across CPUs whose performance levels may be
    apart by multiple levels of magnitude doesn't make whole lot of sense.

    This patch scales up wq_cpu_intensive_thresh_us upto 1 second when BogoMIPS
    is below 4000. This is obviously very inaccurate but it doesn't have to be
    accurate to be useful. The mechanism is still useful when the threshold is
    fully scaled up and the benefits of reports are usually shared with everyone
    regardless of who's reporting, so as long as there are sufficient number of
    fast machines reporting, we don't lose much.

    Some (or is it all?) ARM CPUs systemtically report significantly lower
    BogoMIPS. While this doesn't break anything, given how widespread ARM CPUs
    are, it's at least a missed opportunity and it probably would be a good idea
    to teach workqueue about it.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Reported-and-Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:25 -04:00
Waiman Long c9a9cddde4 workqueue: Disable per-cpu CPU hog detection when wq_cpu_intensive_thresh_us is 0
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 18c8ae813156a6855f026de80fffb91e1a28ab3d
Author: Zqiang <qiang.zhang1211@gmail.com>
Date:   Thu, 25 May 2023 12:00:38 +0800

    workqueue: Disable per-cpu CPU hog detection when wq_cpu_intensive_thresh_us is 0

    If workqueue.cpu_intensive_thresh_us is set to 0, the detection mechanism
    for CPU-hogging per-cpu work item will keep triggering spuriously:

      workqueue: process_srcu hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND
      workqueue: gc_worker hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND
      workqueue: gc_worker hogged CPU for >0us 8 times, consider switching to WQ_UNBOUND
      workqueue: wait_rcu_exp_gp hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND
      workqueue: kfree_rcu_monitor hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND
      workqueue: kfree_rcu_monitor hogged CPU for >0us 8 times, consider switching to WQ_UNBOUND
      workqueue: reg_todo hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND

    This commit therefore disables the CPU-hog detection mechanism when
    workqueue.cpu_intensive_thresh_us is set to 0.

    tj: Patch description updated and the condition check on
        cpu_intensive_thresh_us separated into a separate if statement for
        readability.

    Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:24 -04:00
Waiman Long ebdb8e47b2 workqueue: Fix WARN_ON_ONCE() triggers in worker_enter_idle()
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit c8f6219be2e58d7f676935ae90b64abef5d0966a
Author: Zqiang <qiang.zhang1211@gmail.com>
Date:   Wed, 24 May 2023 11:53:39 +0800

    workqueue: Fix WARN_ON_ONCE() triggers in worker_enter_idle()

    Currently, pool->nr_running can be modified from timer tick, that means the
    timer tick can run nested inside a not-irq-protected section that's in the
    process of modifying nr_running. Consider the following scenario:

    CPU0
    kworker/0:2 (events)
       worker_clr_flags(worker, WORKER_PREP | WORKER_REBOUND);
       ->pool->nr_running++;  (1)

       process_one_work()
       ->worker->current_func(work);
         ->schedule()
           ->wq_worker_sleeping()
             ->worker->sleeping = 1;
             ->pool->nr_running--;  (0)
               ....
           ->wq_worker_running()
                   ....
                   CPU0 by interrupt:
                   wq_worker_tick()
                   ->worker_set_flags(worker, WORKER_CPU_INTENSIVE);
                     ->pool->nr_running--;  (-1)
                     ->worker->flags |= WORKER_CPU_INTENSIVE;
                   ....
             ->if (!(worker->flags & WORKER_NOT_RUNNING))
               ->pool->nr_running++;    (will not execute)
             ->worker->sleeping = 0;
             ....
        ->worker_clr_flags(worker, WORKER_CPU_INTENSIVE);
          ->pool->nr_running++;  (0)
        ....
        worker_set_flags(worker, WORKER_PREP);
        ->pool->nr_running--;   (-1)
        ....
        worker_enter_idle()
        ->WARN_ON_ONCE(pool->nr_workers == pool->nr_idle && pool->nr_running);

    if the nr_workers is equal to nr_idle, due to the nr_running is not zero,
    will trigger WARN_ON_ONCE().

    [    2.460602] WARNING: CPU: 0 PID: 63 at kernel/workqueue.c:1999 worker_enter_idle+0xb2/0xc0
    [    2.462163] Modules linked in:
    [    2.463401] CPU: 0 PID: 63 Comm: kworker/0:2 Not tainted 6.4.0-rc2-next-20230519 #1
    [    2.463771] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
    [    2.465127] Workqueue:  0x0 (events)
    [    2.465678] RIP: 0010:worker_enter_idle+0xb2/0xc0
    ...
    [    2.472614] Call Trace:
    [    2.473152]  <TASK>
    [    2.474182]  worker_thread+0x71/0x430
    [    2.474992]  ? _raw_spin_unlock_irqrestore+0x28/0x50
    [    2.475263]  kthread+0x103/0x120
    [    2.475493]  ? __pfx_worker_thread+0x10/0x10
    [    2.476355]  ? __pfx_kthread+0x10/0x10
    [    2.476635]  ret_from_fork+0x2c/0x50
    [    2.477051]  </TASK>

    This commit therefore add the check of worker->sleeping in wq_worker_tick(),
    if the worker->sleeping is not zero, directly return.

    tj: Updated comment and description.

    Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
    Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
    Tested-by: Anders Roxell <anders.roxell@linaro.org>
    Closes: https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20230519/testrun/17078554/suite/boot/test/clang-nightly-lkftconfig/log
    Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:24 -04:00
Waiman Long de650632ad workqueue: Track and monitor per-workqueue CPU time usage
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 8a1dd1e547c1a037692e7a6da6a76108108c72b1
Author: Tejun Heo <tj@kernel.org>
Date:   Wed, 17 May 2023 17:02:09 -1000

    workqueue: Track and monitor per-workqueue CPU time usage

    Now that wq_worker_tick() is there, we can easily track the rough CPU time
    consumption of each workqueue by charging the whole tick whenever a tick
    hits an active workqueue. While not super accurate, it provides reasonable
    visibility into the workqueues that consume a lot of CPU cycles.
    wq_monitor.py is updated to report the per-workqueue CPU times.

    v2: wq_monitor.py was using "cputime" as the key when outputting in json
        format. Use "cpu_time" instead for consistency with other fields.

    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:24 -04:00
Waiman Long b2d36126d6 workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 6363845005202148b8409ec3082e80845c19d309
Author: Tejun Heo <tj@kernel.org>
Date:   Wed, 17 May 2023 17:02:08 -1000

    workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism

    Workqueue now automatically marks per-cpu work items that hog CPU for too
    long as CPU_INTENSIVE, which excludes them from concurrency management and
    prevents stalling other concurrency-managed work items. If a work function
    keeps running over the thershold, it likely needs to be switched to use an
    unbound workqueue.

    This patch adds a debug mechanism which tracks the work functions which
    trigger the automatic CPU_INTENSIVE mechanism and report them using
    pr_warn() with exponential backoff.

    v3: Documentation update.

    v2: Drop bouncing to kthread_worker for printing messages. It was to avoid
        introducing circular locking dependency through printk but not effective
        as it still had pool lock -> wci_lock -> printk -> pool lock loop. Let's
        just print directly using printk_deferred().

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Suggested-by: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:24 -04:00
Waiman Long 1665f6ac9c workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVE
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 616db8779b1e3f93075df691432cccc5ef3c3ba0
Author: Tejun Heo <tj@kernel.org>
Date:   Wed, 17 May 2023 17:02:08 -1000

    workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVE

    If a per-cpu work item hogs the CPU, it can prevent other work items from
    starting through concurrency management. A per-cpu workqueue which intends
    to host such CPU-hogging work items can choose to not participate in
    concurrency management by setting %WQ_CPU_INTENSIVE; however, this can be
    error-prone and difficult to debug when missed.

    This patch adds an automatic CPU usage based detection. If a
    concurrency-managed work item consumes more CPU time than the threshold
    (10ms by default) continuously without intervening sleeps, wq_worker_tick()
    which is called from scheduler_tick() will detect the condition and
    automatically mark it CPU_INTENSIVE.

    The mechanism isn't foolproof:

    * Detection depends on tick hitting the work item. Getting preempted at the
      right timings may allow a violating work item to evade detection at least
      temporarily.

    * nohz_full CPUs may not be running ticks and thus can fail detection.

    * Even when detection is working, the 10ms detection delays can add up if
      many CPU-hogging work items are queued at the same time.

    However, in vast majority of cases, this should be able to detect violations
    reliably and provide reasonable protection with a small increase in code
    complexity.

    If some work items trigger this condition repeatedly, the bigger problem
    likely is the CPU being saturated with such per-cpu work items and the
    solution would be making them UNBOUND. The next patch will add a debug
    mechanism to help spot such cases.

    v4: Documentation for workqueue.cpu_intensive_thresh_us added to
        kernel-parameters.txt.

    v3: Switch to use wq_worker_tick() instead of hooking into preemptions as
        suggested by Peter.

    v2: Lai pointed out that wq_worker_stopping() also needs to be called from
        preemption and rtlock paths and an earlier patch was updated
        accordingly. This patch adds a comment describing the risk of infinte
        recursions and how they're avoided.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Acked-by: Peter Zijlstra <peterz@infradead.org>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Lai Jiangshan <jiangshanlai@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:24 -04:00
Waiman Long d067533aa7 workqueue: Improve locking rule description for worker fields
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit bdf8b9bfc131864f0fcef268b34123acfb6a1b59
Author: Tejun Heo <tj@kernel.org>
Date:   Wed, 17 May 2023 17:02:08 -1000

    workqueue: Improve locking rule description for worker fields

    * Some worker fields are modified only by the worker itself while holding
      pool->lock thus making them safe to read from self, IRQ context if the CPU
      is running the worker or while holding pool->lock. Add 'K' locking rule
      for them.

    * worker->sleeping is currently marked "None" which isn't very descriptive.
      It's used only by the worker itself. Add 'S' locking rule for it.

    A future patch will depend on the 'K' rule to access worker->current_* from
    the scheduler ticks.

    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:24 -04:00
Waiman Long bdad1a320c workqueue: Move worker_set/clr_flags() upwards
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit c54d5046a06b90adb3d1188f0741a88692854354
Author: Tejun Heo <tj@kernel.org>
Date:   Wed, 17 May 2023 17:02:08 -1000

    workqueue: Move worker_set/clr_flags() upwards

    They are going to be used in wq_worker_stopping(). Move them upwards.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Cc: Lai Jiangshan <jiangshanlai@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:23 -04:00
Waiman Long 20a387c381 workqueue: Add pwq->stats[] and a monitoring script
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 725e8ec59c56c65fb92e343c10a8842cd0d4f194
Author: Tejun Heo <tj@kernel.org>
Date:   Wed, 17 May 2023 17:02:08 -1000

    workqueue: Add pwq->stats[] and a monitoring script

    Currently, the only way to peer into workqueue operations is through
    tracing. While possible, it isn't easy or convenient to monitor
    per-workqueue behaviors over time this way. Let's add pwq->stats[] that
    track relevant events and a drgn monitoring script -
    tools/workqueue/wq_monitor.py.

    It's arguable whether this needs to be configurable. However, it currently
    only has several counters and the runtime overhead shouldn't be noticeable
    given that they're on pwq's which are per-cpu on per-cpu workqueues and
    per-numa-node on unbound ones. Let's keep it simple for the time being.

    v2: Patch reordered to earlier with fewer fields. Field will be added back
        gradually. Help message improved.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Cc: Lai Jiangshan <jiangshanlai@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:23 -04:00
Waiman Long 0d4b8874cf Further upgrade queue_work_on() comment
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 854f5cc5b7355ceebf2bdfed97ea8f3c5d47a0c3
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Fri, 28 Apr 2023 16:47:07 -0700

    Further upgrade queue_work_on() comment

    The current queue_work_on() docbook comment says that the caller must
    ensure that the specified CPU can't go away, and further says that the
    penalty for failing to nail down the specified CPU is that the workqueue
    handler might find itself executing on some other CPU.  This is true
    as far as it goes, but fails to note what happens if the specified CPU
    never was online.  Therefore, further expand this comment to say that
    specifying a CPU that was never online will result in a splat.

    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Cc: Lai Jiangshan <jiangshanlai@gmail.com>
    Cc: Tejun Heo <tj@kernel.org>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:23 -04:00
Waiman Long e8df0001f6 workqueue: clean up WORK_* constant types, clarify masking
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit afa4bb778e48d79e4a642ed41e3b4e0de7489a6c
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Fri, 23 Jun 2023 12:08:14 -0700

    workqueue: clean up WORK_* constant types, clarify masking

    Dave Airlie reports that gcc-13.1.1 has started complaining about some
    of the workqueue code in 32-bit arm builds:

      kernel/workqueue.c: In function ‘get_work_pwq’:
      kernel/workqueue.c:713:24: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
        713 |                 return (void *)(data & WORK_STRUCT_WQ_DATA_MASK);
            |                        ^
      [ ... a couple of other cases ... ]

    and while it's not immediately clear exactly why gcc started complaining
    about it now, I suspect it's some C23-induced enum type handlign fixup in
    gcc-13 is the cause.

    Whatever the reason for starting to complain, the code and data types
    are indeed disgusting enough that the complaint is warranted.

    The wq code ends up creating various "helper constants" (like that
    WORK_STRUCT_WQ_DATA_MASK) using an enum type, which is all kinds of
    confused.  The mask needs to be 'unsigned long', not some unspecified
    enum type.

    To make matters worse, the actual "mask and cast to a pointer" is
    repeated a couple of times, and the cast isn't even always done to the
    right pointer, but - as the error case above - to a 'void *' with then
    the compiler finishing the job.

    That's now how we roll in the kernel.

    So create the masks using the proper types rather than some ambiguous
    enumeration, and use a nice helper that actually does the type
    conversion in one well-defined place.

    Incidentally, this magically makes clang generate better code.  That,
    admittedly, is really just a sign of clang having been seriously
    confused before, and cleaning up the typing unconfuses the compiler too.

    Reported-by: Dave Airlie <airlied@gmail.com>
    Link: https://lore.kernel.org/lkml/CAPM=9twNnV4zMCvrPkw3H-ajZOH-01JVh_kDrxdPYQErz8ZTdA@mail.gmail.com/
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Nick Desaulniers <ndesaulniers@google.com>
    Cc: Nathan Chancellor <nathan@kernel.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:23 -04:00
Waiman Long 2a1c329725 workqueue: Introduce show_freezable_workqueues
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 704bc669e1dda3eb8f6d5cb462b21e85558a3912
Author: Jungseung Lee <js07.lee@samsung.com>
Date:   Mon, 20 Mar 2023 12:29:05 +0900

    workqueue: Introduce show_freezable_workqueues

    Currently show_all_workqueue is called if freeze fails at the time of
    freeze the workqueues, which shows the status of all workqueues and of
    all worker pools. In this cases we may only need to dump state of only
    workqueues that are freezable and busy.

    This patch defines show_freezable_workqueues, which uses
    show_one_workqueue, a granular function that shows the state of individual
    workqueues, so that dump only the state of freezable workqueues
    at that time.

    tj: Minor message adjustment.

    Signed-off-by: Jungseung Lee <js07.lee@samsung.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:23 -04:00
Waiman Long 058229c9f6 workqueue: Print backtraces from CPUs with hung CPU bound workqueues
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit cd2440d66fec7d1bdb4f605b64c27c63c9141989
Author: Petr Mladek <pmladek@suse.com>
Date:   Tue, 7 Mar 2023 13:53:35 +0100

    workqueue: Print backtraces from CPUs with hung CPU bound workqueues

    The workqueue watchdog reports a lockup when there was not any progress
    in the worker pool for a long time. The progress means that a pending
    work item starts being proceed.

    Worker pools for unbound workqueues always wake up an idle worker and
    try to process the work immediately. The last idle worker has to create
    new worker first. The stall might happen only when a new worker could
    not be created in which case an error should get printed. Another problem
    might be too high load. In this case, workers are victims of a global
    system problem.

    Worker pools for CPU bound workqueues are designed for lightweight
    work items that do not need much CPU time. They are proceed one by
    one on a single worker. New worker is used only when a work is sleeping.
    It creates one additional scenario. The stall might happen when
    the CPU-bound workqueue is used for CPU-intensive work.

    More precisely, the stall is detected when a CPU-bound worker is in
    the TASK_RUNNING state for too long. In this case, it might be useful
    to see the backtrace from the problematic worker.

    The information how long a worker is in the running state is not available.
    But the CPU-bound worker pools do not have many workers in the running
    state by definition. And only few pools are typically blocked.

    It should be acceptable to print backtraces from all workers in
    TASK_RUNNING state in the stalled worker pools. The number of false
    positives should be very low.

    Signed-off-by: Petr Mladek <pmladek@suse.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:23 -04:00
Waiman Long 4f4620189e workqueue: Warn when a rescuer could not be created
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 4c0736a76a186e5df2cd2afda3e7a04d2a427d1b
Author: Petr Mladek <pmladek@suse.com>
Date:   Tue, 7 Mar 2023 13:53:34 +0100

    workqueue: Warn when a rescuer could not be created

    Rescuers are created when a workqueue with WQ_MEM_RECLAIM is allocated.
    It typically happens during the system boot.

    systemd switches the root filesystem from initrd to the booted system
    during boot. It kills processes that block the switch for too long.
    One of the process might be modprobe that tries to create a workqueue.

    These problems are hard to reproduce. Also alloc_workqueue() does not
    pass the error code. Make the debugging easier by printing an error,
    similar to create_worker().

    Signed-off-by: Petr Mladek <pmladek@suse.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:23 -04:00
Waiman Long e97b3edb25 workqueue: Interrupted create_worker() is not a repeated event
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 60f540389a5d2df25ddc7ad511b4fa2880dea521
Author: Petr Mladek <pmladek@suse.com>
Date:   Tue, 7 Mar 2023 13:53:33 +0100

    workqueue: Interrupted create_worker() is not a repeated event

    kthread_create_on_node() might get interrupted(). It is rare but realistic.
    For example, when an unbound workqueue is allocated in module_init()
    callback. It is done in the context of the "modprobe" process. And,
    for example, systemd might kill pending processes when switching root
    from initrd to the booted system.

    The interrupt is a one-off event and the race might be hard to reproduce.
    It is always worth printing.

    Signed-off-by: Petr Mladek <pmladek@suse.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:23 -04:00
Waiman Long 48949a02ba workqueue: Warn when a new worker could not be created
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 3f0ea0b864562c6bd1cee892026067eaea7be242
Author: Petr Mladek <pmladek@suse.com>
Date:   Tue, 7 Mar 2023 13:53:32 +0100

    workqueue: Warn when a new worker could not be created

    The workqueue watchdog reports a lockup when there was not any progress
    in the worker pool for a long time. The progress means that a pending
    work item starts being proceed.

    The progress is guaranteed by using idle workers or creating new workers
    for pending work items.

    There are several reasons why a new worker could not be created:

       + there is not enough memory

       + there is no free pool ID (IDR API)

       + the system reached PID limit

       + the process creating the new worker was interrupted

       + the last idle worker (manager) has not been scheduled for a long
         time. It was not able to even start creating the kthread.

    None of these failures is reported at the moment. The only clue is that
    show_one_worker_pool() prints that there is a manager. It is the last
    idle worker that is responsible for creating a new one. But it is not
    clear if create_worker() is failing and why.

    Make the debugging easier by printing errors in create_worker().

    The error code is important, especially from kthread_create_on_node().
    It helps to distinguish the various reasons. For example, reaching
    memory limit (-ENOMEM), other system limits (-EAGAIN), or process
    interrupted (-EINTR).

    Use pr_once() to avoid repeating the same error every CREATE_COOLDOWN
    for each stuck worker pool.

    Ratelimited printk() might be better. It would help to know if the problem
    remains. It would be more clear if the create_worker() errors and workqueue
    stalls are related. Also old messages might get lost when the internal log
    buffer is full. The problem is that printk() might touch the watchdog.
    For example, see touch_nmi_watchdog() in serial8250_console_write().
    It would require synchronization of the begin and length of the ratelimit
    interval with the workqueue watchdog. Otherwise, the error messages
    might break the watchdog. This does not look worth the complexity.

    Signed-off-by: Petr Mladek <pmladek@suse.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:23 -04:00
Waiman Long c3296e8c7e workqueue: Fix hung time report of worker pools
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 335a42ebb0ca8ee9997a1731aaaae6dcd704c113
Author: Petr Mladek <pmladek@suse.com>
Date:   Tue, 7 Mar 2023 13:53:31 +0100

    workqueue: Fix hung time report of worker pools

    The workqueue watchdog prints a warning when there is no progress in
    a worker pool. Where the progress means that the pool started processing
    a pending work item.

    Note that it is perfectly fine to process work items much longer.
    The progress should be guaranteed by waking up or creating idle
    workers.

    show_one_worker_pool() prints state of non-idle worker pool. It shows
    a delay since the last pool->watchdog_ts.

    The timestamp is updated when a first pending work is queued in
    __queue_work(). Also it is updated when a work is dequeued for
    processing in worker_thread() and rescuer_thread().

    The delay is misleading when there is no pending work item. In this
    case it shows how long the last work item is being proceed. Show
    zero instead. There is no stall if there is no pending work.

    Fixes: 82607adcf9 ("workqueue: implement lockup detector")
    Signed-off-by: Petr Mladek <pmladek@suse.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:23 -04:00
Waiman Long cf8b90187f workqueue: Simplify a pr_warn() call in wq_select_unbound_cpu()
JIRA: https://issues.redhat.com/browse/RHEL-25103
Conflicts: Context diff due to the presence of a later upstream commit
	   4a6c5607d450 ("workqueue: Make sure that wq_unbound_cpumask
	   is never empty").

commit a8ec5880bd82b834717770cba4596381ffd50545
Author: Ammar Faizi <ammarfaizi2@gnuweeb.org>
Date:   Sun, 26 Feb 2023 23:53:20 +0700

    workqueue: Simplify a pr_warn() call in wq_select_unbound_cpu()

    Use pr_warn_once() to achieve the same thing. It's simpler.

    Signed-off-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
    Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:22 -04:00
Waiman Long a088b39f32 workqueue: Make show_pwq() use run-length encoding
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit c76feb0d5dfdb90b70fa820bb3181142bb01e980
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Fri, 6 Jan 2023 16:10:24 -0800

    workqueue: Make show_pwq() use run-length encoding

    The show_pwq() function dumps out a pool_workqueue structure's activity,
    including the pending work-queue handlers:

     Showing busy workqueues and worker pools:
     workqueue events: flags=0x0
       pwq 0: cpus=0 node=0 flags=0x1 nice=0 active=10/256 refcnt=11
         in-flight: 7:test_work_func, 64:test_work_func, 249:test_work_func
         pending: test_work_func, test_work_func, test_work_func1, test_work_func1, test_work_func1, test_work_func1, test_work_func1

    When large systems are facing certain types of hang conditions, it is not
    unusual for this "pending" list to contain runs of hundreds of identical
    function names.  This "wall of text" is difficult to read, and worse yet,
    it can be interleaved with other output such as stack traces.

    Therefore, make show_pwq() use run-length encoding so that the above
    printout instead looks like this:

     Showing busy workqueues and worker pools:
     workqueue events: flags=0x0
       pwq 0: cpus=0 node=0 flags=0x1 nice=0 active=10/256 refcnt=11
         in-flight: 7:test_work_func, 64:test_work_func, 249:test_work_func
         pending: 2*test_work_func, 5*test_work_func1

    When no comma would be printed, including the WORK_STRUCT_LINKED case,
    a new run is started unconditionally.

    This output is more readable, places less stress on the hardware,
    firmware, and software on the console-log path, and reduces interference
    with other output.

    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Lai Jiangshan <jiangshanlai@gmail.com>
    Cc: Dave Jones <davej@codemonkey.org.uk>
    Cc: Rik van Riel <riel@surriel.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:22 -04:00
Waiman Long 7de5240e80 workqueue: Add a new flag to spot the potential UAF error
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 33e3f0a3358b8f9bb54b2661b9c1d37a75664c79
Author: Richard Clark <richard.xnu.clark@gmail.com>
Date:   Tue, 13 Dec 2022 12:39:36 +0800

    workqueue: Add a new flag to spot the potential UAF error

    Currently if the user queues a new work item unintentionally
    into a wq after the destroy_workqueue(wq), the work still can
    be queued and scheduled without any noticeable kernel message
    before the end of a RCU grace period.

    As a debug-aid facility, this commit adds a new flag
    __WQ_DESTROYING to spot that issue by triggering a kernel WARN
    message.

    Signed-off-by: Richard Clark <richard.xnu.clark@gmail.com>
    Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:22 -04:00
Waiman Long 50f44cde6c workqueue: Make queue_rcu_work() use call_rcu_hurry()
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit a7e30c0e9a5f95b7f74e6272d9c75fd65c897721
Author: Uladzislau Rezki <urezki@gmail.com>
Date:   Sun, 16 Oct 2022 16:23:03 +0000

    workqueue: Make queue_rcu_work() use call_rcu_hurry()

    Earlier commits in this series allow battery-powered systems to build
    their kernels with the default-disabled CONFIG_RCU_LAZY=y Kconfig option.
    This Kconfig option causes call_rcu() to delay its callbacks in order
    to batch them.  This means that a given RCU grace period covers more
    callbacks, thus reducing the number of grace periods, in turn reducing
    the amount of energy consumed, which increases battery lifetime which
    can be a very good thing.  This is not a subtle effect: In some important
    use cases, the battery lifetime is increased by more than 10%.

    This CONFIG_RCU_LAZY=y option is available only for CPUs that offload
    callbacks, for example, CPUs mentioned in the rcu_nocbs kernel boot
    parameter passed to kernels built with CONFIG_RCU_NOCB_CPU=y.

    Delaying callbacks is normally not a problem because most callbacks do
    nothing but free memory.  If the system is short on memory, a shrinker
    will kick all currently queued lazy callbacks out of their laziness,
    thus freeing their memory in short order.  Similarly, the rcu_barrier()
    function, which blocks until all currently queued callbacks are invoked,
    will also kick lazy callbacks, thus enabling rcu_barrier() to complete
    in a timely manner.

    However, there are some cases where laziness is not a good option.
    For example, synchronize_rcu() invokes call_rcu(), and blocks until
    the newly queued callback is invoked.  It would not be a good for
    synchronize_rcu() to block for ten seconds, even on an idle system.
    Therefore, synchronize_rcu() invokes call_rcu_hurry() instead of
    call_rcu().  The arrival of a non-lazy call_rcu_hurry() callback on a
    given CPU kicks any lazy callbacks that might be already queued on that
    CPU.  After all, if there is going to be a grace period, all callbacks
    might as well get full benefit from it.

    Yes, this could be done the other way around by creating a
    call_rcu_lazy(), but earlier experience with this approach and
    feedback at the 2022 Linux Plumbers Conference shifted the approach
    to call_rcu() being lazy with call_rcu_hurry() for the few places
    where laziness is inappropriate.

    And another call_rcu() instance that cannot be lazy is the one
    in queue_rcu_work(), given that callers to queue_rcu_work() are
    not necessarily OK with long delays.

    Therefore, make queue_rcu_work() use call_rcu_hurry() in order to revert
    to the old behavior.

    [ paulmck: Apply s/call_rcu_flush/call_rcu_hurry/ feedback from Tejun Heo. ]

    Signed-off-by: Uladzislau Rezki <urezki@gmail.com>
    Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
    Acked-by: Tejun Heo <tj@kernel.org>
    Cc: Lai Jiangshan <jiangshanlai@gmail.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:22 -04:00
Waiman Long 50392f94c5 treewide: Drop WARN_ON_FUNCTION_MISMATCH
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 4b24356312fbe1bace72f9905d529b14fc34c1c3
Author: Sami Tolvanen <samitolvanen@google.com>
Date:   Thu, 8 Sep 2022 14:54:56 -0700

    treewide: Drop WARN_ON_FUNCTION_MISMATCH

    CONFIG_CFI_CLANG no longer breaks cross-module function address
    equality, which makes WARN_ON_FUNCTION_MISMATCH unnecessary. Remove
    the definition and switch back to WARN_ON_ONCE.

    Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Tested-by: Kees Cook <keescook@chromium.org>
    Tested-by: Nathan Chancellor <nathan@kernel.org>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Kees Cook <keescook@chromium.org>
    Link: https://lore.kernel.org/r/20220908215504.3686827-15-samitolvanen@google.com

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:22 -04:00
Waiman Long c023e9b1d1 workqueue: Convert the type of pool->nr_running to int
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit bc35f7ef96284b8c963991357a9278a6beafca54
Author: Lai Jiangshan <laijs@linux.alibaba.com>
Date:   Thu, 23 Dec 2021 20:31:40 +0800

    workqueue: Convert the type of pool->nr_running to int

    It is only modified in associated CPU, so it doesn't need to be atomic.

    tj: Comment updated.

    Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:22 -04:00
Waiman Long ec7c270fc1 workqueue: Use wake_up_worker() in wq_worker_sleeping() instead of open code
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit cc5bff38463e0894dd596befa99f9d6860e15f5e
Author: Lai Jiangshan <laijs@linux.alibaba.com>
Date:   Thu, 23 Dec 2021 20:31:39 +0800

    workqueue: Use wake_up_worker() in wq_worker_sleeping() instead of open code

    The wakeup code in wq_worker_sleeping() is the same as wake_up_worker().

    Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:22 -04:00
Waiman Long 5ea3f1a8eb workqueue: Upgrade queue_work_on() comment
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 443378f0664a78756c3e3aeaab92750fe1e05735
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Tue, 30 Nov 2021 17:00:30 -0800

    workqueue: Upgrade queue_work_on() comment

    The current queue_work_on() docbook comment says that the caller must
    ensure that the specified CPU can't go away, but does not spell out the
    consequences, which turn out to be quite mild.  Therefore expand this
    comment to explicitly say that the penalty for failing to nail down the
    specified CPU is that the workqueue handler might find itself executing
    on some other CPU.

    Cc: Tejun Heo <tj@kernel.org>
    Cc: Lai Jiangshan <jiangshanlai@gmail.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:22 -04:00
Audra Mitchell 3b21196ba3 workqueue: Shorten events_freezable_power_efficient name
JIRA: https://issues.redhat.com/browse/RHEL-3534

This patch is a backport of the following upstream commit:
commit 8318d6a6362f5903edb4c904a8dd447e59be4ad1
Author: Audra Mitchell <audra@redhat.com>
Date:   Thu Jan 25 14:05:32 2024 -0500

    workqueue: Shorten events_freezable_power_efficient name

    Since we have set the WQ_NAME_LEN to 32, decrease the name of
    events_freezable_power_efficient so that it does not trip the name length
    warning when the workqueue is created.

    Signed-off-by: Audra Mitchell <audra@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-05-03 09:45:58 -04:00
Audra Mitchell f45c2f9160 workqueue.c: Increase workqueue name length
JIRA: https://issues.redhat.com/browse/RHEL-3534

This patch is a backport of the following upstream commit:
commit 31c89007285d365aa36f71d8fb0701581c770a27
Author: Audra Mitchell <audra@redhat.com>
Date:   Mon Jan 15 12:08:22 2024 -0500

    workqueue.c: Increase workqueue name length

    Currently we limit the size of the workqueue name to 24 characters due to
    commit ecf6881ff3 ("workqueue: make workqueue->name[] fixed len")
    Increase the size to 32 characters and print a warning in the event
    the requested name is larger than the limit of 32 characters.

    Signed-off-by: Audra Mitchell <audra@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-05-03 09:45:58 -04:00
Leonardo Bras 6f7f4ba4b1 workqueue: Avoid using isolated cpus' timers on queue_delayed_work
JIRA: https://issues.redhat.com/browse/RHEL-20254
Upstream Status: https://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git/

commit aae17ebb53cd3da37f5dfbde937acd091eb4340c
Author: Leonardo Bras <leobras@redhat.com>
Date:   Mon Jan 29 22:00:46 2024 -0300

    workqueue: Avoid using isolated cpus' timers on queue_delayed_work

    When __queue_delayed_work() is called, it chooses a cpu for handling the
    timer interrupt. As of today, it will pick either the cpu passed as
    parameter or the last cpu used for this.

    This is not good if a system does use CPU isolation, because it can take
    away some valuable cpu time to:
    1 - deal with the timer interrupt,
    2 - schedule-out the desired task,
    3 - queue work on a random workqueue, and
    4 - schedule the desired task back to the cpu.

    So to fix this, during __queue_delayed_work(), if cpu isolation is in
    place, pick a random non-isolated cpu to handle the timer interrupt.

    As an optimization, if the current cpu is not isolated, use it instead
    of looking for another candidate.

    Signed-off-by: Leonardo Bras <leobras@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Leonardo Bras <leobras@redhat.com>
2024-02-22 16:47:15 -03:00
Waiman Long 6524bc7b74 workqueue: Move workqueue_set_unbound_cpumask() and its helpers inside CONFIG_SYSFS
JIRA: https://issues.redhat.com/browse/RHEL-21798
Conflicts: A minor context diff due to missing upstream commit
	   fcecfa8f271a ("workqueue: Remove module param disable_numa
	   and sysfs knobs pool_ids and numa").

commit 49277a5b76373e630075ff7d32fc0f9f51294f24
Author: Waiman Long <longman@redhat.com>
Date:   Mon, 20 Nov 2023 21:18:40 -0500

    workqueue: Move workqueue_set_unbound_cpumask() and its helpers inside CONFIG_SYSFS

    Commit fe28f631fa94 ("workqueue: Add workqueue_unbound_exclude_cpumask()
    to exclude CPUs from wq_unbound_cpumask") makes
    workqueue_set_unbound_cpumask() static as it is not used elsewhere in
    the kernel. However, this triggers a kernel test robot warning about
    'workqueue_set_unbound_cpumask' defined but not used when CONFIG_SYS
    isn't defined. It happens that workqueue_set_unbound_cpumask() is only
    called when CONFIG_SYS is defined.

    Move workqueue_set_unbound_cpumask() and its helpers inside the
    CONFIG_SYSFS compilation block to avoid the warning. There is no
    functional change.

    Fixes: fe28f631fa94 ("workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask")
    Reported-by: kernel test robot <lkp@intel.com>
    Closes: https://lore.kernel.org/oe-kbuild-all/202311130831.uh0AoCd1-lkp@intel.com/
    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-01-16 14:24:47 -05:00
Waiman Long 24be7e35b7 workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask
JIRA: https://issues.redhat.com/browse/RHEL-21798
Conflicts:
 1) A merge conflict in the workqueue_unbound_exclude_cpumask() hunk
    of kernel/workqueue.c due to missing upstream commit 63c5484e7495
    ("workqueue: Add multiple affinity scopes and interface to select
    them").
 2) A merge conflict in the workqueue_init_early() hunk of
    kernel/workqueue.c due to upstream merge conflict resolved according
    to upstream merge commit 202595663905 ("Merge branch 'for-6.7-fixes'
    of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq into for-6.8").

commit fe28f631fa941fba583d1c4f25895284b90af671
Author: Waiman Long <longman@redhat.com>
Date:   Wed, 25 Oct 2023 14:25:52 -0400

    workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask

    When the "isolcpus" boot command line option is used to add a set
    of isolated CPUs, those CPUs will be excluded automatically from
    wq_unbound_cpumask to avoid running work functions from unbound
    workqueues.

    Recently cpuset has been extended to allow the creation of partitions
    of isolated CPUs dynamically. To make it closer to the "isolcpus"
    in functionality, the CPUs in those isolated cpuset partitions should be
    excluded from wq_unbound_cpumask as well. This can be done currently by
    explicitly writing to the workqueue's cpumask sysfs file after creating
    the isolated partitions. However, this process can be error prone.

    Ideally, the cpuset code should be allowed to request the workqueue code
    to exclude those isolated CPUs from wq_unbound_cpumask so that this
    operation can be done automatically and the isolated CPUs will be returned
    back to wq_unbound_cpumask after the destructions of the isolated
    cpuset partitions.

    This patch adds a new workqueue_unbound_exclude_cpumask() function to
    enable that. This new function will exclude the specified isolated
    CPUs from wq_unbound_cpumask. To be able to restore those isolated
    CPUs back after the destruction of isolated cpuset partitions, a new
    wq_requested_unbound_cpumask is added to store the user provided unbound
    cpumask either from the boot command line options or from writing to
    the cpumask sysfs file. This new cpumask provides the basis for CPU
    exclusion.

    To enable users to understand how the wq_unbound_cpumask is being
    modified internally, this patch also exposes the newly introduced
    wq_requested_unbound_cpumask as well as a wq_isolated_cpumask to
    store the cpumask to be excluded from wq_unbound_cpumask as read-only
    sysfs files.

    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-01-16 14:24:47 -05:00
Waiman Long 1d28ea804a workqueue: Make sure that wq_unbound_cpumask is never empty
JIRA: https://issues.redhat.com/browse/RHEL-21798
Conflicts: A merge conflict due to missing upstream commit fef59c9cab6a
	   ("workqueue: Rename NUMA related names to use pod instead")
	   and two other subsequent workqueue commits.

commit 4a6c5607d4502ccd1b15b57d57f17d12b6f257a7
Author: Tejun Heo <tj@kernel.org>
Date:   Tue, 21 Nov 2023 11:39:36 -1000

    workqueue: Make sure that wq_unbound_cpumask is never empty

    During boot, depending on how the housekeeping and workqueue.unbound_cpus
    masks are set, wq_unbound_cpumask can end up empty. Since 8639ecebc9b1
    ("workqueue: Implement non-strict affinity scope for unbound workqueues"),
    this may end up feeding -1 as a CPU number into scheduler leading to oopses.

      BUG: unable to handle page fault for address: ffffffff8305e9c0
      #PF: supervisor read access in kernel mode
      #PF: error_code(0x0000) - not-present page
      ...
      Call Trace:
       <TASK>
       select_idle_sibling+0x79/0xaf0
       select_task_rq_fair+0x1cb/0x7b0
       try_to_wake_up+0x29c/0x5c0
       wake_up_process+0x19/0x20
       kick_pool+0x5e/0xb0
       __queue_work+0x119/0x430
       queue_work_on+0x29/0x30
      ...

    An empty wq_unbound_cpumask is a clear misconfiguration and already
    disallowed once system is booted up. Let's warn on and ignore
    unbound_cpumask restrictions which lead to no unbound cpus. While at it,
    also remove now unncessary empty check on wq_unbound_cpumask in
    wq_select_unbound_cpu().

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Reported-and-Tested-by: Yong He <alexyonghe@tencent.com>
    Link: http://lkml.kernel.org/r/20231120121623.119780-1-alexyonghe@tencent.com
    Fixes: 8639ecebc9b1 ("workqueue: Implement non-strict affinity scope for unbound workqueues")
    Cc: stable@vger.kernel.org # v6.6+
    Reviewed-by: Waiman Long <longman@redhat.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-01-16 14:24:46 -05:00
Waiman Long bed8f3efe3 workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()
JIRA: https://issues.redhat.com/browse/RHEL-21798

commit ca10d851b9ad0338c19e8e3089e24d565ebfffd7
Author: Waiman Long <longman@redhat.com>
Date:   Tue, 10 Oct 2023 22:48:42 -0400

    workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()

    Commit 5c0338c687 ("workqueue: restore WQ_UNBOUND/max_active==1
    to be ordered") enabled implicit ordered attribute to be added to
    WQ_UNBOUND workqueues with max_active of 1. This prevented the changing
    of attributes to these workqueues leading to fix commit 0a94efb5ac
    ("workqueue: implicit ordered attribute should be overridable").

    However, workqueue_apply_unbound_cpumask() was not updated at that time.
    So sysfs changes to wq_unbound_cpumask has no effect on WQ_UNBOUND
    workqueues with implicit ordered attribute. Since not all WQ_UNBOUND
    workqueues are visible on sysfs, we are not able to make all the
    necessary cpumask changes even if we iterates all the workqueue cpumasks
    in sysfs and changing them one by one.

    Fix this problem by applying the corresponding change made
    to apply_workqueue_attrs_locked() in the fix commit to
    workqueue_apply_unbound_cpumask().

    Fixes: 5c0338c687 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered")
    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-01-16 14:24:46 -05:00
Waiman Long 4356616088 workqueue: add cmdline parameter `workqueue.unbound_cpus` to further constrain wq_unbound_cpumask at boot time
JIRA: https://issues.redhat.com/browse/RHEL-21798
Conflicts: A minor context diff in kernel/workqueue.c due to missing
	   upstream commit 20bdedafd2f6 ("workqueue: Warn attempt to
	   flush system-wide workqueues.").

commit ace3c5499e61ef7c0433a7a297227a9bdde54a55
Author: tiozhang <tiozhang@didiglobal.com>
Date:   Thu, 29 Jun 2023 11:50:50 +0800

    workqueue: add cmdline parameter `workqueue.unbound_cpus` to further constrain wq_unbound_cpumask at boot time

    Motivation of doing this is to better improve boot times for devices when
    we want to prevent our workqueue works from running on some specific CPUs,
    e,g, some CPUs are busy with interrupts.

    Signed-off-by: tiozhang <tiozhang@didiglobal.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-01-16 14:24:45 -05:00
Mark Langsdorf d4e81a63a3 workqueue: move to use bus_get_dev_root()
JIRA: https://issues.redhat.com/browse/RHEL-1023

commit 686f669780276da534e93ba769e02bdcf1f89f8d
Author: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Date:   Mon Mar 13 19:28:50 2023 +0100

Direct access to the struct bus_type dev_root pointer is going away soon
so replace that with a call to bus_get_dev_root() instead, which is what
it is there for.

Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230313182918.1312597-8-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
2023-11-01 11:12:31 -05:00
Waiman Long 8cbdd24861 workqueue: Fold rebind_worker() within rebind_workers()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337

commit c63a2e52d5e08f01140d7b76c08a78e15e801f03
Author: Valentin Schneider <vschneid@redhat.com>
Date:   Fri, 13 Jan 2023 17:40:40 +0000

    workqueue: Fold rebind_worker() within rebind_workers()

    !CONFIG_SMP builds complain about rebind_worker() being unused. Its only
    user, rebind_workers() is indeed only defined for CONFIG_SMP, so just fold
    the two lines back up there.

    Link: http://lore.kernel.org/r/20230113143102.2e94d74f@canb.auug.org.au
    Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
    Signed-off-by: Valentin Schneider <vschneid@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-04-07 15:26:31 -04:00
Waiman Long 107339e408 workqueue: Unbind kworkers before sending them to exit()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337

commit e02b93124855cd34b78e61ae44846c8cb5fddfc3
Author: Valentin Schneider <vschneid@redhat.com>
Date:   Thu, 12 Jan 2023 16:14:31 +0000

    workqueue: Unbind kworkers before sending them to exit()

    It has been reported that isolated CPUs can suffer from interference due to
    per-CPU kworkers waking up just to die.

    A surge of workqueue activity during initial setup of a latency-sensitive
    application (refresh_vm_stats() being one of the culprits) can cause extra
    per-CPU kworkers to be spawned. Then, said latency-sensitive task can be
    running merrily on an isolated CPU only to be interrupted sometime later by
    a kworker marked for death (cf. IDLE_WORKER_TIMEOUT, 5 minutes after last
    kworker activity).

    Prevent this by affining kworkers to the wq_unbound_cpumask (which doesn't
    contain isolated CPUs, cf. HK_TYPE_WQ) before waking them up after marking
    them with WORKER_DIE.

    Changing the affinity does require a sleepable context, leverage the newly
    introduced pool->idle_cull_work to get that.

    Remove dying workers from pool->workers and keep track of them in a
    separate list. This intentionally prevents for_each_loop_worker() from
    iterating over workers that are marked for death.

    Rename destroy_worker() to set_working_dying() to better reflect its
    effects and relationship with wake_dying_workers().

    Signed-off-by: Valentin Schneider <vschneid@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-04-07 15:26:31 -04:00
Waiman Long 813b945165 workqueue: Don't hold any lock while rcuwait'ing for !POOL_MANAGER_ACTIVE
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337

commit 9ab03be42b8f9136dcc01a90ecc9ac71bc6149ef
Author: Valentin Schneider <vschneid@redhat.com>
Date:   Thu, 12 Jan 2023 16:14:30 +0000

    workqueue: Don't hold any lock while rcuwait'ing for !POOL_MANAGER_ACTIVE

    put_unbound_pool() currently passes wq_manager_inactive() as exit condition
    to rcuwait_wait_event(), which grabs pool->lock to check for

      pool->flags & POOL_MANAGER_ACTIVE

    A later patch will require destroy_worker() to be invoked with
    wq_pool_attach_mutex held, which needs to be acquired before
    pool->lock. A mutex cannot be acquired within rcuwait_wait_event(), as
    it could clobber the task state set by rcuwait_wait_event()

    Instead, restructure the waiting logic to acquire any necessary lock
    outside of rcuwait_wait_event().

    Since further work cannot be inserted into unbound pwqs that have reached
    ->refcnt==0, this is bound to make forward progress as eventually the
    worklist will be drained and need_more_worker(pool) will remain false,
    preventing any worker from stealing the manager position from us.

    Suggested-by: Tejun Heo <tj@kernel.org>
    Signed-off-by: Valentin Schneider <vschneid@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-04-07 15:26:30 -04:00
Waiman Long 7ea6709544 workqueue: Convert the idle_timer to a timer + work_struct
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337

commit 3f959aa3b33829acfcd460c6c656d54dfebe8d1e
Author: Valentin Schneider <vschneid@redhat.com>
Date:   Thu, 12 Jan 2023 16:14:29 +0000

    workqueue: Convert the idle_timer to a timer + work_struct

    A later patch will require a sleepable context in the idle worker timeout
    function. Converting worker_pool.idle_timer to a delayed_work gives us just
    that, however this would imply turning all idle_timer expiries into
    scheduler events (waking up a worker to handle the dwork).

    Instead, implement a "custom dwork" where the timer callback does some
    extra checks before queuing the associated work.

    No change in functionality intended.

    Signed-off-by: Valentin Schneider <vschneid@redhat.com>
    Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-04-07 15:26:30 -04:00
Waiman Long 2535806d83 workqueue: Factorize unbind/rebind_workers() logic
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337

commit 793777bc193b658f01924fd09b388eead26d741f
Author: Valentin Schneider <vschneid@redhat.com>
Date:   Thu, 12 Jan 2023 16:14:28 +0000

    workqueue: Factorize unbind/rebind_workers() logic

    Later patches will reuse this code, move it into reusable functions.

    Signed-off-by: Valentin Schneider <vschneid@redhat.com>
    Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-04-07 15:26:29 -04:00
Waiman Long d653c805fc workqueue: Protects wq_unbound_cpumask with wq_pool_attach_mutex
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337

commit 99c621ef243bda726fb8d982a274ded96570b410
Author: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Date:   Thu, 12 Jan 2023 16:14:27 +0000

    workqueue: Protects wq_unbound_cpumask with wq_pool_attach_mutex

    When unbind_workers() reads wq_unbound_cpumask to set the affinity of
    freshly-unbound kworkers, it only holds wq_pool_attach_mutex. This isn't
    sufficient as wq_unbound_cpumask is only protected by wq_pool_mutex.

    Make wq_unbound_cpumask protected with wq_pool_attach_mutex and also
    remove the need of temporary saved_cpumask.

    Fixes: 10a5a651e3af ("workqueue: Restrict kworker in the offline CPU pool running on housekeeping CPUs")
    Reported-by: Valentin Schneider <vschneid@redhat.com>
    Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-04-07 15:26:29 -04:00
Waiman Long 4e109dbd6a workqueue: don't skip lockdep work dependency in cancel_work_sync()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337

commit c0feea594e058223973db94c1c32a830c9807c86
Author: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date:   Fri, 29 Jul 2022 13:30:23 +0900

    workqueue: don't skip lockdep work dependency in cancel_work_sync()

    Like Hillf Danton mentioned

      syzbot should have been able to catch cancel_work_sync() in work context
      by checking lockdep_map in __flush_work() for both flush and cancel.

    in [1], being unable to report an obvious deadlock scenario shown below is
    broken. From locking dependency perspective, sync version of cancel request
    should behave as if flush request, for it waits for completion of work if
    that work has already started execution.

      ----------
      #include <linux/module.h>
      #include <linux/sched.h>
      static DEFINE_MUTEX(mutex);
      static void work_fn(struct work_struct *work)
      {
        schedule_timeout_uninterruptible(HZ / 5);
        mutex_lock(&mutex);
        mutex_unlock(&mutex);
      }
      static DECLARE_WORK(work, work_fn);
      static int __init test_init(void)
      {
        schedule_work(&work);
        schedule_timeout_uninterruptible(HZ / 10);
        mutex_lock(&mutex);
        cancel_work_sync(&work);
        mutex_unlock(&mutex);
        return -EINVAL;
      }
      module_init(test_init);
      MODULE_LICENSE("GPL");
      ----------

    The check this patch restores was added by commit 0976dfc1d0
    ("workqueue: Catch more locking problems with flush_work()").

    Then, lockdep's crossrelease feature was added by commit b09be676e0
    ("locking/lockdep: Implement the 'crossrelease' feature"). As a result,
    this check was once removed by commit fd1a5b04df ("workqueue: Remove
    now redundant lock acquisitions wrt. workqueue flushes").

    But lockdep's crossrelease feature was removed by commit e966eaeeb6
    ("locking/lockdep: Remove the cross-release locking checks"). At this
    point, this check should have been restored.

    Then, commit d6e89786be ("workqueue: skip lockdep wq dependency in
    cancel_work_sync()") introduced a boolean flag in order to distinguish
    flush_work() and cancel_work_sync(), for checking "struct workqueue_struct"
    dependency when called from cancel_work_sync() was causing false positives.

    Then, commit 87915adc3f ("workqueue: re-add lockdep dependencies for
    flushing") tried to restore "struct work_struct" dependency check, but by
    error checked this boolean flag. Like an example shown above indicates,
    "struct work_struct" dependency needs to be checked for both flush_work()
    and cancel_work_sync().

    Link: https://lkml.kernel.org/r/20220504044800.4966-1-hdanton@sina.com [1]
    Reported-by: Hillf Danton <hdanton@sina.com>
    Suggested-by: Lai Jiangshan <jiangshanlai@gmail.com>
    Fixes: 87915adc3f ("workqueue: re-add lockdep dependencies for flushing")
    Cc: Johannes Berg <johannes.berg@intel.com>
    Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-04-07 15:26:28 -04:00
Waiman Long 867850e9d0 workqueue: Change the comments of the synchronization about the idle_list
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337

commit 2c1f1a9180bfacbc3c8e5b10075640cc810cf9c0
Author: Lai Jiangshan <laijs@linux.alibaba.com>
Date:   Thu, 23 Dec 2021 20:31:38 +0800

    workqueue: Change the comments of the synchronization about the idle_list

    The access to idle_list in wq_worker_sleeping() is changed to be
    protected by pool->lock, so the comments above idle_list can be changed
    to "L:" which is the meaning of "access with pool->lock held".

    And the outdated comments in wq_worker_sleeping() is removed since
    the function is not called with rq lock held any more, idle_list is
    dereferenced with pool lock now.

    Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-04-07 15:26:28 -04:00
Waiman Long 94ebfcf09b workqueue: Remove the mb() pair between wq_worker_sleeping() and insert_work()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337

commit 21b195c05cf6a6cc49777d6992772bcf01502186
Author: Lai Jiangshan <laijs@linux.alibaba.com>
Date:   Thu, 23 Dec 2021 20:31:37 +0800

    workqueue: Remove the mb() pair between wq_worker_sleeping() and insert_work()

    In wq_worker_sleeping(), the access to worklist is protected by the
    pool->lock, so the memory barrier is unneeded.

    Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-04-07 15:26:27 -04:00
Waiman Long f710816729 workqueue: Remove the cacheline_aligned for nr_running
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337

commit 84f91c62d675480ffd3d870ee44c07965cbd8b21
Author: Lai Jiangshan <laijs@linux.alibaba.com>
Date:   Tue, 7 Dec 2021 15:35:42 +0800

    workqueue: Remove the cacheline_aligned for nr_running

    nr_running is never modified remotely after the schedule callback in
    wakeup path is removed.

    Rather nr_running is often accessed with other fields in the pool
    together, so the cacheline_aligned for nr_running isn't needed.

    Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-04-07 15:26:27 -04:00
Waiman Long 0df1d79e38 workqueue: Move the code of waking a worker up in unbind_workers()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337
Conflicts: A merge conflict requiring manual merge due to the presence
	   of a later upstream commit 46a4d679ef88 ("workqueue: Avoid
	   a false warning in unbind_workers()").

commit 989442d73757868118a73b92732b549a73c9ce35
Author: Lai Jiangshan <laijs@linux.alibaba.com>
Date:   Tue, 7 Dec 2021 15:35:41 +0800

    workqueue: Move the code of waking a worker up in unbind_workers()

    In unbind_workers(), there are two pool->lock held sections separated
    by the code of zapping nr_running.  wake_up_worker() needs to be in
    pool->lock held section and after zapping nr_running.  And zapping
    nr_running had to be after schedule() when the local wake up
    functionality was in use.  Now, the call to schedule() has been removed
    along with the local wake up functionality, so the code can be merged
    into the same pool->lock held section.

    The diffstat shows that it is other code moved down because the diff
    tools can not know the meaning of merging lock sections by swapping
    two code blocks.

    Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-04-07 15:26:26 -04:00
Waiman Long 41d61eff9a workqueue: Remove the outdated comment before wq_worker_sleeping()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337

commit ccf45156fd167a234baf038c11c1f367c7ccabd4
Author: Lai Jiangshan <laijs@linux.alibaba.com>
Date:   Tue, 7 Dec 2021 15:35:37 +0800

    workqueue: Remove the outdated comment before wq_worker_sleeping()

    It isn't called with preempt disabled now.

    Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-04-07 15:26:26 -04:00
Waiman Long b12ee57248 workqueue: Fix unbind_workers() VS wq_worker_sleeping() race
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337

commit 45c753f5f24d2d4717acb38ce35e604ff9abcb50
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Wed, 1 Dec 2021 16:19:45 +0100

    workqueue: Fix unbind_workers() VS wq_worker_sleeping() race

    At CPU-hotplug time, unbind_workers() may preempt a worker while it is
    going to sleep. In that case the following scenario can happen:

        unbind_workers()                     wq_worker_sleeping()
        --------------                      -------------------
                                          if (worker->flags & WORKER_NOT_RUNNING)
                                              return;
                                          //PREEMPTED by unbind_workers
        worker->flags |= WORKER_UNBOUND;
        [...]
        atomic_set(&pool->nr_running, 0);
        //resume to worker
                                           atomic_dec_and_test(&pool->nr_running);

    After unbind_worker() resets pool->nr_running, the value is expected to
    remain 0 until the pool ever gets rebound in case cpu_up() is called on
    the target CPU in the future. But here the race leaves pool->nr_running
    with a value of -1, triggering the following warning when the worker goes
    idle:

            WARNING: CPU: 3 PID: 34 at kernel/workqueue.c:1823 worker_enter_idle+0x95/0xc0
            Modules linked in:
            CPU: 3 PID: 34 Comm: kworker/3:0 Not tainted 5.16.0-rc1+ #34
            Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
            Workqueue:  0x0 (rcu_par_gp)
            RIP: 0010:worker_enter_idle+0x95/0xc0
            Code: 04 85 f8 ff ff ff 39 c1 7f 09 48 8b 43 50 48 85 c0 74 1b 83 e2 04 75 99 8b 43 34 39 43 30 75 91 8b 83 00 03 00 00 85 c0 74 87 <0f> 0b 5b c3 48 8b 35 70 f1 37 01 48 8d 7b 48 48 81 c6 e0 93  0
            RSP: 0000:ffff9b7680277ed0 EFLAGS: 00010086
            RAX: 00000000ffffffff RBX: ffff93465eae9c00 RCX: 0000000000000000
            RDX: 0000000000000000 RSI: ffff9346418a0000 RDI: ffff934641057140
            RBP: ffff934641057170 R08: 0000000000000001 R09: ffff9346418a0080
            R10: ffff9b768027fdf0 R11: 0000000000002400 R12: ffff93465eae9c20
            R13: ffff93465eae9c20 R14: ffff93465eae9c70 R15: ffff934641057140
            FS:  0000000000000000(0000) GS:ffff93465eac0000(0000) knlGS:0000000000000000
            CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            CR2: 0000000000000000 CR3: 000000001cc0c000 CR4: 00000000000006e0
            DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
            DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
            Call Trace:
              <TASK>
              worker_thread+0x89/0x3d0
              ? process_one_work+0x400/0x400
              kthread+0x162/0x190
              ? set_kthread_struct+0x40/0x40
              ret_from_fork+0x22/0x30
              </TASK>

    Also due to this incorrect "nr_running == -1", all sorts of hazards can
    happen, starting with queued works being ignored because no workers are
    awaken at insert_work() time.

    Fix this with checking again the worker flags while pool->lock is locked.

    Fixes: b945efcdd07d ("sched: Remove pointless preemption disable in sched_submit_work()")
    Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
    Tested-by: Paul E. McKenney <paulmck@kernel.org>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Paul E. McKenney <paulmck@kernel.org>
    Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-04-07 15:26:26 -04:00
Phil Auld 1b770a00ec workqueue: Avoid a false warning in unbind_workers()
Bugzilla: https://bugzilla.redhat.com/2115520

commit 46a4d679ef88285ea17c3e1e4fed330be2044f21
Author: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Date:   Fri Jul 29 17:44:38 2022 +0800

    workqueue: Avoid a false warning in unbind_workers()

    Doing set_cpus_allowed_ptr() with wq_unbound_cpumask can be possible
    fails and trigger the false warning.

    Use cpu_possible_mask instead when wq_unbound_cpumask has no active CPUs.

    It is very easy to trigger the warning:
      Set wq_unbound_cpumask to a small set of CPUs.
      Offline all the CPUs of wq_unbound_cpumask.
      Offline an extra CPU and trigger the warning.

    Fixes: 10a5a651e3af ("workqueue: Restrict kworker in the offline CPU pool running on housekeeping CPUs")
    Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-11-04 13:14:41 -04:00
Phil Auld 51a20c6ae4 workqueue: Wrap flush_workqueue() using a macro
Bugzilla: https://bugzilla.redhat.com/2115520

commit c4f135d643823a869becfa87539f7820ef9d5bfa
Author: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date:   Wed Jun 1 16:32:47 2022 +0900

    workqueue: Wrap flush_workqueue() using a macro

    Since flush operation synchronously waits for completion, flushing
    system-wide WQs (e.g. system_wq) might introduce possibility of deadlock
    due to unexpected locking dependency. Tejun Heo commented at [1] that it
    makes no sense at all to call flush_workqueue() on the shared WQs as the
    caller has no idea what it's gonna end up waiting for.

    Although there is flush_scheduled_work() which flushes system_wq WQ with
    "Think twice before calling this function! It's very easy to get into
    trouble if you don't take great care." warning message, syzbot found a
    circular locking dependency caused by flushing system_wq WQ [2].

    Therefore, let's change the direction to that developers had better use
    their local WQs if flush_scheduled_work()/flush_workqueue(system_*_wq) is
    inevitable.

    Steps for converting system-wide WQs into local WQs are explained at [3],
    and a conversion to stop flushing system-wide WQs is in progress. Now we
    want some mechanism for preventing developers who are not aware of this
    conversion from again start flushing system-wide WQs.

    Since I found that WARN_ON() is complete but awkward approach for teaching
    developers about this problem, let's use __compiletime_warning() for
    incomplete but handy approach. For completeness, we will also insert
    WARN_ON() into __flush_workqueue() after all in-tree users stopped calling
    flush_scheduled_work().

    Link: https://lore.kernel.org/all/YgnQGZWT%2Fn3VAITX@slm.duckdns.org/ [1]
    Link: https://syzkaller.appspot.com/bug?extid=bde0f89deacca7c765b8 [2]
    Link: https://lkml.kernel.org/r/49925af7-78a8-a3dd-bce6-cfc02e1a9236@I-love.SAKURA.ne.jp [3]
    Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-11-04 13:14:39 -04:00
Phil Auld 5eeb631add workqueue: Restrict kworker in the offline CPU pool running on housekeeping CPUs
Bugzilla: https://bugzilla.redhat.com/2115520

commit 10a5a651e3afc9b0b381f47e8930972e4e918397
Author: Zqiang <qiang1.zhang@intel.com>
Date:   Thu Mar 31 13:57:17 2022 +0800

    workqueue: Restrict kworker in the offline CPU pool running on housekeeping CPUs

    When a CPU is going offline, all workers on the CPU's pool will have their
    cpus_allowed cleared to cpu_possible_mask and can run on any CPUs including
    the isolated ones. Instead, set cpus_allowed to wq_unbound_cpumask so that
    the can avoid isolated CPUs.

    Signed-off-by: Zqiang <qiang1.zhang@intel.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-11-04 13:14:36 -04:00