JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 15930da42f8981dc42c19038042947b475b19f47
Author: Tejun Heo <tj@kernel.org>
Date: Tue, 30 Jan 2024 18:55:55 -1000
workqueue: Don't call cpumask_test_cpu() with -1 CPU in wq_update_node_max_active()
For wq_update_node_max_active(), @off_cpu of -1 indicates that no CPU is
going down. The function was incorrectly calling cpumask_test_cpu() with -1
CPU leading to oopses like the following on some archs:
Unable to handle kernel paging request at virtual address ffff0002100296e0
..
pc : wq_update_node_max_active+0x50/0x1fc
lr : wq_update_node_max_active+0x1f0/0x1fc
...
Call trace:
wq_update_node_max_active+0x50/0x1fc
apply_wqattrs_commit+0xf0/0x114
apply_workqueue_attrs_locked+0x58/0xa0
alloc_workqueue+0x5ac/0x774
workqueue_init_early+0x460/0x540
start_kernel+0x258/0x684
__primary_switched+0xb8/0xc0
Code: 9100a273 35000d01 53067f00 d0016dc1 (f8607a60)
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Attempted to kill the idle task!
---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---
Fix it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reported-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Link: http://lkml.kernel.org/r/91eacde0-df99-4d5c-a980-91046f66e612@samsung.com
Fixes: 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues")
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 5797b1c18919cd9c289ded7954383e499f729ce0
Author: Tejun Heo <tj@kernel.org>
Date: Mon, 29 Jan 2024 08:11:25 -1000
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 91ccc6e7233bb10a9c176aa4cc70d6f432a441a5
Author: Tejun Heo <tj@kernel.org>
Date: Mon, 29 Jan 2024 08:11:24 -1000
workqueue: Introduce struct wq_node_nr_active
Currently, for both percpu and unbound workqueues, max_active applies
per-cpu, which is a recent change for unbound workqueues. The change for
unbound workqueues was a significant departure from the previous behavior of
per-node application. It made some use cases create undesirable number of
concurrent work items and left no good way of fixing them. To address the
problem, workqueue is implementing a NUMA node segmented global nr_active
mechanism, which will be explained further in the next patch.
As a preparation, this patch introduces struct wq_node_nr_active. It's a
data structured allocated for each workqueue and NUMA node pair and
currently only tracks the workqueue's number of active work items on the
node. This is split out from the next patch to make it easier to understand
and review.
Note that there is an extra wq_node_nr_active allocated for the invalid node
nr_node_ids which is used to track nr_active for pools which don't have NUMA
node associated such as the default fallback system-wide pool.
This doesn't cause any behavior changes visible to userland yet. The next
patch will expand to implement the control mechanism on top.
v4: - Fixed out-of-bound access when freeing per-cpu workqueues.
v3: - Use flexible array for wq->node_nr_active as suggested by Lai.
v2: - wq->max_active now uses WRITE/READ_ONCE() as suggested by Lai.
- Lai pointed out that pwq_tryinc_nr_active() incorrectly dropped
pwq->max_active check. Restored. As the next patch replaces the
max_active enforcement mechanism, this doesn't change the end result.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit dd6c3c5441263723305a9c52c5ccc899a4653000
Author: Tejun Heo <tj@kernel.org>
Date: Mon, 29 Jan 2024 08:11:24 -1000
workqueue: Move pwq_dec_nr_in_flight() to the end of work item handling
The planned shared nr_active handling for unbound workqueues will make
pwq_dec_nr_active() sometimes drop the pool lock temporarily to acquire
other pool locks, which is necessary as retirement of an nr_active count
from one pool may need kick off an inactive work item in another pool.
This patch moves pwq_dec_nr_in_flight() call in try_to_grab_pending() to the
end of work item handling so that work item state changes stay atomic.
process_one_work() which is the other user of pwq_dec_nr_in_flight() already
calls it at the end of work item handling. Comments are added to both call
sites and pwq_dec_nr_in_flight().
This shouldn't cause any behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 9f66cff212bb3c1cd25996aaa0dfd0c9e9d8baab
Author: Tejun Heo <tj@kernel.org>
Date: Mon, 29 Jan 2024 08:11:24 -1000
workqueue: RCU protect wq->dfl_pwq and implement accessors for it
wq->cpu_pwq is RCU protected but wq->dfl_pwq isn't. This is okay because
currently wq->dfl_pwq is used only accessed to install it into wq->cpu_pwq
which doesn't require RCU access. However, we want to be able to access
wq->dfl_pwq under RCU in the future to access its __pod_cpumask and the code
can be made easier to read by making the two pwq fields behave in the same
way.
- Make wq->dfl_pwq RCU protected.
- Add unbound_pwq_slot() and unbound_pwq() which can access both ->dfl_pwq
and ->cpu_pwq. The former returns the double pointer that can be used
access and update the pwqs. The latter performs locking check and
dereferences the double pointer.
- pwq accesses and updates are converted to use unbound_pwq[_slot]().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit c5404d4e6df6faba1007544b5f4e62c7c14416dd
Author: Tejun Heo <tj@kernel.org>
Date: Mon, 29 Jan 2024 08:11:24 -1000
workqueue: Make wq_adjust_max_active() round-robin pwqs while activating
wq_adjust_max_active() needs to activate work items after max_active is
increased. Previously, it did that by visiting each pwq once activating all
that could be activated. While this makes sense with per-pwq nr_active,
nr_active will be shared across multiple pwqs for unbound wqs. Then, we'd
want to round-robin through pwqs to be fairer.
In preparation, this patch makes wq_adjust_max_active() round-robin pwqs
while activating. While the activation ordering changes, this shouldn't
cause user-noticeable behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 1c270b79ce0b8290f146255ea9057243f6dd3c17
Author: Tejun Heo <tj@kernel.org>
Date: Mon, 29 Jan 2024 08:11:24 -1000
workqueue: Move nr_active handling into helpers
__queue_work(), pwq_dec_nr_in_flight() and wq_adjust_max_active() were
open-coding nr_active handling, which is fine given that the operations are
trivial. However, the planned unbound nr_active update will make them more
complicated, so let's move them into helpers.
- pwq_tryinc_nr_active() is added. It increments nr_active if under
max_active limit and return a boolean indicating whether inc was
successful. Note that the function is structured to accommodate future
changes. __queue_work() is updated to use the new helper.
- pwq_activate_first_inactive() is updated to use pwq_tryinc_nr_active() and
thus no longer assumes that nr_active is under max_active and returns a
boolean to indicate whether a work item has been activated.
- wq_adjust_max_active() no longer tests directly whether a work item can be
activated. Instead, it's updated to use the return value of
pwq_activate_first_inactive() to tell whether a work item has been
activated.
- nr_active decrement and activating the first inactive work item is
factored into pwq_dec_nr_active().
v3: - WARN_ON_ONCE(!WORK_STRUCT_INACTIVE) added to __pwq_activate_work() as
now we're calling the function unconditionally from
pwq_activate_first_inactive().
v2: - wq->max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 4c6380305d21e36581b451f7337a36c93b64e050
Author: Tejun Heo <tj@kernel.org>
Date: Mon, 29 Jan 2024 08:11:24 -1000
workqueue: Replace pwq_activate_inactive_work() with [__]pwq_activate_work()
To prepare for unbound nr_active handling improvements, move work activation
part of pwq_activate_inactive_work() into __pwq_activate_work() and add
pwq_activate_work() which tests WORK_STRUCT_INACTIVE and updates nr_active.
pwq_activate_first_inactive() and try_to_grab_pending() are updated to use
pwq_activate_work(). The latter conversion is functionally identical. For
the former, this conversion adds an unnecessary WORK_STRUCT_INACTIVE
testing. This is temporary and will be removed by the next patch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit afa87ce85379e2d93863fce595afdb5771a84004
Author: Tejun Heo <tj@kernel.org>
Date: Mon, 29 Jan 2024 08:11:24 -1000
workqueue: Factor out pwq_is_empty()
"!pwq->nr_active && list_empty(&pwq->inactive_works)" test is repeated
multiple times. Let's factor it out into pwq_is_empty().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit a045a272d887575da17ad86d6573e82871b50c27
Author: Tejun Heo <tj@kernel.org>
Date: Mon, 29 Jan 2024 08:11:24 -1000
workqueue: Move pwq->max_active to wq->max_active
max_active is a workqueue-wide setting and the configured value is stored in
wq->saved_max_active; however, the effective value was stored in
pwq->max_active. While this is harmless, it makes max_active update process
more complicated and gets in the way of the planned max_active semantic
updates for unbound workqueues.
This patches moves pwq->max_active to wq->max_active. This simplifies the
code and makes freezing and noop max_active updates cheaper too. No
user-visible behavior change is intended.
As wq->max_active is updated while holding wq mutex but read without any
locking, it now uses WRITE/READ_ONCE(). A new locking locking rule WO is
added for it.
v2: wq->max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit e563d0a7cdc1890ff36bb177b5c8c2854d881e4d
Author: Tejun Heo <tj@kernel.org>
Date: Fri, 26 Jan 2024 11:55:50 -1000
workqueue: Break up enum definitions and give names to the types
workqueue is collecting different sorts of enums into a single unnamed enum
type which can increase confusion around enum width. Also, unnamed enums
can't be accessed from BPF. Let's break up enum definitions according to
their purposes and give them type names.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 6a229b0e2ff6143b65ba4ef42bd71e29ffc2c16d
Author: Tejun Heo <tj@kernel.org>
Date: Fri, 26 Jan 2024 11:55:46 -1000
workqueue: Drop unnecessary kick_pool() in create_worker()
After creating a new worker, create_worker() is calling kick_pool() to wake
up the new worker task. However, as kick_pool() doesn't do anything if there
is no work pending, it also calls wake_up_process() explicitly. There's no
reason to call kick_pool() at all. wake_up_process() is enough by itself.
Drop the unnecessary kick_pool() call.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 1a65a6d17cbc58e1aeffb2be962acce49efbef9c
Author: Xuewen Yan <xuewen.yan@unisoc.com>
Date: Wed, 10 Jan 2024 11:27:24 +0800
workqueue: Add rcu lock check at the end of work item execution
Currently the workqueue just checks the atomic and locking states after work
execution ends. However, sometimes, a work item may not unlock rcu after
acquiring rcu_read_lock(). And as a result, it would cause rcu stall, but
the rcu stall warning can not dump the work func, because the work has
finished.
In order to quickly discover those works that do not call rcu_read_unlock()
after rcu_read_lock(), add the rcu lock check.
Use rcu_preempt_depth() to check the work's rcu status. Normally, this value
is 0. If this value is bigger than 0, it means the work are still holding
rcu lock. If so, print err info and the work func.
tj: Reworded the description for clarity. Minor formatting tweak.
Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 85f0ab43f9de62a4b9c1b503b07f1c33e5a6d2ab
Author: Juri Lelli <juri.lelli@redhat.com>
Date: Tue, 16 Jan 2024 17:19:27 +0100
kernel/workqueue: Bind rescuer to unbound cpumask for WQ_UNBOUND
At the time they are created unbound workqueues rescuers currently use
cpu_possible_mask as their affinity, but this can be too wide in case a
workqueue unbound mask has been set as a subset of cpu_possible_mask.
Make new rescuers use their associated workqueue unbound cpumask from
the start.
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit aac8a59537dfc704ff344f1aacfd143c089ee20f
Author: Tejun Heo <tj@kernel.org>
Date: Mon, 5 Feb 2024 15:43:41 -1000
Revert "workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()"
This reverts commit ca10d851b9ad0338c19e8e3089e24d565ebfffd7.
The commit allowed workqueue_apply_unbound_cpumask() to clear __WQ_ORDERED
on now removed implicitly ordered workqueues. This was incorrect in that
system-wide config change shouldn't break ordering properties of all
workqueues. The reason why apply_workqueue_attrs() path was allowed to do so
was because it was targeting the specific workqueue - either the workqueue
had WQ_SYSFS set or the workqueue user specifically tried to change
max_active, both of which indicate that the workqueue doesn't need to be
ordered.
The implicitly ordered workqueue promotion was removed by the previous
commit 3bc1e711c26b ("workqueue: Don't implicitly make UNBOUND workqueues w/
@max_active==1 ordered"). However, it didn't update this path and broke
build. Let's revert the commit which was incorrect in the first place which
also fixes build.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 3bc1e711c26b ("workqueue: Don't implicitly make UNBOUND workqueues w/ @max_active==1 ordered")
Fixes: ca10d851b9ad ("workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()")
Cc: stable@vger.kernel.org # v6.6+
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 265f3ed077036f053981f5eea0b5b43e7c5b39ff
Author: Frederic Weisbecker <frederic@kernel.org>
Date: Sun, 24 Sep 2023 17:07:02 +0200
workqueue: Provide one lock class key per work_on_cpu() callsite
All callers of work_on_cpu() share the same lock class key for all the
functions queued. As a result the workqueue related locking scenario for
a function A may be spuriously accounted as an inversion against the
locking scenario of function B such as in the following model:
long A(void *arg)
{
mutex_lock(&mutex);
mutex_unlock(&mutex);
}
long B(void *arg)
{
}
void launchA(void)
{
work_on_cpu(0, A, NULL);
}
void launchB(void)
{
mutex_lock(&mutex);
work_on_cpu(1, B, NULL);
mutex_unlock(&mutex);
}
launchA and launchB running concurrently have no chance to deadlock.
However the above can be reported by lockdep as a possible locking
inversion because the works containing A() and B() are treated as
belonging to the same locking class.
The following shows an existing example of such a spurious lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
6.6.0-rc1-00065-g934ebd6e5359 #35409 Not tainted
------------------------------------------------------
kworker/0:1/9 is trying to acquire lock:
ffffffff9bc72f30 (cpu_hotplug_lock){++++}-{0:0}, at: _cpu_down+0x57/0x2b0
but task is already holding lock:
ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 ((work_completion)(&wfc.work)){+.+.}-{0:0}:
__flush_work+0x83/0x4e0
work_on_cpu+0x97/0xc0
rcu_nocb_cpu_offload+0x62/0xb0
rcu_nocb_toggle+0xd0/0x1d0
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
-> #1 (rcu_state.barrier_mutex){+.+.}-{3:3}:
__mutex_lock+0x81/0xc80
rcu_nocb_cpu_deoffload+0x38/0xb0
rcu_nocb_toggle+0x144/0x1d0
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1538/0x2500
lock_acquire+0xbf/0x2a0
percpu_down_write+0x31/0x200
_cpu_down+0x57/0x2b0
__cpu_down_maps_locked+0x10/0x20
work_for_cpu_fn+0x15/0x20
process_scheduled_works+0x2a7/0x500
worker_thread+0x173/0x330
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock --> rcu_state.barrier_mutex --> (work_completion)(&wfc.work)
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock((work_completion)(&wfc.work));
lock(rcu_state.barrier_mutex);
lock((work_completion)(&wfc.work));
lock(cpu_hotplug_lock);
*** DEADLOCK ***
2 locks held by kworker/0:1/9:
#0: ffff900481068b38 ((wq_completion)events){+.+.}-{0:0}, at: process_scheduled_works+0x212/0x500
#1: ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500
stack backtrace:
CPU: 0 PID: 9 Comm: kworker/0:1 Not tainted 6.6.0-rc1-00065-g934ebd6e5359 #35409
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
Workqueue: events work_for_cpu_fn
Call Trace:
rcu-torture: rcu_torture_read_exit: Start of episode
<TASK>
dump_stack_lvl+0x4a/0x80
check_noncircular+0x132/0x150
__lock_acquire+0x1538/0x2500
lock_acquire+0xbf/0x2a0
? _cpu_down+0x57/0x2b0
percpu_down_write+0x31/0x200
? _cpu_down+0x57/0x2b0
_cpu_down+0x57/0x2b0
__cpu_down_maps_locked+0x10/0x20
work_for_cpu_fn+0x15/0x20
process_scheduled_works+0x2a7/0x500
worker_thread+0x173/0x330
? __pfx_worker_thread+0x10/0x10
kthread+0xe6/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2f/0x40
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK
Fix this with providing one lock class key per work_on_cpu() caller.
Reported-and-tested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 5d9c7a1e3e8e18db8e10c546de648cda2a57be52
Author: Lucy Mielke <lucymielke@icloud.com>
Date: Mon, 9 Oct 2023 19:09:46 +0200
workqueue: fix -Wformat-truncation in create_worker
Compiling with W=1 emitted the following warning
(Compiler: gcc (x86-64, ver. 13.2.1, .config: result of make allyesconfig,
"Treat warnings as errors" turned off):
kernel/workqueue.c:2188:54: warning: ‘%d’ directive output may be
truncated writing between 1 and 10 bytes into a region of size
between 5 and 14 [-Wformat-truncation=]
kernel/workqueue.c:2188:50: note: directive argument in the range
[0, 2147483647]
kernel/workqueue.c:2188:17: note: ‘snprintf’ output between 4 and 23 bytes
into a destination of size 16
setting "id_buf" to size 23 will silence the warning, since GCC
determines snprintf's output to be max. 23 bytes in line 2188.
Please let me know if there are any mistakes in my patch!
Signed-off-by: Lucy Mielke <lucymielke@icloud.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 7b42f401fc6571b6604441789d892d440829e33c
Author: Zqiang <qiang.zhang1211@gmail.com>
Date: Wed, 11 Oct 2023 16:27:59 +0800
workqueue: Use the kmem_cache_free() instead of kfree() to release pwq
Currently, the kfree() be used for pwq objects allocated with
kmem_cache_alloc() in alloc_and_link_pwqs(), this isn't wrong.
but usually, use "trace_kmem_cache_alloc/trace_kmem_cache_free"
to track memory allocation and free. this commit therefore use
kmem_cache_free() instead of kfree() in alloc_and_link_pwqs()
and also consistent with release of the pwq in rcu_free_pwq().
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit dd64c873ed11cdae340be06dcd2364870fd3e4fc
Author: Zqiang <qiang.zhang1211@gmail.com>
Date: Mon, 11 Sep 2023 16:27:22 +0800
workqueue: Fix missed pwq_release_worker creation in wq_cpu_intensive_thresh_init()
Currently, if the wq_cpu_intensive_thresh_us is set to specific
value, will cause the wq_cpu_intensive_thresh_init() early exit
and missed creation of pwq_release_worker. this commit therefore
create the pwq_release_worker in advance before checking the
wq_cpu_intensive_thresh_us.
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 967b494e2fd1 ("workqueue: Use a kthread_worker to release pool_workqueues")
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit a6828214480e2f00a8a7e64c7a55fc42b0f54e1c
Author: Steven Rostedt (Google) <rostedt@goodmis.org>
Date: Tue, 5 Sep 2023 17:49:35 -0400
workqueue: Removed double allocation of wq_update_pod_attrs_buf
First commit 2930155b2e272 ("workqueue: Initialize unbound CPU pods later in
the boot") added the initialization of wq_update_pod_attrs_buf to
workqueue_init_early(), and then latter on, commit 84193c07105c6
("workqueue: Generalize unbound CPU pods") added it as well. This appeared
in a kmemleak run where the second allocation made the first allocation
leak.
Fixes: 84193c07105c6 ("workqueue: Generalize unbound CPU pods")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit b6a46f7263bd8ba0e545d79bd034c412f32b5875
Author: Aaron Tomlin <atomlin@atomlin.com>
Date: Tue, 8 Aug 2023 13:03:29 +0100
workqueue: Rename rescuer kworker
Each CPU-specific and unbound kworker kthread conforms to a particular
naming scheme. However, this does not extend to the rescuer kworker.
At present, a rescuer kworker is simply named according to its
workqueue's name. This can be cryptic.
This patch modifies a rescuer to follow the kworker naming scheme.
The "R" is indicative of a rescuer and after "-" is its workqueue's
name e.g. "kworker/R-ext4-rsv-conver".
tj: Use "R" instead of "r" as the prefix to make it more distinctive and
consistent with how highpri pools are marked.
Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 523a301e66afd1ea9856660bcf3cee3a7c84c6dd
Author: Tejun Heo <tj@kernel.org>
Date: Mon, 7 Aug 2023 15:57:25 -1000
workqueue: Make default affinity_scope dynamically updatable
While workqueue.default_affinity_scope is writable, it only affects
workqueues which are created afterwards and isn't very useful. Instead,
let's introduce explicit "default" scope and update the effective scope
dynamically when workqueue.default_affinity_scope is changed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 8639ecebc9b1796d7074751a350462f5e1c61cd4
Author: Tejun Heo <tj@kernel.org>
Date: Mon, 7 Aug 2023 15:57:25 -1000
workqueue: Implement non-strict affinity scope for unbound workqueues
An unbound workqueue can be served by multiple worker_pools to improve
locality. The segmentation is achieved by grouping CPUs into pods. By
default, the cache boundaries according to cpus_share_cache() define the
CPUs are grouped. Let's a workqueue is allowed to run on all CPUs and the
system has two L3 caches. The workqueue would be mapped to two worker_pools
each serving one L3 cache domains.
While this improves locality, because the pod boundaries are strict, it
limits the total bandwidth a given issuer can consume. For example, let's
say there is a thread pinned to a CPU issuing enough work items to saturate
the whole machine. With the machine segmented into two pods, no matter how
many work items it issues, it can only use half of the CPUs on the system.
While this limitation has existed for a very long time, it wasn't very
pronounced because the affinity grouping used to be always by NUMA nodes.
With cache boundaries as the default and support for even finer grained
scopes (smt and cpu), it is now an a lot more pressing problem.
This patch implements non-strict affinity scope where the pod boundaries
aren't enforced strictly. Going back to the previous example, the workqueue
would still be mapped to two worker_pools; however, the affinity enforcement
would be soft. The workers in both pools would have their cpus_allowed set
to the whole machine thus allowing the scheduler to migrate them anywhere on
the machine. However, whenever an idle worker is woken up, the workqueue
code asks the scheduler to bring back the task within the pod if the worker
is outside. ie. work items start executing within its affinity scope but can
be migrated outside as the scheduler sees fit. This removes the hard cap on
utilization while maintaining the benefits of affinity scopes.
After the earlier ->__pod_cpumask changes, the implementation is pretty
simple. When non-strict which is the new default:
* pool_allowed_cpus() returns @pool->attrs->cpumask instead of
->__pod_cpumask so that the workers are allowed to run on any CPU that
the associated workqueues allow.
* If the idle worker task's ->wake_cpu is outside the pod, kick_pool() sets
the field to a CPU within the pod.
This would be the first use of task_struct->wake_cpu outside scheduler
proper, so it isn't clear whether this would be acceptable. However, other
methods of migrating tasks are significantly more expensive and are likely
prohibitively so if we want to do this on every work item. This needs
discussion with scheduler folks.
There is also a race window where setting ->wake_cpu wouldn't be effective
as the target task is still on CPU. However, the window is pretty small and
this being a best-effort optimization, it doesn't seem to warrant more
complexity at the moment.
While the non-strict cache affinity scopes seem to be the best option, the
performance picture interacts with the affinity scope and is a bit
complicated to fully discuss in this patch, so the behavior is made easily
selectable through wqattrs and sysfs and the next patch will add
documentation to discuss performance implications.
v2: pool->attrs->affn_strict is set to true for per-cpu worker_pools.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 9546b29e4a6ad6ed7924dd7980975c8e675740a3
Author: Tejun Heo <tj@kernel.org>
Date: Mon, 7 Aug 2023 15:57:25 -1000
workqueue: Add workqueue_attrs->__pod_cpumask
workqueue_attrs has two uses:
* to specify the required unouned workqueue properties by users
* to match worker_pool's properties to workqueues by core code
For example, if the user wants to restrict a workqueue to run only CPUs 0
and 2, and the two CPUs are on different affinity scopes, the workqueue's
attrs->cpumask would contains CPUs 0 and 2, and the workqueue would be
associated with two worker_pools, one with attrs->cpumask containing just
CPU 0 and the other CPU 2.
Workqueue wants to support non-strict affinity scopes where work items are
started in their matching affinity scopes but the scheduler is free to
migrate them outside the starting scopes, which can enable utilizing the
whole machine while maintaining most of the locality benefits from affinity
scopes.
To enable that, worker_pools need to distinguish the strict affinity that it
has to follow (because that's the restriction coming from the user) and the
soft affinity that it wants to apply when dispatching work items. Note that
two worker_pools with different soft dispatching requirements have to be
separate; otherwise, for example, we'd be ping-ponging worker threads across
NUMA boundaries constantly.
This patch adds workqueue_attrs->__pod_cpumask. The new field is double
underscored as it's only used internally to distinguish worker_pools. A
worker_pool's ->cpumask is now always the same as the online subset of
allowed CPUs of the associated workqueues, and ->__pod_cpumask is the pod's
subset of that ->cpumask. Going back to the example above, both worker_pools
would have ->cpumask containing both CPUs 0 and 2 but one's ->__pod_cpumask
would contain 0 while the other's 2.
* pool_allowed_cpus() is added. It returns the worker_pool's strict cpumask
that the pool's workers must stay within. This is currently always
->__pod_cpumask as all boundaries are still strict.
* As a workqueue_attrs can now track both the associated workqueues' cpumask
and its per-pod subset, wq_calc_pod_cpumask() no longer needs an external
out-argument. Drop @cpumask and instead store the result in
->__pod_cpumask.
* The above also simplifies apply_wqattrs_prepare() as the same
workqueue_attrs can be used to create all pods associated with a
workqueue. tmp_attrs is dropped.
* wq_update_pod() is updated to use wqattrs_equal() to test whether a pwq
update is needed instead of only comparing ->cpumask so that
->__pod_cpumask is compared too. It can directly compare ->__pod_cpumaks
but the code is easier to understand and more robust this way.
The only user-visible behavior change is that two workqueues with different
cpumasks no longer can share worker_pools even when their pod subsets
coincide. Going back to the example, let's say there's another workqueue
with cpumask 0, 2, 3, where 2 and 3 are in the same pod. It would be mapped
to two worker_pools - one with CPU 0, the other with 2 and 3. The former has
the same cpumask as the first pod of the earlier example and would have
shared the same worker_pool but that's no longer the case after this patch.
The worker_pools would have the same ->__pod_cpumask but their ->cpumask's
wouldn't match.
While this is necessary to support non-strict affinity scopes, there can be
further optimizations to maintain sharing among strict affinity scopes.
However, non-strict affinity scopes are going to be preferable for most use
cases and we don't see very diverse mixture of unbound workqueue cpumasks
anyway, so the additional overhead doesn't seem to justify the extra
complexity.
v2: - wq_update_pod() was incorrectly comparing target_attrs->__pod_cpumask
to pool->attrs->cpumask instead of its ->__pod_cpumask. Fix it by
using wqattrs_equal() for comparison instead.
- Per-cpu worker pools weren't initializing ->__pod_cpumask which caused
a subtle problem later on. Set it to cpumask_of(cpu) like ->cpumask.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 0219a3528d72143d8d2c4c793b61541d03518b59
Author: Tejun Heo <tj@kernel.org>
Date: Mon, 7 Aug 2023 15:57:25 -1000
workqueue: Factor out need_more_worker() check and worker wake-up
Checking need_more_worker() and calling wake_up_worker() is a repeated
pattern. Let's add kick_pool(), which checks need_more_worker() and
open-code wake_up_worker(), and replace wake_up_worker() uses. The following
conversions aren't one-to-one:
* __queue_work() was using __need_more_work() because it knows that
pool->worklist isn't empty. Switching to kick_pool() adds an extra
list_empty() test.
* create_worker() always needs to wake up the newly minted worker whether
there's more work to do or not to avoid triggering hung task check on the
new task. Keep the current wake_up_process() and still add kick_pool().
This may lead to an extra wakeup which isn't harmful.
* pwq_adjust_max_active() was explicitly checking whether it needs to wake
up a worker or not to avoid spurious wakeups. As kick_pool() only wakes up
a worker when necessary, this explicit check is no longer necessary and
dropped.
* unbind_workers() now calls kick_pool() instead of wake_up_worker() adding
a need_more_worker() test. This avoids spurious wakeups and shouldn't
break anything.
wake_up_worker() is dropped as kick_pool() replaces all its users. After
this patch, all paths that wakes up a non-rescuer worker to initiate work
item execution use kick_pool(). This will enable future changes to improve
locality.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 873eaca6eaf84b1d1ed5b7259308c6a4fca70fdc
Author: Tejun Heo <tj@kernel.org>
Date: Mon, 7 Aug 2023 15:57:25 -1000
workqueue: Factor out work to worker assignment and collision handling
The two work execution paths in worker_thread() and rescuer_thread() use
move_linked_works() to claim work items from @pool->worklist. Once claimed,
process_schedule_works() is called which invokes process_one_work() on each
work item. process_one_work() then uses find_worker_executing_work() to
detect and handle collisions - situations where the work item to be executed
is still running on another worker.
This works fine, but, to improve work execution locality, we want to
establish work to worker association earlier and know for sure that the
worker is going to excute the work once asssigned, which requires performing
collision handling earlier while trying to assign the work item to the
worker.
This patch introduces assign_work() which assigns a work item to a worker
using move_linked_works() and then performs collision handling. As collision
handling is handled earlier, process_one_work() no longer needs to worry
about them.
After the this patch, collision checks for linked work items are skipped,
which should be fine as they can't be queued multiple times concurrently.
For work items running from rescuers, the timing of collision handling may
change but the invariant that the work items go through collision handling
before starting execution does not.
This patch shouldn't cause noticeable behavior changes, especially given
that worker_thread() behavior remains the same.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 63c5484e74952f60f5810256bd69814d167b8d22
Author: Tejun Heo <tj@kernel.org>
Date: Mon, 7 Aug 2023 15:57:24 -1000
workqueue: Add multiple affinity scopes and interface to select them
Add three more affinity scopes - WQ_AFFN_CPU, SMT and CACHE - and make CACHE
the default. The code changes to actually add the additional scopes are
trivial.
Also add module parameter "workqueue.default_affinity_scope" to override the
default scope and "affinity_scope" sysfs file to configure it per workqueue.
wq_dump.py and documentations are updated accordingly.
This enables significant flexibility in configuring how unbound workqueues
behave. If affinity scope is set to "cpu", it'll behave close to a per-cpu
workqueue. On the other hand, "system" removes all locality boundaries.
Many modern machines have multiple L3 caches often while being mostly
uniform in terms of memory access. Thus, workqueue's previous behavior of
spreading work items in each NUMA node had negative performance implications
from unncessarily crossing L3 boundaries between issue and execution.
However, picking a finer grained affinity scope also has a downside in that
an issuer in one group can't utilize CPUs in other groups.
While dependent on the specifics of workload, there's usually a noticeable
penalty in crossing L3 boundaries, so let's default to CACHE. This issue
will be further addressed and documented with examples in future patches.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 025e16845877e80cb169274b330c236056ba553c
Author: Tejun Heo <tj@kernel.org>
Date: Mon, 7 Aug 2023 15:57:24 -1000
workqueue: Modularize wq_pod_type initialization
While wq_pod_type[] can now group CPUs in any aribitrary way, WQ_AFFN_NUM
init is hard coded into workqueue_init_topology(). This patch modularizes
the init path by introducing init_pod_type() which takes a callback to
determine whether two CPUs should share a pod as an argument.
init_pod_type() first scans the CPU combinations testing for sharing to
assign consecutive pod IDs and initialize pod_type->cpu_pod[]. Once
->cpu_pod[] is determined, ->pod_cpus[] and ->pod_node[] are initialized
accordingly. WQ_AFFN_NUMA is now initialized by calling init_pod_type() with
cpus_share_numa() which tests whether the CPU belongs to the same NUMA node.
This patch may change the pod ID assigned to each NUMA node but that
shouldn't cause any behavior changes as the NUMA node to use for allocations
are tracked separately in pod_type->pod_node[]. This makes adding new
affinty types pretty easy.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
Conflicts: A minor context diff in the workqueue_apply_unbound_cpumask()
hunk due to the presence of a later upstream commit
ca10d851b9ad ("workqueue: Override implicit ordered attribute
in workqueue_apply_unbound_cpumask()").
commit 84193c07105c62d206fb230b2f29002226628989
Author: Tejun Heo <tj@kernel.org>
Date: Mon, 7 Aug 2023 15:57:24 -1000
workqueue: Generalize unbound CPU pods
While renamed to pod, the code still assumes that the pods are defined by
NUMA boundaries. Let's generalize it:
* workqueue_attrs->affn_scope is added. Each enum represents the type of
boundaries that define the pods. There are currently two scopes -
WQ_AFFN_NUMA and WQ_AFFN_SYSTEM. The former is the same behavior as before
- one pod per NUMA node. The latter defines one global pod across the
whole system.
* struct wq_pod_type is added which describes how pods are configured for
each affnity scope. For each pod, it lists the member CPUs and the
preferred NUMA node for memory allocations. The reverse mapping from CPU
to pod is also available.
* wq_pod_enabled is dropped. Pod is now always enabled. The previously
disabled behavior is now implemented through WQ_AFFN_SYSTEM.
* get_unbound_pool() wants to determine the NUMA node to allocate memory
from for the new pool. The variables are renamed from node to pod but the
logic still assumes they're one and the same. Clearly distinguish them -
walk the WQ_AFFN_NUMA pods to find the matching pod and then use the pod's
NUMA node.
* wq_calc_pod_cpumask() was taking @pod but assumed that it was the NUMA
node. Take @cpu instead and determine the cpumask to use from the pod_type
matching @attrs.
* apply_wqattrs_prepare() is update to return ERR_PTR() on error instead of
NULL so that it can indicate -EINVAL on invalid affinity scopes.
This patch allows CPUs to be grouped into pods however desired per type.
While this patch causes some internal behavior changes, nothing material
should change for workqueue users.
v2: Trigger WARN_ON_ONCE() in wqattrs_pod_type() if affn_scope is
WQ_AFFN_NR_TYPES which indicates that the function is called with a
worker_pool's attrs instead of a workqueue's.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 5de7a03cac14765ba22934b6fb1476456ee36bf8
Author: Tejun Heo <tj@kernel.org>
Date: Mon, 7 Aug 2023 15:57:24 -1000
workqueue: Factor out clearing of workqueue-only attrs fields
workqueue_attrs can be used for both workqueues and worker_pools. However,
some fields, currently only ->ordered, only apply to workqueues and should
be cleared to the default / invalid values.
Currently, an unbound workqueue explicitly clears attrs->ordered in
get_unbound_pool() after copying the source workqueue attrs, while per-cpu
workqueues rely on the fact that zeroing on allocation gives us the desired
default value for pool->attrs->ordered.
This is fragile. Let's add wqattrs_clear_for_pool() which clears
attrs->ordered and is called from both init_worker_pool() and
get_unbound_pool(). This will ease adding more workqueue-only attrs fields.
In get_unbound_pool(), pool->node initialization is moved upwards for
readability. This shouldn't cause any behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 0f36ee24cd43c67be07166ddd09866dc7a47cb4c
Author: Tejun Heo <tj@kernel.org>
Date: Mon, 7 Aug 2023 15:57:24 -1000
workqueue: Factor out actual cpumask calculation to reduce subtlety in wq_update_pod()
For an unbound pool, multiple cpumasks are involved.
U: The user-specified cpumask (may be filtered with cpu_possible_mask).
A: The actual cpumask filtered by wq_unbound_cpumask. If the filtering
leaves no CPU, wq_unbound_cpumask is used.
P: Per-pod subsets of #A.
wq->attrs stores #U, wq->dfl_pwq->pool->attrs->cpumask #A, and
wq->cpu_pwq[CPU]->pool->attrs->cpumask #P.
wq_update_pod() is called to update per-pod pwq's during CPU hotplug. To
calculate the new #P for each workqueue, it needs to call
wq_calc_pod_cpumask() with @attrs that contains #A. Currently,
wq_update_pod() achieves this by calling wq_calc_pod_cpumask() with
wq->dfl_pwq->pool->attrs.
This is rather fragile because we're calling wq_calc_pod_cpumask() with
@attrs of a worker_pool rather than the workqueue's actual attrs when what
we want to calculate is the workqueue's cpumask on the pod. While this works
fine currently, future changes will add fields which are used differently
between workqueues and worker_pools and this subtlety will bite us.
This patch factors out #U -> #A calculation from apply_wqattrs_prepare()
into wqattrs_actualize_cpumask and updates wq_update_pod() to copy
wq->unbound_attrs and use the new helper to obtain #A freshly instead of
abusing wq->dfl_pwq->pool_attrs.
This shouldn't cause any behavior changes in the current code.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reference: http://lkml.kernel.org/r/30625cdd-4d61-594b-8db9-6816b017dde3@amd.com
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
Conflicts:
1) Minor context diff in init/main.c due to missing upstream commit
7ec7096b8577 ("mm/page_ext: init page_ext early if there are no
deferred struct pages") and commit de57807e6f26 ("init,mm: fold
late call to page_ext_init() to page_alloc_init_late()").
2) Minor context diff in kernel/workqueue.c due to the presence of
a later upstream commit 4a6c5607d450 ("workqueue: Make sure that
wq_unbound_cpumask is never empty").
commit 2930155b2e27232c033970f2e110aaac4187cb9e
Author: Tejun Heo <tj@kernel.org>
Date: Mon, 7 Aug 2023 15:57:24 -1000
workqueue: Initialize unbound CPU pods later in the boot
During boot, to initialize unbound CPU pods, wq_pod_init() was called from
workqueue_init(). This is early enough for NUMA nodes to be set up but
before SMP is brought up and CPU topology information is populated.
Workqueue is in the process of improving CPU locality for unbound workqueues
and will need access to topology information during pod init. This adds a
new init function workqueue_init_topology() which is called after CPU
topology information is available and replaces wq_pod_init().
As unbound CPU pods are now initialized after workqueues are activated, we
need to revisit the workqueues to apply the pod configuration. Workqueues
which are created before workqueue_init_topology() are set up so that they
always use the default worker pool. After pods are set up in
workqueue_init_topology(), wq_update_pod() is called on all existing
workqueues to update the pool associations accordingly.
Note that wq_update_pod_attrs_buf allocation is moved to
workqueue_init_early(). This isn't necessary right now but enables further
generalization of pod handling in the future.
This patch changes the initialization sequence but the end result should be
the same.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
Conflicts: A context diff due to the presence of a later upstream commit
4a6c5607d450 ("workqueue: Make sure that wq_unbound_cpumask
is never empty").
commit a86feae6195ac2148097b063f7fdad8ee1f6dad4
Author: Tejun Heo <tj@kernel.org>
Date: Mon, 7 Aug 2023 15:57:24 -1000
workqueue: Move wq_pod_init() below workqueue_init()
wq_pod_init() is called from workqueue_init() and responsible for
initializing unbound CPU pods according to NUMA node. Workqueue is in the
process of improving affinity awareness and wants to use other topology
information to initialize unbound CPU pods; however, unlike NUMA nodes,
other topology information isn't yet available in workqueue_init().
The next patch will introduce a later stage init function for workqueue
which will be responsible for initializing unbound CPU pods. Relocate
wq_pod_init() below workqueue_init() where the new init function is going to
be located so that the diff can show the content differences.
Just a relocation. No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
Conflicts: context diffs in two of the hunks due to the presence of
later upstream commit 4a6c5607d450 ("workqueue: Make sure that
wq_unbound_cpumask is never empty") and commit 31c89007285d
("workqueue.c: Increase workqueue name length").
commit fef59c9cab6ac5592da54f6c2b1195418f14e4d0
Author: Tejun Heo <tj@kernel.org>
Date: Mon, 7 Aug 2023 15:57:23 -1000
workqueue: Rename NUMA related names to use pod instead
Workqueue is in the process of improving CPU affinity awareness. It will
become more flexible and won't be tied to NUMA node boundaries. This patch
renames all NUMA related names in workqueue.c to use "pod" instead.
While "pod" isn't a very common term, it short and captures the grouping of
CPUs well enough. These names are only going to be used within workqueue
implementation proper, so the specific naming doesn't matter that much.
* wq_numa_possible_cpumask -> wq_pod_cpus
* wq_numa_enabled -> wq_pod_enabled
* wq_update_unbound_numa_attrs_buf -> wq_update_pod_attrs_buf
* workqueue_select_cpu_near -> select_numa_node_cpu
This rename is different from others. The function is only used by
queue_work_node() and specifically tries to find a CPU in the specified
NUMA node. As workqueue affinity will become more flexible and untied from
NUMA, this function's name should specifically describe that it's for
NUMA.
* wq_calc_node_cpumask -> wq_calc_pod_cpumask
* wq_update_unbound_numa -> wq_update_pod
* wq_numa_init -> wq_pod_init
* node -> pod in local variables
Only renames. No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit af73f5c9febe5095ee492ae43e9898fca65ced70
Author: Tejun Heo <tj@kernel.org>
Date: Mon, 7 Aug 2023 15:57:23 -1000
workqueue: Rename workqueue_attrs->no_numa to ->ordered
With the recent removal of NUMA related module param and sysfs knob,
workqueue_attrs->no_numa is now only used to implement ordered workqueues.
Let's rename the field so that it's less confusing especially with the
planned CPU affinity awareness improvements.
Just a rename. No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 636b927eba5bc633753f8eb80f35e1d5be806e51
Author: Tejun Heo <tj@kernel.org>
Date: Mon, 7 Aug 2023 15:57:23 -1000
workqueue: Make unbound workqueues to use per-cpu pool_workqueues
A pwq (pool_workqueue) represents an association between a workqueue and a
worker_pool. When a work item is queued, the workqueue selects the pwq to
use, which in turn determines the pool, and queues the work item to the pool
through the pwq. pwq is also what implements the maximum concurrency limit -
@max_active.
As a per-cpu workqueue should be assocaited with a different worker_pool on
each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
However, unbound workqueues were sharing a pwq within each NUMA node by
default. The sharing has several downsides:
* Because @max_active is per-pwq, the meaning of @max_active changes
depending on the machine configuration and whether workqueue NUMA locality
support is enabled.
* Makes per-cpu and unbound code deviate.
* Gets in the way of making workqueue CPU locality awareness more flexible.
This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
workqueues do by making the following changes:
* wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
workqueues.
* numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
the specified pwq to the target CPU's wq->cpu_pwq.
* apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
This makes the return value of wq_calc_node_cpumask() unnecessary. It now
returns void.
* @max_active now means the same thing for both per-cpu and unbound
workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
used in workqueue implementation and will be removed later.
* All unbound pwq operations which used to be per-numa-node are now per-cpu.
For most unbound workqueue users, this shouldn't cause noticeable changes.
Work item issue and completion will be a small bit faster, flush_workqueue()
would become a bit more expensive, and the total concurrency limit would
likely become higher. All @max_active==1 use cases are currently being
audited for conversion into alloc_ordered_workqueue() and they shouldn't be
affected once the audit and conversion is complete.
One area where the behavior change may be more noticeable is
workqueue_congested() as the reported congestion state is now per CPU
instead of NUMA node. There are only two users of this interface -
drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
cc'd. Inputs on the behavior change would be very much appreciated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 4cbfd3de737b9d00544ff0f673cb75fc37bffb6a
Author: Tejun Heo <tj@kernel.org>
Date: Mon, 7 Aug 2023 15:57:23 -1000
workqueue: Call wq_update_unbound_numa() on all CPUs in NUMA node on CPU hotplug
When a CPU went online or offline, wq_update_unbound_numa() was called only
on the CPU which was going up or down. This works fine because all CPUs on
the same NUMA node share the same pool_workqueue slot - one CPU updating it
updates it for everyone in the node.
However, future changes will make each CPU use a separate pool_workqueue
even when they're sharing the same worker_pool, which requires updating
pool_workqueue's for all CPUs which may be sharing the same pool_workqueue
on hotplug.
To accommodate the planned changes, this patch updates
workqueue_on/offline_cpu() so that they call wq_update_unbound_numa() for
all CPUs sharing the same NUMA node as the CPU going up or down. In the
current code, the second+ calls would be noops and there shouldn't be any
behavior changes.
* As wq_update_unbound_numa() is now called on multiple CPUs per each
hotplug event, @cpu is renamed to @hotplug_cpu and another @cpu argument
is added. The former indicates the CPU being hot[un]plugged and the latter
the CPU whose pool_workqueue is being updated.
* In wq_update_unbound_numa(), cpu_off is renamed to off_cpu for consistency
with the new @hotplug_cpu.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 687a9aa56f811b381e63f7f8f9149428ac708a3b
Author: Tejun Heo <tj@kernel.org>
Date: Mon, 7 Aug 2023 15:57:23 -1000
workqueue: Make per-cpu pool_workqueues allocated and released like unbound ones
Currently, all per-cpu pwq's (pool_workqueue's) are allocated directly
through a per-cpu allocation and thus, unlike unbound workqueues, not
reference counted. This difference in lifetime management between the two
types is a bit confusing.
Unbound workqueues are currently accessed through wq->numa_pwq_tbl[] which
isn't suitiable for the planned CPU locality related improvements. The plan
is to unify pwq handling across per-cpu and unbound workqueues so that
they're always accessed through wq->cpu_pwq.
In preparation, this patch makes per-cpu pwq's to be allocated, reference
counted and released the same way as unbound pwq's. wq->cpu_pwq now holds
pointers to pwq's instead of containing them directly.
pwq_unbound_release_workfn() is renamed to pwq_release_workfn() as it's now
also used for per-cpu work items.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 967b494e2fd143a9c1a3201422aceadb5fa9fbfc
Author: Tejun Heo <tj@kernel.org>
Date: Mon, 7 Aug 2023 15:57:23 -1000
workqueue: Use a kthread_worker to release pool_workqueues
pool_workqueue release path is currently bounced to system_wq; however, this
is a bit tricky because this bouncing occurs while holding a pool lock and
thus has risk of causing a A-A deadlock. This is currently addressed by the
fact that only unbound workqueues use this bouncing path and system_wq is a
per-cpu workqueue.
While this works, it's brittle and requires a work-around like setting the
lockdep subclass for the lock of unbound pools. Besides, future changes will
use the bouncing path for per-cpu workqueues too making the current approach
unusable.
Let's just use a dedicated kthread_worker to untangle the dependency. This
is just one more kthread for all workqueues and makes the pwq release logic
simpler and more robust.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
Conflicts: A merge conflict in the wq_pool_ids_show() removal hunk due
to the presence of a later upstream commit 49277a5b7637
("workqueue: Move workqueue_set_unbound_cpumask() and its
helpers inside CONFIG_SYSFS").
commit fcecfa8f271acdf130acbb30842e7848a138af0f
Author: Tejun Heo <tj@kernel.org>
Date: Mon, 7 Aug 2023 15:57:23 -1000
workqueue: Remove module param disable_numa and sysfs knobs pool_ids and numa
Unbound workqueue CPU affinity is going to receive an overhaul and the NUMA
specific knobs won't make sense anymore. Remove them. Also, the pool_ids
knob was used for debugging and not really meaningful given that there is no
visibility into the pools associated with those IDs. Remove it too. A future
patch will improve overall visibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 797e8345cbb0d2913300ee9838eb74cce19485cf
Author: Tejun Heo <tj@kernel.org>
Date: Mon, 7 Aug 2023 15:57:23 -1000
workqueue: Relocate worker and work management functions
Collect first_idle_worker(), worker_enter/leave_idle(),
find_worker_executing_work(), move_linked_works() and wake_up_worker() into
one place. These functions will later be used to implement higher level
worker management logic.
No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit ee1ceef72754427e5167743108c52f826fa4ca5b
Author: Tejun Heo <tj@kernel.org>
Date: Mon, 7 Aug 2023 15:57:23 -1000
workqueue: Rename wq->cpu_pwqs to wq->cpu_pwq
wq->cpu_pwqs is a percpu variable carraying one pointer to a pool_workqueue.
The field name being plural is unusual and confusing. Rename it to singular.
This patch doesn't cause any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit fe089f87cccb066e8ad20f49ddf05e95adc1fa8d
Author: Tejun Heo <tj@kernel.org>
Date: Mon, 7 Aug 2023 15:57:22 -1000
workqueue: Not all work insertion needs to wake up a worker
insert_work() always tried to wake up a worker; however, the only time it
needs to try to wake up a worker is when a new active work item is queued.
When a work item goes on the inactive list or queueing a flush work item,
there's no reason to try to wake up a worker.
This patch moves the worker wakeup logic out of insert_work() and places it
in the active new work item queueing path in __queue_work().
While at it:
* __queue_work() is dereferencing pwq->pool repeatedly. Add local variable
pool.
* Every caller of insert_work() calls debug_work_activate(). Consolidate the
invocations into insert_work().
* In __queue_work() pool->watchdog_ts update is relocated slightly. This is
to better accommodate future changes.
This makes wakeups more precise and will help the planned change to assign
work items to workers before waking them up. No behavior changes intended.
v2: WARN_ON_ONCE(pool != last_pool) added in __queue_work() to clarify as
suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit c0ab017d43f4c4147f7ecf3ca3cb872a416e17c7
Author: Tejun Heo <tj@kernel.org>
Date: Mon, 7 Aug 2023 15:57:22 -1000
workqueue: Cleanups around process_scheduled_works()
* Drop the trivial optimization in worker_thread() where it bypasses calling
process_scheduled_works() if the first work item isn't linked. This is a
mostly pointless micro optimization and gets in the way of improving the
work processing path.
* Consolidate pool->watchdog_ts updates in the two callers into
process_scheduled_works().
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit bc8b50c2dfac946c1beed782c1823e52cf55a352
Author: Tejun Heo <tj@kernel.org>
Date: Mon, 7 Aug 2023 15:57:22 -1000
workqueue: Drop the special locking rule for worker->flags and worker_pool->flags
worker->flags used to be accessed from scheduler hooks without grabbing
pool->lock for concurrency management. This is no longer true since
6d25be5782 ("sched/core, workqueues: Distangle worker accounting from rq
lock"). Also, it's unclear why worker_pool->flags was using the "X" rule.
All relevant users are accessing it under the pool lock.
Let's drop the special "X" rule and use the "L" rule for these flag fields
instead. While at it, replace the CONTEXT comment with
lockdep_assert_held().
This allows worker_set/clr_flags() to be used from context which isn't the
worker itself. This will be used later to implement assinging work items to
workers before waking them up so that workqueue can have better control over
which worker executes which work item on which CPU.
The only actual changes are sanity checks. There shouldn't be any visible
behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 9680540c0c56a1f75a2d6aab31bf38aa429aa9d9
Author: Yang Yingliang <yangyingliang@huawei.com>
Date: Fri, 4 Aug 2023 11:22:15 +0800
workqueue: use LIST_HEAD to initialize cull_list
Use LIST_HEAD() to initialize cull_list instead of open-coding it.
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 20bdedafd2f63e0ba70991127f9b5c0826ebdb32
Author: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Fri, 30 Jun 2023 21:28:53 +0900
workqueue: Warn attempt to flush system-wide workqueues.
Based on commit c4f135d643823a86 ("workqueue: Wrap flush_workqueue() using
a macro"), all in-tree users stopped flushing system-wide workqueues.
Therefore, start emitting runtime message so that all out-of-tree users
will understand that they need to update their code.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit aa6fde93f3a49e42c0fe0490d7f3711bac0d162e
Author: Tejun Heo <tj@kernel.org>
Date: Mon, 17 Jul 2023 12:50:02 -1000
workqueue: Scale up wq_cpu_intensive_thresh_us if BogoMIPS is below 4000
wq_cpu_intensive_thresh_us is used to detect CPU-hogging per-cpu work items.
Once detected, they're excluded from concurrency management to prevent them
from blocking other per-cpu work items. If CONFIG_WQ_CPU_INTENSIVE_REPORT is
enabled, repeat offenders are also reported so that the code can be updated.
The default threshold is 10ms which is long enough to do fair bit of work on
modern CPUs while short enough to be usually not noticeable. This
unfortunately leads to a lot of, arguable spurious, detections on very slow
CPUs. Using the same threshold across CPUs whose performance levels may be
apart by multiple levels of magnitude doesn't make whole lot of sense.
This patch scales up wq_cpu_intensive_thresh_us upto 1 second when BogoMIPS
is below 4000. This is obviously very inaccurate but it doesn't have to be
accurate to be useful. The mechanism is still useful when the threshold is
fully scaled up and the benefits of reports are usually shared with everyone
regardless of who's reporting, so as long as there are sufficient number of
fast machines reporting, we don't lose much.
Some (or is it all?) ARM CPUs systemtically report significantly lower
BogoMIPS. While this doesn't break anything, given how widespread ARM CPUs
are, it's at least a missed opportunity and it probably would be a good idea
to teach workqueue about it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 18c8ae813156a6855f026de80fffb91e1a28ab3d
Author: Zqiang <qiang.zhang1211@gmail.com>
Date: Thu, 25 May 2023 12:00:38 +0800
workqueue: Disable per-cpu CPU hog detection when wq_cpu_intensive_thresh_us is 0
If workqueue.cpu_intensive_thresh_us is set to 0, the detection mechanism
for CPU-hogging per-cpu work item will keep triggering spuriously:
workqueue: process_srcu hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND
workqueue: gc_worker hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND
workqueue: gc_worker hogged CPU for >0us 8 times, consider switching to WQ_UNBOUND
workqueue: wait_rcu_exp_gp hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND
workqueue: kfree_rcu_monitor hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND
workqueue: kfree_rcu_monitor hogged CPU for >0us 8 times, consider switching to WQ_UNBOUND
workqueue: reg_todo hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND
This commit therefore disables the CPU-hog detection mechanism when
workqueue.cpu_intensive_thresh_us is set to 0.
tj: Patch description updated and the condition check on
cpu_intensive_thresh_us separated into a separate if statement for
readability.
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 8a1dd1e547c1a037692e7a6da6a76108108c72b1
Author: Tejun Heo <tj@kernel.org>
Date: Wed, 17 May 2023 17:02:09 -1000
workqueue: Track and monitor per-workqueue CPU time usage
Now that wq_worker_tick() is there, we can easily track the rough CPU time
consumption of each workqueue by charging the whole tick whenever a tick
hits an active workqueue. While not super accurate, it provides reasonable
visibility into the workqueues that consume a lot of CPU cycles.
wq_monitor.py is updated to report the per-workqueue CPU times.
v2: wq_monitor.py was using "cputime" as the key when outputting in json
format. Use "cpu_time" instead for consistency with other fields.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 6363845005202148b8409ec3082e80845c19d309
Author: Tejun Heo <tj@kernel.org>
Date: Wed, 17 May 2023 17:02:08 -1000
workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism
Workqueue now automatically marks per-cpu work items that hog CPU for too
long as CPU_INTENSIVE, which excludes them from concurrency management and
prevents stalling other concurrency-managed work items. If a work function
keeps running over the thershold, it likely needs to be switched to use an
unbound workqueue.
This patch adds a debug mechanism which tracks the work functions which
trigger the automatic CPU_INTENSIVE mechanism and report them using
pr_warn() with exponential backoff.
v3: Documentation update.
v2: Drop bouncing to kthread_worker for printing messages. It was to avoid
introducing circular locking dependency through printk but not effective
as it still had pool lock -> wci_lock -> printk -> pool lock loop. Let's
just print directly using printk_deferred().
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 616db8779b1e3f93075df691432cccc5ef3c3ba0
Author: Tejun Heo <tj@kernel.org>
Date: Wed, 17 May 2023 17:02:08 -1000
workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVE
If a per-cpu work item hogs the CPU, it can prevent other work items from
starting through concurrency management. A per-cpu workqueue which intends
to host such CPU-hogging work items can choose to not participate in
concurrency management by setting %WQ_CPU_INTENSIVE; however, this can be
error-prone and difficult to debug when missed.
This patch adds an automatic CPU usage based detection. If a
concurrency-managed work item consumes more CPU time than the threshold
(10ms by default) continuously without intervening sleeps, wq_worker_tick()
which is called from scheduler_tick() will detect the condition and
automatically mark it CPU_INTENSIVE.
The mechanism isn't foolproof:
* Detection depends on tick hitting the work item. Getting preempted at the
right timings may allow a violating work item to evade detection at least
temporarily.
* nohz_full CPUs may not be running ticks and thus can fail detection.
* Even when detection is working, the 10ms detection delays can add up if
many CPU-hogging work items are queued at the same time.
However, in vast majority of cases, this should be able to detect violations
reliably and provide reasonable protection with a small increase in code
complexity.
If some work items trigger this condition repeatedly, the bigger problem
likely is the CPU being saturated with such per-cpu work items and the
solution would be making them UNBOUND. The next patch will add a debug
mechanism to help spot such cases.
v4: Documentation for workqueue.cpu_intensive_thresh_us added to
kernel-parameters.txt.
v3: Switch to use wq_worker_tick() instead of hooking into preemptions as
suggested by Peter.
v2: Lai pointed out that wq_worker_stopping() also needs to be called from
preemption and rtlock paths and an earlier patch was updated
accordingly. This patch adds a comment describing the risk of infinte
recursions and how they're avoided.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit bdf8b9bfc131864f0fcef268b34123acfb6a1b59
Author: Tejun Heo <tj@kernel.org>
Date: Wed, 17 May 2023 17:02:08 -1000
workqueue: Improve locking rule description for worker fields
* Some worker fields are modified only by the worker itself while holding
pool->lock thus making them safe to read from self, IRQ context if the CPU
is running the worker or while holding pool->lock. Add 'K' locking rule
for them.
* worker->sleeping is currently marked "None" which isn't very descriptive.
It's used only by the worker itself. Add 'S' locking rule for it.
A future patch will depend on the 'K' rule to access worker->current_* from
the scheduler ticks.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit c54d5046a06b90adb3d1188f0741a88692854354
Author: Tejun Heo <tj@kernel.org>
Date: Wed, 17 May 2023 17:02:08 -1000
workqueue: Move worker_set/clr_flags() upwards
They are going to be used in wq_worker_stopping(). Move them upwards.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 725e8ec59c56c65fb92e343c10a8842cd0d4f194
Author: Tejun Heo <tj@kernel.org>
Date: Wed, 17 May 2023 17:02:08 -1000
workqueue: Add pwq->stats[] and a monitoring script
Currently, the only way to peer into workqueue operations is through
tracing. While possible, it isn't easy or convenient to monitor
per-workqueue behaviors over time this way. Let's add pwq->stats[] that
track relevant events and a drgn monitoring script -
tools/workqueue/wq_monitor.py.
It's arguable whether this needs to be configurable. However, it currently
only has several counters and the runtime overhead shouldn't be noticeable
given that they're on pwq's which are per-cpu on per-cpu workqueues and
per-numa-node on unbound ones. Let's keep it simple for the time being.
v2: Patch reordered to earlier with fewer fields. Field will be added back
gradually. Help message improved.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 854f5cc5b7355ceebf2bdfed97ea8f3c5d47a0c3
Author: Paul E. McKenney <paulmck@kernel.org>
Date: Fri, 28 Apr 2023 16:47:07 -0700
Further upgrade queue_work_on() comment
The current queue_work_on() docbook comment says that the caller must
ensure that the specified CPU can't go away, and further says that the
penalty for failing to nail down the specified CPU is that the workqueue
handler might find itself executing on some other CPU. This is true
as far as it goes, but fails to note what happens if the specified CPU
never was online. Therefore, further expand this comment to say that
specifying a CPU that was never online will result in a splat.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit afa4bb778e48d79e4a642ed41e3b4e0de7489a6c
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date: Fri, 23 Jun 2023 12:08:14 -0700
workqueue: clean up WORK_* constant types, clarify masking
Dave Airlie reports that gcc-13.1.1 has started complaining about some
of the workqueue code in 32-bit arm builds:
kernel/workqueue.c: In function ‘get_work_pwq’:
kernel/workqueue.c:713:24: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
713 | return (void *)(data & WORK_STRUCT_WQ_DATA_MASK);
| ^
[ ... a couple of other cases ... ]
and while it's not immediately clear exactly why gcc started complaining
about it now, I suspect it's some C23-induced enum type handlign fixup in
gcc-13 is the cause.
Whatever the reason for starting to complain, the code and data types
are indeed disgusting enough that the complaint is warranted.
The wq code ends up creating various "helper constants" (like that
WORK_STRUCT_WQ_DATA_MASK) using an enum type, which is all kinds of
confused. The mask needs to be 'unsigned long', not some unspecified
enum type.
To make matters worse, the actual "mask and cast to a pointer" is
repeated a couple of times, and the cast isn't even always done to the
right pointer, but - as the error case above - to a 'void *' with then
the compiler finishing the job.
That's now how we roll in the kernel.
So create the masks using the proper types rather than some ambiguous
enumeration, and use a nice helper that actually does the type
conversion in one well-defined place.
Incidentally, this magically makes clang generate better code. That,
admittedly, is really just a sign of clang having been seriously
confused before, and cleaning up the typing unconfuses the compiler too.
Reported-by: Dave Airlie <airlied@gmail.com>
Link: https://lore.kernel.org/lkml/CAPM=9twNnV4zMCvrPkw3H-ajZOH-01JVh_kDrxdPYQErz8ZTdA@mail.gmail.com/
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 704bc669e1dda3eb8f6d5cb462b21e85558a3912
Author: Jungseung Lee <js07.lee@samsung.com>
Date: Mon, 20 Mar 2023 12:29:05 +0900
workqueue: Introduce show_freezable_workqueues
Currently show_all_workqueue is called if freeze fails at the time of
freeze the workqueues, which shows the status of all workqueues and of
all worker pools. In this cases we may only need to dump state of only
workqueues that are freezable and busy.
This patch defines show_freezable_workqueues, which uses
show_one_workqueue, a granular function that shows the state of individual
workqueues, so that dump only the state of freezable workqueues
at that time.
tj: Minor message adjustment.
Signed-off-by: Jungseung Lee <js07.lee@samsung.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit cd2440d66fec7d1bdb4f605b64c27c63c9141989
Author: Petr Mladek <pmladek@suse.com>
Date: Tue, 7 Mar 2023 13:53:35 +0100
workqueue: Print backtraces from CPUs with hung CPU bound workqueues
The workqueue watchdog reports a lockup when there was not any progress
in the worker pool for a long time. The progress means that a pending
work item starts being proceed.
Worker pools for unbound workqueues always wake up an idle worker and
try to process the work immediately. The last idle worker has to create
new worker first. The stall might happen only when a new worker could
not be created in which case an error should get printed. Another problem
might be too high load. In this case, workers are victims of a global
system problem.
Worker pools for CPU bound workqueues are designed for lightweight
work items that do not need much CPU time. They are proceed one by
one on a single worker. New worker is used only when a work is sleeping.
It creates one additional scenario. The stall might happen when
the CPU-bound workqueue is used for CPU-intensive work.
More precisely, the stall is detected when a CPU-bound worker is in
the TASK_RUNNING state for too long. In this case, it might be useful
to see the backtrace from the problematic worker.
The information how long a worker is in the running state is not available.
But the CPU-bound worker pools do not have many workers in the running
state by definition. And only few pools are typically blocked.
It should be acceptable to print backtraces from all workers in
TASK_RUNNING state in the stalled worker pools. The number of false
positives should be very low.
Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 4c0736a76a186e5df2cd2afda3e7a04d2a427d1b
Author: Petr Mladek <pmladek@suse.com>
Date: Tue, 7 Mar 2023 13:53:34 +0100
workqueue: Warn when a rescuer could not be created
Rescuers are created when a workqueue with WQ_MEM_RECLAIM is allocated.
It typically happens during the system boot.
systemd switches the root filesystem from initrd to the booted system
during boot. It kills processes that block the switch for too long.
One of the process might be modprobe that tries to create a workqueue.
These problems are hard to reproduce. Also alloc_workqueue() does not
pass the error code. Make the debugging easier by printing an error,
similar to create_worker().
Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 60f540389a5d2df25ddc7ad511b4fa2880dea521
Author: Petr Mladek <pmladek@suse.com>
Date: Tue, 7 Mar 2023 13:53:33 +0100
workqueue: Interrupted create_worker() is not a repeated event
kthread_create_on_node() might get interrupted(). It is rare but realistic.
For example, when an unbound workqueue is allocated in module_init()
callback. It is done in the context of the "modprobe" process. And,
for example, systemd might kill pending processes when switching root
from initrd to the booted system.
The interrupt is a one-off event and the race might be hard to reproduce.
It is always worth printing.
Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 3f0ea0b864562c6bd1cee892026067eaea7be242
Author: Petr Mladek <pmladek@suse.com>
Date: Tue, 7 Mar 2023 13:53:32 +0100
workqueue: Warn when a new worker could not be created
The workqueue watchdog reports a lockup when there was not any progress
in the worker pool for a long time. The progress means that a pending
work item starts being proceed.
The progress is guaranteed by using idle workers or creating new workers
for pending work items.
There are several reasons why a new worker could not be created:
+ there is not enough memory
+ there is no free pool ID (IDR API)
+ the system reached PID limit
+ the process creating the new worker was interrupted
+ the last idle worker (manager) has not been scheduled for a long
time. It was not able to even start creating the kthread.
None of these failures is reported at the moment. The only clue is that
show_one_worker_pool() prints that there is a manager. It is the last
idle worker that is responsible for creating a new one. But it is not
clear if create_worker() is failing and why.
Make the debugging easier by printing errors in create_worker().
The error code is important, especially from kthread_create_on_node().
It helps to distinguish the various reasons. For example, reaching
memory limit (-ENOMEM), other system limits (-EAGAIN), or process
interrupted (-EINTR).
Use pr_once() to avoid repeating the same error every CREATE_COOLDOWN
for each stuck worker pool.
Ratelimited printk() might be better. It would help to know if the problem
remains. It would be more clear if the create_worker() errors and workqueue
stalls are related. Also old messages might get lost when the internal log
buffer is full. The problem is that printk() might touch the watchdog.
For example, see touch_nmi_watchdog() in serial8250_console_write().
It would require synchronization of the begin and length of the ratelimit
interval with the workqueue watchdog. Otherwise, the error messages
might break the watchdog. This does not look worth the complexity.
Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 335a42ebb0ca8ee9997a1731aaaae6dcd704c113
Author: Petr Mladek <pmladek@suse.com>
Date: Tue, 7 Mar 2023 13:53:31 +0100
workqueue: Fix hung time report of worker pools
The workqueue watchdog prints a warning when there is no progress in
a worker pool. Where the progress means that the pool started processing
a pending work item.
Note that it is perfectly fine to process work items much longer.
The progress should be guaranteed by waking up or creating idle
workers.
show_one_worker_pool() prints state of non-idle worker pool. It shows
a delay since the last pool->watchdog_ts.
The timestamp is updated when a first pending work is queued in
__queue_work(). Also it is updated when a work is dequeued for
processing in worker_thread() and rescuer_thread().
The delay is misleading when there is no pending work item. In this
case it shows how long the last work item is being proceed. Show
zero instead. There is no stall if there is no pending work.
Fixes: 82607adcf9 ("workqueue: implement lockup detector")
Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
Conflicts: Context diff due to the presence of a later upstream commit
4a6c5607d450 ("workqueue: Make sure that wq_unbound_cpumask
is never empty").
commit a8ec5880bd82b834717770cba4596381ffd50545
Author: Ammar Faizi <ammarfaizi2@gnuweeb.org>
Date: Sun, 26 Feb 2023 23:53:20 +0700
workqueue: Simplify a pr_warn() call in wq_select_unbound_cpu()
Use pr_warn_once() to achieve the same thing. It's simpler.
Signed-off-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit c76feb0d5dfdb90b70fa820bb3181142bb01e980
Author: Paul E. McKenney <paulmck@kernel.org>
Date: Fri, 6 Jan 2023 16:10:24 -0800
workqueue: Make show_pwq() use run-length encoding
The show_pwq() function dumps out a pool_workqueue structure's activity,
including the pending work-queue handlers:
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x1 nice=0 active=10/256 refcnt=11
in-flight: 7:test_work_func, 64:test_work_func, 249:test_work_func
pending: test_work_func, test_work_func, test_work_func1, test_work_func1, test_work_func1, test_work_func1, test_work_func1
When large systems are facing certain types of hang conditions, it is not
unusual for this "pending" list to contain runs of hundreds of identical
function names. This "wall of text" is difficult to read, and worse yet,
it can be interleaved with other output such as stack traces.
Therefore, make show_pwq() use run-length encoding so that the above
printout instead looks like this:
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x1 nice=0 active=10/256 refcnt=11
in-flight: 7:test_work_func, 64:test_work_func, 249:test_work_func
pending: 2*test_work_func, 5*test_work_func1
When no comma would be printed, including the WORK_STRUCT_LINKED case,
a new run is started unconditionally.
This output is more readable, places less stress on the hardware,
firmware, and software on the console-log path, and reduces interference
with other output.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 33e3f0a3358b8f9bb54b2661b9c1d37a75664c79
Author: Richard Clark <richard.xnu.clark@gmail.com>
Date: Tue, 13 Dec 2022 12:39:36 +0800
workqueue: Add a new flag to spot the potential UAF error
Currently if the user queues a new work item unintentionally
into a wq after the destroy_workqueue(wq), the work still can
be queued and scheduled without any noticeable kernel message
before the end of a RCU grace period.
As a debug-aid facility, this commit adds a new flag
__WQ_DESTROYING to spot that issue by triggering a kernel WARN
message.
Signed-off-by: Richard Clark <richard.xnu.clark@gmail.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit a7e30c0e9a5f95b7f74e6272d9c75fd65c897721
Author: Uladzislau Rezki <urezki@gmail.com>
Date: Sun, 16 Oct 2022 16:23:03 +0000
workqueue: Make queue_rcu_work() use call_rcu_hurry()
Earlier commits in this series allow battery-powered systems to build
their kernels with the default-disabled CONFIG_RCU_LAZY=y Kconfig option.
This Kconfig option causes call_rcu() to delay its callbacks in order
to batch them. This means that a given RCU grace period covers more
callbacks, thus reducing the number of grace periods, in turn reducing
the amount of energy consumed, which increases battery lifetime which
can be a very good thing. This is not a subtle effect: In some important
use cases, the battery lifetime is increased by more than 10%.
This CONFIG_RCU_LAZY=y option is available only for CPUs that offload
callbacks, for example, CPUs mentioned in the rcu_nocbs kernel boot
parameter passed to kernels built with CONFIG_RCU_NOCB_CPU=y.
Delaying callbacks is normally not a problem because most callbacks do
nothing but free memory. If the system is short on memory, a shrinker
will kick all currently queued lazy callbacks out of their laziness,
thus freeing their memory in short order. Similarly, the rcu_barrier()
function, which blocks until all currently queued callbacks are invoked,
will also kick lazy callbacks, thus enabling rcu_barrier() to complete
in a timely manner.
However, there are some cases where laziness is not a good option.
For example, synchronize_rcu() invokes call_rcu(), and blocks until
the newly queued callback is invoked. It would not be a good for
synchronize_rcu() to block for ten seconds, even on an idle system.
Therefore, synchronize_rcu() invokes call_rcu_hurry() instead of
call_rcu(). The arrival of a non-lazy call_rcu_hurry() callback on a
given CPU kicks any lazy callbacks that might be already queued on that
CPU. After all, if there is going to be a grace period, all callbacks
might as well get full benefit from it.
Yes, this could be done the other way around by creating a
call_rcu_lazy(), but earlier experience with this approach and
feedback at the 2022 Linux Plumbers Conference shifted the approach
to call_rcu() being lazy with call_rcu_hurry() for the few places
where laziness is inappropriate.
And another call_rcu() instance that cannot be lazy is the one
in queue_rcu_work(), given that callers to queue_rcu_work() are
not necessarily OK with long delays.
Therefore, make queue_rcu_work() use call_rcu_hurry() in order to revert
to the old behavior.
[ paulmck: Apply s/call_rcu_flush/call_rcu_hurry/ feedback from Tejun Heo. ]
Signed-off-by: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit bc35f7ef96284b8c963991357a9278a6beafca54
Author: Lai Jiangshan <laijs@linux.alibaba.com>
Date: Thu, 23 Dec 2021 20:31:40 +0800
workqueue: Convert the type of pool->nr_running to int
It is only modified in associated CPU, so it doesn't need to be atomic.
tj: Comment updated.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit cc5bff38463e0894dd596befa99f9d6860e15f5e
Author: Lai Jiangshan <laijs@linux.alibaba.com>
Date: Thu, 23 Dec 2021 20:31:39 +0800
workqueue: Use wake_up_worker() in wq_worker_sleeping() instead of open code
The wakeup code in wq_worker_sleeping() is the same as wake_up_worker().
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 443378f0664a78756c3e3aeaab92750fe1e05735
Author: Paul E. McKenney <paulmck@kernel.org>
Date: Tue, 30 Nov 2021 17:00:30 -0800
workqueue: Upgrade queue_work_on() comment
The current queue_work_on() docbook comment says that the caller must
ensure that the specified CPU can't go away, but does not spell out the
consequences, which turn out to be quite mild. Therefore expand this
comment to explicitly say that the penalty for failing to nail down the
specified CPU is that the workqueue handler might find itself executing
on some other CPU.
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-3534
This patch is a backport of the following upstream commit:
commit 8318d6a6362f5903edb4c904a8dd447e59be4ad1
Author: Audra Mitchell <audra@redhat.com>
Date: Thu Jan 25 14:05:32 2024 -0500
workqueue: Shorten events_freezable_power_efficient name
Since we have set the WQ_NAME_LEN to 32, decrease the name of
events_freezable_power_efficient so that it does not trip the name length
warning when the workqueue is created.
Signed-off-by: Audra Mitchell <audra@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Audra Mitchell <audra@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-3534
This patch is a backport of the following upstream commit:
commit 31c89007285d365aa36f71d8fb0701581c770a27
Author: Audra Mitchell <audra@redhat.com>
Date: Mon Jan 15 12:08:22 2024 -0500
workqueue.c: Increase workqueue name length
Currently we limit the size of the workqueue name to 24 characters due to
commit ecf6881ff3 ("workqueue: make workqueue->name[] fixed len")
Increase the size to 32 characters and print a warning in the event
the requested name is larger than the limit of 32 characters.
Signed-off-by: Audra Mitchell <audra@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Audra Mitchell <audra@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-20254
Upstream Status: https://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git/
commit aae17ebb53cd3da37f5dfbde937acd091eb4340c
Author: Leonardo Bras <leobras@redhat.com>
Date: Mon Jan 29 22:00:46 2024 -0300
workqueue: Avoid using isolated cpus' timers on queue_delayed_work
When __queue_delayed_work() is called, it chooses a cpu for handling the
timer interrupt. As of today, it will pick either the cpu passed as
parameter or the last cpu used for this.
This is not good if a system does use CPU isolation, because it can take
away some valuable cpu time to:
1 - deal with the timer interrupt,
2 - schedule-out the desired task,
3 - queue work on a random workqueue, and
4 - schedule the desired task back to the cpu.
So to fix this, during __queue_delayed_work(), if cpu isolation is in
place, pick a random non-isolated cpu to handle the timer interrupt.
As an optimization, if the current cpu is not isolated, use it instead
of looking for another candidate.
Signed-off-by: Leonardo Bras <leobras@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Leonardo Bras <leobras@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-21798
Conflicts: A minor context diff due to missing upstream commit
fcecfa8f271a ("workqueue: Remove module param disable_numa
and sysfs knobs pool_ids and numa").
commit 49277a5b76373e630075ff7d32fc0f9f51294f24
Author: Waiman Long <longman@redhat.com>
Date: Mon, 20 Nov 2023 21:18:40 -0500
workqueue: Move workqueue_set_unbound_cpumask() and its helpers inside CONFIG_SYSFS
Commit fe28f631fa94 ("workqueue: Add workqueue_unbound_exclude_cpumask()
to exclude CPUs from wq_unbound_cpumask") makes
workqueue_set_unbound_cpumask() static as it is not used elsewhere in
the kernel. However, this triggers a kernel test robot warning about
'workqueue_set_unbound_cpumask' defined but not used when CONFIG_SYS
isn't defined. It happens that workqueue_set_unbound_cpumask() is only
called when CONFIG_SYS is defined.
Move workqueue_set_unbound_cpumask() and its helpers inside the
CONFIG_SYSFS compilation block to avoid the warning. There is no
functional change.
Fixes: fe28f631fa94 ("workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202311130831.uh0AoCd1-lkp@intel.com/
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-21798
Conflicts:
1) A merge conflict in the workqueue_unbound_exclude_cpumask() hunk
of kernel/workqueue.c due to missing upstream commit 63c5484e7495
("workqueue: Add multiple affinity scopes and interface to select
them").
2) A merge conflict in the workqueue_init_early() hunk of
kernel/workqueue.c due to upstream merge conflict resolved according
to upstream merge commit 202595663905 ("Merge branch 'for-6.7-fixes'
of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq into for-6.8").
commit fe28f631fa941fba583d1c4f25895284b90af671
Author: Waiman Long <longman@redhat.com>
Date: Wed, 25 Oct 2023 14:25:52 -0400
workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask
When the "isolcpus" boot command line option is used to add a set
of isolated CPUs, those CPUs will be excluded automatically from
wq_unbound_cpumask to avoid running work functions from unbound
workqueues.
Recently cpuset has been extended to allow the creation of partitions
of isolated CPUs dynamically. To make it closer to the "isolcpus"
in functionality, the CPUs in those isolated cpuset partitions should be
excluded from wq_unbound_cpumask as well. This can be done currently by
explicitly writing to the workqueue's cpumask sysfs file after creating
the isolated partitions. However, this process can be error prone.
Ideally, the cpuset code should be allowed to request the workqueue code
to exclude those isolated CPUs from wq_unbound_cpumask so that this
operation can be done automatically and the isolated CPUs will be returned
back to wq_unbound_cpumask after the destructions of the isolated
cpuset partitions.
This patch adds a new workqueue_unbound_exclude_cpumask() function to
enable that. This new function will exclude the specified isolated
CPUs from wq_unbound_cpumask. To be able to restore those isolated
CPUs back after the destruction of isolated cpuset partitions, a new
wq_requested_unbound_cpumask is added to store the user provided unbound
cpumask either from the boot command line options or from writing to
the cpumask sysfs file. This new cpumask provides the basis for CPU
exclusion.
To enable users to understand how the wq_unbound_cpumask is being
modified internally, this patch also exposes the newly introduced
wq_requested_unbound_cpumask as well as a wq_isolated_cpumask to
store the cpumask to be excluded from wq_unbound_cpumask as read-only
sysfs files.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-21798
Conflicts: A merge conflict due to missing upstream commit fef59c9cab6a
("workqueue: Rename NUMA related names to use pod instead")
and two other subsequent workqueue commits.
commit 4a6c5607d4502ccd1b15b57d57f17d12b6f257a7
Author: Tejun Heo <tj@kernel.org>
Date: Tue, 21 Nov 2023 11:39:36 -1000
workqueue: Make sure that wq_unbound_cpumask is never empty
During boot, depending on how the housekeeping and workqueue.unbound_cpus
masks are set, wq_unbound_cpumask can end up empty. Since 8639ecebc9b1
("workqueue: Implement non-strict affinity scope for unbound workqueues"),
this may end up feeding -1 as a CPU number into scheduler leading to oopses.
BUG: unable to handle page fault for address: ffffffff8305e9c0
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
...
Call Trace:
<TASK>
select_idle_sibling+0x79/0xaf0
select_task_rq_fair+0x1cb/0x7b0
try_to_wake_up+0x29c/0x5c0
wake_up_process+0x19/0x20
kick_pool+0x5e/0xb0
__queue_work+0x119/0x430
queue_work_on+0x29/0x30
...
An empty wq_unbound_cpumask is a clear misconfiguration and already
disallowed once system is booted up. Let's warn on and ignore
unbound_cpumask restrictions which lead to no unbound cpus. While at it,
also remove now unncessary empty check on wq_unbound_cpumask in
wq_select_unbound_cpu().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-Tested-by: Yong He <alexyonghe@tencent.com>
Link: http://lkml.kernel.org/r/20231120121623.119780-1-alexyonghe@tencent.com
Fixes: 8639ecebc9b1 ("workqueue: Implement non-strict affinity scope for unbound workqueues")
Cc: stable@vger.kernel.org # v6.6+
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-21798
commit ca10d851b9ad0338c19e8e3089e24d565ebfffd7
Author: Waiman Long <longman@redhat.com>
Date: Tue, 10 Oct 2023 22:48:42 -0400
workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()
Commit 5c0338c687 ("workqueue: restore WQ_UNBOUND/max_active==1
to be ordered") enabled implicit ordered attribute to be added to
WQ_UNBOUND workqueues with max_active of 1. This prevented the changing
of attributes to these workqueues leading to fix commit 0a94efb5ac
("workqueue: implicit ordered attribute should be overridable").
However, workqueue_apply_unbound_cpumask() was not updated at that time.
So sysfs changes to wq_unbound_cpumask has no effect on WQ_UNBOUND
workqueues with implicit ordered attribute. Since not all WQ_UNBOUND
workqueues are visible on sysfs, we are not able to make all the
necessary cpumask changes even if we iterates all the workqueue cpumasks
in sysfs and changing them one by one.
Fix this problem by applying the corresponding change made
to apply_workqueue_attrs_locked() in the fix commit to
workqueue_apply_unbound_cpumask().
Fixes: 5c0338c687 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-21798
Conflicts: A minor context diff in kernel/workqueue.c due to missing
upstream commit 20bdedafd2f6 ("workqueue: Warn attempt to
flush system-wide workqueues.").
commit ace3c5499e61ef7c0433a7a297227a9bdde54a55
Author: tiozhang <tiozhang@didiglobal.com>
Date: Thu, 29 Jun 2023 11:50:50 +0800
workqueue: add cmdline parameter `workqueue.unbound_cpus` to further constrain wq_unbound_cpumask at boot time
Motivation of doing this is to better improve boot times for devices when
we want to prevent our workqueue works from running on some specific CPUs,
e,g, some CPUs are busy with interrupts.
Signed-off-by: tiozhang <tiozhang@didiglobal.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-1023
commit 686f669780276da534e93ba769e02bdcf1f89f8d
Author: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Date: Mon Mar 13 19:28:50 2023 +0100
Direct access to the struct bus_type dev_root pointer is going away soon
so replace that with a call to bus_get_dev_root() instead, which is what
it is there for.
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230313182918.1312597-8-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337
commit c63a2e52d5e08f01140d7b76c08a78e15e801f03
Author: Valentin Schneider <vschneid@redhat.com>
Date: Fri, 13 Jan 2023 17:40:40 +0000
workqueue: Fold rebind_worker() within rebind_workers()
!CONFIG_SMP builds complain about rebind_worker() being unused. Its only
user, rebind_workers() is indeed only defined for CONFIG_SMP, so just fold
the two lines back up there.
Link: http://lore.kernel.org/r/20230113143102.2e94d74f@canb.auug.org.au
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337
commit e02b93124855cd34b78e61ae44846c8cb5fddfc3
Author: Valentin Schneider <vschneid@redhat.com>
Date: Thu, 12 Jan 2023 16:14:31 +0000
workqueue: Unbind kworkers before sending them to exit()
It has been reported that isolated CPUs can suffer from interference due to
per-CPU kworkers waking up just to die.
A surge of workqueue activity during initial setup of a latency-sensitive
application (refresh_vm_stats() being one of the culprits) can cause extra
per-CPU kworkers to be spawned. Then, said latency-sensitive task can be
running merrily on an isolated CPU only to be interrupted sometime later by
a kworker marked for death (cf. IDLE_WORKER_TIMEOUT, 5 minutes after last
kworker activity).
Prevent this by affining kworkers to the wq_unbound_cpumask (which doesn't
contain isolated CPUs, cf. HK_TYPE_WQ) before waking them up after marking
them with WORKER_DIE.
Changing the affinity does require a sleepable context, leverage the newly
introduced pool->idle_cull_work to get that.
Remove dying workers from pool->workers and keep track of them in a
separate list. This intentionally prevents for_each_loop_worker() from
iterating over workers that are marked for death.
Rename destroy_worker() to set_working_dying() to better reflect its
effects and relationship with wake_dying_workers().
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337
commit 9ab03be42b8f9136dcc01a90ecc9ac71bc6149ef
Author: Valentin Schneider <vschneid@redhat.com>
Date: Thu, 12 Jan 2023 16:14:30 +0000
workqueue: Don't hold any lock while rcuwait'ing for !POOL_MANAGER_ACTIVE
put_unbound_pool() currently passes wq_manager_inactive() as exit condition
to rcuwait_wait_event(), which grabs pool->lock to check for
pool->flags & POOL_MANAGER_ACTIVE
A later patch will require destroy_worker() to be invoked with
wq_pool_attach_mutex held, which needs to be acquired before
pool->lock. A mutex cannot be acquired within rcuwait_wait_event(), as
it could clobber the task state set by rcuwait_wait_event()
Instead, restructure the waiting logic to acquire any necessary lock
outside of rcuwait_wait_event().
Since further work cannot be inserted into unbound pwqs that have reached
->refcnt==0, this is bound to make forward progress as eventually the
worklist will be drained and need_more_worker(pool) will remain false,
preventing any worker from stealing the manager position from us.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337
commit 3f959aa3b33829acfcd460c6c656d54dfebe8d1e
Author: Valentin Schneider <vschneid@redhat.com>
Date: Thu, 12 Jan 2023 16:14:29 +0000
workqueue: Convert the idle_timer to a timer + work_struct
A later patch will require a sleepable context in the idle worker timeout
function. Converting worker_pool.idle_timer to a delayed_work gives us just
that, however this would imply turning all idle_timer expiries into
scheduler events (waking up a worker to handle the dwork).
Instead, implement a "custom dwork" where the timer callback does some
extra checks before queuing the associated work.
No change in functionality intended.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337
commit 99c621ef243bda726fb8d982a274ded96570b410
Author: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Date: Thu, 12 Jan 2023 16:14:27 +0000
workqueue: Protects wq_unbound_cpumask with wq_pool_attach_mutex
When unbind_workers() reads wq_unbound_cpumask to set the affinity of
freshly-unbound kworkers, it only holds wq_pool_attach_mutex. This isn't
sufficient as wq_unbound_cpumask is only protected by wq_pool_mutex.
Make wq_unbound_cpumask protected with wq_pool_attach_mutex and also
remove the need of temporary saved_cpumask.
Fixes: 10a5a651e3af ("workqueue: Restrict kworker in the offline CPU pool running on housekeeping CPUs")
Reported-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337
commit c0feea594e058223973db94c1c32a830c9807c86
Author: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Fri, 29 Jul 2022 13:30:23 +0900
workqueue: don't skip lockdep work dependency in cancel_work_sync()
Like Hillf Danton mentioned
syzbot should have been able to catch cancel_work_sync() in work context
by checking lockdep_map in __flush_work() for both flush and cancel.
in [1], being unable to report an obvious deadlock scenario shown below is
broken. From locking dependency perspective, sync version of cancel request
should behave as if flush request, for it waits for completion of work if
that work has already started execution.
----------
#include <linux/module.h>
#include <linux/sched.h>
static DEFINE_MUTEX(mutex);
static void work_fn(struct work_struct *work)
{
schedule_timeout_uninterruptible(HZ / 5);
mutex_lock(&mutex);
mutex_unlock(&mutex);
}
static DECLARE_WORK(work, work_fn);
static int __init test_init(void)
{
schedule_work(&work);
schedule_timeout_uninterruptible(HZ / 10);
mutex_lock(&mutex);
cancel_work_sync(&work);
mutex_unlock(&mutex);
return -EINVAL;
}
module_init(test_init);
MODULE_LICENSE("GPL");
----------
The check this patch restores was added by commit 0976dfc1d0
("workqueue: Catch more locking problems with flush_work()").
Then, lockdep's crossrelease feature was added by commit b09be676e0
("locking/lockdep: Implement the 'crossrelease' feature"). As a result,
this check was once removed by commit fd1a5b04df ("workqueue: Remove
now redundant lock acquisitions wrt. workqueue flushes").
But lockdep's crossrelease feature was removed by commit e966eaeeb6
("locking/lockdep: Remove the cross-release locking checks"). At this
point, this check should have been restored.
Then, commit d6e89786be ("workqueue: skip lockdep wq dependency in
cancel_work_sync()") introduced a boolean flag in order to distinguish
flush_work() and cancel_work_sync(), for checking "struct workqueue_struct"
dependency when called from cancel_work_sync() was causing false positives.
Then, commit 87915adc3f ("workqueue: re-add lockdep dependencies for
flushing") tried to restore "struct work_struct" dependency check, but by
error checked this boolean flag. Like an example shown above indicates,
"struct work_struct" dependency needs to be checked for both flush_work()
and cancel_work_sync().
Link: https://lkml.kernel.org/r/20220504044800.4966-1-hdanton@sina.com [1]
Reported-by: Hillf Danton <hdanton@sina.com>
Suggested-by: Lai Jiangshan <jiangshanlai@gmail.com>
Fixes: 87915adc3f ("workqueue: re-add lockdep dependencies for flushing")
Cc: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337
commit 2c1f1a9180bfacbc3c8e5b10075640cc810cf9c0
Author: Lai Jiangshan <laijs@linux.alibaba.com>
Date: Thu, 23 Dec 2021 20:31:38 +0800
workqueue: Change the comments of the synchronization about the idle_list
The access to idle_list in wq_worker_sleeping() is changed to be
protected by pool->lock, so the comments above idle_list can be changed
to "L:" which is the meaning of "access with pool->lock held".
And the outdated comments in wq_worker_sleeping() is removed since
the function is not called with rq lock held any more, idle_list is
dereferenced with pool lock now.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337
commit 21b195c05cf6a6cc49777d6992772bcf01502186
Author: Lai Jiangshan <laijs@linux.alibaba.com>
Date: Thu, 23 Dec 2021 20:31:37 +0800
workqueue: Remove the mb() pair between wq_worker_sleeping() and insert_work()
In wq_worker_sleeping(), the access to worklist is protected by the
pool->lock, so the memory barrier is unneeded.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337
commit 84f91c62d675480ffd3d870ee44c07965cbd8b21
Author: Lai Jiangshan <laijs@linux.alibaba.com>
Date: Tue, 7 Dec 2021 15:35:42 +0800
workqueue: Remove the cacheline_aligned for nr_running
nr_running is never modified remotely after the schedule callback in
wakeup path is removed.
Rather nr_running is often accessed with other fields in the pool
together, so the cacheline_aligned for nr_running isn't needed.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337
Conflicts: A merge conflict requiring manual merge due to the presence
of a later upstream commit 46a4d679ef88 ("workqueue: Avoid
a false warning in unbind_workers()").
commit 989442d73757868118a73b92732b549a73c9ce35
Author: Lai Jiangshan <laijs@linux.alibaba.com>
Date: Tue, 7 Dec 2021 15:35:41 +0800
workqueue: Move the code of waking a worker up in unbind_workers()
In unbind_workers(), there are two pool->lock held sections separated
by the code of zapping nr_running. wake_up_worker() needs to be in
pool->lock held section and after zapping nr_running. And zapping
nr_running had to be after schedule() when the local wake up
functionality was in use. Now, the call to schedule() has been removed
along with the local wake up functionality, so the code can be merged
into the same pool->lock held section.
The diffstat shows that it is other code moved down because the diff
tools can not know the meaning of merging lock sections by swapping
two code blocks.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337
commit ccf45156fd167a234baf038c11c1f367c7ccabd4
Author: Lai Jiangshan <laijs@linux.alibaba.com>
Date: Tue, 7 Dec 2021 15:35:37 +0800
workqueue: Remove the outdated comment before wq_worker_sleeping()
It isn't called with preempt disabled now.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337
commit 45c753f5f24d2d4717acb38ce35e604ff9abcb50
Author: Frederic Weisbecker <frederic@kernel.org>
Date: Wed, 1 Dec 2021 16:19:45 +0100
workqueue: Fix unbind_workers() VS wq_worker_sleeping() race
At CPU-hotplug time, unbind_workers() may preempt a worker while it is
going to sleep. In that case the following scenario can happen:
unbind_workers() wq_worker_sleeping()
-------------- -------------------
if (worker->flags & WORKER_NOT_RUNNING)
return;
//PREEMPTED by unbind_workers
worker->flags |= WORKER_UNBOUND;
[...]
atomic_set(&pool->nr_running, 0);
//resume to worker
atomic_dec_and_test(&pool->nr_running);
After unbind_worker() resets pool->nr_running, the value is expected to
remain 0 until the pool ever gets rebound in case cpu_up() is called on
the target CPU in the future. But here the race leaves pool->nr_running
with a value of -1, triggering the following warning when the worker goes
idle:
WARNING: CPU: 3 PID: 34 at kernel/workqueue.c:1823 worker_enter_idle+0x95/0xc0
Modules linked in:
CPU: 3 PID: 34 Comm: kworker/3:0 Not tainted 5.16.0-rc1+ #34
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
Workqueue: 0x0 (rcu_par_gp)
RIP: 0010:worker_enter_idle+0x95/0xc0
Code: 04 85 f8 ff ff ff 39 c1 7f 09 48 8b 43 50 48 85 c0 74 1b 83 e2 04 75 99 8b 43 34 39 43 30 75 91 8b 83 00 03 00 00 85 c0 74 87 <0f> 0b 5b c3 48 8b 35 70 f1 37 01 48 8d 7b 48 48 81 c6 e0 93 0
RSP: 0000:ffff9b7680277ed0 EFLAGS: 00010086
RAX: 00000000ffffffff RBX: ffff93465eae9c00 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff9346418a0000 RDI: ffff934641057140
RBP: ffff934641057170 R08: 0000000000000001 R09: ffff9346418a0080
R10: ffff9b768027fdf0 R11: 0000000000002400 R12: ffff93465eae9c20
R13: ffff93465eae9c20 R14: ffff93465eae9c70 R15: ffff934641057140
FS: 0000000000000000(0000) GS:ffff93465eac0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000001cc0c000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
worker_thread+0x89/0x3d0
? process_one_work+0x400/0x400
kthread+0x162/0x190
? set_kthread_struct+0x40/0x40
ret_from_fork+0x22/0x30
</TASK>
Also due to this incorrect "nr_running == -1", all sorts of hazards can
happen, starting with queued works being ignored because no workers are
awaken at insert_work() time.
Fix this with checking again the worker flags while pool->lock is locked.
Fixes: b945efcdd07d ("sched: Remove pointless preemption disable in sched_submit_work()")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2115520
commit 46a4d679ef88285ea17c3e1e4fed330be2044f21
Author: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Date: Fri Jul 29 17:44:38 2022 +0800
workqueue: Avoid a false warning in unbind_workers()
Doing set_cpus_allowed_ptr() with wq_unbound_cpumask can be possible
fails and trigger the false warning.
Use cpu_possible_mask instead when wq_unbound_cpumask has no active CPUs.
It is very easy to trigger the warning:
Set wq_unbound_cpumask to a small set of CPUs.
Offline all the CPUs of wq_unbound_cpumask.
Offline an extra CPU and trigger the warning.
Fixes: 10a5a651e3af ("workqueue: Restrict kworker in the offline CPU pool running on housekeeping CPUs")
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Phil Auld <pauld@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2115520
commit c4f135d643823a869becfa87539f7820ef9d5bfa
Author: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed Jun 1 16:32:47 2022 +0900
workqueue: Wrap flush_workqueue() using a macro
Since flush operation synchronously waits for completion, flushing
system-wide WQs (e.g. system_wq) might introduce possibility of deadlock
due to unexpected locking dependency. Tejun Heo commented at [1] that it
makes no sense at all to call flush_workqueue() on the shared WQs as the
caller has no idea what it's gonna end up waiting for.
Although there is flush_scheduled_work() which flushes system_wq WQ with
"Think twice before calling this function! It's very easy to get into
trouble if you don't take great care." warning message, syzbot found a
circular locking dependency caused by flushing system_wq WQ [2].
Therefore, let's change the direction to that developers had better use
their local WQs if flush_scheduled_work()/flush_workqueue(system_*_wq) is
inevitable.
Steps for converting system-wide WQs into local WQs are explained at [3],
and a conversion to stop flushing system-wide WQs is in progress. Now we
want some mechanism for preventing developers who are not aware of this
conversion from again start flushing system-wide WQs.
Since I found that WARN_ON() is complete but awkward approach for teaching
developers about this problem, let's use __compiletime_warning() for
incomplete but handy approach. For completeness, we will also insert
WARN_ON() into __flush_workqueue() after all in-tree users stopped calling
flush_scheduled_work().
Link: https://lore.kernel.org/all/YgnQGZWT%2Fn3VAITX@slm.duckdns.org/ [1]
Link: https://syzkaller.appspot.com/bug?extid=bde0f89deacca7c765b8 [2]
Link: https://lkml.kernel.org/r/49925af7-78a8-a3dd-bce6-cfc02e1a9236@I-love.SAKURA.ne.jp [3]
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Phil Auld <pauld@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2115520
commit 10a5a651e3afc9b0b381f47e8930972e4e918397
Author: Zqiang <qiang1.zhang@intel.com>
Date: Thu Mar 31 13:57:17 2022 +0800
workqueue: Restrict kworker in the offline CPU pool running on housekeeping CPUs
When a CPU is going offline, all workers on the CPU's pool will have their
cpus_allowed cleared to cpu_possible_mask and can run on any CPUs including
the isolated ones. Instead, set cpus_allowed to wq_unbound_cpumask so that
the can avoid isolated CPUs.
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Phil Auld <pauld@redhat.com>