JIRA: https://issues.redhat.com/browse/RHEL-81472
CVE: CVE-2025-21786
commit e76946110137703c16423baf6ee177b751a34b7e
Author: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Date: Thu Jan 23 16:25:35 2025 +0800
workqueue: Put the pwq after detaching the rescuer from the pool
The commit 68f83057b913("workqueue: Reap workers via kthread_stop() and
remove detach_completion") adds code to reap the normal workers but
mistakenly does not handle the rescuer and also removes the code waiting
for the rescuer in put_unbound_pool(), which caused a use-after-free bug
reported by Cheung Wall.
To avoid the use-after-free bug, the pool’s reference must be held until
the detachment is complete. Therefore, move the code that puts the pwq
after detaching the rescuer from the pool.
Reported-by: cheung wall <zzqq0103.hey@gmail.com>
Cc: cheung wall <zzqq0103.hey@gmail.com>
Link: https://lore.kernel.org/lkml/CAKHoSAvP3iQW+GwmKzWjEAOoPvzeWeoMO0Gz7Pp3_4kxt-RMoA@mail.gmail.com/
Fixes: 68f83057b913("workqueue: Reap workers via kthread_stop() and remove detach_completion")
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: CKI Backport Bot <cki-ci-bot+cki-gitlab-backport-bot@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-74107
CVE: CVE-2024-57888
commit de35994ecd2dd6148ab5a6c5050a1670a04dec77
Author: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Date: Thu, 19 Dec 2024 09:30:30 +0000
workqueue: Do not warn when cancelling WQ_MEM_RECLAIM work from !WQ_MEM_RECLAIM worker
After commit
746ae46c1113 ("drm/sched: Mark scheduler work queues with WQ_MEM_RECLAIM")
amdgpu started seeing the following warning:
[ ] workqueue: WQ_MEM_RECLAIM sdma0:drm_sched_run_job_work [gpu_sched] is flushing !WQ_MEM_RECLAIM events:amdgpu_device_delay_enable_gfx_off [amdgpu]
...
[ ] Workqueue: sdma0 drm_sched_run_job_work [gpu_sched]
...
[ ] Call Trace:
[ ] <TASK>
...
[ ] ? check_flush_dependency+0xf5/0x110
...
[ ] cancel_delayed_work_sync+0x6e/0x80
[ ] amdgpu_gfx_off_ctrl+0xab/0x140 [amdgpu]
[ ] amdgpu_ring_alloc+0x40/0x50 [amdgpu]
[ ] amdgpu_ib_schedule+0xf4/0x810 [amdgpu]
[ ] ? drm_sched_run_job_work+0x22c/0x430 [gpu_sched]
[ ] amdgpu_job_run+0xaa/0x1f0 [amdgpu]
[ ] drm_sched_run_job_work+0x257/0x430 [gpu_sched]
[ ] process_one_work+0x217/0x720
...
[ ] </TASK>
The intent of the verifcation done in check_flush_depedency is to ensure
forward progress during memory reclaim, by flagging cases when either a
memory reclaim process, or a memory reclaim work item is flushed from a
context not marked as memory reclaim safe.
This is correct when flushing, but when called from the
cancel(_delayed)_work_sync() paths it is a false positive because work is
either already running, or will not be running at all. Therefore
cancelling it is safe and we can relax the warning criteria by letting the
helper know of the calling context.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Fixes: fca839c00a ("workqueue: warn if memory reclaim tries to flush !WQ_MEM_RECLAIM workqueue")
References: 746ae46c1113 ("drm/sched: Mark scheduler work queues with WQ_MEM_RECLAIM")
Cc: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org> # v4.5+
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5437
# Merge Request Required Information
## Summary of Changes
Depends: !5592
Depends: !5692
JIRA: https://issues.redhat.com/browse/RHEL-53569
Omitted-fix: 48ffe2074c2864ab64ee2004e7ebf3d6a6730fbf
Omitted-fix: 06e7139a034f26804904368fe4af2ceb70724756
Omitted-fix: 5278ca048d93eac74e9a81b3e672da2b2264bce4
Omitted-fix: 8dffaec34dd55473adcbc924a4c9b04aaa0d4278
Signed-off-by: Robert Foss <rfoss@redhat.com>
## Approved Development Ticket
All submissions to CentOS Stream must reference an approved ticket in [Red Hat Jira](https://issues.redhat.com/). Please follow the CentOS Stream [contribution documentation](https://docs.centos.org/en-US/stream-contrib/quickstart/) for how to file this ticket and have it approved.
Approved-by: Jocelyn Falempe <jfalempe@redhat.com>
Approved-by: David Arcari <darcari@redhat.com>
Approved-by: Rafael Aquini <raquini@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>
Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-53569
Upstream Status: v6.12-rc1
commit 9b59a85a84dc37ca4f2c54df5e06aff4c1eae5d3
Author: Matthew Brost <matthew.brost@intel.com>
AuthorDate: Tue Aug 20 12:38:08 2024 -0700
Commit: Tejun Heo <tj@kernel.org>
CommitDate: Tue Aug 20 09:38:39 2024 -1000
Calling va_start / va_end multiple times is undefined and causes
problems with certain compiler / platforms.
Change alloc_ordered_workqueue_lockdep_map to a macro and updated
__alloc_workqueue to take a va_list argument.
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Robert Foss <rfoss@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-53569
Upstream Status: v6.12-rc1
commit ec0a7d44b358afaaf52856d03c72e20587bc888b
Author: Matthew Brost <matthew.brost@intel.com>
AuthorDate: Fri Aug 9 15:28:25 2024 -0700
Commit: Tejun Heo <tj@kernel.org>
CommitDate: Tue Aug 13 09:05:51 2024 -1000
Add an interface for a user-defined workqueue lockdep map, which is
helpful when multiple workqueues are created for the same purpose. This
also helps avoid leaking lockdep maps on each workqueue creation.
v2:
- Add alloc_workqueue_lockdep_map (Tejun)
v3:
- Drop __WQ_USER_OWNED_LOCKDEP (Tejun)
- static inline alloc_ordered_workqueue_lockdep_map (Tejun)
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Robert Foss <rfoss@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-53569
Upstream Status: v6.12-rc1
Conflicts: Does not apply cleanly due to many unrelated changes being introduced
in these functions, but the changes introduced in this patch are
simple and well contained.
Relevant prior changes to this commit can be found with git blame
but will not be included here due to being wide-reaching changes.
git blame linux/master kernel/workqueue.c
kernel/workqueue.c
commit b188c57af2b5c17a1e8f71a0358f330446a4f788
Author: Matthew Brost <matthew.brost@intel.com>
AuthorDate: Fri Aug 9 15:28:23 2024 -0700
Commit: Tejun Heo <tj@kernel.org>
CommitDate: Tue Aug 13 09:05:28 2024 -1000
Will help enable user-defined lockdep maps for workqueues.
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Robert Foss <rfoss@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-61734
commit 86898fa6b8cd942505860556f3a0bf52eae57fe8
Author: Tejun Heo <tj@kernel.org>
Date: Mon Mar 25 07:21:03 2024 -1000
workqueue: Implement disable/enable for (delayed) work items
While (delayed) work items could be flushed and canceled, there was no way
to prevent them from being queued in the future. While this didn't lead to
functional deficiencies, it sometimes required a bit more effort from the
workqueue users to e.g. sequence shutdown steps with more care.
Workqueue is currently in the process of replacing tasklet which does
support disabling and enabling. The feature is used relatively widely to,
for example, temporarily suppress main path while a control plane operation
(reset or config change) is in progress.
To enable easy conversion of tasklet users and as it seems like an inherent
useful feature, this patch implements disabling and enabling of work items.
- A work item carries 16bit disable count in work->data while not queued.
The access to the count is synchronized by the PENDING bit like all other
parts of work->data.
- If the count is non-zero, the work item cannot be queued. Any attempt to
queue the work item fails and returns %false.
- disable_work[_sync](), enable_work(), disable_delayed_work[_sync]() and
enable_delayed_work() are added.
v3: enable_work() was using local_irq_enable() instead of
local_irq_restore() to undo IRQ-disable by work_grab_pending(). This is
awkward now and will become incorrect as enable_work() will later be
used from IRQ context too. (Lai)
v2: Lai noticed that queue_work_node() wasn't checking the disable count.
Fixed. queue_rcu_work() is updated to trigger warning if the inner work
item is disabled.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Bastien Nocera <bnocera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-61734
commit 1211f3b21c2aa0d22d8d7f050e3a5930a91cd0e4
Author: Tejun Heo <tj@kernel.org>
Date: Mon Mar 25 07:21:02 2024 -1000
workqueue: Preserve OFFQ bits in cancel[_sync] paths
The cancel[_sync] paths acquire and release WORK_STRUCT_PENDING, and
manipulate WORK_OFFQ_CANCELING. However, they assume that all the OFFQ bit
values except for the pool ID are statically known and don't preserve them,
which is not wrong in the current code as the pool ID and CANCELING are the
only information carried. However, the planned disable/enable support will
add more fields and need them to be preserved.
This patch updates work data handling so that only the bits which need
updating are updated.
- struct work_offq_data is added along with work_offqd_unpack() and
work_offqd_pack_flags() to help manipulating multiple fields contained in
work->data. Note that the helpers look a bit silly right now as there
isn't that much to pack. The next patch will add more.
- mark_work_canceling() which is used only by __cancel_work_sync() is
replaced by open-coded usage of work_offq_data and
set_work_pool_and_keep_pending() in __cancel_work_sync().
- __cancel_work[_sync]() uses offq_data helpers to preserve other OFFQ bits
when clearing WORK_STRUCT_PENDING and WORK_OFFQ_CANCELING at the end.
- This removes all users of get_work_pool_id() which is dropped. Note that
get_work_pool_id() could handle both WORK_STRUCT_PWQ and !WORK_STRUCT_PWQ
cases; however, it was only being called after try_to_grab_pending()
succeeded, in which case WORK_STRUCT_PWQ is never set and thus it's safe
to use work_offqd_unpack() instead.
No behavior changes intended.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Bastien Nocera <bnocera@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-60747
CVE: CVE-2024-46839
commit 98f887f820c993e05a12e8aa816c80b8661d4c87
Author: Nicholas Piggin <npiggin@gmail.com>
Date: Tue, 25 Jun 2024 21:42:45 +1000
workqueue: Improve scalability of workqueue watchdog touch
On a ~2000 CPU powerpc system, hard lockups have been observed in the
workqueue code when stop_machine runs (in this case due to CPU hotplug).
This is due to lots of CPUs spinning in multi_cpu_stop, calling
touch_nmi_watchdog() which ends up calling wq_watchdog_touch().
wq_watchdog_touch() writes to the global variable wq_watchdog_touched,
and that can find itself in the same cacheline as other important
workqueue data, which slows down operations to the point of lockups.
In the case of the following abridged trace, worker_pool_idr was in
the hot line, causing the lockups to always appear at idr_find.
watchdog: CPU 1125 self-detected hard LOCKUP @ idr_find
Call Trace:
get_work_pool
__queue_work
call_timer_fn
run_timer_softirq
__do_softirq
do_softirq_own_stack
irq_exit
timer_interrupt
decrementer_common_virt
* interrupt: 900 (timer) at multi_cpu_stop
multi_cpu_stop
cpu_stopper_thread
smpboot_thread_fn
kthread
Fix this by having wq_watchdog_touch() only write to the line if the
last time a touch was recorded exceeds 1/4 of the watchdog threshold.
Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-60747
CVE: CVE-2024-46839
commit 18e24deb1cc92f2068ce7434a94233741fbd7771
Author: Nicholas Piggin <npiggin@gmail.com>
Date: Tue, 25 Jun 2024 21:42:44 +1000
workqueue: wq_watchdog_touch is always called with valid CPU
Warn in the case it is called with cpu == -1. This does not appear
to happen anywhere.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-56494
Conflicts: Dropped CN documentation since not in RHEL, context
diffs in sched-domains.rst. Skipped hunk in func_set_ftrace_file.tc
due to not having 6fec1ab67f8 ("selftests/ftrace: Do not
trace do_softirq because of PREEMPT_RT") in tree.
commit 86dd6c04ef9f213e14d60c9f64bce1cc019f816e
Author: Ingo Molnar <mingo@kernel.org>
Date: Fri Mar 8 12:18:08 2024 +0100
sched/balancing: Rename scheduler_tick() => sched_tick()
- Standardize on prefixing scheduler-internal functions defined
in <linux/sched.h> with sched_*() prefix. scheduler_tick() was
the only function using the scheduler_ prefix. Harmonize it.
- The other reason to rename it is the NOHZ scheduler tick
handling functions are already named sched_tick_*().
Make the 'git grep sched_tick' more meaningful.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://lore.kernel.org/r/20240308111819.1101550-3-mingo@kernel.org
Signed-off-by: Phil Auld <pauld@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-49500
commit 58629d4871e8eb2c385b16a73a8451669db59f39
Author: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Date: Wed, 3 Jul 2024 17:27:41 +0800
workqueue: Always queue work items to the newest PWQ for order workqueues
To ensure non-reentrancy, __queue_work() attempts to enqueue a work
item to the pool of the currently executing worker. This is not only
unnecessary for an ordered workqueue, where order inherently suggests
non-reentrancy, but it could also disrupt the sequence if the item is
not enqueued on the newest PWQ.
Just queue it to the newest PWQ and let order management guarantees
non-reentrancy.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Fixes: 4c065dbce1e8 ("workqueue: Enable unbound cpumask update on ordered workqueues")
Cc: stable@vger.kernel.org # v6.9+
Signed-off-by: Tejun Heo <tj@kernel.org>
(cherry picked from commit 74347be3edfd11277799242766edf844c43dd5d3)
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-49500
commit 841658832335a32dd86f4e4d3aab7d14188b268b
Author: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Date: Tue, 2 Jul 2024 12:14:55 +0800
workqueue: Update cpumasks after only applying it successfully
Make workqueue_unbound_exclude_cpumask() and workqueue_set_unbound_cpumask()
only update wq_isolated_cpumask and wq_requested_unbound_cpumask when
workqueue_apply_unbound_cpumask() returns successfully.
Fixes: fe28f631fa94("workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask")
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-49500
commit 79202591a55a365251496162ced3004a0a1fa1cf
Author: Dan Williams <dan.j.williams@intel.com>
Date: Thu, 7 Mar 2024 21:39:32 -0800
workqueue: Cleanup subsys attribute registration
While reviewing users of subsys_virtual_register() I noticed that
wq_sysfs_init() ignores the @groups argument. This looks like a
historical artifact as the original wq_subsys only had one attribute to
register.
On the way to building up an @groups argument to pass to
subsys_virtual_register() a few more cleanups fell out:
* Use DEVICE_ATTR_RO() and DEVICE_ATTR_RW() for
cpumask_{isolated,requested} and cpumask respectively. Rename the
@show and @store methods accordingly.
* Co-locate the attribute definition with the methods. This required
moving wq_unbound_cpumask_show down next to wq_unbound_cpumask_store
(renamed to cpumask_show() and cpumask_store())
* Use ATTRIBUTE_GROUPS() to skip some boilerplate declarations
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-49500
commit d40f92020c7a225b77e68599e4b099a4a0823408
Author: Tejun Heo <tj@kernel.org>
Date: Mon, 22 Apr 2024 14:43:48 -1000
workqueue: The default node_nr_active should have its max set to max_active
The default nna (node_nr_active) is used when the pool isn't tied to a
specific NUMA node. This can happen in the following cases:
1. On NUMA, if per-node pwq init failure and the fallback pwq is used.
2. On NUMA, if a pool is configured to span multiple nodes.
3. On single node setups.
5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement for
unbound workqueues") set the default nna->max to min_active because only #1
was being considered. For #2 and #3, using min_active means that the max
concurrency in normal operation is pushed down to min_active which is
currently 8, which can obviously lead to performance issues.
exact value nna->max is set to doesn't really matter. #2 can only happen if
the workqueue is intentionally configured to ignore NUMA boundaries and
there's no good way to distribute max_active in this case. #3 is the default
behavior on single node machines.
Let's set it the default nna->max to max_active. This fixes the artificially
lowered concurrency problem on single node machines and shouldn't hurt
anything for other cases.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues")
Link: https://lore.kernel.org/dm-devel/20240410084531.2134621-1-shinichiro.kawasaki@wdc.com/
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-49500
commit 57a01eafdcf78f6da34fad9ff075ed5dfdd9f420
Author: Sven Schnelle <svens@linux.ibm.com>
Date: Tue, 23 Apr 2024 08:19:05 +0200
workqueue: Fix selection of wake_cpu in kick_pool()
With cpu_possible_mask=0-63 and cpu_online_mask=0-7 the following
kernel oops was observed:
smp: Bringing up secondary CPUs ...
smp: Brought up 1 node, 8 CPUs
Unable to handle kernel pointer dereference in virtual kernel address space
Failing address: 0000000000000000 TEID: 0000000000000803
[..]
Call Trace:
arch_vcpu_is_preempted+0x12/0x80
select_idle_sibling+0x42/0x560
select_task_rq_fair+0x29a/0x3b0
try_to_wake_up+0x38e/0x6e0
kick_pool+0xa4/0x198
__queue_work.part.0+0x2bc/0x3a8
call_timer_fn+0x36/0x160
__run_timers+0x1e2/0x328
__run_timer_base+0x5a/0x88
run_timer_softirq+0x40/0x78
__do_softirq+0x118/0x388
irq_exit_rcu+0xc0/0xd8
do_ext_irq+0xae/0x168
ext_int_handler+0xbe/0xf0
psw_idle_exit+0x0/0xc
default_idle_call+0x3c/0x110
do_idle+0xd4/0x158
cpu_startup_entry+0x40/0x48
rest_init+0xc6/0xc8
start_kernel+0x3c4/0x5e0
startup_continue+0x3c/0x50
The crash is caused by calling arch_vcpu_is_preempted() for an offline
CPU. To avoid this, select the cpu with cpumask_any_and_distribute()
to mask __pod_cpumask with cpu_online_mask. In case no cpu is left in
the pool, skip the assignment.
tj: This doesn't fully fix the bug as CPUs can still go down between picking
the target CPU and the wake call. Fixing that likely requires adding
cpu_online() test to either the sched or s390 arch code. However, regardless
of how that is fixed, workqueue shouldn't be picking a CPU which isn't
online as that would result in unpredictable and worse behavior.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Fixes: 8639ecebc9b1 ("workqueue: Implement non-strict affinity scope for unbound workqueues")
Cc: stable@vger.kernel.org # v6.6+
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 1acd92d95fa24edca8f0292b21870025da93e24f
Author: Tejun Heo <tj@kernel.org>
Date: Mon, 26 Feb 2024 15:38:55 -1000
workqueue: Drain BH work items on hot-unplugged CPUs
Boqun pointed out that workqueues aren't handling BH work items on offlined
CPUs. Unlike tasklet which transfers out the pending tasks from
CPUHP_SOFTIRQ_DEAD, BH workqueue would just leave them pending which is
problematic. Note that this behavior is specific to BH workqueues as the
non-BH per-CPU workers just become unbound when the CPU goes offline.
This patch fixes the issue by draining the pending BH work items from an
offlined CPU from CPUHP_SOFTIRQ_DEAD. Because work items carry more context,
it's not as easy to transfer the pending work items from one pool to
another. Instead, run BH work items which execute the offlined pools on an
online CPU.
Note that this assumes that no further BH work items will be queued on the
offlined CPUs. This assumption is shared with tasklet and should be fine for
conversions. However, this issue also exists for per-CPU workqueues which
will just keep executing work items queued after CPU offline on unbound
workers and workqueue should reject per-CPU and BH work items queued on
offline CPUs. This will be addressed separately later.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Link: http://lkml.kernel.org/r/Zdvw0HdSXcU3JZ4g@boqun-archlinux
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit ccdec92198df0c91f45a68f971771b6b0c1ba02d
Author: Xuewen Yan <xuewen.yan@unisoc.com>
Date: Thu, 22 Feb 2024 15:28:08 +0800
workqueue: Control intensive warning threshold through cmdline
When CONFIG_WQ_CPU_INTENSIVE_REPORT is set, the kernel will report
the work functions which violate the intensive_threshold_us repeatedly.
And now, only when the violate times exceed 4 and is a power of 2,
the kernel warning could be triggered.
However, sometimes, even if a long work execution time occurs only once,
it may cause other work to be delayed for a long time. This may also
cause some problems sometimes.
In order to freely control the threshold of warninging, a boot argument
is added so that the user can control the warning threshold to be printed.
At the same time, keep the exponential backoff to prevent reporting too much.
By default, the warning threshold is 4.
tj: Updated kernel-parameters.txt description.
Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit bccdc1faafaf32e00d6e4dddca1ded64e3272189
Author: Tejun Heo <tj@kernel.org>
Date: Tue, 20 Feb 2024 19:36:15 -1000
workqueue: Make @flags handling consistent across set_work_data() and friends
- set_work_data() takes a separate @flags argument but just ORs it to @data.
This is more confusing than helpful. Just take @data.
- Use the name @flags consistently and add the parameter to
set_work_pool_and_{keep|clear}_pending(). This will be used by the planned
disable/enable support.
No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit afe928c1dc611bec155d834020e0631e026aeb8a
Author: Tejun Heo <tj@kernel.org>
Date: Tue, 20 Feb 2024 19:36:14 -1000
workqueue: Remove clear_work_data()
clear_work_data() is only used in one place and immediately followed by
smp_mb(), making it equivalent to set_work_pool_and_clear_pending() w/
WORK_OFFQ_POOL_NONE for @pool_id. Drop it. No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 978b8409eab15aa733ae3a79c9b5158d34cd3fb7
Author: Tejun Heo <tj@kernel.org>
Date: Tue, 20 Feb 2024 19:36:14 -1000
workqueue: Factor out work_grab_pending() from __cancel_work_sync()
The planned disable/enable support will need the same logic. Let's factor it
out. No functional changes.
v2: Update function comment to include @irq_flags.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
Conflicts: A context diff in include/linux/workqueue.h due to missing
upstream commit b2fa8443db32 ("workqueue: Split out
workqueue_types.h").
commit e9a8e01f9b133c145dd125021ec47c006d108af4
Author: Tejun Heo <tj@kernel.org>
Date: Tue, 20 Feb 2024 19:36:14 -1000
workqueue: Clean up enum work_bits and related constants
The bits of work->data are used for a few different purposes. How the bits
are used is determined by enum work_bits. The planned disable/enable support
will add another use, so let's clean it up a bit in preparation.
- Let WORK_STRUCT_*_BIT's values be determined by enum definition order.
- Deliminate different bit sections the same way using SHIFT and BITS
values.
- Rename __WORK_OFFQ_CANCELING to WORK_OFFQ_CANCELING_BIT for consistency.
- Introduce WORK_STRUCT_PWQ_SHIFT and replace WORK_STRUCT_FLAG_MASK and
WORK_STRUCT_WQ_DATA_MASK with WQ_STRUCT_PWQ_MASK for clarity.
- Improve documentation.
No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit c5f5b9422a49e9bc1c2f992135592ed921ac18e5
Author: Tejun Heo <tj@kernel.org>
Date: Tue, 20 Feb 2024 19:36:14 -1000
workqueue: Introduce work_cancel_flags
The cancel path used bool @is_dwork to distinguish canceling a regular work
and a delayed one. The planned disable/enable support will need passing
around another flag in the code path. As passing them around with bools will
be confusing, let's introduce named flags to pass around in the cancel path.
WORK_CANCEL_DELAYED replaces @is_dwork. No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit c26e2f2e2fcfb73996fa025a0d3b5695017d65b5
Author: Tejun Heo <tj@kernel.org>
Date: Tue, 20 Feb 2024 19:36:14 -1000
workqueue: Use variable name irq_flags for saving local irq flags
Using the generic term `flags` for irq flags is conventional but can be
confusing as there's quite a bit of code dealing with work flags which
involves some subtleties. Let's use a more explicit name `irq_flags` for
local irq flags. No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit cdc6e4b329bc82676886a758a940b2b6987c2109
Author: Tejun Heo <tj@kernel.org>
Date: Tue, 20 Feb 2024 19:36:14 -1000
workqueue: Reorganize flush and cancel[_sync] functions
They are currently a bit disorganized with flush and cancel functions mixed.
Reoranize them so that flush functions come first, cancel next and
cancel_sync last. This way, we won't have to add prototypes for internal
functions for the planned disable/enable support.
This is pure code reorganization. No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit c5140688d19a4579f7b01e6ca4b6e5f5d23d3d4d
Author: Tejun Heo <tj@kernel.org>
Date: Tue, 20 Feb 2024 19:36:14 -1000
workqueue: Rename __cancel_work_timer() to __cancel_timer_sync()
__cancel_work_timer() is used to implement cancel_work_sync() and
cancel_delayed_work_sync(), similarly to how __cancel_work() is used to
implement cancel_work() and cancel_delayed_work(). ie. The _timer part of
the name is a complete misnomer. The difference from __cancel_work() is the
fact that it syncs against work item execution not whether it handles timers
or not.
Let's rename it to less confusing __cancel_work_sync(). No functional
change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit d355001fa9370df8fdd6fca0e9ed77063615c7da
Author: Tejun Heo <tj@kernel.org>
Date: Tue, 20 Feb 2024 19:36:13 -1000
workqueue: Use rcu_read_lock_any_held() instead of rcu_read_lock_held()
The different flavors of RCU read critical sections have been unified. Let's
update the locking assertion macros accordingly to avoid requiring
unnecessary explicit rcu_read_[un]lock() calls.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit c7a40c49af920fbad2ab6795b6587308ad69de9f
Author: Tejun Heo <tj@kernel.org>
Date: Tue, 20 Feb 2024 19:36:13 -1000
workqueue: Cosmetic changes
Reorder some global declarations and adjust comments and whitespaces for
clarity and consistency. No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit fd0a68a2337b79a7bd4dad5e7d9dc726828527af
Author: Tejun Heo <tj@kernel.org>
Date: Thu, 15 Feb 2024 19:10:01 -1000
workqueue, irq_work: Build fix for !CONFIG_IRQ_WORK
2f34d7337d98 ("workqueue: Fix queue_work_on() with BH workqueues") added
irq_work usage to workqueue; however, it turns out irq_work is actually
optional and the change breaks build on configuration which doesn't have
CONFIG_IRQ_WORK enabled.
Fix build by making workqueue use irq_work only when CONFIG_SMP and enabling
CONFIG_IRQ_WORK when CONFIG_SMP is set. It's reasonable to argue that it may
be better to just always enable it. However, this still saves a small bit of
memory for tiny UP configs and also the least amount of change, so, for now,
let's keep it conditional.
Verified to do the right thing for x86_64 allnoconfig and defconfig, and
aarch64 allnoconfig, allnoconfig + prink disable (SMP but nothing selects
IRQ_WORK) and a modified aarch64 Kconfig where !SMP and nothing selects
IRQ_WORK.
v2: `depends on SMP` leads to Kconfig warnings when CONFIG_IRQ_WORK is
selected by something else when !CONFIG_SMP. Use `def_bool y if SMP`
instead.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Tested-by: Anders Roxell <anders.roxell@linaro.org>
Fixes: 2f34d7337d98 ("workqueue: Fix queue_work_on() with BH workqueues")
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 2f34d7337d98f3eae7bd3d1270efaf9d8a17cfc6
Author: Tejun Heo <tj@kernel.org>
Date: Wed, 14 Feb 2024 08:33:55 -1000
workqueue: Fix queue_work_on() with BH workqueues
When queue_work_on() is used to queue a BH work item on a remote CPU, the
work item is queued on that CPU but kick_pool() raises softirq on the local
CPU. This leads to stalls as the work item won't be executed until something
else on the remote CPU schedules a BH work item or tasklet locally.
Fix it by bouncing raising softirq to the target CPU using per-cpu irq_work.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 4cb1ef64609f ("workqueue: Implement BH workqueues to eventually replace tasklets")
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 8f172181f24bb5df7675225d9b5b66d059613f50
Author: Tejun Heo <tj@kernel.org>
Date: Thu, 8 Feb 2024 14:11:56 -1000
workqueue: Implement workqueue_set_min_active()
Since 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement
for unbound workqueues"), unbound workqueues have separate min_active which
sets the number of interdependent work items that can be handled. This value
is currently initialized to WQ_DFL_MIN_ACTIVE which is 8. This isn't high
enough for some users, let's add an interface to adjust the setting.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 516d3dc99f4f2ab856d879696cd3a5d7f6db7796
Author: Waiman Long <longman@redhat.com>
Date: Fri, 9 Feb 2024 12:06:11 -0500
workqueue: Fix kernel-doc comment of unplug_oldest_pwq()
Fix the kernel-doc comment of the unplug_oldest_pwq() function to enable
proper processing and formatting of the embedded ASCII diagram.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 49584bb8ddbe8bcfc276c2d7dd3c8890f45f5970
Author: Waiman Long <longman@redhat.com>
Date: Thu, 8 Feb 2024 11:10:14 -0500
workqueue: Bind unbound workqueue rescuer to wq_unbound_cpumask
Commit 85f0ab43f9de ("kernel/workqueue: Bind rescuer to unbound
cpumask for WQ_UNBOUND") modified init_rescuer() to bind rescuer of
an unbound workqueue to the cpumask in wq->unbound_attrs. However
unbound_attrs->cpumask's of all workqueues are initialized to
cpu_possible_mask and will only be changed if it has the WQ_SYSFS flag
to expose a cpumask sysfs file to be written by users. So this patch
doesn't achieve what it is intended to do.
If an unbound workqueue is created after wq_unbound_cpumask is modified
and there is no more unbound cpumask update after that, the unbound
rescuer will be bound to all CPUs unless the workqueue is created
with the WQ_SYSFS flag and a user explicitly modified its cpumask
sysfs file. Fix this problem by binding directly to wq_unbound_cpumask
in init_rescuer().
Fixes: 85f0ab43f9de ("kernel/workqueue: Bind rescuer to unbound cpumask for WQ_UNBOUND")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit d64f2fa064f8866802e23c8ec95d9d1f601480ee
Author: Juri Lelli <juri.lelli@redhat.com>
Date: Thu, 8 Feb 2024 11:10:13 -0500
kernel/workqueue: Let rescuers follow unbound wq cpumask changes
When workqueue cpumask changes are committed the associated rescuer (if
one exists) affinity is not touched and this might be a problem down the
line for isolated setups.
Make sure rescuers affinity is updated every time a workqueue cpumask
changes, so that rescuers can't break isolation.
[longman: set_cpus_allowed_ptr() will block until the designated task
is enqueued on an allowed CPU, no wake_up_process() needed. Also use
the unbound_effective_cpumask() helper as suggested by Tejun.]
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 4c065dbce1e8639546ef3612acffb062dd084cfe
Author: Waiman Long <longman@redhat.com>
Date: Thu, 8 Feb 2024 14:12:20 -0500
workqueue: Enable unbound cpumask update on ordered workqueues
Ordered workqueues does not currently follow changes made to the
global unbound cpumask because per-pool workqueue changes may break
the ordering guarantee. IOW, a work function in an ordered workqueue
may run on an isolated CPU.
This patch enables ordered workqueues to follow changes made to the
global unbound cpumask by temporaily plug or suspend the newly allocated
pool_workqueue from executing newly queued work items until the old
pwq has been properly drained. For ordered workqueues, there should
only be one pwq that is unplugged, the rests should be plugged.
This enables ordered workqueues to follow the unbound cpumask changes
like other unbound workqueues at the expense of some delay in execution
of work functions during the transition period.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 26fb7e3dda4c16e2cfe2164a1e7315a9386602db
Author: Waiman Long <longman@redhat.com>
Date: Thu, 8 Feb 2024 11:10:11 -0500
workqueue: Link pwq's into wq->pwqs from oldest to newest
Add a new pwq into the tail of wq->pwqs so that pwq iteration will
start from the oldest pwq to the newest. This ordering will facilitate
the inclusion of ordered workqueues in a wq_unbound_cpumask update.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 3bc1e711c26bff01d41ad71145ecb8dcb4412576
Author: Tejun Heo <tj@kernel.org>
Date: Mon, 5 Feb 2024 14:19:10 -1000
workqueue: Don't implicitly make UNBOUND workqueues w/ @max_active==1 ordered
5c0338c687 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered")
automoatically promoted UNBOUND workqueues w/ @max_active==1 to ordered
workqueues because UNBOUND workqueues w/ @max_active==1 used to be the way
to create ordered workqueues and the new NUMA support broke it. These
problems can be subtle and the fact that they can only trigger on NUMA
machines made them even more difficult to debug.
However, overloading the UNBOUND allocation interface this way creates other
issues. It's difficult to tell whether a given workqueue actually needs to
be ordered and users that legitimately want a min concurrency level wq
unexpectedly gets an ordered one instead. With planned UNBOUND workqueue
udpates to improve execution locality and more prevalence of chiplet designs
which can benefit from such improvements, this isn't a state we wanna be in
forever.
There aren't that many UNBOUND w/ @max_active==1 users in the tree and the
preceding patches audited all and converted them to
alloc_ordered_workqueue() as appropriate. This patch removes the implicit
promotion of UNBOUND w/ @max_active==1 workqueues to ordered ones.
v2: v1 patch incorrectly dropped !list_empty(&wq->pwqs) condition in
apply_workqueue_attrs_locked() which spuriously triggers WARNING and
fails workqueue creation. Fix it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: kernel test robot <oliver.sang@intel.com>
Link: https://lore.kernel.org/oe-lkp/202304251050.45a5df1f-oliver.sang@intel.com
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
Conflicts: A context diff due to upstream conflict as shown in merge
commit 40911d4457f2 ("Merge branch 'for-6.8-fixes' into
for-6.9").
commit 8eb17dc1a6b5db7e89681f59285242af8d182f95
Author: Waiman Long <longman@redhat.com>
Date: Sat, 3 Feb 2024 10:43:30 -0500
workqueue: Skip __WQ_DESTROYING workqueues when updating global unbound cpumask
Skip updating workqueues with __WQ_DESTROYING bit set when updating
global unbound cpumask to avoid unnecessary work and other complications.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 96068b6030391082bf0cd97af525d731afa5ad63
Author: Wang Jinchao <wangjinchao@xfusion.com>
Date: Mon, 5 Feb 2024 08:31:52 +0800
workqueue: fix a typo in comment
There should be three, fix it.
Signed-off-by: Wang Jinchao <wangjinchao@xfusion.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 4f19b8e01e2fb6c97d4307abb7bde4d34a1e601e
Author: Tejun Heo <tj@kernel.org>
Date: Mon, 5 Feb 2024 07:18:08 -1000
Revert "workqueue: make wq_subsys const"
This reverts commit d412ace11144aa2bf692c7cf9778351efc15c827. This leads to
build failures as it depends on a driver-core commit 32f78abe59c7 ("driver
core: bus: constantify subsys_register() calls"). Let's drop it from wq tree
and route it through driver-core tree.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202402051505.kM9Rr3CJ-lkp@intel.com/
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
Conflicts: A minor context diff in kernel/workqueue.c due to missing
upstream commit 68279f9c9f59 ("treewide: mark stuff as
__ro_after_init").
commit 4cb1ef64609f9b0254184b2947824f4b46ccab22
Author: Tejun Heo <tj@kernel.org>
Date: Sun, 4 Feb 2024 11:28:06 -1000
workqueue: Implement BH workqueues to eventually replace tasklets
The only generic interface to execute asynchronously in the BH context is
tasklet; however, it's marked deprecated and has some design flaws such as
the execution code accessing the tasklet item after the execution is
complete which can lead to subtle use-after-free in certain usage scenarios
and less-developed flush and cancel mechanisms.
This patch implements BH workqueues which share the same semantics and
features of regular workqueues but execute their work items in the softirq
context. As there is always only one BH execution context per CPU, none of
the concurrency management mechanisms applies and a BH workqueue can be
thought of as a convenience wrapper around softirq.
Except for the inability to sleep while executing and lack of max_active
adjustments, BH workqueues and work items should behave the same as regular
workqueues and work items.
Currently, the execution is hooked to tasklet[_hi]. However, the goal is to
convert all tasklet users over to BH workqueues. Once the conversion is
complete, tasklet can be removed and BH workqueues can directly take over
the tasklet softirqs.
system_bh[_highpri]_wq are added. As queue-wide flushing doesn't exist in
tasklet, all existing tasklet users should be able to use the system BH
workqueues without creating their own workqueues.
v3: - Add missing interrupt.h include.
v2: - Instead of using tasklets, hook directly into its softirq action
functions - tasklet[_hi]_action(). This is slightly cheaper and closer
to the eventual code structure we want to arrive at. Suggested by Lai.
- Lai also pointed out several places which need NULL worker->task
handling or can use clarification. Updated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/CAHk-=wjDW53w4-YcSmgKC5RruiRLHmJ1sXeYdp_ZgVoBw=5byA@mail.gmail.com
Tested-by: Allen Pais <allen.lkml@gmail.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit 2fcdb1b44491e08f5334a92c50e8f362e0d46f91
Author: Tejun Heo <tj@kernel.org>
Date: Sun, 4 Feb 2024 11:28:06 -1000
workqueue: Factor out init_cpu_worker_pool()
Factor out init_cpu_worker_pool() from workqueue_init_early(). This is pure
reorganization in preparation of BH workqueue support.
Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Allen Pais <allen.lkml@gmail.com>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit c35aea39d1e106f61fd2130f0d32a3bac8bd4570
Author: Tejun Heo <tj@kernel.org>
Date: Sun, 4 Feb 2024 11:28:06 -1000
workqueue: Update lock debugging code
These changes are in preparation of BH workqueue which will execute work
items from BH context.
- Update lock and RCU depth checks in process_one_work() so that it
remembers and checks against the starting depths and prints out the depth
changes.
- Factor out lockdep annotations in the flush paths into
touch_{wq|work}_lockdep_map(). The work->lockdep_map touching is moved
from __flush_work() to its callee - start_flush_work(). This brings it
closer to the wq counterpart and will allow testing the associated wq's
flags which will be needed to support BH workqueues. This is not expected
to cause any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Allen Pais <allen.lkml@gmail.com>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit d412ace11144aa2bf692c7cf9778351efc15c827
Author: Ricardo B. Marliere <ricardo@marliere.net>
Date: Sun, 4 Feb 2024 10:47:05 -0300
workqueue: make wq_subsys const
Now that the driver core can properly handle constant struct bus_type,
move the wq_subsys variable to be a constant structure as well,
placing it into read-only memory which can not be modified at runtime.
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Suggested-and-reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit c70e1779b73a39f7648b26bdc835304c60100ce3
Author: Tejun Heo <tj@kernel.org>
Date: Sun, 4 Feb 2024 11:14:21 -1000
workqueue: Fix pwq->nr_in_flight corruption in try_to_grab_pending()
dd6c3c544126 ("workqueue: Move pwq_dec_nr_in_flight() to the end of work
item handling") relocated pwq_dec_nr_in_flight() after
set_work_pool_and_keep_pending(). However, the latter destroys information
contained in work->data that's needed by pwq_dec_nr_in_flight() including
the flush color. With flush color destroyed, flush_workqueue() can stall
easily when mixed with cancel_work*() usages.
This is easily triggered by running xfstests generic/001 test on xfs:
INFO: task umount:6305 blocked for more than 122 seconds.
...
task:umount state:D stack:13008 pid:6305 tgid:6305 ppid:6301 flags:0x00004000
Call Trace:
<TASK>
__schedule+0x2f6/0xa20
schedule+0x36/0xb0
schedule_timeout+0x20b/0x280
wait_for_completion+0x8a/0x140
__flush_workqueue+0x11a/0x3b0
xfs_inodegc_flush+0x24/0xf0
xfs_unmountfs+0x14/0x180
xfs_fs_put_super+0x3d/0x90
generic_shutdown_super+0x7c/0x160
kill_block_super+0x1b/0x40
xfs_kill_sb+0x12/0x30
deactivate_locked_super+0x35/0x90
deactivate_super+0x42/0x50
cleanup_mnt+0x109/0x170
__cleanup_mnt+0x12/0x20
task_work_run+0x60/0x90
syscall_exit_to_user_mode+0x146/0x150
do_syscall_64+0x5d/0x110
entry_SYSCALL_64_after_hwframe+0x6c/0x74
Fix it by stashing work_data before calling set_work_pool_and_keep_pending()
and using the stashed value for pwq_dec_nr_in_flight().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Chandan Babu R <chandanbabu@kernel.org>
Link: http://lkml.kernel.org/r/87o7cxeehy.fsf@debian-BULLSEYE-live-builder-AMD64
Fixes: dd6c3c544126 ("workqueue: Move pwq_dec_nr_in_flight() to the end of work item handling")
Signed-off-by: Waiman Long <longman@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-25103
commit c5f8cd6c62ce02205ced15e9a998103f21ec5455
Author: Tejun Heo <tj@kernel.org>
Date: Tue, 30 Jan 2024 19:06:43 -1000
workqueue: Avoid premature init of wq->node_nr_active[].max
System workqueues are allocated early during boot from
workqueue_init_early(). While allocating unbound workqueues,
wq_update_node_max_active() is invoked from apply_workqueue_attrs() and
accesses NUMA topology to initialize wq->node_nr_active[].max.
However, topology information may not be set up at this point.
wq_update_node_max_active() is explicitly invoked from
workqueue_init_topology() later when topology information is known to be
available.
This doesn't seem to crash anything but it's doing useless work with dubious
data. Let's skip the premature and duplicate node_max_active updates by
initializing the field to WQ_DFL_MIN_ACTIVE on allocation and making
wq_update_node_max_active() noop until workqueue_init_topology().
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>