Centos-kernel-stream-9

Commit Graph

Author	SHA1	Message	Date
CKI Backport Bot	80162867aa	workqueue: Put the pwq after detaching the rescuer from the pool JIRA: https://issues.redhat.com/browse/RHEL-81472 CVE: CVE-2025-21786 commit e76946110137703c16423baf6ee177b751a34b7e Author: Lai Jiangshan <jiangshan.ljs@antgroup.com> Date: Thu Jan 23 16:25:35 2025 +0800 workqueue: Put the pwq after detaching the rescuer from the pool The commit 68f83057b913("workqueue: Reap workers via kthread_stop() and remove detach_completion") adds code to reap the normal workers but mistakenly does not handle the rescuer and also removes the code waiting for the rescuer in put_unbound_pool(), which caused a use-after-free bug reported by Cheung Wall. To avoid the use-after-free bug, the pool’s reference must be held until the detachment is complete. Therefore, move the code that puts the pwq after detaching the rescuer from the pool. Reported-by: cheung wall <zzqq0103.hey@gmail.com> Cc: cheung wall <zzqq0103.hey@gmail.com> Link: https://lore.kernel.org/lkml/CAKHoSAvP3iQW+GwmKzWjEAOoPvzeWeoMO0Gz7Pp3_4kxt-RMoA@mail.gmail.com/ Fixes: 68f83057b913("workqueue: Reap workers via kthread_stop() and remove detach_completion") Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: CKI Backport Bot <cki-ci-bot+cki-gitlab-backport-bot@redhat.com>	2025-02-27 22:46:55 +00:00
Waiman Long	ec40aa3ed8	workqueue: Do not warn when cancelling WQ_MEM_RECLAIM work from !WQ_MEM_RECLAIM worker JIRA: https://issues.redhat.com/browse/RHEL-74107 CVE: CVE-2024-57888 commit de35994ecd2dd6148ab5a6c5050a1670a04dec77 Author: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Date: Thu, 19 Dec 2024 09:30:30 +0000 workqueue: Do not warn when cancelling WQ_MEM_RECLAIM work from !WQ_MEM_RECLAIM worker After commit 746ae46c1113 ("drm/sched: Mark scheduler work queues with WQ_MEM_RECLAIM") amdgpu started seeing the following warning: [ ] workqueue: WQ_MEM_RECLAIM sdma0:drm_sched_run_job_work [gpu_sched] is flushing !WQ_MEM_RECLAIM events:amdgpu_device_delay_enable_gfx_off [amdgpu] ... [ ] Workqueue: sdma0 drm_sched_run_job_work [gpu_sched] ... [ ] Call Trace: [ ] <TASK> ... [ ] ? check_flush_dependency+0xf5/0x110 ... [ ] cancel_delayed_work_sync+0x6e/0x80 [ ] amdgpu_gfx_off_ctrl+0xab/0x140 [amdgpu] [ ] amdgpu_ring_alloc+0x40/0x50 [amdgpu] [ ] amdgpu_ib_schedule+0xf4/0x810 [amdgpu] [ ] ? drm_sched_run_job_work+0x22c/0x430 [gpu_sched] [ ] amdgpu_job_run+0xaa/0x1f0 [amdgpu] [ ] drm_sched_run_job_work+0x257/0x430 [gpu_sched] [ ] process_one_work+0x217/0x720 ... [ ] </TASK> The intent of the verifcation done in check_flush_depedency is to ensure forward progress during memory reclaim, by flagging cases when either a memory reclaim process, or a memory reclaim work item is flushed from a context not marked as memory reclaim safe. This is correct when flushing, but when called from the cancel(_delayed)_work_sync() paths it is a false positive because work is either already running, or will not be running at all. Therefore cancelling it is safe and we can relax the warning criteria by letting the helper know of the calling context. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Fixes: `fca839c00a` ("workqueue: warn if memory reclaim tries to flush !WQ_MEM_RECLAIM workqueue") References: 746ae46c1113 ("drm/sched: Mark scheduler work queues with WQ_MEM_RECLAIM") Cc: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Christian König <christian.koenig@amd.com Cc: Matthew Brost <matthew.brost@intel.com> Cc: <stable@vger.kernel.org> # v4.5+ Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2025-01-17 09:22:12 -05:00
Rado Vrbovsky	780560f78a	Merge: RHEL9.6 drm backport dependencies MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5437 # Merge Request Required Information ## Summary of Changes Depends: !5592 Depends: !5692 JIRA: https://issues.redhat.com/browse/RHEL-53569 Omitted-fix: 48ffe2074c2864ab64ee2004e7ebf3d6a6730fbf Omitted-fix: 06e7139a034f26804904368fe4af2ceb70724756 Omitted-fix: 5278ca048d93eac74e9a81b3e672da2b2264bce4 Omitted-fix: 8dffaec34dd55473adcbc924a4c9b04aaa0d4278 Signed-off-by: Robert Foss <rfoss@redhat.com> ## Approved Development Ticket All submissions to CentOS Stream must reference an approved ticket in [Red Hat Jira](https://issues.redhat.com/). Please follow the CentOS Stream [contribution documentation](https://docs.centos.org/en-US/stream-contrib/quickstart/) for how to file this ticket and have it approved. Approved-by: Jocelyn Falempe <jfalempe@redhat.com> Approved-by: David Arcari <darcari@redhat.com> Approved-by: Rafael Aquini <raquini@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>	2025-01-06 08:26:14 +00:00
Robert Foss	85c4f6b0ed	workqueue: Don't call va_start / va_end twice JIRA: https://issues.redhat.com/browse/RHEL-53569 Upstream Status: v6.12-rc1 commit 9b59a85a84dc37ca4f2c54df5e06aff4c1eae5d3 Author: Matthew Brost <matthew.brost@intel.com> AuthorDate: Tue Aug 20 12:38:08 2024 -0700 Commit: Tejun Heo <tj@kernel.org> CommitDate: Tue Aug 20 09:38:39 2024 -1000 Calling va_start / va_end multiple times is undefined and causes problems with certain compiler / platforms. Change alloc_ordered_workqueue_lockdep_map to a macro and updated __alloc_workqueue to take a va_list argument. Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Robert Foss <rfoss@redhat.com>	2024-12-17 22:59:24 +01:00
Robert Foss	e29258cd10	workqueue: Add interface for user-defined workqueue lockdep map JIRA: https://issues.redhat.com/browse/RHEL-53569 Upstream Status: v6.12-rc1 commit ec0a7d44b358afaaf52856d03c72e20587bc888b Author: Matthew Brost <matthew.brost@intel.com> AuthorDate: Fri Aug 9 15:28:25 2024 -0700 Commit: Tejun Heo <tj@kernel.org> CommitDate: Tue Aug 13 09:05:51 2024 -1000 Add an interface for a user-defined workqueue lockdep map, which is helpful when multiple workqueues are created for the same purpose. This also helps avoid leaking lockdep maps on each workqueue creation. v2: - Add alloc_workqueue_lockdep_map (Tejun) v3: - Drop __WQ_USER_OWNED_LOCKDEP (Tejun) - static inline alloc_ordered_workqueue_lockdep_map (Tejun) Cc: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Robert Foss <rfoss@redhat.com>	2024-12-17 22:59:24 +01:00
Robert Foss	c2a6c84a1c	workqueue: Change workqueue lockdep map to pointer JIRA: https://issues.redhat.com/browse/RHEL-53569 Upstream Status: v6.12-rc1 commit 4f022f430e21e456893283036bc2ea78ac6bd2a1 Author: Matthew Brost <matthew.brost@intel.com> AuthorDate: Fri Aug 9 15:28:24 2024 -0700 Commit: Tejun Heo <tj@kernel.org> CommitDate: Tue Aug 13 09:05:40 2024 -1000 Will help enable user-defined lockdep maps for workqueues. Cc: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Robert Foss <rfoss@redhat.com>	2024-12-17 22:59:23 +01:00
Robert Foss	461cd6afdd	workqueue: Split alloc_workqueue into internal function and lockdep init JIRA: https://issues.redhat.com/browse/RHEL-53569 Upstream Status: v6.12-rc1 Conflicts: Does not apply cleanly due to many unrelated changes being introduced in these functions, but the changes introduced in this patch are simple and well contained. Relevant prior changes to this commit can be found with git blame but will not be included here due to being wide-reaching changes. git blame linux/master kernel/workqueue.c kernel/workqueue.c commit b188c57af2b5c17a1e8f71a0358f330446a4f788 Author: Matthew Brost <matthew.brost@intel.com> AuthorDate: Fri Aug 9 15:28:23 2024 -0700 Commit: Tejun Heo <tj@kernel.org> CommitDate: Tue Aug 13 09:05:28 2024 -1000 Will help enable user-defined lockdep maps for workqueues. Cc: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Robert Foss <rfoss@redhat.com>	2024-12-17 22:59:23 +01:00
Bastien Nocera	562729d980	workqueue: Fix UBSAN 'subtraction overflow' error in shift_and_mask() JIRA: https://issues.redhat.com/browse/RHEL-61734 commit 38f7e14519d39cf524ddc02d4caee9b337dad703 Author: Will Deacon <will@kernel.org> Date: Tue Jul 30 12:44:31 2024 +0100 workqueue: Fix UBSAN 'subtraction overflow' error in shift_and_mask() UBSAN reports the following 'subtraction overflow' error when booting in a virtual machine on Android: \| Internal error: UBSAN: integer subtraction overflow: 00000000f2005515 [#1] PREEMPT SMP \| Modules linked in: \| CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.10.0-00006-g3cbe9e5abd46-dirty #4 \| Hardware name: linux,dummy-virt (DT) \| pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) \| pc : cancel_delayed_work+0x34/0x44 \| lr : cancel_delayed_work+0x2c/0x44 \| sp : ffff80008002ba60 \| x29: ffff80008002ba60 x28: 0000000000000000 x27: 0000000000000000 \| x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000 \| x23: 0000000000000000 x22: 0000000000000000 x21: ffff1f65014cd3c0 \| x20: ffffc0e84c9d0da0 x19: ffffc0e84cab3558 x18: ffff800080009058 \| x17: 00000000247ee1f8 x16: 00000000247ee1f8 x15: 00000000bdcb279d \| x14: 0000000000000001 x13: 0000000000000075 x12: 00000a0000000000 \| x11: ffff1f6501499018 x10: 00984901651fffff x9 : ffff5e7cc35af000 \| x8 : 0000000000000001 x7 : 3d4d455453595342 x6 : 000000004e514553 \| x5 : ffff1f6501499265 x4 : ffff1f650ff60b10 x3 : 0000000000000620 \| x2 : ffff80008002ba78 x1 : 0000000000000000 x0 : 0000000000000000 \| Call trace: \| cancel_delayed_work+0x34/0x44 \| deferred_probe_extend_timeout+0x20/0x70 \| driver_register+0xa8/0x110 \| __platform_driver_register+0x28/0x3c \| syscon_init+0x24/0x38 \| do_one_initcall+0xe4/0x338 \| do_initcall_level+0xac/0x178 \| do_initcalls+0x5c/0xa0 \| do_basic_setup+0x20/0x30 \| kernel_init_freeable+0x8c/0xf8 \| kernel_init+0x28/0x1b4 \| ret_from_fork+0x10/0x20 \| Code: f9000fbf 97fffa2f 39400268 37100048 (d42aa2a0) \| ---[ end trace 0000000000000000 ]--- \| Kernel panic - not syncing: UBSAN: integer subtraction overflow: Fatal exception This is due to shift_and_mask() using a signed immediate to construct the mask and being called with a shift of 31 (WORK_OFFQ_POOL_SHIFT) so that it ends up decrementing from INT_MIN. Use an unsigned constant '1U' to generate the mask in shift_and_mask(). Cc: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Fixes: 1211f3b21c2a ("workqueue: Preserve OFFQ bits in cancel[_sync] paths") Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Bastien Nocera <bnocera@redhat.com>	2024-12-11 15:25:24 +01:00
Bastien Nocera	4a8bc1a1a6	workqueue: Implement disable/enable for (delayed) work items JIRA: https://issues.redhat.com/browse/RHEL-61734 commit 86898fa6b8cd942505860556f3a0bf52eae57fe8 Author: Tejun Heo <tj@kernel.org> Date: Mon Mar 25 07:21:03 2024 -1000 workqueue: Implement disable/enable for (delayed) work items While (delayed) work items could be flushed and canceled, there was no way to prevent them from being queued in the future. While this didn't lead to functional deficiencies, it sometimes required a bit more effort from the workqueue users to e.g. sequence shutdown steps with more care. Workqueue is currently in the process of replacing tasklet which does support disabling and enabling. The feature is used relatively widely to, for example, temporarily suppress main path while a control plane operation (reset or config change) is in progress. To enable easy conversion of tasklet users and as it seems like an inherent useful feature, this patch implements disabling and enabling of work items. - A work item carries 16bit disable count in work->data while not queued. The access to the count is synchronized by the PENDING bit like all other parts of work->data. - If the count is non-zero, the work item cannot be queued. Any attempt to queue the work item fails and returns %false. - disable_work[_sync](), enable_work(), disable_delayed_work[_sync]() and enable_delayed_work() are added. v3: enable_work() was using local_irq_enable() instead of local_irq_restore() to undo IRQ-disable by work_grab_pending(). This is awkward now and will become incorrect as enable_work() will later be used from IRQ context too. (Lai) v2: Lai noticed that queue_work_node() wasn't checking the disable count. Fixed. queue_rcu_work() is updated to trigger warning if the inner work item is disabled. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Bastien Nocera <bnocera@redhat.com>	2024-12-11 15:25:24 +01:00
Bastien Nocera	85215bab77	workqueue: Preserve OFFQ bits in cancel[_sync] paths JIRA: https://issues.redhat.com/browse/RHEL-61734 commit 1211f3b21c2aa0d22d8d7f050e3a5930a91cd0e4 Author: Tejun Heo <tj@kernel.org> Date: Mon Mar 25 07:21:02 2024 -1000 workqueue: Preserve OFFQ bits in cancel[_sync] paths The cancel[_sync] paths acquire and release WORK_STRUCT_PENDING, and manipulate WORK_OFFQ_CANCELING. However, they assume that all the OFFQ bit values except for the pool ID are statically known and don't preserve them, which is not wrong in the current code as the pool ID and CANCELING are the only information carried. However, the planned disable/enable support will add more fields and need them to be preserved. This patch updates work data handling so that only the bits which need updating are updated. - struct work_offq_data is added along with work_offqd_unpack() and work_offqd_pack_flags() to help manipulating multiple fields contained in work->data. Note that the helpers look a bit silly right now as there isn't that much to pack. The next patch will add more. - mark_work_canceling() which is used only by __cancel_work_sync() is replaced by open-coded usage of work_offq_data and set_work_pool_and_keep_pending() in __cancel_work_sync(). - __cancel_work[_sync]() uses offq_data helpers to preserve other OFFQ bits when clearing WORK_STRUCT_PENDING and WORK_OFFQ_CANCELING at the end. - This removes all users of get_work_pool_id() which is dropped. Note that get_work_pool_id() could handle both WORK_STRUCT_PWQ and !WORK_STRUCT_PWQ cases; however, it was only being called after try_to_grab_pending() succeeded, in which case WORK_STRUCT_PWQ is never set and thus it's safe to use work_offqd_unpack() instead. No behavior changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Bastien Nocera <bnocera@redhat.com>	2024-12-11 15:25:23 +01:00
Rado Vrbovsky	5cb9527389	Merge: CVE-2024-46839: workqueue: Improve scalability of workqueue watchdog touch MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5346 JIRA: https://issues.redhat.com/browse/RHEL-60747 CVE: CVE-2024-46839 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5346 The 2nd patch fixes the CVE. The first patch is included as it is related to the 2nd patch. Signed-off-by: Waiman Long <longman@redhat.com> Approved-by: Phil Auld <pauld@redhat.com> Approved-by: Audra Mitchell <aubaker@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>	2024-12-09 08:21:12 +00:00
Waiman Long	8ec893ae89	workqueue: Improve scalability of workqueue watchdog touch JIRA: https://issues.redhat.com/browse/RHEL-60747 CVE: CVE-2024-46839 commit 98f887f820c993e05a12e8aa816c80b8661d4c87 Author: Nicholas Piggin <npiggin@gmail.com> Date: Tue, 25 Jun 2024 21:42:45 +1000 workqueue: Improve scalability of workqueue watchdog touch On a ~2000 CPU powerpc system, hard lockups have been observed in the workqueue code when stop_machine runs (in this case due to CPU hotplug). This is due to lots of CPUs spinning in multi_cpu_stop, calling touch_nmi_watchdog() which ends up calling wq_watchdog_touch(). wq_watchdog_touch() writes to the global variable wq_watchdog_touched, and that can find itself in the same cacheline as other important workqueue data, which slows down operations to the point of lockups. In the case of the following abridged trace, worker_pool_idr was in the hot line, causing the lockups to always appear at idr_find. watchdog: CPU 1125 self-detected hard LOCKUP @ idr_find Call Trace: get_work_pool __queue_work call_timer_fn run_timer_softirq __do_softirq do_softirq_own_stack irq_exit timer_interrupt decrementer_common_virt * interrupt: 900 (timer) at multi_cpu_stop multi_cpu_stop cpu_stopper_thread smpboot_thread_fn kthread Fix this by having wq_watchdog_touch() only write to the line if the last time a touch was recorded exceeds 1/4 of the watchdog threshold. Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-10-17 08:58:24 -04:00
Waiman Long	b0d5b181c0	workqueue: wq_watchdog_touch is always called with valid CPU JIRA: https://issues.redhat.com/browse/RHEL-60747 CVE: CVE-2024-46839 commit 18e24deb1cc92f2068ce7434a94233741fbd7771 Author: Nicholas Piggin <npiggin@gmail.com> Date: Tue, 25 Jun 2024 21:42:44 +1000 workqueue: wq_watchdog_touch is always called with valid CPU Warn in the case it is called with cpu == -1. This does not appear to happen anywhere. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-10-17 08:58:23 -04:00
Phil Auld	88d1c5d2ed	sched/balancing: Rename scheduler_tick() => sched_tick() JIRA: https://issues.redhat.com/browse/RHEL-56494 Conflicts: Dropped CN documentation since not in RHEL, context diffs in sched-domains.rst. Skipped hunk in func_set_ftrace_file.tc due to not having 6fec1ab67f8 ("selftests/ftrace: Do not trace do_softirq because of PREEMPT_RT") in tree. commit 86dd6c04ef9f213e14d60c9f64bce1cc019f816e Author: Ingo Molnar <mingo@kernel.org> Date: Fri Mar 8 12:18:08 2024 +0100 sched/balancing: Rename scheduler_tick() => sched_tick() - Standardize on prefixing scheduler-internal functions defined in <linux/sched.h> with sched_() prefix. scheduler_tick() was the only function using the scheduler_ prefix. Harmonize it. - The other reason to rename it is the NOHZ scheduler tick handling functions are already named sched_tick_(). Make the 'git grep sched_tick' more meaningful. Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://lore.kernel.org/r/20240308111819.1101550-3-mingo@kernel.org Signed-off-by: Phil Auld <pauld@redhat.com>	2024-09-20 04:38:48 -04:00
Waiman Long	baa0a3b48c	workqueue: Always queue work items to the newest PWQ for order workqueues JIRA: https://issues.redhat.com/browse/RHEL-49500 commit 58629d4871e8eb2c385b16a73a8451669db59f39 Author: Lai Jiangshan <jiangshan.ljs@antgroup.com> Date: Wed, 3 Jul 2024 17:27:41 +0800 workqueue: Always queue work items to the newest PWQ for order workqueues To ensure non-reentrancy, __queue_work() attempts to enqueue a work item to the pool of the currently executing worker. This is not only unnecessary for an ordered workqueue, where order inherently suggests non-reentrancy, but it could also disrupt the sequence if the item is not enqueued on the newest PWQ. Just queue it to the newest PWQ and let order management guarantees non-reentrancy. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Fixes: 4c065dbce1e8 ("workqueue: Enable unbound cpumask update on ordered workqueues") Cc: stable@vger.kernel.org # v6.9+ Signed-off-by: Tejun Heo <tj@kernel.org> (cherry picked from commit 74347be3edfd11277799242766edf844c43dd5d3) Signed-off-by: Waiman Long <longman@redhat.com>	2024-07-17 09:55:18 -04:00
Waiman Long	3adbdc1e6a	workqueue: Update cpumasks after only applying it successfully JIRA: https://issues.redhat.com/browse/RHEL-49500 commit 841658832335a32dd86f4e4d3aab7d14188b268b Author: Lai Jiangshan <jiangshan.ljs@antgroup.com> Date: Tue, 2 Jul 2024 12:14:55 +0800 workqueue: Update cpumasks after only applying it successfully Make workqueue_unbound_exclude_cpumask() and workqueue_set_unbound_cpumask() only update wq_isolated_cpumask and wq_requested_unbound_cpumask when workqueue_apply_unbound_cpumask() returns successfully. Fixes: fe28f631fa94("workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask") Cc: Waiman Long <longman@redhat.com> Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-07-17 09:55:18 -04:00
Waiman Long	beb7c33dd4	workqueue: Cleanup subsys attribute registration JIRA: https://issues.redhat.com/browse/RHEL-49500 commit 79202591a55a365251496162ced3004a0a1fa1cf Author: Dan Williams <dan.j.williams@intel.com> Date: Thu, 7 Mar 2024 21:39:32 -0800 workqueue: Cleanup subsys attribute registration While reviewing users of subsys_virtual_register() I noticed that wq_sysfs_init() ignores the @groups argument. This looks like a historical artifact as the original wq_subsys only had one attribute to register. On the way to building up an @groups argument to pass to subsys_virtual_register() a few more cleanups fell out: * Use DEVICE_ATTR_RO() and DEVICE_ATTR_RW() for cpumask_{isolated,requested} and cpumask respectively. Rename the @show and @store methods accordingly. * Co-locate the attribute definition with the methods. This required moving wq_unbound_cpumask_show down next to wq_unbound_cpumask_store (renamed to cpumask_show() and cpumask_store()) * Use ATTRIBUTE_GROUPS() to skip some boilerplate declarations Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-07-17 09:55:17 -04:00
Waiman Long	8dfa13be90	workqueue: Fix divide error in wq_update_node_max_active() JIRA: https://issues.redhat.com/browse/RHEL-49500 commit 91f098704c25106d88706fc9f8bcfce01fdb97df Author: Lai Jiangshan <jiangshan.ljs@antgroup.com> Date: Wed, 24 Apr 2024 21:51:54 +0800 workqueue: Fix divide error in wq_update_node_max_active() Yue Sun and xingwei lee reported a divide error bug in wq_update_node_max_active(): divide error: 0000 [#1] PREEMPT SMP KASAN PTI CPU: 1 PID: 21 Comm: cpuhp/1 Not tainted 6.9.0-rc5 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 RIP: 0010:wq_update_node_max_active+0x369/0x6b0 kernel/workqueue.c:1605 Code: 24 bf 00 00 00 80 44 89 fe e8 83 27 33 00 41 83 fc ff 75 0d 41 81 ff 00 00 00 80 0f 84 68 01 00 00 e8 fb 22 33 00 44 89 f8 99 <41> f7 fc 89 c5 89 c7 44 89 ee e8 a8 24 33 00 89 ef 8b 5c 24 04 89 RSP: 0018:ffffc9000018fbb0 EFLAGS: 00010293 RAX: 00000000000000ff RBX: 0000000000000001 RCX: ffff888100ada500 RDX: 0000000000000000 RSI: 00000000000000ff RDI: 0000000080000000 RBP: 0000000000000001 R08: ffffffff815b1fcd R09: 1ffff1100364ad72 R10: dffffc0000000000 R11: ffffed100364ad73 R12: 0000000000000000 R13: 0000000000000100 R14: 0000000000000000 R15: 00000000000000ff FS: 0000000000000000(0000) GS:ffff888135c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fb8c06ca6f8 CR3: 000000010d6c6000 CR4: 0000000000750ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> workqueue_offline_cpu+0x56f/0x600 kernel/workqueue.c:6525 cpuhp_invoke_callback+0x4e1/0x870 kernel/cpu.c:194 cpuhp_thread_fun+0x411/0x7d0 kernel/cpu.c:1092 smpboot_thread_fn+0x544/0xa10 kernel/smpboot.c:164 kthread+0x2ed/0x390 kernel/kthread.c:388 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:244 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- After analysis, it happens when all of the CPUs in a workqueue's affinity get offine. The problem can be easily reproduced by: # echo 8 > /sys/devices/virtual/workqueue/<any-wq-name>/cpumask # echo 0 > /sys/devices/system/cpu/cpu3/online Use the default max_actives for nodes when all of the CPUs in the workqueue's affinity get offline to fix the problem. Reported-by: Yue Sun <samsun1006219@gmail.com> Reported-by: xingwei lee <xrivendell7@gmail.com> Link: https://lore.kernel.org/lkml/CAEkJfYPGS1_4JqvpSo0=FM0S1ytB8CEbyreLTtWpR900dUZymw@mail.gmail.com/ Fixes: 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues") Cc: stable@vger.kernel.org Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-07-17 09:55:17 -04:00
Waiman Long	1d6310526e	workqueue: The default node_nr_active should have its max set to max_active JIRA: https://issues.redhat.com/browse/RHEL-49500 commit d40f92020c7a225b77e68599e4b099a4a0823408 Author: Tejun Heo <tj@kernel.org> Date: Mon, 22 Apr 2024 14:43:48 -1000 workqueue: The default node_nr_active should have its max set to max_active The default nna (node_nr_active) is used when the pool isn't tied to a specific NUMA node. This can happen in the following cases: 1. On NUMA, if per-node pwq init failure and the fallback pwq is used. 2. On NUMA, if a pool is configured to span multiple nodes. 3. On single node setups. 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues") set the default nna->max to min_active because only #1 was being considered. For #2 and #3, using min_active means that the max concurrency in normal operation is pushed down to min_active which is currently 8, which can obviously lead to performance issues. exact value nna->max is set to doesn't really matter. #2 can only happen if the workqueue is intentionally configured to ignore NUMA boundaries and there's no good way to distribute max_active in this case. #3 is the default behavior on single node machines. Let's set it the default nna->max to max_active. This fixes the artificially lowered concurrency problem on single node machines and shouldn't hurt anything for other cases. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Fixes: 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues") Link: https://lore.kernel.org/dm-devel/20240410084531.2134621-1-shinichiro.kawasaki@wdc.com/ Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-07-17 09:55:16 -04:00
Waiman Long	91dac9de45	workqueue: Fix selection of wake_cpu in kick_pool() JIRA: https://issues.redhat.com/browse/RHEL-49500 commit 57a01eafdcf78f6da34fad9ff075ed5dfdd9f420 Author: Sven Schnelle <svens@linux.ibm.com> Date: Tue, 23 Apr 2024 08:19:05 +0200 workqueue: Fix selection of wake_cpu in kick_pool() With cpu_possible_mask=0-63 and cpu_online_mask=0-7 the following kernel oops was observed: smp: Bringing up secondary CPUs ... smp: Brought up 1 node, 8 CPUs Unable to handle kernel pointer dereference in virtual kernel address space Failing address: 0000000000000000 TEID: 0000000000000803 [..] Call Trace: arch_vcpu_is_preempted+0x12/0x80 select_idle_sibling+0x42/0x560 select_task_rq_fair+0x29a/0x3b0 try_to_wake_up+0x38e/0x6e0 kick_pool+0xa4/0x198 __queue_work.part.0+0x2bc/0x3a8 call_timer_fn+0x36/0x160 __run_timers+0x1e2/0x328 __run_timer_base+0x5a/0x88 run_timer_softirq+0x40/0x78 __do_softirq+0x118/0x388 irq_exit_rcu+0xc0/0xd8 do_ext_irq+0xae/0x168 ext_int_handler+0xbe/0xf0 psw_idle_exit+0x0/0xc default_idle_call+0x3c/0x110 do_idle+0xd4/0x158 cpu_startup_entry+0x40/0x48 rest_init+0xc6/0xc8 start_kernel+0x3c4/0x5e0 startup_continue+0x3c/0x50 The crash is caused by calling arch_vcpu_is_preempted() for an offline CPU. To avoid this, select the cpu with cpumask_any_and_distribute() to mask __pod_cpumask with cpu_online_mask. In case no cpu is left in the pool, skip the assignment. tj: This doesn't fully fix the bug as CPUs can still go down between picking the target CPU and the wake call. Fixing that likely requires adding cpu_online() test to either the sched or s390 arch code. However, regardless of how that is fixed, workqueue shouldn't be picking a CPU which isn't online as that would result in unpredictable and worse behavior. Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Fixes: 8639ecebc9b1 ("workqueue: Implement non-strict affinity scope for unbound workqueues") Cc: stable@vger.kernel.org # v6.6+ Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-07-17 09:55:16 -04:00
Waiman Long	efd17b3bb7	workqueue: Drain BH work items on hot-unplugged CPUs JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 1acd92d95fa24edca8f0292b21870025da93e24f Author: Tejun Heo <tj@kernel.org> Date: Mon, 26 Feb 2024 15:38:55 -1000 workqueue: Drain BH work items on hot-unplugged CPUs Boqun pointed out that workqueues aren't handling BH work items on offlined CPUs. Unlike tasklet which transfers out the pending tasks from CPUHP_SOFTIRQ_DEAD, BH workqueue would just leave them pending which is problematic. Note that this behavior is specific to BH workqueues as the non-BH per-CPU workers just become unbound when the CPU goes offline. This patch fixes the issue by draining the pending BH work items from an offlined CPU from CPUHP_SOFTIRQ_DEAD. Because work items carry more context, it's not as easy to transfer the pending work items from one pool to another. Instead, run BH work items which execute the offlined pools on an online CPU. Note that this assumes that no further BH work items will be queued on the offlined CPUs. This assumption is shared with tasklet and should be fine for conversions. However, this issue also exists for per-CPU workqueues which will just keep executing work items queued after CPU offline on unbound workers and workqueue should reject per-CPU and BH work items queued on offline CPUs. This will be addressed separately later. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-reviewed-by: Boqun Feng <boqun.feng@gmail.com> Link: http://lkml.kernel.org/r/Zdvw0HdSXcU3JZ4g@boqun-archlinux Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:33 -04:00
Waiman Long	c26c31e2b9	workqueue: Control intensive warning threshold through cmdline JIRA: https://issues.redhat.com/browse/RHEL-25103 commit ccdec92198df0c91f45a68f971771b6b0c1ba02d Author: Xuewen Yan <xuewen.yan@unisoc.com> Date: Thu, 22 Feb 2024 15:28:08 +0800 workqueue: Control intensive warning threshold through cmdline When CONFIG_WQ_CPU_INTENSIVE_REPORT is set, the kernel will report the work functions which violate the intensive_threshold_us repeatedly. And now, only when the violate times exceed 4 and is a power of 2, the kernel warning could be triggered. However, sometimes, even if a long work execution time occurs only once, it may cause other work to be delayed for a long time. This may also cause some problems sometimes. In order to freely control the threshold of warninging, a boot argument is added so that the user can control the warning threshold to be printed. At the same time, keep the exponential backoff to prevent reporting too much. By default, the warning threshold is 4. tj: Updated kernel-parameters.txt description. Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:33 -04:00
Waiman Long	e58ec3ad16	workqueue: Make @flags handling consistent across set_work_data() and friends JIRA: https://issues.redhat.com/browse/RHEL-25103 commit bccdc1faafaf32e00d6e4dddca1ded64e3272189 Author: Tejun Heo <tj@kernel.org> Date: Tue, 20 Feb 2024 19:36:15 -1000 workqueue: Make @flags handling consistent across set_work_data() and friends - set_work_data() takes a separate @flags argument but just ORs it to @data. This is more confusing than helpful. Just take @data. - Use the name @flags consistently and add the parameter to set_work_pool_and_{keep\|clear}_pending(). This will be used by the planned disable/enable support. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:33 -04:00
Waiman Long	89f1d097be	workqueue: Remove clear_work_data() JIRA: https://issues.redhat.com/browse/RHEL-25103 commit afe928c1dc611bec155d834020e0631e026aeb8a Author: Tejun Heo <tj@kernel.org> Date: Tue, 20 Feb 2024 19:36:14 -1000 workqueue: Remove clear_work_data() clear_work_data() is only used in one place and immediately followed by smp_mb(), making it equivalent to set_work_pool_and_clear_pending() w/ WORK_OFFQ_POOL_NONE for @pool_id. Drop it. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:33 -04:00
Waiman Long	056e8351f0	workqueue: Factor out work_grab_pending() from __cancel_work_sync() JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 978b8409eab15aa733ae3a79c9b5158d34cd3fb7 Author: Tejun Heo <tj@kernel.org> Date: Tue, 20 Feb 2024 19:36:14 -1000 workqueue: Factor out work_grab_pending() from __cancel_work_sync() The planned disable/enable support will need the same logic. Let's factor it out. No functional changes. v2: Update function comment to include @irq_flags. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:32 -04:00
Waiman Long	c1daf24536	workqueue: Clean up enum work_bits and related constants JIRA: https://issues.redhat.com/browse/RHEL-25103 Conflicts: A context diff in include/linux/workqueue.h due to missing upstream commit b2fa8443db32 ("workqueue: Split out workqueue_types.h"). commit e9a8e01f9b133c145dd125021ec47c006d108af4 Author: Tejun Heo <tj@kernel.org> Date: Tue, 20 Feb 2024 19:36:14 -1000 workqueue: Clean up enum work_bits and related constants The bits of work->data are used for a few different purposes. How the bits are used is determined by enum work_bits. The planned disable/enable support will add another use, so let's clean it up a bit in preparation. - Let WORK_STRUCT_*_BIT's values be determined by enum definition order. - Deliminate different bit sections the same way using SHIFT and BITS values. - Rename __WORK_OFFQ_CANCELING to WORK_OFFQ_CANCELING_BIT for consistency. - Introduce WORK_STRUCT_PWQ_SHIFT and replace WORK_STRUCT_FLAG_MASK and WORK_STRUCT_WQ_DATA_MASK with WQ_STRUCT_PWQ_MASK for clarity. - Improve documentation. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:32 -04:00
Waiman Long	2f6aa88b75	workqueue: Introduce work_cancel_flags JIRA: https://issues.redhat.com/browse/RHEL-25103 commit c5f5b9422a49e9bc1c2f992135592ed921ac18e5 Author: Tejun Heo <tj@kernel.org> Date: Tue, 20 Feb 2024 19:36:14 -1000 workqueue: Introduce work_cancel_flags The cancel path used bool @is_dwork to distinguish canceling a regular work and a delayed one. The planned disable/enable support will need passing around another flag in the code path. As passing them around with bools will be confusing, let's introduce named flags to pass around in the cancel path. WORK_CANCEL_DELAYED replaces @is_dwork. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:32 -04:00
Waiman Long	b483553d36	workqueue: Use variable name irq_flags for saving local irq flags JIRA: https://issues.redhat.com/browse/RHEL-25103 commit c26e2f2e2fcfb73996fa025a0d3b5695017d65b5 Author: Tejun Heo <tj@kernel.org> Date: Tue, 20 Feb 2024 19:36:14 -1000 workqueue: Use variable name irq_flags for saving local irq flags Using the generic term `flags` for irq flags is conventional but can be confusing as there's quite a bit of code dealing with work flags which involves some subtleties. Let's use a more explicit name `irq_flags` for local irq flags. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:32 -04:00
Waiman Long	e6019eb8f1	workqueue: Reorganize flush and cancel[_sync] functions JIRA: https://issues.redhat.com/browse/RHEL-25103 commit cdc6e4b329bc82676886a758a940b2b6987c2109 Author: Tejun Heo <tj@kernel.org> Date: Tue, 20 Feb 2024 19:36:14 -1000 workqueue: Reorganize flush and cancel[_sync] functions They are currently a bit disorganized with flush and cancel functions mixed. Reoranize them so that flush functions come first, cancel next and cancel_sync last. This way, we won't have to add prototypes for internal functions for the planned disable/enable support. This is pure code reorganization. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:32 -04:00
Waiman Long	c92ee4a73c	workqueue: Rename __cancel_work_timer() to __cancel_timer_sync() JIRA: https://issues.redhat.com/browse/RHEL-25103 commit c5140688d19a4579f7b01e6ca4b6e5f5d23d3d4d Author: Tejun Heo <tj@kernel.org> Date: Tue, 20 Feb 2024 19:36:14 -1000 workqueue: Rename __cancel_work_timer() to __cancel_timer_sync() __cancel_work_timer() is used to implement cancel_work_sync() and cancel_delayed_work_sync(), similarly to how __cancel_work() is used to implement cancel_work() and cancel_delayed_work(). ie. The _timer part of the name is a complete misnomer. The difference from __cancel_work() is the fact that it syncs against work item execution not whether it handles timers or not. Let's rename it to less confusing __cancel_work_sync(). No functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:32 -04:00
Waiman Long	6842b59700	workqueue: Use rcu_read_lock_any_held() instead of rcu_read_lock_held() JIRA: https://issues.redhat.com/browse/RHEL-25103 commit d355001fa9370df8fdd6fca0e9ed77063615c7da Author: Tejun Heo <tj@kernel.org> Date: Tue, 20 Feb 2024 19:36:13 -1000 workqueue: Use rcu_read_lock_any_held() instead of rcu_read_lock_held() The different flavors of RCU read critical sections have been unified. Let's update the locking assertion macros accordingly to avoid requiring unnecessary explicit rcu_read_[un]lock() calls. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:32 -04:00
Waiman Long	3159667c3e	workqueue: Cosmetic changes JIRA: https://issues.redhat.com/browse/RHEL-25103 commit c7a40c49af920fbad2ab6795b6587308ad69de9f Author: Tejun Heo <tj@kernel.org> Date: Tue, 20 Feb 2024 19:36:13 -1000 workqueue: Cosmetic changes Reorder some global declarations and adjust comments and whitespaces for clarity and consistency. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:32 -04:00
Waiman Long	889c30a2b9	workqueue, irq_work: Build fix for !CONFIG_IRQ_WORK JIRA: https://issues.redhat.com/browse/RHEL-25103 commit fd0a68a2337b79a7bd4dad5e7d9dc726828527af Author: Tejun Heo <tj@kernel.org> Date: Thu, 15 Feb 2024 19:10:01 -1000 workqueue, irq_work: Build fix for !CONFIG_IRQ_WORK 2f34d7337d98 ("workqueue: Fix queue_work_on() with BH workqueues") added irq_work usage to workqueue; however, it turns out irq_work is actually optional and the change breaks build on configuration which doesn't have CONFIG_IRQ_WORK enabled. Fix build by making workqueue use irq_work only when CONFIG_SMP and enabling CONFIG_IRQ_WORK when CONFIG_SMP is set. It's reasonable to argue that it may be better to just always enable it. However, this still saves a small bit of memory for tiny UP configs and also the least amount of change, so, for now, let's keep it conditional. Verified to do the right thing for x86_64 allnoconfig and defconfig, and aarch64 allnoconfig, allnoconfig + prink disable (SMP but nothing selects IRQ_WORK) and a modified aarch64 Kconfig where !SMP and nothing selects IRQ_WORK. v2: `depends on SMP` leads to Kconfig warnings when CONFIG_IRQ_WORK is selected by something else when !CONFIG_SMP. Use `def_bool y if SMP` instead. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Tested-by: Anders Roxell <anders.roxell@linaro.org> Fixes: 2f34d7337d98 ("workqueue: Fix queue_work_on() with BH workqueues") Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:32 -04:00
Waiman Long	6bff994865	workqueue: Fix queue_work_on() with BH workqueues JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 2f34d7337d98f3eae7bd3d1270efaf9d8a17cfc6 Author: Tejun Heo <tj@kernel.org> Date: Wed, 14 Feb 2024 08:33:55 -1000 workqueue: Fix queue_work_on() with BH workqueues When queue_work_on() is used to queue a BH work item on a remote CPU, the work item is queued on that CPU but kick_pool() raises softirq on the local CPU. This leads to stalls as the work item won't be executed until something else on the remote CPU schedules a BH work item or tasklet locally. Fix it by bouncing raising softirq to the target CPU using per-cpu irq_work. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 4cb1ef64609f ("workqueue: Implement BH workqueues to eventually replace tasklets") Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:32 -04:00
Waiman Long	e897cf4939	workqueue: Implement workqueue_set_min_active() JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 8f172181f24bb5df7675225d9b5b66d059613f50 Author: Tejun Heo <tj@kernel.org> Date: Thu, 8 Feb 2024 14:11:56 -1000 workqueue: Implement workqueue_set_min_active() Since 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues"), unbound workqueues have separate min_active which sets the number of interdependent work items that can be handled. This value is currently initialized to WQ_DFL_MIN_ACTIVE which is 8. This isn't high enough for some users, let's add an interface to adjust the setting. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:32 -04:00
Waiman Long	a89bd6c4c6	workqueue: Fix kernel-doc comment of unplug_oldest_pwq() JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 516d3dc99f4f2ab856d879696cd3a5d7f6db7796 Author: Waiman Long <longman@redhat.com> Date: Fri, 9 Feb 2024 12:06:11 -0500 workqueue: Fix kernel-doc comment of unplug_oldest_pwq() Fix the kernel-doc comment of the unplug_oldest_pwq() function to enable proper processing and formatting of the embedded ASCII diagram. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:31 -04:00
Waiman Long	3d3fe163af	workqueue: Bind unbound workqueue rescuer to wq_unbound_cpumask JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 49584bb8ddbe8bcfc276c2d7dd3c8890f45f5970 Author: Waiman Long <longman@redhat.com> Date: Thu, 8 Feb 2024 11:10:14 -0500 workqueue: Bind unbound workqueue rescuer to wq_unbound_cpumask Commit 85f0ab43f9de ("kernel/workqueue: Bind rescuer to unbound cpumask for WQ_UNBOUND") modified init_rescuer() to bind rescuer of an unbound workqueue to the cpumask in wq->unbound_attrs. However unbound_attrs->cpumask's of all workqueues are initialized to cpu_possible_mask and will only be changed if it has the WQ_SYSFS flag to expose a cpumask sysfs file to be written by users. So this patch doesn't achieve what it is intended to do. If an unbound workqueue is created after wq_unbound_cpumask is modified and there is no more unbound cpumask update after that, the unbound rescuer will be bound to all CPUs unless the workqueue is created with the WQ_SYSFS flag and a user explicitly modified its cpumask sysfs file. Fix this problem by binding directly to wq_unbound_cpumask in init_rescuer(). Fixes: 85f0ab43f9de ("kernel/workqueue: Bind rescuer to unbound cpumask for WQ_UNBOUND") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:31 -04:00
Waiman Long	950ef7b192	kernel/workqueue: Let rescuers follow unbound wq cpumask changes JIRA: https://issues.redhat.com/browse/RHEL-25103 commit d64f2fa064f8866802e23c8ec95d9d1f601480ee Author: Juri Lelli <juri.lelli@redhat.com> Date: Thu, 8 Feb 2024 11:10:13 -0500 kernel/workqueue: Let rescuers follow unbound wq cpumask changes When workqueue cpumask changes are committed the associated rescuer (if one exists) affinity is not touched and this might be a problem down the line for isolated setups. Make sure rescuers affinity is updated every time a workqueue cpumask changes, so that rescuers can't break isolation. [longman: set_cpus_allowed_ptr() will block until the designated task is enqueued on an allowed CPU, no wake_up_process() needed. Also use the unbound_effective_cpumask() helper as suggested by Tejun.] Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:31 -04:00
Waiman Long	9e9a9764ad	workqueue: Enable unbound cpumask update on ordered workqueues JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 4c065dbce1e8639546ef3612acffb062dd084cfe Author: Waiman Long <longman@redhat.com> Date: Thu, 8 Feb 2024 14:12:20 -0500 workqueue: Enable unbound cpumask update on ordered workqueues Ordered workqueues does not currently follow changes made to the global unbound cpumask because per-pool workqueue changes may break the ordering guarantee. IOW, a work function in an ordered workqueue may run on an isolated CPU. This patch enables ordered workqueues to follow changes made to the global unbound cpumask by temporaily plug or suspend the newly allocated pool_workqueue from executing newly queued work items until the old pwq has been properly drained. For ordered workqueues, there should only be one pwq that is unplugged, the rests should be plugged. This enables ordered workqueues to follow the unbound cpumask changes like other unbound workqueues at the expense of some delay in execution of work functions during the transition period. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:31 -04:00
Waiman Long	343954323b	workqueue: Link pwq's into wq->pwqs from oldest to newest JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 26fb7e3dda4c16e2cfe2164a1e7315a9386602db Author: Waiman Long <longman@redhat.com> Date: Thu, 8 Feb 2024 11:10:11 -0500 workqueue: Link pwq's into wq->pwqs from oldest to newest Add a new pwq into the tail of wq->pwqs so that pwq iteration will start from the oldest pwq to the newest. This ordering will facilitate the inclusion of ordered workqueues in a wq_unbound_cpumask update. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:31 -04:00
Waiman Long	deebfc6ab4	workqueue: Don't implicitly make UNBOUND workqueues w/ @max_active==1 ordered JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 3bc1e711c26bff01d41ad71145ecb8dcb4412576 Author: Tejun Heo <tj@kernel.org> Date: Mon, 5 Feb 2024 14:19:10 -1000 workqueue: Don't implicitly make UNBOUND workqueues w/ @max_active==1 ordered `5c0338c687` ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered") automoatically promoted UNBOUND workqueues w/ @max_active==1 to ordered workqueues because UNBOUND workqueues w/ @max_active==1 used to be the way to create ordered workqueues and the new NUMA support broke it. These problems can be subtle and the fact that they can only trigger on NUMA machines made them even more difficult to debug. However, overloading the UNBOUND allocation interface this way creates other issues. It's difficult to tell whether a given workqueue actually needs to be ordered and users that legitimately want a min concurrency level wq unexpectedly gets an ordered one instead. With planned UNBOUND workqueue udpates to improve execution locality and more prevalence of chiplet designs which can benefit from such improvements, this isn't a state we wanna be in forever. There aren't that many UNBOUND w/ @max_active==1 users in the tree and the preceding patches audited all and converted them to alloc_ordered_workqueue() as appropriate. This patch removes the implicit promotion of UNBOUND w/ @max_active==1 workqueues to ordered ones. v2: v1 patch incorrectly dropped !list_empty(&wq->pwqs) condition in apply_workqueue_attrs_locked() which spuriously triggers WARNING and fails workqueue creation. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: kernel test robot <oliver.sang@intel.com> Link: https://lore.kernel.org/oe-lkp/202304251050.45a5df1f-oliver.sang@intel.com Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:31 -04:00
Waiman Long	4c4d4b9049	workqueue: Skip __WQ_DESTROYING workqueues when updating global unbound cpumask JIRA: https://issues.redhat.com/browse/RHEL-25103 Conflicts: A context diff due to upstream conflict as shown in merge commit 40911d4457f2 ("Merge branch 'for-6.8-fixes' into for-6.9"). commit 8eb17dc1a6b5db7e89681f59285242af8d182f95 Author: Waiman Long <longman@redhat.com> Date: Sat, 3 Feb 2024 10:43:30 -0500 workqueue: Skip __WQ_DESTROYING workqueues when updating global unbound cpumask Skip updating workqueues with __WQ_DESTROYING bit set when updating global unbound cpumask to avoid unnecessary work and other complications. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:31 -04:00
Waiman Long	c84568af70	workqueue: fix a typo in comment JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 96068b6030391082bf0cd97af525d731afa5ad63 Author: Wang Jinchao <wangjinchao@xfusion.com> Date: Mon, 5 Feb 2024 08:31:52 +0800 workqueue: fix a typo in comment There should be three, fix it. Signed-off-by: Wang Jinchao <wangjinchao@xfusion.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:31 -04:00
Waiman Long	91c025a7a0	Revert "workqueue: make wq_subsys const" JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 4f19b8e01e2fb6c97d4307abb7bde4d34a1e601e Author: Tejun Heo <tj@kernel.org> Date: Mon, 5 Feb 2024 07:18:08 -1000 Revert "workqueue: make wq_subsys const" This reverts commit d412ace11144aa2bf692c7cf9778351efc15c827. This leads to build failures as it depends on a driver-core commit 32f78abe59c7 ("driver core: bus: constantify subsys_register() calls"). Let's drop it from wq tree and route it through driver-core tree. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202402051505.kM9Rr3CJ-lkp@intel.com/ Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:31 -04:00
Waiman Long	da3eaa2838	workqueue: Implement BH workqueues to eventually replace tasklets JIRA: https://issues.redhat.com/browse/RHEL-25103 Conflicts: A minor context diff in kernel/workqueue.c due to missing upstream commit 68279f9c9f59 ("treewide: mark stuff as __ro_after_init"). commit 4cb1ef64609f9b0254184b2947824f4b46ccab22 Author: Tejun Heo <tj@kernel.org> Date: Sun, 4 Feb 2024 11:28:06 -1000 workqueue: Implement BH workqueues to eventually replace tasklets The only generic interface to execute asynchronously in the BH context is tasklet; however, it's marked deprecated and has some design flaws such as the execution code accessing the tasklet item after the execution is complete which can lead to subtle use-after-free in certain usage scenarios and less-developed flush and cancel mechanisms. This patch implements BH workqueues which share the same semantics and features of regular workqueues but execute their work items in the softirq context. As there is always only one BH execution context per CPU, none of the concurrency management mechanisms applies and a BH workqueue can be thought of as a convenience wrapper around softirq. Except for the inability to sleep while executing and lack of max_active adjustments, BH workqueues and work items should behave the same as regular workqueues and work items. Currently, the execution is hooked to tasklet[_hi]. However, the goal is to convert all tasklet users over to BH workqueues. Once the conversion is complete, tasklet can be removed and BH workqueues can directly take over the tasklet softirqs. system_bh[_highpri]_wq are added. As queue-wide flushing doesn't exist in tasklet, all existing tasklet users should be able to use the system BH workqueues without creating their own workqueues. v3: - Add missing interrupt.h include. v2: - Instead of using tasklets, hook directly into its softirq action functions - tasklet[_hi]_action(). This is slightly cheaper and closer to the eventual code structure we want to arrive at. Suggested by Lai. - Lai also pointed out several places which need NULL worker->task handling or can use clarification. Updated. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/CAHk-=wjDW53w4-YcSmgKC5RruiRLHmJ1sXeYdp_ZgVoBw=5byA@mail.gmail.com Tested-by: Allen Pais <allen.lkml@gmail.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:31 -04:00
Waiman Long	3eb62e205f	workqueue: Factor out init_cpu_worker_pool() JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 2fcdb1b44491e08f5334a92c50e8f362e0d46f91 Author: Tejun Heo <tj@kernel.org> Date: Sun, 4 Feb 2024 11:28:06 -1000 workqueue: Factor out init_cpu_worker_pool() Factor out init_cpu_worker_pool() from workqueue_init_early(). This is pure reorganization in preparation of BH workqueue support. Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Allen Pais <allen.lkml@gmail.com> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:31 -04:00
Waiman Long	d6a8a41d51	workqueue: Update lock debugging code JIRA: https://issues.redhat.com/browse/RHEL-25103 commit c35aea39d1e106f61fd2130f0d32a3bac8bd4570 Author: Tejun Heo <tj@kernel.org> Date: Sun, 4 Feb 2024 11:28:06 -1000 workqueue: Update lock debugging code These changes are in preparation of BH workqueue which will execute work items from BH context. - Update lock and RCU depth checks in process_one_work() so that it remembers and checks against the starting depths and prints out the depth changes. - Factor out lockdep annotations in the flush paths into touch_{wq\|work}_lockdep_map(). The work->lockdep_map touching is moved from __flush_work() to its callee - start_flush_work(). This brings it closer to the wq counterpart and will allow testing the associated wq's flags which will be needed to support BH workqueues. This is not expected to cause any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Allen Pais <allen.lkml@gmail.com> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:31 -04:00
Waiman Long	bc8bb2d224	workqueue: make wq_subsys const JIRA: https://issues.redhat.com/browse/RHEL-25103 commit d412ace11144aa2bf692c7cf9778351efc15c827 Author: Ricardo B. Marliere <ricardo@marliere.net> Date: Sun, 4 Feb 2024 10:47:05 -0300 workqueue: make wq_subsys const Now that the driver core can properly handle constant struct bus_type, move the wq_subsys variable to be a constant structure as well, placing it into read-only memory which can not be modified at runtime. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Suggested-and-reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:30 -04:00
Waiman Long	1f7fc85664	workqueue: Fix pwq->nr_in_flight corruption in try_to_grab_pending() JIRA: https://issues.redhat.com/browse/RHEL-25103 commit c70e1779b73a39f7648b26bdc835304c60100ce3 Author: Tejun Heo <tj@kernel.org> Date: Sun, 4 Feb 2024 11:14:21 -1000 workqueue: Fix pwq->nr_in_flight corruption in try_to_grab_pending() dd6c3c544126 ("workqueue: Move pwq_dec_nr_in_flight() to the end of work item handling") relocated pwq_dec_nr_in_flight() after set_work_pool_and_keep_pending(). However, the latter destroys information contained in work->data that's needed by pwq_dec_nr_in_flight() including the flush color. With flush color destroyed, flush_workqueue() can stall easily when mixed with cancel_work*() usages. This is easily triggered by running xfstests generic/001 test on xfs: INFO: task umount:6305 blocked for more than 122 seconds. ... task:umount state:D stack:13008 pid:6305 tgid:6305 ppid:6301 flags:0x00004000 Call Trace: <TASK> __schedule+0x2f6/0xa20 schedule+0x36/0xb0 schedule_timeout+0x20b/0x280 wait_for_completion+0x8a/0x140 __flush_workqueue+0x11a/0x3b0 xfs_inodegc_flush+0x24/0xf0 xfs_unmountfs+0x14/0x180 xfs_fs_put_super+0x3d/0x90 generic_shutdown_super+0x7c/0x160 kill_block_super+0x1b/0x40 xfs_kill_sb+0x12/0x30 deactivate_locked_super+0x35/0x90 deactivate_super+0x42/0x50 cleanup_mnt+0x109/0x170 __cleanup_mnt+0x12/0x20 task_work_run+0x60/0x90 syscall_exit_to_user_mode+0x146/0x150 do_syscall_64+0x5d/0x110 entry_SYSCALL_64_after_hwframe+0x6c/0x74 Fix it by stashing work_data before calling set_work_pool_and_keep_pending() and using the stashed value for pwq_dec_nr_in_flight(). Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Chandan Babu R <chandanbabu@kernel.org> Link: http://lkml.kernel.org/r/87o7cxeehy.fsf@debian-BULLSEYE-live-builder-AMD64 Fixes: dd6c3c544126 ("workqueue: Move pwq_dec_nr_in_flight() to the end of work item handling") Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:30 -04:00
Waiman Long	2eeb9eb5ac	workqueue: Avoid premature init of wq->node_nr_active[].max JIRA: https://issues.redhat.com/browse/RHEL-25103 commit c5f8cd6c62ce02205ced15e9a998103f21ec5455 Author: Tejun Heo <tj@kernel.org> Date: Tue, 30 Jan 2024 19:06:43 -1000 workqueue: Avoid premature init of wq->node_nr_active[].max System workqueues are allocated early during boot from workqueue_init_early(). While allocating unbound workqueues, wq_update_node_max_active() is invoked from apply_workqueue_attrs() and accesses NUMA topology to initialize wq->node_nr_active[].max. However, topology information may not be set up at this point. wq_update_node_max_active() is explicitly invoked from workqueue_init_topology() later when topology information is known to be available. This doesn't seem to crash anything but it's doing useless work with dubious data. Let's skip the premature and duplicate node_max_active updates by initializing the field to WQ_DFL_MIN_ACTIVE on allocation and making wq_update_node_max_active() noop until workqueue_init_topology(). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:30 -04:00

1 2 3 4 5 ...

875 Commits