Centos-kernel-stream-9

Commit Graph

Author	SHA1	Message	Date
Waiman Long	9734ca08bc	workqueue: Don't call cpumask_test_cpu() with -1 CPU in wq_update_node_max_active() JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 15930da42f8981dc42c19038042947b475b19f47 Author: Tejun Heo <tj@kernel.org> Date: Tue, 30 Jan 2024 18:55:55 -1000 workqueue: Don't call cpumask_test_cpu() with -1 CPU in wq_update_node_max_active() For wq_update_node_max_active(), @off_cpu of -1 indicates that no CPU is going down. The function was incorrectly calling cpumask_test_cpu() with -1 CPU leading to oopses like the following on some archs: Unable to handle kernel paging request at virtual address ffff0002100296e0 .. pc : wq_update_node_max_active+0x50/0x1fc lr : wq_update_node_max_active+0x1f0/0x1fc ... Call trace: wq_update_node_max_active+0x50/0x1fc apply_wqattrs_commit+0xf0/0x114 apply_workqueue_attrs_locked+0x58/0xa0 alloc_workqueue+0x5ac/0x774 workqueue_init_early+0x460/0x540 start_kernel+0x258/0x684 __primary_switched+0xb8/0xc0 Code: 9100a273 35000d01 53067f00 d0016dc1 (f8607a60) ---[ end trace 0000000000000000 ]--- Kernel panic - not syncing: Attempted to kill the idle task! ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]--- Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Reported-by: Nathan Chancellor <nathan@kernel.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Link: http://lkml.kernel.org/r/91eacde0-df99-4d5c-a980-91046f66e612@samsung.com Fixes: 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues") Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:30 -04:00
Waiman Long	31c9838ff4	workqueue: Implement system-wide nr_active enforcement for unbound workqueues JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 5797b1c18919cd9c289ded7954383e499f729ce0 Author: Tejun Heo <tj@kernel.org> Date: Mon, 29 Jan 2024 08:11:25 -1000 workqueue: Implement system-wide nr_active enforcement for unbound workqueues A pool_workqueue (pwq) represents the connection between a workqueue and a worker_pool. One of the roles that a pwq plays is enforcement of the max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU for per-cpu workqueues and per each NUMA node for unbound workqueues, which was a natural result of per-cpu workqueues being served by per-cpu pools and unbound by per-NUMA pools. In terms of max_active enforcement, this was, while not perfect, workable. For per-cpu workqueues, it was fine. For unbound, it wasn't great in that NUMA machines would get max_active that's multiplied by the number of nodes but didn't cause huge problems because NUMA machines are relatively rare and the node count is usually pretty low. However, cache layouts are more complex now and sharing a worker pool across a whole node didn't really work well for unbound workqueues. Thus, a series of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues") implemented more flexible affinity mechanism for unbound workqueues which enables using e.g. last-level-cache aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues") made unbound workqueues use per-cpu pwqs like per-cpu workqueues. While the change was necessary to enable more flexible affinity scopes, this came with the side effect of blowing up the effective max_active for unbound workqueues. Before, the effective max_active for unbound workqueues was multiplied by the number of nodes. After, by the number of CPUs. 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues") claims that this should generally be okay. It is okay for users which self-regulates concurrency level which are the vast majority; however, there are enough use cases which actually depend on max_active to prevent the level of concurrency from going bonkers including several IO handling workqueues that can issue a work item for each in-flight IO. With targeted benchmarks, the misbehavior can easily be exposed as reported in http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3. Unfortunately, there is no way to express what these use cases need using per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want to set max_active too low but as soon as we increase max_active a bit, we can end up with unreasonable number of in-flight work items when many CPUs issue IOs at the same time. ie. The acceptable lowest max_active is higher than the acceptable highest max_active. Ideally, max_active for an unbound workqueue should be system-wide so that the users can regulate the total level of concurrency regardless of node and cache layout. The reasons workqueue hasn't implemented that yet are: - One max_active enforcement decouples from pool boundaires, chaining execution after a work item finishes requires inter-pool operations which would require lock dancing, which is nasty. - Sharing a single nr_active count across the whole system can be pretty expensive on NUMA machines. - Per-pwq enforcement had been more or less okay while we were using per-node pools. It looks like we no longer can avoid decoupling max_active enforcement from pool boundaries. This patch implements system-wide nr_active mechanism with the following design characteristics: - To avoid sharing a single counter across multiple nodes, the configured max_active is split across nodes according to the proportion of each workqueue's online effective CPUs per node. e.g. A node with twice more online effective CPUs will get twice higher portion of max_active. - Workqueue used to be able to process a chain of interdependent work items which is as long as max_active. We can't do this anymore as max_active is distributed across the nodes. Instead, a new parameter min_active is introduced which determines the minimum level of concurrency within a node regardless of how max_active distribution comes out to be. It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8. This can lead to higher effective max_weight than configured and also deadlocks if a workqueue was depending on being able to handle chains of interdependent work items that are longer than 8. I believe these should be fine given that the number of CPUs in each NUMA node is usually higher than 8 and work item chain longer than 8 is pretty unlikely. However, if these assumptions turn out to be wrong, we'll need to add an interface to adjust min_active. - Each unbound wq has an array of struct wq_node_nr_active which tracks per-node nr_active. When its pwq wants to run a work item, it has to obtain the matching node's nr_active. If over the node's max_active, the pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish, the completion path round-robins the pending pwqs activating the first inactive work item of each, which involves some pool lock dancing and kicking other pools. It's not the simplest code but doesn't look too bad. v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active(). - wq_adjust_max_active() is now protected by wq->mutex instead of wq_pool_mutex. v3: - wq_node_max_active() used to calculate per-node max_active on the fly based on system-wide CPU online states. Lai pointed out that this can lead to skewed distributions for workqueues with restricted cpumasks. Update the max_active distribution to use per-workqueue effective online CPU counts instead of system-wide and cache the calculation results in node_nr_active->max. v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com> Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3 Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues") Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:30 -04:00
Waiman Long	020c3805d1	workqueue: Introduce struct wq_node_nr_active JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 91ccc6e7233bb10a9c176aa4cc70d6f432a441a5 Author: Tejun Heo <tj@kernel.org> Date: Mon, 29 Jan 2024 08:11:24 -1000 workqueue: Introduce struct wq_node_nr_active Currently, for both percpu and unbound workqueues, max_active applies per-cpu, which is a recent change for unbound workqueues. The change for unbound workqueues was a significant departure from the previous behavior of per-node application. It made some use cases create undesirable number of concurrent work items and left no good way of fixing them. To address the problem, workqueue is implementing a NUMA node segmented global nr_active mechanism, which will be explained further in the next patch. As a preparation, this patch introduces struct wq_node_nr_active. It's a data structured allocated for each workqueue and NUMA node pair and currently only tracks the workqueue's number of active work items on the node. This is split out from the next patch to make it easier to understand and review. Note that there is an extra wq_node_nr_active allocated for the invalid node nr_node_ids which is used to track nr_active for pools which don't have NUMA node associated such as the default fallback system-wide pool. This doesn't cause any behavior changes visible to userland yet. The next patch will expand to implement the control mechanism on top. v4: - Fixed out-of-bound access when freeing per-cpu workqueues. v3: - Use flexible array for wq->node_nr_active as suggested by Lai. v2: - wq->max_active now uses WRITE/READ_ONCE() as suggested by Lai. - Lai pointed out that pwq_tryinc_nr_active() incorrectly dropped pwq->max_active check. Restored. As the next patch replaces the max_active enforcement mechanism, this doesn't change the end result. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:30 -04:00
Waiman Long	d91862ff64	workqueue: Move pwq_dec_nr_in_flight() to the end of work item handling JIRA: https://issues.redhat.com/browse/RHEL-25103 commit dd6c3c5441263723305a9c52c5ccc899a4653000 Author: Tejun Heo <tj@kernel.org> Date: Mon, 29 Jan 2024 08:11:24 -1000 workqueue: Move pwq_dec_nr_in_flight() to the end of work item handling The planned shared nr_active handling for unbound workqueues will make pwq_dec_nr_active() sometimes drop the pool lock temporarily to acquire other pool locks, which is necessary as retirement of an nr_active count from one pool may need kick off an inactive work item in another pool. This patch moves pwq_dec_nr_in_flight() call in try_to_grab_pending() to the end of work item handling so that work item state changes stay atomic. process_one_work() which is the other user of pwq_dec_nr_in_flight() already calls it at the end of work item handling. Comments are added to both call sites and pwq_dec_nr_in_flight(). This shouldn't cause any behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:30 -04:00
Waiman Long	02ccecf34b	workqueue: RCU protect wq->dfl_pwq and implement accessors for it JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 9f66cff212bb3c1cd25996aaa0dfd0c9e9d8baab Author: Tejun Heo <tj@kernel.org> Date: Mon, 29 Jan 2024 08:11:24 -1000 workqueue: RCU protect wq->dfl_pwq and implement accessors for it wq->cpu_pwq is RCU protected but wq->dfl_pwq isn't. This is okay because currently wq->dfl_pwq is used only accessed to install it into wq->cpu_pwq which doesn't require RCU access. However, we want to be able to access wq->dfl_pwq under RCU in the future to access its __pod_cpumask and the code can be made easier to read by making the two pwq fields behave in the same way. - Make wq->dfl_pwq RCU protected. - Add unbound_pwq_slot() and unbound_pwq() which can access both ->dfl_pwq and ->cpu_pwq. The former returns the double pointer that can be used access and update the pwqs. The latter performs locking check and dereferences the double pointer. - pwq accesses and updates are converted to use unbound_pwq[_slot](). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:30 -04:00
Waiman Long	f90f83b7b8	workqueue: Make wq_adjust_max_active() round-robin pwqs while activating JIRA: https://issues.redhat.com/browse/RHEL-25103 commit c5404d4e6df6faba1007544b5f4e62c7c14416dd Author: Tejun Heo <tj@kernel.org> Date: Mon, 29 Jan 2024 08:11:24 -1000 workqueue: Make wq_adjust_max_active() round-robin pwqs while activating wq_adjust_max_active() needs to activate work items after max_active is increased. Previously, it did that by visiting each pwq once activating all that could be activated. While this makes sense with per-pwq nr_active, nr_active will be shared across multiple pwqs for unbound wqs. Then, we'd want to round-robin through pwqs to be fairer. In preparation, this patch makes wq_adjust_max_active() round-robin pwqs while activating. While the activation ordering changes, this shouldn't cause user-noticeable behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:30 -04:00
Waiman Long	29cc16fece	workqueue: Move nr_active handling into helpers JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 1c270b79ce0b8290f146255ea9057243f6dd3c17 Author: Tejun Heo <tj@kernel.org> Date: Mon, 29 Jan 2024 08:11:24 -1000 workqueue: Move nr_active handling into helpers __queue_work(), pwq_dec_nr_in_flight() and wq_adjust_max_active() were open-coding nr_active handling, which is fine given that the operations are trivial. However, the planned unbound nr_active update will make them more complicated, so let's move them into helpers. - pwq_tryinc_nr_active() is added. It increments nr_active if under max_active limit and return a boolean indicating whether inc was successful. Note that the function is structured to accommodate future changes. __queue_work() is updated to use the new helper. - pwq_activate_first_inactive() is updated to use pwq_tryinc_nr_active() and thus no longer assumes that nr_active is under max_active and returns a boolean to indicate whether a work item has been activated. - wq_adjust_max_active() no longer tests directly whether a work item can be activated. Instead, it's updated to use the return value of pwq_activate_first_inactive() to tell whether a work item has been activated. - nr_active decrement and activating the first inactive work item is factored into pwq_dec_nr_active(). v3: - WARN_ON_ONCE(!WORK_STRUCT_INACTIVE) added to __pwq_activate_work() as now we're calling the function unconditionally from pwq_activate_first_inactive(). v2: - wq->max_active now uses WRITE/READ_ONCE() as suggested by Lai. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:30 -04:00
Waiman Long	86a54e0586	workqueue: Replace pwq_activate_inactive_work() with [__]pwq_activate_work() JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 4c6380305d21e36581b451f7337a36c93b64e050 Author: Tejun Heo <tj@kernel.org> Date: Mon, 29 Jan 2024 08:11:24 -1000 workqueue: Replace pwq_activate_inactive_work() with [__]pwq_activate_work() To prepare for unbound nr_active handling improvements, move work activation part of pwq_activate_inactive_work() into __pwq_activate_work() and add pwq_activate_work() which tests WORK_STRUCT_INACTIVE and updates nr_active. pwq_activate_first_inactive() and try_to_grab_pending() are updated to use pwq_activate_work(). The latter conversion is functionally identical. For the former, this conversion adds an unnecessary WORK_STRUCT_INACTIVE testing. This is temporary and will be removed by the next patch. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:29 -04:00
Waiman Long	ba03c5aa0f	workqueue: Factor out pwq_is_empty() JIRA: https://issues.redhat.com/browse/RHEL-25103 commit afa87ce85379e2d93863fce595afdb5771a84004 Author: Tejun Heo <tj@kernel.org> Date: Mon, 29 Jan 2024 08:11:24 -1000 workqueue: Factor out pwq_is_empty() "!pwq->nr_active && list_empty(&pwq->inactive_works)" test is repeated multiple times. Let's factor it out into pwq_is_empty(). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:29 -04:00
Waiman Long	60f17f20d1	workqueue: Move pwq->max_active to wq->max_active JIRA: https://issues.redhat.com/browse/RHEL-25103 commit a045a272d887575da17ad86d6573e82871b50c27 Author: Tejun Heo <tj@kernel.org> Date: Mon, 29 Jan 2024 08:11:24 -1000 workqueue: Move pwq->max_active to wq->max_active max_active is a workqueue-wide setting and the configured value is stored in wq->saved_max_active; however, the effective value was stored in pwq->max_active. While this is harmless, it makes max_active update process more complicated and gets in the way of the planned max_active semantic updates for unbound workqueues. This patches moves pwq->max_active to wq->max_active. This simplifies the code and makes freezing and noop max_active updates cheaper too. No user-visible behavior change is intended. As wq->max_active is updated while holding wq mutex but read without any locking, it now uses WRITE/READ_ONCE(). A new locking locking rule WO is added for it. v2: wq->max_active now uses WRITE/READ_ONCE() as suggested by Lai. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:29 -04:00
Waiman Long	07412404d9	workqueue: Break up enum definitions and give names to the types JIRA: https://issues.redhat.com/browse/RHEL-25103 commit e563d0a7cdc1890ff36bb177b5c8c2854d881e4d Author: Tejun Heo <tj@kernel.org> Date: Fri, 26 Jan 2024 11:55:50 -1000 workqueue: Break up enum definitions and give names to the types workqueue is collecting different sorts of enums into a single unnamed enum type which can increase confusion around enum width. Also, unnamed enums can't be accessed from BPF. Let's break up enum definitions according to their purposes and give them type names. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:29 -04:00
Waiman Long	1ff06fc607	workqueue: Drop unnecessary kick_pool() in create_worker() JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 6a229b0e2ff6143b65ba4ef42bd71e29ffc2c16d Author: Tejun Heo <tj@kernel.org> Date: Fri, 26 Jan 2024 11:55:46 -1000 workqueue: Drop unnecessary kick_pool() in create_worker() After creating a new worker, create_worker() is calling kick_pool() to wake up the new worker task. However, as kick_pool() doesn't do anything if there is no work pending, it also calls wake_up_process() explicitly. There's no reason to call kick_pool() at all. wake_up_process() is enough by itself. Drop the unnecessary kick_pool() call. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:29 -04:00
Waiman Long	843633d967	workqueue: mark power efficient workqueue as unbounded if nohz_full enabled JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 7bd20b6b87183db2ebf789bcf9d0aa6d06a0defb Author: Marcelo Tosatti <mtosatti@redhat.com> Date: Fri, 19 Jan 2024 12:54:39 -0300 workqueue: mark power efficient workqueue as unbounded if nohz_full enabled A customer using nohz_full has experienced the following interruption: oslat-1004510 [018] timer_cancel: timer=0xffff90a7ca663cf8 oslat-1004510 [018] timer_expire_entry: timer=0xffff90a7ca663cf8 function=delayed_work_timer_fn now=4709188240 baseclk=4709188240 oslat-1004510 [018] workqueue_queue_work: work struct=0xffff90a7ca663cd8 function=fb_flashcursor workqueue=events_power_efficient req_cpu=8192 cpu=18 oslat-1004510 [018] workqueue_activate_work: work struct 0xffff90a7ca663cd8 oslat-1004510 [018] sched_wakeup: kworker/18:1:326 [120] CPU:018 oslat-1004510 [018] timer_expire_exit: timer=0xffff90a7ca663cf8 oslat-1004510 [018] irq_work_entry: vector=246 oslat-1004510 [018] irq_work_exit: vector=246 oslat-1004510 [018] tick_stop: success=0 dependency=SCHED oslat-1004510 [018] hrtimer_start: hrtimer=0xffff90a70009cb00 function=tick_sched_timer/0x0 ... oslat-1004510 [018] softirq_exit: vec=1 [action=TIMER] oslat-1004510 [018] softirq_entry: vec=7 [action=SCHED] oslat-1004510 [018] softirq_exit: vec=7 [action=SCHED] oslat-1004510 [018] tick_stop: success=0 dependency=SCHED oslat-1004510 [018] sched_switch: oslat:1004510 [120] R ==> kworker/18:1:326 [120] kworker/18:1-326 [018] workqueue_execute_start: work struct 0xffff90a7ca663cd8: function fb_flashcursor kworker/18:1-326 [018] workqueue_queue_work: work struct=0xffff9078f119eed0 function=drm_fb_helper_damage_work workqueue=events req_cpu=8192 cpu=18 kworker/18:1-326 [018] workqueue_activate_work: work struct 0xffff9078f119eed0 kworker/18:1-326 [018] timer_start: timer=0xffff90a7ca663cf8 function=delayed_work_timer_fn ... Set wq_power_efficient to true, in case nohz_full is enabled. This makes the power efficient workqueue be unbounded, which allows workqueue items there to be moved to HK CPUs. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:29 -04:00
Waiman Long	5bb6b4870b	workqueue: Add rcu lock check at the end of work item execution JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 1a65a6d17cbc58e1aeffb2be962acce49efbef9c Author: Xuewen Yan <xuewen.yan@unisoc.com> Date: Wed, 10 Jan 2024 11:27:24 +0800 workqueue: Add rcu lock check at the end of work item execution Currently the workqueue just checks the atomic and locking states after work execution ends. However, sometimes, a work item may not unlock rcu after acquiring rcu_read_lock(). And as a result, it would cause rcu stall, but the rcu stall warning can not dump the work func, because the work has finished. In order to quickly discover those works that do not call rcu_read_unlock() after rcu_read_lock(), add the rcu lock check. Use rcu_preempt_depth() to check the work's rcu status. Normally, this value is 0. If this value is bigger than 0, it means the work are still holding rcu lock. If so, print err info and the work func. tj: Reworded the description for clarity. Minor formatting tweak. Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:29 -04:00
Waiman Long	14bcf3aa22	kernel/workqueue: Bind rescuer to unbound cpumask for WQ_UNBOUND JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 85f0ab43f9de62a4b9c1b503b07f1c33e5a6d2ab Author: Juri Lelli <juri.lelli@redhat.com> Date: Tue, 16 Jan 2024 17:19:27 +0100 kernel/workqueue: Bind rescuer to unbound cpumask for WQ_UNBOUND At the time they are created unbound workqueues rescuers currently use cpu_possible_mask as their affinity, but this can be too wide in case a workqueue unbound mask has been set as a subset of cpu_possible_mask. Make new rescuers use their associated workqueue unbound cpumask from the start. Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:29 -04:00
Waiman Long	9f2706138e	Revert "workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()" JIRA: https://issues.redhat.com/browse/RHEL-25103 commit aac8a59537dfc704ff344f1aacfd143c089ee20f Author: Tejun Heo <tj@kernel.org> Date: Mon, 5 Feb 2024 15:43:41 -1000 Revert "workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()" This reverts commit ca10d851b9ad0338c19e8e3089e24d565ebfffd7. The commit allowed workqueue_apply_unbound_cpumask() to clear __WQ_ORDERED on now removed implicitly ordered workqueues. This was incorrect in that system-wide config change shouldn't break ordering properties of all workqueues. The reason why apply_workqueue_attrs() path was allowed to do so was because it was targeting the specific workqueue - either the workqueue had WQ_SYSFS set or the workqueue user specifically tried to change max_active, both of which indicate that the workqueue doesn't need to be ordered. The implicitly ordered workqueue promotion was removed by the previous commit 3bc1e711c26b ("workqueue: Don't implicitly make UNBOUND workqueues w/ @max_active==1 ordered"). However, it didn't update this path and broke build. Let's revert the commit which was incorrect in the first place which also fixes build. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 3bc1e711c26b ("workqueue: Don't implicitly make UNBOUND workqueues w/ @max_active==1 ordered") Fixes: ca10d851b9ad ("workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()") Cc: stable@vger.kernel.org # v6.6+ Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:29 -04:00
Waiman Long	5dcd3d30aa	workqueue: Provide one lock class key per work_on_cpu() callsite JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 265f3ed077036f053981f5eea0b5b43e7c5b39ff Author: Frederic Weisbecker <frederic@kernel.org> Date: Sun, 24 Sep 2023 17:07:02 +0200 workqueue: Provide one lock class key per work_on_cpu() callsite All callers of work_on_cpu() share the same lock class key for all the functions queued. As a result the workqueue related locking scenario for a function A may be spuriously accounted as an inversion against the locking scenario of function B such as in the following model: long A(void arg) { mutex_lock(&mutex); mutex_unlock(&mutex); } long B(void arg) { } void launchA(void) { work_on_cpu(0, A, NULL); } void launchB(void) { mutex_lock(&mutex); work_on_cpu(1, B, NULL); mutex_unlock(&mutex); } launchA and launchB running concurrently have no chance to deadlock. However the above can be reported by lockdep as a possible locking inversion because the works containing A() and B() are treated as belonging to the same locking class. The following shows an existing example of such a spurious lockdep splat: ====================================================== WARNING: possible circular locking dependency detected 6.6.0-rc1-00065-g934ebd6e5359 #35409 Not tainted ------------------------------------------------------ kworker/0:1/9 is trying to acquire lock: ffffffff9bc72f30 (cpu_hotplug_lock){++++}-{0:0}, at: _cpu_down+0x57/0x2b0 but task is already holding lock: ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 ((work_completion)(&wfc.work)){+.+.}-{0:0}: __flush_work+0x83/0x4e0 work_on_cpu+0x97/0xc0 rcu_nocb_cpu_offload+0x62/0xb0 rcu_nocb_toggle+0xd0/0x1d0 kthread+0xe6/0x120 ret_from_fork+0x2f/0x40 ret_from_fork_asm+0x1b/0x30 -> #1 (rcu_state.barrier_mutex){+.+.}-{3:3}: __mutex_lock+0x81/0xc80 rcu_nocb_cpu_deoffload+0x38/0xb0 rcu_nocb_toggle+0x144/0x1d0 kthread+0xe6/0x120 ret_from_fork+0x2f/0x40 ret_from_fork_asm+0x1b/0x30 -> #0 (cpu_hotplug_lock){++++}-{0:0}: __lock_acquire+0x1538/0x2500 lock_acquire+0xbf/0x2a0 percpu_down_write+0x31/0x200 _cpu_down+0x57/0x2b0 __cpu_down_maps_locked+0x10/0x20 work_for_cpu_fn+0x15/0x20 process_scheduled_works+0x2a7/0x500 worker_thread+0x173/0x330 kthread+0xe6/0x120 ret_from_fork+0x2f/0x40 ret_from_fork_asm+0x1b/0x30 other info that might help us debug this: Chain exists of: cpu_hotplug_lock --> rcu_state.barrier_mutex --> (work_completion)(&wfc.work) Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock((work_completion)(&wfc.work)); lock(rcu_state.barrier_mutex); lock((work_completion)(&wfc.work)); lock(cpu_hotplug_lock); * DEADLOCK * 2 locks held by kworker/0:1/9: #0: ffff900481068b38 ((wq_completion)events){+.+.}-{0:0}, at: process_scheduled_works+0x212/0x500 #1: ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500 stack backtrace: CPU: 0 PID: 9 Comm: kworker/0:1 Not tainted 6.6.0-rc1-00065-g934ebd6e5359 #35409 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 Workqueue: events work_for_cpu_fn Call Trace: rcu-torture: rcu_torture_read_exit: Start of episode <TASK> dump_stack_lvl+0x4a/0x80 check_noncircular+0x132/0x150 __lock_acquire+0x1538/0x2500 lock_acquire+0xbf/0x2a0 ? _cpu_down+0x57/0x2b0 percpu_down_write+0x31/0x200 ? _cpu_down+0x57/0x2b0 _cpu_down+0x57/0x2b0 __cpu_down_maps_locked+0x10/0x20 work_for_cpu_fn+0x15/0x20 process_scheduled_works+0x2a7/0x500 worker_thread+0x173/0x330 ? __pfx_worker_thread+0x10/0x10 kthread+0xe6/0x120 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x2f/0x40 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1b/0x30 </TASK Fix this with providing one lock class key per work_on_cpu() caller. Reported-and-tested-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:28 -04:00
Waiman Long	d0f98c8671	workqueue: fix -Wformat-truncation in create_worker JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 5d9c7a1e3e8e18db8e10c546de648cda2a57be52 Author: Lucy Mielke <lucymielke@icloud.com> Date: Mon, 9 Oct 2023 19:09:46 +0200 workqueue: fix -Wformat-truncation in create_worker Compiling with W=1 emitted the following warning (Compiler: gcc (x86-64, ver. 13.2.1, .config: result of make allyesconfig, "Treat warnings as errors" turned off): kernel/workqueue.c:2188:54: warning: ‘%d’ directive output may be truncated writing between 1 and 10 bytes into a region of size between 5 and 14 [-Wformat-truncation=] kernel/workqueue.c:2188:50: note: directive argument in the range [0, 2147483647] kernel/workqueue.c:2188:17: note: ‘snprintf’ output between 4 and 23 bytes into a destination of size 16 setting "id_buf" to size 23 will silence the warning, since GCC determines snprintf's output to be max. 23 bytes in line 2188. Please let me know if there are any mistakes in my patch! Signed-off-by: Lucy Mielke <lucymielke@icloud.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:28 -04:00
Waiman Long	419b96f858	workqueue: Use the kmem_cache_free() instead of kfree() to release pwq JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 7b42f401fc6571b6604441789d892d440829e33c Author: Zqiang <qiang.zhang1211@gmail.com> Date: Wed, 11 Oct 2023 16:27:59 +0800 workqueue: Use the kmem_cache_free() instead of kfree() to release pwq Currently, the kfree() be used for pwq objects allocated with kmem_cache_alloc() in alloc_and_link_pwqs(), this isn't wrong. but usually, use "trace_kmem_cache_alloc/trace_kmem_cache_free" to track memory allocation and free. this commit therefore use kmem_cache_free() instead of kfree() in alloc_and_link_pwqs() and also consistent with release of the pwq in rcu_free_pwq(). Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:28 -04:00
Waiman Long	b5d207d9d1	workqueue: Fix UAF report by KASAN in pwq_release_workfn() JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 643445531829d89dc5ddbe0c5ee4ff8f84ce8687 Author: Zqiang <qiang.zhang1211@gmail.com> Date: Wed, 20 Sep 2023 14:07:04 +0800 workqueue: Fix UAF report by KASAN in pwq_release_workfn() Currently, for UNBOUND wq, if the apply_wqattrs_prepare() return error, the apply_wqattr_cleanup() will be called and use the pwq_release_worker kthread to release resources asynchronously. however, the kfree(wq) is invoked directly in failure path of alloc_workqueue(), if the kfree(wq) has been executed and when the pwq_release_workfn() accesses wq, this leads to the following scenario: BUG: KASAN: slab-use-after-free in pwq_release_workfn+0x339/0x380 kernel/workqueue.c:4124 Read of size 4 at addr ffff888027b831c0 by task pool_workqueue_/3 CPU: 0 PID: 3 Comm: pool_workqueue_ Not tainted 6.5.0-rc7-next-20230825-syzkaller #0 Hardware name: Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xd9/0x1b0 lib/dump_stack.c:106 print_address_description mm/kasan/report.c:364 [inline] print_report+0xc4/0x620 mm/kasan/report.c:475 kasan_report+0xda/0x110 mm/kasan/report.c:588 pwq_release_workfn+0x339/0x380 kernel/workqueue.c:4124 kthread_worker_fn+0x2fc/0xa80 kernel/kthread.c:823 kthread+0x33a/0x430 kernel/kthread.c:388 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:304 </TASK> Allocated by task 5054: kasan_save_stack+0x33/0x50 mm/kasan/common.c:45 kasan_set_track+0x25/0x30 mm/kasan/common.c:52 ____kasan_kmalloc mm/kasan/common.c:374 [inline] __kasan_kmalloc+0xa2/0xb0 mm/kasan/common.c:383 kmalloc include/linux/slab.h:599 [inline] kzalloc include/linux/slab.h:720 [inline] alloc_workqueue+0x16f/0x1490 kernel/workqueue.c:4684 kvm_mmu_init_tdp_mmu+0x23/0x100 arch/x86/kvm/mmu/tdp_mmu.c:19 kvm_mmu_init_vm+0x248/0x2e0 arch/x86/kvm/mmu/mmu.c:6180 kvm_arch_init_vm+0x39/0x720 arch/x86/kvm/x86.c:12311 kvm_create_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:1222 [inline] kvm_dev_ioctl_create_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:5089 [inline] kvm_dev_ioctl+0xa31/0x1c20 arch/x86/kvm/../../../virt/kvm/kvm_main.c:5131 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:871 [inline] __se_sys_ioctl fs/ioctl.c:857 [inline] __x64_sys_ioctl+0x18f/0x210 fs/ioctl.c:857 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x38/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Freed by task 5054: kasan_save_stack+0x33/0x50 mm/kasan/common.c:45 kasan_set_track+0x25/0x30 mm/kasan/common.c:52 kasan_save_free_info+0x2b/0x40 mm/kasan/generic.c:522 ____kasan_slab_free mm/kasan/common.c:236 [inline] ____kasan_slab_free+0x15b/0x1b0 mm/kasan/common.c:200 kasan_slab_free include/linux/kasan.h:164 [inline] slab_free_hook mm/slub.c:1800 [inline] slab_free_freelist_hook+0x114/0x1e0 mm/slub.c:1826 slab_free mm/slub.c:3809 [inline] __kmem_cache_free+0xb8/0x2f0 mm/slub.c:3822 alloc_workqueue+0xe76/0x1490 kernel/workqueue.c:4746 kvm_mmu_init_tdp_mmu+0x23/0x100 arch/x86/kvm/mmu/tdp_mmu.c:19 kvm_mmu_init_vm+0x248/0x2e0 arch/x86/kvm/mmu/mmu.c:6180 kvm_arch_init_vm+0x39/0x720 arch/x86/kvm/x86.c:12311 kvm_create_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:1222 [inline] kvm_dev_ioctl_create_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:5089 [inline] kvm_dev_ioctl+0xa31/0x1c20 arch/x86/kvm/../../../virt/kvm/kvm_main.c:5131 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:871 [inline] __se_sys_ioctl fs/ioctl.c:857 [inline] __x64_sys_ioctl+0x18f/0x210 fs/ioctl.c:857 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x38/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd This commit therefore flush pwq_release_worker in the alloc_and_link_pwqs() before invoke kfree(wq). Reported-by: syzbot+60db9f652c92d5bacba4@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=60db9f652c92d5bacba4 Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:28 -04:00
Waiman Long	933254a1a8	workqueue: Fix missed pwq_release_worker creation in wq_cpu_intensive_thresh_init() JIRA: https://issues.redhat.com/browse/RHEL-25103 commit dd64c873ed11cdae340be06dcd2364870fd3e4fc Author: Zqiang <qiang.zhang1211@gmail.com> Date: Mon, 11 Sep 2023 16:27:22 +0800 workqueue: Fix missed pwq_release_worker creation in wq_cpu_intensive_thresh_init() Currently, if the wq_cpu_intensive_thresh_us is set to specific value, will cause the wq_cpu_intensive_thresh_init() early exit and missed creation of pwq_release_worker. this commit therefore create the pwq_release_worker in advance before checking the wq_cpu_intensive_thresh_us. Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 967b494e2fd1 ("workqueue: Use a kthread_worker to release pool_workqueues") Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:28 -04:00
Waiman Long	ef34ded5ae	workqueue: Removed double allocation of wq_update_pod_attrs_buf JIRA: https://issues.redhat.com/browse/RHEL-25103 commit a6828214480e2f00a8a7e64c7a55fc42b0f54e1c Author: Steven Rostedt (Google) <rostedt@goodmis.org> Date: Tue, 5 Sep 2023 17:49:35 -0400 workqueue: Removed double allocation of wq_update_pod_attrs_buf First commit 2930155b2e272 ("workqueue: Initialize unbound CPU pods later in the boot") added the initialization of wq_update_pod_attrs_buf to workqueue_init_early(), and then latter on, commit 84193c07105c6 ("workqueue: Generalize unbound CPU pods") added it as well. This appeared in a kmemleak run where the second allocation made the first allocation leak. Fixes: 84193c07105c6 ("workqueue: Generalize unbound CPU pods") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:28 -04:00
Waiman Long	0b15385184	workqueue: fix data race with the pwq->stats[] increment JIRA: https://issues.redhat.com/browse/RHEL-25103 commit fe48ba7daefe75bbbefa2426deddc05f2d530d2d Author: Mirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr> Date: Sat, 26 Aug 2023 16:51:03 +0200 workqueue: fix data race with the pwq->stats[] increment KCSAN has discovered a data race in kernel/workqueue.c:2598: [ 1863.554079] ================================================================== [ 1863.554118] BUG: KCSAN: data-race in process_one_work / process_one_work [ 1863.554142] write to 0xffff963d99d79998 of 8 bytes by task 5394 on cpu 27: [ 1863.554154] process_one_work (kernel/workqueue.c:2598) [ 1863.554166] worker_thread (./include/linux/list.h:292 kernel/workqueue.c:2752) [ 1863.554177] kthread (kernel/kthread.c:389) [ 1863.554186] ret_from_fork (arch/x86/kernel/process.c:145) [ 1863.554197] ret_from_fork_asm (arch/x86/entry/entry_64.S:312) [ 1863.554213] read to 0xffff963d99d79998 of 8 bytes by task 5450 on cpu 12: [ 1863.554224] process_one_work (kernel/workqueue.c:2598) [ 1863.554235] worker_thread (./include/linux/list.h:292 kernel/workqueue.c:2752) [ 1863.554247] kthread (kernel/kthread.c:389) [ 1863.554255] ret_from_fork (arch/x86/kernel/process.c:145) [ 1863.554266] ret_from_fork_asm (arch/x86/entry/entry_64.S:312) [ 1863.554280] value changed: 0x0000000000001766 -> 0x000000000000176a [ 1863.554295] Reported by Kernel Concurrency Sanitizer on: [ 1863.554303] CPU: 12 PID: 5450 Comm: kworker/u64:1 Tainted: G L 6.5.0-rc6+ #44 [ 1863.554314] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023 [ 1863.554322] Workqueue: btrfs-endio btrfs_end_bio_work [btrfs] [ 1863.554941] ================================================================== lockdep_invariant_state(true); → pwq->stats[PWQ_STAT_STARTED]++; trace_workqueue_execute_start(work); worker->current_func(work); Moving pwq->stats[PWQ_STAT_STARTED]++; before the line raw_spin_unlock_irq(&pool->lock); resolves the data race without performance penalty. KCSAN detected at least one additional data race: [ 157.834751] ================================================================== [ 157.834770] BUG: KCSAN: data-race in process_one_work / process_one_work [ 157.834793] write to 0xffff9934453f77a0 of 8 bytes by task 468 on cpu 29: [ 157.834804] process_one_work (/home/marvin/linux/kernel/linux_torvalds/kernel/workqueue.c:2606) [ 157.834815] worker_thread (/home/marvin/linux/kernel/linux_torvalds/./include/linux/list.h:292 /home/marvin/linux/kernel/linux_torvalds/kernel/workqueue.c:2752) [ 157.834826] kthread (/home/marvin/linux/kernel/linux_torvalds/kernel/kthread.c:389) [ 157.834834] ret_from_fork (/home/marvin/linux/kernel/linux_torvalds/arch/x86/kernel/process.c:145) [ 157.834845] ret_from_fork_asm (/home/marvin/linux/kernel/linux_torvalds/arch/x86/entry/entry_64.S:312) [ 157.834859] read to 0xffff9934453f77a0 of 8 bytes by task 214 on cpu 7: [ 157.834868] process_one_work (/home/marvin/linux/kernel/linux_torvalds/kernel/workqueue.c:2606) [ 157.834879] worker_thread (/home/marvin/linux/kernel/linux_torvalds/./include/linux/list.h:292 /home/marvin/linux/kernel/linux_torvalds/kernel/workqueue.c:2752) [ 157.834890] kthread (/home/marvin/linux/kernel/linux_torvalds/kernel/kthread.c:389) [ 157.834897] ret_from_fork (/home/marvin/linux/kernel/linux_torvalds/arch/x86/kernel/process.c:145) [ 157.834907] ret_from_fork_asm (/home/marvin/linux/kernel/linux_torvalds/arch/x86/entry/entry_64.S:312) [ 157.834920] value changed: 0x000000000000052a -> 0x0000000000000532 [ 157.834933] Reported by Kernel Concurrency Sanitizer on: [ 157.834941] CPU: 7 PID: 214 Comm: kworker/u64:2 Tainted: G L 6.5.0-rc7-kcsan-00169-g81eaf55a60fc #4 [ 157.834951] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023 [ 157.834958] Workqueue: btrfs-endio btrfs_end_bio_work [btrfs] [ 157.835567] ================================================================== in code: trace_workqueue_execute_end(work, worker->current_func); → pwq->stats[PWQ_STAT_COMPLETED]++; lock_map_release(&lockdep_map); lock_map_release(&pwq->wq->lockdep_map); which needs to be resolved separately. Fixes: 725e8ec59c56c ("workqueue: Add pwq->stats[] and a monitoring script") Cc: Tejun Heo <tj@kernel.org> Suggested-by: Lai Jiangshan <jiangshanlai@gmail.com> Link: https://lore.kernel.org/lkml/20230818194448.29672-1-mirsad.todorovac@alu.unizg.hr/ Signed-off-by: Mirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:28 -04:00
Waiman Long	dc384b64b6	workqueue: Rename rescuer kworker JIRA: https://issues.redhat.com/browse/RHEL-25103 commit b6a46f7263bd8ba0e545d79bd034c412f32b5875 Author: Aaron Tomlin <atomlin@atomlin.com> Date: Tue, 8 Aug 2023 13:03:29 +0100 workqueue: Rename rescuer kworker Each CPU-specific and unbound kworker kthread conforms to a particular naming scheme. However, this does not extend to the rescuer kworker. At present, a rescuer kworker is simply named according to its workqueue's name. This can be cryptic. This patch modifies a rescuer to follow the kworker naming scheme. The "R" is indicative of a rescuer and after "-" is its workqueue's name e.g. "kworker/R-ext4-rsv-conver". tj: Use "R" instead of "r" as the prefix to make it more distinctive and consistent with how highpri pools are marked. Signed-off-by: Aaron Tomlin <atomlin@atomlin.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:28 -04:00
Waiman Long	150f0fd88d	workqueue: Make default affinity_scope dynamically updatable JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 523a301e66afd1ea9856660bcf3cee3a7c84c6dd Author: Tejun Heo <tj@kernel.org> Date: Mon, 7 Aug 2023 15:57:25 -1000 workqueue: Make default affinity_scope dynamically updatable While workqueue.default_affinity_scope is writable, it only affects workqueues which are created afterwards and isn't very useful. Instead, let's introduce explicit "default" scope and update the effective scope dynamically when workqueue.default_affinity_scope is changed. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:28 -04:00
Waiman Long	87b2fa1d0e	workqueue: Implement non-strict affinity scope for unbound workqueues JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 8639ecebc9b1796d7074751a350462f5e1c61cd4 Author: Tejun Heo <tj@kernel.org> Date: Mon, 7 Aug 2023 15:57:25 -1000 workqueue: Implement non-strict affinity scope for unbound workqueues An unbound workqueue can be served by multiple worker_pools to improve locality. The segmentation is achieved by grouping CPUs into pods. By default, the cache boundaries according to cpus_share_cache() define the CPUs are grouped. Let's a workqueue is allowed to run on all CPUs and the system has two L3 caches. The workqueue would be mapped to two worker_pools each serving one L3 cache domains. While this improves locality, because the pod boundaries are strict, it limits the total bandwidth a given issuer can consume. For example, let's say there is a thread pinned to a CPU issuing enough work items to saturate the whole machine. With the machine segmented into two pods, no matter how many work items it issues, it can only use half of the CPUs on the system. While this limitation has existed for a very long time, it wasn't very pronounced because the affinity grouping used to be always by NUMA nodes. With cache boundaries as the default and support for even finer grained scopes (smt and cpu), it is now an a lot more pressing problem. This patch implements non-strict affinity scope where the pod boundaries aren't enforced strictly. Going back to the previous example, the workqueue would still be mapped to two worker_pools; however, the affinity enforcement would be soft. The workers in both pools would have their cpus_allowed set to the whole machine thus allowing the scheduler to migrate them anywhere on the machine. However, whenever an idle worker is woken up, the workqueue code asks the scheduler to bring back the task within the pod if the worker is outside. ie. work items start executing within its affinity scope but can be migrated outside as the scheduler sees fit. This removes the hard cap on utilization while maintaining the benefits of affinity scopes. After the earlier ->__pod_cpumask changes, the implementation is pretty simple. When non-strict which is the new default: * pool_allowed_cpus() returns @pool->attrs->cpumask instead of ->__pod_cpumask so that the workers are allowed to run on any CPU that the associated workqueues allow. * If the idle worker task's ->wake_cpu is outside the pod, kick_pool() sets the field to a CPU within the pod. This would be the first use of task_struct->wake_cpu outside scheduler proper, so it isn't clear whether this would be acceptable. However, other methods of migrating tasks are significantly more expensive and are likely prohibitively so if we want to do this on every work item. This needs discussion with scheduler folks. There is also a race window where setting ->wake_cpu wouldn't be effective as the target task is still on CPU. However, the window is pretty small and this being a best-effort optimization, it doesn't seem to warrant more complexity at the moment. While the non-strict cache affinity scopes seem to be the best option, the performance picture interacts with the affinity scope and is a bit complicated to fully discuss in this patch, so the behavior is made easily selectable through wqattrs and sysfs and the next patch will add documentation to discuss performance implications. v2: pool->attrs->affn_strict is set to true for per-cpu worker_pools. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:27 -04:00
Waiman Long	4f5a422058	workqueue: Add workqueue_attrs->__pod_cpumask JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 9546b29e4a6ad6ed7924dd7980975c8e675740a3 Author: Tejun Heo <tj@kernel.org> Date: Mon, 7 Aug 2023 15:57:25 -1000 workqueue: Add workqueue_attrs->__pod_cpumask workqueue_attrs has two uses: * to specify the required unouned workqueue properties by users * to match worker_pool's properties to workqueues by core code For example, if the user wants to restrict a workqueue to run only CPUs 0 and 2, and the two CPUs are on different affinity scopes, the workqueue's attrs->cpumask would contains CPUs 0 and 2, and the workqueue would be associated with two worker_pools, one with attrs->cpumask containing just CPU 0 and the other CPU 2. Workqueue wants to support non-strict affinity scopes where work items are started in their matching affinity scopes but the scheduler is free to migrate them outside the starting scopes, which can enable utilizing the whole machine while maintaining most of the locality benefits from affinity scopes. To enable that, worker_pools need to distinguish the strict affinity that it has to follow (because that's the restriction coming from the user) and the soft affinity that it wants to apply when dispatching work items. Note that two worker_pools with different soft dispatching requirements have to be separate; otherwise, for example, we'd be ping-ponging worker threads across NUMA boundaries constantly. This patch adds workqueue_attrs->__pod_cpumask. The new field is double underscored as it's only used internally to distinguish worker_pools. A worker_pool's ->cpumask is now always the same as the online subset of allowed CPUs of the associated workqueues, and ->__pod_cpumask is the pod's subset of that ->cpumask. Going back to the example above, both worker_pools would have ->cpumask containing both CPUs 0 and 2 but one's ->__pod_cpumask would contain 0 while the other's 2. * pool_allowed_cpus() is added. It returns the worker_pool's strict cpumask that the pool's workers must stay within. This is currently always ->__pod_cpumask as all boundaries are still strict. * As a workqueue_attrs can now track both the associated workqueues' cpumask and its per-pod subset, wq_calc_pod_cpumask() no longer needs an external out-argument. Drop @cpumask and instead store the result in ->__pod_cpumask. * The above also simplifies apply_wqattrs_prepare() as the same workqueue_attrs can be used to create all pods associated with a workqueue. tmp_attrs is dropped. * wq_update_pod() is updated to use wqattrs_equal() to test whether a pwq update is needed instead of only comparing ->cpumask so that ->__pod_cpumask is compared too. It can directly compare ->__pod_cpumaks but the code is easier to understand and more robust this way. The only user-visible behavior change is that two workqueues with different cpumasks no longer can share worker_pools even when their pod subsets coincide. Going back to the example, let's say there's another workqueue with cpumask 0, 2, 3, where 2 and 3 are in the same pod. It would be mapped to two worker_pools - one with CPU 0, the other with 2 and 3. The former has the same cpumask as the first pod of the earlier example and would have shared the same worker_pool but that's no longer the case after this patch. The worker_pools would have the same ->__pod_cpumask but their ->cpumask's wouldn't match. While this is necessary to support non-strict affinity scopes, there can be further optimizations to maintain sharing among strict affinity scopes. However, non-strict affinity scopes are going to be preferable for most use cases and we don't see very diverse mixture of unbound workqueue cpumasks anyway, so the additional overhead doesn't seem to justify the extra complexity. v2: - wq_update_pod() was incorrectly comparing target_attrs->__pod_cpumask to pool->attrs->cpumask instead of its ->__pod_cpumask. Fix it by using wqattrs_equal() for comparison instead. - Per-cpu worker pools weren't initializing ->__pod_cpumask which caused a subtle problem later on. Set it to cpumask_of(cpu) like ->cpumask. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:27 -04:00
Waiman Long	976f7b4a57	workqueue: Factor out need_more_worker() check and worker wake-up JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 0219a3528d72143d8d2c4c793b61541d03518b59 Author: Tejun Heo <tj@kernel.org> Date: Mon, 7 Aug 2023 15:57:25 -1000 workqueue: Factor out need_more_worker() check and worker wake-up Checking need_more_worker() and calling wake_up_worker() is a repeated pattern. Let's add kick_pool(), which checks need_more_worker() and open-code wake_up_worker(), and replace wake_up_worker() uses. The following conversions aren't one-to-one: * __queue_work() was using __need_more_work() because it knows that pool->worklist isn't empty. Switching to kick_pool() adds an extra list_empty() test. * create_worker() always needs to wake up the newly minted worker whether there's more work to do or not to avoid triggering hung task check on the new task. Keep the current wake_up_process() and still add kick_pool(). This may lead to an extra wakeup which isn't harmful. * pwq_adjust_max_active() was explicitly checking whether it needs to wake up a worker or not to avoid spurious wakeups. As kick_pool() only wakes up a worker when necessary, this explicit check is no longer necessary and dropped. * unbind_workers() now calls kick_pool() instead of wake_up_worker() adding a need_more_worker() test. This avoids spurious wakeups and shouldn't break anything. wake_up_worker() is dropped as kick_pool() replaces all its users. After this patch, all paths that wakes up a non-rescuer worker to initiate work item execution use kick_pool(). This will enable future changes to improve locality. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:27 -04:00
Waiman Long	332255a89e	workqueue: Factor out work to worker assignment and collision handling JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 873eaca6eaf84b1d1ed5b7259308c6a4fca70fdc Author: Tejun Heo <tj@kernel.org> Date: Mon, 7 Aug 2023 15:57:25 -1000 workqueue: Factor out work to worker assignment and collision handling The two work execution paths in worker_thread() and rescuer_thread() use move_linked_works() to claim work items from @pool->worklist. Once claimed, process_schedule_works() is called which invokes process_one_work() on each work item. process_one_work() then uses find_worker_executing_work() to detect and handle collisions - situations where the work item to be executed is still running on another worker. This works fine, but, to improve work execution locality, we want to establish work to worker association earlier and know for sure that the worker is going to excute the work once asssigned, which requires performing collision handling earlier while trying to assign the work item to the worker. This patch introduces assign_work() which assigns a work item to a worker using move_linked_works() and then performs collision handling. As collision handling is handled earlier, process_one_work() no longer needs to worry about them. After the this patch, collision checks for linked work items are skipped, which should be fine as they can't be queued multiple times concurrently. For work items running from rescuers, the timing of collision handling may change but the invariant that the work items go through collision handling before starting execution does not. This patch shouldn't cause noticeable behavior changes, especially given that worker_thread() behavior remains the same. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:27 -04:00
Waiman Long	89997d1af2	workqueue: Add multiple affinity scopes and interface to select them JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 63c5484e74952f60f5810256bd69814d167b8d22 Author: Tejun Heo <tj@kernel.org> Date: Mon, 7 Aug 2023 15:57:24 -1000 workqueue: Add multiple affinity scopes and interface to select them Add three more affinity scopes - WQ_AFFN_CPU, SMT and CACHE - and make CACHE the default. The code changes to actually add the additional scopes are trivial. Also add module parameter "workqueue.default_affinity_scope" to override the default scope and "affinity_scope" sysfs file to configure it per workqueue. wq_dump.py and documentations are updated accordingly. This enables significant flexibility in configuring how unbound workqueues behave. If affinity scope is set to "cpu", it'll behave close to a per-cpu workqueue. On the other hand, "system" removes all locality boundaries. Many modern machines have multiple L3 caches often while being mostly uniform in terms of memory access. Thus, workqueue's previous behavior of spreading work items in each NUMA node had negative performance implications from unncessarily crossing L3 boundaries between issue and execution. However, picking a finer grained affinity scope also has a downside in that an issuer in one group can't utilize CPUs in other groups. While dependent on the specifics of workload, there's usually a noticeable penalty in crossing L3 boundaries, so let's default to CACHE. This issue will be further addressed and documented with examples in future patches. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:27 -04:00
Waiman Long	0b97408c78	workqueue: Modularize wq_pod_type initialization JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 025e16845877e80cb169274b330c236056ba553c Author: Tejun Heo <tj@kernel.org> Date: Mon, 7 Aug 2023 15:57:24 -1000 workqueue: Modularize wq_pod_type initialization While wq_pod_type[] can now group CPUs in any aribitrary way, WQ_AFFN_NUM init is hard coded into workqueue_init_topology(). This patch modularizes the init path by introducing init_pod_type() which takes a callback to determine whether two CPUs should share a pod as an argument. init_pod_type() first scans the CPU combinations testing for sharing to assign consecutive pod IDs and initialize pod_type->cpu_pod[]. Once ->cpu_pod[] is determined, ->pod_cpus[] and ->pod_node[] are initialized accordingly. WQ_AFFN_NUMA is now initialized by calling init_pod_type() with cpus_share_numa() which tests whether the CPU belongs to the same NUMA node. This patch may change the pod ID assigned to each NUMA node but that shouldn't cause any behavior changes as the NUMA node to use for allocations are tracked separately in pod_type->pod_node[]. This makes adding new affinty types pretty easy. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:27 -04:00
Waiman Long	cb62b603f4	workqueue: Generalize unbound CPU pods JIRA: https://issues.redhat.com/browse/RHEL-25103 Conflicts: A minor context diff in the workqueue_apply_unbound_cpumask() hunk due to the presence of a later upstream commit ca10d851b9ad ("workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()"). commit 84193c07105c62d206fb230b2f29002226628989 Author: Tejun Heo <tj@kernel.org> Date: Mon, 7 Aug 2023 15:57:24 -1000 workqueue: Generalize unbound CPU pods While renamed to pod, the code still assumes that the pods are defined by NUMA boundaries. Let's generalize it: * workqueue_attrs->affn_scope is added. Each enum represents the type of boundaries that define the pods. There are currently two scopes - WQ_AFFN_NUMA and WQ_AFFN_SYSTEM. The former is the same behavior as before - one pod per NUMA node. The latter defines one global pod across the whole system. * struct wq_pod_type is added which describes how pods are configured for each affnity scope. For each pod, it lists the member CPUs and the preferred NUMA node for memory allocations. The reverse mapping from CPU to pod is also available. * wq_pod_enabled is dropped. Pod is now always enabled. The previously disabled behavior is now implemented through WQ_AFFN_SYSTEM. * get_unbound_pool() wants to determine the NUMA node to allocate memory from for the new pool. The variables are renamed from node to pod but the logic still assumes they're one and the same. Clearly distinguish them - walk the WQ_AFFN_NUMA pods to find the matching pod and then use the pod's NUMA node. * wq_calc_pod_cpumask() was taking @pod but assumed that it was the NUMA node. Take @cpu instead and determine the cpumask to use from the pod_type matching @attrs. * apply_wqattrs_prepare() is update to return ERR_PTR() on error instead of NULL so that it can indicate -EINVAL on invalid affinity scopes. This patch allows CPUs to be grouped into pods however desired per type. While this patch causes some internal behavior changes, nothing material should change for workqueue users. v2: Trigger WARN_ON_ONCE() in wqattrs_pod_type() if affn_scope is WQ_AFFN_NR_TYPES which indicates that the function is called with a worker_pool's attrs instead of a workqueue's. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:27 -04:00
Waiman Long	d1e3435ca0	workqueue: Factor out clearing of workqueue-only attrs fields JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 5de7a03cac14765ba22934b6fb1476456ee36bf8 Author: Tejun Heo <tj@kernel.org> Date: Mon, 7 Aug 2023 15:57:24 -1000 workqueue: Factor out clearing of workqueue-only attrs fields workqueue_attrs can be used for both workqueues and worker_pools. However, some fields, currently only ->ordered, only apply to workqueues and should be cleared to the default / invalid values. Currently, an unbound workqueue explicitly clears attrs->ordered in get_unbound_pool() after copying the source workqueue attrs, while per-cpu workqueues rely on the fact that zeroing on allocation gives us the desired default value for pool->attrs->ordered. This is fragile. Let's add wqattrs_clear_for_pool() which clears attrs->ordered and is called from both init_worker_pool() and get_unbound_pool(). This will ease adding more workqueue-only attrs fields. In get_unbound_pool(), pool->node initialization is moved upwards for readability. This shouldn't cause any behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:27 -04:00
Waiman Long	af3d231314	workqueue: Factor out actual cpumask calculation to reduce subtlety in wq_update_pod() JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 0f36ee24cd43c67be07166ddd09866dc7a47cb4c Author: Tejun Heo <tj@kernel.org> Date: Mon, 7 Aug 2023 15:57:24 -1000 workqueue: Factor out actual cpumask calculation to reduce subtlety in wq_update_pod() For an unbound pool, multiple cpumasks are involved. U: The user-specified cpumask (may be filtered with cpu_possible_mask). A: The actual cpumask filtered by wq_unbound_cpumask. If the filtering leaves no CPU, wq_unbound_cpumask is used. P: Per-pod subsets of #A. wq->attrs stores #U, wq->dfl_pwq->pool->attrs->cpumask #A, and wq->cpu_pwq[CPU]->pool->attrs->cpumask #P. wq_update_pod() is called to update per-pod pwq's during CPU hotplug. To calculate the new #P for each workqueue, it needs to call wq_calc_pod_cpumask() with @attrs that contains #A. Currently, wq_update_pod() achieves this by calling wq_calc_pod_cpumask() with wq->dfl_pwq->pool->attrs. This is rather fragile because we're calling wq_calc_pod_cpumask() with @attrs of a worker_pool rather than the workqueue's actual attrs when what we want to calculate is the workqueue's cpumask on the pod. While this works fine currently, future changes will add fields which are used differently between workqueues and worker_pools and this subtlety will bite us. This patch factors out #U -> #A calculation from apply_wqattrs_prepare() into wqattrs_actualize_cpumask and updates wq_update_pod() to copy wq->unbound_attrs and use the new helper to obtain #A freshly instead of abusing wq->dfl_pwq->pool_attrs. This shouldn't cause any behavior changes in the current code. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: K Prateek Nayak <kprateek.nayak@amd.com> Reference: http://lkml.kernel.org/r/30625cdd-4d61-594b-8db9-6816b017dde3@amd.com Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:27 -04:00
Waiman Long	591106944f	workqueue: Initialize unbound CPU pods later in the boot JIRA: https://issues.redhat.com/browse/RHEL-25103 Conflicts: 1) Minor context diff in init/main.c due to missing upstream commit 7ec7096b8577 ("mm/page_ext: init page_ext early if there are no deferred struct pages") and commit de57807e6f26 ("init,mm: fold late call to page_ext_init() to page_alloc_init_late()"). 2) Minor context diff in kernel/workqueue.c due to the presence of a later upstream commit 4a6c5607d450 ("workqueue: Make sure that wq_unbound_cpumask is never empty"). commit 2930155b2e27232c033970f2e110aaac4187cb9e Author: Tejun Heo <tj@kernel.org> Date: Mon, 7 Aug 2023 15:57:24 -1000 workqueue: Initialize unbound CPU pods later in the boot During boot, to initialize unbound CPU pods, wq_pod_init() was called from workqueue_init(). This is early enough for NUMA nodes to be set up but before SMP is brought up and CPU topology information is populated. Workqueue is in the process of improving CPU locality for unbound workqueues and will need access to topology information during pod init. This adds a new init function workqueue_init_topology() which is called after CPU topology information is available and replaces wq_pod_init(). As unbound CPU pods are now initialized after workqueues are activated, we need to revisit the workqueues to apply the pod configuration. Workqueues which are created before workqueue_init_topology() are set up so that they always use the default worker pool. After pods are set up in workqueue_init_topology(), wq_update_pod() is called on all existing workqueues to update the pool associations accordingly. Note that wq_update_pod_attrs_buf allocation is moved to workqueue_init_early(). This isn't necessary right now but enables further generalization of pod handling in the future. This patch changes the initialization sequence but the end result should be the same. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:27 -04:00
Waiman Long	09bde2a4d3	workqueue: Move wq_pod_init() below workqueue_init() JIRA: https://issues.redhat.com/browse/RHEL-25103 Conflicts: A context diff due to the presence of a later upstream commit 4a6c5607d450 ("workqueue: Make sure that wq_unbound_cpumask is never empty"). commit a86feae6195ac2148097b063f7fdad8ee1f6dad4 Author: Tejun Heo <tj@kernel.org> Date: Mon, 7 Aug 2023 15:57:24 -1000 workqueue: Move wq_pod_init() below workqueue_init() wq_pod_init() is called from workqueue_init() and responsible for initializing unbound CPU pods according to NUMA node. Workqueue is in the process of improving affinity awareness and wants to use other topology information to initialize unbound CPU pods; however, unlike NUMA nodes, other topology information isn't yet available in workqueue_init(). The next patch will introduce a later stage init function for workqueue which will be responsible for initializing unbound CPU pods. Relocate wq_pod_init() below workqueue_init() where the new init function is going to be located so that the diff can show the content differences. Just a relocation. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:26 -04:00
Waiman Long	4b35d49bdf	workqueue: Rename NUMA related names to use pod instead JIRA: https://issues.redhat.com/browse/RHEL-25103 Conflicts: context diffs in two of the hunks due to the presence of later upstream commit 4a6c5607d450 ("workqueue: Make sure that wq_unbound_cpumask is never empty") and commit 31c89007285d ("workqueue.c: Increase workqueue name length"). commit fef59c9cab6ac5592da54f6c2b1195418f14e4d0 Author: Tejun Heo <tj@kernel.org> Date: Mon, 7 Aug 2023 15:57:23 -1000 workqueue: Rename NUMA related names to use pod instead Workqueue is in the process of improving CPU affinity awareness. It will become more flexible and won't be tied to NUMA node boundaries. This patch renames all NUMA related names in workqueue.c to use "pod" instead. While "pod" isn't a very common term, it short and captures the grouping of CPUs well enough. These names are only going to be used within workqueue implementation proper, so the specific naming doesn't matter that much. * wq_numa_possible_cpumask -> wq_pod_cpus * wq_numa_enabled -> wq_pod_enabled * wq_update_unbound_numa_attrs_buf -> wq_update_pod_attrs_buf * workqueue_select_cpu_near -> select_numa_node_cpu This rename is different from others. The function is only used by queue_work_node() and specifically tries to find a CPU in the specified NUMA node. As workqueue affinity will become more flexible and untied from NUMA, this function's name should specifically describe that it's for NUMA. * wq_calc_node_cpumask -> wq_calc_pod_cpumask * wq_update_unbound_numa -> wq_update_pod * wq_numa_init -> wq_pod_init * node -> pod in local variables Only renames. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:26 -04:00
Waiman Long	0dc9405a09	workqueue: Rename workqueue_attrs->no_numa to ->ordered JIRA: https://issues.redhat.com/browse/RHEL-25103 commit af73f5c9febe5095ee492ae43e9898fca65ced70 Author: Tejun Heo <tj@kernel.org> Date: Mon, 7 Aug 2023 15:57:23 -1000 workqueue: Rename workqueue_attrs->no_numa to ->ordered With the recent removal of NUMA related module param and sysfs knob, workqueue_attrs->no_numa is now only used to implement ordered workqueues. Let's rename the field so that it's less confusing especially with the planned CPU affinity awareness improvements. Just a rename. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:26 -04:00
Waiman Long	de03f9c951	workqueue: Make unbound workqueues to use per-cpu pool_workqueues JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 636b927eba5bc633753f8eb80f35e1d5be806e51 Author: Tejun Heo <tj@kernel.org> Date: Mon, 7 Aug 2023 15:57:23 -1000 workqueue: Make unbound workqueues to use per-cpu pool_workqueues A pwq (pool_workqueue) represents an association between a workqueue and a worker_pool. When a work item is queued, the workqueue selects the pwq to use, which in turn determines the pool, and queues the work item to the pool through the pwq. pwq is also what implements the maximum concurrency limit - @max_active. As a per-cpu workqueue should be assocaited with a different worker_pool on each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq. However, unbound workqueues were sharing a pwq within each NUMA node by default. The sharing has several downsides: * Because @max_active is per-pwq, the meaning of @max_active changes depending on the machine configuration and whether workqueue NUMA locality support is enabled. * Makes per-cpu and unbound code deviate. * Gets in the way of making workqueue CPU locality awareness more flexible. This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu workqueues do by making the following changes: * wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound workqueues. * numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs the specified pwq to the target CPU's wq->cpu_pwq. * apply_wqattrs_prepare() now always allocates a separate pwq for each CPU unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq. This makes the return value of wq_calc_node_cpumask() unnecessary. It now returns void. * @max_active now means the same thing for both per-cpu and unbound workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer used in workqueue implementation and will be removed later. * All unbound pwq operations which used to be per-numa-node are now per-cpu. For most unbound workqueue users, this shouldn't cause noticeable changes. Work item issue and completion will be a small bit faster, flush_workqueue() would become a bit more expensive, and the total concurrency limit would likely become higher. All @max_active==1 use cases are currently being audited for conversion into alloc_ordered_workqueue() and they shouldn't be affected once the audit and conversion is complete. One area where the behavior change may be more noticeable is workqueue_congested() as the reported congestion state is now per CPU instead of NUMA node. There are only two users of this interface - drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are cc'd. Inputs on the behavior change would be very much appreciated. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Leon Romanovsky <leon@kernel.org> Cc: Karsten Graul <kgraul@linux.ibm.com> Cc: Wenjia Zhang <wenjia@linux.ibm.com> Cc: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:26 -04:00
Waiman Long	3bc7ad84fe	workqueue: Call wq_update_unbound_numa() on all CPUs in NUMA node on CPU hotplug JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 4cbfd3de737b9d00544ff0f673cb75fc37bffb6a Author: Tejun Heo <tj@kernel.org> Date: Mon, 7 Aug 2023 15:57:23 -1000 workqueue: Call wq_update_unbound_numa() on all CPUs in NUMA node on CPU hotplug When a CPU went online or offline, wq_update_unbound_numa() was called only on the CPU which was going up or down. This works fine because all CPUs on the same NUMA node share the same pool_workqueue slot - one CPU updating it updates it for everyone in the node. However, future changes will make each CPU use a separate pool_workqueue even when they're sharing the same worker_pool, which requires updating pool_workqueue's for all CPUs which may be sharing the same pool_workqueue on hotplug. To accommodate the planned changes, this patch updates workqueue_on/offline_cpu() so that they call wq_update_unbound_numa() for all CPUs sharing the same NUMA node as the CPU going up or down. In the current code, the second+ calls would be noops and there shouldn't be any behavior changes. * As wq_update_unbound_numa() is now called on multiple CPUs per each hotplug event, @cpu is renamed to @hotplug_cpu and another @cpu argument is added. The former indicates the CPU being hot[un]plugged and the latter the CPU whose pool_workqueue is being updated. * In wq_update_unbound_numa(), cpu_off is renamed to off_cpu for consistency with the new @hotplug_cpu. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:26 -04:00
Waiman Long	431872309b	workqueue: Make per-cpu pool_workqueues allocated and released like unbound ones JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 687a9aa56f811b381e63f7f8f9149428ac708a3b Author: Tejun Heo <tj@kernel.org> Date: Mon, 7 Aug 2023 15:57:23 -1000 workqueue: Make per-cpu pool_workqueues allocated and released like unbound ones Currently, all per-cpu pwq's (pool_workqueue's) are allocated directly through a per-cpu allocation and thus, unlike unbound workqueues, not reference counted. This difference in lifetime management between the two types is a bit confusing. Unbound workqueues are currently accessed through wq->numa_pwq_tbl[] which isn't suitiable for the planned CPU locality related improvements. The plan is to unify pwq handling across per-cpu and unbound workqueues so that they're always accessed through wq->cpu_pwq. In preparation, this patch makes per-cpu pwq's to be allocated, reference counted and released the same way as unbound pwq's. wq->cpu_pwq now holds pointers to pwq's instead of containing them directly. pwq_unbound_release_workfn() is renamed to pwq_release_workfn() as it's now also used for per-cpu work items. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:26 -04:00
Waiman Long	1a30e9bd34	workqueue: Use a kthread_worker to release pool_workqueues JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 967b494e2fd143a9c1a3201422aceadb5fa9fbfc Author: Tejun Heo <tj@kernel.org> Date: Mon, 7 Aug 2023 15:57:23 -1000 workqueue: Use a kthread_worker to release pool_workqueues pool_workqueue release path is currently bounced to system_wq; however, this is a bit tricky because this bouncing occurs while holding a pool lock and thus has risk of causing a A-A deadlock. This is currently addressed by the fact that only unbound workqueues use this bouncing path and system_wq is a per-cpu workqueue. While this works, it's brittle and requires a work-around like setting the lockdep subclass for the lock of unbound pools. Besides, future changes will use the bouncing path for per-cpu workqueues too making the current approach unusable. Let's just use a dedicated kthread_worker to untangle the dependency. This is just one more kthread for all workqueues and makes the pwq release logic simpler and more robust. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:26 -04:00
Waiman Long	33f0fd9f2a	workqueue: Remove module param disable_numa and sysfs knobs pool_ids and numa JIRA: https://issues.redhat.com/browse/RHEL-25103 Conflicts: A merge conflict in the wq_pool_ids_show() removal hunk due to the presence of a later upstream commit 49277a5b7637 ("workqueue: Move workqueue_set_unbound_cpumask() and its helpers inside CONFIG_SYSFS"). commit fcecfa8f271acdf130acbb30842e7848a138af0f Author: Tejun Heo <tj@kernel.org> Date: Mon, 7 Aug 2023 15:57:23 -1000 workqueue: Remove module param disable_numa and sysfs knobs pool_ids and numa Unbound workqueue CPU affinity is going to receive an overhaul and the NUMA specific knobs won't make sense anymore. Remove them. Also, the pool_ids knob was used for debugging and not really meaningful given that there is no visibility into the pools associated with those IDs. Remove it too. A future patch will improve overall visibility. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:26 -04:00
Waiman Long	30f4e1d335	workqueue: Relocate worker and work management functions JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 797e8345cbb0d2913300ee9838eb74cce19485cf Author: Tejun Heo <tj@kernel.org> Date: Mon, 7 Aug 2023 15:57:23 -1000 workqueue: Relocate worker and work management functions Collect first_idle_worker(), worker_enter/leave_idle(), find_worker_executing_work(), move_linked_works() and wake_up_worker() into one place. These functions will later be used to implement higher level worker management logic. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:26 -04:00
Waiman Long	dcadc6099c	workqueue: Rename wq->cpu_pwqs to wq->cpu_pwq JIRA: https://issues.redhat.com/browse/RHEL-25103 commit ee1ceef72754427e5167743108c52f826fa4ca5b Author: Tejun Heo <tj@kernel.org> Date: Mon, 7 Aug 2023 15:57:23 -1000 workqueue: Rename wq->cpu_pwqs to wq->cpu_pwq wq->cpu_pwqs is a percpu variable carraying one pointer to a pool_workqueue. The field name being plural is unusual and confusing. Rename it to singular. This patch doesn't cause any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:26 -04:00
Waiman Long	ad3fa632c4	workqueue: Not all work insertion needs to wake up a worker JIRA: https://issues.redhat.com/browse/RHEL-25103 commit fe089f87cccb066e8ad20f49ddf05e95adc1fa8d Author: Tejun Heo <tj@kernel.org> Date: Mon, 7 Aug 2023 15:57:22 -1000 workqueue: Not all work insertion needs to wake up a worker insert_work() always tried to wake up a worker; however, the only time it needs to try to wake up a worker is when a new active work item is queued. When a work item goes on the inactive list or queueing a flush work item, there's no reason to try to wake up a worker. This patch moves the worker wakeup logic out of insert_work() and places it in the active new work item queueing path in __queue_work(). While at it: * __queue_work() is dereferencing pwq->pool repeatedly. Add local variable pool. * Every caller of insert_work() calls debug_work_activate(). Consolidate the invocations into insert_work(). * In __queue_work() pool->watchdog_ts update is relocated slightly. This is to better accommodate future changes. This makes wakeups more precise and will help the planned change to assign work items to workers before waking them up. No behavior changes intended. v2: WARN_ON_ONCE(pool != last_pool) added in __queue_work() to clarify as suggested by Lai. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:26 -04:00
Waiman Long	cd964710d4	workqueue: Cleanups around process_scheduled_works() JIRA: https://issues.redhat.com/browse/RHEL-25103 commit c0ab017d43f4c4147f7ecf3ca3cb872a416e17c7 Author: Tejun Heo <tj@kernel.org> Date: Mon, 7 Aug 2023 15:57:22 -1000 workqueue: Cleanups around process_scheduled_works() * Drop the trivial optimization in worker_thread() where it bypasses calling process_scheduled_works() if the first work item isn't linked. This is a mostly pointless micro optimization and gets in the way of improving the work processing path. * Consolidate pool->watchdog_ts updates in the two callers into process_scheduled_works(). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:25 -04:00
Waiman Long	5aa79febcd	workqueue: Drop the special locking rule for worker->flags and worker_pool->flags JIRA: https://issues.redhat.com/browse/RHEL-25103 commit bc8b50c2dfac946c1beed782c1823e52cf55a352 Author: Tejun Heo <tj@kernel.org> Date: Mon, 7 Aug 2023 15:57:22 -1000 workqueue: Drop the special locking rule for worker->flags and worker_pool->flags worker->flags used to be accessed from scheduler hooks without grabbing pool->lock for concurrency management. This is no longer true since `6d25be5782` ("sched/core, workqueues: Distangle worker accounting from rq lock"). Also, it's unclear why worker_pool->flags was using the "X" rule. All relevant users are accessing it under the pool lock. Let's drop the special "X" rule and use the "L" rule for these flag fields instead. While at it, replace the CONTEXT comment with lockdep_assert_held(). This allows worker_set/clr_flags() to be used from context which isn't the worker itself. This will be used later to implement assinging work items to workers before waking them up so that workqueue can have better control over which worker executes which work item on which CPU. The only actual changes are sanity checks. There shouldn't be any visible behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:25 -04:00
Waiman Long	4753d1c10a	workqueue: use LIST_HEAD to initialize cull_list JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 9680540c0c56a1f75a2d6aab31bf38aa429aa9d9 Author: Yang Yingliang <yangyingliang@huawei.com> Date: Fri, 4 Aug 2023 11:22:15 +0800 workqueue: use LIST_HEAD to initialize cull_list Use LIST_HEAD() to initialize cull_list instead of open-coding it. Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:25 -04:00
Waiman Long	9e8e8dfabf	workqueue: Warn attempt to flush system-wide workqueues. JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 20bdedafd2f63e0ba70991127f9b5c0826ebdb32 Author: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Date: Fri, 30 Jun 2023 21:28:53 +0900 workqueue: Warn attempt to flush system-wide workqueues. Based on commit c4f135d643823a86 ("workqueue: Wrap flush_workqueue() using a macro"), all in-tree users stopped flushing system-wide workqueues. Therefore, start emitting runtime message so that all out-of-tree users will understand that they need to update their code. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:25 -04:00
Waiman Long	5df3631c9c	workqueue: Scale up wq_cpu_intensive_thresh_us if BogoMIPS is below 4000 JIRA: https://issues.redhat.com/browse/RHEL-25103 commit aa6fde93f3a49e42c0fe0490d7f3711bac0d162e Author: Tejun Heo <tj@kernel.org> Date: Mon, 17 Jul 2023 12:50:02 -1000 workqueue: Scale up wq_cpu_intensive_thresh_us if BogoMIPS is below 4000 wq_cpu_intensive_thresh_us is used to detect CPU-hogging per-cpu work items. Once detected, they're excluded from concurrency management to prevent them from blocking other per-cpu work items. If CONFIG_WQ_CPU_INTENSIVE_REPORT is enabled, repeat offenders are also reported so that the code can be updated. The default threshold is 10ms which is long enough to do fair bit of work on modern CPUs while short enough to be usually not noticeable. This unfortunately leads to a lot of, arguable spurious, detections on very slow CPUs. Using the same threshold across CPUs whose performance levels may be apart by multiple levels of magnitude doesn't make whole lot of sense. This patch scales up wq_cpu_intensive_thresh_us upto 1 second when BogoMIPS is below 4000. This is obviously very inaccurate but it doesn't have to be accurate to be useful. The mechanism is still useful when the threshold is fully scaled up and the benefits of reports are usually shared with everyone regardless of who's reporting, so as long as there are sufficient number of fast machines reporting, we don't lose much. Some (or is it all?) ARM CPUs systemtically report significantly lower BogoMIPS. While this doesn't break anything, given how widespread ARM CPUs are, it's at least a missed opportunity and it probably would be a good idea to teach workqueue about it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:25 -04:00
Waiman Long	c9a9cddde4	workqueue: Disable per-cpu CPU hog detection when wq_cpu_intensive_thresh_us is 0 JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 18c8ae813156a6855f026de80fffb91e1a28ab3d Author: Zqiang <qiang.zhang1211@gmail.com> Date: Thu, 25 May 2023 12:00:38 +0800 workqueue: Disable per-cpu CPU hog detection when wq_cpu_intensive_thresh_us is 0 If workqueue.cpu_intensive_thresh_us is set to 0, the detection mechanism for CPU-hogging per-cpu work item will keep triggering spuriously: workqueue: process_srcu hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND workqueue: gc_worker hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND workqueue: gc_worker hogged CPU for >0us 8 times, consider switching to WQ_UNBOUND workqueue: wait_rcu_exp_gp hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND workqueue: kfree_rcu_monitor hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND workqueue: kfree_rcu_monitor hogged CPU for >0us 8 times, consider switching to WQ_UNBOUND workqueue: reg_todo hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND This commit therefore disables the CPU-hog detection mechanism when workqueue.cpu_intensive_thresh_us is set to 0. tj: Patch description updated and the condition check on cpu_intensive_thresh_us separated into a separate if statement for readability. Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:24 -04:00
Waiman Long	ebdb8e47b2	workqueue: Fix WARN_ON_ONCE() triggers in worker_enter_idle() JIRA: https://issues.redhat.com/browse/RHEL-25103 commit c8f6219be2e58d7f676935ae90b64abef5d0966a Author: Zqiang <qiang.zhang1211@gmail.com> Date: Wed, 24 May 2023 11:53:39 +0800 workqueue: Fix WARN_ON_ONCE() triggers in worker_enter_idle() Currently, pool->nr_running can be modified from timer tick, that means the timer tick can run nested inside a not-irq-protected section that's in the process of modifying nr_running. Consider the following scenario: CPU0 kworker/0:2 (events) worker_clr_flags(worker, WORKER_PREP \| WORKER_REBOUND); ->pool->nr_running++; (1) process_one_work() ->worker->current_func(work); ->schedule() ->wq_worker_sleeping() ->worker->sleeping = 1; ->pool->nr_running--; (0) .... ->wq_worker_running() .... CPU0 by interrupt: wq_worker_tick() ->worker_set_flags(worker, WORKER_CPU_INTENSIVE); ->pool->nr_running--; (-1) ->worker->flags \|= WORKER_CPU_INTENSIVE; .... ->if (!(worker->flags & WORKER_NOT_RUNNING)) ->pool->nr_running++; (will not execute) ->worker->sleeping = 0; .... ->worker_clr_flags(worker, WORKER_CPU_INTENSIVE); ->pool->nr_running++; (0) .... worker_set_flags(worker, WORKER_PREP); ->pool->nr_running--; (-1) .... worker_enter_idle() ->WARN_ON_ONCE(pool->nr_workers == pool->nr_idle && pool->nr_running); if the nr_workers is equal to nr_idle, due to the nr_running is not zero, will trigger WARN_ON_ONCE(). [ 2.460602] WARNING: CPU: 0 PID: 63 at kernel/workqueue.c:1999 worker_enter_idle+0xb2/0xc0 [ 2.462163] Modules linked in: [ 2.463401] CPU: 0 PID: 63 Comm: kworker/0:2 Not tainted 6.4.0-rc2-next-20230519 #1 [ 2.463771] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014 [ 2.465127] Workqueue: 0x0 (events) [ 2.465678] RIP: 0010:worker_enter_idle+0xb2/0xc0 ... [ 2.472614] Call Trace: [ 2.473152] <TASK> [ 2.474182] worker_thread+0x71/0x430 [ 2.474992] ? _raw_spin_unlock_irqrestore+0x28/0x50 [ 2.475263] kthread+0x103/0x120 [ 2.475493] ? __pfx_worker_thread+0x10/0x10 [ 2.476355] ? __pfx_kthread+0x10/0x10 [ 2.476635] ret_from_fork+0x2c/0x50 [ 2.477051] </TASK> This commit therefore add the check of worker->sleeping in wq_worker_tick(), if the worker->sleeping is not zero, directly return. tj: Updated comment and description. Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Reported-by: Linux Kernel Functional Testing <lkft@linaro.org> Tested-by: Anders Roxell <anders.roxell@linaro.org> Closes: https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20230519/testrun/17078554/suite/boot/test/clang-nightly-lkftconfig/log Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:24 -04:00
Waiman Long	de650632ad	workqueue: Track and monitor per-workqueue CPU time usage JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 8a1dd1e547c1a037692e7a6da6a76108108c72b1 Author: Tejun Heo <tj@kernel.org> Date: Wed, 17 May 2023 17:02:09 -1000 workqueue: Track and monitor per-workqueue CPU time usage Now that wq_worker_tick() is there, we can easily track the rough CPU time consumption of each workqueue by charging the whole tick whenever a tick hits an active workqueue. While not super accurate, it provides reasonable visibility into the workqueues that consume a lot of CPU cycles. wq_monitor.py is updated to report the per-workqueue CPU times. v2: wq_monitor.py was using "cputime" as the key when outputting in json format. Use "cpu_time" instead for consistency with other fields. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:24 -04:00
Waiman Long	b2d36126d6	workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 6363845005202148b8409ec3082e80845c19d309 Author: Tejun Heo <tj@kernel.org> Date: Wed, 17 May 2023 17:02:08 -1000 workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism Workqueue now automatically marks per-cpu work items that hog CPU for too long as CPU_INTENSIVE, which excludes them from concurrency management and prevents stalling other concurrency-managed work items. If a work function keeps running over the thershold, it likely needs to be switched to use an unbound workqueue. This patch adds a debug mechanism which tracks the work functions which trigger the automatic CPU_INTENSIVE mechanism and report them using pr_warn() with exponential backoff. v3: Documentation update. v2: Drop bouncing to kthread_worker for printing messages. It was to avoid introducing circular locking dependency through printk but not effective as it still had pool lock -> wci_lock -> printk -> pool lock loop. Let's just print directly using printk_deferred(). Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:24 -04:00
Waiman Long	1665f6ac9c	workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVE JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 616db8779b1e3f93075df691432cccc5ef3c3ba0 Author: Tejun Heo <tj@kernel.org> Date: Wed, 17 May 2023 17:02:08 -1000 workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVE If a per-cpu work item hogs the CPU, it can prevent other work items from starting through concurrency management. A per-cpu workqueue which intends to host such CPU-hogging work items can choose to not participate in concurrency management by setting %WQ_CPU_INTENSIVE; however, this can be error-prone and difficult to debug when missed. This patch adds an automatic CPU usage based detection. If a concurrency-managed work item consumes more CPU time than the threshold (10ms by default) continuously without intervening sleeps, wq_worker_tick() which is called from scheduler_tick() will detect the condition and automatically mark it CPU_INTENSIVE. The mechanism isn't foolproof: * Detection depends on tick hitting the work item. Getting preempted at the right timings may allow a violating work item to evade detection at least temporarily. * nohz_full CPUs may not be running ticks and thus can fail detection. * Even when detection is working, the 10ms detection delays can add up if many CPU-hogging work items are queued at the same time. However, in vast majority of cases, this should be able to detect violations reliably and provide reasonable protection with a small increase in code complexity. If some work items trigger this condition repeatedly, the bigger problem likely is the CPU being saturated with such per-cpu work items and the solution would be making them UNBOUND. The next patch will add a debug mechanism to help spot such cases. v4: Documentation for workqueue.cpu_intensive_thresh_us added to kernel-parameters.txt. v3: Switch to use wq_worker_tick() instead of hooking into preemptions as suggested by Peter. v2: Lai pointed out that wq_worker_stopping() also needs to be called from preemption and rtlock paths and an earlier patch was updated accordingly. This patch adds a comment describing the risk of infinte recursions and how they're avoided. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:24 -04:00
Waiman Long	d067533aa7	workqueue: Improve locking rule description for worker fields JIRA: https://issues.redhat.com/browse/RHEL-25103 commit bdf8b9bfc131864f0fcef268b34123acfb6a1b59 Author: Tejun Heo <tj@kernel.org> Date: Wed, 17 May 2023 17:02:08 -1000 workqueue: Improve locking rule description for worker fields * Some worker fields are modified only by the worker itself while holding pool->lock thus making them safe to read from self, IRQ context if the CPU is running the worker or while holding pool->lock. Add 'K' locking rule for them. * worker->sleeping is currently marked "None" which isn't very descriptive. It's used only by the worker itself. Add 'S' locking rule for it. A future patch will depend on the 'K' rule to access worker->current_* from the scheduler ticks. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:24 -04:00
Waiman Long	bdad1a320c	workqueue: Move worker_set/clr_flags() upwards JIRA: https://issues.redhat.com/browse/RHEL-25103 commit c54d5046a06b90adb3d1188f0741a88692854354 Author: Tejun Heo <tj@kernel.org> Date: Wed, 17 May 2023 17:02:08 -1000 workqueue: Move worker_set/clr_flags() upwards They are going to be used in wq_worker_stopping(). Move them upwards. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:23 -04:00
Waiman Long	20a387c381	workqueue: Add pwq->stats[] and a monitoring script JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 725e8ec59c56c65fb92e343c10a8842cd0d4f194 Author: Tejun Heo <tj@kernel.org> Date: Wed, 17 May 2023 17:02:08 -1000 workqueue: Add pwq->stats[] and a monitoring script Currently, the only way to peer into workqueue operations is through tracing. While possible, it isn't easy or convenient to monitor per-workqueue behaviors over time this way. Let's add pwq->stats[] that track relevant events and a drgn monitoring script - tools/workqueue/wq_monitor.py. It's arguable whether this needs to be configurable. However, it currently only has several counters and the runtime overhead shouldn't be noticeable given that they're on pwq's which are per-cpu on per-cpu workqueues and per-numa-node on unbound ones. Let's keep it simple for the time being. v2: Patch reordered to earlier with fewer fields. Field will be added back gradually. Help message improved. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:23 -04:00
Waiman Long	0d4b8874cf	Further upgrade queue_work_on() comment JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 854f5cc5b7355ceebf2bdfed97ea8f3c5d47a0c3 Author: Paul E. McKenney <paulmck@kernel.org> Date: Fri, 28 Apr 2023 16:47:07 -0700 Further upgrade queue_work_on() comment The current queue_work_on() docbook comment says that the caller must ensure that the specified CPU can't go away, and further says that the penalty for failing to nail down the specified CPU is that the workqueue handler might find itself executing on some other CPU. This is true as far as it goes, but fails to note what happens if the specified CPU never was online. Therefore, further expand this comment to say that specifying a CPU that was never online will result in a splat. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:23 -04:00
Waiman Long	e8df0001f6	workqueue: clean up WORK_* constant types, clarify masking JIRA: https://issues.redhat.com/browse/RHEL-25103 commit afa4bb778e48d79e4a642ed41e3b4e0de7489a6c Author: Linus Torvalds <torvalds@linux-foundation.org> Date: Fri, 23 Jun 2023 12:08:14 -0700 workqueue: clean up WORK_* constant types, clarify masking Dave Airlie reports that gcc-13.1.1 has started complaining about some of the workqueue code in 32-bit arm builds: kernel/workqueue.c: In function ‘get_work_pwq’: kernel/workqueue.c:713:24: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast] 713 \| return (void )(data & WORK_STRUCT_WQ_DATA_MASK); \| ^ [ ... a couple of other cases ... ] and while it's not immediately clear exactly why gcc started complaining about it now, I suspect it's some C23-induced enum type handlign fixup in gcc-13 is the cause. Whatever the reason for starting to complain, the code and data types are indeed disgusting enough that the complaint is warranted. The wq code ends up creating various "helper constants" (like that WORK_STRUCT_WQ_DATA_MASK) using an enum type, which is all kinds of confused. The mask needs to be 'unsigned long', not some unspecified enum type. To make matters worse, the actual "mask and cast to a pointer" is repeated a couple of times, and the cast isn't even always done to the right pointer, but - as the error case above - to a 'void ' with then the compiler finishing the job. That's now how we roll in the kernel. So create the masks using the proper types rather than some ambiguous enumeration, and use a nice helper that actually does the type conversion in one well-defined place. Incidentally, this magically makes clang generate better code. That, admittedly, is really just a sign of clang having been seriously confused before, and cleaning up the typing unconfuses the compiler too. Reported-by: Dave Airlie <airlied@gmail.com> Link: https://lore.kernel.org/lkml/CAPM=9twNnV4zMCvrPkw3H-ajZOH-01JVh_kDrxdPYQErz8ZTdA@mail.gmail.com/ Cc: Arnd Bergmann <arnd@arndb.de> Cc: Tejun Heo <tj@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:23 -04:00
Waiman Long	2a1c329725	workqueue: Introduce show_freezable_workqueues JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 704bc669e1dda3eb8f6d5cb462b21e85558a3912 Author: Jungseung Lee <js07.lee@samsung.com> Date: Mon, 20 Mar 2023 12:29:05 +0900 workqueue: Introduce show_freezable_workqueues Currently show_all_workqueue is called if freeze fails at the time of freeze the workqueues, which shows the status of all workqueues and of all worker pools. In this cases we may only need to dump state of only workqueues that are freezable and busy. This patch defines show_freezable_workqueues, which uses show_one_workqueue, a granular function that shows the state of individual workqueues, so that dump only the state of freezable workqueues at that time. tj: Minor message adjustment. Signed-off-by: Jungseung Lee <js07.lee@samsung.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:23 -04:00
Waiman Long	058229c9f6	workqueue: Print backtraces from CPUs with hung CPU bound workqueues JIRA: https://issues.redhat.com/browse/RHEL-25103 commit cd2440d66fec7d1bdb4f605b64c27c63c9141989 Author: Petr Mladek <pmladek@suse.com> Date: Tue, 7 Mar 2023 13:53:35 +0100 workqueue: Print backtraces from CPUs with hung CPU bound workqueues The workqueue watchdog reports a lockup when there was not any progress in the worker pool for a long time. The progress means that a pending work item starts being proceed. Worker pools for unbound workqueues always wake up an idle worker and try to process the work immediately. The last idle worker has to create new worker first. The stall might happen only when a new worker could not be created in which case an error should get printed. Another problem might be too high load. In this case, workers are victims of a global system problem. Worker pools for CPU bound workqueues are designed for lightweight work items that do not need much CPU time. They are proceed one by one on a single worker. New worker is used only when a work is sleeping. It creates one additional scenario. The stall might happen when the CPU-bound workqueue is used for CPU-intensive work. More precisely, the stall is detected when a CPU-bound worker is in the TASK_RUNNING state for too long. In this case, it might be useful to see the backtrace from the problematic worker. The information how long a worker is in the running state is not available. But the CPU-bound worker pools do not have many workers in the running state by definition. And only few pools are typically blocked. It should be acceptable to print backtraces from all workers in TASK_RUNNING state in the stalled worker pools. The number of false positives should be very low. Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:23 -04:00
Waiman Long	4f4620189e	workqueue: Warn when a rescuer could not be created JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 4c0736a76a186e5df2cd2afda3e7a04d2a427d1b Author: Petr Mladek <pmladek@suse.com> Date: Tue, 7 Mar 2023 13:53:34 +0100 workqueue: Warn when a rescuer could not be created Rescuers are created when a workqueue with WQ_MEM_RECLAIM is allocated. It typically happens during the system boot. systemd switches the root filesystem from initrd to the booted system during boot. It kills processes that block the switch for too long. One of the process might be modprobe that tries to create a workqueue. These problems are hard to reproduce. Also alloc_workqueue() does not pass the error code. Make the debugging easier by printing an error, similar to create_worker(). Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:23 -04:00
Waiman Long	e97b3edb25	workqueue: Interrupted create_worker() is not a repeated event JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 60f540389a5d2df25ddc7ad511b4fa2880dea521 Author: Petr Mladek <pmladek@suse.com> Date: Tue, 7 Mar 2023 13:53:33 +0100 workqueue: Interrupted create_worker() is not a repeated event kthread_create_on_node() might get interrupted(). It is rare but realistic. For example, when an unbound workqueue is allocated in module_init() callback. It is done in the context of the "modprobe" process. And, for example, systemd might kill pending processes when switching root from initrd to the booted system. The interrupt is a one-off event and the race might be hard to reproduce. It is always worth printing. Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:23 -04:00
Waiman Long	48949a02ba	workqueue: Warn when a new worker could not be created JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 3f0ea0b864562c6bd1cee892026067eaea7be242 Author: Petr Mladek <pmladek@suse.com> Date: Tue, 7 Mar 2023 13:53:32 +0100 workqueue: Warn when a new worker could not be created The workqueue watchdog reports a lockup when there was not any progress in the worker pool for a long time. The progress means that a pending work item starts being proceed. The progress is guaranteed by using idle workers or creating new workers for pending work items. There are several reasons why a new worker could not be created: + there is not enough memory + there is no free pool ID (IDR API) + the system reached PID limit + the process creating the new worker was interrupted + the last idle worker (manager) has not been scheduled for a long time. It was not able to even start creating the kthread. None of these failures is reported at the moment. The only clue is that show_one_worker_pool() prints that there is a manager. It is the last idle worker that is responsible for creating a new one. But it is not clear if create_worker() is failing and why. Make the debugging easier by printing errors in create_worker(). The error code is important, especially from kthread_create_on_node(). It helps to distinguish the various reasons. For example, reaching memory limit (-ENOMEM), other system limits (-EAGAIN), or process interrupted (-EINTR). Use pr_once() to avoid repeating the same error every CREATE_COOLDOWN for each stuck worker pool. Ratelimited printk() might be better. It would help to know if the problem remains. It would be more clear if the create_worker() errors and workqueue stalls are related. Also old messages might get lost when the internal log buffer is full. The problem is that printk() might touch the watchdog. For example, see touch_nmi_watchdog() in serial8250_console_write(). It would require synchronization of the begin and length of the ratelimit interval with the workqueue watchdog. Otherwise, the error messages might break the watchdog. This does not look worth the complexity. Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:23 -04:00
Waiman Long	c3296e8c7e	workqueue: Fix hung time report of worker pools JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 335a42ebb0ca8ee9997a1731aaaae6dcd704c113 Author: Petr Mladek <pmladek@suse.com> Date: Tue, 7 Mar 2023 13:53:31 +0100 workqueue: Fix hung time report of worker pools The workqueue watchdog prints a warning when there is no progress in a worker pool. Where the progress means that the pool started processing a pending work item. Note that it is perfectly fine to process work items much longer. The progress should be guaranteed by waking up or creating idle workers. show_one_worker_pool() prints state of non-idle worker pool. It shows a delay since the last pool->watchdog_ts. The timestamp is updated when a first pending work is queued in __queue_work(). Also it is updated when a work is dequeued for processing in worker_thread() and rescuer_thread(). The delay is misleading when there is no pending work item. In this case it shows how long the last work item is being proceed. Show zero instead. There is no stall if there is no pending work. Fixes: `82607adcf9` ("workqueue: implement lockup detector") Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:23 -04:00
Waiman Long	cf8b90187f	workqueue: Simplify a pr_warn() call in wq_select_unbound_cpu() JIRA: https://issues.redhat.com/browse/RHEL-25103 Conflicts: Context diff due to the presence of a later upstream commit 4a6c5607d450 ("workqueue: Make sure that wq_unbound_cpumask is never empty"). commit a8ec5880bd82b834717770cba4596381ffd50545 Author: Ammar Faizi <ammarfaizi2@gnuweeb.org> Date: Sun, 26 Feb 2023 23:53:20 +0700 workqueue: Simplify a pr_warn() call in wq_select_unbound_cpu() Use pr_warn_once() to achieve the same thing. It's simpler. Signed-off-by: Ammar Faizi <ammarfaizi2@gnuweeb.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:22 -04:00
Waiman Long	a088b39f32	workqueue: Make show_pwq() use run-length encoding JIRA: https://issues.redhat.com/browse/RHEL-25103 commit c76feb0d5dfdb90b70fa820bb3181142bb01e980 Author: Paul E. McKenney <paulmck@kernel.org> Date: Fri, 6 Jan 2023 16:10:24 -0800 workqueue: Make show_pwq() use run-length encoding The show_pwq() function dumps out a pool_workqueue structure's activity, including the pending work-queue handlers: Showing busy workqueues and worker pools: workqueue events: flags=0x0 pwq 0: cpus=0 node=0 flags=0x1 nice=0 active=10/256 refcnt=11 in-flight: 7:test_work_func, 64:test_work_func, 249:test_work_func pending: test_work_func, test_work_func, test_work_func1, test_work_func1, test_work_func1, test_work_func1, test_work_func1 When large systems are facing certain types of hang conditions, it is not unusual for this "pending" list to contain runs of hundreds of identical function names. This "wall of text" is difficult to read, and worse yet, it can be interleaved with other output such as stack traces. Therefore, make show_pwq() use run-length encoding so that the above printout instead looks like this: Showing busy workqueues and worker pools: workqueue events: flags=0x0 pwq 0: cpus=0 node=0 flags=0x1 nice=0 active=10/256 refcnt=11 in-flight: 7:test_work_func, 64:test_work_func, 249:test_work_func pending: 2test_work_func, 5test_work_func1 When no comma would be printed, including the WORK_STRUCT_LINKED case, a new run is started unconditionally. This output is more readable, places less stress on the hardware, firmware, and software on the console-log path, and reduces interference with other output. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Dave Jones <davej@codemonkey.org.uk> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:22 -04:00
Waiman Long	7de5240e80	workqueue: Add a new flag to spot the potential UAF error JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 33e3f0a3358b8f9bb54b2661b9c1d37a75664c79 Author: Richard Clark <richard.xnu.clark@gmail.com> Date: Tue, 13 Dec 2022 12:39:36 +0800 workqueue: Add a new flag to spot the potential UAF error Currently if the user queues a new work item unintentionally into a wq after the destroy_workqueue(wq), the work still can be queued and scheduled without any noticeable kernel message before the end of a RCU grace period. As a debug-aid facility, this commit adds a new flag __WQ_DESTROYING to spot that issue by triggering a kernel WARN message. Signed-off-by: Richard Clark <richard.xnu.clark@gmail.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:22 -04:00
Waiman Long	50f44cde6c	workqueue: Make queue_rcu_work() use call_rcu_hurry() JIRA: https://issues.redhat.com/browse/RHEL-25103 commit a7e30c0e9a5f95b7f74e6272d9c75fd65c897721 Author: Uladzislau Rezki <urezki@gmail.com> Date: Sun, 16 Oct 2022 16:23:03 +0000 workqueue: Make queue_rcu_work() use call_rcu_hurry() Earlier commits in this series allow battery-powered systems to build their kernels with the default-disabled CONFIG_RCU_LAZY=y Kconfig option. This Kconfig option causes call_rcu() to delay its callbacks in order to batch them. This means that a given RCU grace period covers more callbacks, thus reducing the number of grace periods, in turn reducing the amount of energy consumed, which increases battery lifetime which can be a very good thing. This is not a subtle effect: In some important use cases, the battery lifetime is increased by more than 10%. This CONFIG_RCU_LAZY=y option is available only for CPUs that offload callbacks, for example, CPUs mentioned in the rcu_nocbs kernel boot parameter passed to kernels built with CONFIG_RCU_NOCB_CPU=y. Delaying callbacks is normally not a problem because most callbacks do nothing but free memory. If the system is short on memory, a shrinker will kick all currently queued lazy callbacks out of their laziness, thus freeing their memory in short order. Similarly, the rcu_barrier() function, which blocks until all currently queued callbacks are invoked, will also kick lazy callbacks, thus enabling rcu_barrier() to complete in a timely manner. However, there are some cases where laziness is not a good option. For example, synchronize_rcu() invokes call_rcu(), and blocks until the newly queued callback is invoked. It would not be a good for synchronize_rcu() to block for ten seconds, even on an idle system. Therefore, synchronize_rcu() invokes call_rcu_hurry() instead of call_rcu(). The arrival of a non-lazy call_rcu_hurry() callback on a given CPU kicks any lazy callbacks that might be already queued on that CPU. After all, if there is going to be a grace period, all callbacks might as well get full benefit from it. Yes, this could be done the other way around by creating a call_rcu_lazy(), but earlier experience with this approach and feedback at the 2022 Linux Plumbers Conference shifted the approach to call_rcu() being lazy with call_rcu_hurry() for the few places where laziness is inappropriate. And another call_rcu() instance that cannot be lazy is the one in queue_rcu_work(), given that callers to queue_rcu_work() are not necessarily OK with long delays. Therefore, make queue_rcu_work() use call_rcu_hurry() in order to revert to the old behavior. [ paulmck: Apply s/call_rcu_flush/call_rcu_hurry/ feedback from Tejun Heo. ] Signed-off-by: Uladzislau Rezki <urezki@gmail.com> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:22 -04:00
Waiman Long	50392f94c5	treewide: Drop WARN_ON_FUNCTION_MISMATCH JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 4b24356312fbe1bace72f9905d529b14fc34c1c3 Author: Sami Tolvanen <samitolvanen@google.com> Date: Thu, 8 Sep 2022 14:54:56 -0700 treewide: Drop WARN_ON_FUNCTION_MISMATCH CONFIG_CFI_CLANG no longer breaks cross-module function address equality, which makes WARN_ON_FUNCTION_MISMATCH unnecessary. Remove the definition and switch back to WARN_ON_ONCE. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Kees Cook <keescook@chromium.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220908215504.3686827-15-samitolvanen@google.com Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:22 -04:00
Waiman Long	c023e9b1d1	workqueue: Convert the type of pool->nr_running to int JIRA: https://issues.redhat.com/browse/RHEL-25103 commit bc35f7ef96284b8c963991357a9278a6beafca54 Author: Lai Jiangshan <laijs@linux.alibaba.com> Date: Thu, 23 Dec 2021 20:31:40 +0800 workqueue: Convert the type of pool->nr_running to int It is only modified in associated CPU, so it doesn't need to be atomic. tj: Comment updated. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:22 -04:00
Waiman Long	ec7c270fc1	workqueue: Use wake_up_worker() in wq_worker_sleeping() instead of open code JIRA: https://issues.redhat.com/browse/RHEL-25103 commit cc5bff38463e0894dd596befa99f9d6860e15f5e Author: Lai Jiangshan <laijs@linux.alibaba.com> Date: Thu, 23 Dec 2021 20:31:39 +0800 workqueue: Use wake_up_worker() in wq_worker_sleeping() instead of open code The wakeup code in wq_worker_sleeping() is the same as wake_up_worker(). Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:22 -04:00
Waiman Long	5ea3f1a8eb	workqueue: Upgrade queue_work_on() comment JIRA: https://issues.redhat.com/browse/RHEL-25103 commit 443378f0664a78756c3e3aeaab92750fe1e05735 Author: Paul E. McKenney <paulmck@kernel.org> Date: Tue, 30 Nov 2021 17:00:30 -0800 workqueue: Upgrade queue_work_on() comment The current queue_work_on() docbook comment says that the caller must ensure that the specified CPU can't go away, but does not spell out the consequences, which turn out to be quite mild. Therefore expand this comment to explicitly say that the penalty for failing to nail down the specified CPU is that the workqueue handler might find itself executing on some other CPU. Cc: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-05-03 13:39:22 -04:00
Audra Mitchell	3b21196ba3	workqueue: Shorten events_freezable_power_efficient name JIRA: https://issues.redhat.com/browse/RHEL-3534 This patch is a backport of the following upstream commit: commit 8318d6a6362f5903edb4c904a8dd447e59be4ad1 Author: Audra Mitchell <audra@redhat.com> Date: Thu Jan 25 14:05:32 2024 -0500 workqueue: Shorten events_freezable_power_efficient name Since we have set the WQ_NAME_LEN to 32, decrease the name of events_freezable_power_efficient so that it does not trip the name length warning when the workqueue is created. Signed-off-by: Audra Mitchell <audra@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Audra Mitchell <audra@redhat.com>	2024-05-03 09:45:58 -04:00
Audra Mitchell	f45c2f9160	workqueue.c: Increase workqueue name length JIRA: https://issues.redhat.com/browse/RHEL-3534 This patch is a backport of the following upstream commit: commit 31c89007285d365aa36f71d8fb0701581c770a27 Author: Audra Mitchell <audra@redhat.com> Date: Mon Jan 15 12:08:22 2024 -0500 workqueue.c: Increase workqueue name length Currently we limit the size of the workqueue name to 24 characters due to commit `ecf6881ff3` ("workqueue: make workqueue->name[] fixed len") Increase the size to 32 characters and print a warning in the event the requested name is larger than the limit of 32 characters. Signed-off-by: Audra Mitchell <audra@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Audra Mitchell <audra@redhat.com>	2024-05-03 09:45:58 -04:00
Leonardo Bras	6f7f4ba4b1	workqueue: Avoid using isolated cpus' timers on queue_delayed_work JIRA: https://issues.redhat.com/browse/RHEL-20254 Upstream Status: https://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git/ commit aae17ebb53cd3da37f5dfbde937acd091eb4340c Author: Leonardo Bras <leobras@redhat.com> Date: Mon Jan 29 22:00:46 2024 -0300 workqueue: Avoid using isolated cpus' timers on queue_delayed_work When __queue_delayed_work() is called, it chooses a cpu for handling the timer interrupt. As of today, it will pick either the cpu passed as parameter or the last cpu used for this. This is not good if a system does use CPU isolation, because it can take away some valuable cpu time to: 1 - deal with the timer interrupt, 2 - schedule-out the desired task, 3 - queue work on a random workqueue, and 4 - schedule the desired task back to the cpu. So to fix this, during __queue_delayed_work(), if cpu isolation is in place, pick a random non-isolated cpu to handle the timer interrupt. As an optimization, if the current cpu is not isolated, use it instead of looking for another candidate. Signed-off-by: Leonardo Bras <leobras@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Leonardo Bras <leobras@redhat.com>	2024-02-22 16:47:15 -03:00
Waiman Long	6524bc7b74	workqueue: Move workqueue_set_unbound_cpumask() and its helpers inside CONFIG_SYSFS JIRA: https://issues.redhat.com/browse/RHEL-21798 Conflicts: A minor context diff due to missing upstream commit fcecfa8f271a ("workqueue: Remove module param disable_numa and sysfs knobs pool_ids and numa"). commit 49277a5b76373e630075ff7d32fc0f9f51294f24 Author: Waiman Long <longman@redhat.com> Date: Mon, 20 Nov 2023 21:18:40 -0500 workqueue: Move workqueue_set_unbound_cpumask() and its helpers inside CONFIG_SYSFS Commit fe28f631fa94 ("workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask") makes workqueue_set_unbound_cpumask() static as it is not used elsewhere in the kernel. However, this triggers a kernel test robot warning about 'workqueue_set_unbound_cpumask' defined but not used when CONFIG_SYS isn't defined. It happens that workqueue_set_unbound_cpumask() is only called when CONFIG_SYS is defined. Move workqueue_set_unbound_cpumask() and its helpers inside the CONFIG_SYSFS compilation block to avoid the warning. There is no functional change. Fixes: fe28f631fa94 ("workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202311130831.uh0AoCd1-lkp@intel.com/ Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-01-16 14:24:47 -05:00
Waiman Long	24be7e35b7	workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask JIRA: https://issues.redhat.com/browse/RHEL-21798 Conflicts: 1) A merge conflict in the workqueue_unbound_exclude_cpumask() hunk of kernel/workqueue.c due to missing upstream commit 63c5484e7495 ("workqueue: Add multiple affinity scopes and interface to select them"). 2) A merge conflict in the workqueue_init_early() hunk of kernel/workqueue.c due to upstream merge conflict resolved according to upstream merge commit 202595663905 ("Merge branch 'for-6.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq into for-6.8"). commit fe28f631fa941fba583d1c4f25895284b90af671 Author: Waiman Long <longman@redhat.com> Date: Wed, 25 Oct 2023 14:25:52 -0400 workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask When the "isolcpus" boot command line option is used to add a set of isolated CPUs, those CPUs will be excluded automatically from wq_unbound_cpumask to avoid running work functions from unbound workqueues. Recently cpuset has been extended to allow the creation of partitions of isolated CPUs dynamically. To make it closer to the "isolcpus" in functionality, the CPUs in those isolated cpuset partitions should be excluded from wq_unbound_cpumask as well. This can be done currently by explicitly writing to the workqueue's cpumask sysfs file after creating the isolated partitions. However, this process can be error prone. Ideally, the cpuset code should be allowed to request the workqueue code to exclude those isolated CPUs from wq_unbound_cpumask so that this operation can be done automatically and the isolated CPUs will be returned back to wq_unbound_cpumask after the destructions of the isolated cpuset partitions. This patch adds a new workqueue_unbound_exclude_cpumask() function to enable that. This new function will exclude the specified isolated CPUs from wq_unbound_cpumask. To be able to restore those isolated CPUs back after the destruction of isolated cpuset partitions, a new wq_requested_unbound_cpumask is added to store the user provided unbound cpumask either from the boot command line options or from writing to the cpumask sysfs file. This new cpumask provides the basis for CPU exclusion. To enable users to understand how the wq_unbound_cpumask is being modified internally, this patch also exposes the newly introduced wq_requested_unbound_cpumask as well as a wq_isolated_cpumask to store the cpumask to be excluded from wq_unbound_cpumask as read-only sysfs files. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-01-16 14:24:47 -05:00
Waiman Long	1d28ea804a	workqueue: Make sure that wq_unbound_cpumask is never empty JIRA: https://issues.redhat.com/browse/RHEL-21798 Conflicts: A merge conflict due to missing upstream commit fef59c9cab6a ("workqueue: Rename NUMA related names to use pod instead") and two other subsequent workqueue commits. commit 4a6c5607d4502ccd1b15b57d57f17d12b6f257a7 Author: Tejun Heo <tj@kernel.org> Date: Tue, 21 Nov 2023 11:39:36 -1000 workqueue: Make sure that wq_unbound_cpumask is never empty During boot, depending on how the housekeeping and workqueue.unbound_cpus masks are set, wq_unbound_cpumask can end up empty. Since 8639ecebc9b1 ("workqueue: Implement non-strict affinity scope for unbound workqueues"), this may end up feeding -1 as a CPU number into scheduler leading to oopses. BUG: unable to handle page fault for address: ffffffff8305e9c0 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page ... Call Trace: <TASK> select_idle_sibling+0x79/0xaf0 select_task_rq_fair+0x1cb/0x7b0 try_to_wake_up+0x29c/0x5c0 wake_up_process+0x19/0x20 kick_pool+0x5e/0xb0 __queue_work+0x119/0x430 queue_work_on+0x29/0x30 ... An empty wq_unbound_cpumask is a clear misconfiguration and already disallowed once system is booted up. Let's warn on and ignore unbound_cpumask restrictions which lead to no unbound cpus. While at it, also remove now unncessary empty check on wq_unbound_cpumask in wq_select_unbound_cpu(). Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-Tested-by: Yong He <alexyonghe@tencent.com> Link: http://lkml.kernel.org/r/20231120121623.119780-1-alexyonghe@tencent.com Fixes: 8639ecebc9b1 ("workqueue: Implement non-strict affinity scope for unbound workqueues") Cc: stable@vger.kernel.org # v6.6+ Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Waiman Long <longman@redhat.com>	2024-01-16 14:24:46 -05:00
Waiman Long	bed8f3efe3	workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask() JIRA: https://issues.redhat.com/browse/RHEL-21798 commit ca10d851b9ad0338c19e8e3089e24d565ebfffd7 Author: Waiman Long <longman@redhat.com> Date: Tue, 10 Oct 2023 22:48:42 -0400 workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask() Commit `5c0338c687` ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered") enabled implicit ordered attribute to be added to WQ_UNBOUND workqueues with max_active of 1. This prevented the changing of attributes to these workqueues leading to fix commit `0a94efb5ac` ("workqueue: implicit ordered attribute should be overridable"). However, workqueue_apply_unbound_cpumask() was not updated at that time. So sysfs changes to wq_unbound_cpumask has no effect on WQ_UNBOUND workqueues with implicit ordered attribute. Since not all WQ_UNBOUND workqueues are visible on sysfs, we are not able to make all the necessary cpumask changes even if we iterates all the workqueue cpumasks in sysfs and changing them one by one. Fix this problem by applying the corresponding change made to apply_workqueue_attrs_locked() in the fix commit to workqueue_apply_unbound_cpumask(). Fixes: `5c0338c687` ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-01-16 14:24:46 -05:00
Waiman Long	4356616088	workqueue: add cmdline parameter `workqueue.unbound_cpus` to further constrain wq_unbound_cpumask at boot time JIRA: https://issues.redhat.com/browse/RHEL-21798 Conflicts: A minor context diff in kernel/workqueue.c due to missing upstream commit 20bdedafd2f6 ("workqueue: Warn attempt to flush system-wide workqueues."). commit ace3c5499e61ef7c0433a7a297227a9bdde54a55 Author: tiozhang <tiozhang@didiglobal.com> Date: Thu, 29 Jun 2023 11:50:50 +0800 workqueue: add cmdline parameter `workqueue.unbound_cpus` to further constrain wq_unbound_cpumask at boot time Motivation of doing this is to better improve boot times for devices when we want to prevent our workqueue works from running on some specific CPUs, e,g, some CPUs are busy with interrupts. Signed-off-by: tiozhang <tiozhang@didiglobal.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2024-01-16 14:24:45 -05:00
Mark Langsdorf	d4e81a63a3	workqueue: move to use bus_get_dev_root() JIRA: https://issues.redhat.com/browse/RHEL-1023 commit 686f669780276da534e93ba769e02bdcf1f89f8d Author: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Date: Mon Mar 13 19:28:50 2023 +0100 Direct access to the struct bus_type dev_root pointer is going away soon so replace that with a call to bus_get_dev_root() instead, which is what it is there for. Cc: Lai Jiangshan <jiangshanlai@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230313182918.1312597-8-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>	2023-11-01 11:12:31 -05:00
Waiman Long	8cbdd24861	workqueue: Fold rebind_worker() within rebind_workers() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337 commit c63a2e52d5e08f01140d7b76c08a78e15e801f03 Author: Valentin Schneider <vschneid@redhat.com> Date: Fri, 13 Jan 2023 17:40:40 +0000 workqueue: Fold rebind_worker() within rebind_workers() !CONFIG_SMP builds complain about rebind_worker() being unused. Its only user, rebind_workers() is indeed only defined for CONFIG_SMP, so just fold the two lines back up there. Link: http://lore.kernel.org/r/20230113143102.2e94d74f@canb.auug.org.au Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2023-04-07 15:26:31 -04:00
Waiman Long	107339e408	workqueue: Unbind kworkers before sending them to exit() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337 commit e02b93124855cd34b78e61ae44846c8cb5fddfc3 Author: Valentin Schneider <vschneid@redhat.com> Date: Thu, 12 Jan 2023 16:14:31 +0000 workqueue: Unbind kworkers before sending them to exit() It has been reported that isolated CPUs can suffer from interference due to per-CPU kworkers waking up just to die. A surge of workqueue activity during initial setup of a latency-sensitive application (refresh_vm_stats() being one of the culprits) can cause extra per-CPU kworkers to be spawned. Then, said latency-sensitive task can be running merrily on an isolated CPU only to be interrupted sometime later by a kworker marked for death (cf. IDLE_WORKER_TIMEOUT, 5 minutes after last kworker activity). Prevent this by affining kworkers to the wq_unbound_cpumask (which doesn't contain isolated CPUs, cf. HK_TYPE_WQ) before waking them up after marking them with WORKER_DIE. Changing the affinity does require a sleepable context, leverage the newly introduced pool->idle_cull_work to get that. Remove dying workers from pool->workers and keep track of them in a separate list. This intentionally prevents for_each_loop_worker() from iterating over workers that are marked for death. Rename destroy_worker() to set_working_dying() to better reflect its effects and relationship with wake_dying_workers(). Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2023-04-07 15:26:31 -04:00
Waiman Long	813b945165	workqueue: Don't hold any lock while rcuwait'ing for !POOL_MANAGER_ACTIVE Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337 commit 9ab03be42b8f9136dcc01a90ecc9ac71bc6149ef Author: Valentin Schneider <vschneid@redhat.com> Date: Thu, 12 Jan 2023 16:14:30 +0000 workqueue: Don't hold any lock while rcuwait'ing for !POOL_MANAGER_ACTIVE put_unbound_pool() currently passes wq_manager_inactive() as exit condition to rcuwait_wait_event(), which grabs pool->lock to check for pool->flags & POOL_MANAGER_ACTIVE A later patch will require destroy_worker() to be invoked with wq_pool_attach_mutex held, which needs to be acquired before pool->lock. A mutex cannot be acquired within rcuwait_wait_event(), as it could clobber the task state set by rcuwait_wait_event() Instead, restructure the waiting logic to acquire any necessary lock outside of rcuwait_wait_event(). Since further work cannot be inserted into unbound pwqs that have reached ->refcnt==0, this is bound to make forward progress as eventually the worklist will be drained and need_more_worker(pool) will remain false, preventing any worker from stealing the manager position from us. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2023-04-07 15:26:30 -04:00
Waiman Long	7ea6709544	workqueue: Convert the idle_timer to a timer + work_struct Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337 commit 3f959aa3b33829acfcd460c6c656d54dfebe8d1e Author: Valentin Schneider <vschneid@redhat.com> Date: Thu, 12 Jan 2023 16:14:29 +0000 workqueue: Convert the idle_timer to a timer + work_struct A later patch will require a sleepable context in the idle worker timeout function. Converting worker_pool.idle_timer to a delayed_work gives us just that, however this would imply turning all idle_timer expiries into scheduler events (waking up a worker to handle the dwork). Instead, implement a "custom dwork" where the timer callback does some extra checks before queuing the associated work. No change in functionality intended. Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2023-04-07 15:26:30 -04:00
Waiman Long	2535806d83	workqueue: Factorize unbind/rebind_workers() logic Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337 commit 793777bc193b658f01924fd09b388eead26d741f Author: Valentin Schneider <vschneid@redhat.com> Date: Thu, 12 Jan 2023 16:14:28 +0000 workqueue: Factorize unbind/rebind_workers() logic Later patches will reuse this code, move it into reusable functions. Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2023-04-07 15:26:29 -04:00
Waiman Long	d653c805fc	workqueue: Protects wq_unbound_cpumask with wq_pool_attach_mutex Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337 commit 99c621ef243bda726fb8d982a274ded96570b410 Author: Lai Jiangshan <jiangshan.ljs@antgroup.com> Date: Thu, 12 Jan 2023 16:14:27 +0000 workqueue: Protects wq_unbound_cpumask with wq_pool_attach_mutex When unbind_workers() reads wq_unbound_cpumask to set the affinity of freshly-unbound kworkers, it only holds wq_pool_attach_mutex. This isn't sufficient as wq_unbound_cpumask is only protected by wq_pool_mutex. Make wq_unbound_cpumask protected with wq_pool_attach_mutex and also remove the need of temporary saved_cpumask. Fixes: 10a5a651e3af ("workqueue: Restrict kworker in the offline CPU pool running on housekeeping CPUs") Reported-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2023-04-07 15:26:29 -04:00
Waiman Long	4e109dbd6a	workqueue: don't skip lockdep work dependency in cancel_work_sync() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337 commit c0feea594e058223973db94c1c32a830c9807c86 Author: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Date: Fri, 29 Jul 2022 13:30:23 +0900 workqueue: don't skip lockdep work dependency in cancel_work_sync() Like Hillf Danton mentioned syzbot should have been able to catch cancel_work_sync() in work context by checking lockdep_map in __flush_work() for both flush and cancel. in [1], being unable to report an obvious deadlock scenario shown below is broken. From locking dependency perspective, sync version of cancel request should behave as if flush request, for it waits for completion of work if that work has already started execution. ---------- #include <linux/module.h> #include <linux/sched.h> static DEFINE_MUTEX(mutex); static void work_fn(struct work_struct *work) { schedule_timeout_uninterruptible(HZ / 5); mutex_lock(&mutex); mutex_unlock(&mutex); } static DECLARE_WORK(work, work_fn); static int __init test_init(void) { schedule_work(&work); schedule_timeout_uninterruptible(HZ / 10); mutex_lock(&mutex); cancel_work_sync(&work); mutex_unlock(&mutex); return -EINVAL; } module_init(test_init); MODULE_LICENSE("GPL"); ---------- The check this patch restores was added by commit `0976dfc1d0` ("workqueue: Catch more locking problems with flush_work()"). Then, lockdep's crossrelease feature was added by commit `b09be676e0` ("locking/lockdep: Implement the 'crossrelease' feature"). As a result, this check was once removed by commit `fd1a5b04df` ("workqueue: Remove now redundant lock acquisitions wrt. workqueue flushes"). But lockdep's crossrelease feature was removed by commit `e966eaeeb6` ("locking/lockdep: Remove the cross-release locking checks"). At this point, this check should have been restored. Then, commit `d6e89786be` ("workqueue: skip lockdep wq dependency in cancel_work_sync()") introduced a boolean flag in order to distinguish flush_work() and cancel_work_sync(), for checking "struct workqueue_struct" dependency when called from cancel_work_sync() was causing false positives. Then, commit `87915adc3f` ("workqueue: re-add lockdep dependencies for flushing") tried to restore "struct work_struct" dependency check, but by error checked this boolean flag. Like an example shown above indicates, "struct work_struct" dependency needs to be checked for both flush_work() and cancel_work_sync(). Link: https://lkml.kernel.org/r/20220504044800.4966-1-hdanton@sina.com [1] Reported-by: Hillf Danton <hdanton@sina.com> Suggested-by: Lai Jiangshan <jiangshanlai@gmail.com> Fixes: `87915adc3f` ("workqueue: re-add lockdep dependencies for flushing") Cc: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2023-04-07 15:26:28 -04:00
Waiman Long	867850e9d0	workqueue: Change the comments of the synchronization about the idle_list Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337 commit 2c1f1a9180bfacbc3c8e5b10075640cc810cf9c0 Author: Lai Jiangshan <laijs@linux.alibaba.com> Date: Thu, 23 Dec 2021 20:31:38 +0800 workqueue: Change the comments of the synchronization about the idle_list The access to idle_list in wq_worker_sleeping() is changed to be protected by pool->lock, so the comments above idle_list can be changed to "L:" which is the meaning of "access with pool->lock held". And the outdated comments in wq_worker_sleeping() is removed since the function is not called with rq lock held any more, idle_list is dereferenced with pool lock now. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2023-04-07 15:26:28 -04:00
Waiman Long	94ebfcf09b	workqueue: Remove the mb() pair between wq_worker_sleeping() and insert_work() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337 commit 21b195c05cf6a6cc49777d6992772bcf01502186 Author: Lai Jiangshan <laijs@linux.alibaba.com> Date: Thu, 23 Dec 2021 20:31:37 +0800 workqueue: Remove the mb() pair between wq_worker_sleeping() and insert_work() In wq_worker_sleeping(), the access to worklist is protected by the pool->lock, so the memory barrier is unneeded. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2023-04-07 15:26:27 -04:00
Waiman Long	f710816729	workqueue: Remove the cacheline_aligned for nr_running Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337 commit 84f91c62d675480ffd3d870ee44c07965cbd8b21 Author: Lai Jiangshan <laijs@linux.alibaba.com> Date: Tue, 7 Dec 2021 15:35:42 +0800 workqueue: Remove the cacheline_aligned for nr_running nr_running is never modified remotely after the schedule callback in wakeup path is removed. Rather nr_running is often accessed with other fields in the pool together, so the cacheline_aligned for nr_running isn't needed. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2023-04-07 15:26:27 -04:00
Waiman Long	0df1d79e38	workqueue: Move the code of waking a worker up in unbind_workers() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337 Conflicts: A merge conflict requiring manual merge due to the presence of a later upstream commit 46a4d679ef88 ("workqueue: Avoid a false warning in unbind_workers()"). commit 989442d73757868118a73b92732b549a73c9ce35 Author: Lai Jiangshan <laijs@linux.alibaba.com> Date: Tue, 7 Dec 2021 15:35:41 +0800 workqueue: Move the code of waking a worker up in unbind_workers() In unbind_workers(), there are two pool->lock held sections separated by the code of zapping nr_running. wake_up_worker() needs to be in pool->lock held section and after zapping nr_running. And zapping nr_running had to be after schedule() when the local wake up functionality was in use. Now, the call to schedule() has been removed along with the local wake up functionality, so the code can be merged into the same pool->lock held section. The diffstat shows that it is other code moved down because the diff tools can not know the meaning of merging lock sections by swapping two code blocks. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2023-04-07 15:26:26 -04:00
Waiman Long	41d61eff9a	workqueue: Remove the outdated comment before wq_worker_sleeping() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337 commit ccf45156fd167a234baf038c11c1f367c7ccabd4 Author: Lai Jiangshan <laijs@linux.alibaba.com> Date: Tue, 7 Dec 2021 15:35:37 +0800 workqueue: Remove the outdated comment before wq_worker_sleeping() It isn't called with preempt disabled now. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2023-04-07 15:26:26 -04:00
Waiman Long	b12ee57248	workqueue: Fix unbind_workers() VS wq_worker_sleeping() race Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337 commit 45c753f5f24d2d4717acb38ce35e604ff9abcb50 Author: Frederic Weisbecker <frederic@kernel.org> Date: Wed, 1 Dec 2021 16:19:45 +0100 workqueue: Fix unbind_workers() VS wq_worker_sleeping() race At CPU-hotplug time, unbind_workers() may preempt a worker while it is going to sleep. In that case the following scenario can happen: unbind_workers() wq_worker_sleeping() -------------- ------------------- if (worker->flags & WORKER_NOT_RUNNING) return; //PREEMPTED by unbind_workers worker->flags \|= WORKER_UNBOUND; [...] atomic_set(&pool->nr_running, 0); //resume to worker atomic_dec_and_test(&pool->nr_running); After unbind_worker() resets pool->nr_running, the value is expected to remain 0 until the pool ever gets rebound in case cpu_up() is called on the target CPU in the future. But here the race leaves pool->nr_running with a value of -1, triggering the following warning when the worker goes idle: WARNING: CPU: 3 PID: 34 at kernel/workqueue.c:1823 worker_enter_idle+0x95/0xc0 Modules linked in: CPU: 3 PID: 34 Comm: kworker/3:0 Not tainted 5.16.0-rc1+ #34 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014 Workqueue: 0x0 (rcu_par_gp) RIP: 0010:worker_enter_idle+0x95/0xc0 Code: 04 85 f8 ff ff ff 39 c1 7f 09 48 8b 43 50 48 85 c0 74 1b 83 e2 04 75 99 8b 43 34 39 43 30 75 91 8b 83 00 03 00 00 85 c0 74 87 <0f> 0b 5b c3 48 8b 35 70 f1 37 01 48 8d 7b 48 48 81 c6 e0 93 0 RSP: 0000:ffff9b7680277ed0 EFLAGS: 00010086 RAX: 00000000ffffffff RBX: ffff93465eae9c00 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff9346418a0000 RDI: ffff934641057140 RBP: ffff934641057170 R08: 0000000000000001 R09: ffff9346418a0080 R10: ffff9b768027fdf0 R11: 0000000000002400 R12: ffff93465eae9c20 R13: ffff93465eae9c20 R14: ffff93465eae9c70 R15: ffff934641057140 FS: 0000000000000000(0000) GS:ffff93465eac0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000001cc0c000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> worker_thread+0x89/0x3d0 ? process_one_work+0x400/0x400 kthread+0x162/0x190 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x22/0x30 </TASK> Also due to this incorrect "nr_running == -1", all sorts of hazards can happen, starting with queued works being ignored because no workers are awaken at insert_work() time. Fix this with checking again the worker flags while pool->lock is locked. Fixes: b945efcdd07d ("sched: Remove pointless preemption disable in sched_submit_work()") Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Tested-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com>	2023-04-07 15:26:26 -04:00
Phil Auld	1b770a00ec	workqueue: Avoid a false warning in unbind_workers() Bugzilla: https://bugzilla.redhat.com/2115520 commit 46a4d679ef88285ea17c3e1e4fed330be2044f21 Author: Lai Jiangshan <jiangshan.ljs@antgroup.com> Date: Fri Jul 29 17:44:38 2022 +0800 workqueue: Avoid a false warning in unbind_workers() Doing set_cpus_allowed_ptr() with wq_unbound_cpumask can be possible fails and trigger the false warning. Use cpu_possible_mask instead when wq_unbound_cpumask has no active CPUs. It is very easy to trigger the warning: Set wq_unbound_cpumask to a small set of CPUs. Offline all the CPUs of wq_unbound_cpumask. Offline an extra CPU and trigger the warning. Fixes: 10a5a651e3af ("workqueue: Restrict kworker in the offline CPU pool running on housekeeping CPUs") Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Phil Auld <pauld@redhat.com>	2022-11-04 13:14:41 -04:00
Phil Auld	51a20c6ae4	workqueue: Wrap flush_workqueue() using a macro Bugzilla: https://bugzilla.redhat.com/2115520 commit c4f135d643823a869becfa87539f7820ef9d5bfa Author: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Date: Wed Jun 1 16:32:47 2022 +0900 workqueue: Wrap flush_workqueue() using a macro Since flush operation synchronously waits for completion, flushing system-wide WQs (e.g. system_wq) might introduce possibility of deadlock due to unexpected locking dependency. Tejun Heo commented at [1] that it makes no sense at all to call flush_workqueue() on the shared WQs as the caller has no idea what it's gonna end up waiting for. Although there is flush_scheduled_work() which flushes system_wq WQ with "Think twice before calling this function! It's very easy to get into trouble if you don't take great care." warning message, syzbot found a circular locking dependency caused by flushing system_wq WQ [2]. Therefore, let's change the direction to that developers had better use their local WQs if flush_scheduled_work()/flush_workqueue(system_*_wq) is inevitable. Steps for converting system-wide WQs into local WQs are explained at [3], and a conversion to stop flushing system-wide WQs is in progress. Now we want some mechanism for preventing developers who are not aware of this conversion from again start flushing system-wide WQs. Since I found that WARN_ON() is complete but awkward approach for teaching developers about this problem, let's use __compiletime_warning() for incomplete but handy approach. For completeness, we will also insert WARN_ON() into __flush_workqueue() after all in-tree users stopped calling flush_scheduled_work(). Link: https://lore.kernel.org/all/YgnQGZWT%2Fn3VAITX@slm.duckdns.org/ [1] Link: https://syzkaller.appspot.com/bug?extid=bde0f89deacca7c765b8 [2] Link: https://lkml.kernel.org/r/49925af7-78a8-a3dd-bce6-cfc02e1a9236@I-love.SAKURA.ne.jp [3] Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Phil Auld <pauld@redhat.com>	2022-11-04 13:14:39 -04:00
Phil Auld	5eeb631add	workqueue: Restrict kworker in the offline CPU pool running on housekeeping CPUs Bugzilla: https://bugzilla.redhat.com/2115520 commit 10a5a651e3afc9b0b381f47e8930972e4e918397 Author: Zqiang <qiang1.zhang@intel.com> Date: Thu Mar 31 13:57:17 2022 +0800 workqueue: Restrict kworker in the offline CPU pool running on housekeeping CPUs When a CPU is going offline, all workers on the CPU's pool will have their cpus_allowed cleared to cpu_possible_mask and can run on any CPUs including the isolated ones. Instead, set cpus_allowed to wq_unbound_cpumask so that the can avoid isolated CPUs. Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Phil Auld <pauld@redhat.com>	2022-11-04 13:14:36 -04:00

1 2 3 4 5 ...

875 Commits