Commit Graph

875 Commits

Author SHA1 Message Date
CKI Backport Bot 80162867aa workqueue: Put the pwq after detaching the rescuer from the pool
JIRA: https://issues.redhat.com/browse/RHEL-81472
CVE: CVE-2025-21786

commit e76946110137703c16423baf6ee177b751a34b7e
Author: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Date:   Thu Jan 23 16:25:35 2025 +0800

    workqueue: Put the pwq after detaching the rescuer from the pool

    The commit 68f83057b913("workqueue: Reap workers via kthread_stop() and
    remove detach_completion") adds code to reap the normal workers but
    mistakenly does not handle the rescuer and also removes the code waiting
    for the rescuer in put_unbound_pool(), which caused a use-after-free bug
    reported by Cheung Wall.

    To avoid the use-after-free bug, the pool’s reference must be held until
    the detachment is complete. Therefore, move the code that puts the pwq
    after detaching the rescuer from the pool.

    Reported-by: cheung wall <zzqq0103.hey@gmail.com>
    Cc: cheung wall <zzqq0103.hey@gmail.com>
    Link: https://lore.kernel.org/lkml/CAKHoSAvP3iQW+GwmKzWjEAOoPvzeWeoMO0Gz7Pp3_4kxt-RMoA@mail.gmail.com/
    Fixes: 68f83057b913("workqueue: Reap workers via kthread_stop() and remove detach_completion")
    Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: CKI Backport Bot <cki-ci-bot+cki-gitlab-backport-bot@redhat.com>
2025-02-27 22:46:55 +00:00
Waiman Long ec40aa3ed8 workqueue: Do not warn when cancelling WQ_MEM_RECLAIM work from !WQ_MEM_RECLAIM worker
JIRA: https://issues.redhat.com/browse/RHEL-74107
CVE: CVE-2024-57888

commit de35994ecd2dd6148ab5a6c5050a1670a04dec77
Author: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Date:   Thu, 19 Dec 2024 09:30:30 +0000

    workqueue: Do not warn when cancelling WQ_MEM_RECLAIM work from !WQ_MEM_RECLAIM worker

    After commit
    746ae46c1113 ("drm/sched: Mark scheduler work queues with WQ_MEM_RECLAIM")
    amdgpu started seeing the following warning:

     [ ] workqueue: WQ_MEM_RECLAIM sdma0:drm_sched_run_job_work [gpu_sched] is flushing !WQ_MEM_RECLAIM events:amdgpu_device_delay_enable_gfx_off [amdgpu]
    ...
     [ ] Workqueue: sdma0 drm_sched_run_job_work [gpu_sched]
    ...
     [ ] Call Trace:
     [ ]  <TASK>
    ...
     [ ]  ? check_flush_dependency+0xf5/0x110
    ...
     [ ]  cancel_delayed_work_sync+0x6e/0x80
     [ ]  amdgpu_gfx_off_ctrl+0xab/0x140 [amdgpu]
     [ ]  amdgpu_ring_alloc+0x40/0x50 [amdgpu]
     [ ]  amdgpu_ib_schedule+0xf4/0x810 [amdgpu]
     [ ]  ? drm_sched_run_job_work+0x22c/0x430 [gpu_sched]
     [ ]  amdgpu_job_run+0xaa/0x1f0 [amdgpu]
     [ ]  drm_sched_run_job_work+0x257/0x430 [gpu_sched]
     [ ]  process_one_work+0x217/0x720
    ...
     [ ]  </TASK>

    The intent of the verifcation done in check_flush_depedency is to ensure
    forward progress during memory reclaim, by flagging cases when either a
    memory reclaim process, or a memory reclaim work item is flushed from a
    context not marked as memory reclaim safe.

    This is correct when flushing, but when called from the
    cancel(_delayed)_work_sync() paths it is a false positive because work is
    either already running, or will not be running at all. Therefore
    cancelling it is safe and we can relax the warning criteria by letting the
    helper know of the calling context.

    Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
    Fixes: fca839c00a ("workqueue: warn if memory reclaim tries to flush !WQ_MEM_RECLAIM workqueue")
    References: 746ae46c1113 ("drm/sched: Mark scheduler work queues with WQ_MEM_RECLAIM")
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Lai Jiangshan <jiangshanlai@gmail.com>
    Cc: Alex Deucher <alexander.deucher@amd.com>
    Cc: Christian König <christian.koenig@amd.com
    Cc: Matthew Brost <matthew.brost@intel.com>
    Cc: <stable@vger.kernel.org> # v4.5+
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2025-01-17 09:22:12 -05:00
Rado Vrbovsky 780560f78a Merge: RHEL9.6 drm backport dependencies
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5437

# Merge Request Required Information

## Summary of Changes

Depends: !5592

Depends: !5692

JIRA: https://issues.redhat.com/browse/RHEL-53569

Omitted-fix: 48ffe2074c2864ab64ee2004e7ebf3d6a6730fbf

Omitted-fix: 06e7139a034f26804904368fe4af2ceb70724756

Omitted-fix: 5278ca048d93eac74e9a81b3e672da2b2264bce4

Omitted-fix: 8dffaec34dd55473adcbc924a4c9b04aaa0d4278

Signed-off-by: Robert Foss <rfoss@redhat.com>

## Approved Development Ticket
All submissions to CentOS Stream must reference an approved ticket in [Red Hat Jira](https://issues.redhat.com/). Please follow the CentOS Stream [contribution documentation](https://docs.centos.org/en-US/stream-contrib/quickstart/) for how to file this ticket and have it approved.

Approved-by: Jocelyn Falempe <jfalempe@redhat.com>
Approved-by: David Arcari <darcari@redhat.com>
Approved-by: Rafael Aquini <raquini@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
2025-01-06 08:26:14 +00:00
Robert Foss 85c4f6b0ed
workqueue: Don't call va_start / va_end twice
JIRA: https://issues.redhat.com/browse/RHEL-53569
Upstream Status: v6.12-rc1

commit 9b59a85a84dc37ca4f2c54df5e06aff4c1eae5d3
Author:     Matthew Brost <matthew.brost@intel.com>
AuthorDate: Tue Aug 20 12:38:08 2024 -0700
Commit:     Tejun Heo <tj@kernel.org>
CommitDate: Tue Aug 20 09:38:39 2024 -1000

    Calling va_start / va_end multiple times is undefined and causes
    problems with certain compiler / platforms.

    Change alloc_ordered_workqueue_lockdep_map to a macro and updated
    __alloc_workqueue to take a va_list argument.

    Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Lai Jiangshan <jiangshanlai@gmail.com>
    Signed-off-by: Matthew Brost <matthew.brost@intel.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Robert Foss <rfoss@redhat.com>
2024-12-17 22:59:24 +01:00
Robert Foss e29258cd10
workqueue: Add interface for user-defined workqueue lockdep map
JIRA: https://issues.redhat.com/browse/RHEL-53569
Upstream Status: v6.12-rc1

commit ec0a7d44b358afaaf52856d03c72e20587bc888b
Author:     Matthew Brost <matthew.brost@intel.com>
AuthorDate: Fri Aug  9 15:28:25 2024 -0700
Commit:     Tejun Heo <tj@kernel.org>
CommitDate: Tue Aug 13 09:05:51 2024 -1000

    Add an interface for a user-defined workqueue lockdep map, which is
    helpful when multiple workqueues are created for the same purpose. This
    also helps avoid leaking lockdep maps on each workqueue creation.

    v2:
     - Add alloc_workqueue_lockdep_map (Tejun)
    v3:
     - Drop __WQ_USER_OWNED_LOCKDEP (Tejun)
     - static inline alloc_ordered_workqueue_lockdep_map (Tejun)

    Cc: Tejun Heo <tj@kernel.org>
    Cc: Lai Jiangshan <jiangshanlai@gmail.com>
    Signed-off-by: Matthew Brost <matthew.brost@intel.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Robert Foss <rfoss@redhat.com>
2024-12-17 22:59:24 +01:00
Robert Foss c2a6c84a1c
workqueue: Change workqueue lockdep map to pointer
JIRA: https://issues.redhat.com/browse/RHEL-53569
Upstream Status: v6.12-rc1

commit 4f022f430e21e456893283036bc2ea78ac6bd2a1
Author:     Matthew Brost <matthew.brost@intel.com>
AuthorDate: Fri Aug  9 15:28:24 2024 -0700
Commit:     Tejun Heo <tj@kernel.org>
CommitDate: Tue Aug 13 09:05:40 2024 -1000

    Will help enable user-defined lockdep maps for workqueues.

    Cc: Tejun Heo <tj@kernel.org>
    Cc: Lai Jiangshan <jiangshanlai@gmail.com>
    Signed-off-by: Matthew Brost <matthew.brost@intel.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Robert Foss <rfoss@redhat.com>
2024-12-17 22:59:23 +01:00
Robert Foss 461cd6afdd
workqueue: Split alloc_workqueue into internal function and lockdep init
JIRA: https://issues.redhat.com/browse/RHEL-53569
Upstream Status: v6.12-rc1

Conflicts: Does not apply cleanly due to many unrelated changes being introduced
	   in these functions, but the changes introduced in this patch are
	   simple and well contained.

	   Relevant prior changes to this commit can be found with git blame
	   but will not be included here due to being wide-reaching changes.

	   git blame linux/master kernel/workqueue.c

        kernel/workqueue.c

commit b188c57af2b5c17a1e8f71a0358f330446a4f788
Author:     Matthew Brost <matthew.brost@intel.com>
AuthorDate: Fri Aug  9 15:28:23 2024 -0700
Commit:     Tejun Heo <tj@kernel.org>
CommitDate: Tue Aug 13 09:05:28 2024 -1000

    Will help enable user-defined lockdep maps for workqueues.

    Cc: Tejun Heo <tj@kernel.org>
    Cc: Lai Jiangshan <jiangshanlai@gmail.com>
    Signed-off-by: Matthew Brost <matthew.brost@intel.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Robert Foss <rfoss@redhat.com>
2024-12-17 22:59:23 +01:00
Bastien Nocera 562729d980 workqueue: Fix UBSAN 'subtraction overflow' error in shift_and_mask()
JIRA: https://issues.redhat.com/browse/RHEL-61734

commit 38f7e14519d39cf524ddc02d4caee9b337dad703
Author: Will Deacon <will@kernel.org>
Date:   Tue Jul 30 12:44:31 2024 +0100

    workqueue: Fix UBSAN 'subtraction overflow' error in shift_and_mask()

    UBSAN reports the following 'subtraction overflow' error when booting
    in a virtual machine on Android:

     | Internal error: UBSAN: integer subtraction overflow: 00000000f2005515 [#1] PREEMPT SMP
     | Modules linked in:
     | CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.10.0-00006-g3cbe9e5abd46-dirty #4
     | Hardware name: linux,dummy-virt (DT)
     | pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
     | pc : cancel_delayed_work+0x34/0x44
     | lr : cancel_delayed_work+0x2c/0x44
     | sp : ffff80008002ba60
     | x29: ffff80008002ba60 x28: 0000000000000000 x27: 0000000000000000
     | x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
     | x23: 0000000000000000 x22: 0000000000000000 x21: ffff1f65014cd3c0
     | x20: ffffc0e84c9d0da0 x19: ffffc0e84cab3558 x18: ffff800080009058
     | x17: 00000000247ee1f8 x16: 00000000247ee1f8 x15: 00000000bdcb279d
     | x14: 0000000000000001 x13: 0000000000000075 x12: 00000a0000000000
     | x11: ffff1f6501499018 x10: 00984901651fffff x9 : ffff5e7cc35af000
     | x8 : 0000000000000001 x7 : 3d4d455453595342 x6 : 000000004e514553
     | x5 : ffff1f6501499265 x4 : ffff1f650ff60b10 x3 : 0000000000000620
     | x2 : ffff80008002ba78 x1 : 0000000000000000 x0 : 0000000000000000
     | Call trace:
     |  cancel_delayed_work+0x34/0x44
     |  deferred_probe_extend_timeout+0x20/0x70
     |  driver_register+0xa8/0x110
     |  __platform_driver_register+0x28/0x3c
     |  syscon_init+0x24/0x38
     |  do_one_initcall+0xe4/0x338
     |  do_initcall_level+0xac/0x178
     |  do_initcalls+0x5c/0xa0
     |  do_basic_setup+0x20/0x30
     |  kernel_init_freeable+0x8c/0xf8
     |  kernel_init+0x28/0x1b4
     |  ret_from_fork+0x10/0x20
     | Code: f9000fbf 97fffa2f 39400268 37100048 (d42aa2a0)
     | ---[ end trace 0000000000000000 ]---
     | Kernel panic - not syncing: UBSAN: integer subtraction overflow: Fatal exception

    This is due to shift_and_mask() using a signed immediate to construct
    the mask and being called with a shift of 31 (WORK_OFFQ_POOL_SHIFT) so
    that it ends up decrementing from INT_MIN.

    Use an unsigned constant '1U' to generate the mask in shift_and_mask().

    Cc: Tejun Heo <tj@kernel.org>
    Cc: Lai Jiangshan <jiangshanlai@gmail.com>
    Fixes: 1211f3b21c2a ("workqueue: Preserve OFFQ bits in cancel[_sync] paths")
    Signed-off-by: Will Deacon <will@kernel.org>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Bastien Nocera <bnocera@redhat.com>
2024-12-11 15:25:24 +01:00
Bastien Nocera 4a8bc1a1a6 workqueue: Implement disable/enable for (delayed) work items
JIRA: https://issues.redhat.com/browse/RHEL-61734

commit 86898fa6b8cd942505860556f3a0bf52eae57fe8
Author: Tejun Heo <tj@kernel.org>
Date:   Mon Mar 25 07:21:03 2024 -1000

    workqueue: Implement disable/enable for (delayed) work items

    While (delayed) work items could be flushed and canceled, there was no way
    to prevent them from being queued in the future. While this didn't lead to
    functional deficiencies, it sometimes required a bit more effort from the
    workqueue users to e.g. sequence shutdown steps with more care.

    Workqueue is currently in the process of replacing tasklet which does
    support disabling and enabling. The feature is used relatively widely to,
    for example, temporarily suppress main path while a control plane operation
    (reset or config change) is in progress.

    To enable easy conversion of tasklet users and as it seems like an inherent
    useful feature, this patch implements disabling and enabling of work items.

    - A work item carries 16bit disable count in work->data while not queued.
      The access to the count is synchronized by the PENDING bit like all other
      parts of work->data.

    - If the count is non-zero, the work item cannot be queued. Any attempt to
      queue the work item fails and returns %false.

    - disable_work[_sync](), enable_work(), disable_delayed_work[_sync]() and
      enable_delayed_work() are added.

    v3: enable_work() was using local_irq_enable() instead of
        local_irq_restore() to undo IRQ-disable by work_grab_pending(). This is
        awkward now and will become incorrect as enable_work() will later be
        used from IRQ context too. (Lai)

    v2: Lai noticed that queue_work_node() wasn't checking the disable count.
        Fixed. queue_rcu_work() is updated to trigger warning if the inner work
        item is disabled.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>

Signed-off-by: Bastien Nocera <bnocera@redhat.com>
2024-12-11 15:25:24 +01:00
Bastien Nocera 85215bab77 workqueue: Preserve OFFQ bits in cancel[_sync] paths
JIRA: https://issues.redhat.com/browse/RHEL-61734

commit 1211f3b21c2aa0d22d8d7f050e3a5930a91cd0e4
Author: Tejun Heo <tj@kernel.org>
Date:   Mon Mar 25 07:21:02 2024 -1000

    workqueue: Preserve OFFQ bits in cancel[_sync] paths

    The cancel[_sync] paths acquire and release WORK_STRUCT_PENDING, and
    manipulate WORK_OFFQ_CANCELING. However, they assume that all the OFFQ bit
    values except for the pool ID are statically known and don't preserve them,
    which is not wrong in the current code as the pool ID and CANCELING are the
    only information carried. However, the planned disable/enable support will
    add more fields and need them to be preserved.

    This patch updates work data handling so that only the bits which need
    updating are updated.

    - struct work_offq_data is added along with work_offqd_unpack() and
      work_offqd_pack_flags() to help manipulating multiple fields contained in
      work->data. Note that the helpers look a bit silly right now as there
      isn't that much to pack. The next patch will add more.

    - mark_work_canceling() which is used only by __cancel_work_sync() is
      replaced by open-coded usage of work_offq_data and
      set_work_pool_and_keep_pending() in __cancel_work_sync().

    - __cancel_work[_sync]() uses offq_data helpers to preserve other OFFQ bits
      when clearing WORK_STRUCT_PENDING and WORK_OFFQ_CANCELING at the end.

    - This removes all users of get_work_pool_id() which is dropped. Note that
      get_work_pool_id() could handle both WORK_STRUCT_PWQ and !WORK_STRUCT_PWQ
      cases; however, it was only being called after try_to_grab_pending()
      succeeded, in which case WORK_STRUCT_PWQ is never set and thus it's safe
      to use work_offqd_unpack() instead.

    No behavior changes intended.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>

Signed-off-by: Bastien Nocera <bnocera@redhat.com>
2024-12-11 15:25:23 +01:00
Rado Vrbovsky 5cb9527389 Merge: CVE-2024-46839: workqueue: Improve scalability of workqueue watchdog touch
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5346

JIRA: https://issues.redhat.com/browse/RHEL-60747
CVE: CVE-2024-46839
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5346

The 2nd patch fixes the CVE. The first patch is included as it is
related to the 2nd patch.

Signed-off-by: Waiman Long <longman@redhat.com>

Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: Audra Mitchell <aubaker@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
2024-12-09 08:21:12 +00:00
Waiman Long 8ec893ae89 workqueue: Improve scalability of workqueue watchdog touch
JIRA: https://issues.redhat.com/browse/RHEL-60747
CVE: CVE-2024-46839

commit 98f887f820c993e05a12e8aa816c80b8661d4c87
Author: Nicholas Piggin <npiggin@gmail.com>
Date:   Tue, 25 Jun 2024 21:42:45 +1000

    workqueue: Improve scalability of workqueue watchdog touch

    On a ~2000 CPU powerpc system, hard lockups have been observed in the
    workqueue code when stop_machine runs (in this case due to CPU hotplug).
    This is due to lots of CPUs spinning in multi_cpu_stop, calling
    touch_nmi_watchdog() which ends up calling wq_watchdog_touch().
    wq_watchdog_touch() writes to the global variable wq_watchdog_touched,
    and that can find itself in the same cacheline as other important
    workqueue data, which slows down operations to the point of lockups.

    In the case of the following abridged trace, worker_pool_idr was in
    the hot line, causing the lockups to always appear at idr_find.

      watchdog: CPU 1125 self-detected hard LOCKUP @ idr_find
      Call Trace:
      get_work_pool
      __queue_work
      call_timer_fn
      run_timer_softirq
      __do_softirq
      do_softirq_own_stack
      irq_exit
      timer_interrupt
      decrementer_common_virt
      * interrupt: 900 (timer) at multi_cpu_stop
      multi_cpu_stop
      cpu_stopper_thread
      smpboot_thread_fn
      kthread

    Fix this by having wq_watchdog_touch() only write to the line if the
    last time a touch was recorded exceeds 1/4 of the watchdog threshold.

    Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
    Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
    Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-10-17 08:58:24 -04:00
Waiman Long b0d5b181c0 workqueue: wq_watchdog_touch is always called with valid CPU
JIRA: https://issues.redhat.com/browse/RHEL-60747
CVE: CVE-2024-46839

commit 18e24deb1cc92f2068ce7434a94233741fbd7771
Author: Nicholas Piggin <npiggin@gmail.com>
Date:   Tue, 25 Jun 2024 21:42:44 +1000

    workqueue: wq_watchdog_touch is always called with valid CPU

    Warn in the case it is called with cpu == -1. This does not appear
    to happen anywhere.

    Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
    Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-10-17 08:58:23 -04:00
Phil Auld 88d1c5d2ed sched/balancing: Rename scheduler_tick() => sched_tick()
JIRA: https://issues.redhat.com/browse/RHEL-56494
Conflicts:  Dropped CN documentation since not in RHEL, context
diffs in sched-domains.rst. Skipped hunk in func_set_ftrace_file.tc
due to not having 6fec1ab67f8 ("selftests/ftrace: Do not
trace do_softirq because of PREEMPT_RT") in tree.

commit 86dd6c04ef9f213e14d60c9f64bce1cc019f816e
Author: Ingo Molnar <mingo@kernel.org>
Date:   Fri Mar 8 12:18:08 2024 +0100

    sched/balancing: Rename scheduler_tick() => sched_tick()

    - Standardize on prefixing scheduler-internal functions defined
      in <linux/sched.h> with sched_*() prefix. scheduler_tick() was
      the only function using the scheduler_ prefix. Harmonize it.

    - The other reason to rename it is the NOHZ scheduler tick
      handling functions are already named sched_tick_*().
      Make the 'git grep sched_tick' more meaningful.

    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Acked-by: Valentin Schneider <vschneid@redhat.com>
    Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
    Link: https://lore.kernel.org/r/20240308111819.1101550-3-mingo@kernel.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-09-20 04:38:48 -04:00
Waiman Long baa0a3b48c workqueue: Always queue work items to the newest PWQ for order workqueues
JIRA: https://issues.redhat.com/browse/RHEL-49500

commit 58629d4871e8eb2c385b16a73a8451669db59f39
Author: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Date:   Wed, 3 Jul 2024 17:27:41 +0800

    workqueue: Always queue work items to the newest PWQ for order workqueues

    To ensure non-reentrancy, __queue_work() attempts to enqueue a work
    item to the pool of the currently executing worker. This is not only
    unnecessary for an ordered workqueue, where order inherently suggests
    non-reentrancy, but it could also disrupt the sequence if the item is
    not enqueued on the newest PWQ.

    Just queue it to the newest PWQ and let order management guarantees
    non-reentrancy.

    Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
    Fixes: 4c065dbce1e8 ("workqueue: Enable unbound cpumask update on ordered workqueues")
    Cc: stable@vger.kernel.org # v6.9+
    Signed-off-by: Tejun Heo <tj@kernel.org>
    (cherry picked from commit 74347be3edfd11277799242766edf844c43dd5d3)

Signed-off-by: Waiman Long <longman@redhat.com>
2024-07-17 09:55:18 -04:00
Waiman Long 3adbdc1e6a workqueue: Update cpumasks after only applying it successfully
JIRA: https://issues.redhat.com/browse/RHEL-49500

commit 841658832335a32dd86f4e4d3aab7d14188b268b
Author: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Date:   Tue, 2 Jul 2024 12:14:55 +0800

    workqueue: Update cpumasks after only applying it successfully

    Make workqueue_unbound_exclude_cpumask() and workqueue_set_unbound_cpumask()
    only update wq_isolated_cpumask and wq_requested_unbound_cpumask when
    workqueue_apply_unbound_cpumask() returns successfully.

    Fixes: fe28f631fa94("workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask")
    Cc: Waiman Long <longman@redhat.com>
    Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-07-17 09:55:18 -04:00
Waiman Long beb7c33dd4 workqueue: Cleanup subsys attribute registration
JIRA: https://issues.redhat.com/browse/RHEL-49500

commit 79202591a55a365251496162ced3004a0a1fa1cf
Author: Dan Williams <dan.j.williams@intel.com>
Date:   Thu, 7 Mar 2024 21:39:32 -0800

    workqueue: Cleanup subsys attribute registration

    While reviewing users of subsys_virtual_register() I noticed that
    wq_sysfs_init() ignores the @groups argument. This looks like a
    historical artifact as the original wq_subsys only had one attribute to
    register.

    On the way to building up an @groups argument to pass to
    subsys_virtual_register() a few more cleanups fell out:

    * Use DEVICE_ATTR_RO() and DEVICE_ATTR_RW() for
      cpumask_{isolated,requested} and cpumask respectively. Rename the
      @show and @store methods accordingly.

    * Co-locate the attribute definition with the methods. This required
      moving wq_unbound_cpumask_show down next to wq_unbound_cpumask_store
      (renamed to cpumask_show() and cpumask_store())

    * Use ATTRIBUTE_GROUPS() to skip some boilerplate declarations

    Signed-off-by: Dan Williams <dan.j.williams@intel.com>
    Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-07-17 09:55:17 -04:00
Waiman Long 8dfa13be90 workqueue: Fix divide error in wq_update_node_max_active()
JIRA: https://issues.redhat.com/browse/RHEL-49500

commit 91f098704c25106d88706fc9f8bcfce01fdb97df
Author: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Date:   Wed, 24 Apr 2024 21:51:54 +0800

    workqueue: Fix divide error in wq_update_node_max_active()

    Yue Sun and xingwei lee reported a divide error bug in
    wq_update_node_max_active():

    divide error: 0000 [#1] PREEMPT SMP KASAN PTI
    CPU: 1 PID: 21 Comm: cpuhp/1 Not tainted 6.9.0-rc5 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
    RIP: 0010:wq_update_node_max_active+0x369/0x6b0 kernel/workqueue.c:1605
    Code: 24 bf 00 00 00 80 44 89 fe e8 83 27 33 00 41 83 fc ff 75 0d 41
    81 ff 00 00 00 80 0f 84 68 01 00 00 e8 fb 22 33 00 44 89 f8 99 <41> f7
    fc 89 c5 89 c7 44 89 ee e8 a8 24 33 00 89 ef 8b 5c 24 04 89
    RSP: 0018:ffffc9000018fbb0 EFLAGS: 00010293
    RAX: 00000000000000ff RBX: 0000000000000001 RCX: ffff888100ada500
    RDX: 0000000000000000 RSI: 00000000000000ff RDI: 0000000080000000
    RBP: 0000000000000001 R08: ffffffff815b1fcd R09: 1ffff1100364ad72
    R10: dffffc0000000000 R11: ffffed100364ad73 R12: 0000000000000000
    R13: 0000000000000100 R14: 0000000000000000 R15: 00000000000000ff
    FS:  0000000000000000(0000) GS:ffff888135c00000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fb8c06ca6f8 CR3: 000000010d6c6000 CR4: 0000000000750ef0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    PKRU: 55555554
    Call Trace:
     <TASK>
     workqueue_offline_cpu+0x56f/0x600 kernel/workqueue.c:6525
     cpuhp_invoke_callback+0x4e1/0x870 kernel/cpu.c:194
     cpuhp_thread_fun+0x411/0x7d0 kernel/cpu.c:1092
     smpboot_thread_fn+0x544/0xa10 kernel/smpboot.c:164
     kthread+0x2ed/0x390 kernel/kthread.c:388
     ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
     ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:244
     </TASK>
    Modules linked in:
    ---[ end trace 0000000000000000 ]---

    After analysis, it happens when all of the CPUs in a workqueue's affinity
    get offine.

    The problem can be easily reproduced by:

     # echo 8 > /sys/devices/virtual/workqueue/<any-wq-name>/cpumask
     # echo 0 > /sys/devices/system/cpu/cpu3/online

    Use the default max_actives for nodes when all of the CPUs in the
    workqueue's affinity get offline to fix the problem.

    Reported-by: Yue Sun <samsun1006219@gmail.com>
    Reported-by: xingwei lee <xrivendell7@gmail.com>
    Link: https://lore.kernel.org/lkml/CAEkJfYPGS1_4JqvpSo0=FM0S1ytB8CEbyreLTtWpR900dUZymw@mail.gmail.com/
    Fixes: 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues")
    Cc: stable@vger.kernel.org
    Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-07-17 09:55:17 -04:00
Waiman Long 1d6310526e workqueue: The default node_nr_active should have its max set to max_active
JIRA: https://issues.redhat.com/browse/RHEL-49500

commit d40f92020c7a225b77e68599e4b099a4a0823408
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 22 Apr 2024 14:43:48 -1000

    workqueue: The default node_nr_active should have its max set to max_active

    The default nna (node_nr_active) is used when the pool isn't tied to a
    specific NUMA node. This can happen in the following cases:

     1. On NUMA, if per-node pwq init failure and the fallback pwq is used.
     2. On NUMA, if a pool is configured to span multiple nodes.
     3. On single node setups.

    5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement for
    unbound workqueues") set the default nna->max to min_active because only #1
    was being considered. For #2 and #3, using min_active means that the max
    concurrency in normal operation is pushed down to min_active which is
    currently 8, which can obviously lead to performance issues.

    exact value nna->max is set to doesn't really matter. #2 can only happen if
    the workqueue is intentionally configured to ignore NUMA boundaries and
    there's no good way to distribute max_active in this case. #3 is the default
    behavior on single node machines.

    Let's set it the default nna->max to max_active. This fixes the artificially
    lowered concurrency problem on single node machines and shouldn't hurt
    anything for other cases.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
    Fixes: 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues")
    Link: https://lore.kernel.org/dm-devel/20240410084531.2134621-1-shinichiro.kawasaki@wdc.com/
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-07-17 09:55:16 -04:00
Waiman Long 91dac9de45 workqueue: Fix selection of wake_cpu in kick_pool()
JIRA: https://issues.redhat.com/browse/RHEL-49500

commit 57a01eafdcf78f6da34fad9ff075ed5dfdd9f420
Author: Sven Schnelle <svens@linux.ibm.com>
Date:   Tue, 23 Apr 2024 08:19:05 +0200

    workqueue: Fix selection of wake_cpu in kick_pool()

    With cpu_possible_mask=0-63 and cpu_online_mask=0-7 the following
    kernel oops was observed:

    smp: Bringing up secondary CPUs ...
    smp: Brought up 1 node, 8 CPUs
    Unable to handle kernel pointer dereference in virtual kernel address space
    Failing address: 0000000000000000 TEID: 0000000000000803
    [..]
     Call Trace:
    arch_vcpu_is_preempted+0x12/0x80
    select_idle_sibling+0x42/0x560
    select_task_rq_fair+0x29a/0x3b0
    try_to_wake_up+0x38e/0x6e0
    kick_pool+0xa4/0x198
    __queue_work.part.0+0x2bc/0x3a8
    call_timer_fn+0x36/0x160
    __run_timers+0x1e2/0x328
    __run_timer_base+0x5a/0x88
    run_timer_softirq+0x40/0x78
    __do_softirq+0x118/0x388
    irq_exit_rcu+0xc0/0xd8
    do_ext_irq+0xae/0x168
    ext_int_handler+0xbe/0xf0
    psw_idle_exit+0x0/0xc
    default_idle_call+0x3c/0x110
    do_idle+0xd4/0x158
    cpu_startup_entry+0x40/0x48
    rest_init+0xc6/0xc8
    start_kernel+0x3c4/0x5e0
    startup_continue+0x3c/0x50

    The crash is caused by calling arch_vcpu_is_preempted() for an offline
    CPU. To avoid this, select the cpu with cpumask_any_and_distribute()
    to mask __pod_cpumask with cpu_online_mask. In case no cpu is left in
    the pool, skip the assignment.

    tj: This doesn't fully fix the bug as CPUs can still go down between picking
    the target CPU and the wake call. Fixing that likely requires adding
    cpu_online() test to either the sched or s390 arch code. However, regardless
    of how that is fixed, workqueue shouldn't be picking a CPU which isn't
    online as that would result in unpredictable and worse behavior.

    Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
    Fixes: 8639ecebc9b1 ("workqueue: Implement non-strict affinity scope for unbound workqueues")
    Cc: stable@vger.kernel.org # v6.6+
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-07-17 09:55:16 -04:00
Waiman Long efd17b3bb7 workqueue: Drain BH work items on hot-unplugged CPUs
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 1acd92d95fa24edca8f0292b21870025da93e24f
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 26 Feb 2024 15:38:55 -1000

    workqueue: Drain BH work items on hot-unplugged CPUs

    Boqun pointed out that workqueues aren't handling BH work items on offlined
    CPUs. Unlike tasklet which transfers out the pending tasks from
    CPUHP_SOFTIRQ_DEAD, BH workqueue would just leave them pending which is
    problematic. Note that this behavior is specific to BH workqueues as the
    non-BH per-CPU workers just become unbound when the CPU goes offline.

    This patch fixes the issue by draining the pending BH work items from an
    offlined CPU from CPUHP_SOFTIRQ_DEAD. Because work items carry more context,
    it's not as easy to transfer the pending work items from one pool to
    another. Instead, run BH work items which execute the offlined pools on an
    online CPU.

    Note that this assumes that no further BH work items will be queued on the
    offlined CPUs. This assumption is shared with tasklet and should be fine for
    conversions. However, this issue also exists for per-CPU workqueues which
    will just keep executing work items queued after CPU offline on unbound
    workers and workqueue should reject per-CPU and BH work items queued on
    offline CPUs. This will be addressed separately later.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Reported-and-reviewed-by: Boqun Feng <boqun.feng@gmail.com>
    Link: http://lkml.kernel.org/r/Zdvw0HdSXcU3JZ4g@boqun-archlinux

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:33 -04:00
Waiman Long c26c31e2b9 workqueue: Control intensive warning threshold through cmdline
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit ccdec92198df0c91f45a68f971771b6b0c1ba02d
Author: Xuewen Yan <xuewen.yan@unisoc.com>
Date:   Thu, 22 Feb 2024 15:28:08 +0800

    workqueue: Control intensive warning threshold through cmdline

    When CONFIG_WQ_CPU_INTENSIVE_REPORT is set, the kernel will report
    the work functions which violate the intensive_threshold_us repeatedly.
    And now, only when the violate times exceed 4 and is a power of 2,
    the kernel warning could be triggered.

    However, sometimes, even if a long work execution time occurs only once,
    it may cause other work to be delayed for a long time. This may also
    cause some problems sometimes.

    In order to freely control the threshold of warninging, a boot argument
    is added so that the user can control the warning threshold to be printed.
    At the same time, keep the exponential backoff to prevent reporting too much.

    By default, the warning threshold is 4.

    tj: Updated kernel-parameters.txt description.

    Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:33 -04:00
Waiman Long e58ec3ad16 workqueue: Make @flags handling consistent across set_work_data() and friends
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit bccdc1faafaf32e00d6e4dddca1ded64e3272189
Author: Tejun Heo <tj@kernel.org>
Date:   Tue, 20 Feb 2024 19:36:15 -1000

    workqueue: Make @flags handling consistent across set_work_data() and friends

    - set_work_data() takes a separate @flags argument but just ORs it to @data.
      This is more confusing than helpful. Just take @data.

    - Use the name @flags consistently and add the parameter to
      set_work_pool_and_{keep|clear}_pending(). This will be used by the planned
      disable/enable support.

    No functional changes.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:33 -04:00
Waiman Long 89f1d097be workqueue: Remove clear_work_data()
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit afe928c1dc611bec155d834020e0631e026aeb8a
Author: Tejun Heo <tj@kernel.org>
Date:   Tue, 20 Feb 2024 19:36:14 -1000

    workqueue: Remove clear_work_data()

    clear_work_data() is only used in one place and immediately followed by
    smp_mb(), making it equivalent to set_work_pool_and_clear_pending() w/
    WORK_OFFQ_POOL_NONE for @pool_id. Drop it. No functional changes.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:33 -04:00
Waiman Long 056e8351f0 workqueue: Factor out work_grab_pending() from __cancel_work_sync()
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 978b8409eab15aa733ae3a79c9b5158d34cd3fb7
Author: Tejun Heo <tj@kernel.org>
Date:   Tue, 20 Feb 2024 19:36:14 -1000

    workqueue: Factor out work_grab_pending() from __cancel_work_sync()

    The planned disable/enable support will need the same logic. Let's factor it
    out. No functional changes.

    v2: Update function comment to include @irq_flags.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:32 -04:00
Waiman Long c1daf24536 workqueue: Clean up enum work_bits and related constants
JIRA: https://issues.redhat.com/browse/RHEL-25103
Conflicts: A context diff in include/linux/workqueue.h due to missing
	   upstream commit b2fa8443db32 ("workqueue: Split out
	   workqueue_types.h").

commit e9a8e01f9b133c145dd125021ec47c006d108af4
Author: Tejun Heo <tj@kernel.org>
Date:   Tue, 20 Feb 2024 19:36:14 -1000

    workqueue: Clean up enum work_bits and related constants

    The bits of work->data are used for a few different purposes. How the bits
    are used is determined by enum work_bits. The planned disable/enable support
    will add another use, so let's clean it up a bit in preparation.

    - Let WORK_STRUCT_*_BIT's values be determined by enum definition order.

    - Deliminate different bit sections the same way using SHIFT and BITS
      values.

    - Rename __WORK_OFFQ_CANCELING to WORK_OFFQ_CANCELING_BIT for consistency.

    - Introduce WORK_STRUCT_PWQ_SHIFT and replace WORK_STRUCT_FLAG_MASK and
      WORK_STRUCT_WQ_DATA_MASK with WQ_STRUCT_PWQ_MASK for clarity.

    - Improve documentation.

    No functional changes.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:32 -04:00
Waiman Long 2f6aa88b75 workqueue: Introduce work_cancel_flags
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit c5f5b9422a49e9bc1c2f992135592ed921ac18e5
Author: Tejun Heo <tj@kernel.org>
Date:   Tue, 20 Feb 2024 19:36:14 -1000

    workqueue: Introduce work_cancel_flags

    The cancel path used bool @is_dwork to distinguish canceling a regular work
    and a delayed one. The planned disable/enable support will need passing
    around another flag in the code path. As passing them around with bools will
    be confusing, let's introduce named flags to pass around in the cancel path.

    WORK_CANCEL_DELAYED replaces @is_dwork. No functional changes.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:32 -04:00
Waiman Long b483553d36 workqueue: Use variable name irq_flags for saving local irq flags
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit c26e2f2e2fcfb73996fa025a0d3b5695017d65b5
Author: Tejun Heo <tj@kernel.org>
Date:   Tue, 20 Feb 2024 19:36:14 -1000

    workqueue: Use variable name irq_flags for saving local irq flags

    Using the generic term `flags` for irq flags is conventional but can be
    confusing as there's quite a bit of code dealing with work flags which
    involves some subtleties. Let's use a more explicit name `irq_flags` for
    local irq flags. No functional changes.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:32 -04:00
Waiman Long e6019eb8f1 workqueue: Reorganize flush and cancel[_sync] functions
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit cdc6e4b329bc82676886a758a940b2b6987c2109
Author: Tejun Heo <tj@kernel.org>
Date:   Tue, 20 Feb 2024 19:36:14 -1000

    workqueue: Reorganize flush and cancel[_sync] functions

    They are currently a bit disorganized with flush and cancel functions mixed.
    Reoranize them so that flush functions come first, cancel next and
    cancel_sync last. This way, we won't have to add prototypes for internal
    functions for the planned disable/enable support.

    This is pure code reorganization. No functional changes.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:32 -04:00
Waiman Long c92ee4a73c workqueue: Rename __cancel_work_timer() to __cancel_timer_sync()
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit c5140688d19a4579f7b01e6ca4b6e5f5d23d3d4d
Author: Tejun Heo <tj@kernel.org>
Date:   Tue, 20 Feb 2024 19:36:14 -1000

    workqueue: Rename __cancel_work_timer() to __cancel_timer_sync()

    __cancel_work_timer() is used to implement cancel_work_sync() and
    cancel_delayed_work_sync(), similarly to how __cancel_work() is used to
    implement cancel_work() and cancel_delayed_work(). ie. The _timer part of
    the name is a complete misnomer. The difference from __cancel_work() is the
    fact that it syncs against work item execution not whether it handles timers
    or not.

    Let's rename it to less confusing __cancel_work_sync(). No functional
    change.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:32 -04:00
Waiman Long 6842b59700 workqueue: Use rcu_read_lock_any_held() instead of rcu_read_lock_held()
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit d355001fa9370df8fdd6fca0e9ed77063615c7da
Author: Tejun Heo <tj@kernel.org>
Date:   Tue, 20 Feb 2024 19:36:13 -1000

    workqueue: Use rcu_read_lock_any_held() instead of rcu_read_lock_held()

    The different flavors of RCU read critical sections have been unified. Let's
    update the locking assertion macros accordingly to avoid requiring
    unnecessary explicit rcu_read_[un]lock() calls.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:32 -04:00
Waiman Long 3159667c3e workqueue: Cosmetic changes
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit c7a40c49af920fbad2ab6795b6587308ad69de9f
Author: Tejun Heo <tj@kernel.org>
Date:   Tue, 20 Feb 2024 19:36:13 -1000

    workqueue: Cosmetic changes

    Reorder some global declarations and adjust comments and whitespaces for
    clarity and consistency. No functional changes.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:32 -04:00
Waiman Long 889c30a2b9 workqueue, irq_work: Build fix for !CONFIG_IRQ_WORK
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit fd0a68a2337b79a7bd4dad5e7d9dc726828527af
Author: Tejun Heo <tj@kernel.org>
Date:   Thu, 15 Feb 2024 19:10:01 -1000

    workqueue, irq_work: Build fix for !CONFIG_IRQ_WORK

    2f34d7337d98 ("workqueue: Fix queue_work_on() with BH workqueues") added
    irq_work usage to workqueue; however, it turns out irq_work is actually
    optional and the change breaks build on configuration which doesn't have
    CONFIG_IRQ_WORK enabled.

    Fix build by making workqueue use irq_work only when CONFIG_SMP and enabling
    CONFIG_IRQ_WORK when CONFIG_SMP is set. It's reasonable to argue that it may
    be better to just always enable it. However, this still saves a small bit of
    memory for tiny UP configs and also the least amount of change, so, for now,
    let's keep it conditional.

    Verified to do the right thing for x86_64 allnoconfig and defconfig, and
    aarch64 allnoconfig, allnoconfig + prink disable (SMP but nothing selects
    IRQ_WORK) and a modified aarch64 Kconfig where !SMP and nothing selects
    IRQ_WORK.

    v2: `depends on SMP` leads to Kconfig warnings when CONFIG_IRQ_WORK is
        selected by something else when !CONFIG_SMP. Use `def_bool y if SMP`
        instead.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
    Tested-by: Anders Roxell <anders.roxell@linaro.org>
    Fixes: 2f34d7337d98 ("workqueue: Fix queue_work_on() with BH workqueues")
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:32 -04:00
Waiman Long 6bff994865 workqueue: Fix queue_work_on() with BH workqueues
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 2f34d7337d98f3eae7bd3d1270efaf9d8a17cfc6
Author: Tejun Heo <tj@kernel.org>
Date:   Wed, 14 Feb 2024 08:33:55 -1000

    workqueue: Fix queue_work_on() with BH workqueues

    When queue_work_on() is used to queue a BH work item on a remote CPU, the
    work item is queued on that CPU but kick_pool() raises softirq on the local
    CPU. This leads to stalls as the work item won't be executed until something
    else on the remote CPU schedules a BH work item or tasklet locally.

    Fix it by bouncing raising softirq to the target CPU using per-cpu irq_work.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Fixes: 4cb1ef64609f ("workqueue: Implement BH workqueues to eventually replace tasklets")

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:32 -04:00
Waiman Long e897cf4939 workqueue: Implement workqueue_set_min_active()
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 8f172181f24bb5df7675225d9b5b66d059613f50
Author: Tejun Heo <tj@kernel.org>
Date:   Thu, 8 Feb 2024 14:11:56 -1000

    workqueue: Implement workqueue_set_min_active()

    Since 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement
    for unbound workqueues"), unbound workqueues have separate min_active which
    sets the number of interdependent work items that can be handled. This value
    is currently initialized to WQ_DFL_MIN_ACTIVE which is 8. This isn't high
    enough for some users, let's add an interface to adjust the setting.

    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:32 -04:00
Waiman Long a89bd6c4c6 workqueue: Fix kernel-doc comment of unplug_oldest_pwq()
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 516d3dc99f4f2ab856d879696cd3a5d7f6db7796
Author: Waiman Long <longman@redhat.com>
Date:   Fri, 9 Feb 2024 12:06:11 -0500

    workqueue: Fix kernel-doc comment of unplug_oldest_pwq()

    Fix the kernel-doc comment of the unplug_oldest_pwq() function to enable
    proper processing and formatting of the embedded ASCII diagram.

    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:31 -04:00
Waiman Long 3d3fe163af workqueue: Bind unbound workqueue rescuer to wq_unbound_cpumask
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 49584bb8ddbe8bcfc276c2d7dd3c8890f45f5970
Author: Waiman Long <longman@redhat.com>
Date:   Thu, 8 Feb 2024 11:10:14 -0500

    workqueue: Bind unbound workqueue rescuer to wq_unbound_cpumask

    Commit 85f0ab43f9de ("kernel/workqueue: Bind rescuer to unbound
    cpumask for WQ_UNBOUND") modified init_rescuer() to bind rescuer of
    an unbound workqueue to the cpumask in wq->unbound_attrs. However
    unbound_attrs->cpumask's of all workqueues are initialized to
    cpu_possible_mask and will only be changed if it has the WQ_SYSFS flag
    to expose a cpumask sysfs file to be written by users. So this patch
    doesn't achieve what it is intended to do.

    If an unbound workqueue is created after wq_unbound_cpumask is modified
    and there is no more unbound cpumask update after that, the unbound
    rescuer will be bound to all CPUs unless the workqueue is created
    with the WQ_SYSFS flag and a user explicitly modified its cpumask
    sysfs file.  Fix this problem by binding directly to wq_unbound_cpumask
    in init_rescuer().

    Fixes: 85f0ab43f9de ("kernel/workqueue: Bind rescuer to unbound cpumask for WQ_UNBOUND")
    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:31 -04:00
Waiman Long 950ef7b192 kernel/workqueue: Let rescuers follow unbound wq cpumask changes
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit d64f2fa064f8866802e23c8ec95d9d1f601480ee
Author: Juri Lelli <juri.lelli@redhat.com>
Date:   Thu, 8 Feb 2024 11:10:13 -0500

    kernel/workqueue: Let rescuers follow unbound wq cpumask changes

    When workqueue cpumask changes are committed the associated rescuer (if
    one exists) affinity is not touched and this might be a problem down the
    line for isolated setups.

    Make sure rescuers affinity is updated every time a workqueue cpumask
    changes, so that rescuers can't break isolation.

     [longman: set_cpus_allowed_ptr() will block until the designated task
      is enqueued on an allowed CPU, no wake_up_process() needed. Also use
      the unbound_effective_cpumask() helper as suggested by Tejun.]

    Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:31 -04:00
Waiman Long 9e9a9764ad workqueue: Enable unbound cpumask update on ordered workqueues
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 4c065dbce1e8639546ef3612acffb062dd084cfe
Author: Waiman Long <longman@redhat.com>
Date:   Thu, 8 Feb 2024 14:12:20 -0500

    workqueue: Enable unbound cpumask update on ordered workqueues

    Ordered workqueues does not currently follow changes made to the
    global unbound cpumask because per-pool workqueue changes may break
    the ordering guarantee. IOW, a work function in an ordered workqueue
    may run on an isolated CPU.

    This patch enables ordered workqueues to follow changes made to the
    global unbound cpumask by temporaily plug or suspend the newly allocated
    pool_workqueue from executing newly queued work items until the old
    pwq has been properly drained. For ordered workqueues, there should
    only be one pwq that is unplugged, the rests should be plugged.

    This enables ordered workqueues to follow the unbound cpumask changes
    like other unbound workqueues at the expense of some delay in execution
    of work functions during the transition period.

    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:31 -04:00
Waiman Long 343954323b workqueue: Link pwq's into wq->pwqs from oldest to newest
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 26fb7e3dda4c16e2cfe2164a1e7315a9386602db
Author: Waiman Long <longman@redhat.com>
Date:   Thu, 8 Feb 2024 11:10:11 -0500

    workqueue: Link pwq's into wq->pwqs from oldest to newest

    Add a new pwq into the tail of wq->pwqs so that pwq iteration will
    start from the oldest pwq to the newest. This ordering will facilitate
    the inclusion of ordered workqueues in a wq_unbound_cpumask update.

    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:31 -04:00
Waiman Long deebfc6ab4 workqueue: Don't implicitly make UNBOUND workqueues w/ @max_active==1 ordered
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 3bc1e711c26bff01d41ad71145ecb8dcb4412576
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 5 Feb 2024 14:19:10 -1000

    workqueue: Don't implicitly make UNBOUND workqueues w/ @max_active==1 ordered

    5c0338c687 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered")
    automoatically promoted UNBOUND workqueues w/ @max_active==1 to ordered
    workqueues because UNBOUND workqueues w/ @max_active==1 used to be the way
    to create ordered workqueues and the new NUMA support broke it. These
    problems can be subtle and the fact that they can only trigger on NUMA
    machines made them even more difficult to debug.

    However, overloading the UNBOUND allocation interface this way creates other
    issues. It's difficult to tell whether a given workqueue actually needs to
    be ordered and users that legitimately want a min concurrency level wq
    unexpectedly gets an ordered one instead. With planned UNBOUND workqueue
    udpates to improve execution locality and more prevalence of chiplet designs
    which can benefit from such improvements, this isn't a state we wanna be in
    forever.

    There aren't that many UNBOUND w/ @max_active==1 users in the tree and the
    preceding patches audited all and converted them to
    alloc_ordered_workqueue() as appropriate. This patch removes the implicit
    promotion of UNBOUND w/ @max_active==1 workqueues to ordered ones.

    v2: v1 patch incorrectly dropped !list_empty(&wq->pwqs) condition in
        apply_workqueue_attrs_locked() which spuriously triggers WARNING and
        fails workqueue creation. Fix it.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Reported-by: kernel test robot <oliver.sang@intel.com>
    Link: https://lore.kernel.org/oe-lkp/202304251050.45a5df1f-oliver.sang@intel.com

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:31 -04:00
Waiman Long 4c4d4b9049 workqueue: Skip __WQ_DESTROYING workqueues when updating global unbound cpumask
JIRA: https://issues.redhat.com/browse/RHEL-25103
Conflicts: A context diff due to upstream conflict as shown in merge
	   commit 40911d4457f2 ("Merge branch 'for-6.8-fixes' into
	   for-6.9").

commit 8eb17dc1a6b5db7e89681f59285242af8d182f95
Author: Waiman Long <longman@redhat.com>
Date:   Sat, 3 Feb 2024 10:43:30 -0500

    workqueue: Skip __WQ_DESTROYING workqueues when updating global unbound cpumask

    Skip updating workqueues with __WQ_DESTROYING bit set when updating
    global unbound cpumask to avoid unnecessary work and other complications.

    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:31 -04:00
Waiman Long c84568af70 workqueue: fix a typo in comment
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 96068b6030391082bf0cd97af525d731afa5ad63
Author: Wang Jinchao <wangjinchao@xfusion.com>
Date:   Mon, 5 Feb 2024 08:31:52 +0800

    workqueue: fix a typo in comment

    There should be three, fix it.

    Signed-off-by: Wang Jinchao <wangjinchao@xfusion.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:31 -04:00
Waiman Long 91c025a7a0 Revert "workqueue: make wq_subsys const"
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 4f19b8e01e2fb6c97d4307abb7bde4d34a1e601e
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 5 Feb 2024 07:18:08 -1000

    Revert "workqueue: make wq_subsys const"

    This reverts commit d412ace11144aa2bf692c7cf9778351efc15c827. This leads to
    build failures as it depends on a driver-core commit 32f78abe59c7 ("driver
    core: bus: constantify subsys_register() calls"). Let's drop it from wq tree
    and route it through driver-core tree.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Reported-by: kernel test robot <lkp@intel.com>
    Closes: https://lore.kernel.org/oe-kbuild-all/202402051505.kM9Rr3CJ-lkp@intel.com/

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:31 -04:00
Waiman Long da3eaa2838 workqueue: Implement BH workqueues to eventually replace tasklets
JIRA: https://issues.redhat.com/browse/RHEL-25103
Conflicts: A minor context diff in kernel/workqueue.c due to missing
	   upstream commit 68279f9c9f59 ("treewide: mark stuff as
	   __ro_after_init").

commit 4cb1ef64609f9b0254184b2947824f4b46ccab22
Author: Tejun Heo <tj@kernel.org>
Date:   Sun, 4 Feb 2024 11:28:06 -1000

    workqueue: Implement BH workqueues to eventually replace tasklets

    The only generic interface to execute asynchronously in the BH context is
    tasklet; however, it's marked deprecated and has some design flaws such as
    the execution code accessing the tasklet item after the execution is
    complete which can lead to subtle use-after-free in certain usage scenarios
    and less-developed flush and cancel mechanisms.

    This patch implements BH workqueues which share the same semantics and
    features of regular workqueues but execute their work items in the softirq
    context. As there is always only one BH execution context per CPU, none of
    the concurrency management mechanisms applies and a BH workqueue can be
    thought of as a convenience wrapper around softirq.

    Except for the inability to sleep while executing and lack of max_active
    adjustments, BH workqueues and work items should behave the same as regular
    workqueues and work items.

    Currently, the execution is hooked to tasklet[_hi]. However, the goal is to
    convert all tasklet users over to BH workqueues. Once the conversion is
    complete, tasklet can be removed and BH workqueues can directly take over
    the tasklet softirqs.

    system_bh[_highpri]_wq are added. As queue-wide flushing doesn't exist in
    tasklet, all existing tasklet users should be able to use the system BH
    workqueues without creating their own workqueues.

    v3: - Add missing interrupt.h include.

    v2: - Instead of using tasklets, hook directly into its softirq action
          functions - tasklet[_hi]_action(). This is slightly cheaper and closer
          to the eventual code structure we want to arrive at. Suggested by Lai.

        - Lai also pointed out several places which need NULL worker->task
          handling or can use clarification. Updated.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
    Link: http://lkml.kernel.org/r/CAHk-=wjDW53w4-YcSmgKC5RruiRLHmJ1sXeYdp_ZgVoBw=5byA@mail.gmail.com
    Tested-by: Allen Pais <allen.lkml@gmail.com>
    Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:31 -04:00
Waiman Long 3eb62e205f workqueue: Factor out init_cpu_worker_pool()
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 2fcdb1b44491e08f5334a92c50e8f362e0d46f91
Author: Tejun Heo <tj@kernel.org>
Date:   Sun, 4 Feb 2024 11:28:06 -1000

    workqueue: Factor out init_cpu_worker_pool()

    Factor out init_cpu_worker_pool() from workqueue_init_early(). This is pure
    reorganization in preparation of BH workqueue support.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Tested-by: Allen Pais <allen.lkml@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:31 -04:00
Waiman Long d6a8a41d51 workqueue: Update lock debugging code
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit c35aea39d1e106f61fd2130f0d32a3bac8bd4570
Author: Tejun Heo <tj@kernel.org>
Date:   Sun, 4 Feb 2024 11:28:06 -1000

    workqueue: Update lock debugging code

    These changes are in preparation of BH workqueue which will execute work
    items from BH context.

    - Update lock and RCU depth checks in process_one_work() so that it
      remembers and checks against the starting depths and prints out the depth
      changes.

    - Factor out lockdep annotations in the flush paths into
      touch_{wq|work}_lockdep_map(). The work->lockdep_map touching is moved
      from __flush_work() to its callee - start_flush_work(). This brings it
      closer to the wq counterpart and will allow testing the associated wq's
      flags which will be needed to support BH workqueues. This is not expected
      to cause any functional changes.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Tested-by: Allen Pais <allen.lkml@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:31 -04:00
Waiman Long bc8bb2d224 workqueue: make wq_subsys const
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit d412ace11144aa2bf692c7cf9778351efc15c827
Author: Ricardo B. Marliere <ricardo@marliere.net>
Date:   Sun, 4 Feb 2024 10:47:05 -0300

    workqueue: make wq_subsys const

    Now that the driver core can properly handle constant struct bus_type,
    move the wq_subsys variable to be a constant structure as well,
    placing it into read-only memory which can not be modified at runtime.

    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Suggested-and-reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:30 -04:00
Waiman Long 1f7fc85664 workqueue: Fix pwq->nr_in_flight corruption in try_to_grab_pending()
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit c70e1779b73a39f7648b26bdc835304c60100ce3
Author: Tejun Heo <tj@kernel.org>
Date:   Sun, 4 Feb 2024 11:14:21 -1000

    workqueue: Fix pwq->nr_in_flight corruption in try_to_grab_pending()

    dd6c3c544126 ("workqueue: Move pwq_dec_nr_in_flight() to the end of work
    item handling") relocated pwq_dec_nr_in_flight() after
    set_work_pool_and_keep_pending(). However, the latter destroys information
    contained in work->data that's needed by pwq_dec_nr_in_flight() including
    the flush color. With flush color destroyed, flush_workqueue() can stall
    easily when mixed with cancel_work*() usages.

    This is easily triggered by running xfstests generic/001 test on xfs:

         INFO: task umount:6305 blocked for more than 122 seconds.
         ...
         task:umount          state:D stack:13008 pid:6305  tgid:6305  ppid:6301   flags:0x00004000
         Call Trace:
          <TASK>
          __schedule+0x2f6/0xa20
          schedule+0x36/0xb0
          schedule_timeout+0x20b/0x280
          wait_for_completion+0x8a/0x140
          __flush_workqueue+0x11a/0x3b0
          xfs_inodegc_flush+0x24/0xf0
          xfs_unmountfs+0x14/0x180
          xfs_fs_put_super+0x3d/0x90
          generic_shutdown_super+0x7c/0x160
          kill_block_super+0x1b/0x40
          xfs_kill_sb+0x12/0x30
          deactivate_locked_super+0x35/0x90
          deactivate_super+0x42/0x50
          cleanup_mnt+0x109/0x170
          __cleanup_mnt+0x12/0x20
          task_work_run+0x60/0x90
          syscall_exit_to_user_mode+0x146/0x150
          do_syscall_64+0x5d/0x110
          entry_SYSCALL_64_after_hwframe+0x6c/0x74

    Fix it by stashing work_data before calling set_work_pool_and_keep_pending()
    and using the stashed value for pwq_dec_nr_in_flight().

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Reported-by: Chandan Babu R <chandanbabu@kernel.org>
    Link: http://lkml.kernel.org/r/87o7cxeehy.fsf@debian-BULLSEYE-live-builder-AMD64
    Fixes: dd6c3c544126 ("workqueue: Move pwq_dec_nr_in_flight() to the end of work item handling")

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:30 -04:00
Waiman Long 2eeb9eb5ac workqueue: Avoid premature init of wq->node_nr_active[].max
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit c5f8cd6c62ce02205ced15e9a998103f21ec5455
Author: Tejun Heo <tj@kernel.org>
Date:   Tue, 30 Jan 2024 19:06:43 -1000

    workqueue: Avoid premature init of wq->node_nr_active[].max

    System workqueues are allocated early during boot from
    workqueue_init_early(). While allocating unbound workqueues,
    wq_update_node_max_active() is invoked from apply_workqueue_attrs() and
    accesses NUMA topology to initialize wq->node_nr_active[].max.

    However, topology information may not be set up at this point.
    wq_update_node_max_active() is explicitly invoked from
    workqueue_init_topology() later when topology information is known to be
    available.

    This doesn't seem to crash anything but it's doing useless work with dubious
    data. Let's skip the premature and duplicate node_max_active updates by
    initializing the field to WQ_DFL_MIN_ACTIVE on allocation and making
    wq_update_node_max_active() noop until workqueue_init_topology().

    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:30 -04:00