Commit Graph

875 Commits

Author SHA1 Message Date
Waiman Long 5df3631c9c workqueue: Scale up wq_cpu_intensive_thresh_us if BogoMIPS is below 4000
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit aa6fde93f3a49e42c0fe0490d7f3711bac0d162e
Author: Tejun Heo <tj@kernel.org>
Date:   Mon, 17 Jul 2023 12:50:02 -1000

    workqueue: Scale up wq_cpu_intensive_thresh_us if BogoMIPS is below 4000

    wq_cpu_intensive_thresh_us is used to detect CPU-hogging per-cpu work items.
    Once detected, they're excluded from concurrency management to prevent them
    from blocking other per-cpu work items. If CONFIG_WQ_CPU_INTENSIVE_REPORT is
    enabled, repeat offenders are also reported so that the code can be updated.

    The default threshold is 10ms which is long enough to do fair bit of work on
    modern CPUs while short enough to be usually not noticeable. This
    unfortunately leads to a lot of, arguable spurious, detections on very slow
    CPUs. Using the same threshold across CPUs whose performance levels may be
    apart by multiple levels of magnitude doesn't make whole lot of sense.

    This patch scales up wq_cpu_intensive_thresh_us upto 1 second when BogoMIPS
    is below 4000. This is obviously very inaccurate but it doesn't have to be
    accurate to be useful. The mechanism is still useful when the threshold is
    fully scaled up and the benefits of reports are usually shared with everyone
    regardless of who's reporting, so as long as there are sufficient number of
    fast machines reporting, we don't lose much.

    Some (or is it all?) ARM CPUs systemtically report significantly lower
    BogoMIPS. While this doesn't break anything, given how widespread ARM CPUs
    are, it's at least a missed opportunity and it probably would be a good idea
    to teach workqueue about it.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Reported-and-Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:25 -04:00
Waiman Long c9a9cddde4 workqueue: Disable per-cpu CPU hog detection when wq_cpu_intensive_thresh_us is 0
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 18c8ae813156a6855f026de80fffb91e1a28ab3d
Author: Zqiang <qiang.zhang1211@gmail.com>
Date:   Thu, 25 May 2023 12:00:38 +0800

    workqueue: Disable per-cpu CPU hog detection when wq_cpu_intensive_thresh_us is 0

    If workqueue.cpu_intensive_thresh_us is set to 0, the detection mechanism
    for CPU-hogging per-cpu work item will keep triggering spuriously:

      workqueue: process_srcu hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND
      workqueue: gc_worker hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND
      workqueue: gc_worker hogged CPU for >0us 8 times, consider switching to WQ_UNBOUND
      workqueue: wait_rcu_exp_gp hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND
      workqueue: kfree_rcu_monitor hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND
      workqueue: kfree_rcu_monitor hogged CPU for >0us 8 times, consider switching to WQ_UNBOUND
      workqueue: reg_todo hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND

    This commit therefore disables the CPU-hog detection mechanism when
    workqueue.cpu_intensive_thresh_us is set to 0.

    tj: Patch description updated and the condition check on
        cpu_intensive_thresh_us separated into a separate if statement for
        readability.

    Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:24 -04:00
Waiman Long ebdb8e47b2 workqueue: Fix WARN_ON_ONCE() triggers in worker_enter_idle()
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit c8f6219be2e58d7f676935ae90b64abef5d0966a
Author: Zqiang <qiang.zhang1211@gmail.com>
Date:   Wed, 24 May 2023 11:53:39 +0800

    workqueue: Fix WARN_ON_ONCE() triggers in worker_enter_idle()

    Currently, pool->nr_running can be modified from timer tick, that means the
    timer tick can run nested inside a not-irq-protected section that's in the
    process of modifying nr_running. Consider the following scenario:

    CPU0
    kworker/0:2 (events)
       worker_clr_flags(worker, WORKER_PREP | WORKER_REBOUND);
       ->pool->nr_running++;  (1)

       process_one_work()
       ->worker->current_func(work);
         ->schedule()
           ->wq_worker_sleeping()
             ->worker->sleeping = 1;
             ->pool->nr_running--;  (0)
               ....
           ->wq_worker_running()
                   ....
                   CPU0 by interrupt:
                   wq_worker_tick()
                   ->worker_set_flags(worker, WORKER_CPU_INTENSIVE);
                     ->pool->nr_running--;  (-1)
                     ->worker->flags |= WORKER_CPU_INTENSIVE;
                   ....
             ->if (!(worker->flags & WORKER_NOT_RUNNING))
               ->pool->nr_running++;    (will not execute)
             ->worker->sleeping = 0;
             ....
        ->worker_clr_flags(worker, WORKER_CPU_INTENSIVE);
          ->pool->nr_running++;  (0)
        ....
        worker_set_flags(worker, WORKER_PREP);
        ->pool->nr_running--;   (-1)
        ....
        worker_enter_idle()
        ->WARN_ON_ONCE(pool->nr_workers == pool->nr_idle && pool->nr_running);

    if the nr_workers is equal to nr_idle, due to the nr_running is not zero,
    will trigger WARN_ON_ONCE().

    [    2.460602] WARNING: CPU: 0 PID: 63 at kernel/workqueue.c:1999 worker_enter_idle+0xb2/0xc0
    [    2.462163] Modules linked in:
    [    2.463401] CPU: 0 PID: 63 Comm: kworker/0:2 Not tainted 6.4.0-rc2-next-20230519 #1
    [    2.463771] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
    [    2.465127] Workqueue:  0x0 (events)
    [    2.465678] RIP: 0010:worker_enter_idle+0xb2/0xc0
    ...
    [    2.472614] Call Trace:
    [    2.473152]  <TASK>
    [    2.474182]  worker_thread+0x71/0x430
    [    2.474992]  ? _raw_spin_unlock_irqrestore+0x28/0x50
    [    2.475263]  kthread+0x103/0x120
    [    2.475493]  ? __pfx_worker_thread+0x10/0x10
    [    2.476355]  ? __pfx_kthread+0x10/0x10
    [    2.476635]  ret_from_fork+0x2c/0x50
    [    2.477051]  </TASK>

    This commit therefore add the check of worker->sleeping in wq_worker_tick(),
    if the worker->sleeping is not zero, directly return.

    tj: Updated comment and description.

    Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
    Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
    Tested-by: Anders Roxell <anders.roxell@linaro.org>
    Closes: https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20230519/testrun/17078554/suite/boot/test/clang-nightly-lkftconfig/log
    Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:24 -04:00
Waiman Long de650632ad workqueue: Track and monitor per-workqueue CPU time usage
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 8a1dd1e547c1a037692e7a6da6a76108108c72b1
Author: Tejun Heo <tj@kernel.org>
Date:   Wed, 17 May 2023 17:02:09 -1000

    workqueue: Track and monitor per-workqueue CPU time usage

    Now that wq_worker_tick() is there, we can easily track the rough CPU time
    consumption of each workqueue by charging the whole tick whenever a tick
    hits an active workqueue. While not super accurate, it provides reasonable
    visibility into the workqueues that consume a lot of CPU cycles.
    wq_monitor.py is updated to report the per-workqueue CPU times.

    v2: wq_monitor.py was using "cputime" as the key when outputting in json
        format. Use "cpu_time" instead for consistency with other fields.

    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:24 -04:00
Waiman Long b2d36126d6 workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 6363845005202148b8409ec3082e80845c19d309
Author: Tejun Heo <tj@kernel.org>
Date:   Wed, 17 May 2023 17:02:08 -1000

    workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism

    Workqueue now automatically marks per-cpu work items that hog CPU for too
    long as CPU_INTENSIVE, which excludes them from concurrency management and
    prevents stalling other concurrency-managed work items. If a work function
    keeps running over the thershold, it likely needs to be switched to use an
    unbound workqueue.

    This patch adds a debug mechanism which tracks the work functions which
    trigger the automatic CPU_INTENSIVE mechanism and report them using
    pr_warn() with exponential backoff.

    v3: Documentation update.

    v2: Drop bouncing to kthread_worker for printing messages. It was to avoid
        introducing circular locking dependency through printk but not effective
        as it still had pool lock -> wci_lock -> printk -> pool lock loop. Let's
        just print directly using printk_deferred().

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Suggested-by: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:24 -04:00
Waiman Long 1665f6ac9c workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVE
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 616db8779b1e3f93075df691432cccc5ef3c3ba0
Author: Tejun Heo <tj@kernel.org>
Date:   Wed, 17 May 2023 17:02:08 -1000

    workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVE

    If a per-cpu work item hogs the CPU, it can prevent other work items from
    starting through concurrency management. A per-cpu workqueue which intends
    to host such CPU-hogging work items can choose to not participate in
    concurrency management by setting %WQ_CPU_INTENSIVE; however, this can be
    error-prone and difficult to debug when missed.

    This patch adds an automatic CPU usage based detection. If a
    concurrency-managed work item consumes more CPU time than the threshold
    (10ms by default) continuously without intervening sleeps, wq_worker_tick()
    which is called from scheduler_tick() will detect the condition and
    automatically mark it CPU_INTENSIVE.

    The mechanism isn't foolproof:

    * Detection depends on tick hitting the work item. Getting preempted at the
      right timings may allow a violating work item to evade detection at least
      temporarily.

    * nohz_full CPUs may not be running ticks and thus can fail detection.

    * Even when detection is working, the 10ms detection delays can add up if
      many CPU-hogging work items are queued at the same time.

    However, in vast majority of cases, this should be able to detect violations
    reliably and provide reasonable protection with a small increase in code
    complexity.

    If some work items trigger this condition repeatedly, the bigger problem
    likely is the CPU being saturated with such per-cpu work items and the
    solution would be making them UNBOUND. The next patch will add a debug
    mechanism to help spot such cases.

    v4: Documentation for workqueue.cpu_intensive_thresh_us added to
        kernel-parameters.txt.

    v3: Switch to use wq_worker_tick() instead of hooking into preemptions as
        suggested by Peter.

    v2: Lai pointed out that wq_worker_stopping() also needs to be called from
        preemption and rtlock paths and an earlier patch was updated
        accordingly. This patch adds a comment describing the risk of infinte
        recursions and how they're avoided.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Acked-by: Peter Zijlstra <peterz@infradead.org>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Lai Jiangshan <jiangshanlai@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:24 -04:00
Waiman Long d067533aa7 workqueue: Improve locking rule description for worker fields
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit bdf8b9bfc131864f0fcef268b34123acfb6a1b59
Author: Tejun Heo <tj@kernel.org>
Date:   Wed, 17 May 2023 17:02:08 -1000

    workqueue: Improve locking rule description for worker fields

    * Some worker fields are modified only by the worker itself while holding
      pool->lock thus making them safe to read from self, IRQ context if the CPU
      is running the worker or while holding pool->lock. Add 'K' locking rule
      for them.

    * worker->sleeping is currently marked "None" which isn't very descriptive.
      It's used only by the worker itself. Add 'S' locking rule for it.

    A future patch will depend on the 'K' rule to access worker->current_* from
    the scheduler ticks.

    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:24 -04:00
Waiman Long bdad1a320c workqueue: Move worker_set/clr_flags() upwards
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit c54d5046a06b90adb3d1188f0741a88692854354
Author: Tejun Heo <tj@kernel.org>
Date:   Wed, 17 May 2023 17:02:08 -1000

    workqueue: Move worker_set/clr_flags() upwards

    They are going to be used in wq_worker_stopping(). Move them upwards.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Cc: Lai Jiangshan <jiangshanlai@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:23 -04:00
Waiman Long 20a387c381 workqueue: Add pwq->stats[] and a monitoring script
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 725e8ec59c56c65fb92e343c10a8842cd0d4f194
Author: Tejun Heo <tj@kernel.org>
Date:   Wed, 17 May 2023 17:02:08 -1000

    workqueue: Add pwq->stats[] and a monitoring script

    Currently, the only way to peer into workqueue operations is through
    tracing. While possible, it isn't easy or convenient to monitor
    per-workqueue behaviors over time this way. Let's add pwq->stats[] that
    track relevant events and a drgn monitoring script -
    tools/workqueue/wq_monitor.py.

    It's arguable whether this needs to be configurable. However, it currently
    only has several counters and the runtime overhead shouldn't be noticeable
    given that they're on pwq's which are per-cpu on per-cpu workqueues and
    per-numa-node on unbound ones. Let's keep it simple for the time being.

    v2: Patch reordered to earlier with fewer fields. Field will be added back
        gradually. Help message improved.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Cc: Lai Jiangshan <jiangshanlai@gmail.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:23 -04:00
Waiman Long 0d4b8874cf Further upgrade queue_work_on() comment
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 854f5cc5b7355ceebf2bdfed97ea8f3c5d47a0c3
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Fri, 28 Apr 2023 16:47:07 -0700

    Further upgrade queue_work_on() comment

    The current queue_work_on() docbook comment says that the caller must
    ensure that the specified CPU can't go away, and further says that the
    penalty for failing to nail down the specified CPU is that the workqueue
    handler might find itself executing on some other CPU.  This is true
    as far as it goes, but fails to note what happens if the specified CPU
    never was online.  Therefore, further expand this comment to say that
    specifying a CPU that was never online will result in a splat.

    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Cc: Lai Jiangshan <jiangshanlai@gmail.com>
    Cc: Tejun Heo <tj@kernel.org>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:23 -04:00
Waiman Long e8df0001f6 workqueue: clean up WORK_* constant types, clarify masking
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit afa4bb778e48d79e4a642ed41e3b4e0de7489a6c
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Fri, 23 Jun 2023 12:08:14 -0700

    workqueue: clean up WORK_* constant types, clarify masking

    Dave Airlie reports that gcc-13.1.1 has started complaining about some
    of the workqueue code in 32-bit arm builds:

      kernel/workqueue.c: In function ‘get_work_pwq’:
      kernel/workqueue.c:713:24: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
        713 |                 return (void *)(data & WORK_STRUCT_WQ_DATA_MASK);
            |                        ^
      [ ... a couple of other cases ... ]

    and while it's not immediately clear exactly why gcc started complaining
    about it now, I suspect it's some C23-induced enum type handlign fixup in
    gcc-13 is the cause.

    Whatever the reason for starting to complain, the code and data types
    are indeed disgusting enough that the complaint is warranted.

    The wq code ends up creating various "helper constants" (like that
    WORK_STRUCT_WQ_DATA_MASK) using an enum type, which is all kinds of
    confused.  The mask needs to be 'unsigned long', not some unspecified
    enum type.

    To make matters worse, the actual "mask and cast to a pointer" is
    repeated a couple of times, and the cast isn't even always done to the
    right pointer, but - as the error case above - to a 'void *' with then
    the compiler finishing the job.

    That's now how we roll in the kernel.

    So create the masks using the proper types rather than some ambiguous
    enumeration, and use a nice helper that actually does the type
    conversion in one well-defined place.

    Incidentally, this magically makes clang generate better code.  That,
    admittedly, is really just a sign of clang having been seriously
    confused before, and cleaning up the typing unconfuses the compiler too.

    Reported-by: Dave Airlie <airlied@gmail.com>
    Link: https://lore.kernel.org/lkml/CAPM=9twNnV4zMCvrPkw3H-ajZOH-01JVh_kDrxdPYQErz8ZTdA@mail.gmail.com/
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Nick Desaulniers <ndesaulniers@google.com>
    Cc: Nathan Chancellor <nathan@kernel.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:23 -04:00
Waiman Long 2a1c329725 workqueue: Introduce show_freezable_workqueues
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 704bc669e1dda3eb8f6d5cb462b21e85558a3912
Author: Jungseung Lee <js07.lee@samsung.com>
Date:   Mon, 20 Mar 2023 12:29:05 +0900

    workqueue: Introduce show_freezable_workqueues

    Currently show_all_workqueue is called if freeze fails at the time of
    freeze the workqueues, which shows the status of all workqueues and of
    all worker pools. In this cases we may only need to dump state of only
    workqueues that are freezable and busy.

    This patch defines show_freezable_workqueues, which uses
    show_one_workqueue, a granular function that shows the state of individual
    workqueues, so that dump only the state of freezable workqueues
    at that time.

    tj: Minor message adjustment.

    Signed-off-by: Jungseung Lee <js07.lee@samsung.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:23 -04:00
Waiman Long 058229c9f6 workqueue: Print backtraces from CPUs with hung CPU bound workqueues
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit cd2440d66fec7d1bdb4f605b64c27c63c9141989
Author: Petr Mladek <pmladek@suse.com>
Date:   Tue, 7 Mar 2023 13:53:35 +0100

    workqueue: Print backtraces from CPUs with hung CPU bound workqueues

    The workqueue watchdog reports a lockup when there was not any progress
    in the worker pool for a long time. The progress means that a pending
    work item starts being proceed.

    Worker pools for unbound workqueues always wake up an idle worker and
    try to process the work immediately. The last idle worker has to create
    new worker first. The stall might happen only when a new worker could
    not be created in which case an error should get printed. Another problem
    might be too high load. In this case, workers are victims of a global
    system problem.

    Worker pools for CPU bound workqueues are designed for lightweight
    work items that do not need much CPU time. They are proceed one by
    one on a single worker. New worker is used only when a work is sleeping.
    It creates one additional scenario. The stall might happen when
    the CPU-bound workqueue is used for CPU-intensive work.

    More precisely, the stall is detected when a CPU-bound worker is in
    the TASK_RUNNING state for too long. In this case, it might be useful
    to see the backtrace from the problematic worker.

    The information how long a worker is in the running state is not available.
    But the CPU-bound worker pools do not have many workers in the running
    state by definition. And only few pools are typically blocked.

    It should be acceptable to print backtraces from all workers in
    TASK_RUNNING state in the stalled worker pools. The number of false
    positives should be very low.

    Signed-off-by: Petr Mladek <pmladek@suse.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:23 -04:00
Waiman Long 4f4620189e workqueue: Warn when a rescuer could not be created
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 4c0736a76a186e5df2cd2afda3e7a04d2a427d1b
Author: Petr Mladek <pmladek@suse.com>
Date:   Tue, 7 Mar 2023 13:53:34 +0100

    workqueue: Warn when a rescuer could not be created

    Rescuers are created when a workqueue with WQ_MEM_RECLAIM is allocated.
    It typically happens during the system boot.

    systemd switches the root filesystem from initrd to the booted system
    during boot. It kills processes that block the switch for too long.
    One of the process might be modprobe that tries to create a workqueue.

    These problems are hard to reproduce. Also alloc_workqueue() does not
    pass the error code. Make the debugging easier by printing an error,
    similar to create_worker().

    Signed-off-by: Petr Mladek <pmladek@suse.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:23 -04:00
Waiman Long e97b3edb25 workqueue: Interrupted create_worker() is not a repeated event
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 60f540389a5d2df25ddc7ad511b4fa2880dea521
Author: Petr Mladek <pmladek@suse.com>
Date:   Tue, 7 Mar 2023 13:53:33 +0100

    workqueue: Interrupted create_worker() is not a repeated event

    kthread_create_on_node() might get interrupted(). It is rare but realistic.
    For example, when an unbound workqueue is allocated in module_init()
    callback. It is done in the context of the "modprobe" process. And,
    for example, systemd might kill pending processes when switching root
    from initrd to the booted system.

    The interrupt is a one-off event and the race might be hard to reproduce.
    It is always worth printing.

    Signed-off-by: Petr Mladek <pmladek@suse.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:23 -04:00
Waiman Long 48949a02ba workqueue: Warn when a new worker could not be created
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 3f0ea0b864562c6bd1cee892026067eaea7be242
Author: Petr Mladek <pmladek@suse.com>
Date:   Tue, 7 Mar 2023 13:53:32 +0100

    workqueue: Warn when a new worker could not be created

    The workqueue watchdog reports a lockup when there was not any progress
    in the worker pool for a long time. The progress means that a pending
    work item starts being proceed.

    The progress is guaranteed by using idle workers or creating new workers
    for pending work items.

    There are several reasons why a new worker could not be created:

       + there is not enough memory

       + there is no free pool ID (IDR API)

       + the system reached PID limit

       + the process creating the new worker was interrupted

       + the last idle worker (manager) has not been scheduled for a long
         time. It was not able to even start creating the kthread.

    None of these failures is reported at the moment. The only clue is that
    show_one_worker_pool() prints that there is a manager. It is the last
    idle worker that is responsible for creating a new one. But it is not
    clear if create_worker() is failing and why.

    Make the debugging easier by printing errors in create_worker().

    The error code is important, especially from kthread_create_on_node().
    It helps to distinguish the various reasons. For example, reaching
    memory limit (-ENOMEM), other system limits (-EAGAIN), or process
    interrupted (-EINTR).

    Use pr_once() to avoid repeating the same error every CREATE_COOLDOWN
    for each stuck worker pool.

    Ratelimited printk() might be better. It would help to know if the problem
    remains. It would be more clear if the create_worker() errors and workqueue
    stalls are related. Also old messages might get lost when the internal log
    buffer is full. The problem is that printk() might touch the watchdog.
    For example, see touch_nmi_watchdog() in serial8250_console_write().
    It would require synchronization of the begin and length of the ratelimit
    interval with the workqueue watchdog. Otherwise, the error messages
    might break the watchdog. This does not look worth the complexity.

    Signed-off-by: Petr Mladek <pmladek@suse.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:23 -04:00
Waiman Long c3296e8c7e workqueue: Fix hung time report of worker pools
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 335a42ebb0ca8ee9997a1731aaaae6dcd704c113
Author: Petr Mladek <pmladek@suse.com>
Date:   Tue, 7 Mar 2023 13:53:31 +0100

    workqueue: Fix hung time report of worker pools

    The workqueue watchdog prints a warning when there is no progress in
    a worker pool. Where the progress means that the pool started processing
    a pending work item.

    Note that it is perfectly fine to process work items much longer.
    The progress should be guaranteed by waking up or creating idle
    workers.

    show_one_worker_pool() prints state of non-idle worker pool. It shows
    a delay since the last pool->watchdog_ts.

    The timestamp is updated when a first pending work is queued in
    __queue_work(). Also it is updated when a work is dequeued for
    processing in worker_thread() and rescuer_thread().

    The delay is misleading when there is no pending work item. In this
    case it shows how long the last work item is being proceed. Show
    zero instead. There is no stall if there is no pending work.

    Fixes: 82607adcf9 ("workqueue: implement lockup detector")
    Signed-off-by: Petr Mladek <pmladek@suse.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:23 -04:00
Waiman Long cf8b90187f workqueue: Simplify a pr_warn() call in wq_select_unbound_cpu()
JIRA: https://issues.redhat.com/browse/RHEL-25103
Conflicts: Context diff due to the presence of a later upstream commit
	   4a6c5607d450 ("workqueue: Make sure that wq_unbound_cpumask
	   is never empty").

commit a8ec5880bd82b834717770cba4596381ffd50545
Author: Ammar Faizi <ammarfaizi2@gnuweeb.org>
Date:   Sun, 26 Feb 2023 23:53:20 +0700

    workqueue: Simplify a pr_warn() call in wq_select_unbound_cpu()

    Use pr_warn_once() to achieve the same thing. It's simpler.

    Signed-off-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
    Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:22 -04:00
Waiman Long a088b39f32 workqueue: Make show_pwq() use run-length encoding
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit c76feb0d5dfdb90b70fa820bb3181142bb01e980
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Fri, 6 Jan 2023 16:10:24 -0800

    workqueue: Make show_pwq() use run-length encoding

    The show_pwq() function dumps out a pool_workqueue structure's activity,
    including the pending work-queue handlers:

     Showing busy workqueues and worker pools:
     workqueue events: flags=0x0
       pwq 0: cpus=0 node=0 flags=0x1 nice=0 active=10/256 refcnt=11
         in-flight: 7:test_work_func, 64:test_work_func, 249:test_work_func
         pending: test_work_func, test_work_func, test_work_func1, test_work_func1, test_work_func1, test_work_func1, test_work_func1

    When large systems are facing certain types of hang conditions, it is not
    unusual for this "pending" list to contain runs of hundreds of identical
    function names.  This "wall of text" is difficult to read, and worse yet,
    it can be interleaved with other output such as stack traces.

    Therefore, make show_pwq() use run-length encoding so that the above
    printout instead looks like this:

     Showing busy workqueues and worker pools:
     workqueue events: flags=0x0
       pwq 0: cpus=0 node=0 flags=0x1 nice=0 active=10/256 refcnt=11
         in-flight: 7:test_work_func, 64:test_work_func, 249:test_work_func
         pending: 2*test_work_func, 5*test_work_func1

    When no comma would be printed, including the WORK_STRUCT_LINKED case,
    a new run is started unconditionally.

    This output is more readable, places less stress on the hardware,
    firmware, and software on the console-log path, and reduces interference
    with other output.

    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Lai Jiangshan <jiangshanlai@gmail.com>
    Cc: Dave Jones <davej@codemonkey.org.uk>
    Cc: Rik van Riel <riel@surriel.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:22 -04:00
Waiman Long 7de5240e80 workqueue: Add a new flag to spot the potential UAF error
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 33e3f0a3358b8f9bb54b2661b9c1d37a75664c79
Author: Richard Clark <richard.xnu.clark@gmail.com>
Date:   Tue, 13 Dec 2022 12:39:36 +0800

    workqueue: Add a new flag to spot the potential UAF error

    Currently if the user queues a new work item unintentionally
    into a wq after the destroy_workqueue(wq), the work still can
    be queued and scheduled without any noticeable kernel message
    before the end of a RCU grace period.

    As a debug-aid facility, this commit adds a new flag
    __WQ_DESTROYING to spot that issue by triggering a kernel WARN
    message.

    Signed-off-by: Richard Clark <richard.xnu.clark@gmail.com>
    Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:22 -04:00
Waiman Long 50f44cde6c workqueue: Make queue_rcu_work() use call_rcu_hurry()
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit a7e30c0e9a5f95b7f74e6272d9c75fd65c897721
Author: Uladzislau Rezki <urezki@gmail.com>
Date:   Sun, 16 Oct 2022 16:23:03 +0000

    workqueue: Make queue_rcu_work() use call_rcu_hurry()

    Earlier commits in this series allow battery-powered systems to build
    their kernels with the default-disabled CONFIG_RCU_LAZY=y Kconfig option.
    This Kconfig option causes call_rcu() to delay its callbacks in order
    to batch them.  This means that a given RCU grace period covers more
    callbacks, thus reducing the number of grace periods, in turn reducing
    the amount of energy consumed, which increases battery lifetime which
    can be a very good thing.  This is not a subtle effect: In some important
    use cases, the battery lifetime is increased by more than 10%.

    This CONFIG_RCU_LAZY=y option is available only for CPUs that offload
    callbacks, for example, CPUs mentioned in the rcu_nocbs kernel boot
    parameter passed to kernels built with CONFIG_RCU_NOCB_CPU=y.

    Delaying callbacks is normally not a problem because most callbacks do
    nothing but free memory.  If the system is short on memory, a shrinker
    will kick all currently queued lazy callbacks out of their laziness,
    thus freeing their memory in short order.  Similarly, the rcu_barrier()
    function, which blocks until all currently queued callbacks are invoked,
    will also kick lazy callbacks, thus enabling rcu_barrier() to complete
    in a timely manner.

    However, there are some cases where laziness is not a good option.
    For example, synchronize_rcu() invokes call_rcu(), and blocks until
    the newly queued callback is invoked.  It would not be a good for
    synchronize_rcu() to block for ten seconds, even on an idle system.
    Therefore, synchronize_rcu() invokes call_rcu_hurry() instead of
    call_rcu().  The arrival of a non-lazy call_rcu_hurry() callback on a
    given CPU kicks any lazy callbacks that might be already queued on that
    CPU.  After all, if there is going to be a grace period, all callbacks
    might as well get full benefit from it.

    Yes, this could be done the other way around by creating a
    call_rcu_lazy(), but earlier experience with this approach and
    feedback at the 2022 Linux Plumbers Conference shifted the approach
    to call_rcu() being lazy with call_rcu_hurry() for the few places
    where laziness is inappropriate.

    And another call_rcu() instance that cannot be lazy is the one
    in queue_rcu_work(), given that callers to queue_rcu_work() are
    not necessarily OK with long delays.

    Therefore, make queue_rcu_work() use call_rcu_hurry() in order to revert
    to the old behavior.

    [ paulmck: Apply s/call_rcu_flush/call_rcu_hurry/ feedback from Tejun Heo. ]

    Signed-off-by: Uladzislau Rezki <urezki@gmail.com>
    Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
    Acked-by: Tejun Heo <tj@kernel.org>
    Cc: Lai Jiangshan <jiangshanlai@gmail.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:22 -04:00
Waiman Long 50392f94c5 treewide: Drop WARN_ON_FUNCTION_MISMATCH
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 4b24356312fbe1bace72f9905d529b14fc34c1c3
Author: Sami Tolvanen <samitolvanen@google.com>
Date:   Thu, 8 Sep 2022 14:54:56 -0700

    treewide: Drop WARN_ON_FUNCTION_MISMATCH

    CONFIG_CFI_CLANG no longer breaks cross-module function address
    equality, which makes WARN_ON_FUNCTION_MISMATCH unnecessary. Remove
    the definition and switch back to WARN_ON_ONCE.

    Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Tested-by: Kees Cook <keescook@chromium.org>
    Tested-by: Nathan Chancellor <nathan@kernel.org>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Kees Cook <keescook@chromium.org>
    Link: https://lore.kernel.org/r/20220908215504.3686827-15-samitolvanen@google.com

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:22 -04:00
Waiman Long c023e9b1d1 workqueue: Convert the type of pool->nr_running to int
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit bc35f7ef96284b8c963991357a9278a6beafca54
Author: Lai Jiangshan <laijs@linux.alibaba.com>
Date:   Thu, 23 Dec 2021 20:31:40 +0800

    workqueue: Convert the type of pool->nr_running to int

    It is only modified in associated CPU, so it doesn't need to be atomic.

    tj: Comment updated.

    Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:22 -04:00
Waiman Long ec7c270fc1 workqueue: Use wake_up_worker() in wq_worker_sleeping() instead of open code
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit cc5bff38463e0894dd596befa99f9d6860e15f5e
Author: Lai Jiangshan <laijs@linux.alibaba.com>
Date:   Thu, 23 Dec 2021 20:31:39 +0800

    workqueue: Use wake_up_worker() in wq_worker_sleeping() instead of open code

    The wakeup code in wq_worker_sleeping() is the same as wake_up_worker().

    Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:22 -04:00
Waiman Long 5ea3f1a8eb workqueue: Upgrade queue_work_on() comment
JIRA: https://issues.redhat.com/browse/RHEL-25103

commit 443378f0664a78756c3e3aeaab92750fe1e05735
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Tue, 30 Nov 2021 17:00:30 -0800

    workqueue: Upgrade queue_work_on() comment

    The current queue_work_on() docbook comment says that the caller must
    ensure that the specified CPU can't go away, but does not spell out the
    consequences, which turn out to be quite mild.  Therefore expand this
    comment to explicitly say that the penalty for failing to nail down the
    specified CPU is that the workqueue handler might find itself executing
    on some other CPU.

    Cc: Tejun Heo <tj@kernel.org>
    Cc: Lai Jiangshan <jiangshanlai@gmail.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-05-03 13:39:22 -04:00
Audra Mitchell 3b21196ba3 workqueue: Shorten events_freezable_power_efficient name
JIRA: https://issues.redhat.com/browse/RHEL-3534

This patch is a backport of the following upstream commit:
commit 8318d6a6362f5903edb4c904a8dd447e59be4ad1
Author: Audra Mitchell <audra@redhat.com>
Date:   Thu Jan 25 14:05:32 2024 -0500

    workqueue: Shorten events_freezable_power_efficient name

    Since we have set the WQ_NAME_LEN to 32, decrease the name of
    events_freezable_power_efficient so that it does not trip the name length
    warning when the workqueue is created.

    Signed-off-by: Audra Mitchell <audra@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-05-03 09:45:58 -04:00
Audra Mitchell f45c2f9160 workqueue.c: Increase workqueue name length
JIRA: https://issues.redhat.com/browse/RHEL-3534

This patch is a backport of the following upstream commit:
commit 31c89007285d365aa36f71d8fb0701581c770a27
Author: Audra Mitchell <audra@redhat.com>
Date:   Mon Jan 15 12:08:22 2024 -0500

    workqueue.c: Increase workqueue name length

    Currently we limit the size of the workqueue name to 24 characters due to
    commit ecf6881ff3 ("workqueue: make workqueue->name[] fixed len")
    Increase the size to 32 characters and print a warning in the event
    the requested name is larger than the limit of 32 characters.

    Signed-off-by: Audra Mitchell <audra@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-05-03 09:45:58 -04:00
Leonardo Bras 6f7f4ba4b1 workqueue: Avoid using isolated cpus' timers on queue_delayed_work
JIRA: https://issues.redhat.com/browse/RHEL-20254
Upstream Status: https://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git/

commit aae17ebb53cd3da37f5dfbde937acd091eb4340c
Author: Leonardo Bras <leobras@redhat.com>
Date:   Mon Jan 29 22:00:46 2024 -0300

    workqueue: Avoid using isolated cpus' timers on queue_delayed_work

    When __queue_delayed_work() is called, it chooses a cpu for handling the
    timer interrupt. As of today, it will pick either the cpu passed as
    parameter or the last cpu used for this.

    This is not good if a system does use CPU isolation, because it can take
    away some valuable cpu time to:
    1 - deal with the timer interrupt,
    2 - schedule-out the desired task,
    3 - queue work on a random workqueue, and
    4 - schedule the desired task back to the cpu.

    So to fix this, during __queue_delayed_work(), if cpu isolation is in
    place, pick a random non-isolated cpu to handle the timer interrupt.

    As an optimization, if the current cpu is not isolated, use it instead
    of looking for another candidate.

    Signed-off-by: Leonardo Bras <leobras@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Leonardo Bras <leobras@redhat.com>
2024-02-22 16:47:15 -03:00
Waiman Long 6524bc7b74 workqueue: Move workqueue_set_unbound_cpumask() and its helpers inside CONFIG_SYSFS
JIRA: https://issues.redhat.com/browse/RHEL-21798
Conflicts: A minor context diff due to missing upstream commit
	   fcecfa8f271a ("workqueue: Remove module param disable_numa
	   and sysfs knobs pool_ids and numa").

commit 49277a5b76373e630075ff7d32fc0f9f51294f24
Author: Waiman Long <longman@redhat.com>
Date:   Mon, 20 Nov 2023 21:18:40 -0500

    workqueue: Move workqueue_set_unbound_cpumask() and its helpers inside CONFIG_SYSFS

    Commit fe28f631fa94 ("workqueue: Add workqueue_unbound_exclude_cpumask()
    to exclude CPUs from wq_unbound_cpumask") makes
    workqueue_set_unbound_cpumask() static as it is not used elsewhere in
    the kernel. However, this triggers a kernel test robot warning about
    'workqueue_set_unbound_cpumask' defined but not used when CONFIG_SYS
    isn't defined. It happens that workqueue_set_unbound_cpumask() is only
    called when CONFIG_SYS is defined.

    Move workqueue_set_unbound_cpumask() and its helpers inside the
    CONFIG_SYSFS compilation block to avoid the warning. There is no
    functional change.

    Fixes: fe28f631fa94 ("workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask")
    Reported-by: kernel test robot <lkp@intel.com>
    Closes: https://lore.kernel.org/oe-kbuild-all/202311130831.uh0AoCd1-lkp@intel.com/
    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-01-16 14:24:47 -05:00
Waiman Long 24be7e35b7 workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask
JIRA: https://issues.redhat.com/browse/RHEL-21798
Conflicts:
 1) A merge conflict in the workqueue_unbound_exclude_cpumask() hunk
    of kernel/workqueue.c due to missing upstream commit 63c5484e7495
    ("workqueue: Add multiple affinity scopes and interface to select
    them").
 2) A merge conflict in the workqueue_init_early() hunk of
    kernel/workqueue.c due to upstream merge conflict resolved according
    to upstream merge commit 202595663905 ("Merge branch 'for-6.7-fixes'
    of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq into for-6.8").

commit fe28f631fa941fba583d1c4f25895284b90af671
Author: Waiman Long <longman@redhat.com>
Date:   Wed, 25 Oct 2023 14:25:52 -0400

    workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask

    When the "isolcpus" boot command line option is used to add a set
    of isolated CPUs, those CPUs will be excluded automatically from
    wq_unbound_cpumask to avoid running work functions from unbound
    workqueues.

    Recently cpuset has been extended to allow the creation of partitions
    of isolated CPUs dynamically. To make it closer to the "isolcpus"
    in functionality, the CPUs in those isolated cpuset partitions should be
    excluded from wq_unbound_cpumask as well. This can be done currently by
    explicitly writing to the workqueue's cpumask sysfs file after creating
    the isolated partitions. However, this process can be error prone.

    Ideally, the cpuset code should be allowed to request the workqueue code
    to exclude those isolated CPUs from wq_unbound_cpumask so that this
    operation can be done automatically and the isolated CPUs will be returned
    back to wq_unbound_cpumask after the destructions of the isolated
    cpuset partitions.

    This patch adds a new workqueue_unbound_exclude_cpumask() function to
    enable that. This new function will exclude the specified isolated
    CPUs from wq_unbound_cpumask. To be able to restore those isolated
    CPUs back after the destruction of isolated cpuset partitions, a new
    wq_requested_unbound_cpumask is added to store the user provided unbound
    cpumask either from the boot command line options or from writing to
    the cpumask sysfs file. This new cpumask provides the basis for CPU
    exclusion.

    To enable users to understand how the wq_unbound_cpumask is being
    modified internally, this patch also exposes the newly introduced
    wq_requested_unbound_cpumask as well as a wq_isolated_cpumask to
    store the cpumask to be excluded from wq_unbound_cpumask as read-only
    sysfs files.

    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-01-16 14:24:47 -05:00
Waiman Long 1d28ea804a workqueue: Make sure that wq_unbound_cpumask is never empty
JIRA: https://issues.redhat.com/browse/RHEL-21798
Conflicts: A merge conflict due to missing upstream commit fef59c9cab6a
	   ("workqueue: Rename NUMA related names to use pod instead")
	   and two other subsequent workqueue commits.

commit 4a6c5607d4502ccd1b15b57d57f17d12b6f257a7
Author: Tejun Heo <tj@kernel.org>
Date:   Tue, 21 Nov 2023 11:39:36 -1000

    workqueue: Make sure that wq_unbound_cpumask is never empty

    During boot, depending on how the housekeeping and workqueue.unbound_cpus
    masks are set, wq_unbound_cpumask can end up empty. Since 8639ecebc9b1
    ("workqueue: Implement non-strict affinity scope for unbound workqueues"),
    this may end up feeding -1 as a CPU number into scheduler leading to oopses.

      BUG: unable to handle page fault for address: ffffffff8305e9c0
      #PF: supervisor read access in kernel mode
      #PF: error_code(0x0000) - not-present page
      ...
      Call Trace:
       <TASK>
       select_idle_sibling+0x79/0xaf0
       select_task_rq_fair+0x1cb/0x7b0
       try_to_wake_up+0x29c/0x5c0
       wake_up_process+0x19/0x20
       kick_pool+0x5e/0xb0
       __queue_work+0x119/0x430
       queue_work_on+0x29/0x30
      ...

    An empty wq_unbound_cpumask is a clear misconfiguration and already
    disallowed once system is booted up. Let's warn on and ignore
    unbound_cpumask restrictions which lead to no unbound cpus. While at it,
    also remove now unncessary empty check on wq_unbound_cpumask in
    wq_select_unbound_cpu().

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Reported-and-Tested-by: Yong He <alexyonghe@tencent.com>
    Link: http://lkml.kernel.org/r/20231120121623.119780-1-alexyonghe@tencent.com
    Fixes: 8639ecebc9b1 ("workqueue: Implement non-strict affinity scope for unbound workqueues")
    Cc: stable@vger.kernel.org # v6.6+
    Reviewed-by: Waiman Long <longman@redhat.com>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-01-16 14:24:46 -05:00
Waiman Long bed8f3efe3 workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()
JIRA: https://issues.redhat.com/browse/RHEL-21798

commit ca10d851b9ad0338c19e8e3089e24d565ebfffd7
Author: Waiman Long <longman@redhat.com>
Date:   Tue, 10 Oct 2023 22:48:42 -0400

    workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()

    Commit 5c0338c687 ("workqueue: restore WQ_UNBOUND/max_active==1
    to be ordered") enabled implicit ordered attribute to be added to
    WQ_UNBOUND workqueues with max_active of 1. This prevented the changing
    of attributes to these workqueues leading to fix commit 0a94efb5ac
    ("workqueue: implicit ordered attribute should be overridable").

    However, workqueue_apply_unbound_cpumask() was not updated at that time.
    So sysfs changes to wq_unbound_cpumask has no effect on WQ_UNBOUND
    workqueues with implicit ordered attribute. Since not all WQ_UNBOUND
    workqueues are visible on sysfs, we are not able to make all the
    necessary cpumask changes even if we iterates all the workqueue cpumasks
    in sysfs and changing them one by one.

    Fix this problem by applying the corresponding change made
    to apply_workqueue_attrs_locked() in the fix commit to
    workqueue_apply_unbound_cpumask().

    Fixes: 5c0338c687 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered")
    Signed-off-by: Waiman Long <longman@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-01-16 14:24:46 -05:00
Waiman Long 4356616088 workqueue: add cmdline parameter `workqueue.unbound_cpus` to further constrain wq_unbound_cpumask at boot time
JIRA: https://issues.redhat.com/browse/RHEL-21798
Conflicts: A minor context diff in kernel/workqueue.c due to missing
	   upstream commit 20bdedafd2f6 ("workqueue: Warn attempt to
	   flush system-wide workqueues.").

commit ace3c5499e61ef7c0433a7a297227a9bdde54a55
Author: tiozhang <tiozhang@didiglobal.com>
Date:   Thu, 29 Jun 2023 11:50:50 +0800

    workqueue: add cmdline parameter `workqueue.unbound_cpus` to further constrain wq_unbound_cpumask at boot time

    Motivation of doing this is to better improve boot times for devices when
    we want to prevent our workqueue works from running on some specific CPUs,
    e,g, some CPUs are busy with interrupts.

    Signed-off-by: tiozhang <tiozhang@didiglobal.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-01-16 14:24:45 -05:00
Mark Langsdorf d4e81a63a3 workqueue: move to use bus_get_dev_root()
JIRA: https://issues.redhat.com/browse/RHEL-1023

commit 686f669780276da534e93ba769e02bdcf1f89f8d
Author: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Date:   Mon Mar 13 19:28:50 2023 +0100

Direct access to the struct bus_type dev_root pointer is going away soon
so replace that with a call to bus_get_dev_root() instead, which is what
it is there for.

Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230313182918.1312597-8-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
2023-11-01 11:12:31 -05:00
Waiman Long 8cbdd24861 workqueue: Fold rebind_worker() within rebind_workers()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337

commit c63a2e52d5e08f01140d7b76c08a78e15e801f03
Author: Valentin Schneider <vschneid@redhat.com>
Date:   Fri, 13 Jan 2023 17:40:40 +0000

    workqueue: Fold rebind_worker() within rebind_workers()

    !CONFIG_SMP builds complain about rebind_worker() being unused. Its only
    user, rebind_workers() is indeed only defined for CONFIG_SMP, so just fold
    the two lines back up there.

    Link: http://lore.kernel.org/r/20230113143102.2e94d74f@canb.auug.org.au
    Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
    Signed-off-by: Valentin Schneider <vschneid@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-04-07 15:26:31 -04:00
Waiman Long 107339e408 workqueue: Unbind kworkers before sending them to exit()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337

commit e02b93124855cd34b78e61ae44846c8cb5fddfc3
Author: Valentin Schneider <vschneid@redhat.com>
Date:   Thu, 12 Jan 2023 16:14:31 +0000

    workqueue: Unbind kworkers before sending them to exit()

    It has been reported that isolated CPUs can suffer from interference due to
    per-CPU kworkers waking up just to die.

    A surge of workqueue activity during initial setup of a latency-sensitive
    application (refresh_vm_stats() being one of the culprits) can cause extra
    per-CPU kworkers to be spawned. Then, said latency-sensitive task can be
    running merrily on an isolated CPU only to be interrupted sometime later by
    a kworker marked for death (cf. IDLE_WORKER_TIMEOUT, 5 minutes after last
    kworker activity).

    Prevent this by affining kworkers to the wq_unbound_cpumask (which doesn't
    contain isolated CPUs, cf. HK_TYPE_WQ) before waking them up after marking
    them with WORKER_DIE.

    Changing the affinity does require a sleepable context, leverage the newly
    introduced pool->idle_cull_work to get that.

    Remove dying workers from pool->workers and keep track of them in a
    separate list. This intentionally prevents for_each_loop_worker() from
    iterating over workers that are marked for death.

    Rename destroy_worker() to set_working_dying() to better reflect its
    effects and relationship with wake_dying_workers().

    Signed-off-by: Valentin Schneider <vschneid@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-04-07 15:26:31 -04:00
Waiman Long 813b945165 workqueue: Don't hold any lock while rcuwait'ing for !POOL_MANAGER_ACTIVE
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337

commit 9ab03be42b8f9136dcc01a90ecc9ac71bc6149ef
Author: Valentin Schneider <vschneid@redhat.com>
Date:   Thu, 12 Jan 2023 16:14:30 +0000

    workqueue: Don't hold any lock while rcuwait'ing for !POOL_MANAGER_ACTIVE

    put_unbound_pool() currently passes wq_manager_inactive() as exit condition
    to rcuwait_wait_event(), which grabs pool->lock to check for

      pool->flags & POOL_MANAGER_ACTIVE

    A later patch will require destroy_worker() to be invoked with
    wq_pool_attach_mutex held, which needs to be acquired before
    pool->lock. A mutex cannot be acquired within rcuwait_wait_event(), as
    it could clobber the task state set by rcuwait_wait_event()

    Instead, restructure the waiting logic to acquire any necessary lock
    outside of rcuwait_wait_event().

    Since further work cannot be inserted into unbound pwqs that have reached
    ->refcnt==0, this is bound to make forward progress as eventually the
    worklist will be drained and need_more_worker(pool) will remain false,
    preventing any worker from stealing the manager position from us.

    Suggested-by: Tejun Heo <tj@kernel.org>
    Signed-off-by: Valentin Schneider <vschneid@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-04-07 15:26:30 -04:00
Waiman Long 7ea6709544 workqueue: Convert the idle_timer to a timer + work_struct
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337

commit 3f959aa3b33829acfcd460c6c656d54dfebe8d1e
Author: Valentin Schneider <vschneid@redhat.com>
Date:   Thu, 12 Jan 2023 16:14:29 +0000

    workqueue: Convert the idle_timer to a timer + work_struct

    A later patch will require a sleepable context in the idle worker timeout
    function. Converting worker_pool.idle_timer to a delayed_work gives us just
    that, however this would imply turning all idle_timer expiries into
    scheduler events (waking up a worker to handle the dwork).

    Instead, implement a "custom dwork" where the timer callback does some
    extra checks before queuing the associated work.

    No change in functionality intended.

    Signed-off-by: Valentin Schneider <vschneid@redhat.com>
    Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-04-07 15:26:30 -04:00
Waiman Long 2535806d83 workqueue: Factorize unbind/rebind_workers() logic
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337

commit 793777bc193b658f01924fd09b388eead26d741f
Author: Valentin Schneider <vschneid@redhat.com>
Date:   Thu, 12 Jan 2023 16:14:28 +0000

    workqueue: Factorize unbind/rebind_workers() logic

    Later patches will reuse this code, move it into reusable functions.

    Signed-off-by: Valentin Schneider <vschneid@redhat.com>
    Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-04-07 15:26:29 -04:00
Waiman Long d653c805fc workqueue: Protects wq_unbound_cpumask with wq_pool_attach_mutex
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337

commit 99c621ef243bda726fb8d982a274ded96570b410
Author: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Date:   Thu, 12 Jan 2023 16:14:27 +0000

    workqueue: Protects wq_unbound_cpumask with wq_pool_attach_mutex

    When unbind_workers() reads wq_unbound_cpumask to set the affinity of
    freshly-unbound kworkers, it only holds wq_pool_attach_mutex. This isn't
    sufficient as wq_unbound_cpumask is only protected by wq_pool_mutex.

    Make wq_unbound_cpumask protected with wq_pool_attach_mutex and also
    remove the need of temporary saved_cpumask.

    Fixes: 10a5a651e3af ("workqueue: Restrict kworker in the offline CPU pool running on housekeeping CPUs")
    Reported-by: Valentin Schneider <vschneid@redhat.com>
    Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-04-07 15:26:29 -04:00
Waiman Long 4e109dbd6a workqueue: don't skip lockdep work dependency in cancel_work_sync()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337

commit c0feea594e058223973db94c1c32a830c9807c86
Author: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date:   Fri, 29 Jul 2022 13:30:23 +0900

    workqueue: don't skip lockdep work dependency in cancel_work_sync()

    Like Hillf Danton mentioned

      syzbot should have been able to catch cancel_work_sync() in work context
      by checking lockdep_map in __flush_work() for both flush and cancel.

    in [1], being unable to report an obvious deadlock scenario shown below is
    broken. From locking dependency perspective, sync version of cancel request
    should behave as if flush request, for it waits for completion of work if
    that work has already started execution.

      ----------
      #include <linux/module.h>
      #include <linux/sched.h>
      static DEFINE_MUTEX(mutex);
      static void work_fn(struct work_struct *work)
      {
        schedule_timeout_uninterruptible(HZ / 5);
        mutex_lock(&mutex);
        mutex_unlock(&mutex);
      }
      static DECLARE_WORK(work, work_fn);
      static int __init test_init(void)
      {
        schedule_work(&work);
        schedule_timeout_uninterruptible(HZ / 10);
        mutex_lock(&mutex);
        cancel_work_sync(&work);
        mutex_unlock(&mutex);
        return -EINVAL;
      }
      module_init(test_init);
      MODULE_LICENSE("GPL");
      ----------

    The check this patch restores was added by commit 0976dfc1d0
    ("workqueue: Catch more locking problems with flush_work()").

    Then, lockdep's crossrelease feature was added by commit b09be676e0
    ("locking/lockdep: Implement the 'crossrelease' feature"). As a result,
    this check was once removed by commit fd1a5b04df ("workqueue: Remove
    now redundant lock acquisitions wrt. workqueue flushes").

    But lockdep's crossrelease feature was removed by commit e966eaeeb6
    ("locking/lockdep: Remove the cross-release locking checks"). At this
    point, this check should have been restored.

    Then, commit d6e89786be ("workqueue: skip lockdep wq dependency in
    cancel_work_sync()") introduced a boolean flag in order to distinguish
    flush_work() and cancel_work_sync(), for checking "struct workqueue_struct"
    dependency when called from cancel_work_sync() was causing false positives.

    Then, commit 87915adc3f ("workqueue: re-add lockdep dependencies for
    flushing") tried to restore "struct work_struct" dependency check, but by
    error checked this boolean flag. Like an example shown above indicates,
    "struct work_struct" dependency needs to be checked for both flush_work()
    and cancel_work_sync().

    Link: https://lkml.kernel.org/r/20220504044800.4966-1-hdanton@sina.com [1]
    Reported-by: Hillf Danton <hdanton@sina.com>
    Suggested-by: Lai Jiangshan <jiangshanlai@gmail.com>
    Fixes: 87915adc3f ("workqueue: re-add lockdep dependencies for flushing")
    Cc: Johannes Berg <johannes.berg@intel.com>
    Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-04-07 15:26:28 -04:00
Waiman Long 867850e9d0 workqueue: Change the comments of the synchronization about the idle_list
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337

commit 2c1f1a9180bfacbc3c8e5b10075640cc810cf9c0
Author: Lai Jiangshan <laijs@linux.alibaba.com>
Date:   Thu, 23 Dec 2021 20:31:38 +0800

    workqueue: Change the comments of the synchronization about the idle_list

    The access to idle_list in wq_worker_sleeping() is changed to be
    protected by pool->lock, so the comments above idle_list can be changed
    to "L:" which is the meaning of "access with pool->lock held".

    And the outdated comments in wq_worker_sleeping() is removed since
    the function is not called with rq lock held any more, idle_list is
    dereferenced with pool lock now.

    Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-04-07 15:26:28 -04:00
Waiman Long 94ebfcf09b workqueue: Remove the mb() pair between wq_worker_sleeping() and insert_work()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337

commit 21b195c05cf6a6cc49777d6992772bcf01502186
Author: Lai Jiangshan <laijs@linux.alibaba.com>
Date:   Thu, 23 Dec 2021 20:31:37 +0800

    workqueue: Remove the mb() pair between wq_worker_sleeping() and insert_work()

    In wq_worker_sleeping(), the access to worklist is protected by the
    pool->lock, so the memory barrier is unneeded.

    Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-04-07 15:26:27 -04:00
Waiman Long f710816729 workqueue: Remove the cacheline_aligned for nr_running
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337

commit 84f91c62d675480ffd3d870ee44c07965cbd8b21
Author: Lai Jiangshan <laijs@linux.alibaba.com>
Date:   Tue, 7 Dec 2021 15:35:42 +0800

    workqueue: Remove the cacheline_aligned for nr_running

    nr_running is never modified remotely after the schedule callback in
    wakeup path is removed.

    Rather nr_running is often accessed with other fields in the pool
    together, so the cacheline_aligned for nr_running isn't needed.

    Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-04-07 15:26:27 -04:00
Waiman Long 0df1d79e38 workqueue: Move the code of waking a worker up in unbind_workers()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337
Conflicts: A merge conflict requiring manual merge due to the presence
	   of a later upstream commit 46a4d679ef88 ("workqueue: Avoid
	   a false warning in unbind_workers()").

commit 989442d73757868118a73b92732b549a73c9ce35
Author: Lai Jiangshan <laijs@linux.alibaba.com>
Date:   Tue, 7 Dec 2021 15:35:41 +0800

    workqueue: Move the code of waking a worker up in unbind_workers()

    In unbind_workers(), there are two pool->lock held sections separated
    by the code of zapping nr_running.  wake_up_worker() needs to be in
    pool->lock held section and after zapping nr_running.  And zapping
    nr_running had to be after schedule() when the local wake up
    functionality was in use.  Now, the call to schedule() has been removed
    along with the local wake up functionality, so the code can be merged
    into the same pool->lock held section.

    The diffstat shows that it is other code moved down because the diff
    tools can not know the meaning of merging lock sections by swapping
    two code blocks.

    Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-04-07 15:26:26 -04:00
Waiman Long 41d61eff9a workqueue: Remove the outdated comment before wq_worker_sleeping()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337

commit ccf45156fd167a234baf038c11c1f367c7ccabd4
Author: Lai Jiangshan <laijs@linux.alibaba.com>
Date:   Tue, 7 Dec 2021 15:35:37 +0800

    workqueue: Remove the outdated comment before wq_worker_sleeping()

    It isn't called with preempt disabled now.

    Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-04-07 15:26:26 -04:00
Waiman Long b12ee57248 workqueue: Fix unbind_workers() VS wq_worker_sleeping() race
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2182337

commit 45c753f5f24d2d4717acb38ce35e604ff9abcb50
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Wed, 1 Dec 2021 16:19:45 +0100

    workqueue: Fix unbind_workers() VS wq_worker_sleeping() race

    At CPU-hotplug time, unbind_workers() may preempt a worker while it is
    going to sleep. In that case the following scenario can happen:

        unbind_workers()                     wq_worker_sleeping()
        --------------                      -------------------
                                          if (worker->flags & WORKER_NOT_RUNNING)
                                              return;
                                          //PREEMPTED by unbind_workers
        worker->flags |= WORKER_UNBOUND;
        [...]
        atomic_set(&pool->nr_running, 0);
        //resume to worker
                                           atomic_dec_and_test(&pool->nr_running);

    After unbind_worker() resets pool->nr_running, the value is expected to
    remain 0 until the pool ever gets rebound in case cpu_up() is called on
    the target CPU in the future. But here the race leaves pool->nr_running
    with a value of -1, triggering the following warning when the worker goes
    idle:

            WARNING: CPU: 3 PID: 34 at kernel/workqueue.c:1823 worker_enter_idle+0x95/0xc0
            Modules linked in:
            CPU: 3 PID: 34 Comm: kworker/3:0 Not tainted 5.16.0-rc1+ #34
            Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
            Workqueue:  0x0 (rcu_par_gp)
            RIP: 0010:worker_enter_idle+0x95/0xc0
            Code: 04 85 f8 ff ff ff 39 c1 7f 09 48 8b 43 50 48 85 c0 74 1b 83 e2 04 75 99 8b 43 34 39 43 30 75 91 8b 83 00 03 00 00 85 c0 74 87 <0f> 0b 5b c3 48 8b 35 70 f1 37 01 48 8d 7b 48 48 81 c6 e0 93  0
            RSP: 0000:ffff9b7680277ed0 EFLAGS: 00010086
            RAX: 00000000ffffffff RBX: ffff93465eae9c00 RCX: 0000000000000000
            RDX: 0000000000000000 RSI: ffff9346418a0000 RDI: ffff934641057140
            RBP: ffff934641057170 R08: 0000000000000001 R09: ffff9346418a0080
            R10: ffff9b768027fdf0 R11: 0000000000002400 R12: ffff93465eae9c20
            R13: ffff93465eae9c20 R14: ffff93465eae9c70 R15: ffff934641057140
            FS:  0000000000000000(0000) GS:ffff93465eac0000(0000) knlGS:0000000000000000
            CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            CR2: 0000000000000000 CR3: 000000001cc0c000 CR4: 00000000000006e0
            DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
            DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
            Call Trace:
              <TASK>
              worker_thread+0x89/0x3d0
              ? process_one_work+0x400/0x400
              kthread+0x162/0x190
              ? set_kthread_struct+0x40/0x40
              ret_from_fork+0x22/0x30
              </TASK>

    Also due to this incorrect "nr_running == -1", all sorts of hazards can
    happen, starting with queued works being ignored because no workers are
    awaken at insert_work() time.

    Fix this with checking again the worker flags while pool->lock is locked.

    Fixes: b945efcdd07d ("sched: Remove pointless preemption disable in sched_submit_work()")
    Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
    Tested-by: Paul E. McKenney <paulmck@kernel.org>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Paul E. McKenney <paulmck@kernel.org>
    Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2023-04-07 15:26:26 -04:00
Phil Auld 1b770a00ec workqueue: Avoid a false warning in unbind_workers()
Bugzilla: https://bugzilla.redhat.com/2115520

commit 46a4d679ef88285ea17c3e1e4fed330be2044f21
Author: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Date:   Fri Jul 29 17:44:38 2022 +0800

    workqueue: Avoid a false warning in unbind_workers()

    Doing set_cpus_allowed_ptr() with wq_unbound_cpumask can be possible
    fails and trigger the false warning.

    Use cpu_possible_mask instead when wq_unbound_cpumask has no active CPUs.

    It is very easy to trigger the warning:
      Set wq_unbound_cpumask to a small set of CPUs.
      Offline all the CPUs of wq_unbound_cpumask.
      Offline an extra CPU and trigger the warning.

    Fixes: 10a5a651e3af ("workqueue: Restrict kworker in the offline CPU pool running on housekeeping CPUs")
    Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-11-04 13:14:41 -04:00
Phil Auld 51a20c6ae4 workqueue: Wrap flush_workqueue() using a macro
Bugzilla: https://bugzilla.redhat.com/2115520

commit c4f135d643823a869becfa87539f7820ef9d5bfa
Author: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date:   Wed Jun 1 16:32:47 2022 +0900

    workqueue: Wrap flush_workqueue() using a macro

    Since flush operation synchronously waits for completion, flushing
    system-wide WQs (e.g. system_wq) might introduce possibility of deadlock
    due to unexpected locking dependency. Tejun Heo commented at [1] that it
    makes no sense at all to call flush_workqueue() on the shared WQs as the
    caller has no idea what it's gonna end up waiting for.

    Although there is flush_scheduled_work() which flushes system_wq WQ with
    "Think twice before calling this function! It's very easy to get into
    trouble if you don't take great care." warning message, syzbot found a
    circular locking dependency caused by flushing system_wq WQ [2].

    Therefore, let's change the direction to that developers had better use
    their local WQs if flush_scheduled_work()/flush_workqueue(system_*_wq) is
    inevitable.

    Steps for converting system-wide WQs into local WQs are explained at [3],
    and a conversion to stop flushing system-wide WQs is in progress. Now we
    want some mechanism for preventing developers who are not aware of this
    conversion from again start flushing system-wide WQs.

    Since I found that WARN_ON() is complete but awkward approach for teaching
    developers about this problem, let's use __compiletime_warning() for
    incomplete but handy approach. For completeness, we will also insert
    WARN_ON() into __flush_workqueue() after all in-tree users stopped calling
    flush_scheduled_work().

    Link: https://lore.kernel.org/all/YgnQGZWT%2Fn3VAITX@slm.duckdns.org/ [1]
    Link: https://syzkaller.appspot.com/bug?extid=bde0f89deacca7c765b8 [2]
    Link: https://lkml.kernel.org/r/49925af7-78a8-a3dd-bce6-cfc02e1a9236@I-love.SAKURA.ne.jp [3]
    Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-11-04 13:14:39 -04:00
Phil Auld 5eeb631add workqueue: Restrict kworker in the offline CPU pool running on housekeeping CPUs
Bugzilla: https://bugzilla.redhat.com/2115520

commit 10a5a651e3afc9b0b381f47e8930972e4e918397
Author: Zqiang <qiang1.zhang@intel.com>
Date:   Thu Mar 31 13:57:17 2022 +0800

    workqueue: Restrict kworker in the offline CPU pool running on housekeeping CPUs

    When a CPU is going offline, all workers on the CPU's pool will have their
    cpus_allowed cleared to cpu_possible_mask and can run on any CPUs including
    the isolated ones. Instead, set cpus_allowed to wq_unbound_cpumask so that
    the can avoid isolated CPUs.

    Signed-off-by: Zqiang <qiang1.zhang@intel.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-11-04 13:14:36 -04:00
Phil Auld 6dcdf4f5d6 workqueue: Remove schedule() in unbind_workers()
Bugzilla: https://bugzilla.redhat.com/2115520

commit b4ac9384ac057c5bf035fbe82fc162fa2f7b15a9
Author: Lai Jiangshan <laijs@linux.alibaba.com>
Date:   Tue Dec 7 15:35:40 2021 +0800

    workqueue: Remove schedule() in unbind_workers()

    The commit 6d25be5782 ("sched/core, workqueues: Distangle worker
    accounting from rq lock") changed the schedule callbacks for workqueue
    and moved the schedule callback from the wakeup code to at end of
    schedule() in the worker's process context.

    It means that the callback wq_worker_running() is guaranteed that
    it sees the %WORKER_UNBOUND flag after scheduled since unbind_workers()
    is running on the same CPU that all the pool's workers bound to.

    Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-11-04 13:14:35 -04:00
Phil Auld 012a9c8157 workqueue: Remove outdated comment about exceptional workers in unbind_workers()
Bugzilla: https://bugzilla.redhat.com/2115520

commit 11b45b0bf402b53c94c86737a440363fc36f03cd
Author: Lai Jiangshan <laijs@linux.alibaba.com>
Date:   Tue Dec 7 15:35:39 2021 +0800

    workqueue: Remove outdated comment about exceptional workers in unbind_workers()

    Long time before, workers are not ALL bound after CPU_ONLINE, they can
    still be running in other CPUs before self rebinding.

    But the commit a9ab775bca ("workqueue: directly restore CPU affinity
    of workers from CPU_ONLINE") makes rebind_workers() bind them all.

    So all workers are on the CPU before the CPU is down.

    And the comment in unbind_workers() refers to the workers "which are
    still executing works from before the last CPU down" is outdated.
    Just removed it.

    Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-11-04 13:14:35 -04:00
Phil Auld 8c23be3925 workqueue: Remove the advanced kicking of the idle workers in rebind_workers()
Bugzilla: https://bugzilla.redhat.com/2115520

commit 3e5f39ea33b1189ccaa4ae2a9de2bce07753d2e0
Author: Lai Jiangshan <laijs@linux.alibaba.com>
Date:   Tue Dec 7 15:35:38 2021 +0800

    workqueue: Remove the advanced kicking of the idle workers in rebind_workers()

    The commit 6d25be5782 ("sched/core, workqueues: Distangle worker
    accounting from rq lock") changed the schedule callbacks for workqueue
    and removed the local-wake-up functionality.

    Now the wakingup of workers is done by normal fashion and workers not
    yet migrated to the specific CPU in concurrency managed pool can also
    be woken up by workers that already bound to the specific cpu now.

    So this advanced kicking of the idle workers to migrate them to the
    associated CPU is unneeded now.

    Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-11-04 13:14:35 -04:00
Phil Auld 4dc7dc47f4 workqueue: Fix unbind_workers() VS wq_worker_running() race
Bugzilla: https://bugzilla.redhat.com/2115520

commit 07edfece8bcb0580a1828d939e6f8d91a8603eb2
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Wed Dec 1 16:19:44 2021 +0100

    workqueue: Fix unbind_workers() VS wq_worker_running() race

    At CPU-hotplug time, unbind_worker() may preempt a worker while it is
    waking up. In that case the following scenario can happen:

            unbind_workers()                     wq_worker_running()
            --------------                      -------------------
                                          if (!(worker->flags & WORKER_NOT_RUNNING))
                                              //PREEMPTED by unbind_workers
            worker->flags |= WORKER_UNBOUND;
            [...]
            atomic_set(&pool->nr_running, 0);
            //resume to worker
                                                  atomic_inc(&worker->pool->nr_running);

    After unbind_worker() resets pool->nr_running, the value is expected to
    remain 0 until the pool ever gets rebound in case cpu_up() is called on
    the target CPU in the future. But here the race leaves pool->nr_running
    with a value of 1, triggering the following warning when the worker goes
    idle:

            WARNING: CPU: 3 PID: 34 at kernel/workqueue.c:1823 worker_enter_idle+0x95/0xc0
            Modules linked in:
            CPU: 3 PID: 34 Comm: kworker/3:0 Not tainted 5.16.0-rc1+ #34
            Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
            Workqueue:  0x0 (rcu_par_gp)
            RIP: 0010:worker_enter_idle+0x95/0xc0
            Code: 04 85 f8 ff ff ff 39 c1 7f 09 48 8b 43 50 48 85 c0 74 1b 83 e2 04 75 99 8b 43 34 39 43 30 75 91 8b 83 00 03 00 00 85 c0 74 87 <0f> 0b 5b c3 48 8b 35 70 f1 37 01 48 8d 7b 48 48 81 c6 e0 93  0
            RSP: 0000:ffff9b7680277ed0 EFLAGS: 00010086
            RAX: 00000000ffffffff RBX: ffff93465eae9c00 RCX: 0000000000000000
            RDX: 0000000000000000 RSI: ffff9346418a0000 RDI: ffff934641057140
            RBP: ffff934641057170 R08: 0000000000000001 R09: ffff9346418a0080
            R10: ffff9b768027fdf0 R11: 0000000000002400 R12: ffff93465eae9c20
            R13: ffff93465eae9c20 R14: ffff93465eae9c70 R15: ffff934641057140
            FS:  0000000000000000(0000) GS:ffff93465eac0000(0000) knlGS:0000000000000000
            CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            CR2: 0000000000000000 CR3: 000000001cc0c000 CR4: 00000000000006e0
            DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
            DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
            Call Trace:
              <TASK>
              worker_thread+0x89/0x3d0
              ? process_one_work+0x400/0x400
              kthread+0x162/0x190
              ? set_kthread_struct+0x40/0x40
              ret_from_fork+0x22/0x30
              </TASK>

    Also due to this incorrect "nr_running == 1", further queued work may
    end up not being served, because no worker is awaken at work insert time.
    This raises rcutorture writer stalls for example.

    Fix this with disabling preemption in the right place in
    wq_worker_running().

    It's worth noting that if the worker migrates and runs concurrently with
    unbind_workers(), it is guaranteed to see the WORKER_UNBOUND flag update
    due to set_cpus_allowed_ptr() acquiring/releasing rq->lock.

    Fixes: 6d25be5782 ("sched/core, workqueues: Distangle worker accounting from rq lock")
    Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
    Tested-by: Paul E. McKenney <paulmck@kernel.org>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-11-04 13:14:34 -04:00
Karol Herbst 902456a79d Revert "workqueue: remove unused cancel_work()"
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2115866
Upstream Status: v6.0-rc1

commit 73b4b53276a1d6290cd4f47dbbc885b6e6e59ac6
Author:     Andrey Grodzovsky <andrey.grodzovsky@amd.com>
AuthorDate: Thu May 19 09:47:28 2022 -0400
Commit:     Alex Deucher <alexander.deucher@amd.com>
CommitDate: Fri Jun 10 15:24:38 2022 -0400

    This reverts commit 6417250d3f.

    amdpgu need this function in order to prematurly stop pending
    reset works when another reset work already in progress.

    Acked-by: Tejun Heo <tj@kernel.org>
    Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
    Reviewed-by: Lai Jiangshan<jiangshanlai@gmail.com>
    Reviewed-by: Christian König <christian.koenig@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

Signed-off-by: Karol Herbst <kherbst@redhat.com>
2022-10-25 13:19:44 +02:00
Phil Auld 1cf795c344 sched/isolation: Use single feature type while referring to housekeeping cpumask
Bugzilla: http://bugzilla.redhat.com/2065222

commit 04d4e665a60902cf36e7ad39af1179cb5df542ad
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Mon Feb 7 16:59:06 2022 +0100

    sched/isolation: Use single feature type while referring to housekeeping cpumask

    Refer to housekeeping APIs using single feature types instead of flags.
    This prevents from passing multiple isolation features at once to
    housekeeping interfaces, which soon won't be possible anymore as each
    isolation features will have their own cpumask.

    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
    Reviewed-by: Phil Auld <pauld@redhat.com>
    Link: https://lore.kernel.org/r/20220207155910.527133-5-frederic@kernel.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-03-31 10:40:39 -04:00
Phil Auld 4133c32b7f workqueue: Decouple HK_FLAG_WQ and HK_FLAG_DOMAIN cpumask fetch
Bugzilla: http://bugzilla.redhat.com/2065222

commit 7b45b51e778021cd7817b8f0d743a2c73205c011
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Mon Feb 7 16:59:04 2022 +0100

    workqueue: Decouple HK_FLAG_WQ and HK_FLAG_DOMAIN cpumask fetch

    To prepare for supporting each feature of the housekeeping cpumask
    toward cpuset, prepare each of the HK_FLAG_* entries to move to their
    own cpumask with enforcing to fetch them individually. The new
    constraint is that multiple HK_FLAG_* entries can't be mixed together
    anymore in a single call to housekeeping cpumask().

    This will later allow, for example, to runtime modify the cpulist passed
    through "isolcpus=", "nohz_full=" and "rcu_nocbs=" kernel boot
    parameters.

    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
    Reviewed-by: Phil Auld <pauld@redhat.com>
    Acked-by: Tejun Heo <tj@kernel.org>
    Link: https://lore.kernel.org/r/20220207155910.527133-3-frederic@kernel.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-03-31 10:38:57 -04:00
Phil Auld c6737c0e58 workqueue, kasan: avoid alloc_pages() when recording stack
Bugzilla: http://bugzilla.redhat.com/2022894

commit f70da745be4d4c367568b8345f50db1ae04efcb2
Author: Marco Elver <elver@google.com>
Date:   Fri Nov 5 13:35:50 2021 -0700

    workqueue, kasan: avoid alloc_pages() when recording stack

    Shuah Khan reported:

     | When CONFIG_PROVE_RAW_LOCK_NESTING=y and CONFIG_KASAN are enabled,
     | kasan_record_aux_stack() runs into "BUG: Invalid wait context" when
     | it tries to allocate memory attempting to acquire spinlock in page
     | allocation code while holding workqueue pool raw_spinlock.
     |
     | There are several instances of this problem when block layer tries
     | to __queue_work(). Call trace from one of these instances is below:
     |
     |     kblockd_mod_delayed_work_on()
     |       mod_delayed_work_on()
     |         __queue_delayed_work()
     |           __queue_work() (rcu_read_lock, raw_spin_lock pool->lock held)
     |             insert_work()
     |               kasan_record_aux_stack()
     |                 kasan_save_stack()
     |                   stack_depot_save()
     |                     alloc_pages()
     |                       __alloc_pages()
     |                         get_page_from_freelist()
     |                           rm_queue()
     |                             rm_queue_pcplist()
     |                               local_lock_irqsave(&pagesets.lock, flags);
     |                               [ BUG: Invalid wait context triggered ]

    The default kasan_record_aux_stack() calls stack_depot_save() with
    GFP_NOWAIT, which in turn can then call alloc_pages(GFP_NOWAIT, ...).
    In general, however, it is not even possible to use either GFP_ATOMIC
    nor GFP_NOWAIT in certain non-preemptive contexts, including
    raw_spin_locks (see gfp.h and commmit ab00db216c).

    Fix it by instructing stackdepot to not expand stack storage via
    alloc_pages() in case it runs out by using
    kasan_record_aux_stack_noalloc().

    While there is an increased risk of failing to insert the stack trace,
    this is typically unlikely, especially if the same insertion had already
    succeeded previously (stack depot hit).

    For frequent calls from the same location, it therefore becomes
    extremely unlikely that kasan_record_aux_stack_noalloc() fails.

    Link: https://lkml.kernel.org/r/20210902200134.25603-1-skhan@linuxfoundation.org
    Link: https://lkml.kernel.org/r/20210913112609.2651084-7-elver@google.com
    Signed-off-by: Marco Elver <elver@google.com>
    Reported-by: Shuah Khan <skhan@linuxfoundation.org>
    Tested-by: Shuah Khan <skhan@linuxfoundation.org>
    Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Acked-by: Tejun Heo <tj@kernel.org>
    Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org>
    Cc: Lai Jiangshan <jiangshanlai@gmail.com>
    Cc: Taras Madan <tarasmadan@google.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Vijayanand Jitta <vjitta@codeaurora.org>
    Cc: Vinayak Menon <vinmenon@codeaurora.org>
    Cc: Walter Wu <walter-zh.wu@mediatek.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-01-17 10:39:01 -05:00
Phil Auld 7fef064d7c workqueue: Introduce show_one_worker_pool and show_one_workqueue.
Bugzilla: http://bugzilla.redhat.com/2022894

commit 55df0933be74bd2e52aba0b67eb743ae0feabe7e
Author: Imran Khan <imran.f.khan@oracle.com>
Date:   Wed Oct 20 14:09:00 2021 +1100

    workqueue: Introduce show_one_worker_pool and show_one_workqueue.

    Currently show_workqueue_state shows the state of all workqueues and of
    all worker pools. In certain cases we may need to dump state of only a
    specific workqueue or worker pool. For example in destroy_workqueue we
    only need to show state of the workqueue which is getting destroyed.

    So rename show_workqueue_state to show_all_workqueues(to signify it
    dumps state of all busy workqueues) and divide it into more granular
    functions (show_one_workqueue and show_one_worker_pool), that would show
    states of individual workqueues and worker pools and can be used in
    cases such as the one mentioned above.

    Also, as mentioned earlier, make destroy_workqueue dump data pertaining
    to only the workqueue that is being destroyed and make user(s) of
    earlier interface(show_workqueue_state), use new interface
    (show_all_workqueues).

    Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-01-17 10:39:01 -05:00
Phil Auld 59acfcf9f1 workqueue: make sysfs of unbound kworker cpumask more clever
Bugzilla: http://bugzilla.redhat.com/2022894

commit d25302e46592c97d29f70ccb1be558df31a9a360
Author: Menglong Dong <imagedong@tencent.com>
Date:   Sun Oct 17 20:04:02 2021 +0800

    workqueue: make sysfs of unbound kworker cpumask more clever

    Some unfriendly component, such as dpdk, write the same mask to
    unbound kworker cpumask again and again. Every time it write to
    this interface some work is queue to cpu, even though the mask
    is same with the original mask.

    So, fix it by return success and do nothing if the cpumask is
    equal with the old one.

    Signed-off-by: Mengen Sun <mengensun@tencent.com>
    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-01-17 10:39:01 -05:00
Phil Auld 37888a0f4e workqueue: fix state-dump console deadlock
Bugzilla: http://bugzilla.redhat.com/2022894

commit 57116ce17b04fde2fe30f0859df69d8dbe5809f6
Author: Johan Hovold <johan@kernel.org>
Date:   Wed Oct 6 13:58:52 2021 +0200

    workqueue: fix state-dump console deadlock

    Console drivers often queue work while holding locks also taken in their
    console write paths, something which can lead to deadlocks on SMP when
    dumping workqueue state (e.g. sysrq-t or on suspend failures).

    For serial console drivers this could look like:

            CPU0                            CPU1
            ----                            ----

            show_workqueue_state();
              lock(&pool->lock);            <IRQ>
                                              lock(&port->lock);
                                              schedule_work();
                                                lock(&pool->lock);
              printk();
                lock(console_owner);
                lock(&port->lock);

    where workqueues are, for example, used to push data to the line
    discipline, process break signals and handle modem-status changes. Line
    disciplines and serdev drivers can also queue work on write-wakeup
    notifications, etc.

    Reworking every console driver to avoid queuing work while holding locks
    also taken in their write paths would complicate drivers and is neither
    desirable or feasible.

    Instead use the deferred-printk mechanism to avoid printing while
    holding pool locks when dumping workqueue state.

    Note that there are a few WARN_ON() assertions in the workqueue code
    which could potentially also trigger a deadlock. Hopefully the ongoing
    printk rework will provide a general solution for this eventually.

    This was originally reported after a lockdep splat when executing
    sysrq-t with the imx serial driver.

    Fixes: 3494fc3084 ("workqueue: dump workqueues on sysrq-t")
    Cc: stable@vger.kernel.org      # 4.0
    Reported-by: Fabio Estevam <festevam@denx.de>
    Tested-by: Fabio Estevam <festevam@denx.de>
    Signed-off-by: Johan Hovold <johan@kernel.org>
    Reviewed-by: John Ogness <john.ogness@linutronix.de>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-01-17 10:39:01 -05:00
Phil Auld 8c055a75de workqueue: Assign a color to barrier work items
Bugzilla: http://bugzilla.redhat.com/2022894

commit d812796eb3906424cd3c0eea530983961e2f88bd
Author: Lai Jiangshan <laijs@linux.alibaba.com>
Date:   Tue Aug 17 09:32:38 2021 +0800

    workqueue: Assign a color to barrier work items

    There was no strong reason to or not to flush barrier work items in
    flush_workqueue().  And we have to make barrier work items not participate
    in nr_active so we had been using WORK_NO_COLOR for them which also makes
    them can't be flushed by flush_workqueue().

    And the users of flush_workqueue() often do not intend to wait barrier work
    items issued by flush_work().  That made the choice sound perfect.

    But barrier work items have reference to internal structure (pool_workqueue)
    and the worker thread[s] is/are still busy for the workqueue user when the
    barrrier work items are not done.  So it is reasonable to make flush_workqueue()
    also watch for flush_work() to make it more robust.

    And a problem[1] reported by Li Zhe shows that we need such robustness.
    The warning logs are listed below:

    WARNING: CPU: 0 PID: 19336 at kernel/workqueue.c:4430 destroy_workqueue+0x11a/0x2f0
    *****
    destroy_workqueue: test_workqueue9 has the following busy pwq
      pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=0/1 refcnt=2
          in-flight: 5658:wq_barrier_func
    Showing busy workqueues and worker pools:
    *****

    It shows that even after drain_workqueue() returns, the barrier work item
    is still in flight and the pwq (and a worker) is still busy on it.

    The problem is caused by flush_workqueue() not watching flush_work():

    Thread A                                Worker
                                            /* normal work item with linked */
                                            process_scheduled_works()
    destroy_workqueue()                       process_one_work()
      drain_workqueue()                         /* run normal work item */
                                     /--        pwq_dec_nr_in_flight()
        flush_workqueue()       <---/
                    /* the last normal work item is done */
      sanity_check                            process_one_work()
                                           /--  raw_spin_unlock_irq(&pool->lock)
        raw_spin_lock_irq(&pool->lock)  <-/     /* maybe preempt */
        *WARNING*                               wq_barrier_func()
                                                /* maybe preempt by cond_resched() */

    Thread A can get the pool lock after the Worker unlocks the pool lock before
    running wq_barrier_func().  And if there is any preemption happen around
    wq_barrier_func(), destroy_workqueue()'s sanity check is more likely to
    get the lock and catch it.  (Note: preemption is not necessary to cause the bug,
    the unlocking is enough to possibly trigger the WARNING.)

    A simple solution might be just executing all linked barrier work items
    once without releasing pool lock after the head work item's
    pwq_dec_nr_in_flight().  But this solution has two problems:

      1) the head work item might also be barrier work item when the user-queued
         work item is cancelled. For example:
            thread 1:               thread 2:
            queue_work(wq, &my_work)
            flush_work(&my_work)
                                    cancel_work_sync(&my_work);
            /* Neiter my_work nor the barrier work is scheduled. */
                                    destroy_workqueue(wq);
            /* This is an easier way to catch the WARNING. */

      2) there might be too much linked barrier work items and running them
         all once without releasing pool lock just causes trouble.

    The only solution is to make flush_workqueue() aslo watch barrier work
    items.  So we have to assign a color to these barrier work items which
    is the color of the head (user-queued) work item.

    Assigning a color doesn't cause any problem in ative management, because
    the prvious patch made barrier work items not participate in nr_active
    via WORK_STRUCT_INACTIVE rather than reliance on the (old) WORK_NO_COLOR.

    [1]: https://lore.kernel.org/lkml/20210812083814.32453-1-lizhe.67@bytedance.com/
    Reported-by: Li Zhe <lizhe.67@bytedance.com>
    Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-01-17 10:39:00 -05:00
Phil Auld d20bd5ccfa workqueue: Mark barrier work with WORK_STRUCT_INACTIVE
Bugzilla: http://bugzilla.redhat.com/2022894

commit 018f3a13dd6300701103f268b6bfec0a56beea57
Author: Lai Jiangshan <laijs@linux.alibaba.com>
Date:   Tue Aug 17 09:32:37 2021 +0800

    workqueue: Mark barrier work with WORK_STRUCT_INACTIVE

    Currently, WORK_NO_COLOR has two meanings:
            Not participate in flushing
            Not participate in nr_active

    And only non-barrier work items are marked with WORK_STRUCT_INACTIVE
    when they are in inactive_works list.  The barrier work items are not
    marked INACTIVE even linked in inactive_works list since these tail
    items are always moved together with the head work item.

    These definitions are simple, clean and practical. (Except a small
    blemish that only the first meaning of WORK_NO_COLOR is documented in
    include/linux/workqueue.h while both meanings are in workqueue.c)

    But dual-purpose WORK_NO_COLOR used for barrier work items has proven to
    be problematical[1].  Only the second purpose is obligatory.  So we plan
    to make barrier work items participate in flushing but keep them still
    not participating in nr_active.

    So the plan is to mark barrier work items inactive without using
    WORK_NO_COLOR in this patch so that we can assign a flushing color to
    them in next patch.

    The reasonable way is to add or reuse a bit in work data of the work
    item.  But adding a bit will double the size of pool_workqueue.

    Currently, WORK_STRUCT_INACTIVE is only used in try_to_grab_pending()
    for user-queued work items and try_to_grab_pending() can't work for
    barrier work items.  So we extend WORK_STRUCT_INACTIVE to also mark
    barrier work items no matter which list they are in because we don't
    need to determind which list a barrier work item is in.

    So the meaning of WORK_STRUCT_INACTIVE becomes just "the work items don't
    participate in nr_active" (no matter whether it is a barrier work item or
    a user-queued work item).  And WORK_STRUCT_INACTIVE for user-queued work
    items means they are in inactive_works list.

    This patch does it by setting WORK_STRUCT_INACTIVE for barrier work items
    in insert_wq_barrier() and checking WORK_STRUCT_INACTIVE first in
    pwq_dec_nr_in_flight().  And the meaning of WORK_NO_COLOR is reduced to
    only "not participating in flushing".

    There is no functionality change intended in this patch.  Because
    WORK_NO_COLOR+WORK_STRUCT_INACTIVE represents the previous WORK_NO_COLOR
    in meaning and try_to_grab_pending() doesn't use for barrier work items
    and avoids being confused by this extended WORK_STRUCT_INACTIVE.

    A bunch of comment for nr_active & WORK_STRUCT_INACTIVE is also added for
    documenting how WORK_STRUCT_INACTIVE works in nr_active management.

    [1]: https://lore.kernel.org/lkml/20210812083814.32453-1-lizhe.67@bytedance.com/
    Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-01-17 10:39:00 -05:00
Phil Auld 283105745e workqueue: Change the code of calculating work_flags in insert_wq_barrier()
Bugzilla: http://bugzilla.redhat.com/2022894

commit d21cece0dbb424ad3ff9e49bde6954632b8efede
Author: Lai Jiangshan <laijs@linux.alibaba.com>
Date:   Tue Aug 17 09:32:36 2021 +0800

    workqueue: Change the code of calculating work_flags in insert_wq_barrier()

    Add a local var @work_flags to calculate work_flags step by step, so that
    we don't need to squeeze several flags in only the last line of code.

    Parepare for next patch to add a bit to barrier work item's flag.  Not
    squshing this to next patch makes it clear that what it will have changed.

    No functional change intended.

    Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-01-17 10:39:00 -05:00
Phil Auld d43bd6174b workqueue: Change arguement of pwq_dec_nr_in_flight()
Bugzilla: http://bugzilla.redhat.com/2022894

commit c4560c2c88a4c809800ba8e76faabaf80bf6ee89
Author: Lai Jiangshan <laijs@linux.alibaba.com>
Date:   Tue Aug 17 09:32:35 2021 +0800

    workqueue: Change arguement of pwq_dec_nr_in_flight()

    Make pwq_dec_nr_in_flight() use work_data rather just work_color.

    Prepare for later patch to get WORK_STRUCT_INACTIVE bit from work_data
    in pwq_dec_nr_in_flight().

    No functional change intended.

    Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-01-17 10:39:00 -05:00
Phil Auld 373441ce25 workqueue: Rename "delayed" (delayed by active management) to "inactive"
Bugzilla: http://bugzilla.redhat.com/2022894

commit f97a4a1a3f8769e3452885967955e21c88f3f263
Author: Lai Jiangshan <laijs@linux.alibaba.com>
Date:   Tue Aug 17 09:32:34 2021 +0800

    workqueue: Rename "delayed" (delayed by active management) to "inactive"

    There are two kinds of "delayed" work items in workqueue subsystem.

    One is for timer-delayed work items which are visible to workqueue users.
    The other kind is for work items delayed by active management which can
    not be directly visible to workqueue users.  We mixed the word "delayed"
    for both kinds and caused somewhat ambiguity.

    This patch renames the later one (delayed by active management) to
    "inactive", because it is used for workqueue active management and
    most of its related symbols are named with "active" or "activate".

    All "delayed" and "DELAYED" are carefully checked and renamed one by
    one to avoid accidentally changing the name of the other kind for
    timer-delayed.

    No functional change intended.

    Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-01-17 10:39:00 -05:00
Phil Auld 274ea6d685 workqueue: Replace deprecated ida_simple_*() with ida_alloc()/ida_free()
Bugzilla: http://bugzilla.redhat.com/2022894

commit e441b56fe438fd126b9eea7d30c57d3cd3f34e14
Author: Zhen Lei <thunder.leizhen@huawei.com>
Date:   Wed Aug 4 11:50:36 2021 +0800

    workqueue: Replace deprecated ida_simple_*() with ida_alloc()/ida_free()

    Replace ida_simple_get() with ida_alloc() and ida_simple_remove() with
    ida_free(), the latter is more concise and intuitive.

    In addition, if ida_alloc() fails, NULL is returned directly. This
    eliminates unnecessary initialization of two local variables and an 'if'
    judgment.

    Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-01-17 10:39:00 -05:00
Phil Auld f681ac25c1 workqueue: Fix typo in comments
Bugzilla: http://bugzilla.redhat.com/2022894

commit 67dc8325370844ffce92aa59abe8b453aa6aa83c
Author: Cai Huoqing <caihuoqing@baidu.com>
Date:   Sat Jul 31 08:01:29 2021 +0800

    workqueue: Fix typo in comments

    Fix typo:
    *assing  ==> assign
    *alloced  ==> allocated
    *Retun  ==> Return
    *excute  ==> execute

    v1->v2:
    *reverse 'iff'
    *update changelog

    Signed-off-by: Cai Huoqing <caihuoqing@baidu.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-01-17 10:39:00 -05:00
Phil Auld c75c0980e5 workqueue: Fix possible memory leaks in wq_numa_init()
Bugzilla: http://bugzilla.redhat.com/2022894

commit f728c4a9e8405caae69d4bc1232c54ff57b5d20f
Author: Zhen Lei <thunder.leizhen@huawei.com>
Date:   Thu Jul 22 11:03:52 2021 +0800

    workqueue: Fix possible memory leaks in wq_numa_init()

    In error handling branch "if (WARN_ON(node == NUMA_NO_NODE))", the
    previously allocated memories are not released. Doing this before
    allocating memory eliminates memory leaks.

    tj: Note that the condition only occurs when the arch code is pretty broken
    and the WARN_ON might as well be BUG_ON().

    Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
    Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-01-17 10:39:00 -05:00
Prarit Bhargava 28de1bbb6e workqueue: Replace deprecated CPU-hotplug functions.
Bugzilla: http://bugzilla.redhat.com/2023079

commit ffd8bea81fbb5abe6518bce8d6297a149b935cf7
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Tue Aug 3 16:16:20 2021 +0200

    workqueue: Replace deprecated CPU-hotplug functions.

    The functions get_online_cpus() and put_online_cpus() have been
    deprecated during the CPU hotplug rework. They map directly to
    cpus_read_lock() and cpus_read_unlock().

    Replace deprecated CPU-hotplug functions with the official version.
    The behavior remains unchanged.

    Cc: Tejun Heo <tj@kernel.org>
    Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
2021-12-09 09:04:09 -05:00
Yang Yingliang b42b0bddcb workqueue: fix UAF in pwq_unbound_release_workfn()
I got a UAF report when doing fuzz test:

[  152.880091][ T8030] ==================================================================
[  152.881240][ T8030] BUG: KASAN: use-after-free in pwq_unbound_release_workfn+0x50/0x190
[  152.882442][ T8030] Read of size 4 at addr ffff88810d31bd00 by task kworker/3:2/8030
[  152.883578][ T8030]
[  152.883932][ T8030] CPU: 3 PID: 8030 Comm: kworker/3:2 Not tainted 5.13.0+ #249
[  152.885014][ T8030] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
[  152.886442][ T8030] Workqueue: events pwq_unbound_release_workfn
[  152.887358][ T8030] Call Trace:
[  152.887837][ T8030]  dump_stack_lvl+0x75/0x9b
[  152.888525][ T8030]  ? pwq_unbound_release_workfn+0x50/0x190
[  152.889371][ T8030]  print_address_description.constprop.10+0x48/0x70
[  152.890326][ T8030]  ? pwq_unbound_release_workfn+0x50/0x190
[  152.891163][ T8030]  ? pwq_unbound_release_workfn+0x50/0x190
[  152.891999][ T8030]  kasan_report.cold.15+0x82/0xdb
[  152.892740][ T8030]  ? pwq_unbound_release_workfn+0x50/0x190
[  152.893594][ T8030]  __asan_load4+0x69/0x90
[  152.894243][ T8030]  pwq_unbound_release_workfn+0x50/0x190
[  152.895057][ T8030]  process_one_work+0x47b/0x890
[  152.895778][ T8030]  worker_thread+0x5c/0x790
[  152.896439][ T8030]  ? process_one_work+0x890/0x890
[  152.897163][ T8030]  kthread+0x223/0x250
[  152.897747][ T8030]  ? set_kthread_struct+0xb0/0xb0
[  152.898471][ T8030]  ret_from_fork+0x1f/0x30
[  152.899114][ T8030]
[  152.899446][ T8030] Allocated by task 8884:
[  152.900084][ T8030]  kasan_save_stack+0x21/0x50
[  152.900769][ T8030]  __kasan_kmalloc+0x88/0xb0
[  152.901416][ T8030]  __kmalloc+0x29c/0x460
[  152.902014][ T8030]  alloc_workqueue+0x111/0x8e0
[  152.902690][ T8030]  __btrfs_alloc_workqueue+0x11e/0x2a0
[  152.903459][ T8030]  btrfs_alloc_workqueue+0x6d/0x1d0
[  152.904198][ T8030]  scrub_workers_get+0x1e8/0x490
[  152.904929][ T8030]  btrfs_scrub_dev+0x1b9/0x9c0
[  152.905599][ T8030]  btrfs_ioctl+0x122c/0x4e50
[  152.906247][ T8030]  __x64_sys_ioctl+0x137/0x190
[  152.906916][ T8030]  do_syscall_64+0x34/0xb0
[  152.907535][ T8030]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  152.908365][ T8030]
[  152.908688][ T8030] Freed by task 8884:
[  152.909243][ T8030]  kasan_save_stack+0x21/0x50
[  152.909893][ T8030]  kasan_set_track+0x20/0x30
[  152.910541][ T8030]  kasan_set_free_info+0x24/0x40
[  152.911265][ T8030]  __kasan_slab_free+0xf7/0x140
[  152.911964][ T8030]  kfree+0x9e/0x3d0
[  152.912501][ T8030]  alloc_workqueue+0x7d7/0x8e0
[  152.913182][ T8030]  __btrfs_alloc_workqueue+0x11e/0x2a0
[  152.913949][ T8030]  btrfs_alloc_workqueue+0x6d/0x1d0
[  152.914703][ T8030]  scrub_workers_get+0x1e8/0x490
[  152.915402][ T8030]  btrfs_scrub_dev+0x1b9/0x9c0
[  152.916077][ T8030]  btrfs_ioctl+0x122c/0x4e50
[  152.916729][ T8030]  __x64_sys_ioctl+0x137/0x190
[  152.917414][ T8030]  do_syscall_64+0x34/0xb0
[  152.918034][ T8030]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  152.918872][ T8030]
[  152.919203][ T8030] The buggy address belongs to the object at ffff88810d31bc00
[  152.919203][ T8030]  which belongs to the cache kmalloc-512 of size 512
[  152.921155][ T8030] The buggy address is located 256 bytes inside of
[  152.921155][ T8030]  512-byte region [ffff88810d31bc00, ffff88810d31be00)
[  152.922993][ T8030] The buggy address belongs to the page:
[  152.923800][ T8030] page:ffffea000434c600 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10d318
[  152.925249][ T8030] head:ffffea000434c600 order:2 compound_mapcount:0 compound_pincount:0
[  152.926399][ T8030] flags: 0x57ff00000010200(slab|head|node=1|zone=2|lastcpupid=0x7ff)
[  152.927515][ T8030] raw: 057ff00000010200 dead000000000100 dead000000000122 ffff888009c42c80
[  152.928716][ T8030] raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
[  152.929890][ T8030] page dumped because: kasan: bad access detected
[  152.930759][ T8030]
[  152.931076][ T8030] Memory state around the buggy address:
[  152.931851][ T8030]  ffff88810d31bc00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  152.932967][ T8030]  ffff88810d31bc80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  152.934068][ T8030] >ffff88810d31bd00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  152.935189][ T8030]                    ^
[  152.935763][ T8030]  ffff88810d31bd80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  152.936847][ T8030]  ffff88810d31be00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  152.937940][ T8030] ==================================================================

If apply_wqattrs_prepare() fails in alloc_workqueue(), it will call put_pwq()
which invoke a work queue to call pwq_unbound_release_workfn() and use the 'wq'.
The 'wq' allocated in alloc_workqueue() will be freed in error path when
apply_wqattrs_prepare() fails. So it will lead a UAF.

CPU0                                          CPU1
alloc_workqueue()
alloc_and_link_pwqs()
apply_wqattrs_prepare() fails
apply_wqattrs_cleanup()
schedule_work(&pwq->unbound_release_work)
kfree(wq)
                                              worker_thread()
                                              pwq_unbound_release_workfn() <- trigger uaf here

If apply_wqattrs_prepare() fails, the new pwq are not linked, it doesn't
hold any reference to the 'wq', 'wq' is invalid to access in the worker,
so add check pwq if linked to fix this.

Fixes: 2d5f0764b5 ("workqueue: split apply_workqueue_attrs() into 3 stages")
Cc: stable@vger.kernel.org # v4.2+
Reported-by: Hulk Robot <hulkci@huawei.com>
Suggested-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Tested-by: Pavel Skripkin <paskripkin@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2021-07-21 06:42:31 -10:00
Sergey Senozhatsky 940d71c646 wq: handle VM suspension in stall detection
If VCPU is suspended (VM suspend) in wq_watchdog_timer_fn() then
once this VCPU resumes it will see the new jiffies value, while it
may take a while before IRQ detects PVCLOCK_GUEST_STOPPED on this
VCPU and updates all the watchdogs via pvclock_touch_watchdogs().
There is a small chance of misreported WQ stalls in the meantime,
because new jiffies is time_after() old 'ts + thresh'.

wq_watchdog_timer_fn()
{
	for_each_pool(pool, pi) {
		if (time_after(jiffies, ts + thresh)) {
			pr_emerg("BUG: workqueue lockup - pool");
		}
	}
}

Save jiffies at the beginning of this function and use that value
for stall detection. If VM gets suspended then we continue using
"old" jiffies value and old WQ touch timestamps. If IRQ at some
point restarts the stall detection cycle (pvclock_touch_watchdogs())
then old jiffies will always be before new 'ts + thresh'.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2021-05-20 12:58:30 -04:00
Linus Torvalds 57fa2369ab CFI on arm64 series for v5.13-rc1
- Clean up list_sort prototypes (Sami Tolvanen)
 
 - Introduce CONFIG_CFI_CLANG for arm64 (Sami Tolvanen)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmCHCR8ACgkQiXL039xt
 wCZyFQ//fnUZaXR2K354zDyW6CJljMf+d94RF6rH+J6eMTH2/HXa5v0iJokwABLf
 ussP6qF4k5wtmI22Gm9A5Zc3e4iiry5pC0jOdk0mk4gzWwFN9MdgNxJZIGA3xqhS
 bsBK4AGrVKjtZl48G1/ZxJuNDeJhVp6GNK2n6/Gl4rZF6R7D/Upz0XelyJRdDpcM
 HIGma7jZl6xfGU0mdWCzpOGK1zdMca1WVs7A4YuurSbLn5PZJrcNVWLouDqt/Si2
 AduSri1gyPClicgvqWjMOzhUpuw/nJtBLRl1x1EsWk/KSZ1/uNVjlewfzdN4fZrr
 zbtFr2gLubYLK6JOX7/LqoHlOTgE3tYLL+WIVN75DsOGZBKgHhmebTmWLyqzV0SL
 oqcyM5d3ucC6msdtAK5Fv4MSp8rpjqlK1Ha4SGRT6kC2wut7AhZ3KD7eyRIz8mV9
 Sa9mhignGFJnTEUp+LSbYdrAudgSKxB40WyXPmswAXX4VJFRD4ONrrcAON/SzkUT
 Hw/JdFRCKkJjgwNQjIQoZcUNMTbFz2PlNIEnjJWm38YImQKQlCb2mXaZKCwBkf45
 aheCZk17eKoxTCXFMd+KxlyNEtS2yBfq/PpZgvw7GW/pfFbWUg1+2O41LnihIe5v
 zu0hN1wNCQqgfxiMZqX1OTb9C/2vybzGsXILt+9nppjZ8EBU7iU=
 =wU6U
 -----END PGP SIGNATURE-----

Merge tag 'cfi-v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull CFI on arm64 support from Kees Cook:
 "This builds on last cycle's LTO work, and allows the arm64 kernels to
  be built with Clang's Control Flow Integrity feature. This feature has
  happily lived in Android kernels for almost 3 years[1], so I'm excited
  to have it ready for upstream.

  The wide diffstat is mainly due to the treewide fixing of mismatched
  list_sort prototypes. Other things in core kernel are to address
  various CFI corner cases. The largest code portion is the CFI runtime
  implementation itself (which will be shared by all architectures
  implementing support for CFI). The arm64 pieces are Acked by arm64
  maintainers rather than coming through the arm64 tree since carrying
  this tree over there was going to be awkward.

  CFI support for x86 is still under development, but is pretty close.
  There are a handful of corner cases on x86 that need some improvements
  to Clang and objtool, but otherwise works well.

  Summary:

   - Clean up list_sort prototypes (Sami Tolvanen)

   - Introduce CONFIG_CFI_CLANG for arm64 (Sami Tolvanen)"

* tag 'cfi-v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  arm64: allow CONFIG_CFI_CLANG to be selected
  KVM: arm64: Disable CFI for nVHE
  arm64: ftrace: use function_nocfi for ftrace_call
  arm64: add __nocfi to __apply_alternatives
  arm64: add __nocfi to functions that jump to a physical address
  arm64: use function_nocfi with __pa_symbol
  arm64: implement function_nocfi
  psci: use function_nocfi for cpu_resume
  lkdtm: use function_nocfi
  treewide: Change list_sort to use const pointers
  bpf: disable CFI in dispatcher functions
  kallsyms: strip ThinLTO hashes from static functions
  kthread: use WARN_ON_FUNCTION_MISMATCH
  workqueue: use WARN_ON_FUNCTION_MISMATCH
  module: ensure __cfi_check alignment
  mm: add generic function_nocfi macro
  cfi: add __cficanonical
  add support for Clang CFI
2021-04-27 10:16:46 -07:00
Sami Tolvanen 981731129e workqueue: use WARN_ON_FUNCTION_MISMATCH
With CONFIG_CFI_CLANG, a callback function passed to
__queue_delayed_work from a module points to a jump table entry
defined in the module instead of the one used in the core kernel,
which breaks function address equality in this check:

  WARN_ON_ONCE(timer->function != delayed_work_timer_fn);

Use WARN_ON_FUNCTION_MISMATCH() instead to disable the warning
when CFI and modules are both enabled.

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210408182843.1754385-6-samitolvanen@google.com
2021-04-08 16:04:21 -07:00
Wang Qing 89e28ce60c workqueue/watchdog: Make unbound workqueues aware of touch_softlockup_watchdog()
84;0;0c84;0;0c
There are two workqueue-specific watchdog timestamps:

    + @wq_watchdog_touched_cpu (per-CPU) updated by
      touch_softlockup_watchdog()

    + @wq_watchdog_touched (global) updated by
      touch_all_softlockup_watchdogs()

watchdog_timer_fn() checks only the global @wq_watchdog_touched for
unbound workqueues. As a result, unbound workqueues are not aware
of touch_softlockup_watchdog(). The watchdog might report a stall
even when the unbound workqueues are blocked by a known slow code.

Solution:
touch_softlockup_watchdog() must touch also the global @wq_watchdog_touched
timestamp.

The global timestamp can no longer be used for bound workqueues because
it is now updated from all CPUs. Instead, bound workqueues have to check
only @wq_watchdog_touched_cpu and these timestamps have to be updated for
all CPUs in touch_all_softlockup_watchdogs().

Beware:
The change might cause the opposite problem. An unbound workqueue
might get blocked on CPU A because of a real softlockup. The workqueue
watchdog would miss it when the timestamp got touched on CPU B.

It is acceptable because softlockups are detected by softlockup
watchdog. The workqueue watchdog is there to detect stalls where
a work never finishes, for example, because of dependencies of works
queued into the same workqueue.

V3:
- Modify the commit message clearly according to Petr's suggestion.

Signed-off-by: Wang Qing <wangqing@vivo.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2021-04-04 13:26:49 -04:00
Zqiang 0687c66b5f workqueue: Move the position of debug_work_activate() in __queue_work()
The debug_work_activate() is called on the premise that
the work can be inserted, because if wq be in WQ_DRAINING
status, insert work may be failed.

Fixes: e41e704bc4 ("workqueue: improve destroy_workqueue() debuggability")
Signed-off-by: Zqiang <qiang.zhang@windriver.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2021-04-04 13:26:46 -04:00
Linus Torvalds ac9e806c9c Merge branch 'for-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull qorkqueue updates from Tejun Heo:
 "Tracepoint and comment updates only"

* 'for-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Use %s instead of function name
  workqueue: tracing the name of the workqueue instead of it's address
  workqueue: fix annotation for WQ_SYSFS
2021-02-22 17:06:54 -08:00
Stephen Zhang e9ad2eb3d9 workqueue: Use %s instead of function name
It is better to replace the function name with %s, in case the function
name changes.

Signed-off-by: Stephen Zhang <stephenzhangzsd@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2021-01-27 09:42:48 -05:00
Peter Zijlstra 640f17c824 workqueue: Restrict affinity change to rescuer
create_worker() will already set the right affinity using
kthread_bind_mask(), this means only the rescuer will need to change
it's affinity.

Howveer, while in cpu-hot-unplug a regular task is not allowed to run
on online&&!active as it would be pushed away quite agressively. We
need KTHREAD_IS_PER_CPU to survive in that environment.

Therefore set the affinity after getting that magic flag.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210121103506.826629830@infradead.org
2021-01-22 15:09:43 +01:00
Peter Zijlstra 5c25b5ff89 workqueue: Tag bound workers with KTHREAD_IS_PER_CPU
Mark the per-cpu workqueue workers as KTHREAD_IS_PER_CPU.

Workqueues have unfortunate semantics in that per-cpu workers are not
default flushed and parked during hotplug, however a subset does
manual flush on hotplug and hard relies on them for correctness.

Therefore play silly games..

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210121103506.693465814@infradead.org
2021-01-22 15:09:42 +01:00
Lai Jiangshan 547a77d02f workqueue: Use cpu_possible_mask instead of cpu_active_mask to break affinity
The scheduler won't break affinity for us any more, and we should
"emulate" the same behavior when the scheduler breaks affinity for
us.  The behavior is "changing the cpumask to cpu_possible_mask".

And there might be some other CPUs online later while the worker is
still running with the pending work items.  The worker should be allowed
to use the later online CPUs as before and process the work items ASAP.
If we use cpu_active_mask here, we can't achieve this goal but
using cpu_possible_mask can.

Fixes: 06249738a4 ("workqueue: Manually break affinity on hotplug")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210111152638.2417-4-jiangshanlai@gmail.com
2021-01-22 15:09:41 +01:00
Linus Torvalds c76e02c59e Merge branch 'for-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue update from Tejun Heo:
 "The same as the cgroup tree - one commit which was scheduled for the
  5.11 merge window.

  All the commit does is avoding spurious worker wakeups from workqueue
  allocation / config change path to help cpuisol use cases"

* 'for-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Kick a worker based on the actual activation of delayed works
2020-12-28 11:23:02 -08:00
Linus Torvalds ac73e3dc8a Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:

 - a few random little subsystems

 - almost all of the MM patches which are staged ahead of linux-next
   material. I'll trickle to post-linux-next work in as the dependents
   get merged up.

Subsystems affected by this patch series: kthread, kbuild, ide, ntfs,
ocfs2, arch, and mm (slab-generic, slab, slub, dax, debug, pagecache,
gup, swap, shmem, memcg, pagemap, mremap, hmm, vmalloc, documentation,
kasan, pagealloc, memory-failure, hugetlb, vmscan, z3fold, compaction,
oom-kill, migration, cma, page-poison, userfaultfd, zswap, zsmalloc,
uaccess, zram, and cleanups).

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (200 commits)
  mm: cleanup kstrto*() usage
  mm: fix fall-through warnings for Clang
  mm: slub: convert sysfs sprintf family to sysfs_emit/sysfs_emit_at
  mm: shmem: convert shmem_enabled_show to use sysfs_emit_at
  mm:backing-dev: use sysfs_emit in macro defining functions
  mm: huge_memory: convert remaining use of sprintf to sysfs_emit and neatening
  mm: use sysfs_emit for struct kobject * uses
  mm: fix kernel-doc markups
  zram: break the strict dependency from lzo
  zram: add stat to gather incompressible pages since zram set up
  zram: support page writeback
  mm/process_vm_access: remove redundant initialization of iov_r
  mm/zsmalloc.c: rework the list_add code in insert_zspage()
  mm/zswap: move to use crypto_acomp API for hardware acceleration
  mm/zswap: fix passing zero to 'PTR_ERR' warning
  mm/zswap: make struct kernel_param_ops definitions const
  userfaultfd/selftests: hint the test runner on required privilege
  userfaultfd/selftests: fix retval check for userfaultfd_open()
  userfaultfd/selftests: always dump something in modes
  userfaultfd: selftests: make __{s,u}64 format specifiers portable
  ...
2020-12-15 12:53:37 -08:00
Walter Wu e89a85d63f workqueue: kasan: record workqueue stack
Patch series "kasan: add workqueue stack for generic KASAN", v5.

Syzbot reports many UAF issues for workqueue, see [1].

In some of these access/allocation happened in process_one_work(), we
see the free stack is useless in KASAN report, it doesn't help
programmers to solve UAF for workqueue issue.

This patchset improves KASAN reports by making them to have workqueue
queueing stack.  It is useful for programmers to solve use-after-free or
double-free memory issue.

Generic KASAN also records the last two workqueue stacks and prints them
in KASAN report.  It is only suitable for generic KASAN.

[1] https://groups.google.com/g/syzkaller-bugs/search?q=%22use-after-free%22+process_one_work
[2] https://bugzilla.kernel.org/show_bug.cgi?id=198437

This patch (of 4):

When analyzing use-after-free or double-free issue, recording the
enqueuing work stacks is helpful to preserve usage history which
potentially gives a hint about the affected code.

For workqueue it has turned out to be useful to record the enqueuing work
call stacks.  Because user can see KASAN report to determine whether it is
root cause.  They don't need to enable debugobjects, but they have a
chance to find out the root cause.

Link: https://lkml.kernel.org/r/20201203022148.29754-1-walter-zh.wu@mediatek.com
Link: https://lkml.kernel.org/r/20201203022442.30006-1-walter-zh.wu@mediatek.com
Signed-off-by: Walter Wu <walter-zh.wu@mediatek.com>
Suggested-by: Marco Elver <elver@google.com>
Acked-by: Marco Elver <elver@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Marco Elver <elver@google.com>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 12:13:42 -08:00
Yunfeng Ye 01341fbd0d workqueue: Kick a worker based on the actual activation of delayed works
In realtime scenario, We do not want to have interference on the
isolated cpu cores. but when invoking alloc_workqueue() for percpu wq
on the housekeeping cpu, it kick a kworker on the isolated cpu.

  alloc_workqueue
    pwq_adjust_max_active
      wake_up_worker

The comment in pwq_adjust_max_active() said:
  "Need to kick a worker after thawed or an unbound wq's
   max_active is bumped"

So it is unnecessary to kick a kworker for percpu's wq when invoking
alloc_workqueue(). this patch only kick a worker based on the actual
activation of delayed works.

Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-11-25 17:10:28 -05:00
Peter Zijlstra 06249738a4 workqueue: Manually break affinity on hotplug
Don't rely on the scheduler to force break affinity for us -- it will
stop doing that for per-cpu-kthreads.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102346.464718669@infradead.org
2020-11-10 18:38:58 +01:00
Mauro Carvalho Chehab 3eb6b31bfb workqueue: fix a kernel-doc warning
As warned by Sphinx:

	./Documentation/core-api/workqueue:400: ./kernel/workqueue.c:1218: WARNING: Unexpected indentation.

the return code table is currently not recognized, as it lacks
markups.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
2020-10-16 07:28:20 +02:00
Stephen Boyd f9e62f318f treewide: Make all debug_obj_descriptors const
This should make it harder for the kernel to corrupt the debug object
descriptor, used to call functions to fixup state and track debug objects,
by moving the structure to read-only memory.

Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20200815004027.2046113-3-swboyd@chromium.org
2020-09-24 21:56:25 +02:00
Christoph Hellwig fe557319aa maccess: rename probe_kernel_{read,write} to copy_{from,to}_kernel_nofault
Better describe what these functions do.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-17 10:57:41 -07:00
Lai Jiangshan 10cdb15759 workqueue: use BUILD_BUG_ON() for compile time test instead of WARN_ON()
Any runtime WARN_ON() has to be fixed, and BUILD_BUG_ON() can
help you nitice it earlier.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-06-01 11:02:42 -04:00
Lai Jiangshan b8f06b0444 workqueue: remove useless unlock() and lock() in series
This is no point to unlock() and then lock() the same mutex
back to back.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-05-29 10:25:23 -04:00
Lai Jiangshan 4f3f4cf388 workqueue: void unneeded requeuing the pwq in rescuer thread
008847f66c ("workqueue: allow rescuer thread to do more work.") made
the rescuer worker requeue the pwq immediately if there may be more
work items which need rescuing instead of waiting for the next mayday
timer expiration.  Unfortunately, it checks only whether the pool needs
help from rescuers, but it doesn't check whether the pwq has work items
in the pool (the real reason that this rescuer can help for the pool).

The patch adds the check and void unneeded requeuing.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-05-29 10:22:10 -04:00
Sebastian Andrzej Siewior a9b8a98529 workqueue: Convert the pool::lock and wq_mayday_lock to raw_spinlock_t
The workqueue code has it's internal spinlocks (pool::lock), which
are acquired on most workqueue operations. These spinlocks are
converted to 'sleeping' spinlocks on a RT-kernel.

Workqueue functions can be invoked from contexts which are truly atomic
even on a PREEMPT_RT enabled kernel. Taking sleeping locks from such
contexts is forbidden.

The pool::lock hold times are bound and the code sections are
relatively short, which allows to convert pool::lock and as a
consequence wq_mayday_lock to raw spinlocks which are truly spinning
locks even on a PREEMPT_RT kernel.

With the previous conversion of the manager waitqueue to a simple
waitqueue workqueues are now fully RT compliant.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-05-29 10:03:47 -04:00
Sebastian Andrzej Siewior d8bb65ab70 workqueue: Use rcuwait for wq_manager_wait
The workqueue code has it's internal spinlock (pool::lock) and also
implicit spinlock usage in the wq_manager waitqueue. These spinlocks
are converted to 'sleeping' spinlocks on a RT-kernel.

Workqueue functions can be invoked from contexts which are truly atomic
even on a PREEMPT_RT enabled kernel. Taking sleeping locks from such
contexts is forbidden.

pool::lock can be converted to a raw spinlock as the lock held times
are short. But the workqueue manager waitqueue is handled inside of
pool::lock held regions which again violates the lock nesting rules
of raw and regular spinlocks.

The manager waitqueue has no special requirements like custom wakeup
callbacks or mass wakeups. While it does not use exclusive wait mode
explicitly there is no strict requirement to queue the waiters in a
particular order as there is only one waiter at a time.

This allows to replace the waitqueue with rcuwait which solves the
locking problem because rcuwait relies on existing locking.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-05-29 10:00:35 -04:00
Zhang Qiang 342ed2400b workqueue: Remove unnecessary kfree() call in rcu_free_wq()
The data structure member "wq->rescuer" was reset to a null pointer
in one if branch. It was passed to a call of the function "kfree"
in the callback function "rcu_free_wq" (which was eventually executed).
The function "kfree" does not perform more meaningful data processing
for a passed null pointer (besides immediately returning from such a call).
Thus delete this function call which became unnecessary with the referenced
software update.

Fixes: def98c84b6 ("workqueue: Fix spurious sanity check failures in destroy_workqueue()")

Suggested-by: Markus Elfring <Markus.Elfring@web.de>
Signed-off-by: Zhang Qiang <qiang.zhang@windriver.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-05-27 09:52:41 -04:00
Dan Carpenter b92b36eadf workqueue: Fix an use after free in init_rescuer()
We need to preserve error code before freeing "rescuer".

Fixes: f187b6974f ("workqueue: Use IS_ERR and PTR_ERR instead of PTR_ERR_OR_ZERO.")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-05-11 10:25:42 -04:00
Sean Fu f187b6974f workqueue: Use IS_ERR and PTR_ERR instead of PTR_ERR_OR_ZERO.
Replace inline function PTR_ERR_OR_ZERO with IS_ERR and PTR_ERR to
remove redundant parameter definitions and checks.
Reduce code size.
Before:
   text	   data	    bss	    dec	    hex	filename
  47510	   5979	    840	  54329	   d439	kernel/workqueue.o
After:
   text	   data	    bss	    dec	    hex	filename
  47474	   5979	    840	  54293	   d415	kernel/workqueue.o

Signed-off-by: Sean Fu <fxinrong@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-05-05 11:56:07 -04:00
Sebastian Andrzej Siewior 62849a9612 workqueue: Remove the warning in wq_worker_sleeping()
The kernel test robot triggered a warning with the following race:
   task-ctx A                            interrupt-ctx B
 worker
  -> process_one_work()
    -> work_item()
      -> schedule();
         -> sched_submit_work()
           -> wq_worker_sleeping()
             -> ->sleeping = 1
               atomic_dec_and_test(nr_running)
         __schedule();                *interrupt*
                                       async_page_fault()
                                       -> local_irq_enable();
                                       -> schedule();
                                          -> sched_submit_work()
                                            -> wq_worker_sleeping()
                                               -> if (WARN_ON(->sleeping)) return
                                          -> __schedule()
                                            ->  sched_update_worker()
                                              -> wq_worker_running()
                                                 -> atomic_inc(nr_running);
                                                 -> ->sleeping = 0;

      ->  sched_update_worker()
        -> wq_worker_running()
          if (!->sleeping) return

In this context the warning is pointless everything is fine.
An interrupt before wq_worker_sleeping() will perform the ->sleeping
assignment (0 -> 1 > 0) twice.
An interrupt after wq_worker_sleeping() will trigger the warning and
nr_running will be decremented (by A) and incremented once (only by B, A
will skip it). This is the case until the ->sleeping is zeroed again in
wq_worker_running().

Remove the WARN statement because this condition may happen. Document
that preemption around wq_worker_sleeping() needs to be disabled to
protect ->sleeping and not just as an optimisation.

Fixes: 6d25be5782 ("sched/core, workqueues: Distangle worker accounting from rq lock")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Link: https://lkml.kernel.org/r/20200327074308.GY11705@shao2-debian
2020-04-08 11:35:20 +02:00
Linus Torvalds 0adb8bc039 Merge branch 'for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:
 "Nothing too interesting. Just two trivial patches"

* 'for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Mark up unlocked access to wq->first_flusher
  workqueue: Make workqueue_init*() return void
2020-04-03 12:27:36 -07:00
Chris Wilson 00d5d15b06 workqueue: Mark up unlocked access to wq->first_flusher
[ 7329.671518] BUG: KCSAN: data-race in flush_workqueue / flush_workqueue
[ 7329.671549]
[ 7329.671572] write to 0xffff8881f65fb250 of 8 bytes by task 37173 on cpu 2:
[ 7329.671607]  flush_workqueue+0x3bc/0x9b0 (kernel/workqueue.c:2844)
[ 7329.672527]
[ 7329.672540] read to 0xffff8881f65fb250 of 8 bytes by task 37175 on cpu 0:
[ 7329.672571]  flush_workqueue+0x28d/0x9b0 (kernel/workqueue.c:2835)

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-03-12 14:26:50 -04:00
Hillf Danton aa202f1f56 workqueue: don't use wq_select_unbound_cpu() for bound works
wq_select_unbound_cpu() is designed for unbound workqueues only, but
it's wrongly called when using a bound workqueue too.

Fixing this ensures work queued to a bound workqueue with
cpu=WORK_CPU_UNBOUND always runs on the local CPU.

Before, that would happen only if wq_unbound_cpumask happened to include
it (likely almost always the case), or was empty, or we got lucky with
forced round-robin placement.  So restricting
/sys/devices/virtual/workqueue/cpumask to a small subset of a machine's
CPUs would cause some bound work items to run unexpectedly there.

Fixes: ef55718044 ("workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask CPUs")
Cc: stable@vger.kernel.org # v4.5+
Signed-off-by: Hillf Danton <hdanton@sina.com>
[dj: massage changelog]
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-03-10 10:30:51 -04:00
Yu Chen 2333e82995 workqueue: Make workqueue_init*() return void
The return values of workqueue_init() and workqueue_early_int() are
always 0, and there is no usage of their return value.  So just make
them return void.

Signed-off-by: Yu Chen <chen.yu@easystack.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-03-04 11:21:49 -05:00
Linus Torvalds c677124e63 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "These were the main changes in this cycle:

   - More -rt motivated separation of CONFIG_PREEMPT and
     CONFIG_PREEMPTION.

   - Add more low level scheduling topology sanity checks and warnings
     to filter out nonsensical topologies that break scheduling.

   - Extend uclamp constraints to influence wakeup CPU placement

   - Make the RT scheduler more aware of asymmetric topologies and CPU
     capacities, via uclamp metrics, if CONFIG_UCLAMP_TASK=y

   - Make idle CPU selection more consistent

   - Various fixes, smaller cleanups, updates and enhancements - please
     see the git log for details"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits)
  sched/fair: Define sched_idle_cpu() only for SMP configurations
  sched/topology: Assert non-NUMA topology masks don't (partially) overlap
  idle: fix spelling mistake "iterrupts" -> "interrupts"
  sched/fair: Remove redundant call to cpufreq_update_util()
  sched/psi: create /proc/pressure and /proc/pressure/{io|memory|cpu} only when psi enabled
  sched/fair: Fix sgc->{min,max}_capacity calculation for SD_OVERLAP
  sched/fair: calculate delta runnable load only when it's needed
  sched/cputime: move rq parameter in irqtime_account_process_tick
  stop_machine: Make stop_cpus() static
  sched/debug: Reset watchdog on all CPUs while processing sysrq-t
  sched/core: Fix size of rq::uclamp initialization
  sched/uclamp: Fix a bug in propagating uclamp value in new cgroups
  sched/fair: Load balance aggressively for SCHED_IDLE CPUs
  sched/fair : Improve update_sd_pick_busiest for spare capacity case
  watchdog: Remove soft_lockup_hrtimer_cnt and related code
  sched/rt: Make RT capacity-aware
  sched/fair: Make EAS wakeup placement consider uclamp restrictions
  sched/fair: Make task_fits_capacity() consider uclamp restrictions
  sched/uclamp: Rename uclamp_util_with() into uclamp_rq_util_with()
  sched/uclamp: Make uclamp util helpers use and return UL values
  ...
2020-01-28 10:07:09 -08:00
Daniel Jordan 1c5da0ec7f workqueue: add worker function to workqueue_execute_end tracepoint
It's surprising that workqueue_execute_end includes only the work when
its counterpart workqueue_execute_start has both the work and the worker
function.

You can't set a tracing filter or trigger based on the function, and
postprocessing scripts interested in specific functions are harder to
write since they have to remember the work from _start and match it up
with the same field in _end.

Add the function name, taking care to use the copy stashed in the
worker since the work is no longer safe to touch.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-01-15 08:02:47 -08:00
Ingo Molnar 1e5f8a3085 Linux 5.5-rc3
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl4AEiYeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGR3sH/ixrBBYUVyjRPOxS
 ce4iVoTqphGSoAzq/3FA1YZZOPQ/Ep0NXL4L2fTGxmoiqIiuy8JPp07/NKbHQjj1
 Rt6PGm6cw2pMJHaK9gRdlTH/6OyXkp06OkH1uHqKYrhPnpCWDnj+i2SHAX21Hr1y
 oBQh4/XKvoCMCV96J2zxRsLvw8OkQFE0ouWWfj6LbpXIsmWZ++s0OuaO1cVdP/oG
 j+j2Voi3B3vZNQtGgJa5W7YoZN5Qk4ZIj9bMPg7bmKRd3wNB228AiJH2w68JWD/I
 jCA+JcITilxC9ud96uJ6k7SMS2ufjQlnP0z6Lzd0El1yGtHYRcPOZBgfOoPU2Euf
 33WGSyI=
 =iEwx
 -----END PGP SIGNATURE-----

Merge tag 'v5.5-rc3' into sched/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-25 10:41:37 +01:00
Sebastian Andrzej Siewior 025f50f386 sched/rt, workqueue: Use PREEMPTION
CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by CONFIG_PREEMPT_RT.
Both PREEMPT and PREEMPT_RT require the same functionality which today
depends on CONFIG_PREEMPT.

Update the comment to use PREEMPTION because it is true for both
preemption models.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20191015191821.11479-35-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-08 14:37:37 +01:00
Kefeng Wang 1d9a6159bd workqueue: Use pr_warn instead of pr_warning
Use pr_warn() instead of the remaining pr_warning() calls.

Link: http://lkml.kernel.org/r/20191128004752.35268-2-wangkefeng.wang@huawei.com
To: joe@perches.com
To: linux-kernel@vger.kernel.org
Cc: gregkh@linuxfoundation.org
Cc: tj@kernel.org
Cc: arnd@arndb.de
Cc: sergey.senozhatsky@gmail.com
Cc: rostedt@goodmis.org
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2019-12-06 09:59:30 +01:00
Linus Torvalds 1ae78780ed Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The main changes in this cycle were:

   - Dynamic tick (nohz) updates, perhaps most notably changes to force
     the tick on when needed due to lengthy in-kernel execution on CPUs
     on which RCU is waiting.

   - Linux-kernel memory consistency model updates.

   - Replace rcu_swap_protected() with rcu_prepace_pointer().

   - Torture-test updates.

   - Documentation updates.

   - Miscellaneous fixes"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits)
  security/safesetid: Replace rcu_swap_protected() with rcu_replace_pointer()
  net/sched: Replace rcu_swap_protected() with rcu_replace_pointer()
  net/netfilter: Replace rcu_swap_protected() with rcu_replace_pointer()
  net/core: Replace rcu_swap_protected() with rcu_replace_pointer()
  bpf/cgroup: Replace rcu_swap_protected() with rcu_replace_pointer()
  fs/afs: Replace rcu_swap_protected() with rcu_replace_pointer()
  drivers/scsi: Replace rcu_swap_protected() with rcu_replace_pointer()
  drm/i915: Replace rcu_swap_protected() with rcu_replace_pointer()
  x86/kvm/pmu: Replace rcu_swap_protected() with rcu_replace_pointer()
  rcu: Upgrade rcu_swap_protected() to rcu_replace_pointer()
  rcu: Suppress levelspread uninitialized messages
  rcu: Fix uninitialized variable in nocb_gp_wait()
  rcu: Update descriptions for rcu_future_grace_period tracepoint
  rcu: Update descriptions for rcu_nocb_wake tracepoint
  rcu: Remove obsolete descriptions for rcu_barrier tracepoint
  rcu: Ensure that ->rcu_urgent_qs is set before resched IPI
  workqueue: Convert for_each_wq to use built-in list check
  rcu: Several rcu_segcblist functions can be static
  rcu: Remove unused function hlist_bl_del_init_rcu()
  Documentation: Rename rcu_node_context_switch() to rcu_note_context_switch()
  ...
2019-11-26 15:42:43 -08:00
Sebastian Andrzej Siewior 49e9d1a9fa workqueue: Add RCU annotation for pwq list walk
An additional check has been recently added to ensure that a RCU related lock
is held while the RCU list is iterated.
The `pwqs' are sometimes iterated without a RCU lock but with the &wq->mutex
acquired leading to a warning.

Teach list_for_each_entry_rcu() that the RCU usage is okay if &wq->mutex
is acquired during the list traversal.

Fixes: 28875945ba ("rcu: Add support for consolidated-RCU reader checking")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-11-15 11:53:35 -08:00
Joel Fernandes (Google) 5a6446626d workqueue: Convert for_each_wq to use built-in list check
Because list_for_each_entry_rcu() can now check for holding a
lock as well as for being in an RCU read-side critical section,
this commit replaces the workqueue_sysfs_unregister() function's
use of assert_rcu_or_wq_mutex() and list_for_each_entry_rcu() with
list_for_each_entry_rcu() augmented with a lockdep_is_held() optional
argument.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-30 08:34:10 -07:00
Tejun Heo e66b39af00 workqueue: Fix pwq ref leak in rescuer_thread()
008847f66c ("workqueue: allow rescuer thread to do more work.") made
the rescuer worker requeue the pwq immediately if there may be more
work items which need rescuing instead of waiting for the next mayday
timer expiration.  Unfortunately, it doesn't check whether the pwq is
already on the mayday list and unconditionally gets the ref and moves
it onto the list.  This doesn't corrupt the list but creates an
additional reference to the pwq.  It got queued twice but will only be
removed once.

This leak later can trigger pwq refcnt warning on workqueue
destruction and prevent freeing of the workqueue.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "Williams, Gerald S" <gerald.s.williams@intel.com>
Cc: NeilBrown <neilb@suse.de>
Cc: stable@vger.kernel.org # v3.19+
2019-10-04 10:23:11 -07:00
Tejun Heo c29eb85386 workqueue: more destroy_workqueue() fixes
destroy_workqueue() warnings still, at a lower frequency, trigger
spuriously.  The problem seems to be in-flight operations which
haven't reached put_pwq() yet.

* Make sanity check grab all the related locks so that it's
  synchronized against operations which puts pwq at the end.

* Always print out the offending pwq.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "Williams, Gerald S" <gerald.s.williams@intel.com>
2019-10-04 10:23:01 -07:00
Tejun Heo 30ae2fc0a7 workqueue: Minor follow-ups to the rescuer destruction change
* Now that wq->rescuer may be cleared while rescuer is still there,
  switch show_pwq() debug printout to test worker->rescue_wq to
  identify rescuers intead of testing wq->rescuer.

* Update comment on ->rescuer locking.

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Lai Jiangshan <jiangshanlai@gmail.com>
2019-09-20 14:09:14 -07:00
Tejun Heo 8efe1223d7 workqueue: Fix missing kfree(rescuer) in destroy_workqueue()
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Qian Cai <cai@lca.pw>
Fixes: def98c84b6 ("workqueue: Fix spurious sanity check failures in destroy_workqueue()")
2019-09-20 13:39:57 -07:00
Tejun Heo def98c84b6 workqueue: Fix spurious sanity check failures in destroy_workqueue()
Before actually destrying a workqueue, destroy_workqueue() checks
whether it's actually idle.  If it isn't, it prints out a bunch of
warning messages and leaves the workqueue dangling.  It unfortunately
has a couple issues.

* Mayday list queueing increments pwq's refcnts which gets detected as
  busy and fails the sanity checks.  However, because mayday list
  queueing is asynchronous, this condition can happen without any
  actual work items left in the workqueue.

* Sanity check failure leaves the sysfs interface behind too which can
  lead to init failure of newer instances of the workqueue.

This patch fixes the above two by

* If a workqueue has a rescuer, disable and kill the rescuer before
  sanity checks.  Disabling and killing is guaranteed to flush the
  existing mayday list.

* Remove sysfs interface before sanity checks.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Marcin Pawlowski <mpawlowski@fb.com>
Reported-by: "Williams, Gerald S" <gerald.s.williams@intel.com>
Cc: stable@vger.kernel.org
2019-09-18 18:45:23 -07:00
Daniel Jordan 509b320489 workqueue: require CPU hotplug read exclusion for apply_workqueue_attrs
Change the calling convention for apply_workqueue_attrs to require CPU
hotplug read exclusion.

Avoids lockdep complaints about nested calls to get_online_cpus in a
future patch where padata calls apply_workqueue_attrs when changing
other CPU-hotplug-sensitive data structures with the CPU read lock
already held.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-09-13 21:15:40 +10:00
Daniel Jordan 513c98d086 workqueue: unconfine alloc/apply/free_workqueue_attrs()
padata will use these these interfaces in a later patch, so unconfine them.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-09-13 21:15:39 +10:00
Thomas Gleixner be69d00d97 workqueue: Remove GPF argument from alloc_workqueue_attrs()
All callers use GFP_KERNEL. No point in having that argument.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-06-27 14:12:19 -07:00
Thomas Gleixner 2c9858ecbe workqueue: Make alloc/apply/free_workqueue_attrs() static
None of those functions have any users outside of workqueue.c. Confine
them.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-06-27 14:12:15 -07:00
Thomas Gleixner 457c899653 treewide: Add SPDX license identifier for missed files
Add SPDX license identifiers to all files which:

 - Have no license information of any form

 - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
   initial scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

  GPL-2.0-only

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 10:50:45 +02:00
Linus Torvalds 23c970608a Merge branch 'for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:
 "Only three commits, of which two are trivial.

  The non-trivial chagne is Thomas's patch to switch workqueue from
  sched RCU to regular one. The use of sched RCU is mostly historic and
  doesn't really buy us anything noticeable"

* 'for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Use normal rcu
  kernel/workqueue: Document wq_worker_last_func() argument
  kernel/workqueue: Use __printf markup to silence compiler in function 'alloc_workqueue'
2019-05-09 13:48:52 -07:00
Linus Torvalds 0968621917 Printk changes for 5.2
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAlzP8nQACgkQUqAMR0iA
 lPK79A/+NkRouqA9ihAZhUbgW0DHzOAFvUJSBgX11HQAZbGjngakuoyYFvwUx0T0
 m80SUTCysxQrWl+xLdccPZ9ZrhP2KFQrEBEdeYHZ6ymcYcl83+3bOIBS7VwdZAbO
 EzB8u/58uU/sI6ABL4lF7ZF/+R+U4CXveEUoVUF04bxdPOxZkRX4PT8u3DzCc+RK
 r4yhwQUXGcKrHa2GrRL3GXKsDxcnRdFef/nzq4RFSZsi0bpskzEj34WrvctV6j+k
 FH/R3kEcZrtKIMPOCoDMMWq07yNqK/QKj0MJlGoAlwfK4INgcrSXLOx+pAmr6BNq
 uMKpkxCFhnkZVKgA/GbKEGzFf+ZGz9+2trSFka9LD2Ig6DIstwXqpAgiUK8JFQYj
 lq1mTaJZD3DfF2vnGHGeAfBFG3XETv+mIT/ow6BcZi3NyNSVIaqa5GAR+lMc6xkR
 waNkcMDkzLFuP1r0p7ZizXOksk9dFkMP3M6KqJomRtApwbSNmtt+O2jvyLPvB3+w
 wRyN9WT7IJZYo4v0rrD5Bl6BjV15ZeCPRSFZRYofX+vhcqJQsFX1M9DeoNqokh55
 Cri8f6MxGzBVjE1G70y2/cAFFvKEKJud0NUIMEuIbcy+xNrEAWPF8JhiwpKKnU10
 c0u674iqHJ2HeVsYWZF0zqzqQ6E1Idhg/PrXfuVuhAaL5jIOnYY=
 =WZfC
 -----END PGP SIGNATURE-----

Merge tag 'printk-for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk

Pull printk updates from Petr Mladek:

 - Allow state reset of printk_once() calls.

 - Prevent crashes when dereferencing invalid pointers in vsprintf().
   Only the first byte is checked for simplicity.

 - Make vsprintf warnings consistent and inlined.

 - Treewide conversion of obsolete %pf, %pF to %ps, %pF printf
   modifiers.

 - Some clean up of vsprintf and test_printf code.

* tag 'printk-for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk:
  lib/vsprintf: Make function pointer_string static
  vsprintf: Limit the length of inlined error messages
  vsprintf: Avoid confusion between invalid address and value
  vsprintf: Prevent crash when dereferencing invalid pointers
  vsprintf: Consolidate handling of unknown pointer specifiers
  vsprintf: Factor out %pO handler as kobject_string()
  vsprintf: Factor out %pV handler as va_format()
  vsprintf: Factor out %p[iI] handler as ip_addr_string()
  vsprintf: Do not check address of well-known strings
  vsprintf: Consistent %pK handling for kptr_restrict == 0
  vsprintf: Shuffle restricted_pointer()
  printk: Tie printk_once / printk_deferred_once into .data.once for reset
  treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively
  lib/test_printf: Switch to bitmap_zalloc()
2019-05-07 09:18:12 -07:00
Thomas Gleixner 6d25be5782 sched/core, workqueues: Distangle worker accounting from rq lock
The worker accounting for CPU bound workers is plugged into the core
scheduler code and the wakeup code. This is not a hard requirement and
can be avoided by keeping track of the state in the workqueue code
itself.

Keep track of the sleeping state in the worker itself and call the
notifier before entering the core scheduler. There might be false
positives when the task is woken between that call and actually
scheduling, but that's not really different from scheduling and being
woken immediately after switching away. When nr_running is updated when
the task is retunrning from schedule() then it is later compared when it
is done from ttwu().

[ bigeasy: preempt_disable() around wq_worker_sleeping() by Daniel Bristot de Oliveira ]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/ad2b29b5715f970bffc1a7026cabd6ff0b24076a.1532952814.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-16 16:55:15 +02:00
Sakari Ailus d75f773c86 treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively
%pF and %pf are functionally equivalent to %pS and %ps conversion
specifiers. The former are deprecated, therefore switch the current users
to use the preferred variant.

The changes have been produced by the following command:

	git grep -l '%p[fF]' | grep -v '^\(tools\|Documentation\)/' | \
	while read i; do perl -i -pe 's/%pf/%ps/g; s/%pF/%pS/g;' $i; done

And verifying the result.

Link: http://lkml.kernel.org/r/20190325193229.23390-1-sakari.ailus@linux.intel.com
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: sparclinux@vger.kernel.org
Cc: linux-um@lists.infradead.org
Cc: xen-devel@lists.xenproject.org
Cc: linux-acpi@vger.kernel.org
Cc: linux-pm@vger.kernel.org
Cc: drbd-dev@lists.linbit.com
Cc: linux-block@vger.kernel.org
Cc: linux-mmc@vger.kernel.org
Cc: linux-nvdimm@lists.01.org
Cc: linux-pci@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
Cc: linux-btrfs@vger.kernel.org
Cc: linux-f2fs-devel@lists.sourceforge.net
Cc: linux-mm@kvack.org
Cc: ceph-devel@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Acked-by: David Sterba <dsterba@suse.com> (for btrfs)
Acked-by: Mike Rapoport <rppt@linux.ibm.com> (for mm/memblock.c)
Acked-by: Bjorn Helgaas <bhelgaas@google.com> (for drivers/pci)
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2019-04-09 14:19:06 +02:00
Thomas Gleixner 24acfb7182 workqueue: Use normal rcu
There is no need for sched_rcu. The undocumented reason why sched_rcu
is used is to avoid a few explicit rcu_read_lock()/unlock() pairs by
the fact that sched_rcu reader side critical sections are also protected
by preempt or irq disabled regions.

Replace rcu_read_lock_sched with rcu_read_lock and acquire the RCU lock
where it is not yet explicit acquired. Replace local_irq_disable() with
rcu_read_lock(). Update asserts.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[bigeasy: mangle changelog a little]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-04-08 12:37:43 -07:00
Bart Van Assche 82efcab3b9 workqueue: Only unregister a registered lockdep key
The recent change to prevent use after free and a memory leak introduced an
unconditional call to wq_unregister_lockdep() in the error handling
path. If the lockdep key had not been registered yet, then the lockdep core
emits a warning.

Only call wq_unregister_lockdep() if wq_register_lockdep() has been
called first.

Fixes: 009bb421b6 ("workqueue, lockdep: Fix an alloc_workqueue() error path")
Reported-by: syzbot+be0c198232f86389c3dd@syzkaller.appspotmail.com
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Qian Cai <cai@lca.pw>
Link: https://lkml.kernel.org/r/20190311230255.176081-1-bvanassche@acm.org
2019-03-21 12:00:18 +01:00
Bart Van Assche 8194fe94ab kernel/workqueue: Document wq_worker_last_func() argument
This patch avoids that the following warning is reported when building
with W=1:

kernel/workqueue.c:938: warning: Function parameter or member 'task' not described in 'wq_worker_last_func'

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-03-19 10:48:20 -07:00
Mathieu Malaterre a2775bbc1d kernel/workqueue: Use __printf markup to silence compiler in function 'alloc_workqueue'
Silence warnings (triggered at W=1) by adding relevant __printf attributes.

  kernel/workqueue.c:4249:2: warning: function 'alloc_workqueue' might be a candidate for 'gnu_printf' format attribute [-Wsuggest-attribute=format]

Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-03-15 08:47:22 -07:00
Linus Torvalds 9e55f87c0e Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Thomas Gleixner:
 "A few fixes for lockdep:

   - initialize lockdep internal RCU head after initializing RCU

   - prevent use after free in a alloc_workqueue() error handling path

   - plug a memory leak in the workqueue core which fails to free a
     dynamically allocated lock name.

   - make Clang happy"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  workqueue, lockdep: Fix a memory leak in wq->lock_name
  workqueue, lockdep: Fix an alloc_workqueue() error path
  locking/lockdep: Only call init_rcu_head() after RCU has been initialized
  locking/lockdep: Avoid a Clang warning
2019-03-10 13:48:14 -07:00
Qian Cai 69a106c00e workqueue, lockdep: Fix a memory leak in wq->lock_name
The following commit:

  669de8bda8 ("kernel/workqueue: Use dynamic lockdep keys for workqueues")

introduced a memory leak as wq_free_lockdep() calls kfree(wq->lock_name),
but wq_init_lockdep() does not point wq->lock_name to the newly allocated
slab object.

This can be reproduced by running LTP fallocate04 followed by oom01 tests:

 unreferenced object 0xc0000005876384d8 (size 64):
  comm "fallocate04", pid 26972, jiffies 4297139141 (age 40370.480s)
  hex dump (first 32 bytes):
    28 77 71 5f 63 6f 6d 70 6c 65 74 69 6f 6e 29 65  (wq_completion)e
    78 74 34 2d 72 73 76 2d 63 6f 6e 76 65 72 73 69  xt4-rsv-conversi
  backtrace:
    [<00000000cb452883>] kvasprintf+0x6c/0xe0
    [<000000004654ddac>] kasprintf+0x34/0x60
    [<000000001c68f311>] alloc_workqueue+0x1f8/0x6ac
    [<0000000003c2ad83>] ext4_fill_super+0x23d4/0x3c80 [ext4]
    [<0000000006610538>] mount_bdev+0x25c/0x290
    [<00000000bcf955ec>] ext4_mount+0x28/0x50 [ext4]
    [<0000000016e08fd3>] legacy_get_tree+0x4c/0xb0
    [<0000000042b6a5fc>] vfs_get_tree+0x6c/0x190
    [<00000000268ab022>] do_mount+0xb9c/0x1100
    [<00000000698e6898>] ksys_mount+0x158/0x180
    [<0000000064e391fd>] sys_mount+0x20/0x30
    [<00000000ba378f12>] system_call+0x5c/0x70

Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: catalin.marinas@arm.com
Cc: jiangshanlai@gmail.com
Cc: tj@kernel.org
Fixes: 669de8bda8 ("kernel/workqueue: Use dynamic lockdep keys for workqueues")
Link: https://lkml.kernel.org/r/20190307002731.47371-1-cai@lca.pw
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-03-09 14:15:52 +01:00
Bart Van Assche 009bb421b6 workqueue, lockdep: Fix an alloc_workqueue() error path
This patch fixes a use-after-free and a memory leak in an alloc_workqueue()
error path.

Repoted by syzkaller and KASAN:

  BUG: KASAN: use-after-free in __read_once_size include/linux/compiler.h:197 [inline]
  BUG: KASAN: use-after-free in lockdep_register_key+0x3b9/0x490 kernel/locking/lockdep.c:1023
  Read of size 8 at addr ffff888090fc2698 by task syz-executor134/7858

  CPU: 1 PID: 7858 Comm: syz-executor134 Not tainted 5.0.0-rc8-next-20190301 #1
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
  Call Trace:
   __dump_stack lib/dump_stack.c:77 [inline]
   dump_stack+0x172/0x1f0 lib/dump_stack.c:113
   print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187
   kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
   __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
   __read_once_size include/linux/compiler.h:197 [inline]
   lockdep_register_key+0x3b9/0x490 kernel/locking/lockdep.c:1023
   wq_init_lockdep kernel/workqueue.c:3444 [inline]
   alloc_workqueue+0x427/0xe70 kernel/workqueue.c:4263
   ucma_open+0x76/0x290 drivers/infiniband/core/ucma.c:1732
   misc_open+0x398/0x4c0 drivers/char/misc.c:141
   chrdev_open+0x247/0x6b0 fs/char_dev.c:417
   do_dentry_open+0x488/0x1160 fs/open.c:771
   vfs_open+0xa0/0xd0 fs/open.c:880
   do_last fs/namei.c:3416 [inline]
   path_openat+0x10e9/0x46e0 fs/namei.c:3533
   do_filp_open+0x1a1/0x280 fs/namei.c:3563
   do_sys_open+0x3fe/0x5d0 fs/open.c:1063
   __do_sys_openat fs/open.c:1090 [inline]
   __se_sys_openat fs/open.c:1084 [inline]
   __x64_sys_openat+0x9d/0x100 fs/open.c:1084
   do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

  Allocated by task 7789:
   save_stack+0x45/0xd0 mm/kasan/common.c:75
   set_track mm/kasan/common.c:87 [inline]
   __kasan_kmalloc mm/kasan/common.c:497 [inline]
   __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:470
   kasan_kmalloc+0x9/0x10 mm/kasan/common.c:511
   __do_kmalloc mm/slab.c:3726 [inline]
   __kmalloc+0x15c/0x740 mm/slab.c:3735
   kmalloc include/linux/slab.h:553 [inline]
   kzalloc include/linux/slab.h:743 [inline]
   alloc_workqueue+0x13c/0xe70 kernel/workqueue.c:4236
   ucma_open+0x76/0x290 drivers/infiniband/core/ucma.c:1732
   misc_open+0x398/0x4c0 drivers/char/misc.c:141
   chrdev_open+0x247/0x6b0 fs/char_dev.c:417
   do_dentry_open+0x488/0x1160 fs/open.c:771
   vfs_open+0xa0/0xd0 fs/open.c:880
   do_last fs/namei.c:3416 [inline]
   path_openat+0x10e9/0x46e0 fs/namei.c:3533
   do_filp_open+0x1a1/0x280 fs/namei.c:3563
   do_sys_open+0x3fe/0x5d0 fs/open.c:1063
   __do_sys_openat fs/open.c:1090 [inline]
   __se_sys_openat fs/open.c:1084 [inline]
   __x64_sys_openat+0x9d/0x100 fs/open.c:1084
   do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

  Freed by task 7789:
   save_stack+0x45/0xd0 mm/kasan/common.c:75
   set_track mm/kasan/common.c:87 [inline]
   __kasan_slab_free+0x102/0x150 mm/kasan/common.c:459
   kasan_slab_free+0xe/0x10 mm/kasan/common.c:467
   __cache_free mm/slab.c:3498 [inline]
   kfree+0xcf/0x230 mm/slab.c:3821
   alloc_workqueue+0xc3e/0xe70 kernel/workqueue.c:4295
   ucma_open+0x76/0x290 drivers/infiniband/core/ucma.c:1732
   misc_open+0x398/0x4c0 drivers/char/misc.c:141
   chrdev_open+0x247/0x6b0 fs/char_dev.c:417
   do_dentry_open+0x488/0x1160 fs/open.c:771
   vfs_open+0xa0/0xd0 fs/open.c:880
   do_last fs/namei.c:3416 [inline]
   path_openat+0x10e9/0x46e0 fs/namei.c:3533
   do_filp_open+0x1a1/0x280 fs/namei.c:3563
   do_sys_open+0x3fe/0x5d0 fs/open.c:1063
   __do_sys_openat fs/open.c:1090 [inline]
   __se_sys_openat fs/open.c:1084 [inline]
   __x64_sys_openat+0x9d/0x100 fs/open.c:1084
   do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

  The buggy address belongs to the object at ffff888090fc2580
   which belongs to the cache kmalloc-512 of size 512
  The buggy address is located 280 bytes inside of
   512-byte region [ffff888090fc2580, ffff888090fc2780)

Reported-by: syzbot+17335689e239ce135d8b@syzkaller.appspotmail.com
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Fixes: 669de8bda8 ("kernel/workqueue: Use dynamic lockdep keys for workqueues")
Link: https://lkml.kernel.org/r/20190303220046.29448-1-bvanassche@acm.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-03-09 14:15:52 +01:00
Linus Torvalds b5dd0c658c Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:

 - some of the rest of MM

 - various misc things

 - dynamic-debug updates

 - checkpatch

 - some epoll speedups

 - autofs

 - rapidio

 - lib/, lib/lzo/ updates

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (83 commits)
  samples/mic/mpssd/mpssd.h: remove duplicate header
  kernel/fork.c: remove duplicated include
  include/linux/relay.h: fix percpu annotation in struct rchan
  arch/nios2/mm/fault.c: remove duplicate include
  unicore32: stop printing the virtual memory layout
  MAINTAINERS: fix GTA02 entry and mark as orphan
  mm: create the new vm_fault_t type
  arm, s390, unicore32: remove oneliner wrappers for memblock_alloc()
  arch: simplify several early memory allocations
  openrisc: simplify pte_alloc_one_kernel()
  sh: prefer memblock APIs returning virtual address
  microblaze: prefer memblock API returning virtual address
  powerpc: prefer memblock APIs returning virtual address
  lib/lzo: separate lzo-rle from lzo
  lib/lzo: implement run-length encoding
  lib/lzo: fast 8-byte copy on arm64
  lib/lzo: 64-bit CTZ on arm64
  lib/lzo: tidy-up ifdefs
  ipc/sem.c: replace kvmalloc/memset with kvzalloc and use struct_size
  ipc: annotate implicit fall through
  ...
2019-03-07 19:25:37 -08:00
Johannes Weiner 4b04700275 kernel: workqueue: clarify wq_worker_last_func() caller requirements
This function can only be called safely from very specific scheduler
contexts.  Document those.

Link: http://lkml.kernel.org/r/20190206150528.31198-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-07 18:32:01 -08:00
Linus Torvalds abf7c3d8dd Merge branch 'for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:
 "All trivial. Two comment updates and one more initialization sanity
  check in flush_work()"

* 'for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Fix spelling in source code comments
  workqueue: fix typo in comment
  workqueue: Try to catch flush_work() without INIT_WORK().
2019-03-07 10:09:52 -08:00
Linus Torvalds e431f2d74e Driver core patches for 5.1-rc1
Here is the big driver core patchset for 5.1-rc1
 
 More patches than "normal" here this merge window, due to some work in
 the driver core by Alexander Duyck to rework the async probe
 functionality to work better for a number of devices, and independant
 work from Rafael for the device link functionality to make it work
 "correctly".
 
 Also in here is:
 	- lots of BUS_ATTR() removals, the macro is about to go away
 	- firmware test fixups
 	- ihex fixups and simplification
 	- component additions (also includes i915 patches)
 	- lots of minor coding style fixups and cleanups.
 
 All of these have been in linux-next for a while with no reported
 issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXH+euQ8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ynyTgCfbV8CLums843sBnT8NnWrTMTdTCcAn1K4re0m
 ep8g+6oRLxJy414hogxQ
 =bLs2
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here is the big driver core patchset for 5.1-rc1

  More patches than "normal" here this merge window, due to some work in
  the driver core by Alexander Duyck to rework the async probe
  functionality to work better for a number of devices, and independant
  work from Rafael for the device link functionality to make it work
  "correctly".

  Also in here is:

   - lots of BUS_ATTR() removals, the macro is about to go away

   - firmware test fixups

   - ihex fixups and simplification

   - component additions (also includes i915 patches)

   - lots of minor coding style fixups and cleanups.

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'driver-core-5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (65 commits)
  driver core: platform: remove misleading err_alloc label
  platform: set of_node in platform_device_register_full()
  firmware: hardcode the debug message for -ENOENT
  driver core: Add missing description of new struct device_link field
  driver core: Fix PM-runtime for links added during consumer probe
  drivers/component: kerneldoc polish
  async: Add cmdline option to specify drivers to be async probed
  driver core: Fix possible supplier PM-usage counter imbalance
  PM-runtime: Fix __pm_runtime_set_status() race with runtime resume
  driver: platform: Support parsing GpioInt 0 in platform_get_irq()
  selftests: firmware: fix verify_reqs() return value
  Revert "selftests: firmware: remove use of non-standard diff -Z option"
  Revert "selftests: firmware: add CONFIG_FW_LOADER_USER_HELPER_FALLBACK to config"
  device: Fix comment for driver_data in struct device
  kernfs: Allocating memory for kernfs_iattrs with kmem_cache.
  sysfs: remove unused include of kernfs-internal.h
  driver core: Postpone DMA tear-down until after devres release
  driver core: Document limitation related to DL_FLAG_RPM_ACTIVE
  PM-runtime: Take suppliers into account in __pm_runtime_set_status()
  device.h: Add __cold to dev_<level> logging functions
  ...
2019-03-06 14:52:48 -08:00
Bart Van Assche bf393fd4a3 workqueue: Fix spelling in source code comments
Change "execuing" into "executing" and "guarnateed" into "guaranteed".

Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-03-05 07:52:39 -08:00
Bart Van Assche 669de8bda8 kernel/workqueue: Use dynamic lockdep keys for workqueues
The following commit:

  87915adc3f ("workqueue: re-add lockdep dependencies for flushing")

improved deadlock checking in the workqueue implementation. Unfortunately
that patch also introduced a few false positive lockdep complaints.

This patch suppresses these false positives by allocating the workqueue mutex
lockdep key dynamically.

An example of a false positive lockdep complaint suppressed by this patch
can be found below. The root cause of the lockdep complaint shown below
is that the direct I/O code can call alloc_workqueue() from inside a work
item created by another alloc_workqueue() call and that both workqueues
share the same lockdep key. This patch avoids that that lockdep complaint
is triggered by allocating the work queue lockdep keys dynamically.

In other words, this patch guarantees that a unique lockdep key is
associated with each work queue mutex.

  ======================================================
  WARNING: possible circular locking dependency detected
  4.19.0-dbg+ #1 Not tainted
  fio/4129 is trying to acquire lock:
  00000000a01cfe1a ((wq_completion)"dio/%s"sb->s_id){+.+.}, at: flush_workqueue+0xd0/0x970

  but task is already holding lock:
  00000000a0acecf9 (&sb->s_type->i_mutex_key#14){+.+.}, at: ext4_file_write_iter+0x154/0x710

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #2 (&sb->s_type->i_mutex_key#14){+.+.}:
         down_write+0x3d/0x80
         __generic_file_fsync+0x77/0xf0
         ext4_sync_file+0x3c9/0x780
         vfs_fsync_range+0x66/0x100
         dio_complete+0x2f5/0x360
         dio_aio_complete_work+0x1c/0x20
         process_one_work+0x481/0x9f0
         worker_thread+0x63/0x5a0
         kthread+0x1cf/0x1f0
         ret_from_fork+0x24/0x30

  -> #1 ((work_completion)(&dio->complete_work)){+.+.}:
         process_one_work+0x447/0x9f0
         worker_thread+0x63/0x5a0
         kthread+0x1cf/0x1f0
         ret_from_fork+0x24/0x30

  -> #0 ((wq_completion)"dio/%s"sb->s_id){+.+.}:
         lock_acquire+0xc5/0x200
         flush_workqueue+0xf3/0x970
         drain_workqueue+0xec/0x220
         destroy_workqueue+0x23/0x350
         sb_init_dio_done_wq+0x6a/0x80
         do_blockdev_direct_IO+0x1f33/0x4be0
         __blockdev_direct_IO+0x79/0x86
         ext4_direct_IO+0x5df/0xbb0
         generic_file_direct_write+0x119/0x220
         __generic_file_write_iter+0x131/0x2d0
         ext4_file_write_iter+0x3fa/0x710
         aio_write+0x235/0x330
         io_submit_one+0x510/0xeb0
         __x64_sys_io_submit+0x122/0x340
         do_syscall_64+0x71/0x220
         entry_SYSCALL_64_after_hwframe+0x49/0xbe

  other info that might help us debug this:

  Chain exists of:
    (wq_completion)"dio/%s"sb->s_id --> (work_completion)(&dio->complete_work) --> &sb->s_type->i_mutex_key#14

   Possible unsafe locking scenario:

         CPU0                    CPU1
         ----                    ----
    lock(&sb->s_type->i_mutex_key#14);
                                 lock((work_completion)(&dio->complete_work));
                                 lock(&sb->s_type->i_mutex_key#14);
    lock((wq_completion)"dio/%s"sb->s_id);

   *** DEADLOCK ***

  1 lock held by fio/4129:
   #0: 00000000a0acecf9 (&sb->s_type->i_mutex_key#14){+.+.}, at: ext4_file_write_iter+0x154/0x710

  stack backtrace:
  CPU: 3 PID: 4129 Comm: fio Not tainted 4.19.0-dbg+ #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
  Call Trace:
   dump_stack+0x86/0xc5
   print_circular_bug.isra.32+0x20a/0x218
   __lock_acquire+0x1c68/0x1cf0
   lock_acquire+0xc5/0x200
   flush_workqueue+0xf3/0x970
   drain_workqueue+0xec/0x220
   destroy_workqueue+0x23/0x350
   sb_init_dio_done_wq+0x6a/0x80
   do_blockdev_direct_IO+0x1f33/0x4be0
   __blockdev_direct_IO+0x79/0x86
   ext4_direct_IO+0x5df/0xbb0
   generic_file_direct_write+0x119/0x220
   __generic_file_write_iter+0x131/0x2d0
   ext4_file_write_iter+0x3fa/0x710
   aio_write+0x235/0x330
   io_submit_one+0x510/0xeb0
   __x64_sys_io_submit+0x122/0x340
   do_syscall_64+0x71/0x220
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Link: https://lkml.kernel.org/r/20190214230058.196511-20-bvanassche@acm.org
[ Reworked the changelog a bit. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-28 07:55:47 +01:00
Liu Song 8bdc620178 workqueue: fix typo in comment
qeueue/queue

Signed-off-by: Liu Song <liu.song11@zte.com.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-02-21 08:03:38 -08:00
Greg Kroah-Hartman 9481caf39b Merge 5.0-rc6 into driver-core-next
We need the debugfs fixes in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-11 09:09:02 +01:00
Johannes Weiner 1b69ac6b40 psi: fix aggregation idle shut-off
psi has provisions to shut off the periodic aggregation worker when
there is a period of no task activity - and thus no data that needs
aggregating.  However, while developing psi monitoring, Suren noticed
that the aggregation clock currently won't stay shut off for good.

Debugging this revealed a flaw in the idle design: an aggregation run
will see no task activity and decide to go to sleep; shortly thereafter,
the kworker thread that executed the aggregation will go idle and cause
a scheduling change, during which the psi callback will kick the
!pending worker again.  This will ping-pong forever, and is equivalent
to having no shut-off logic at all (but with more code!)

Fix this by exempting aggregation workers from psi's clock waking logic
when the state change is them going to sleep.  To do this, tag workers
with the last work function they executed, and if in psi we see a worker
going to sleep after aggregating psi data, we will not reschedule the
aggregation work item.

What if the worker is also executing other items before or after?

Any psi state times that were incurred by work items preceding the
aggregation work will have been collected from the per-cpu buckets
during the aggregation itself.  If there are work items following the
aggregation work, the worker's last_func tag will be overwritten and the
aggregator will be kept alive to process this genuine new activity.

If the aggregation work is the last thing the worker does, and we decide
to go idle, the brief period of non-idle time incurred between the
aggregation run and the kworker's dequeue will be stranded in the
per-cpu buckets until the clock is woken by later activity.  But that
should not be a problem.  The buckets can hold 4s worth of time, and
future activity will wake the clock with a 2s delay, giving us 2s worth
of data we can leave behind when disabling aggregation.  If it takes a
worker more than two seconds to go idle after it finishes its last work
item, we likely have bigger problems in the system, and won't notice one
sample that was averaged with a bogus per-CPU weight.

Link: http://lkml.kernel.org/r/20190116193501.1910-1-hannes@cmpxchg.org
Fixes: eb414681d5 ("psi: pressure stall information for CPU, memory, and IO")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01 15:46:23 -08:00
Alexander Duyck 8204e0c111 workqueue: Provide queue_work_node to queue work near a given NUMA node
Provide a new function, queue_work_node, which is meant to schedule work on
a "random" CPU of the requested NUMA node. The main motivation for this is
to help assist asynchronous init to better improve boot times for devices
that are local to a specific node.

For now we just default to the first CPU that is in the intersection of the
cpumask of the node and the online cpumask. The only exception is if the
CPU is local to the node we will just use the current CPU. This should work
for our purposes as we are currently only using this for unbound work so
the CPU will be translated to a node anyway instead of being directly used.

As we are only using the first CPU to represent the NUMA node for now I am
limiting the scope of the function so that it can only be used with unbound
workqueues.

Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-31 14:20:54 +01:00
Tetsuo Handa 4d43d395fe workqueue: Try to catch flush_work() without INIT_WORK().
syzbot found a flush_work() caller who forgot to call INIT_WORK()
because that work_struct was allocated by kzalloc() [1]. But the message

  INFO: trying to register non-static key.
  the code is fine but needs lockdep annotation.
  turning off the locking correctness validator.

by lock_map_acquire() is failing to tell that INIT_WORK() is missing.

Since flush_work() without INIT_WORK() is a bug, and INIT_WORK() should
set ->func field to non-zero, let's warn if ->func field is zero.

[1] https://syzkaller.appspot.com/bug?id=a5954455fcfa51c29ca2ab55b203076337e1c770

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-01-25 07:28:29 -08:00
Paul E. McKenney 25b0077511 workqueue: Replace call_rcu_sched() with call_rcu()
Now that call_rcu()'s callback is not invoked until after all
preempt-disable regions of code have completed (in addition to explicitly
marked RCU read-side critical sections), call_rcu() can be used in place
of call_rcu_sched().  This commit therefore makes that change.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
2018-11-27 09:21:44 -08:00
Vincent Whitchurch cb9d7fd51d watchdog: Mark watchdog touch functions as notrace
Some architectures need to use stop_machine() to patch functions for
ftrace, and the assumption is that the stopped CPUs do not make function
calls to traceable functions when they are in the stopped state.

Commit ce4f06dcbb ("stop_machine: Touch_nmi_watchdog() after
MULTI_STOP_PREPARE") added calls to the watchdog touch functions from
the stopped CPUs and those functions lack notrace annotations.  This
leads to crashes when enabling/disabling ftrace on ARM kernels built
with the Thumb-2 instruction set.

Fix it by adding the necessary notrace annotations.

Fixes: ce4f06dcbb ("stop_machine: Touch_nmi_watchdog() after MULTI_STOP_PREPARE")
Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: oleg@redhat.com
Cc: tj@kernel.org
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180821152507.18313-1-vincent.whitchurch@axis.com
2018-08-30 12:56:40 +02:00
Linus Torvalds 9022ada8ab Merge branch 'for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:
 "Over the lockdep cross-release churn, workqueue lost some of the
  existing annotations. Johannes Berg restored it and also improved
  them"

* 'for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: re-add lockdep dependencies for flushing
  workqueue: skip lockdep wq dependency in cancel_work_sync()
2018-08-24 13:16:36 -07:00
Johannes Berg 87915adc3f workqueue: re-add lockdep dependencies for flushing
In flush_work(), we need to create a lockdep dependency so that
the following scenario is appropriately tagged as a problem:

  work_function()
  {
    mutex_lock(&mutex);
    ...
  }

  other_function()
  {
    mutex_lock(&mutex);
    flush_work(&work); // or cancel_work_sync(&work);
  }

This is a problem since the work might be running and be blocked
on trying to acquire the mutex.

Similarly, in flush_workqueue().

These were removed after cross-release partially caught these
problems, but now cross-release was reverted anyway. IMHO the
removal was erroneous anyway though, since lockdep should be
able to catch potential problems, not just actual ones, and
cross-release would only have caught the problem when actually
invoking wait_for_completion().

Fixes: fd1a5b04df ("workqueue: Remove now redundant lock acquisitions wrt. workqueue flushes")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2018-08-22 08:31:38 -07:00
Johannes Berg d6e89786be workqueue: skip lockdep wq dependency in cancel_work_sync()
In cancel_work_sync(), we can only have one of two cases, even
with an ordered workqueue:
 * the work isn't running, just cancelled before it started
 * the work is running, but then nothing else can be on the
   workqueue before it

Thus, we need to skip the lockdep workqueue dependency handling,
otherwise we get false positive reports from lockdep saying that
we have a potential deadlock when the workqueue also has other
work items with locking, e.g.

  work1_function() { mutex_lock(&mutex); ... }
  work2_function() { /* nothing */ }

  other_function() {
    queue_work(ordered_wq, &work1);
    queue_work(ordered_wq, &work2);
    mutex_lock(&mutex);
    cancel_work_sync(&work2);
  }

As described above, this isn't a problem, but lockdep will
currently flag it as if cancel_work_sync() was flush_work(),
which *is* a problem.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2018-08-22 08:31:37 -07:00
Kees Cook 6396bb2215 treewide: kzalloc() -> kcalloc()
The kzalloc() function has a 2-factor argument form, kcalloc(). This
patch replaces cases of:

        kzalloc(a * b, gfp)

with:
        kcalloc(a * b, gfp)

as well as handling cases of:

        kzalloc(a * b * c, gfp)

with:

        kzalloc(array3_size(a, b, c), gfp)

as it's slightly less ugly than:

        kzalloc_array(array_size(a, b), c, gfp)

This does, however, attempt to ignore constant size factors like:

        kzalloc(4 * 1024, gfp)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
  kzalloc(
-	(sizeof(TYPE)) * E
+	sizeof(TYPE) * E
  , ...)
|
  kzalloc(
-	(sizeof(THING)) * E
+	sizeof(THING) * E
  , ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
  kzalloc(
-	sizeof(u8) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(__u8) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(char) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(unsigned char) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(u8) * COUNT
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(__u8) * COUNT
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(char) * COUNT
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(unsigned char) * COUNT
+	COUNT
  , ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * (COUNT_ID)
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * COUNT_ID
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * (COUNT_CONST)
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * COUNT_CONST
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * (COUNT_ID)
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * COUNT_ID
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * (COUNT_CONST)
+	COUNT_CONST, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * COUNT_CONST
+	COUNT_CONST, sizeof(THING)
  , ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

- kzalloc
+ kcalloc
  (
-	SIZE * COUNT
+	COUNT, SIZE
  , ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
  kzalloc(
-	sizeof(TYPE) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc(
-	sizeof(TYPE) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc(
-	sizeof(TYPE) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc(
-	sizeof(TYPE) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc(
-	sizeof(THING) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kzalloc(
-	sizeof(THING) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kzalloc(
-	sizeof(THING) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kzalloc(
-	sizeof(THING) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
  kzalloc(
-	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kzalloc(
-	sizeof(THING1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kzalloc(
-	sizeof(THING1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
|
  kzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
  kzalloc(
-	(COUNT) * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	COUNT * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	COUNT * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	(COUNT) * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	COUNT * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	(COUNT) * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	(COUNT) * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	COUNT * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
)

// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
  kzalloc(C1 * C2 * C3, ...)
|
  kzalloc(
-	(E1) * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kzalloc(
-	(E1) * (E2) * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kzalloc(
-	(E1) * (E2) * (E3)
+	array3_size(E1, E2, E3)
  , ...)
|
  kzalloc(
-	E1 * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
)

// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@

(
  kzalloc(sizeof(THING) * C2, ...)
|
  kzalloc(sizeof(TYPE) * C2, ...)
|
  kzalloc(C1 * C2 * C3, ...)
|
  kzalloc(C1 * C2, ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * (E2)
+	E2, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * E2
+	E2, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * (E2)
+	E2, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * E2
+	E2, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	(E1) * E2
+	E1, E2
  , ...)
|
- kzalloc
+ kcalloc
  (
-	(E1) * (E2)
+	E1, E2
  , ...)
|
- kzalloc
+ kcalloc
  (
-	E1 * E2
+	E1, E2
  , ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Linus Torvalds 5f85942c2e SCSI misc on 20180610
This is mostly updates to the usual drivers: ufs, qedf, mpt3sas, lpfc,
 xfcp, hisi_sas, cxlflash, qla2xxx.  In the absence of Nic, we're also
 taking target updates which are mostly minor except for the tcmu
 refactor. The only real core change to worry about is the removal of
 high page bouncing (in sas, storvsc and iscsi).  This has been well
 tested and no problems have shown up so far.
 
 Signed-off-by: James E.J. Bottomley <jejb@linux.vnet.ibm.com>
 -----BEGIN PGP SIGNATURE-----
 
 iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCWx1pbCYcamFtZXMuYm90
 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishUucAP42pccS
 ziKyiOizuxv9fZ4Q+nXd1A9zhI5tqqpkHjcQegEA40qiZSi3EKGKR8W0UpX7Ntmo
 tqrZJGojx9lnrAM2RbQ=
 =NMXg
 -----END PGP SIGNATURE-----

Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI updates from James Bottomley:
 "This is mostly updates to the usual drivers: ufs, qedf, mpt3sas, lpfc,
  xfcp, hisi_sas, cxlflash, qla2xxx.

  In the absence of Nic, we're also taking target updates which are
  mostly minor except for the tcmu refactor.

  The only real core change to worry about is the removal of high page
  bouncing (in sas, storvsc and iscsi). This has been well tested and no
  problems have shown up so far"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (268 commits)
  scsi: lpfc: update driver version to 12.0.0.4
  scsi: lpfc: Fix port initialization failure.
  scsi: lpfc: Fix 16gb hbas failing cq create.
  scsi: lpfc: Fix crash in blk_mq layer when executing modprobe -r lpfc
  scsi: lpfc: correct oversubscription of nvme io requests for an adapter
  scsi: lpfc: Fix MDS diagnostics failure (Rx < Tx)
  scsi: hisi_sas: Mark PHY as in reset for nexus reset
  scsi: hisi_sas: Fix return value when get_free_slot() failed
  scsi: hisi_sas: Terminate STP reject quickly for v2 hw
  scsi: hisi_sas: Add v2 hw force PHY function for internal ATA command
  scsi: hisi_sas: Include TMF elements in struct hisi_sas_slot
  scsi: hisi_sas: Try wait commands before before controller reset
  scsi: hisi_sas: Init disks after controller reset
  scsi: hisi_sas: Create a scsi_host_template per HW module
  scsi: hisi_sas: Reset disks when discovered
  scsi: hisi_sas: Add LED feature for v3 hw
  scsi: hisi_sas: Change common allocation mode of device id
  scsi: hisi_sas: change slot index allocation mode
  scsi: hisi_sas: Introduce hisi_sas_phy_set_linkrate()
  scsi: hisi_sas: fix a typo in hisi_sas_task_prep()
  ...
2018-06-10 13:01:12 -07:00
Linus Torvalds 2857676045 - Introduce arithmetic overflow test helper functions (Rasmus)
- Use overflow helpers in 2-factor allocators (Kees, Rasmus)
 - Introduce overflow test module (Rasmus, Kees)
 - Introduce saturating size helper functions (Matthew, Kees)
 - Treewide use of struct_size() for allocators (Kees)
 -----BEGIN PGP SIGNATURE-----
 Comment: Kees Cook <kees@outflux.net>
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAlsYJ1gWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJlCTEACwdEeriAd2VwxknnsstojGD/3g
 8TTFA19vSu4Gxa6WiDkjGoSmIlfhXTlZo1Nlmencv16ytSvIVDNLUIB3uDxUIv1J
 2+dyHML9JpXYHHR7zLXXnGFJL0wazqjbsD3NYQgXqmun7EVVYnOsAlBZ7h/Lwiej
 jzEJd8DaHT3TA586uD3uggiFvQU0yVyvkDCDONIytmQx+BdtGdg9TYCzkBJaXuDZ
 YIthyKDvxIw5nh/UaG3L+SKo73tUr371uAWgAfqoaGQQCWe+mxnWL4HkCKsjFzZL
 u9ouxxF/n6pij3E8n6rb0i2fCzlsTDdDF+aqV1rQ4I4hVXCFPpHUZgjDPvBWbj7A
 m6AfRHVNnOgI8HGKqBGOfViV+2kCHlYeQh3pPW33dWzy/4d/uq9NIHKxE63LH+S4
 bY3oO2ela8oxRyvEgXLjqmRYGW1LB/ZU7FS6Rkx2gRzo4k8Rv+8K/KzUHfFVRX61
 jEbiPLzko0xL9D53kcEn0c+BhofK5jgeSWxItdmfuKjLTW4jWhLRlU+bcUXb6kSS
 S3G6aF+L+foSUwoq63AS8QxCuabuhreJSB+BmcGUyjthCbK/0WjXYC6W/IJiRfBa
 3ZTxBC/2vP3uq/AGRNh5YZoxHL8mSxDfn62F+2cqlJTTKR/O+KyDb1cusyvk3H04
 KCDVLYPxwQQqK1Mqig==
 =/3L8
 -----END PGP SIGNATURE-----

Merge tag 'overflow-v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull overflow updates from Kees Cook:
 "This adds the new overflow checking helpers and adds them to the
  2-factor argument allocators. And this adds the saturating size
  helpers and does a treewide replacement for the struct_size() usage.
  Additionally this adds the overflow testing modules to make sure
  everything works.

  I'm still working on the treewide replacements for allocators with
  "simple" multiplied arguments:

     *alloc(a * b, ...) -> *alloc_array(a, b, ...)

  and

     *zalloc(a * b, ...) -> *calloc(a, b, ...)

  as well as the more complex cases, but that's separable from this
  portion of the series. I expect to have the rest sent before -rc1
  closes; there are a lot of messy cases to clean up.

  Summary:

   - Introduce arithmetic overflow test helper functions (Rasmus)

   - Use overflow helpers in 2-factor allocators (Kees, Rasmus)

   - Introduce overflow test module (Rasmus, Kees)

   - Introduce saturating size helper functions (Matthew, Kees)

   - Treewide use of struct_size() for allocators (Kees)"

* tag 'overflow-v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  treewide: Use struct_size() for devm_kmalloc() and friends
  treewide: Use struct_size() for vmalloc()-family
  treewide: Use struct_size() for kmalloc()-family
  device: Use overflow helpers for devm_kmalloc()
  mm: Use overflow helpers in kvmalloc()
  mm: Use overflow helpers in kmalloc_array*()
  test_overflow: Add memory allocation overflow tests
  overflow.h: Add allocation size calculation helpers
  test_overflow: Report test failures
  test_overflow: macrofy some more, do more tests for free
  lib: add runtime test of check_*_overflow functions
  compiler.h: enable builtin overflow checkers and add fallback code
2018-06-06 17:27:14 -07:00