Commit Graph

145 Commits

Author SHA1 Message Date
Phil Auld 62c982d739 sched/debug: Change need_resched warnings to pr_err
JIRA: https://issues.redhat.com/browse/RHEL-78821

commit 8061b9f5e111a3012f8b691e5b75dd81eafbb793
Author: David Rientjes <rientjes@google.com>
Date:   Thu Jan 9 16:24:33 2025 -0800

    sched/debug: Change need_resched warnings to pr_err

    need_resched warnings, if enabled, are treated as WARNINGs.  If
    kernel.panic_on_warn is enabled, then this causes a kernel panic.

    It's highly unlikely that a panic is desired for these warnings, only a
    stack trace is normally required to debug and resolve.

    Thus, switch need_resched warnings to simply be a printk with an
    associated stack trace so they are no longer in scope for panic_on_warn.

    Signed-off-by: David Rientjes <rientjes@google.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
    Acked-by: Josh Don <joshdon@google.com>
    Link: https://lkml.kernel.org/r/e8d52023-5291-26bd-5299-8bb9eb604929@google.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2025-02-27 15:13:11 +00:00
Phil Auld 3a06ac6575 sched/debug: Dump domains' level
JIRA: https://issues.redhat.com/browse/RHEL-48226

commit 287372fa39f579a61e17b000aa74c8418d230528
Author: Vitalii Bursov <vitaly@bursov.com>
Date:   Tue Apr 30 18:05:24 2024 +0300

    sched/debug: Dump domains' level

    Knowing domain's level exactly can be useful when setting
    relax_domain_level or cpuset.sched_relax_domain_level

    Usage:

      cat /debug/sched/domains/cpu0/domain1/level

    to dump cpu0 domain1's level.

    SDM macro is not used because sd->level is 'int' and
    it would hide the type mismatch between 'int' and 'u32'.

    Signed-off-by: Vitalii Bursov <vitaly@bursov.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Link: https://lore.kernel.org/r/9489b6475f6dd6fbc67c617752d4216fa094da53.1714488502.git.vitaly@bursov.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-07-15 11:12:47 -04:00
Phil Auld 19156dbd93 sched/fair: Simplify util_est
JIRA: https://issues.redhat.com/browse/RHEL-25535
Conflicts: Hunk in include/linux/sched.h applied by hand
 due to kABI changes.

commit 11137d384996bb05cf33c8163db271e1bac3f4bf
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date:   Fri Dec 1 17:16:52 2023 +0100

    sched/fair: Simplify util_est

    With UTIL_EST_FASTUP now being permanent, we can take advantage of the
    fact that the ewma jumps directly to a higher utilization at dequeue to
    simplify util_est and remove the enqueued field.

    Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Tested-by: Lukasz Luba <lukasz.luba@arm.com>
    Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
    Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Reviewed-by: Hongyan Xia <hongyan.xia2@arm.com>
    Reviewed-by: Alex Shi <alexs@kernel.org>
    Link: https://lore.kernel.org/r/20231201161652.1241695-3-vincent.guittot@linaro.org

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-08 15:47:17 -04:00
Phil Auld 660107a034 sched/deadline: Make dl_rq->pushable_dl_tasks update drive dl_rq->overloaded
JIRA: https://issues.redhat.com/browse/RHEL-25535

commit 5fe7765997b139e2d922b58359dea181efe618f9
Author: Valentin Schneider <vschneid@redhat.com>
Date:   Thu Sep 28 17:02:51 2023 +0200

    sched/deadline: Make dl_rq->pushable_dl_tasks update drive dl_rq->overloaded

    dl_rq->dl_nr_migratory is increased whenever a DL entity is enqueued and it has
    nr_cpus_allowed > 1. Unlike the pushable_dl_tasks tree, dl_rq->dl_nr_migratory
    includes a dl_rq's current task. This means a dl_rq can have a migratable
    current, N non-migratable queued tasks, and be flagged as overloaded and have
    its CPU set in the dlo_mask, despite having an empty pushable_tasks tree.

    Make an dl_rq's overload logic be driven by {enqueue,dequeue}_pushable_dl_task(),
    in other words make DL RQs only be flagged as overloaded if they have at
    least one runnable-but-not-current migratable task.

     o push_dl_task() is unaffected, as it is a no-op if there are no pushable
       tasks.

     o pull_dl_task() now no longer scans runqueues whose sole migratable task is
       their current one, which it can't do anything about anyway.
       It may also now pull tasks to a DL RQ with dl_nr_running > 1 if only its
       current task is migratable.

    Since dl_rq->dl_nr_migratory becomes unused, remove it.

    RT had the exact same mechanism (rt_rq->rt_nr_migratory) which was dropped
    in favour of relying on rt_rq->pushable_tasks, see:

      612f769edd06 ("sched/rt: Make rt_rq->pushable_tasks updates drive rto_mask")

    Signed-off-by: Valentin Schneider <vschneid@redhat.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Acked-by: Juri Lelli <juri.lelli@redhat.com>
    Link: https://lore.kernel.org/r/20230928150251.463109-1-vschneid@redhat.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-08 09:40:56 -04:00
Phil Auld 8883ff7c00 sched/rt: Make rt_rq->pushable_tasks updates drive rto_mask
JIRA: https://issues.redhat.com/browse/RHEL-25535

commit 612f769edd06a6e42f7cd72425488e68ddaeef0a
Author: Valentin Schneider <vschneid@redhat.com>
Date:   Fri Aug 11 12:20:44 2023 +0100

    sched/rt: Make rt_rq->pushable_tasks updates drive rto_mask

    Sebastian noted that the rto_push_work IRQ work can be queued for a CPU
    that has an empty pushable_tasks list, which means nothing useful will be
    done in the IPI other than queue the work for the next CPU on the rto_mask.

    rto_push_irq_work_func() only operates on tasks in the pushable_tasks list,
    but the conditions for that irq_work to be queued (and for a CPU to be
    added to the rto_mask) rely on rq_rt->nr_migratory instead.

    nr_migratory is increased whenever an RT task entity is enqueued and it has
    nr_cpus_allowed > 1. Unlike the pushable_tasks list, nr_migratory includes a
    rt_rq's current task. This means a rt_rq can have a migratible current, N
    non-migratible queued tasks, and be flagged as overloaded / have its CPU
    set in the rto_mask, despite having an empty pushable_tasks list.

    Make an rt_rq's overload logic be driven by {enqueue,dequeue}_pushable_task().
    Since rt_rq->{rt_nr_migratory,rt_nr_total} become unused, remove them.

    Note that the case where the current task is pushed away to make way for a
    migration-disabled task remains unchanged: the migration-disabled task has
    to be in the pushable_tasks list in the first place, which means it has
    nr_cpus_allowed > 1.

    Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Valentin Schneider <vschneid@redhat.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Link: https://lore.kernel.org/r/20230811112044.3302588-1-vschneid@redhat.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2024-04-08 09:40:56 -04:00
Phil Auld c079632f22 sched/debug: Dump domains' sched group flags
JIRA: https://issues.redhat.com/browse/RHEL-17497

commit ed74cc4995d314ea6cbf406caf978c442f451fa5
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Fri Jul 7 15:57:05 2023 -0700

    sched/debug: Dump domains' sched group flags

    There have been a case where the SD_SHARE_CPUCAPACITY sched group flag
    in a parent domain were not set and propagated properly when a degenerate
    domain is removed.

    Add dump of domain sched group flags of a CPU to make debug easier
    in the future.

    Usage:
    cat /debug/sched/domains/cpu0/domain1/groups_flags
    to dump cpu0 domain1's sched group flags.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Link: https://lore.kernel.org/r/ed1749262d94d95a8296c86a415999eda90bcfe3.1688770494.git.tim.c.chen@linux.intel.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-12-06 09:31:10 -05:00
Phil Auld 3363c6964a sched/debug: Correct printing for rq->nr_uninterruptible
JIRA: https://issues.redhat.com/browse/RHEL-1536

commit a6fcdd8d95f7486150b3faadfea119fc3dfc3b74
Author: 晏艳(采苓) <yanyan.yan@antgroup.com>
Date:   Sat May 6 15:42:53 2023 +0800

    sched/debug: Correct printing for rq->nr_uninterruptible

    Commit e6fe3f422b ("sched: Make multiple runqueue task counters
    32-bit") changed the type for rq->nr_uninterruptible from "unsigned
    long" to "unsigned int", but left wrong cast print to
    /sys/kernel/debug/sched/debug and to the console.

    For example, nr_uninterruptible's value is fffffff7 with type
    "unsigned int", (long)nr_uninterruptible shows 4294967287 while
    (int)nr_uninterruptible prints -9. So using int cast fixes wrong
    printing.

    Signed-off-by: Yan Yan <yanyan.yan@antgroup.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20230506074253.44526-1-yanyan.yan@antgroup.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-09-07 14:26:07 -04:00
Phil Auld 57b680d4b2 sched/debug: Put sched/domains files under the verbose flag
Bugzilla: https://bugzilla.redhat.com/2053117

commit 34320745dfc9b71234139720bc8b602cdc8c086a
Author: Phil Auld <pauld@redhat.com>
Date:   Fri Mar 3 13:37:54 2023 -0500

    sched/debug: Put sched/domains files under the verbose flag

    The debug files under sched/domains can take a long time to regenerate,
    especially when updates are done one at a time. Move these files under
    the sched verbose debug flag. Allow changes to verbose to trigger
    generation of the files. This lets a user batch the updates but still
    have the information available.  The detailed topology printk messages
    are also under verbose.

    Discussion that lead to this approach can be found in the link below.

    Simplified code to maintain use of debugfs bool routines suggested by
    Michael Ellerman <mpe@ellerman.id.au>.

    Signed-off-by: Phil Auld <pauld@redhat.com>
    Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Reviewed-by: Valentin Schneider <vschneid@redhat.com>
    Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
    Tested-by: Vishal Chourasia <vishalc@linux.vnet.ibm.com>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
    Cc: Valentin Schneider <vschneid@redhat.com>
    Cc: Vishal Chourasia <vishalc@linux.vnet.ibm.com>
    Cc: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lore.kernel.org/all/Y01UWQL2y2r69sBX@li-05afa54c-330e-11b2-a85c-e3f3aa0db1e9.ibm.com/
    Link: https://lore.kernel.org/r/20230303183754.3076321-1-pauld@redhat.com
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2023-05-04 15:47:19 -04:00
Chris von Recklinghausen b9ef7d0aa3 memory tiering: hot page selection with hint page fault latency
Bugzilla: https://bugzilla.redhat.com/2160210

commit 33024536bafd9129f1d16ade0974671c648700ac
Author: Huang Ying <ying.huang@intel.com>
Date:   Wed Jul 13 16:39:51 2022 +0800

    memory tiering: hot page selection with hint page fault latency

    Patch series "memory tiering: hot page selection", v4.

    To optimize page placement in a memory tiering system with NUMA balancing,
    the hot pages in the slow memory nodes need to be identified.
    Essentially, the original NUMA balancing implementation selects the mostly
    recently accessed (MRU) pages to promote.  But this isn't a perfect
    algorithm to identify the hot pages.  Because the pages with quite low
    access frequency may be accessed eventually given the NUMA balancing page
    table scanning period could be quite long (e.g.  60 seconds).  So in this
    patchset, we implement a new hot page identification algorithm based on
    the latency between NUMA balancing page table scanning and hint page
    fault.  Which is a kind of mostly frequently accessed (MFU) algorithm.

    In NUMA balancing memory tiering mode, if there are hot pages in slow
    memory node and cold pages in fast memory node, we need to promote/demote
    hot/cold pages between the fast and cold memory nodes.

    A choice is to promote/demote as fast as possible.  But the CPU cycles and
    memory bandwidth consumed by the high promoting/demoting throughput will
    hurt the latency of some workload because of accessing inflating and slow
    memory bandwidth contention.

    A way to resolve this issue is to restrict the max promoting/demoting
    throughput.  It will take longer to finish the promoting/demoting.  But
    the workload latency will be better.  This is implemented in this patchset
    as the page promotion rate limit mechanism.

    The promotion hot threshold is workload and system configuration
    dependent.  So in this patchset, a method to adjust the hot threshold
    automatically is implemented.  The basic idea is to control the number of
    the candidate promotion pages to match the promotion rate limit.

    We used the pmbench memory accessing benchmark tested the patchset on a
    2-socket server system with DRAM and PMEM installed.  The test results are
    as follows,

                    pmbench score           promote rate
                     (accesses/s)                   MB/s
                    -------------           ------------
    base              146887704.1                  725.6
    hot selection     165695601.2                  544.0
    rate limit        162814569.8                  165.2
    auto adjustment   170495294.0                  136.9

    From the results above,

    With hot page selection patch [1/3], the pmbench score increases about
    12.8%, and promote rate (overhead) decreases about 25.0%, compared with
    base kernel.

    With rate limit patch [2/3], pmbench score decreases about 1.7%, and
    promote rate decreases about 69.6%, compared with hot page selection
    patch.

    With threshold auto adjustment patch [3/3], pmbench score increases about
    4.7%, and promote rate decrease about 17.1%, compared with rate limit
    patch.

    Baolin helped to test the patchset with MySQL on a machine which contains
    1 DRAM node (30G) and 1 PMEM node (126G).

    sysbench /usr/share/sysbench/oltp_read_write.lua \
    ......
    --tables=200 \
    --table-size=1000000 \
    --report-interval=10 \
    --threads=16 \
    --time=120

    The tps can be improved about 5%.

    This patch (of 3):

    To optimize page placement in a memory tiering system with NUMA balancing,
    the hot pages in the slow memory node need to be identified.  Essentially,
    the original NUMA balancing implementation selects the mostly recently
    accessed (MRU) pages to promote.  But this isn't a perfect algorithm to
    identify the hot pages.  Because the pages with quite low access frequency
    may be accessed eventually given the NUMA balancing page table scanning
    period could be quite long (e.g.  60 seconds).  The most frequently
    accessed (MFU) algorithm is better.

    So, in this patch we implemented a better hot page selection algorithm.
    Which is based on NUMA balancing page table scanning and hint page fault
    as follows,

    - When the page tables of the processes are scanned to change PTE/PMD
      to be PROT_NONE, the current time is recorded in struct page as scan
      time.

    - When the page is accessed, hint page fault will occur.  The scan
      time is gotten from the struct page.  And The hint page fault
      latency is defined as

        hint page fault time - scan time

    The shorter the hint page fault latency of a page is, the higher the
    probability of their access frequency to be higher.  So the hint page
    fault latency is a better estimation of the page hot/cold.

    It's hard to find some extra space in struct page to hold the scan time.
    Fortunately, we can reuse some bits used by the original NUMA balancing.

    NUMA balancing uses some bits in struct page to store the page accessing
    CPU and PID (referring to page_cpupid_xchg_last()).  Which is used by the
    multi-stage node selection algorithm to avoid to migrate pages shared
    accessed by the NUMA nodes back and forth.  But for pages in the slow
    memory node, even if they are shared accessed by multiple NUMA nodes, as
    long as the pages are hot, they need to be promoted to the fast memory
    node.  So the accessing CPU and PID information are unnecessary for the
    slow memory pages.  We can reuse these bits in struct page to record the
    scan time.  For the fast memory pages, these bits are used as before.

    For the hot threshold, the default value is 1 second, which works well in
    our performance test.  All pages with hint page fault latency < hot
    threshold will be considered hot.

    It's hard for users to determine the hot threshold.  So we don't provide a
    kernel ABI to set it, just provide a debugfs interface for advanced users
    to experiment.  We will continue to work on a hot threshold automatic
    adjustment mechanism.

    The downside of the above method is that the response time to the workload
    hot spot changing may be much longer.  For example,

    - A previous cold memory area becomes hot

    - The hint page fault will be triggered.  But the hint page fault
      latency isn't shorter than the hot threshold.  So the pages will
      not be promoted.

    - When the memory area is scanned again, maybe after a scan period,
      the hint page fault latency measured will be shorter than the hot
      threshold and the pages will be promoted.

    To mitigate this, if there are enough free space in the fast memory node,
    the hot threshold will not be used, all pages will be promoted upon the
    hint page fault for fast response.

    Thanks Zhong Jiang reported and tested the fix for a bug when disabling
    memory tiering mode dynamically.

    Link: https://lkml.kernel.org/r/20220713083954.34196-1-ying.huang@intel.com
    Link: https://lkml.kernel.org/r/20220713083954.34196-2-ying.huang@intel.com
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: Wei Xu <weixugc@google.com>
    Cc: osalvador <osalvador@suse.de>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:32 -04:00
Phil Auld 7c4f5b3384 sched/debug: fix dentry leak in update_sched_domain_debugfs
Bugzilla: https://bugzilla.redhat.com/2115520

commit c2e406596571659451f4b95e37ddfd5a8ef1d0dc
Author: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Date:   Fri Sep 2 14:31:07 2022 +0200

    sched/debug: fix dentry leak in update_sched_domain_debugfs

    Kuyo reports that the pattern of using debugfs_remove(debugfs_lookup())
    leaks a dentry and with a hotplug stress test, the machine eventually
    runs out of memory.

    Fix this up by using the newly created debugfs_lookup_and_remove() call
    instead which properly handles the dentry reference counting logic.

    Cc: Major Chen <major.chen@samsung.com>
    Cc: stable <stable@kernel.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Juri Lelli <juri.lelli@redhat.com>
    Cc: Vincent Guittot <vincent.guittot@linaro.org>
    Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Ben Segall <bsegall@google.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
    Cc: Valentin Schneider <vschneid@redhat.com>
    Cc: Matthias Brugger <matthias.bgg@gmail.com>
    Reported-by: Kuyo Chang <kuyo.chang@mediatek.com>
    Tested-by: Kuyo Chang <kuyo.chang@mediatek.com>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20220902123107.109274-2-gregkh@linuxfoundation.org
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-11-04 13:14:41 -04:00
Phil Auld deee3a961c sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c files there
Bugzilla: http://bugzilla.redhat.com/2069275

commit 801c141955108fb7cf1244dda76e6de8b16fd3ae
Author: Ingo Molnar <mingo@kernel.org>
Date:   Tue Feb 22 13:23:24 2022 +0100

    sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c files there

    Collect all utility functionality source code files into a single kernel/sched/build_utility.c file,
    via #include-ing the .c files:

        kernel/sched/clock.c
        kernel/sched/completion.c
        kernel/sched/loadavg.c
        kernel/sched/swait.c
        kernel/sched/wait_bit.c
        kernel/sched/wait.c

    CONFIG_CPU_FREQ:
        kernel/sched/cpufreq.c

    CONFIG_CPU_FREQ_GOV_SCHEDUTIL:
        kernel/sched/cpufreq_schedutil.c

    CONFIG_CGROUP_CPUACCT:
        kernel/sched/cpuacct.c

    CONFIG_SCHED_DEBUG:
        kernel/sched/debug.c

    CONFIG_SCHEDSTATS:
        kernel/sched/stats.c

    CONFIG_SMP:
       kernel/sched/cpupri.c
       kernel/sched/stop_task.c
       kernel/sched/topology.c

    CONFIG_SCHED_CORE:
       kernel/sched/core_sched.c

    CONFIG_PSI:
       kernel/sched/psi.c

    CONFIG_MEMBARRIER:
       kernel/sched/membarrier.c

    CONFIG_CPU_ISOLATION:
       kernel/sched/isolation.c

    CONFIG_SCHED_AUTOGROUP:
       kernel/sched/autogroup.c

    The goal is to amortize the 60+ KLOC header bloat from over a dozen build units into
    a single build unit.

    The build time of build_utility.c also roughly matches the build time of core.c and
    fair.c - allowing better load-balancing of scheduler-only rebuilds.

    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-04-11 17:38:21 -04:00
Phil Auld a074ad4a40 sched/debug: Remove mpol_get/put and task_lock/unlock from sched_show_numa
Bugzilla: http://bugzilla.redhat.com/2062831

commit 28c988c3ec29db74a1dda631b18785958d57df4f
Author: Bharata B Rao <bharata@amd.com>
Date:   Tue Jan 18 10:35:15 2022 +0530

    sched/debug: Remove mpol_get/put and task_lock/unlock from sched_show_numa

    The older format of /proc/pid/sched printed home node info which
    required the mempolicy and task lock around mpol_get(). However
    the format has changed since then and there is no need for
    sched_show_numa() any more to have mempolicy argument,
    asssociated mpol_get/put and task_lock/unlock. Remove them.

    Fixes: 397f2378f1 ("sched/numa: Fix numa balancing stats in /proc/pid/sched")
    Signed-off-by: Bharata B Rao <bharata@amd.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Link: https://lore.kernel.org/r/20220118050515.2973-1-bharata@amd.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-03-28 09:28:38 -04:00
Phil Auld b5be4938b7 sched/core: Forced idle accounting
Bugzilla: http://bugzilla.redhat.com/2062831
Conflicts: fuzz due to kabi padding in struct rq

commit 4feee7d12603deca8775f9f9ae5e121093837444
Author: Josh Don <joshdon@google.com>
Date:   Mon Oct 18 13:34:28 2021 -0700

    sched/core: Forced idle accounting

    Adds accounting for "forced idle" time, which is time where a cookie'd
    task forces its SMT sibling to idle, despite the presence of runnable
    tasks.

    Forced idle time is one means to measure the cost of enabling core
    scheduling (ie. the capacity lost due to the need to force idle).

    Forced idle time is attributed to the thread responsible for causing
    the forced idle.

    A few details:
     - Forced idle time is displayed via /proc/PID/sched. It also requires
       that schedstats is enabled.
     - Forced idle is only accounted when a sibling hyperthread is held
       idle despite the presence of runnable tasks. No time is charged if
       a sibling is idle but has no runnable tasks.
     - Tasks with 0 cookie are never charged forced idle.
     - For SMT > 2, we scale the amount of forced idle charged based on the
       number of forced idle siblings. Additionally, we split the time up and
       evenly charge it to all running tasks, as each is equally responsible
       for the forced idle.

    Signed-off-by: Josh Don <joshdon@google.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20211018203428.2025792-1-joshdon@google.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-03-28 09:28:35 -04:00
Phil Auld 464d8dbf61 sched: Fix DEBUG && !SCHEDSTATS warn
Bugzilla: http://bugzilla.redhat.com/2020279

commit 769fdf83df57b373660343ef4270b3ada91ef434
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Wed Oct 6 10:12:05 2021 +0200

    sched: Fix DEBUG && !SCHEDSTATS warn

    When !SCHEDSTATS schedstat_enabled() is an unconditional 0 and the
    whole block doesn't exist, however GCC figures the scoped variable
    'stats' is unused and complains about it.

    Upgrade the warning from -Wunused-variable to -Wunused-but-set-variable
    by writing it in two statements. This fixes the build because the new
    warning is in W=1.

    Given that whole if(0) {} thing, I don't feel motivated to change
    things overly much and quite strongly feel this is the compiler being
    daft.

    Fixes: cb3e971c435d ("sched: Make struct sched_statistics independent of fair sched class")
    Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Signed-off-by: Phil Auld <pauld@redhat.com>
2021-12-13 16:07:50 -05:00
Phil Auld 77753d58ed sched: Introduce task block time in schedstats
Bugzilla: http://bugzilla.redhat.com/2020279

commit 847fc0cd0664fcb2a08ac66df6b85935361ec454
Author: Yafang Shao <laoar.shao@gmail.com>
Date:   Sun Sep 5 14:35:43 2021 +0000

    sched: Introduce task block time in schedstats

    Currently in schedstats we have sum_sleep_runtime and iowait_sum, but
    there's no metric to show how long the task is in D state.  Once a task in
    D state, it means the task is blocked in the kernel, for example the
    task may be waiting for a mutex. The D state is more frequent than
    iowait, and it is more critital than S state. So it is worth to add a
    metric to measure it.

    Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20210905143547.4668-5-laoar.shao@gmail.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2021-12-13 16:07:46 -05:00
Phil Auld fbc84644bc sched: Make struct sched_statistics independent of fair sched class
Bugzilla: http://bugzilla.redhat.com/2020279

commit ceeadb83aea28372e54857bf88ab7e17af48ab7b
Author: Yafang Shao <laoar.shao@gmail.com>
Date:   Sun Sep 5 14:35:41 2021 +0000

    sched: Make struct sched_statistics independent of fair sched class

    If we want to use the schedstats facility to trace other sched classes, we
    should make it independent of fair sched class. The struct sched_statistics
    is the schedular statistics of a task_struct or a task_group. So we can
    move it into struct task_struct and struct task_group to achieve the goal.

    After the patch, schestats are orgnized as follows,

        struct task_struct {
           ...
           struct sched_entity se;
           struct sched_rt_entity rt;
           struct sched_dl_entity dl;
           ...
           struct sched_statistics stats;
           ...
       };

    Regarding the task group, schedstats is only supported for fair group
    sched, and a new struct sched_entity_stats is introduced, suggested by
    Peter -

        struct sched_entity_stats {
            struct sched_entity     se;
            struct sched_statistics stats;
        } __no_randomize_layout;

    Then with the se in a task_group, we can easily get the stats.

    The sched_statistics members may be frequently modified when schedstats is
    enabled, in order to avoid impacting on random data which may in the same
    cacheline with them, the struct sched_statistics is defined as cacheline
    aligned.

    As this patch changes the core struct of scheduler, so I verified the
    performance it may impact on the scheduler with 'perf bench sched
    pipe', suggested by Mel. Below is the result, in which all the values
    are in usecs/op.
                                      Before               After
          kernel.sched_schedstats=0  5.2~5.4               5.2~5.4
          kernel.sched_schedstats=1  5.3~5.5               5.3~5.5
    [These data is a little difference with the earlier version, that is
     because my old test machine is destroyed so I have to use a new
     different test machine.]

    Almost no impact on the sched performance.

    No functional change.

    [lkp@intel.com: reported build failure in earlier version]

    Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Link: https://lore.kernel.org/r/20210905143547.4668-3-laoar.shao@gmail.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2021-12-13 16:07:46 -05:00
Phil Auld c01a712dff sched: reduce sched slice for SCHED_IDLE entities
Bugzilla: http://bugzilla.redhat.com/2020279

commit 51ce83ed523b00d58f2937ec014b12daaad55185
Author: Josh Don <joshdon@google.com>
Date:   Thu Aug 19 18:04:02 2021 -0700

    sched: reduce sched slice for SCHED_IDLE entities

    Use a small, non-scaled min granularity for SCHED_IDLE entities, when
    competing with normal entities. This reduces the latency of getting
    a normal entity back on cpu, at the expense of increased context
    switch frequency of SCHED_IDLE entities.

    The benefit of this change is to reduce the round-robin latency for
    normal entities when competing with a SCHED_IDLE entity.

    Example: on a machine with HZ=1000, spawned two threads, one of which is
    SCHED_IDLE, and affined to one cpu. Without this patch, the SCHED_IDLE
    thread runs for 4ms then waits for 1.4s. With this patch, it runs for
    1ms and waits 340ms (as it round-robins with the other thread).

    Signed-off-by: Josh Don <joshdon@google.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lore.kernel.org/r/20210820010403.946838-4-joshdon@google.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2021-12-13 16:07:45 -05:00
Phil Auld b6e5728c3c sched: Account number of SCHED_IDLE entities on each cfs_rq
Bugzilla: http://bugzilla.redhat.com/2020279

commit a480addecc0d89c200ec0b41da62ae8ceddca8d7
Author: Josh Don <joshdon@google.com>
Date:   Thu Aug 19 18:04:01 2021 -0700

    sched: Account number of SCHED_IDLE entities on each cfs_rq

    Adds cfs_rq->idle_nr_running, which accounts the number of idle entities
    directly enqueued on the cfs_rq.

    Signed-off-by: Josh Don <joshdon@google.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lore.kernel.org/r/20210820010403.946838-3-joshdon@google.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2021-12-13 16:07:45 -05:00
Phil Auld fb6b70bc1b sched/fair: Null terminate buffer when updating tunable_scaling
Bugzilla: http://bugzilla.redhat.com/1992256

commit 703066188f63d66cc6b9d678e5b5ef1213c5938e
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Mon Sep 27 12:46:35 2021 +0100

    sched/fair: Null terminate buffer when updating tunable_scaling

    This patch null-terminates the temporary buffer in sched_scaling_write()
    so kstrtouint() does not return failure and checks the value is valid.

    Before:
      $ cat /sys/kernel/debug/sched/tunable_scaling
      1
      $ echo 0 > /sys/kernel/debug/sched/tunable_scaling
      -bash: echo: write error: Invalid argument
      $ cat /sys/kernel/debug/sched/tunable_scaling
      1

    After:
      $ cat /sys/kernel/debug/sched/tunable_scaling
      1
      $ echo 0 > /sys/kernel/debug/sched/tunable_scaling
      $ cat /sys/kernel/debug/sched/tunable_scaling
      0
      $ echo 3 > /sys/kernel/debug/sched/tunable_scaling
      -bash: echo: write error: Invalid argument

    Fixes: 8a99b6833c ("sched: Move SCHED_DEBUG sysctl to debugfs")
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lore.kernel.org/r/20210927114635.GH3959@techsingularity.net

Signed-off-by: Phil Auld <pauld@redhat.com>
2021-11-02 07:53:42 -04:00
Phil Auld cc1e56f2eb sched: Cgroup SCHED_IDLE support
Bugzilla: http://bugzilla.redhat.com/1992256

commit 304000390f88d049c85e9a0958ac5567f38816ee
Author: Josh Don <joshdon@google.com>
Date:   Thu Jul 29 19:00:18 2021 -0700

    sched: Cgroup SCHED_IDLE support

    This extends SCHED_IDLE to cgroups.

    Interface: cgroup/cpu.idle.
     0: default behavior
     1: SCHED_IDLE

    Extending SCHED_IDLE to cgroups means that we incorporate the existing
    aspects of SCHED_IDLE; a SCHED_IDLE cgroup will count all of its
    descendant threads towards the idle_h_nr_running count of all of its
    ancestor cgroups. Thus, sched_idle_rq() will work properly.
    Additionally, SCHED_IDLE cgroups are configured with minimum weight.

    There are two key differences between the per-task and per-cgroup
    SCHED_IDLE interface:

      - The cgroup interface allows tasks within a SCHED_IDLE hierarchy to
        maintain their relative weights. The entity that is "idle" is the
        cgroup, not the tasks themselves.

      - Since the idle entity is the cgroup, our SCHED_IDLE wakeup preemption
        decision is not made by comparing the current task with the woken
        task, but rather by comparing their matching sched_entity.

    A typical use-case for this is a user that creates an idle and a
    non-idle subtree. The non-idle subtree will dominate competition vs
    the idle subtree, but the idle subtree will still be high priority vs
    other users on the system. The latter is accomplished via comparing
    matching sched_entity in the waken preemption path (this could also be
    improved by making the sched_idle_rq() decision dependent on the
    perspective of a specific task).

    For now, we maintain the existing SCHED_IDLE semantics. Future patches
    may make improvements that extend how we treat SCHED_IDLE entities.

    The per-task_group idle field is an integer that currently only holds
    either a 0 or a 1. This is explicitly typed as an integer to allow for
    further extensions to this API. For example, a negative value may
    indicate a highly latency-sensitive cgroup that should be preferred
    for preemption/placement/etc.

    Signed-off-by: Josh Don <joshdon@google.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
    Link: https://lore.kernel.org/r/20210730020019.1487127-2-joshdon@google.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2021-11-01 18:14:46 -04:00
Phil Auld e47f9c4e00 sched/debug: Don't update sched_domain debug directories before sched_debug_init()
Bugzilla: http://bugzilla.redhat.com/1992256

commit 459b09b5a3254008b63382bf41a9b36d0b590f57
Author: Valentin Schneider <valentin.schneider@arm.com>
Date:   Tue May 18 14:07:25 2021 +0100

    sched/debug: Don't update sched_domain debug directories before sched_debug_init()

    Since CPU capacity asymmetry can stem purely from maximum frequency
    differences (e.g. Pixel 1), a rebuild of the scheduler topology can be
    issued upon loading cpufreq, see:

      arch_topology.c::init_cpu_capacity_callback()

    Turns out that if this rebuild happens *before* sched_debug_init() is
    run (which is a late initcall), we end up messing up the sched_domain debug
    directory: passing a NULL parent to debugfs_create_dir() ends up creating
    the directory at the debugfs root, which in this case creates
    /sys/kernel/debug/domains (instead of /sys/kernel/debug/sched/domains).

    This currently doesn't happen on asymmetric systems which use cpufreq-scpi
    or cpufreq-dt drivers, as those are loaded via
    deferred_probe_initcall() (it is also a late initcall, but appears to be
    ordered *after* sched_debug_init()).

    Ionela has been working on detecting maximum frequency asymmetry via ACPI,
    and that actually happens via a *device* initcall, thus before
    sched_debug_init(), and causes the aforementionned debugfs mayhem.

    One option would be to punt sched_debug_init() down to
    fs_initcall_sync(). Preventing update_sched_domain_debugfs() from running
    before sched_debug_init() appears to be the safer option.

    Fixes: 3b87f136f8 ("sched,debug: Convert sysctl sched_domains to debugfs")
    Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: http://lore.kernel.org/r/20210514095339.12979-1-ionela.voinescu@arm.com

Signed-off-by: Phil Auld <pauld@redhat.com>
2021-11-01 18:14:45 -04:00
Ingo Molnar b2c0931a07 Merge branch 'sched/urgent' into sched/core, to resolve conflicts
This commit in sched/urgent moved the cfs_rq_is_decayed() function:

  a7b359fc6a37: ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")

and this fresh commit in sched/core modified it in the old location:

  9e077b52d86a: ("sched/pelt: Check that *_avg are null when *_sum are")

Merge the two variants.

Conflicts:
	kernel/sched/fair.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-06-18 11:31:25 +02:00
Dietmar Eggemann 68d7a19068 sched/fair: Fix util_est UTIL_AVG_UNCHANGED handling
The util_est internal UTIL_AVG_UNCHANGED flag which is used to prevent
unnecessary util_est updates uses the LSB of util_est.enqueued. It is
exposed via _task_util_est() (and task_util_est()).

Commit 92a801e5d5 ("sched/fair: Mask UTIL_AVG_UNCHANGED usages")
mentions that the LSB is lost for util_est resolution but
find_energy_efficient_cpu() checks if task_util_est() returns 0 to
return prev_cpu early.

_task_util_est() returns the max value of util_est.ewma and
util_est.enqueued or'ed w/ UTIL_AVG_UNCHANGED.
So task_util_est() returning the max of task_util() and
_task_util_est() will never return 0 under the default
SCHED_FEAT(UTIL_EST, true).

To fix this use the MSB of util_est.enqueued instead and keep the flag
util_est internal, i.e. don't export it via _task_util_est().

The maximal possible util_avg value for a task is 1024 so the MSB of
'unsigned int util_est.enqueued' isn't used to store a util value.

As a caveat the code behind the util_est_se trace point has to filter
UTIL_AVG_UNCHANGED to see the real util_est.enqueued value which should
be easy to do.

This also fixes an issue report by Xuewen Yan that util_est_update()
only used UTIL_AVG_UNCHANGED for the subtrahend of the equation:

  last_enqueued_diff = ue.enqueued - (task_util() | UTIL_AVG_UNCHANGED)

Fixes: b89997aa88 sched/pelt: Fix task util_est update filtering
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Xuewen Yan <xuewen.yan@unisoc.com>
Reviewed-by: Vincent Donnefort <vincent.donnefort@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20210602145808.1562603-1-dietmar.eggemann@arm.com
2021-06-03 15:47:23 +02:00
Peter Zijlstra 5cb9eaa3d2 sched: Wrap rq::lock access
In preparation of playing games with rq->lock, abstract the thing
using an accessor.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Don Hiatt <dhiatt@digitalocean.com>
Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210422123308.136465446@infradead.org
2021-05-12 11:43:26 +02:00
Waiman Long ad789f84c9 sched/debug: Fix cgroup_path[] serialization
The handling of sysrq key can be activated by echoing the key to
/proc/sysrq-trigger or via the magic key sequence typed into a terminal
that is connected to the system in some way (serial, USB or other mean).
In the former case, the handling is done in a user context. In the
latter case, it is likely to be in an interrupt context.

Currently in print_cpu() of kernel/sched/debug.c, sched_debug_lock is
taken with interrupt disabled for the whole duration of the calls to
print_*_stats() and print_rq() which could last for the quite some time
if the information dump happens on the serial console.

If the system has many cpus and the sched_debug_lock is somehow busy
(e.g. parallel sysrq-t), the system may hit a hard lockup panic
depending on the actually serial console implementation of the
system.

The purpose of sched_debug_lock is to serialize the use of the global
cgroup_path[] buffer in print_cpu(). The rests of the printk calls don't
need serialization from sched_debug_lock.

Calling printk() with interrupt disabled can still be problematic if
multiple instances are running. Allocating a stack buffer of PATH_MAX
bytes is not feasible because of the limited size of the kernel stack.

The solution implemented in this patch is to allow only one caller at a
time to use the full size group_path[], while other simultaneous callers
will have to use shorter stack buffers with the possibility of path
name truncation. A "..." suffix will be printed if truncation may have
happened.  The cgroup path name is provided for informational purpose
only, so occasional path name truncation should not be a big problem.

Fixes: efe25c2c7b ("sched: Reinstate group names in /proc/sched_debug")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210415195426.6677-1-longman@redhat.com
2021-04-21 13:55:42 +02:00
Paul Turner c006fac556 sched: Warn on long periods of pending need_resched
CPU scheduler marks need_resched flag to signal a schedule() on a
particular CPU. But, schedule() may not happen immediately in cases
where the current task is executing in the kernel mode (no
preemption state) for extended periods of time.

This patch adds a warn_on if need_resched is pending for more than the
time specified in sysctl resched_latency_warn_ms. If it goes off, it is
likely that there is a missing cond_resched() somewhere. Monitoring is
done via the tick and the accuracy is hence limited to jiffy scale. This
also means that we won't trigger the warning if the tick is disabled.

This feature (LATENCY_WARN) is default disabled.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210416212936.390566-1-joshdon@google.com
2021-04-21 13:55:41 +02:00
Peter Zijlstra 9406415f46 sched/debug: Rename the sched_debug parameter to sched_verbose
CONFIG_SCHED_DEBUG is the build-time Kconfig knob, the boot param
sched_debug and the /debug/sched/debug_enabled knobs control the
sched_debug_enabled variable, but what they really do is make
SCHED_DEBUG more verbose, so rename the lot.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2021-04-17 13:22:44 +02:00
Peter Zijlstra d27e9ae2f2 sched: Move /proc/sched_debug to debugfs
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210412102001.548833671@infradead.org
2021-04-16 17:06:35 +02:00
Peter Zijlstra 3b87f136f8 sched,debug: Convert sysctl sched_domains to debugfs
Stop polluting sysctl, move to debugfs for SCHED_DEBUG stuff.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/YHgB/s4KCBQ1ifdm@hirez.programming.kicks-ass.net
2021-04-16 17:06:35 +02:00
Peter Zijlstra 1011dcce99 sched,preempt: Move preempt_dynamic to debug.c
Move the #ifdef SCHED_DEBUG bits to kernel/sched/debug.c in order to
collect all the debugfs bits.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210412102001.353833279@infradead.org
2021-04-16 17:06:34 +02:00
Peter Zijlstra 8a99b6833c sched: Move SCHED_DEBUG sysctl to debugfs
Stop polluting sysctl with undocumented knobs that really are debug
only, move them all to /debug/sched/ along with the existing
/debug/sched_* files that already exist.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210412102001.287610138@infradead.org
2021-04-16 17:06:34 +02:00
Ingo Molnar 3b03706fa6 sched: Fix various typos
Fix ~42 single-word typos in scheduler code comments.

We have accumulated a few fun ones over the years. :-)

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: linux-kernel@vger.kernel.org
2021-03-22 00:11:52 +01:00
Hui Su 65bcf072e2 sched: Use task_current() instead of 'rq->curr == p'
Use the task_current() function where appropriate.

No functional change.

Signed-off-by: Hui Su <sh_def@163.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/20201030173223.GA52339@rlk
2021-01-14 11:20:11 +01:00
Colin Ian King 8d4d9c7b43 sched/debug: Fix memory corruption caused by multiple small reads of flags
Reading /proc/sys/kernel/sched_domain/cpu*/domain0/flags mutliple times
with small reads causes oopses with slub corruption issues because the kfree is
free'ing an offset from a previous allocation. Fix this by adding in a new
pointer 'buf' for the allocation and kfree and use the temporary pointer tmp
to handle memory copies of the buf offsets.

Fixes: 5b9f8ff7b3 ("sched/debug: Output SD flag names rather than their values")
Reported-by: Jeff Bastian <jbastian@redhat.com>
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20201029151103.373410-1-colin.king@canonical.com
2020-11-10 18:38:49 +01:00
Valentin Schneider 848785df48 sched/topology: Move sd_flag_debug out of #ifdef CONFIG_SYSCTL
The last sd_flag_debug shuffle inadvertently moved its definition within
an #ifdef CONFIG_SYSCTL region. While CONFIG_SYSCTL is indeed required to
produce the sched domain ctl interface (which uses sd_flag_debug to output
flag names), it isn't required to run any assertion on the sched_domain
hierarchy itself.

Move the definition of sd_flag_debug to a CONFIG_SCHED_DEBUG region of
topology.c.

Now at long last we have:

- sd_flag_debug declared in include/linux/sched/topology.h iff
  CONFIG_SCHED_DEBUG=y
- sd_flag_debug defined in kernel/sched/topology.c, conditioned by:
  - CONFIG_SCHED_DEBUG, with an explicit #ifdef block
  - CONFIG_SMP, as a requirement to compile topology.c

With this change, all symbols pertaining to SD flag metadata (with the
exception of __SD_FLAG_CNT) are now defined exclusively within topology.c

Fixes: 8fca9494d4 ("sched/topology: Move sd_flag_debug out of linux/sched/topology.h")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200908184956.23369-1-valentin.schneider@arm.com
2020-09-09 10:09:03 +02:00
Valentin Schneider 8fca9494d4 sched/topology: Move sd_flag_debug out of linux/sched/topology.h
Defining an array in a header imported all over the place clearly is a daft
idea, that still didn't stop me from doing it.

Leave a declaration of sd_flag_debug in topology.h and move its definition
to sched/debug.c.

Fixes: b6e862f386 ("sched/topology: Define and assign sched_domain flag metadata")
Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200825133216.9163-1-valentin.schneider@arm.com
2020-08-26 12:41:59 +02:00
Valentin Schneider 5b9f8ff7b3 sched/debug: Output SD flag names rather than their values
Decoding the output of /proc/sys/kernel/sched_domain/cpu*/domain*/flags has
always been somewhat annoying, as one needs to go fetch the bit -> name
mapping from the source code itself. This encoding can be saved in a script
somewhere, but that isn't safe from flags being added, removed or even
shuffled around.

What matters for debugging purposes is to get *which* flags are set in a
given domain, their associated value is pretty much meaningless.

Make the sd flags debug file output flag names.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20200817113003.20802-7-valentin.schneider@arm.com
2020-08-19 10:49:48 +02:00
Peter Zijlstra 126c2092e5 sched: Add rq::ttwu_pending
In preparation of removing rq->wake_list, replace the
!list_empty(rq->wake_list) with rq->ttwu_pending. This is not fully
equivalent as this new variable is racy.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200526161908.070399698@infradead.org
2020-05-28 10:54:16 +02:00
Peter Zijlstra 9013196a46 Merge branch 'sched/urgent' 2020-05-19 20:34:12 +02:00
Pavankumar Kondeti ad32bb41fc sched/debug: Fix requested task uclamp values shown in procfs
The intention of commit 96e74ebf8d ("sched/debug: Add task uclamp
values to SCHED_DEBUG procfs") was to print requested and effective
task uclamp values. The requested values printed are read from p->uclamp,
which holds the last effective values. Fix this by printing the values
from p->uclamp_req.

Fixes: 96e74ebf8d ("sched/debug: Add task uclamp values to SCHED_DEBUG procfs")
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/1589115401-26391-1-git-send-email-pkondeti@codeaurora.org
2020-05-19 20:34:10 +02:00
Valentin Schneider 9818427c62 sched/debug: Make sd->flags sysctl read-only
Writing to the sysctl of a sched_domain->flags directly updates the value of
the field, and goes nowhere near update_top_cache_domain(). This means that
the cached domain pointers can end up containing stale data (e.g. the
domain pointed to doesn't have the relevant flag set anymore).

Explicit domain walks that check for flags will be affected by
the write, but this won't be in sync with the cached pointers which will
still point to the domains that were cached at the last sched_domain
build.

In other words, writing to this interface is playing a dangerous game. It
could be made to trigger an update of the cached sched_domain pointers when
written to, but this does not seem to be worth the trouble. Make it
read-only.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200415210512.805-3-valentin.schneider@arm.com
2020-04-30 20:14:39 +02:00
Xie XiuQi f080d93e1d sched/debug: Fix trival print_task() format
Ensure leave one space between state and task name.

w/o patch:
runnable tasks:
 S           task   PID         tree-key  switches  prio     wait
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200414125721.195801-1-xiexiuqi@huawei.com
2020-04-30 20:14:37 +02:00
Valentin Schneider 96e74ebf8d sched/debug: Add task uclamp values to SCHED_DEBUG procfs
Requested and effective uclamp values can be a bit tricky to decipher when
playing with cgroup hierarchies. Add them to a task's procfs when
SCHED_DEBUG is enabled.

Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200226124543.31986-4-valentin.schneider@arm.com
2020-04-08 11:35:27 +02:00
Valentin Schneider 9e3bf9469c sched/debug: Factor out printing formats into common macros
The printing macros in debug.c keep redefining the same output
format. Collect each output format in a single definition, and reuse that
definition in the other macros. While at it, add a layer of parentheses and
replace printf's  with the newly introduced macros.

Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200226124543.31986-3-valentin.schneider@arm.com
2020-04-08 11:35:26 +02:00
Valentin Schneider c745a6212c sched/debug: Remove redundant macro define
Most printing macros for procfs are defined globally in debug.c, and they
are re-defined (to the exact same thing) within proc_sched_show_task().

Get rid of the duplicate defines.

Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200226124543.31986-2-valentin.schneider@arm.com
2020-04-08 11:35:24 +02:00
Vincent Guittot 9f68395333 sched/pelt: Add a new runnable average signal
Now that runnable_load_avg has been removed, we can replace it by a new
signal that will highlight the runnable pressure on a cfs_rq. This signal
track the waiting time of tasks on rq and can help to better define the
state of rqs.

At now, only util_avg is used to define the state of a rq:
  A rq with more that around 80% of utilization and more than 1 tasks is
  considered as overloaded.

But the util_avg signal of a rq can become temporaly low after that a task
migrated onto another rq which can bias the classification of the rq.

When tasks compete for the same rq, their runnable average signal will be
higher than util_avg as it will include the waiting time and we can use
this signal to better classify cfs_rqs.

The new runnable_avg will track the runnable time of a task which simply
adds the waiting time to the running time. The runnable _avg of cfs_rq
will be the /Sum of se's runnable_avg and the runnable_avg of group entity
will follow the one of the rq similarly to util_avg.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: "Dietmar Eggemann <dietmar.eggemann@arm.com>"
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Link: https://lore.kernel.org/r/20200224095223.13361-9-mgorman@techsingularity.net
2020-02-24 11:36:36 +01:00
Vincent Guittot 0dacee1bfa sched/pelt: Remove unused runnable load average
Now that runnable_load_avg is no more used, we can remove it to make
space for a new signal.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: "Dietmar Eggemann <dietmar.eggemann@arm.com>"
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Link: https://lore.kernel.org/r/20200224095223.13361-8-mgorman@techsingularity.net
2020-02-24 11:36:36 +01:00
Wei Li 02d4ac5885 sched/debug: Reset watchdog on all CPUs while processing sysrq-t
Lengthy output of sysrq-t may take a lot of time on slow serial console
with lots of processes and CPUs.

So we need to reset NMI-watchdog to avoid spurious lockup messages, and
we also reset softlockup watchdogs on all other CPUs since another CPU
might be blocked waiting for us to process an IPI or stop_machine.

Add to sysrq_sched_debug_show() as what we did in show_state_filter().

Signed-off-by: Wei Li <liwei391@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/20191226085224.48942-1-liwei391@huawei.com
2020-01-17 10:19:20 +01:00
Ingo Molnar d2abae71eb Linux 5.2-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl0Os1seHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGtx4H/j6i482XzcGFKTBm
 A7mBoQpy+kLtoUov4EtBAR62OuwI8rsahW9di37QKndPoQrczWaKBmr3De6LCdPe
 v3pl3O6wBbvH5ru+qBPFX9PdNbDvimEChh7LHxmMxNQq3M+AjZAZVJyfpoiFnx35
 Fbge+LZaH/k8HMwZmkMr5t9Mpkip715qKg2o9Bua6dkH0AqlcpLlC8d9a+HIVw/z
 aAsyGSU8jRwhoAOJsE9bJf0acQ/pZSqmFp0rDKqeFTSDMsbDRKLGq/dgv4nW0RiW
 s7xqsjb/rdcvirRj3rv9+lcTVkOtEqwk0PVdL9WOf7g4iYrb3SOIZh8ZyViaDSeH
 VTS5zps=
 =huBY
 -----END PGP SIGNATURE-----

Merge tag 'v5.2-rc6' into sched/core, to refresh the branch

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:19:53 +02:00
Thomas Gleixner d2912cb15b treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500
Based on 2 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license version 2 as
  published by the free software foundation

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license version 2 as
  published by the free software foundation #

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 4122 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19 17:09:55 +02:00