Commit Graph

415 Commits

Author SHA1 Message Date
Rafael Aquini 5d187dd21a mm: remove redundant K() macro definition
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 61f297380118060a70888e0c1f5c534b74ab78fe
Author: ZhangPeng <zhangpeng362@huawei.com>
Date:   Fri Aug 4 09:25:53 2023 +0800

    mm: remove redundant K() macro definition

    Patch series "cleanup with helper macro K()".

    Use helper macro K() to improve code readability.  No functional
    modification involved.  Remove redundant K() macro definition.

    This patch (of 7):

    Since commit eb8589b4f8c1 ("mm: move mem_init_print_info() to mm_init.c"),
    the K() macro definition has been moved to mm/internal.h.  Therefore, the
    definitions in mm/memcontrol.c, mm/backing-dev.c and mm/oom_kill.c are
    redundant.  Drop redundant definitions.

    [akpm@linux-foundation.org: oom_kill.c: remove "#undef K", per Kefeng]
    Link: https://lkml.kernel.org/r/20230804012559.2617515-1-zhangpeng362@huawei.com
    Link: https://lkml.kernel.org/r/20230804012559.2617515-2-zhangpeng362@huawei.com
    Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Nanyong Sun <sunnanyong@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:20:53 -04:00
Rafael Aquini fdf497fb3e mm, oom: do not check 0 mask in out_of_memory()
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit 4822acb1369637938c1252d534d3356c5e313cde
Author: Haifeng Xu <haifeng.xu@shopee.com>
Date:   Mon May 8 07:35:38 2023 +0000

    mm, oom: do not check 0 mask in out_of_memory()

    Since commit 60e2793d440a ("mm, oom: do not trigger out_of_memory from the
    #PF"), no user sets gfp_mask to 0.  Remove the 0 mask check and update the
    comments.

    Link: https://lkml.kernel.org/r/20230508073538.1168-1-haifeng.xu@shopee.com
    Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Shakeel Butt <shakeelb@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:35:19 -04:00
Aristeu Rozanski 5455c3da6d mm/mmu_notifier: remove unused mmu_notifier_range_update_to_read_only export
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 7d4a8be0c4b2b7ffb367929d2b352651f083806b
Author: Alistair Popple <apopple@nvidia.com>
Date:   Tue Jan 10 13:57:22 2023 +1100

    mm/mmu_notifier: remove unused mmu_notifier_range_update_to_read_only export

    mmu_notifier_range_update_to_read_only() was originally introduced in
    commit c6d23413f8 ("mm/mmu_notifier:
    mmu_notifier_range_update_to_read_only() helper") as an optimisation for
    device drivers that know a range has only been mapped read-only.  However
    there are no users of this feature so remove it.  As it is the only user
    of the struct mmu_notifier_range.vma field remove that also.

    Link: https://lkml.kernel.org/r/20230110025722.600912-1-apopple@nvidia.com
    Signed-off-by: Alistair Popple <apopple@nvidia.com>
    Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Ira Weiny <ira.weiny@intel.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:05 -04:00
Chris von Recklinghausen f52e563186 mm: reduce noise in show_mem for lowmem allocations
JIRA: https://issues.redhat.com/browse/RHEL-27736

commit 974f4367dd315acc15ad4a6453f8304aea60dfbd
Author: Michal Hocko <mhocko@suse.com>
Date:   Tue Aug 23 11:22:30 2022 +0200

    mm: reduce noise in show_mem for lowmem allocations

    While discussing early DMA pool pre-allocation failure with Christoph [1]
    I have realized that the allocation failure warning is rather noisy for
    constrained allocations like GFP_DMA{32}.  Those zones are usually not
    populated on all nodes very often as their memory ranges are constrained.

    This is an attempt to reduce the ballast that doesn't provide any relevant
    information for those allocation failures investigation.  Please note that
    I have only compile tested it (in my default config setup) and I am
    throwing it mostly to see what people think about it.

    [1] http://lkml.kernel.org/r/20220817060647.1032426-1-hch@lst.de

    [mhocko@suse.com: update]
      Link: https://lkml.kernel.org/r/Yw29bmJTIkKogTiW@dhcp22.suse.cz
    [mhocko@suse.com: fix build]
    [akpm@linux-foundation.org: fix it for mapletree]
    [akpm@linux-foundation.org: update it for Michal's update]
    [mhocko@suse.com: fix arch/powerpc/xmon/xmon.c]
      Link: https://lkml.kernel.org/r/Ywh3C4dKB9B93jIy@dhcp22.suse.cz
    [akpm@linux-foundation.org: fix arch/sparc/kernel/setup_32.c]
    Link: https://lkml.kernel.org/r/YwScVmVofIZkopkF@dhcp22.suse.cz
    Signed-off-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Dan Carpenter <dan.carpenter@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:19:59 -04:00
Chris von Recklinghausen bdf4f0a2df mm/oom_kill: use vma iterators instead of vma linked list
JIRA: https://issues.redhat.com/browse/RHEL-27736

commit e1c2c775d448be0503a3ac90681d86980919bad0
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:49:03 2022 +0000

    mm/oom_kill: use vma iterators instead of vma linked list

    Use vma iterator in preparation of removing the linked list.

    Link: https://lkml.kernel.org/r/20220906194824.2110408-62-Liam.Howlett@oracle.com
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
    Tested-by: Yu Zhao <yuzhao@google.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:19:55 -04:00
Nico Pache db3644c677 mm: delete unused MMF_OOM_VICTIM flag
commit b3541d912a84dc40cabb516f2deeac9ae6fa30da
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Tue May 31 15:31:00 2022 -0700

    mm: delete unused MMF_OOM_VICTIM flag

    With the last usage of MMF_OOM_VICTIM in exit_mmap gone, this flag is now
    unused and can be removed.

    [akpm@linux-foundation.org: remove comment about now-removed mm_is_oom_victim()]
    Link: https://lkml.kernel.org/r/20220531223100.510392-2-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Christian Brauner (Microsoft) <brauner@kernel.org>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Shuah Khan <shuah@kernel.org>
    Cc: Liam Howlett <liam.howlett@oracle.com>

    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:11:01 -06:00
Nico Pache a1e8cb93bf mm: drop oom code from exit_mmap
Conflicts:
       mm/mmap.c: slight differences in unmap_vmas and free_pgtables
        arguments.

commit bf3980c85212fc71512d27a46f5aab66f46ca284
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Tue May 31 15:30:59 2022 -0700

    mm: drop oom code from exit_mmap

    The primary reason to invoke the oom reaper from the exit_mmap path used
    to be a prevention of an excessive oom killing if the oom victim exit
    races with the oom reaper (see [1] for more details).  The invocation has
    moved around since then because of the interaction with the munlock logic
    but the underlying reason has remained the same (see [2]).

    Munlock code is no longer a problem since [3] and there shouldn't be any
    blocking operation before the memory is unmapped by exit_mmap so the oom
    reaper invocation can be dropped.  The unmapping part can be done with the
    non-exclusive mmap_sem and the exclusive one is only required when page
    tables are freed.

    Remove the oom_reaper from exit_mmap which will make the code easier to
    read.  This is really unlikely to make any observable difference although
    some microbenchmarks could benefit from one less branch that needs to be
    evaluated even though it almost never is true.

    [1] 2129258024 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
    [2] 27ae357fa8 ("mm, oom: fix concurrent munlock and oom reaper unmap, v3")
    [3] a213e5cf71cb ("mm/munlock: delete munlock_vma_pages_all(), allow oomreap")

    [akpm@linux-foundation.org: restore Suren's mmap_read_lock() optimization]
    Link: https://lkml.kernel.org/r/20220531223100.510392-1-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Christian Brauner (Microsoft) <brauner@kernel.org>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
    Cc: Liam Howlett <liam.howlett@oracle.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Shuah Khan <shuah@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:11:01 -06:00
Chris von Recklinghausen 678e36d15c mm/oom_kill.c: fix vm_oom_kill_table[] ifdeffery
Bugzilla: https://bugzilla.redhat.com/2160210

commit a19cad0691597eb79c123b8a19a9faba5ab7d90e
Author: Andrew Morton <akpm@linux-foundation.org>
Date:   Wed Jun 1 15:57:16 2022 -0700

    mm/oom_kill.c: fix vm_oom_kill_table[] ifdeffery

    arm allnoconfig:

    mm/oom_kill.c:60:25: warning: 'vm_oom_kill_table' defined but not used [-Wunused-variable]
       60 | static struct ctl_table vm_oom_kill_table[] = {
          |                         ^~~~~~~~~~~~~~~~~

    Cc: Luis Chamberlain <mcgrof@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:13 -04:00
Chris von Recklinghausen 94b77ec848 mm: move oom_kill sysctls to their own file
Bugzilla: https://bugzilla.redhat.com/2160210

commit 43fe219aa56a2fdd8f0623c9470a32b14b0617a5
Author: sujiaxun <sujiaxun@uniontech.com>
Date:   Thu Feb 17 18:51:48 2022 -0800

    mm: move oom_kill sysctls to their own file

    kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
    dishes, this makes it very difficult to maintain.

    To help with this maintenance let's start by moving sysctls to places
    where they actually belong.  The proc sysctl maintainers do not want to
    know what sysctl knobs you wish to add for your own piece of code, we just
    care about the core logic.

    So move the oom_kill sysctls to their own file, mm/oom_kill.c

    [sfr@canb.auug.org.au: null-terminate the array]
      Link: https://lkml.kernel.org/r/20220216193202.28838626@canb.auug.org.au

    Link: https://lkml.kernel.org/r/20220215093203.31032-1-sujiaxun@uniontech.com
    Signed-off-by: sujiaxun <sujiaxun@uniontech.com>
    Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Iurii Zaikin <yzaikin@google.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:50 -04:00
Waiman Long cfab40ea83 mm, oom: do not trigger out_of_memory from the #PF
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139747
Tested: A test kernel with this patch was able to pass the zswap test
	on a 8GB ppcle64 system with a lot of memory cgroup oom messages
	at the console.

commit 60e2793d440a3ec95abb5d6d4fc034a4b480472d
Author: Michal Hocko <mhocko@suse.com>
Date:   Fri, 5 Nov 2021 13:38:06 -0700

    mm, oom: do not trigger out_of_memory from the #PF

    Any allocation failure during the #PF path will return with VM_FAULT_OOM
    which in turn results in pagefault_out_of_memory.  This can happen for 2
    different reasons.  a) Memcg is out of memory and we rely on
    mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
    normal allocation fails.

    The latter is quite problematic because allocation paths already trigger
    out_of_memory and the page allocator tries really hard to not fail
    allocations.  Anyway, if the OOM killer has been already invoked there
    is no reason to invoke it again from the #PF path.  Especially when the
    OOM condition might be gone by that time and we have no way to find out
    other than allocate.

    Moreover if the allocation failed and the OOM killer hasn't been invoked
    then we are unlikely to do the right thing from the #PF context because
    we have already lost the allocation context and restictions and
    therefore might oom kill a task from a different NUMA domain.

    This all suggests that there is no legitimate reason to trigger
    out_of_memory from pagefault_out_of_memory so drop it.  Just to be sure
    that no #PF path returns with VM_FAULT_OOM without allocation print a
    warning that this is happening before we restart the #PF.

    [VvS: #PF allocation can hit into limit of cgroup v1 kmem controller.
    This is a local problem related to memcg, however, it causes unnecessary
    global OOM kills that are repeated over and over again and escalate into a
    real disaster.  This has been broken since kmem accounting has been
    introduced for cgroup v1 (3.8).  There was no kmem specific reclaim for
    the separate limit so the only way to handle kmem hard limit was to return
    with ENOMEM.  In upstream the problem will be fixed by removing the
    outdated kmem limit, however stable and LTS kernels cannot do it and are
    still affected.  This patch fixes the problem and should be backported
    into stable/LTS.]

    Link: https://lkml.kernel.org/r/f5fd8dd8-0ad4-c524-5f65-920b01972a42@virtuozzo.com
    Signed-off-by: Michal Hocko <mhocko@suse.com>
    Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
    Cc: Uladzislau Rezki <urezki@gmail.com>
    Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2022-11-15 21:29:36 -05:00
Chris von Recklinghausen a6c94949d2 mm/oom_kill: remove unneeded is_memcg_oom check
Bugzilla: https://bugzilla.redhat.com/2120352

commit bd8b77d653e84cf1387b8046c61315af8b7513fb
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Mar 22 14:46:02 2022 -0700

    mm/oom_kill: remove unneeded is_memcg_oom check

    oom_cpuset_eligible() is always called when !is_memcg_oom().  Remove this
    unnecessary check.

    Link: https://lkml.kernel.org/r/20220224115933.20154-1-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: David Rientjes <rientjes@google.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:54 -04:00
Chris von Recklinghausen e36c1bf190 mm, oom: OOM sysrq should always kill a process
Bugzilla: https://bugzilla.redhat.com/2120352

commit f530243a172d2ff03f88d0056f838928d6445c6d
Author: Jann Horn <jannh@google.com>
Date:   Fri Jan 14 14:08:27 2022 -0800

    mm, oom: OOM sysrq should always kill a process

    The OOM kill sysrq (alt+sysrq+F) should allow the user to kill the
    process with the highest OOM badness with a single execution.

    However, at the moment, the OOM kill can bail out if an OOM notifier
    (e.g.  the i915 one) says that it reclaimed a tiny amount of memory from
    somewhere.  That's probably not what the user wants, so skip the bailout
    if the OOM was triggered via sysrq.

    Link: https://lkml.kernel.org/r/20220106102605.635656-1-jannh@google.com
    Signed-off-by: Jann Horn <jannh@google.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: David Rientjes <rientjes@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:40 -04:00
Chris von Recklinghausen f97fdf33af mm/memcg: add oom_group_kill memory event
Bugzilla: https://bugzilla.redhat.com/2120352

commit b6bf9abb0aa44e53ffe9c1e6e1d32568f5b25e4a
Author: Dan Schatzberg <schatzberg.dan@gmail.com>
Date:   Fri Jan 14 14:05:35 2022 -0800

    mm/memcg: add oom_group_kill memory event

    Our container agent wants to know when a container exits if it was OOM
    killed or not to report to the user.  We use memory.oom.group = 1 to
    ensure that OOM kills within the container's cgroup kill everything.
    Existing memory.events are insufficient for knowing if this triggered:

    1) Our current approach reads memory.events oom_kill and reports the
       container was killed if the value is non-zero. This is erroneous in
       some cases where containers create their children cgroups with
       memory.oom.group=1 as such OOM kills will get counted against the
       container cgroup's oom_kill counter despite not actually OOM killing
       the entire container.

    2) Reading memory.events.local will fail to identify OOM kills in leaf
       cgroups (that don't set memory.oom.group) within the container
       cgroup.

    This patch adds a new oom_group_kill event when memory.oom.group
    triggers to allow userspace to cleanly identify when an entire cgroup is
    oom killed.

    [schatzberg.dan@gmail.com: changes from Johannes and Chris]
      Link: https://lkml.kernel.org/r/20211213162511.2492267-1-schatzberg.dan@gmail.com

    Link: https://lkml.kernel.org/r/20211203162426.3375036-1-schatzberg.dan@gmail.com
    Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
    Reviewed-by: Roman Gushchin <guro@fb.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Acked-by: Chris Down <chris@chrisdown.name>
    Reviewed-by: Shakeel Butt <shakeelb@google.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Zefan Li <lizefan.x@bytedance.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Alex Shi <alexs@kernel.org>
    Cc: Wei Yang <richard.weiyang@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:38 -04:00
Chris von Recklinghausen 089efd59ca signal: Have the oom killer detect coredumps using signal->core_state
Bugzilla: https://bugzilla.redhat.com/2120352

commit 98b24b16b2aebffabf5b8670f44f19666c1e029f
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Fri Nov 19 11:29:48 2021 -0600

    signal: Have the oom killer detect coredumps using signal->core_state

    In preparation for removing the flag SIGNAL_GROUP_COREDUMP, change
    __task_will_free_mem to test signal->core_state instead of the flag
    SIGNAL_GROUP_COREDUMP.

    Both fields are protected by siglock and both live in signal_struct so
    there are no real tradeoffs here, just a change to which field is
    being tested.

    Link: https://lkml.kernel.org/r/20211213225350.27481-3-ebiederm@xmission.com
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:36 -04:00
Chris von Recklinghausen 8bc744b383 mm: mark the OOM reaper thread as freezable
Bugzilla: https://bugzilla.redhat.com/2120352

commit 3723929eb0f50e2101de739cdb66458a4f1f4b27
Author: Sultan Alsawaf <sultan@kerneltoast.com>
Date:   Fri Nov 5 13:43:25 2021 -0700

    mm: mark the OOM reaper thread as freezable

    The OOM reaper alters user address space which might theoretically alter
    the snapshot if reaping is allowed to happen after the freezer quiescent
    state.  To this end, the reaper kthread uses wait_event_freezable()
    while waiting for any work so that it cannot run while the system
    freezes.

    However, the current implementation doesn't respect the freezer because
    all kernel threads are created with the PF_NOFREEZE flag, so they are
    automatically excluded from freezing operations.  This means that the
    OOM reaper can race with system snapshotting if it has work to do while
    the system is being frozen.

    Fix this by adding a set_freezable() call which will clear the
    PF_NOFREEZE flag and thus make the OOM reaper visible to the freezer.

    Please note that the OOM reaper altering the snapshot this way is mostly
    a theoretical concern and has not been observed in practice.

    Link: https://lkml.kernel.org/r/20210921165758.6154-1-sultan@kerneltoast.com
    Link: https://lkml.kernel.org/r/20210918233920.9174-1-sultan@kerneltoast.com
    Fixes: aac4536355 ("mm, oom: introduce oom reaper")
    Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:29 -04:00
Chris von Recklinghausen 9820873c88 mm, oom: pagefault_out_of_memory: don't force global OOM for dying tasks
Bugzilla: https://bugzilla.redhat.com/2120352

commit 0b28179a6138a5edd9d82ad2687c05b3773c387b
Author: Vasily Averin <vvs@virtuozzo.com>
Date:   Fri Nov 5 13:38:02 2021 -0700

    mm, oom: pagefault_out_of_memory: don't force global OOM for dying tasks

    Patch series "memcg: prohibit unconditional exceeding the limit of dying tasks", v3.

    Memory cgroup charging allows killed or exiting tasks to exceed the hard
    limit.  It can be misused and allowed to trigger global OOM from inside
    a memcg-limited container.  On the other hand if memcg fails allocation,
    called from inside #PF handler it triggers global OOM from inside
    pagefault_out_of_memory().

    To prevent these problems this patchset:
     (a) removes execution of out_of_memory() from
         pagefault_out_of_memory(), becasue nobody can explain why it is
         necessary.
     (b) allow memcg to fail allocation of dying/killed tasks.

    This patch (of 3):

    Any allocation failure during the #PF path will return with VM_FAULT_OOM
    which in turn results in pagefault_out_of_memory which in turn executes
    out_out_memory() and can kill a random task.

    An allocation might fail when the current task is the oom victim and
    there are no memory reserves left.  The OOM killer is already handled at
    the page allocator level for the global OOM and at the charging level
    for the memcg one.  Both have much more information about the scope of
    allocation/charge request.  This means that either the OOM killer has
    been invoked properly and didn't lead to the allocation success or it
    has been skipped because it couldn't have been invoked.  In both cases
    triggering it from here is pointless and even harmful.

    It makes much more sense to let the killed task die rather than to wake
    up an eternally hungry oom-killer and send him to choose a fatter victim
    for breakfast.

    Link: https://lkml.kernel.org/r/0828a149-786e-7c06-b70a-52d086818ea3@virtuozzo.com
    Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
    Suggested-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
    Cc: Uladzislau Rezki <urezki@gmail.com>
    Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:27 -04:00
Chris von Recklinghausen 84c6164e5a mm: use pidfd_get_task()
Conflicts: mm/oom_kill.c - We already have
	337546e83fc7 ("mm/oom_kill.c: prevent a race between process_mrelease and exit_mmap")
	so keep current assignment of reap

Bugzilla: https://bugzilla.redhat.com/2120352

commit ee9955d61a0a770152f9c3af470bd1689f034c74
Author: Christian Brauner <christian.brauner@ubuntu.com>
Date:   Mon Oct 11 15:32:45 2021 +0200

    mm: use pidfd_get_task()

    Instead of duplicating the same code in two places use the newly added
    pidfd_get_task() helper. This fixes an (unimportant for now) bug where
    PIDTYPE_PID is used whereas PIDTYPE_TGID should have been used.

    Link: https://lore.kernel.org/r/20211004125050.1153693-3-christian.brauner@u
buntu.com
    Link: https://lore.kernel.org/r/20211011133245.1703103-3-brauner@kernel.org
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Matthew Bobrowski <repnop@google.com>
    Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Minchan Kim <minchan@kernel.org>
    Reviewed-by: Matthew Bobrowski <repnop@google.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:25 -04:00
Chris von Recklinghausen f7fb43f6b1 coredump: Don't perform any cleanups before dumping core
Bugzilla: https://bugzilla.redhat.com/2120352

commit 92307383082daff5df884a25df9e283efb7ef261
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Wed Sep 1 11:33:50 2021 -0500

    coredump:  Don't perform any cleanups before dumping core

    Rename coredump_exit_mm to coredump_task_exit and call it from do_exit
    before PTRACE_EVENT_EXIT, and before any cleanup work for a task
    happens.  This ensures that an accurate copy of the process can be
    captured in the coredump as no cleanup for the process happens before
    the coredump completes.  This also ensures that PTRACE_EVENT_EXIT
    will not be visited by any thread until the coredump is complete.

    Add a new flag PF_POSTCOREDUMP so that tasks that have passed through
    coredump_task_exit can be recognized and ignored in zap_process.

    Now that all of the coredumping happens before exit_mm remove code to
    test for a coredump in progress from mm_release.

    Replace "may_ptrace_stop()" with a simple test of "current->ptrace".
    The other tests in may_ptrace_stop all concern avoiding stopping
    during a coredump.  These tests are no longer necessary as it is now
    guaranteed that fatal_signal_pending will be set if the code enters
    ptrace_stop during a coredump.  The code in ptrace_stop is guaranteed
    not to stop if fatal_signal_pending returns true.

    Until this change "ptrace_event(PTRACE_EVENT_EXIT)" could call
    ptrace_stop without fatal_signal_pending being true, as signals are
    dequeued in get_signal before calling do_exit.  This is no longer
    an issue as "ptrace_event(PTRACE_EVENT_EXIT)" is no longer reached
    until after the coredump completes.

    Link: https://lkml.kernel.org/r/874kaax26c.fsf@disp2133
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:25 -04:00
Chris von Recklinghausen 00d1c4c826 exit: Factor coredump_exit_mm out of exit_mm
Bugzilla: https://bugzilla.redhat.com/2120352

commit d67e03e361619b20c51aaef3b7dd1497617c371d
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Wed Sep 1 11:23:38 2021 -0500

    exit: Factor coredump_exit_mm out of exit_mm

    Separate the coredump logic from the ordinary exit_mm logic
    by moving the coredump logic out of exit_mm into it's own
    function coredump_exit_mm.

    Link: https://lkml.kernel.org/r/87a6k2x277.fsf@disp2133
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:24 -04:00
Aristeu Rozanski 4ba8fd7ec7 mm/munlock: delete munlock_vma_pages_all(), allow oomreap
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites
Conflicts: context due missing prototype for pmd_install()

commit a213e5cf71cbcea4b23caedcb8fe6629a333b275
Author: Hugh Dickins <hughd@google.com>
Date:   Mon Feb 14 18:23:29 2022 -0800

    mm/munlock: delete munlock_vma_pages_all(), allow oomreap

    munlock_vma_pages_range() will still be required, when munlocking but
    not munmapping a set of pages; but when unmapping a pte, the mlock count
    will be maintained in much the same way as it will be maintained when
    mapping in the pte.  Which removes the need for munlock_vma_pages_all()
    on mlocked vmas when munmapping or exiting: eliminating the catastrophic
    contention on i_mmap_rwsem, and the need for page lock on the pages.

    There is still a need to update locked_vm accounting according to the
    munmapped vmas when munmapping: do that in detach_vmas_to_be_unmapped().
    exit_mmap() does not need locked_vm updates, so delete unlock_range().

    And wasn't I the one who forbade the OOM reaper to attack mlocked vmas,
    because of the uncertainty in blocking on all those page locks?
    No fear of that now, so permit the OOM reaper on mlocked vmas.

    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:12 -04:00
Aristeu Rozanski f65490413e mm/oom_kill: allow process_mrelease to run under mmap_lock protection
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites
Conflicts: context due missing ee9955d61a0a7

commit ba535c1caf3ee78aa7719e9e4b07a0dc1d153b9e
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Fri Jan 14 14:06:22 2022 -0800

    mm/oom_kill: allow process_mrelease to run under mmap_lock protection

    With exit_mmap holding mmap_write_lock during free_pgtables call,
    process_mrelease does not need to elevate mm->mm_users in order to
    prevent exit_mmap from destrying pagetables while __oom_reap_task_mm is
    walking the VMA tree.  The change prevents process_mrelease from calling
    the last mmput, which can lead to waiting for IO completion in exit_aio.

    Link: https://lkml.kernel.org/r/20211209191325.3069345-3-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Christian Brauner <christian@brauner.io>
    Cc: Christian Brauner <christian.brauner@ubuntu.com>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Florian Weimer <fweimer@redhat.com>
    Cc: Jan Engelhardt <jengelh@inai.de>
    Cc: Jann Horn <jannh@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Kirill A. Shutemov <kirill@shutemov.name>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Tim Murray <timmurray@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:11 -04:00
Nico Pache 4743a77671 oom_kill.c: futex: delay the OOM reaper to allow time for proper futex cleanup
commit e4a38402c36e42df28eb1a5394be87e6571fb48a
Author: Nico Pache <npache@redhat.com>
Date:   Thu Apr 21 16:36:01 2022 -0700

    oom_kill.c: futex: delay the OOM reaper to allow time for proper futex cleanup

    The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which
    can be targeted by the oom reaper.  This mapping is used to store the
    futex robust list head; the kernel does not keep a copy of the robust
    list and instead references a userspace address to maintain the
    robustness during a process death.

    A race can occur between exit_mm and the oom reaper that allows the oom
    reaper to free the memory of the futex robust list before the exit path
    has handled the futex death:

        CPU1                               CPU2
        --------------------------------------------------------------------
        page_fault
        do_exit "signal"
        wake_oom_reaper
                                            oom_reaper
                                            oom_reap_task_mm (invalidates mm)
        exit_mm
        exit_mm_release
        futex_exit_release
        futex_cleanup
        exit_robust_list
        get_user (EFAULT- can't access memory)

    If the get_user EFAULT's, the kernel will be unable to recover the
    waiters on the robust_list, leaving userspace mutexes hung indefinitely.

    Delay the OOM reaper, allowing more time for the exit path to perform
    the futex cleanup.

    Reproducer: https://gitlab.com/jsavitz/oom_futex_reproducer

    Based on a patch by Michal Hocko.

    Link: https://elixir.bootlin.com/glibc/glibc-2.35/source/nptl/allocatestack.c#L370 [1]
    Link: https://lkml.kernel.org/r/20220414144042.677008-1-npache@redhat.com
    Fixes: 2129258024 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
    Signed-off-by: Joel Savitz <jsavitz@redhat.com>
    Signed-off-by: Nico Pache <npache@redhat.com>
    Co-developed-by: Joel Savitz <jsavitz@redhat.com>
    Suggested-by: Thomas Gleixner <tglx@linutronix.de>
    Acked-by: Thomas Gleixner <tglx@linutronix.de>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Cc: Rafael Aquini <aquini@redhat.com>
    Cc: Waiman Long <longman@redhat.com>
    Cc: Herton R. Krzesinski <herton@redhat.com>
    Cc: Juri Lelli <juri.lelli@redhat.com>
    Cc: Vincent Guittot <vincent.guittot@linaro.org>
    Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Ben Segall <bsegall@google.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Joel Savitz <jsavitz@redhat.com>
    Cc: Darren Hart <dvhart@infradead.org>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1951330
Signed-off-by: Nico Pache <npache@redhat.com>
2022-04-25 10:45:11 -04:00
Rafael Aquini e214048db6 mm/oom_kill.c: prevent a race between process_mrelease and exit_mmap
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 337546e83fc7e50917f44846beee936abb9c9f1f
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Thu Oct 28 14:36:14 2021 -0700

    mm/oom_kill.c: prevent a race between process_mrelease and exit_mmap

    Race between process_mrelease and exit_mmap, where free_pgtables is
    called while __oom_reap_task_mm is in progress, leads to kernel crash
    during pte_offset_map_lock call.  oom-reaper avoids this race by setting
    MMF_OOM_VICTIM flag and causing exit_mmap to take and release
    mmap_write_lock, blocking it until oom-reaper releases mmap_read_lock.

    Reusing MMF_OOM_VICTIM for process_mrelease would be the simplest way to
    fix this race, however that would be considered a hack.  Fix this race
    by elevating mm->mm_users and preventing exit_mmap from executing until
    process_mrelease is finished.  Patch slightly refactors the code to
    adapt for a possible mmget_not_zero failure.

    This fix has considerable negative impact on process_mrelease
    performance and will likely need later optimization.

    Link: https://lkml.kernel.org/r/20211022014658.263508-1-surenb@google.com
    Fixes: 884a7e5964e0 ("mm: introduce process_mrelease system call")
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Christian Brauner <christian@brauner.io>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Christian Brauner <christian.brauner@ubuntu.com>
    Cc: Florian Weimer <fweimer@redhat.com>
    Cc: Jan Engelhardt <jengelh@inai.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:44:15 -05:00
Rafael Aquini b3f424aa00 mm: introduce process_mrelease system call
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 884a7e5964e06ed93c7771c0d7cf19c09a8946f1
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Thu Sep 2 15:00:29 2021 -0700

    mm: introduce process_mrelease system call

    In modern systems it's not unusual to have a system component monitoring
    memory conditions of the system and tasked with keeping system memory
    pressure under control.  One way to accomplish that is to kill
    non-essential processes to free up memory for more important ones.
    Examples of this are Facebook's OOM killer daemon called oomd and
    Android's low memory killer daemon called lmkd.

    For such system component it's important to be able to free memory quickly
    and efficiently.  Unfortunately the time process takes to free up its
    memory after receiving a SIGKILL might vary based on the state of the
    process (uninterruptible sleep), size and OPP level of the core the
    process is running.  A mechanism to free resources of the target process
    in a more predictable way would improve system's ability to control its
    memory pressure.

    Introduce process_mrelease system call that releases memory of a dying
    process from the context of the caller.  This way the memory is freed in a
    more controllable way with CPU affinity and priority of the caller.  The
    workload of freeing the memory will also be charged to the caller.  The
    operation is allowed only on a dying process.

    After previous discussions [1, 2, 3] the decision was made [4] to
    introduce a dedicated system call to cover this use case.

    The API is as follows,

              int process_mrelease(int pidfd, unsigned int flags);

            DESCRIPTION
              The process_mrelease() system call is used to free the memory of
              an exiting process.

              The pidfd selects the process referred to by the PID file
              descriptor.
              (See pidfd_open(2) for further information)

              The flags argument is reserved for future use; currently, this
              argument must be specified as 0.

            RETURN VALUE
              On success, process_mrelease() returns 0. On error, -1 is
              returned and errno is set to indicate the error.

            ERRORS
              EBADF  pidfd is not a valid PID file descriptor.

              EAGAIN Failed to release part of the address space.

              EINTR  The call was interrupted by a signal; see signal(7).

              EINVAL flags is not 0.

              EINVAL The memory of the task cannot be released because the
                     process is not exiting, the address space is shared
                     with another live process or there is a core dump in
                     progress.

              ENOSYS This system call is not supported, for example, without
                     MMU support built into Linux.

              ESRCH  The target process does not exist (i.e., it has terminated
                     and been waited on).

    [1] https://lore.kernel.org/lkml/20190411014353.113252-3-surenb@google.com/
    [2] https://lore.kernel.org/linux-api/20201113173448.1863419-1-surenb@google.com/
    [3] https://lore.kernel.org/linux-api/20201124053943.1684874-3-surenb@google.com/
    [4] https://lore.kernel.org/linux-api/20201223075712.GA4719@lst.de/

    Link: https://lkml.kernel.org/r/20210809185259.405936-1-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Reviewed-by: Shakeel Butt <shakeelb@google.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Geert Uytterhoeven <geert@linux-m68k.org>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Christian Brauner <christian.brauner@ubuntu.com>
    Cc: Florian Weimer <fweimer@redhat.com>
    Cc: Jan Engelhardt <jengelh@inai.de>
    Cc: Tim Murray <timmurray@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:42:25 -05:00
Linus Torvalds 28e92f9903 Merge branch 'core-rcu-2021.07.04' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU updates from Paul McKenney:

 - Bitmap parsing support for "all" as an alias for all bits

 - Documentation updates

 - Miscellaneous fixes, including some that overlap into mm and lockdep

 - kvfree_rcu() updates

 - mem_dump_obj() updates, with acks from one of the slab-allocator
   maintainers

 - RCU NOCB CPU updates, including limited deoffloading

 - SRCU updates

 - Tasks-RCU updates

 - Torture-test updates

* 'core-rcu-2021.07.04' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (78 commits)
  tasks-rcu: Make show_rcu_tasks_gp_kthreads() be static inline
  rcu-tasks: Make ksoftirqd provide RCU Tasks quiescent states
  rcu: Add missing __releases() annotation
  rcu: Remove obsolete rcu_read_unlock() deadlock commentary
  rcu: Improve comments describing RCU read-side critical sections
  rcu: Create an unrcu_pointer() to remove __rcu from a pointer
  srcu: Early test SRCU polling start
  rcu: Fix various typos in comments
  rcu/nocb: Unify timers
  rcu/nocb: Prepare for fine-grained deferred wakeup
  rcu/nocb: Only cancel nocb timer if not polling
  rcu/nocb: Delete bypass_timer upon nocb_gp wakeup
  rcu/nocb: Cancel nocb_timer upon nocb_gp wakeup
  rcu/nocb: Allow de-offloading rdp leader
  rcu/nocb: Directly call __wake_nocb_gp() from bypass timer
  rcu: Don't penalize priority boosting when there is nothing to boost
  rcu: Point to documentation of ordering guarantees
  rcu: Make rcu_gp_cleanup() be noinline for tracing
  rcu: Restrict RCU_STRICT_GRACE_PERIOD to at most four CPUs
  rcu: Make show_rcu_gp_kthreads() dump rcu_node structures blocking GP
  ...
2021-07-04 12:58:33 -07:00
Feng Tang b26e517a05 mm/mempolicy: cleanup nodemask intersection check for oom
Patch series "mm/mempolicy: some fix and semantics cleanup", v4.

Current memory policy code has some confusing and ambiguous part about
MPOL_LOCAL policy, as it is handled as a faked MPOL_PREFERRED one, and
there are many places having to distinguish them.  Also the nodemask
intersection check needs cleanup to be more explicit for OOM use, and
handle MPOL_INTERLEAVE correctly.  This patchset cleans up these and
unifies the parameter sanity check for mbind() and set_mempolicy().

This patch (of 3):

mempolicy_nodemask_intersects seem to be a general purpose mempolicy
function.  In fact it is partially tailored for the OOM purpose
instead.  The oom proper is the only existing user so rename the
function to make that purpose explicit.

While at it drop the MPOL_INTERLEAVE as those allocations never has a
nodemask defined (see alloc_page_interleave) so this is a dead code and
a confusing one because MPOL_INTERLEAVE is a hint rather than a hard
requirement so it shouldn't be considered during the OOM.

The final code can be reduced to a check for MPOL_BIND which is the
only memory policy that is a hard requirement and thus relevant to a
constrained OOM logic.

[mhocko@suse.com: changelog edits]

Link: https://lkml.kernel.org/r/1622560492-1294-1-git-send-email-feng.tang@intel.com
Link: https://lkml.kernel.org/r/1622560492-1294-2-git-send-email-feng.tang@intel.com
Link: https://lkml.kernel.org/r/1622469956-82897-1-git-send-email-feng.tang@intel.com
Link: https://lkml.kernel.org/r/1622469956-82897-2-git-send-email-feng.tang@intel.com
Signed-off-by: Feng Tang <feng.tang@intel.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ben Widawsky <ben.widawsky@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:29 -07:00
Rolf Eike Beer 4c9c3809ae rcu: Fix typo in comment: kthead -> kthread
Signed-off-by: Rolf Eike Beer <eb@emlix.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-10 15:45:58 -07:00
Ingo Molnar f0953a1bba mm: fix typos in comments
Fix ~94 single-word typos in locking code comments, plus a few
very obvious grammar mistakes.

Link: https://lkml.kernel.org/r/20210322212624.GA1963421@gmail.com
Link: https://lore.kernel.org/r/20210322205203.GB1959563@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-07 00:26:35 -07:00
Zhiyuan Dai 68d68ff6eb mm/mempool: minor coding style tweaks
Various coding style tweaks to various files under mm/

[daizhiyuan@phytium.com.cn: mm/swapfile: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1614223624-16055-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/sparse: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1614227288-19363-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/vmscan: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1614227649-19853-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/compaction: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1614228218-20770-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/oom_kill: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1614228360-21168-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/shmem: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1614228504-21491-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/page_alloc: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1614228613-21754-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/filemap: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1614228936-22337-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/mlock: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1613956588-2453-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/frontswap: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1613962668-15045-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/vmalloc: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1613963379-15988-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/memory_hotplug: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1613971784-24878-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/mempolicy: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1613972228-25501-1-git-send-email-daizhiyuan@phytium.com.cn

Link: https://lkml.kernel.org/r/1614222374-13805-1-git-send-email-daizhiyuan@phytium.com.cn
Signed-off-by: Zhiyuan Dai <daizhiyuan@phytium.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:27 -07:00
Randy Dunlap 845be1cd34 mm: eliminate "expecting prototype" kernel-doc warnings
Fix stray kernel-doc warnings in mm/ due to mis-typed or missing function
names.

Quietens these kernel-doc warnings:

  mm/mmu_gather.c:264: warning: expecting prototype for tlb_gather_mmu(). Prototype was for __tlb_gather_mmu() instead
  mm/oom_kill.c:180: warning: expecting prototype for Check whether unreclaimable slab amount is greater than(). Prototype was for should_dump_unreclaim_slab() instead
  mm/shuffle.c:155: warning: expecting prototype for shuffle_free_memory(). Prototype was for __shuffle_free_memory() instead

Link: https://lkml.kernel.org/r/20210411210642.11362-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-16 16:10:36 -07:00
Tang Yizhou f8159c1390 mm, oom: fix a comment in dump_task()
If p is a kthread, it will be checked in oom_unkillable_task() so
we can delete the corresponding comment.

Link: https://lkml.kernel.org/r/20210125133006.7242-1-tangyizhou@huawei.com
Signed-off-by: Tang Yizhou <tangyizhou@huawei.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 13:38:34 -08:00
Will Deacon a72afd8730 tlb: mmu_gather: Remove start/end arguments from tlb_gather_mmu()
The 'start' and 'end' arguments to tlb_gather_mmu() are no longer
needed now that there is a separate function for 'fullmm' flushing.

Remove the unused arguments and update all callers.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Yu Zhao <yuzhao@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/CAHk-=wjQWa14_4UpfDf=fiineNP+RH74kZeDMo_f1D35xNzq9w@mail.gmail.com
2021-01-29 20:02:29 +01:00
Will Deacon ae8eba8b5d tlb: mmu_gather: Remove unused start/end arguments from tlb_finish_mmu()
Since commit 7a30df49f6 ("mm: mmu_gather: remove __tlb_reset_range()
for force flush"), the 'start' and 'end' arguments to tlb_finish_mmu()
are no longer used, since we flush the whole mm in case of a nested
invalidation.

Remove the unused arguments and update all callers.

Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Yu Zhao <yuzhao@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/20210127235347.1402-3-will@kernel.org
2021-01-29 20:02:28 +01:00
Hui Su 259b3633e7 mm/oom_kill: change comment and rename is_dump_unreclaim_slabs()
Change the comment of is_dump_unreclaim_slabs(), it just check whether
nr_unreclaimable slabs amount is greater than user memory, and explain why
we dump unreclaim slabs.

Rename it to should_dump_unreclaim_slab() maybe better.

Link: https://lkml.kernel.org/r/20201030182704.GA53949@rlk
Signed-off-by: Hui Su <sh_def@163.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 12:13:45 -08:00
Suren Baghdasaryan 67197a4f28 mm, oom_adj: don't loop through tasks in __set_oom_adj when not necessary
Currently __set_oom_adj loops through all processes in the system to keep
oom_score_adj and oom_score_adj_min in sync between processes sharing
their mm.  This is done for any task with more that one mm_users, which
includes processes with multiple threads (sharing mm and signals).
However for such processes the loop is unnecessary because their signal
structure is shared as well.

Android updates oom_score_adj whenever a tasks changes its role
(background/foreground/...) or binds to/unbinds from a service, making it
more/less important.  Such operation can happen frequently.  We noticed
that updates to oom_score_adj became more expensive and after further
investigation found out that the patch mentioned in "Fixes" introduced a
regression.  Using Pixel 4 with a typical Android workload, write time to
oom_score_adj increased from ~3.57us to ~362us.  Moreover this regression
linearly depends on the number of multi-threaded processes running on the
system.

Mark the mm with a new MMF_MULTIPROCESS flag bit when task is created with
(CLONE_VM && !CLONE_THREAD && !CLONE_VFORK).  Change __set_oom_adj to use
MMF_MULTIPROCESS instead of mm_users to decide whether oom_score_adj
update should be synchronized between multiple processes.  To prevent
races between clone() and __set_oom_adj(), when oom_score_adj of the
process being cloned might be modified from userspace, we use
oom_adj_mutex.  Its scope is changed to global.

The combination of (CLONE_VM && !CLONE_THREAD) is rarely used except for
the case of vfork().  To prevent performance regressions of vfork(), we
skip taking oom_adj_mutex and setting MMF_MULTIPROCESS when CLONE_VFORK is
specified.  Clearing the MMF_MULTIPROCESS flag (when the last process
sharing the mm exits) is left out of this patch to keep it simple and
because it is believed that this threading model is rare.  Should there
ever be a need for optimizing that case as well, it can be done by hooking
into the exit path, likely following the mm_update_next_owner pattern.

With the combination of (CLONE_VM && !CLONE_THREAD && !CLONE_VFORK) being
quite rare, the regression is gone after the change is applied.

[surenb@google.com: v3]
  Link: https://lkml.kernel.org/r/20200902012558.2335613-1-surenb@google.com

Fixes: 44a70adec9 ("mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj")
Reported-by: Tim Murray <timmurray@google.com>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Eugene Syromiatnikov <esyr@redhat.com>
Cc: Christian Kellner <christian@kellner.me>
Cc: Adrian Reber <areber@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexey Gladkov <gladkov.alexey@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Bernd Edlinger <bernd.edlinger@hotmail.de>
Cc: John Johansen <john.johansen@canonical.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Link: https://lkml.kernel.org/r/20200824153036.3201505-1-surenb@google.com
Debugged-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13 18:38:35 -07:00
Yafang Shao 619b5b469b mm, oom: show process exiting information in __oom_kill_process()
When the OOM killer finds a victim and tryies to kill it, if the victim is
already exiting, the task mm will be NULL and no process will be killed.
But the dump_header() has been already executed, so it will be strange to
dump so much information without killing a process.  We'd better show some
helpful information to indicate why this happens.

Suggested-by: David Rientjes <rientjes@google.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Qian Cai <cai@lca.pw>
Link: http://lkml.kernel.org/r/20200721010127.17238-1-laoar.shao@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 10:57:56 -07:00
Yafang Shao 9066e5cfb7 mm, oom: make the calculation of oom badness more accurate
Recently we found an issue on our production environment that when memcg
oom is triggered the oom killer doesn't chose the process with largest
resident memory but chose the first scanned process.  Note that all
processes in this memcg have the same oom_score_adj, so the oom killer
should chose the process with largest resident memory.

Bellow is part of the oom info, which is enough to analyze this issue.
[7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037
[7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0
[7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
[...]
[7516987.983293] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[7516987.983510] [ 5740]     0  5740      257        1    32768        0          -998 pause
[7516987.983574] [58804]     0 58804     4594      771    81920        0          -998 entry_point.bas
[7516987.983577] [58908]     0 58908     7089      689    98304        0          -998 cron
[7516987.983580] [58910]     0 58910    16235     5576   163840        0          -998 supervisord
[7516987.983590] [59620]     0 59620    18074     1395   188416        0          -998 sshd
[7516987.983594] [59622]     0 59622    18680     6679   188416        0          -998 python
[7516987.983598] [59624]     0 59624  1859266     5161   548864        0          -998 odin-agent
[7516987.983600] [59625]     0 59625   707223     9248   983040        0          -998 filebeat
[7516987.983604] [59627]     0 59627   416433    64239   774144        0          -998 odin-log-agent
[7516987.983607] [59631]     0 59631   180671    15012   385024        0          -998 python3
[7516987.983612] [61396]     0 61396   791287     3189   352256        0          -998 client
[7516987.983615] [61641]     0 61641  1844642    29089   946176        0          -998 client
[7516987.983765] [ 9236]     0  9236     2642      467    53248        0          -998 php_scanner
[7516987.983911] [42898]     0 42898    15543      838   167936        0          -998 su
[7516987.983915] [42900]  1000 42900     3673      867    77824        0          -998 exec_script_vr2
[7516987.983918] [42925]  1000 42925    36475    19033   335872        0          -998 python
[7516987.983921] [57146]  1000 57146     3673      848    73728        0          -998 exec_script_J2p
[7516987.983925] [57195]  1000 57195   186359    22958   491520        0          -998 python2
[7516987.983928] [58376]  1000 58376   275764    14402   290816        0          -998 rosmaster
[7516987.983931] [58395]  1000 58395   155166     4449   245760        0          -998 rosout
[7516987.983935] [58406]  1000 58406 18285584  3967322 37101568        0          -998 data_sim
[7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0
[7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

We can find that the first scanned process 5740 (pause) was killed, but
its rss is only one page.  That is because, when we calculate the oom
badness in oom_badness(), we always ignore the negtive point and convert
all of these negtive points to 1.  Now as oom_score_adj of all the
processes in this targeted memcg have the same value -998, the points of
these processes are all negtive value.  As a result, the first scanned
process will be killed.

The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
a Guaranteed pod, which has higher priority to prevent from being killed
by system oom.

To fix this issue, we should make the calculation of oom point more
accurate.  We can achieve it by convert the chosen_point from 'unsigned
long' to 'long'.

[cai@lca.pw: reported a issue in the previous version]
[mhocko@suse.com: fixed the issue reported by Cai]
[mhocko@suse.com: add the comment in proc_oom_score()]
[laoar.shao@gmail.com: v3]
  Link: http://lkml.kernel.org/r/1594396651-9931-1-git-send-email-laoar.shao@gmail.com

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Qian Cai <cai@lca.pw>
Link: http://lkml.kernel.org/r/1594309987-9919-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 10:57:56 -07:00
Roman Gushchin d42f3245c7 mm: memcg: convert vmstat slab counters to bytes
In order to prepare for per-object slab memory accounting, convert
NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat items to bytes.

To make it obvious, rename them to NR_SLAB_RECLAIMABLE_B and
NR_SLAB_UNRECLAIMABLE_B (similar to NR_KERNEL_STACK_KB).

Internally global and per-node counters are stored in pages, however memcg
and lruvec counters are stored in bytes.  This scheme may look weird, but
only for now.  As soon as slab pages will be shared between multiple
cgroups, global and node counters will reflect the total number of slab
pages.  However memcg and lruvec counters will be used for per-memcg slab
memory tracking, which will take separate kernel objects in the account.
Keeping global and node counters in pages helps to avoid additional
overhead.

The size of slab memory shouldn't exceed 4Gb on 32-bit machines, so it
will fit into atomic_long_t we use for vmstats.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/20200623174037.3951353-4-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 11:33:24 -07:00
Christoph Hellwig f5678e7f2a kernel: better document the use_mm/unuse_mm API contract
Switch the function documentation to kerneldoc comments, and add
WARN_ON_ONCE asserts that the calling thread is a kernel thread and does
not have ->mm set (or has ->mm set in the case of unuse_mm).

Also give the functions a kthread_ prefix to better document the use case.

[hch@lst.de: fix a comment typo, cover the newly merged use_mm/unuse_mm caller in vfio]
  Link: http://lkml.kernel.org/r/20200416053158.586887-3-hch@lst.de
[sfr@canb.auug.org.au: powerpc/vas: fix up for {un}use_mm() rename]
  Link: http://lkml.kernel.org/r/20200422163935.5aa93ba5@canb.auug.org.au

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [usb]
Acked-by: Haren Myneni <haren@linux.ibm.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Felipe Balbi <balbi@kernel.org>
Cc: Jason Wang <jasowang@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
Link: http://lkml.kernel.org/r/20200404094101.672954-6-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-10 19:14:18 -07:00
Michel Lespinasse c1e8d7c6a7 mmap locking API: convert mmap_sem comments
Convert comments that reference mmap_sem to reference mmap_lock instead.

[akpm@linux-foundation.org: fix up linux-next leftovers]
[akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
[akpm@linux-foundation.org: more linux-next fixups, per Michel]

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:14 -07:00
Michel Lespinasse 3e4e28c5a8 mmap locking API: convert mmap_sem API comments
Convert comments that reference old mmap_sem APIs to reference
corresponding new mmap locking APIs instead.

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-12-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:14 -07:00
Michel Lespinasse d8ed45c5dc mmap locking API: use coccinelle to convert mmap_sem rwsem call sites
This change converts the existing mmap_sem rwsem calls to use the new mmap
locking API instead.

The change is generated using coccinelle with the following rule:

// spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

@@
expression mm;
@@
(
-init_rwsem
+mmap_init_lock
|
-down_write
+mmap_write_lock
|
-down_write_killable
+mmap_write_lock_killable
|
-down_write_trylock
+mmap_write_trylock
|
-up_write
+mmap_write_unlock
|
-downgrade_write
+mmap_write_downgrade
|
-down_read
+mmap_read_lock
|
-down_read_killable
+mmap_read_lock_killable
|
-down_read_trylock
+mmap_read_trylock
|
-up_read
+mmap_read_unlock
)
-(&mm->mmap_sem)
+(mm)

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:14 -07:00
Joonsoo Kim 97a225e69a mm/page_alloc: integrate classzone_idx and high_zoneidx
classzone_idx is just different name for high_zoneidx now.  So, integrate
them and add some comment to struct alloc_context in order to reduce
future confusion about the meaning of this variable.

The accessor, ac_classzone_idx() is also removed since it isn't needed
after integration.

In addition to integration, this patch also renames high_zoneidx to
highest_zoneidx since it represents more precise meaning.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Baoquan He <bhe@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Ye Xiaolong <xiaolong.ye@intel.com>
Link: http://lkml.kernel.org/r/1587095923-7515-3-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03 20:09:44 -07:00
David Rientjes 8a7ff02aca mm, oom: dump stack of victim when reaping failed
When a process cannot be oom reaped, for whatever reason, currently the
list of locks that are held is currently dumped to the kernel log.

Much more interesting is the stack trace of the victim that cannot be
reaped.  If the stack trace is dumped, we have the ability to find
related occurrences in the same kernel code and hopefully solve the
issue that is making it wedged.

Dump the stack trace when a process fails to be oom reaped.

Link: http://lkml.kernel.org/r/alpine.DEB.2.21.2001141519280.200484@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 10:30:38 -08:00
Ilya Dryomov 941f762bcb mm/oom: fix pgtables units mismatch in Killed process message
pr_err() expects kB, but mm_pgtables_bytes() returns the number of bytes.
As everything else is printed in kB, I chose to fix the value rather than
the string.

Before:

[  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
...
[   1878]  1000  1878   217253   151144  1269760        0             0 python
...
Out of memory: Killed process 1878 (python) total-vm:869012kB, anon-rss:604572kB, file-rss:4kB, shmem-rss:0kB, UID:1000 pgtables:1269760kB oom_score_adj:0

After:

[  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
...
[   1436]  1000  1436   217253   151890  1294336        0             0 python
...
Out of memory: Killed process 1436 (python) total-vm:869012kB, anon-rss:607516kB, file-rss:44kB, shmem-rss:0kB, UID:1000 pgtables:1264kB oom_score_adj:0

Link: http://lkml.kernel.org/r/20191211202830.1600-1-idryomov@gmail.com
Fixes: 70cb6d2677 ("mm/oom: add oom_score_adj and pgtables to Killed process message")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Edward Chron <echron@arista.com>
Cc: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-04 13:55:09 -08:00
Minchan Kim 9c276cc65a mm: introduce MADV_COLD
Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.

- Background

The Android terminology used for forking a new process and starting an app
from scratch is a cold start, while resuming an existing app is a hot
start.  While we continually try to improve the performance of cold
starts, hot starts will always be significantly less power hungry as well
as faster so we are trying to make hot start more likely than cold start.

To increase hot start, Android userspace manages the order that apps
should be killed in a process called ActivityManagerService.
ActivityManagerService tracks every Android app or service that the user
could be interacting with at any time and translates that into a ranked
list for lmkd(low memory killer daemon).  They are likely to be killed by
lmkd if the system has to reclaim memory.  In that sense they are similar
to entries in any other cache.  Those apps are kept alive for
opportunistic performance improvements but those performance improvements
will vary based on the memory requirements of individual workloads.

- Problem

Naturally, cached apps were dominant consumers of memory on the system.
However, they were not significant consumers of swap even though they are
good candidate for swap.  Under investigation, swapping out only begins
once the low zone watermark is hit and kswapd wakes up, but the overall
allocation rate in the system might trip lmkd thresholds and cause a
cached process to be killed(we measured performance swapping out vs.
zapping the memory by killing a process.  Unsurprisingly, zapping is 10x
times faster even though we use zram which is much faster than real
storage) so kill from lmkd will often satisfy the high zone watermark,
resulting in very few pages actually being moved to swap.

- Approach

The approach we chose was to use a new interface to allow userspace to
proactively reclaim entire processes by leveraging platform information.
This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
that are known to be cold from userspace and to avoid races with lmkd by
reclaiming apps as soon as they entered the cached state.  Additionally,
it could provide many chances for platform to use much information to
optimize memory efficiency.

To achieve the goal, the patchset introduce two new options for madvise.
One is MADV_COLD which will deactivate activated pages and the other is
MADV_PAGEOUT which will reclaim private pages instantly.  These new
options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
ways to gain some free memory space.  MADV_PAGEOUT is similar to
MADV_DONTNEED in a way that it hints the kernel that memory region is not
currently needed and should be reclaimed immediately; MADV_COLD is similar
to MADV_FREE in a way that it hints the kernel that memory region is not
currently needed and should be reclaimed when memory pressure rises.

This patch (of 5):

When a process expects no accesses to a certain memory range, it could
give a hint to kernel that the pages can be reclaimed when memory pressure
happens but data should be preserved for future use.  This could reduce
workingset eviction so it ends up increasing performance.

This patch introduces the new MADV_COLD hint to madvise(2) syscall.
MADV_COLD can be used by a process to mark a memory range as not expected
to be used in the near future.  The hint can help kernel in deciding which
pages to evict early during memory pressure.

It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves

	active file page -> inactive file LRU
	active anon page -> inacdtive anon LRU

Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
LRU's head because MADV_COLD is a little bit different symantic.
MADV_FREE means it's okay to discard when the memory pressure because the
content of the page is *garbage* so freeing such pages is almost zero
overhead since we don't need to swap out and access afterward causes just
minor fault.  Thus, it would make sense to put those freeable pages in
inactive file LRU to compete other used-once pages.  It makes sense for
implmentaion point of view, too because it's not swapbacked memory any
longer until it would be re-dirtied.  Even, it could give a bonus to make
them be reclaimed on swapless system.  However, MADV_COLD doesn't mean
garbage so reclaiming them requires swap-out/in in the end so it's bigger
cost.  Since we have designed VM LRU aging based on cost-model, anonymous
cold pages would be better to position inactive anon's LRU list, not file
LRU.  Furthermore, it would help to avoid unnecessary scanning if system
doesn't have a swap device.  Let's start simpler way without adding
complexity at this moment.  However, keep in mind, too that it's a caveat
that workloads with a lot of pages cache are likely to ignore MADV_COLD on
anonymous memory because we rarely age anonymous LRU lists.

* man-page material

MADV_COLD (since Linux x.x)

Pages in the specified regions will be treated as less-recently-accessed
compared to pages in the system with similar access frequencies.  In
contrast to MADV_FREE, the contents of the region are preserved regardless
of subsequent writes to pages.

MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
pages.

[akpm@linux-foundation.org: resolve conflicts with hmm.git]
Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: kbuild test robot <lkp@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25 17:51:41 -07:00
Michal Hocko 1eb41bb07e mm, oom: consider present pages for the node size
constrained_alloc() calculates the size of the oom domain by using
node_spanned_pages which is incorrect because this is the full range of
the physical memory range that the numa node occupies rather than the
memory that backs that range which is represented by node_present_pages.

Sparsely populated nodes (e.g.  after memory hot remove or simply sparse
due to memory layout) can have really a large difference between the two.
This shouldn't really cause any real user observable problems because the
oom calculates a ratio against totalpages and used memory cannot exceed
present pages but it is confusing and wrong from code point of view.

Link: http://lkml.kernel.org/r/20190829163443.899-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: David Hildenbrand <david@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:10 -07:00
Yi Wang f364f06b34 mm/oom_kill.c: fix oom_cpuset_eligible() comment
Commit ac311a14c6 ("oom: decouple mems_allowed from
oom_unkillable_task") changed has_intersects_mems_allowed() to
oom_cpuset_eligible(), but didn't change the comment.

Link: http://lkml.kernel.org/r/1566959929-10638-1-git-send-email-wang.yi59@zte.com.cn
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:10 -07:00
Edward Chron 70cb6d2677 mm/oom: add oom_score_adj and pgtables to Killed process message
For an OOM event: print oom_score_adj value for the OOM Killed process to
document what the oom score adjust value was at the time the process was
OOM Killed.  The adjustment value can be set by user code and it affects
the resulting oom_score so it is used to influence kill process selection.

When eligible tasks are not printed (sysctl oom_dump_tasks = 0) printing
this value is the only documentation of the value for the process being
killed.  Having this value on the Killed process message is useful to
document if a miscconfiguration occurred or to confirm that the
oom_score_adj configuration applies as expected.

An example which illustates both misconfiguration and validation that the
oom_score_adj was applied as expected is:

Aug 14 23:00:02 testserver kernel: Out of memory: Killed process 2692
 (systemd-udevd) total-vm:1056800kB, anon-rss:1052760kB, file-rss:4kB,
 shmem-rss:0kB pgtables:22kB oom_score_adj:1000

The systemd-udevd is a critical system application that should have an
oom_score_adj of -1000.  It was miconfigured to have a adjustment of 1000
making it a highly favored OOM kill target process.  The output documents
both the misconfiguration and the fact that the process was correctly
targeted by OOM due to the miconfiguration.  This can be quite helpful for
triage and problem determination.

The addition of the pgtables_bytes shows page table usage by the process
and is a useful measure of the memory size of the process.

Link: http://lkml.kernel.org/r/20190822173157.1569-1-echron@arista.com
Signed-off-by: Edward Chron <echron@arista.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:10 -07:00
Tetsuo Handa f9c645621a memcg, oom: don't require __GFP_FS when invoking memcg OOM killer
Masoud Sharbiani noticed that commit 29ef680ae7 ("memcg, oom: move
out_of_memory back to the charge path") broke memcg OOM called from
__xfs_filemap_fault() path.  It turned out that try_charge() is retrying
forever without making forward progress because mem_cgroup_oom(GFP_NOFS)
cannot invoke the OOM killer due to commit 3da88fb3ba ("mm, oom:
move GFP_NOFS check to out_of_memory").

Allowing forced charge due to being unable to invoke memcg OOM killer will
lead to global OOM situation.  Also, just returning -ENOMEM will be risky
because OOM path is lost and some paths (e.g.  get_user_pages()) will leak
-ENOMEM.  Therefore, invoking memcg OOM killer (despite GFP_NOFS) will be
the only choice we can choose for now.

Until 29ef680ae7, we were able to invoke memcg OOM killer when
GFP_KERNEL reclaim failed [1].  But since 29ef680ae7, we need to
invoke memcg OOM killer when GFP_NOFS reclaim failed [2].  Although in the
past we did invoke memcg OOM killer for GFP_NOFS [3], we might get
pre-mature memcg OOM reports due to this patch.

[1]

 leaker invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
 CPU: 0 PID: 2746 Comm: leaker Not tainted 4.18.0+ #19
 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
 Call Trace:
  dump_stack+0x63/0x88
  dump_header+0x67/0x27a
  ? mem_cgroup_scan_tasks+0x91/0xf0
  oom_kill_process+0x210/0x410
  out_of_memory+0x10a/0x2c0
  mem_cgroup_out_of_memory+0x46/0x80
  mem_cgroup_oom_synchronize+0x2e4/0x310
  ? high_work_func+0x20/0x20
  pagefault_out_of_memory+0x31/0x76
  mm_fault_error+0x55/0x115
  ? handle_mm_fault+0xfd/0x220
  __do_page_fault+0x433/0x4e0
  do_page_fault+0x22/0x30
  ? page_fault+0x8/0x30
  page_fault+0x1e/0x30
 RIP: 0033:0x4009f0
 Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83
 RSP: 002b:00007ffe29ae96f0 EFLAGS: 00010206
 RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001ce1000
 RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000
 RBP: 000000000000000c R08: 0000000000000000 R09: 00007f94be09220d
 R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0
 R13: 0000000000000003 R14: 00007f949d845000 R15: 0000000002800000
 Task in /leaker killed as a result of limit of /leaker
 memory: usage 524288kB, limit 524288kB, failcnt 158965
 memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
 kmem: usage 2016kB, limit 9007199254740988kB, failcnt 0
 Memory cgroup stats for /leaker: cache:844KB rss:521136KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:132KB writeback:0KB inactive_anon:0KB active_anon:521224KB inactive_file:1012KB active_file:8KB unevictable:0KB
 Memory cgroup out of memory: Kill process 2746 (leaker) score 998 or sacrifice child
 Killed process 2746 (leaker) total-vm:536704kB, anon-rss:521176kB, file-rss:1208kB, shmem-rss:0kB
 oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

[2]

 leaker invoked oom-killer: gfp_mask=0x600040(GFP_NOFS), nodemask=(null), order=0, oom_score_adj=0
 CPU: 1 PID: 2746 Comm: leaker Not tainted 4.18.0+ #20
 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
 Call Trace:
  dump_stack+0x63/0x88
  dump_header+0x67/0x27a
  ? mem_cgroup_scan_tasks+0x91/0xf0
  oom_kill_process+0x210/0x410
  out_of_memory+0x109/0x2d0
  mem_cgroup_out_of_memory+0x46/0x80
  try_charge+0x58d/0x650
  ? __radix_tree_replace+0x81/0x100
  mem_cgroup_try_charge+0x7a/0x100
  __add_to_page_cache_locked+0x92/0x180
  add_to_page_cache_lru+0x4d/0xf0
  iomap_readpages_actor+0xde/0x1b0
  ? iomap_zero_range_actor+0x1d0/0x1d0
  iomap_apply+0xaf/0x130
  iomap_readpages+0x9f/0x150
  ? iomap_zero_range_actor+0x1d0/0x1d0
  xfs_vm_readpages+0x18/0x20 [xfs]
  read_pages+0x60/0x140
  __do_page_cache_readahead+0x193/0x1b0
  ondemand_readahead+0x16d/0x2c0
  page_cache_async_readahead+0x9a/0xd0
  filemap_fault+0x403/0x620
  ? alloc_set_pte+0x12c/0x540
  ? _cond_resched+0x14/0x30
  __xfs_filemap_fault+0x66/0x180 [xfs]
  xfs_filemap_fault+0x27/0x30 [xfs]
  __do_fault+0x19/0x40
  __handle_mm_fault+0x8e8/0xb60
  handle_mm_fault+0xfd/0x220
  __do_page_fault+0x238/0x4e0
  do_page_fault+0x22/0x30
  ? page_fault+0x8/0x30
  page_fault+0x1e/0x30
 RIP: 0033:0x4009f0
 Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83
 RSP: 002b:00007ffda45c9290 EFLAGS: 00010206
 RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001a1e000
 RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000
 RBP: 000000000000000c R08: 0000000000000000 R09: 00007f6d061ff20d
 R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0
 R13: 0000000000000003 R14: 00007f6ce59b2000 R15: 0000000002800000
 Task in /leaker killed as a result of limit of /leaker
 memory: usage 524288kB, limit 524288kB, failcnt 7221
 memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
 kmem: usage 1944kB, limit 9007199254740988kB, failcnt 0
 Memory cgroup stats for /leaker: cache:3632KB rss:518232KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:518408KB inactive_file:3908KB active_file:12KB unevictable:0KB
 Memory cgroup out of memory: Kill process 2746 (leaker) score 992 or sacrifice child
 Killed process 2746 (leaker) total-vm:536704kB, anon-rss:518264kB, file-rss:1188kB, shmem-rss:0kB
 oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

[3]

 leaker invoked oom-killer: gfp_mask=0x50, order=0, oom_score_adj=0
 leaker cpuset=/ mems_allowed=0
 CPU: 1 PID: 3206 Comm: leaker Not tainted 3.10.0-957.27.2.el7.x86_64 #1
 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
 Call Trace:
  [<ffffffffaf364147>] dump_stack+0x19/0x1b
  [<ffffffffaf35eb6a>] dump_header+0x90/0x229
  [<ffffffffaedbb456>] ? find_lock_task_mm+0x56/0xc0
  [<ffffffffaee32a38>] ? try_get_mem_cgroup_from_mm+0x28/0x60
  [<ffffffffaedbb904>] oom_kill_process+0x254/0x3d0
  [<ffffffffaee36c36>] mem_cgroup_oom_synchronize+0x546/0x570
  [<ffffffffaee360b0>] ? mem_cgroup_charge_common+0xc0/0xc0
  [<ffffffffaedbc194>] pagefault_out_of_memory+0x14/0x90
  [<ffffffffaf35d072>] mm_fault_error+0x6a/0x157
  [<ffffffffaf3717c8>] __do_page_fault+0x3c8/0x4f0
  [<ffffffffaf371925>] do_page_fault+0x35/0x90
  [<ffffffffaf36d768>] page_fault+0x28/0x30
 Task in /leaker killed as a result of limit of /leaker
 memory: usage 524288kB, limit 524288kB, failcnt 20628
 memory+swap: usage 524288kB, limit 9007199254740988kB, failcnt 0
 kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
 Memory cgroup stats for /leaker: cache:840KB rss:523448KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:523448KB inactive_file:464KB active_file:376KB unevictable:0KB
 Memory cgroup out of memory: Kill process 3206 (leaker) score 970 or sacrifice child
 Killed process 3206 (leaker) total-vm:536692kB, anon-rss:523304kB, file-rss:412kB, shmem-rss:0kB

Bisected by Masoud Sharbiani.

Link: http://lkml.kernel.org/r/cbe54ed1-b6ba-a056-8899-2dc42526371d@i-love.sakura.ne.jp
Fixes: 3da88fb3ba ("mm, oom: move GFP_NOFS check to out_of_memory") [necessary after 29ef680ae7]
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: Masoud Sharbiani <msharbiani@apple.com>
Tested-by: Masoud Sharbiani <msharbiani@apple.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org>	[4.19+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:10 -07:00