Commit Graph

395 Commits

Author SHA1 Message Date
Nico Pache 9481a51b97 mm: update validate_mm() to use vma iterator
Conflicts:
       mm/internal.h
       mm/mmap.c

commit b50e195ff436625b26dcc9839bc52cc7c5bf1a54
Author: Liam R. Howlett <Liam.Howlett@oracle.com>
Date:   Thu May 18 10:55:26 2023 -0400

    mm: update validate_mm() to use vma iterator

    Use the vma iterator in the validation code and combine the code to check
    the maple tree into the main validate_mm() function.

    Introduce a new function vma_iter_dump_tree() to dump the maple tree in
    hex layout.

    Replace all calls to validate_mm_mt() with validate_mm().

    [Liam.Howlett@oracle.com: update validate_mm() to use vma iterator CONFIG flag]
      Link: https://lkml.kernel.org/r/20230606183538.588190-1-Liam.Howlett@oracle.com
    Link: https://lkml.kernel.org/r/20230518145544.1722059-18-Liam.Howlett@oracle.com
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Cc: David Binderman <dcb314@hotmail.com>
    Cc: Peng Zhang <zhangpeng.00@bytedance.com>
    Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
    Cc: Vernon Yang <vernon2gm@gmail.com>
    Cc: Wei Yang <richard.weiyang@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
2024-04-30 17:51:25 -06:00
Chris von Recklinghausen 7b0c486c6c mm: apply __must_check to vmap_pages_range_noflush()
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit d905ae2b0f7eaf8fb37febfe4833ccf3f8c1c27a
Author: Alexander Potapenko <glider@google.com>
Date:   Thu Apr 13 15:12:23 2023 +0200

    mm: apply __must_check to vmap_pages_range_noflush()

    To prevent errors when vmap_pages_range_noflush() or
    __vmap_pages_range_noflush() silently fail (see the link below for an
    example), annotate them with __must_check so that the callers do not
    unconditionally assume the mapping succeeded.

    Link: https://lkml.kernel.org/r/20230413131223.4135168-4-glider@google.com
    Signed-off-by: Alexander Potapenko <glider@google.com>
    Reported-by: Dipanjan Das <mail.dipanjan.das@gmail.com>
      Link: https://lore.kernel.org/linux-mm/CANX2M5ZRrRA64k0hOif02TjmY9kbbO2aCBPyq79es34RXZ=cAw@mail.gmail.com/
    Reviewed-by: Marco Elver <elver@google.com>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:58 -04:00
Chris von Recklinghausen a7a5bd61d2 mm: move free_area_empty() to mm/internal.h
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 62f31bd4dcedffe3c919deb76ed65bf62c3cf80b
Author: Mike Rapoport (IBM) <rppt@kernel.org>
Date:   Sun Mar 26 19:02:15 2023 +0300

    mm: move free_area_empty() to mm/internal.h

    The free_area_empty() helper is only used inside mm/ so move it there to
    reduce noise in include/linux/mmzone.h

    Link: https://lkml.kernel.org/r/20230326160215.2674531-1-rppt@kernel.org
    Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:50 -04:00
Chris von Recklinghausen 8828368eaf mm: conditionally write-lock VMA in free_pgtables
Conflicts: mm/mmap.c - fuzz

JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 98e51a2239d9d419d819cd61a2e720ebf19a8b0a
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Mon Feb 27 09:36:18 2023 -0800

    mm: conditionally write-lock VMA in free_pgtables

    Normally free_pgtables needs to lock affected VMAs except for the case
    when VMAs were isolated under VMA write-lock.  munmap() does just that,
    isolating while holding appropriate locks and then downgrading mmap_lock
    and dropping per-VMA locks before freeing page tables.  Add a parameter to
    free_pgtables for such scenario.

    Link: https://lkml.kernel.org/r/20230227173632.3292573-20-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:41 -04:00
Chris von Recklinghausen 5c33375540 mm: move vmalloc_init() declaration to mm/internal.h
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit b671491199acd3609f87754d268f2f5cb10de18d
Author: Mike Rapoport (IBM) <rppt@kernel.org>
Date:   Tue Mar 21 19:05:12 2023 +0200

    mm: move vmalloc_init() declaration to mm/internal.h

    vmalloc_init() is called only from mm_core_init(), there is no need to
    declare it in include/linux/vmalloc.h

    Move vmalloc_init() declaration to mm/internal.h

    Link: https://lkml.kernel.org/r/20230321170513.2401534-14-rppt@kernel.org
    Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Doug Berger <opendmb@gmail.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:34 -04:00
Chris von Recklinghausen 5b7e1e7745 mm: move mem_init_print_info() to mm_init.c
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit eb8589b4f8c107c346421881963c0ee0b8367c2c
Author: Mike Rapoport (IBM) <rppt@kernel.org>
Date:   Tue Mar 21 19:05:10 2023 +0200

    mm: move mem_init_print_info() to mm_init.c

    mem_init_print_info() is only called from mm_core_init().

    Move it close to the caller and make it static.

    Link: https://lkml.kernel.org/r/20230321170513.2401534-12-rppt@kernel.org
    Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Doug Berger <opendmb@gmail.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:33 -04:00
Chris von Recklinghausen 1e880e2e96 mm: move init_mem_debugging_and_hardening() to mm/mm_init.c
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit f2fc4b44ec2bb94c51c7ae1af9b1177d72705992
Author: Mike Rapoport (IBM) <rppt@kernel.org>
Date:   Tue Mar 21 19:05:08 2023 +0200

    mm: move init_mem_debugging_and_hardening() to mm/mm_init.c

    init_mem_debugging_and_hardening() is only called from mm_core_init().

    Move it close to the caller, make it static and rename it to
    mem_debugging_and_hardening_init() for consistency with surrounding
    convention.

    Link: https://lkml.kernel.org/r/20230321170513.2401534-10-rppt@kernel.org
    Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Doug Berger <opendmb@gmail.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:33 -04:00
Chris von Recklinghausen 1845b92dcf mm: move most of core MM initialization to mm/mm_init.c
Conflicts: mm/page_alloc.c, mm/mm_init.c - conflicts due to
	3f6dac0fd1b8 ("mm/page_alloc: make deferred page init free pages in MAX_ORDER blocks")
	and
	87a7ae75d738 ("mm/vmemmap/devdax: fix kernel crash when probing devdax devices")

JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 9420f89db2dd611c5b436a13e13f74d65ecc3a6a
Author: Mike Rapoport (IBM) <rppt@kernel.org>
Date:   Tue Mar 21 19:05:02 2023 +0200

    mm: move most of core MM initialization to mm/mm_init.c

    The bulk of memory management initialization code is spread all over
    mm/page_alloc.c and makes navigating through page allocator functionality
    difficult.

    Move most of the functions marked __init and __meminit to mm/mm_init.c to
    make it better localized and allow some more spare room before
    mm/page_alloc.c reaches 10k lines.

    No functional changes.

    Link: https://lkml.kernel.org/r/20230321170513.2401534-4-rppt@kernel.org
    Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Acked-by: David Hildenbrand <david@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Doug Berger <opendmb@gmail.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:31 -04:00
Chris von Recklinghausen ea5a8c928e mm, printk: introduce new format %pGt for page_type
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 4c85c0be3d7a9a7ffe48bfe0954eacc0ba9d3c75
Author: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Date:   Mon Jan 30 13:25:13 2023 +0900

    mm, printk: introduce new format %pGt for page_type

    %pGp format is used to display 'flags' field of a struct page.  However,
    some page flags (i.e.  PG_buddy, see page-flags.h for more details) are
    stored in page_type field.  To display human-readable output of page_type,
    introduce %pGt format.

    It is important to note the meaning of bits are different in page_type.
    if page_type is 0xffffffff, no flags are set.  Setting PG_buddy
    (0x00000080) flag results in a page_type of 0xffffff7f.  Clearing a bit
    actually means setting a flag.  Bits in page_type are inverted when
    displaying type names.

    Only values for which page_type_has_type() returns true are considered as
    page_type, to avoid confusion with mapcount values.  if it returns false,
    only raw values are displayed and not page type names.

    Link: https://lkml.kernel.org/r/20230130042514.2418-3-42.hyeyoo@gmail.com
    Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
    Reviewed-by: Petr Mladek <pmladek@suse.com>     [vsprintf part]
    Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Joe Perches <joe@perches.com>
    Cc: John Ogness <john.ogness@linutronix.de>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
    Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:05 -04:00
Aristeu Rozanski 9819f337e2 splice: Add a func to do a splice from a buffered file without ITER_PIPE
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 07073eb01c5f630344bc1c3e56b0e0d94aedf919
Author: David Howells <dhowells@redhat.com>
Date:   Tue Feb 14 15:01:42 2023 +0000

    splice: Add a func to do a splice from a buffered file without ITER_PIPE

    Provide a function to do splice read from a buffered file, pulling the
    folios out of the pagecache directly by calling filemap_get_pages() to do
    any required reading and then pasting the returned folios into the pipe.

    A helper function is provided to do the actual folio pasting and will
    handle multipage folios by splicing as many of the relevant subpages as
    will fit into the pipe.

    The code is loosely based on filemap_read() and might belong in
    mm/filemap.c with that as it needs to use filemap_get_pages().

    Signed-off-by: David Howells <dhowells@redhat.com>
    Reviewed-by: Jens Axboe <axboe@kernel.dk>
    cc: Christoph Hellwig <hch@lst.de>
    cc: Al Viro <viro@zeniv.linux.org.uk>
    cc: David Hildenbrand <david@redhat.com>
    cc: John Hubbard <jhubbard@nvidia.com>
    cc: linux-mm@kvack.org
    cc: linux-block@vger.kernel.org
    cc: linux-fsdevel@vger.kernel.org
    Signed-off-by: Steve French <stfrench@microsoft.com>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:24 -04:00
Aristeu Rozanski 4c96f5154f mm: change to return bool for isolate_lru_page()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit f7f9c00dfafffd7a5a1a5685e2d874c64913e2ed
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date:   Wed Feb 15 18:39:35 2023 +0800

    mm: change to return bool for isolate_lru_page()

    The isolate_lru_page() can only return 0 or -EBUSY, and most users did not
    care about the negative error of isolate_lru_page(), except one user in
    add_page_for_migration().  So we can convert the isolate_lru_page() to
    return a boolean value, which can help to make the code more clear when
    checking the return value of isolate_lru_page().

    Also convert all users' logic of checking the isolation state.

    No functional changes intended.

    Link: https://lkml.kernel.org/r/3074c1ab628d9dbf139b33f248a8bc253a3f95f0.1676424378.git.baolin.wang@linux.alibaba.com
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Reviewed-by: SeongJae Park <sj@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:23 -04:00
Aristeu Rozanski d1230addeb mm: change to return bool for folio_isolate_lru()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit be2d57563822b7e00b2b16d9354637c4b6d6d5cc
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date:   Wed Feb 15 18:39:34 2023 +0800

    mm: change to return bool for folio_isolate_lru()

    Patch series "Change the return value for page isolation functions", v3.

    Now the page isolation functions did not return a boolean to indicate
    success or not, instead it will return a negative error when failed
    to isolate a page. So below code used in most places seem a boolean
    success/failure thing, which can confuse people whether the isolation
    is successful.

    if (folio_isolate_lru(folio))
            continue;

    Moreover the page isolation functions only return 0 or -EBUSY, and
    most users did not care about the negative error except for few users,
    thus we can convert all page isolation functions to return a boolean
    value, which can remove the confusion to make code more clear.

    No functional changes intended in this patch series.

    This patch (of 4):

    Now the folio_isolate_lru() did not return a boolean value to indicate
    isolation success or not, however below code checking the return value can
    make people think that it was a boolean success/failure thing, which makes
    people easy to make mistakes (see the fix patch[1]).

    if (folio_isolate_lru(folio))
            continue;

    Thus it's better to check the negative error value expilictly returned by
    folio_isolate_lru(), which makes code more clear per Linus's
    suggestion[2].  Moreover Matthew suggested we can convert the isolation
    functions to return a boolean[3], since most users did not care about the
    negative error value, and can also remove the confusing of checking return
    value.

    So this patch converts the folio_isolate_lru() to return a boolean value,
    which means return 'true' to indicate the folio isolation is successful,
    and 'false' means a failure to isolation.  Meanwhile changing all users'
    logic of checking the isolation state.

    No functional changes intended.

    [1] https://lore.kernel.org/all/20230131063206.28820-1-Kuan-Ying.Lee@mediatek.com/T/#u
    [2] https://lore.kernel.org/all/CAHk-=wiBrY+O-4=2mrbVyxR+hOqfdJ=Do6xoucfJ9_5az01L4Q@mail.gmail.com/
    [3] https://lore.kernel.org/all/Y+sTFqwMNAjDvxw3@casper.infradead.org/

    Link: https://lkml.kernel.org/r/cover.1676424378.git.baolin.wang@linux.alibaba.com
    Link: https://lkml.kernel.org/r/8a4e3679ed4196168efadf7ea36c038f2f7d5aa9.1676424378.git.baolin.wang@linux.alibaba.com
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Reviewed-by: SeongJae Park <sj@kernel.org>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Shakeel Butt <shakeelb@google.com>

    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:23 -04:00
Aristeu Rozanski 73bcab273d mm/gup: move private gup FOLL_ flags to internal.h
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 2c2241081f7dec878331fdc3a3f2361e99556bca
Author: Jason Gunthorpe <jgg@ziepe.ca>
Date:   Tue Jan 24 16:34:34 2023 -0400

    mm/gup: move private gup FOLL_ flags to internal.h

    Move the flags that should not/are not used outside gup.c and related into
    mm/internal.h to discourage driver abuse.

    To make this more maintainable going forward compact the two FOLL ranges
    with new bit numbers from 0 to 11 and 16 to 21, using shifts so it is
    explicit.

    Switch to an enum so the whole thing is easier to read.

    Link: https://lkml.kernel.org/r/13-v2-987e91b59705+36b-gup_tidy_jgg@nvidia.com
    Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:19 -04:00
Aristeu Rozanski 53c4db0f94 mm/gup: move gup_must_unshare() to mm/internal.h
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 63b605128655f2e3968d99e30b293c7e7eaa2fc2
Author: Jason Gunthorpe <jgg@ziepe.ca>
Date:   Tue Jan 24 16:34:33 2023 -0400

    mm/gup: move gup_must_unshare() to mm/internal.h

    This function is only used in gup.c and closely related.  It touches
    FOLL_PIN so it must be moved before the next patch.

    Link: https://lkml.kernel.org/r/12-v2-987e91b59705+36b-gup_tidy_jgg@nvidia.com
    Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:19 -04:00
Aristeu Rozanski 71bad82afd mm/gup: move try_grab_page() to mm/internal.h
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 7ce154fe6917e7db94d63bc4d6c73b678ad1c581
Author: Jason Gunthorpe <jgg@ziepe.ca>
Date:   Tue Jan 24 16:34:25 2023 -0400

    mm/gup: move try_grab_page() to mm/internal.h

    This is part of the internal function of gup.c and is only non-static so
    that the parts of gup.c in the huge_memory.c and hugetlb.c can call it.

    Put it in internal.h beside the similarly purposed try_grab_folio()

    Link: https://lkml.kernel.org/r/4-v2-987e91b59705+36b-gup_tidy_jgg@nvidia.com
    Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:19 -04:00
Aristeu Rozanski 7095a3e0ec mm/mmap: refactor locking out of __vma_adjust()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 440703e082b9c79c3d4fffcca8c2dffd621e6dc5
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Fri Jan 20 11:26:41 2023 -0500

    mm/mmap: refactor locking out of __vma_adjust()

    Move the locking into vma_prepare() and vma_complete() for use elsewhere

    Link: https://lkml.kernel.org/r/20230120162650.984577-41-Liam.Howlett@oracle.com
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:16 -04:00
Aristeu Rozanski fca6a0a285 mm: expand vma iterator interface
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit b62b633e048bbddef90b2e55d2e33823187b425f
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Fri Jan 20 11:26:08 2023 -0500

    mm: expand vma iterator interface

    Add wrappers for the maple tree to the vma iterator.  This will provide
    type safety at compile time.

    Link: https://lkml.kernel.org/r/20230120162650.984577-8-Liam.Howlett@oracle.com
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:13 -04:00
Aristeu Rozanski 175b35ee46 mm: remove munlock_vma_page()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 672aa27d0bd241759376e62b78abb8aae1792479
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jan 16 19:28:26 2023 +0000

    mm: remove munlock_vma_page()

    All callers now have a folio and can call munlock_vma_folio().  Update the
    documentation to refer to munlock_vma_folio().

    Link: https://lkml.kernel.org/r/20230116192827.2146732-4-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:10 -04:00
Aristeu Rozanski 96cb17f8b1 mm: remove mlock_vma_page()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 7efecffb8e7968c4a6c53177b0053ca4765fe233
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jan 16 19:28:25 2023 +0000

    mm: remove mlock_vma_page()

    All callers now have a folio and can call mlock_vma_folio().  Update the
    documentation to refer to mlock_vma_folio().

    Link: https://lkml.kernel.org/r/20230116192827.2146732-3-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:10 -04:00
Aristeu Rozanski 7df948ec95 mm: remove page_evictable()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 90c9d13a47d45f2f16530c4d62af2fa4d74dfd16
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jan 16 19:28:24 2023 +0000

    mm: remove page_evictable()

    Patch series "Remove leftover mlock/munlock page wrappers".

    We no longer need the various mlock page functions as all callers have
    folios.

    This patch (of 4):

    This function now has no users.  Also update the unevictable-lru
    documentation to discuss folios instead of pages (mostly).

    [akpm@linux-foundation.org: fix Documentation/mm/unevictable-lru.rst underlining]
      Link: https://lkml.kernel.org/r/20230117145106.585b277b@canb.auug.org.au
    Link: https://lkml.kernel.org/r/20230116192827.2146732-1-willy@infradead.org
    Link: https://lkml.kernel.org/r/20230116192827.2146732-2-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:10 -04:00
Aristeu Rozanski 647a91cfab mm: discard __GFP_ATOMIC
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 2973d8229b78d3f148e0c45916a1e8b237dc6167
Author: NeilBrown <neilb@suse.de>
Date:   Fri Jan 13 11:12:17 2023 +0000

    mm: discard __GFP_ATOMIC

    __GFP_ATOMIC serves little purpose.  Its main effect is to set
    ALLOC_HARDER which adds a few little boosts to increase the chance of an
    allocation succeeding, one of which is to lower the water-mark at which it
    will succeed.

    It is *always* paired with __GFP_HIGH which sets ALLOC_HIGH which also
    adjusts this watermark.  It is probable that other users of __GFP_HIGH
    should benefit from the other little bonuses that __GFP_ATOMIC gets.

    __GFP_ATOMIC also gives a warning if used with __GFP_DIRECT_RECLAIM.
    There is little point to this.  We already get a might_sleep() warning if
    __GFP_DIRECT_RECLAIM is set.

    __GFP_ATOMIC allows the "watermark_boost" to be side-stepped.  It is
    probable that testing ALLOC_HARDER is a better fit here.

    __GFP_ATOMIC is used by tegra-smmu.c to check if the allocation might
    sleep.  This should test __GFP_DIRECT_RECLAIM instead.

    This patch:
     - removes __GFP_ATOMIC
     - allows __GFP_HIGH allocations to ignore watermark boosting as well
       as GFP_ATOMIC requests.
     - makes other adjustments as suggested by the above.

    The net result is not change to GFP_ATOMIC allocations.  Other
    allocations that use __GFP_HIGH will benefit from a few different extra
    privileges.  This affects:
      xen, dm, md, ntfs3
      the vermillion frame buffer
      hibernation
      ksm
      swap
    all of which likely produce more benefit than cost if these selected
    allocation are more likely to succeed quickly.

    [mgorman: Minor adjustments to rework on top of a series]
    Link: https://lkml.kernel.org/r/163712397076.13692.4727608274002939094@noble.neil.brown.name
    Link: https://lkml.kernel.org/r/20230113111217.14134-7-mgorman@techsingularity.net
    Signed-off-by: NeilBrown <neilb@suse.de>
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Thierry Reding <thierry.reding@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:09 -04:00
Aristeu Rozanski 1be39681b6 mm/page_alloc: explicitly define how __GFP_HIGH non-blocking allocations accesses reserves
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 1ebbb21811b76c3b932959787f37985af36f62fa
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Fri Jan 13 11:12:16 2023 +0000

    mm/page_alloc: explicitly define how __GFP_HIGH non-blocking allocations accesses reserves

    GFP_ATOMIC allocations get flagged ALLOC_HARDER which is a vague
    description.  In preparation for the removal of GFP_ATOMIC redefine
    __GFP_ATOMIC to simply mean non-blocking and renaming ALLOC_HARDER to
    ALLOC_NON_BLOCK accordingly.  __GFP_HIGH is required for access to
    reserves but non-blocking is granted more access.  For example, GFP_NOWAIT
    is non-blocking but has no special access to reserves.  A __GFP_NOFAIL
    blocking allocation is granted access similar to __GFP_HIGH if the only
    alternative is an OOM kill.

    Link: https://lkml.kernel.org/r/20230113111217.14134-6-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: NeilBrown <neilb@suse.de>
    Cc: Thierry Reding <thierry.reding@gmail.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:09 -04:00
Aristeu Rozanski f4cfca77a5 mm/page_alloc: explicitly define what alloc flags deplete min reserves
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit ab3508854353793cd35e348fde89a5c09b2fd8b5
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Fri Jan 13 11:12:15 2023 +0000

    mm/page_alloc: explicitly define what alloc flags deplete min reserves

    As there are more ALLOC_ flags that affect reserves, define what flags
    affect reserves and clarify the effect of each flag.

    Link: https://lkml.kernel.org/r/20230113111217.14134-5-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: NeilBrown <neilb@suse.de>
    Cc: Thierry Reding <thierry.reding@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:09 -04:00
Aristeu Rozanski 4911b063db mm/page_alloc: explicitly record high-order atomic allocations in alloc_flags
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit eb2e2b425c6984ca8034448a3f2c680622bd3d4d
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Fri Jan 13 11:12:14 2023 +0000

    mm/page_alloc: explicitly record high-order atomic allocations in alloc_flags

    A high-order ALLOC_HARDER allocation is assumed to be atomic.  While that
    is accurate, it changes later in the series.  In preparation, explicitly
    record high-order atomic allocations in gfp_to_alloc_flags().

    Link: https://lkml.kernel.org/r/20230113111217.14134-4-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: NeilBrown <neilb@suse.de>
    Cc: Thierry Reding <thierry.reding@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:09 -04:00
Aristeu Rozanski 6094729a56 mm/page_alloc: rename ALLOC_HIGH to ALLOC_MIN_RESERVE
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 524c48072e5673f4511f1ad81493e2485863fd65
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Fri Jan 13 11:12:12 2023 +0000

    mm/page_alloc: rename ALLOC_HIGH to ALLOC_MIN_RESERVE

    Patch series "Discard __GFP_ATOMIC", v3.

    Neil's patch has been residing in mm-unstable as commit 2fafb4fe8f7a ("mm:
    discard __GFP_ATOMIC") for a long time and recently brought up again.
    Most recently, I was worried that __GFP_HIGH allocations could use
    high-order atomic reserves which is unintentional but there was no
    response so lets revisit -- this series reworks how min reserves are used,
    protects highorder reserves and then finishes with Neil's patch with very
    minor modifications so it fits on top.

    There was a review discussion on renaming __GFP_DIRECT_RECLAIM to
    __GFP_ALLOW_BLOCKING but I didn't think it was that big an issue and is
    orthogonal to the removal of __GFP_ATOMIC.

    There were some concerns about how the gfp flags affect the min reserves
    but it never reached a solid conclusion so I made my own attempt.

    The series tries to iron out some of the details on how reserves are used.
    ALLOC_HIGH becomes ALLOC_MIN_RESERVE and ALLOC_HARDER becomes
    ALLOC_NON_BLOCK and documents how the reserves are affected.  For example,
    ALLOC_NON_BLOCK (no direct reclaim) on its own allows 25% of the min
    reserve.  ALLOC_MIN_RESERVE (__GFP_HIGH) allows 50% and both combined
    allows deeper access again.  ALLOC_OOM allows access to 75%.

    High-order atomic allocations are explicitly handled with the caveat that
    no __GFP_ATOMIC flag means that any high-order allocation that specifies
    GFP_HIGH and cannot enter direct reclaim will be treated as if it was
    GFP_ATOMIC.

    This patch (of 6):

    __GFP_HIGH aliases to ALLOC_HIGH but the name does not really hint what it
    means.  As ALLOC_HIGH is internal to the allocator, rename it to
    ALLOC_MIN_RESERVE to document that the min reserves can be depleted.

    Link: https://lkml.kernel.org/r/20230113111217.14134-1-mgorman@techsingularity.net
    Link: https://lkml.kernel.org/r/20230113111217.14134-2-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: NeilBrown <neilb@suse.de>
    Cc: Thierry Reding <thierry.reding@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:09 -04:00
Aristeu Rozanski 067fb10657 mm: mlock: update the interface to use folios
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 96f97c438f61ddba94117dcd1a1eb0aaafa22309
Author: Lorenzo Stoakes <lstoakes@gmail.com>
Date:   Thu Jan 12 12:39:31 2023 +0000

    mm: mlock: update the interface to use folios

    Update the mlock interface to accept folios rather than pages, bringing
    the interface in line with the internal implementation.

    munlock_vma_page() still requires a page_folio() conversion, however this
    is consistent with the existent mlock_vma_page() implementation and a
    product of rmap still dealing in pages rather than folios.

    Link: https://lkml.kernel.org/r/cba12777c5544305014bc0cbec56bb4cc71477d8.1673526881.git.lstoakes@gmail.com
    Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Geert Uytterhoeven <geert@linux-m68k.org>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: William Kucharski <william.kucharski@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:08 -04:00
Audra Mitchell fb208bc6ad filemap: find_get_entries() now updates start offset
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit 9fb6beea79c6e7c959adf4fb7b94cf9a6028b941
Author: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Date:   Mon Oct 17 09:18:00 2022 -0700

    filemap: find_get_entries() now updates start offset

    Initially, find_get_entries() was being passed in the start offset as a
    value.  That left the calculation of the offset to the callers.  This led
    to complexity in the callers trying to keep track of the index.

    Now find_get_entries() takes in a pointer to the start offset and updates
    the value to be directly after the last entry found.  If no entry is
    found, the offset is not changed.  This gets rid of multiple hacky
    calculations that kept track of the start offset.

    Link: https://lkml.kernel.org/r/20221017161800.2003-3-vishal.moola@gmail.com
    Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:50 -04:00
Audra Mitchell 765c2fd97b filemap: find_lock_entries() now updates start offset
JIRA: https://issues.redhat.com/browse/RHEL-27739
Conflicts:
    Context conflict due to out of order backport:
    9efa394ef3 ("tmpfs: fix data loss from failed fallocate")

This patch is a backport of the following upstream commit:
commit 3392ca121872dd8c33015c7703d4981c78819be3
Author: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Date:   Mon Oct 17 09:17:59 2022 -0700

    filemap: find_lock_entries() now updates start offset

    Patch series "Rework find_get_entries() and find_lock_entries()", v3.

    Originally the callers of find_get_entries() and find_lock_entries() were
    keeping track of the start index themselves as they traverse the search
    range.

    This resulted in hacky code such as in shmem_undo_range():

                            index = folio->index + folio_nr_pages(folio) - 1;

    where the - 1 is only present to stay in the right spot after incrementing
    index later.  This sort of calculation was also being done on every folio
    despite not even using index later within that function.

    These patches change find_get_entries() and find_lock_entries() to
    calculate the new index instead of leaving it to the callers so we can
    avoid all these complications.

    This patch (of 2):

    Initially, find_lock_entries() was being passed in the start offset as a
    value.  That left the calculation of the offset to the callers.  This led
    to complexity in the callers trying to keep track of the index.

    Now find_lock_entries() takes in a pointer to the start offset and updates
    the value to be directly after the last entry found.  If no entry is
    found, the offset is not changed.  This gets rid of multiple hacky
    calculations that kept track of the start offset.

    Link: https://lkml.kernel.org/r/20221017161800.2003-1-vishal.moola@gmail.com
    Link: https://lkml.kernel.org/r/20221017161800.2003-2-vishal.moola@gmail.com
    Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:50 -04:00
Chris von Recklinghausen 26437a89ef mm: remove the vma linked list
Conflicts:
	include/linux/mm.h - We already have
		21b85b09527c ("madvise: use zap_page_range_single for madvise dontneed")
		so keep declaration for zap_page_range_single
	kernel/fork.c - We already have
		f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
		so keep declaration of i
	mm/mmap.c - We already have
		a1e8cb93bf ("mm: drop oom code from exit_mmap")
		and
		db3644c677 ("mm: delete unused MMF_OOM_VICTIM flag")
		so keep setting MMF_OOM_SKIP in mm->flags

JIRA: https://issues.redhat.com/browse/RHEL-27736

commit 763ecb035029f500d7e6dc99acd1ad299b7726a1
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:49:06 2022 +0000

    mm: remove the vma linked list

    Replace any vm_next use with vma_find().

    Update free_pgtables(), unmap_vmas(), and zap_page_range() to use the
    maple tree.

    Use the new free_pgtables() and unmap_vmas() in do_mas_align_munmap().  At
    the same time, alter the loop to be more compact.

    Now that free_pgtables() and unmap_vmas() take a maple tree as an
    argument, rearrange do_mas_align_munmap() to use the new tree to hold the
    vmas to remove.

    Remove __vma_link_list() and __vma_unlink_list() as they are exclusively
    used to update the linked list.

    Drop linked list update from __insert_vm_struct().

    Rework validation of tree as it was depending on the linked list.

    [yang.lee@linux.alibaba.com: fix one kernel-doc comment]
      Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=1949
      Link: https://lkml.kernel.org/r/20220824021918.94116-1-yang.lee@linux.alib
aba.comLink: https://lkml.kernel.org/r/20220906194824.2110408-69-Liam.Howlett@or
acle.com
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
    Tested-by: Yu Zhao <yuzhao@google.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:19:57 -04:00
Chris von Recklinghausen d384489054 mm: convert head_subpages_mapcount() into folio_nr_pages_mapped()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit eec20426d48bd7b63c69969a793943ed1a99b731
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Jan 11 14:28:48 2023 +0000

    mm: convert head_subpages_mapcount() into folio_nr_pages_mapped()

    Calling this 'mapcount' is confusing since mapcount is usually the number
    of times something is mapped; instead this is the number of mapped pages.
    It's also better to enforce that this is a folio rather than a head page.

    Move folio_nr_pages_mapped() into mm/internal.h since this is not
    something we want device drivers or filesystems poking at.  Get rid of
    folio_subpages_mapcount_ptr() and use folio->_nr_pages_mapped directly.

    Link: https://lkml.kernel.org/r/20230111142915.1001531-3-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:52 -04:00
Chris von Recklinghausen 8e0969ab45 mm: move folio_set_compound_order() to mm/internal.h
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 04a42e72d77a93a166b79c34b7bc862f55a53967
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Wed Dec 14 22:17:57 2022 -0800

    mm: move folio_set_compound_order() to mm/internal.h

    folio_set_compound_order() is moved to an mm-internal location so external
    folio users cannot misuse this function.  Change the name of the function
    to folio_set_order() and use WARN_ON_ONCE() rather than BUG_ON.  Also,
    handle the case if a non-large folio is passed and add clarifying comments
    to the function.

    Link: https://lore.kernel.org/lkml/20221207223731.32784-1-sidhartha.kumar@oracle.com/T/
    Link: https://lkml.kernel.org/r/20221215061757.223440-1-sidhartha.kumar@oracle.com
    Fixes: 9fd330582b2f ("mm: add folio dtor and order setter functions")
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
    Suggested-by: Muchun Song <songmuchun@bytedance.com>
    Suggested-by: Matthew Wilcox <willy@infradead.org>
    Suggested-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:43 -04:00
Chris von Recklinghausen 891dbb5790 mm/hwpoison: introduce per-memory_block hwpoison counter
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 5033091de814ab4b5623faed2755f3064e19e2d2
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Mon Oct 24 15:20:12 2022 +0900

    mm/hwpoison: introduce per-memory_block hwpoison counter

    Currently PageHWPoison flag does not behave well when experiencing memory
    hotremove/hotplug.  Any data field in struct page is unreliable when the
    associated memory is offlined, and the current mechanism can't tell
    whether a memory block is onlined because a new memory devices is
    installed or because previous failed offline operations are undone.
    Especially if there's a hwpoisoned memory, it's unclear what the best
    option is.

    So introduce a new mechanism to make struct memory_block remember that a
    memory block has hwpoisoned memory inside it.  And make any online event
    fail if the onlining memory block contains hwpoison.  struct memory_block
    is freed and reallocated over ACPI-based hotremove/hotplug, but not over
    sysfs-based hotremove/hotplug.  So the new counter can distinguish these
    cases.

    Link: https://lkml.kernel.org/r/20221024062012.1520887-5-naoya.horiguchi@linux.dev
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reported-by: kernel test robot <lkp@intel.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jane Chu <jane.chu@oracle.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:19 -04:00
Chris von Recklinghausen 3f65ff56dd mm/page_alloc: make boot_nodestats static
Conflicts: mm/internal.h - We already have
	27674ef6c73f ("mm: remove the extra ZONE_DEVICE struct page refcount")
	so we already have the declaration for free_zone_device_page
	We already have
	b05a79d4377f ("mm/gup: migrate device coherent pages when pinning instead of failing")
	so we have the declaration of migrate_device_coherent_page
	We already have
	76aefad628aa ("mm/mprotect: fix soft-dirty check in can_change_pte_writable()")
	which makes patch think we don't have a declaration for
	mirrored_kernelcore

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 6dc2c87a5a8878b657d08e34ca0e757d31273e12
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Fri Sep 16 15:22:52 2022 +0800

    mm/page_alloc: make boot_nodestats static

    It's only used in mm/page_alloc.c now.  Make it static.

    Link: https://lkml.kernel.org/r/20220916072257.9639-12-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:46 -04:00
Chris von Recklinghausen 271a98f55e mm: kmsan: maintain KMSAN metadata for page operations
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit b073d7f8aee4ebf05d10e3380df377b73120cf16
Author: Alexander Potapenko <glider@google.com>
Date:   Thu Sep 15 17:03:48 2022 +0200

    mm: kmsan: maintain KMSAN metadata for page operations

    Insert KMSAN hooks that make the necessary bookkeeping changes:
     - poison page shadow and origins in alloc_pages()/free_page();
     - clear page shadow and origins in clear_page(), copy_user_highpage();
     - copy page metadata in copy_highpage(), wp_page_copy();
     - handle vmap()/vunmap()/iounmap();

    Link: https://lkml.kernel.org/r/20220915150417.722975-15-glider@google.com
    Signed-off-by: Alexander Potapenko <glider@google.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Alexei Starovoitov <ast@kernel.org>
    Cc: Andrey Konovalov <andreyknvl@gmail.com>
    Cc: Andrey Konovalov <andreyknvl@google.com>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Eric Biggers <ebiggers@google.com>
    Cc: Eric Biggers <ebiggers@kernel.org>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Herbert Xu <herbert@gondor.apana.org.au>
    Cc: Ilya Leoshkevich <iii@linux.ibm.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Marco Elver <elver@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Michael S. Tsirkin <mst@redhat.com>
    Cc: Pekka Enberg <penberg@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Petr Mladek <pmladek@suse.com>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Vasily Gorbik <gor@linux.ibm.com>
    Cc: Vegard Nossum <vegard.nossum@oracle.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:35 -04:00
Chris von Recklinghausen 23d981c266 mm: multi-gen LRU: exploit locality in rmap
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 018ee47f14893d500131dfca2ff9f3ff8ebd4ed2
Author: Yu Zhao <yuzhao@google.com>
Date:   Sun Sep 18 02:00:04 2022 -0600

    mm: multi-gen LRU: exploit locality in rmap

    Searching the rmap for PTEs mapping each page on an LRU list (to test and
    clear the accessed bit) can be expensive because pages from different VMAs
    (PA space) are not cache friendly to the rmap (VA space).  For workloads
    mostly using mapped pages, searching the rmap can incur the highest CPU
    cost in the reclaim path.

    This patch exploits spatial locality to reduce the trips into the rmap.
    When shrink_page_list() walks the rmap and finds a young PTE, a new
    function lru_gen_look_around() scans at most BITS_PER_LONG-1 adjacent
    PTEs.  On finding another young PTE, it clears the accessed bit and
    updates the gen counter of the page mapped by this PTE to
    (max_seq%MAX_NR_GENS)+1.

    Server benchmark results:
      Single workload:
        fio (buffered I/O): no change

      Single workload:
        memcached (anon): +[3, 5]%
                    Ops/sec      KB/sec
          patch1-6: 1106168.46   43025.04
          patch1-7: 1147696.57   44640.29

      Configurations:
        no change

    Client benchmark results:
      kswapd profiles:
        patch1-6
          39.03%  lzo1x_1_do_compress (real work)
          18.47%  page_vma_mapped_walk (overhead)
           6.74%  _raw_spin_unlock_irq
           3.97%  do_raw_spin_lock
           2.49%  ptep_clear_flush
           2.48%  anon_vma_interval_tree_iter_first
           1.92%  folio_referenced_one
           1.88%  __zram_bvec_write
           1.48%  memmove
           1.31%  vma_interval_tree_iter_next

        patch1-7
          48.16%  lzo1x_1_do_compress (real work)
           8.20%  page_vma_mapped_walk (overhead)
           7.06%  _raw_spin_unlock_irq
           2.92%  ptep_clear_flush
           2.53%  __zram_bvec_write
           2.11%  do_raw_spin_lock
           2.02%  memmove
           1.93%  lru_gen_look_around
           1.56%  free_unref_page_list
           1.40%  memset

      Configurations:
        no change

    Link: https://lkml.kernel.org/r/20220918080010.2920238-8-yuzhao@google.com
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Barry Song <baohua@kernel.org>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
    Cc: Andi Kleen <ak@linux.intel.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Hillf Danton <hdanton@sina.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michael Larabel <Michael@MichaelLarabel.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:45 -04:00
Chris von Recklinghausen b2cb33b2e5 mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 50722804423680488b8063f6cc9a451333bf6f9b
Author: Zach O'Keefe <zokeefe@google.com>
Date:   Wed Jul 6 16:59:26 2022 -0700

    mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage

    When scanning an anon pmd to see if it's eligible for collapse, return
    SCAN_PMD_MAPPED if the pmd already maps a hugepage.  Note that
    SCAN_PMD_MAPPED is different from SCAN_PAGE_COMPOUND used in the
    file-collapse path, since the latter might identify pte-mapped compound
    pages.  This is required by MADV_COLLAPSE which necessarily needs to know
    what hugepage-aligned/sized regions are already pmd-mapped.

    In order to determine if a pmd already maps a hugepage, refactor
    mm_find_pmd():

    Return mm_find_pmd() to it's pre-commit f72e7dcdd2 ("mm: let mm_find_pmd
    fix buggy race with THP fault") behavior.  ksm was the only caller that
    explicitly wanted a pte-mapping pmd, so open code the pte-mapping logic
    there (pmd_present() and pmd_trans_huge() checks).

    Undo revert change in commit f72e7dcdd2 ("mm: let mm_find_pmd fix buggy
    race with THP fault") that open-coded split_huge_pmd_address() pmd lookup
    and use mm_find_pmd() instead.

    Link: https://lkml.kernel.org/r/20220706235936.2197195-9-zokeefe@google.com
    Signed-off-by: Zach O'Keefe <zokeefe@google.com>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Cc: Alex Shi <alex.shi@linux.alibaba.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Chris Kennelly <ckennelly@google.com>
    Cc: Chris Zankel <chris@zankel.net>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Helge Deller <deller@gmx.de>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
    Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Matt Turner <mattst88@gmail.com>
    Cc: Max Filippov <jcmvbkbc@gmail.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
    Cc: Pavel Begunkov <asml.silence@gmail.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Song Liu <songliubraving@fb.com>
    Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: Dan Carpenter <dan.carpenter@oracle.com>
    Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:22 -04:00
Dave Wysochanski 01efd07365 mm, netfs, fscache: stop read optimisation when folio removed from pagecache
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2209756

Fscache has an optimisation by which reads from the cache are skipped
until we know that (a) there's data there to be read and (b) that data
isn't entirely covered by pages resident in the netfs pagecache.  This is
done with two flags manipulated by fscache_note_page_release():

	if (...
	    test_bit(FSCACHE_COOKIE_HAVE_DATA, &cookie->flags) &&
	    test_bit(FSCACHE_COOKIE_NO_DATA_TO_READ, &cookie->flags))
		clear_bit(FSCACHE_COOKIE_NO_DATA_TO_READ, &cookie->flags);

where the NO_DATA_TO_READ flag causes cachefiles_prepare_read() to
indicate that netfslib should download from the server or clear the page
instead.

The fscache_note_page_release() function is intended to be called from
->releasepage() - but that only gets called if PG_private or PG_private_2
is set - and currently the former is at the discretion of the network
filesystem and the latter is only set whilst a page is being written to
the cache, so sometimes we miss clearing the optimisation.

Fix this by following Willy's suggestion[1] and adding an address_space
flag, AS_RELEASE_ALWAYS, that causes filemap_release_folio() to always call
->release_folio() if it's set, even if PG_private or PG_private_2 aren't
set.

Note that this would require folio_test_private() and page_has_private() to
become more complicated.  To avoid that, in the places[*] where these are
used to conditionalise calls to filemap_release_folio() and
try_to_release_page(), the tests are removed the those functions just
jumped to unconditionally and the test is performed there.

[*] There are some exceptions in vmscan.c where the check guards more than
just a call to the releaser.  I've added a function, folio_needs_release()
to wrap all the checks for that.

AS_RELEASE_ALWAYS should be set if a non-NULL cookie is obtained from
fscache and cleared in ->evict_inode() before truncate_inode_pages_final()
is called.

Additionally, the FSCACHE_COOKIE_NO_DATA_TO_READ flag needs to be cleared
and the optimisation cancelled if a cachefiles object already contains data
when we open it.

[dwysocha@redhat.com: call folio_mapping() inside folio_needs_release()]
  Link: 902c990e31
Link: https://lkml.kernel.org/r/20230628104852.3391651-3-dhowells@redhat.com
Fixes: 1f67e6d0b188 ("fscache: Provide a function to note the release of a page")
Fixes: 047487c947e8 ("cachefiles: Implement the I/O routines")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Reported-by: Rohith Surabattula <rohiths.msft@gmail.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Tested-by: SeongJae Park <sj@kernel.org>
Cc: Daire Byrne <daire.byrne@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steve French <sfrench@samba.org>
Cc: Shyam Prasad N <nspmangalore@gmail.com>
Cc: Rohith Surabattula <rohiths.msft@gmail.com>
Cc: Dave Wysochanski <dwysocha@redhat.com>
Cc: Dominique Martinet <asmadeus@codewreck.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jingbo Xu <jefflexu@linux.alibaba.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit b4fa966f03b7401ceacd4ffd7227197afb2b8376)
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
2023-09-13 18:20:12 -04:00
Dave Wysochanski 924daddc03 mm: merge folio_has_private()/filemap_release_folio() call pairs
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2209756

Patch series "mm, netfs, fscache: Stop read optimisation when folio
removed from pagecache", v7.

This fixes an optimisation in fscache whereby we don't read from the cache
for a particular file until we know that there's data there that we don't
have in the pagecache.  The problem is that I'm no longer using PG_fscache
(aka PG_private_2) to indicate that the page is cached and so I don't get
a notification when a cached page is dropped from the pagecache.

The first patch merges some folio_has_private() and
filemap_release_folio() pairs and introduces a helper,
folio_needs_release(), to indicate if a release is required.

The second patch is the actual fix.  Following Willy's suggestions[1], it
adds an AS_RELEASE_ALWAYS flag to an address_space that will make
filemap_release_folio() always call ->release_folio(), even if
PG_private/PG_private_2 aren't set.  folio_needs_release() is altered to
add a check for this.

This patch (of 2):

Make filemap_release_folio() check folio_has_private().  Then, in most
cases, where a call to folio_has_private() is immediately followed by a
call to filemap_release_folio(), we can get rid of the test in the pair.

There are a couple of sites in mm/vscan.c that this can't so easily be
done.  In shrink_folio_list(), there are actually three cases (something
different is done for incompletely invalidated buffers), but
filemap_release_folio() elides two of them.

In shrink_active_list(), we don't have have the folio lock yet, so the
check allows us to avoid locking the page unnecessarily.

A wrapper function to check if a folio needs release is provided for those
places that still need to do it in the mm/ directory.  This will acquire
additional parts to the condition in a future patch.

After this, the only remaining caller of folio_has_private() outside of
mm/ is a check in fuse.

Link: https://lkml.kernel.org/r/20230628104852.3391651-1-dhowells@redhat.com
Link: https://lkml.kernel.org/r/20230628104852.3391651-2-dhowells@redhat.com
Reported-by: Rohith Surabattula <rohiths.msft@gmail.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steve French <sfrench@samba.org>
Cc: Shyam Prasad N <nspmangalore@gmail.com>
Cc: Rohith Surabattula <rohiths.msft@gmail.com>
Cc: Dave Wysochanski <dwysocha@redhat.com>
Cc: Dominique Martinet <asmadeus@codewreck.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Xiubo Li <xiubli@redhat.com>
Cc: Jingbo Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 0201ebf274a306a6ebb95e5dc2d6a0a27c737cac)
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
2023-09-13 18:19:41 -04:00
Nico Pache d480a0d335 mm, compaction: rename compact_control->rescan to finish_pageblock
commit 48731c8436c68ce5597dfe72f3836bd6808bedde
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Wed Jan 25 13:44:31 2023 +0000

    mm, compaction: rename compact_control->rescan to finish_pageblock

    Patch series "Fix excessive CPU usage during compaction".

    Commit 7efc3b726103 ("mm/compaction: fix set skip in fast_find_migrateblock")
    fixed a problem where pageblocks found by fast_find_migrateblock() were
    ignored. Unfortunately there were numerous bug reports complaining about high
    CPU usage and massive stalls once 6.1 was released. Due to the severity,
    the patch was reverted by Vlastimil as a short-term fix[1] to -stable.

    The underlying problem for each of the bugs is suspected to be the
    repeated scanning of the same pageblocks.  This series should guarantee
    forward progress even with commit 7efc3b726103.  More information is in
    the changelog for patch 4.

    [1] http://lore.kernel.org/r/20230113173345.9692-1-vbabka@suse.cz

    This patch (of 4):

    The rescan field was not well named albeit accurate at the time.  Rename
    the field to finish_pageblock to indicate that the remainder of the
    pageblock should be scanned regardless of COMPACT_CLUSTER_MAX.  The intent
    is that pageblocks with transient failures get marked for skipping to
    avoid revisiting the same pageblock.

    Link: https://lkml.kernel.org/r/20230125134434.18017-2-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Cc: Chuyi Zhou <zhouchuyi@bytedance.com>
    Cc: Jiri Slaby <jirislaby@kernel.org>
    Cc: Maxim Levitsky <mlevitsk@redhat.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Paolo Bonzini <pbonzini@redhat.com>
    Cc: Pedro Falcato <pedro.falcato@gmail.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:11:03 -06:00
Nico Pache d9678e29b9 mm: use nth_page instead of mem_map_offset mem_map_next
commit 14455eabd8404a503dc8e80cd8ce185e96a94b22
Author: Cheng Li <lic121@chinatelecom.cn>
Date:   Fri Sep 9 07:31:09 2022 +0000

    mm: use nth_page instead of mem_map_offset mem_map_next

    To handle the discontiguous case, mem_map_next() has a parameter named
    `offset`.  As a function caller, one would be confused why "get next
    entry" needs a parameter named "offset".  The other drawback of
    mem_map_next() is that the callers must take care of the map between
    parameter "iter" and "offset", otherwise we may get an hole or duplication
    during iteration.  So we use nth_page instead of mem_map_next.

    And replace mem_map_offset with nth_page() per Matthew's comments.

    Link: https://lkml.kernel.org/r/1662708669-9395-1-git-send-email-lic121@chinatelecom.cn
    Signed-off-by: Cheng Li <lic121@chinatelecom.cn>
    Fixes: 69d177c2fc ("hugetlbfs: handle pages higher order than MAX_ORDER")
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:11:00 -06:00
Chris von Recklinghausen 6573caddd5 mm/gup: migrate device coherent pages when pinning instead of failing
Bugzilla: https://bugzilla.redhat.com/2160210

commit b05a79d4377f6dcc30683008ffd1c531ea965393
Author: Alistair Popple <apopple@nvidia.com>
Date:   Fri Jul 15 10:05:13 2022 -0500

    mm/gup: migrate device coherent pages when pinning instead of failing

    Currently any attempts to pin a device coherent page will fail.  This is
    because device coherent pages need to be managed by a device driver, and
    pinning them would prevent a driver from migrating them off the device.

    However this is no reason to fail pinning of these pages.  These are
    coherent and accessible from the CPU so can be migrated just like pinning
    ZONE_MOVABLE pages.  So instead of failing all attempts to pin them first
    try migrating them out of ZONE_DEVICE.

    [hch@lst.de: rebased to the split device memory checks, moved migrate_device_page to migrate_device.c]
    Link: https://lkml.kernel.org/r/20220715150521.18165-7-alex.sierra@amd.com
    Signed-off-by: Alistair Popple <apopple@nvidia.com>
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:23 -04:00
Chris von Recklinghausen 3fe654e9dd memblock: Disable mirror feature if kernelcore is not specified
Bugzilla: https://bugzilla.redhat.com/2160210

commit 902c2d91582c7ff0cb5f57ffb3766656f9b910c6
Author: Ma Wupeng <mawupeng1@huawei.com>
Date:   Tue Jun 14 17:21:56 2022 +0800

    memblock: Disable mirror feature if kernelcore is not specified

    If system have some mirrored memory and mirrored feature is not specified
    in boot parameter, the basic mirrored feature will be enabled and this will
    lead to the following situations:

    - memblock memory allocation prefers mirrored region. This may have some
      unexpected influence on numa affinity.

    - contiguous memory will be split into several parts if parts of them
      is mirrored memory via memblock_mark_mirror().

    To fix this, variable mirrored_kernelcore will be checked in
    memblock_mark_mirror(). Mark mirrored memory with flag MEMBLOCK_MIRROR iff
    kernelcore=mirror is added in the kernel parameters.

    Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
    Acked-by: Ard Biesheuvel <ardb@kernel.org>
    Link: https://lore.kernel.org/r/20220614092156.1972846-6-mawupeng1@huawei.com
    Acked-by: Mike Rapoport <rppt@linux.ibm.com>
    Signed-off-by: Ard Biesheuvel <ardb@kernel.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:13 -04:00
Chris von Recklinghausen f0971a2aaf mm: split free page with properly free memory accounting and without race
Bugzilla: https://bugzilla.redhat.com/2160210

commit 86d28b0709279ccc636ef9ba267b7f3bcef79a4b
Author: Zi Yan <ziy@nvidia.com>
Date:   Thu May 26 19:15:31 2022 -0400

    mm: split free page with properly free memory accounting and without race

    In isolate_single_pageblock(), free pages are checked without holding zone
    lock, but they can go away in split_free_page() when zone lock is held.
    Check the free page and its order again in split_free_page() when zone lock
    is held. Recheck the page if the free page is gone under zone lock.

    In addition, in split_free_page(), the free page was deleted from the page
    list without changing free page accounting. Add the missing free page
    accounting code.

    Fix the type of order parameter in split_free_page().

    Link: https://lore.kernel.org/lkml/20220525103621.987185e2ca0079f7b97b856d@linux-foundation.org/
    Link: https://lkml.kernel.org/r/20220526231531.2404977-2-zi.yan@sent.com
    Fixes: b2c9e2fbba32 ("mm: make alloc_contig_range work at pageblock granularity")
    Signed-off-by: Zi Yan <ziy@nvidia.com>
    Reported-by: Doug Berger <opendmb@gmail.com>
      Link: https://lore.kernel.org/linux-mm/c3932a6f-77fe-29f7-0c29-fe6b1c67ab7b@gmail.com/
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Qian Cai <quic_qiancai@quicinc.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Eric Ren <renzhengeek@gmail.com>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Marek Szyprowski <m.szyprowski@samsung.com>
    Cc: Michael Walle <michael@walle.cc>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:12 -04:00
Chris von Recklinghausen a4674660fa mm: fix missing handler for __GFP_NOWARN
Bugzilla: https://bugzilla.redhat.com/2160210

commit 3f913fc5f9745613088d3c569778c9813ab9c129
Author: Qi Zheng <zhengqi.arch@bytedance.com>
Date:   Thu May 19 14:08:55 2022 -0700

    mm: fix missing handler for __GFP_NOWARN

    We expect no warnings to be issued when we specify __GFP_NOWARN, but
    currently in paths like alloc_pages() and kmalloc(), there are still some
    warnings printed, fix it.

    But for some warnings that report usage problems, we don't deal with them.
    If such warnings are printed, then we should fix the usage problems.
    Such as the following case:

            WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));

    [zhengqi.arch@bytedance.com: v2]
     Link: https://lkml.kernel.org/r/20220511061951.1114-1-zhengqi.arch@bytedance.com
    Link: https://lkml.kernel.org/r/20220510113809.80626-1-zhengqi.arch@bytedance.com
    Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Akinobu Mita <akinobu.mita@gmail.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Jiri Slaby <jirislaby@kernel.org>
    Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:11 -04:00
Chris von Recklinghausen 65b3725ab2 mm/memory-failure.c: move clear_hwpoisoned_pages
Bugzilla: https://bugzilla.redhat.com/2160210

commit 60f272f6b09a8f14156df88cccd21447ab394452
Author: zhenwei pi <pizhenwei@bytedance.com>
Date:   Thu May 12 20:23:09 2022 -0700

    mm/memory-failure.c: move clear_hwpoisoned_pages

    Patch series "memory-failure: fix hwpoison_filter", v2.

    As well known, the memory failure mechanism handles memory corrupted
    event, and try to send SIGBUS to the user process which uses this
    corrupted page.

    For the virtualization case, QEMU catches SIGBUS and tries to inject MCE
    into the guest, and the guest handles memory failure again.  Thus the
    guest gets the minimal effect from hardware memory corruption.

    The further step I'm working on:

    1, try to modify code to decrease poisoned pages in a single place
       (mm/memofy-failure.c: simplify num_poisoned_pages_dec in this series).

    2, try to use page_handle_poison() to handle SetPageHWPoison() and
       num_poisoned_pages_inc() together.  It would be best to call
       num_poisoned_pages_inc() in a single place too.

    3, introduce memory failure notifier list in memory-failure.c: notify
       the corrupted PFN to someone who registers this list.  If I can
       complete [1] and [2] part, [3] will be quite easy(just call notifier
       list after increasing poisoned page).

    4, introduce memory recover VQ for memory balloon device, and registers
       memory failure notifier list.  During the guest kernel handles memory
       failure, balloon device gets notified by memory failure notifier list,
       and tells the host to recover the corrupted PFN(GPA) by the new VQ.

    5, host side remaps the corrupted page(HVA), and tells the guest side
       to unpoison the PFN(GPA).  Then the guest fixes the corrupted page(GPA)
       dynamically.

    This patch (of 5):

    clear_hwpoisoned_pages() clears HWPoison flag and decreases the number of
    poisoned pages, this actually works as part of memory failure.

    Move this function from sparse.c to memory-failure.c, finally there is no
    CONFIG_MEMORY_FAILURE in sparse.c.

    Link: https://lkml.kernel.org/r/20220509105641.491313-1-pizhenwei@bytedance.com
    Link: https://lkml.kernel.org/r/20220509105641.491313-2-pizhenwei@bytedance.com
    Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:08 -04:00
Chris von Recklinghausen 69bfe65709 mm: make alloc_contig_range work at pageblock granularity
Bugzilla: https://bugzilla.redhat.com/2160210

commit b2c9e2fbba32539626522b6aed30d1dde7b7e971
Author: Zi Yan <ziy@nvidia.com>
Date:   Thu May 12 20:22:58 2022 -0700

    mm: make alloc_contig_range work at pageblock granularity

    alloc_contig_range() worked at MAX_ORDER_NR_PAGES granularity to avoid
    merging pageblocks with different migratetypes.  It might unnecessarily
    convert extra pageblocks at the beginning and at the end of the range.
    Change alloc_contig_range() to work at pageblock granularity.

    Special handling is needed for free pages and in-use pages across the
    boundaries of the range specified by alloc_contig_range().  Because these=

    Partially isolated pages causes free page accounting issues.  The free
    pages will be split and freed into separate migratetype lists; the in-use=

    Pages will be migrated then the freed pages will be handled in the
    aforementioned way.

    [ziy@nvidia.com: fix deadlock/crash]
      Link: https://lkml.kernel.org/r/23A7297E-6C84-4138-A9FE-3598234004E6@nvidia.com
    Link: https://lkml.kernel.org/r/20220425143118.2850746-4-zi.yan@sent.com
    Signed-off-by: Zi Yan <ziy@nvidia.com>
    Reported-by: kernel test robot <lkp@intel.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Eric Ren <renzhengeek@gmail.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:05 -04:00
Chris von Recklinghausen b389b02018 mm: compaction: clean up comment for sched contention
Bugzilla: https://bugzilla.redhat.com/2160210

commit d56c15845a5493dbe9e8b77f63418bea117d1221
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Apr 28 23:16:17 2022 -0700

    mm: compaction: clean up comment for sched contention

    Since commit cf66f0700c ("mm, compaction: do not consider a need to
    reschedule as contention"), async compaction won't abort when scheduling
    is needed.  Correct the relevant comment accordingly.

    Link: https://lkml.kernel.org/r/20220418141253.24298-5-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Charan Teja Kalla <charante@codeaurora.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Pintu Kumar <pintu@codeaurora.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:56 -04:00
Chris von Recklinghausen 6cf53831b0 mm: wrap __find_buddy_pfn() with a necessary buddy page validation
Bugzilla: https://bugzilla.redhat.com/2160210

commit 8170ac4700d26f65a9a4ebc8ae488539158dc5f7
Author: Zi Yan <ziy@nvidia.com>
Date:   Thu Apr 28 23:16:01 2022 -0700

    mm: wrap __find_buddy_pfn() with a necessary buddy page validation

    Whenever the buddy of a page is found from __find_buddy_pfn(),
    page_is_buddy() should be used to check its validity.  Add a helper
    function find_buddy_page_pfn() to find the buddy page and do the check
    together.

    [ziy@nvidia.com: updates per David]
    Link: https://lkml.kernel.org/r/20220401230804.1658207-2-zi.yan@sent.com
    Link: https://lore.kernel.org/linux-mm/CAHk-=wji_AmYygZMTsPMdJ7XksMt7kOur8oDfDdniBRMjm4VkQ@mail.gmail.com/
    Link: https://lkml.kernel.org/r/7236E7CA-B5F1-4C04-AB85-E86FA3E9A54B@nvidia.com
    Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Zi Yan <ziy@nvidia.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Oscar Salvador <osalvador@suse.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:53 -04:00
Nico Pache 5c785b9d4f mm/page_alloc: make zone_pcp_update() static
commit b89f1735169b8ab54b6a03bf4823657ee4e30073
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Fri Sep 16 15:22:43 2022 +0800

    mm/page_alloc: make zone_pcp_update() static

    Since commit b92ca18e8c ("mm/page_alloc: disassociate the pcp->high from
    pcp->batch"), zone_pcp_update() is only used in mm/page_alloc.c.  Move
    zone_pcp_update() up to avoid forward declaration and then make it static.
    No functional change intended.

    Link: https://lkml.kernel.org/r/20220916072257.9639-3-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:43 -07:00
Nico Pache f738a43292 mm: rmap: introduce pfn_mkclean_range() to cleans PTEs
commit 6a8e0596f00469c15ec556b9f3624acd2e9a04f9
Author: Muchun Song <songmuchun@bytedance.com>
Date:   Thu Apr 28 23:16:10 2022 -0700

    mm: rmap: introduce pfn_mkclean_range() to cleans PTEs

    The page_mkclean_one() is supposed to be used with the pfn that has a
    associated struct page, but not all the pfns (e.g.  DAX) have a struct
    page.  Introduce a new function pfn_mkclean_range() to cleans the PTEs
    (including PMDs) mapped with range of pfns which has no struct page
    associated with them.  This helper will be used by DAX device in the next
    patch to make pfns clean.

    Link: https://lkml.kernel.org/r/20220403053957.10770-4-songmuchun@bytedance.com
    Signed-off-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Cc: Ross Zwisler <zwisler@kernel.org>
    Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
    Cc: Xiyu Yang <xiyuyang19@fudan.edu.cn>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:36 -07:00
Chris von Recklinghausen 11c81109b4 mm/mprotect: fix soft-dirty check in can_change_pte_writable()
Bugzilla: https://bugzilla.redhat.com/2120352

commit 76aefad628aae152207ee624a7981b9aa1a267d8
Author: Peter Xu <peterx@redhat.com>
Date:   Mon Jul 25 10:20:46 2022 -0400

    mm/mprotect: fix soft-dirty check in can_change_pte_writable()

    Patch series "mm/mprotect: Fix soft-dirty checks", v4.

    This patch (of 3):

    The check wanted to make sure when soft-dirty tracking is enabled we won't
    grant write bit by accident, as a page fault is needed for dirty tracking.
    The intention is correct but we didn't check it right because
    VM_SOFTDIRTY set actually means soft-dirty tracking disabled.  Fix it.

    There's another thing tricky about soft-dirty is that, we can't check the
    vma flag !(vma_flags & VM_SOFTDIRTY) directly but only check it after we
    checked CONFIG_MEM_SOFT_DIRTY because otherwise VM_SOFTDIRTY will be
    defined as zero, and !(vma_flags & VM_SOFTDIRTY) will constantly return
    true.  To avoid misuse, introduce a helper for checking whether vma has
    soft-dirty tracking enabled.

    We can easily verify this with any exclusive anonymous page, like program
    below:

    =======8<======
      #include <stdio.h>
      #include <unistd.h>
      #include <stdlib.h>
      #include <assert.h>
      #include <inttypes.h>
      #include <stdint.h>
      #include <sys/types.h>
      #include <sys/mman.h>
      #include <sys/types.h>
      #include <sys/stat.h>
      #include <unistd.h>
      #include <fcntl.h>
      #include <stdbool.h>

      #define BIT_ULL(nr)                   (1ULL << (nr))
      #define PM_SOFT_DIRTY                 BIT_ULL(55)

      unsigned int psize;
      char *page;

      uint64_t pagemap_read_vaddr(int fd, void *vaddr)
      {
          uint64_t value;
          int ret;

          ret = pread(fd, &value, sizeof(uint64_t),
                      ((uint64_t)vaddr >> 12) * sizeof(uint64_t));
          assert(ret == sizeof(uint64_t));

          return value;
      }

      void clear_refs_write(void)
      {
          int fd = open("/proc/self/clear_refs", O_RDWR);

          assert(fd >= 0);
          write(fd, "4", 2);
          close(fd);
      }

      #define  check_soft_dirty(str, expect)  do {                            \
              bool dirty = pagemap_read_vaddr(fd, page) & PM_SOFT_DIRTY;      \
              if (dirty != expect) {                                          \
                  printf("ERROR: %s, soft-dirty=%d (expect: %d)
    ", str, dirty, expect); \
                  exit(-1);                                                   \
              }                                                               \
      } while (0)

      int main(void)
      {
          int fd = open("/proc/self/pagemap", O_RDONLY);

          assert(fd >= 0);
          psize = getpagesize();
          page = mmap(NULL, psize, PROT_READ|PROT_WRITE,
                      MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
          assert(page != MAP_FAILED);

          *page = 1;
          check_soft_dirty("Just faulted in page", 1);
          clear_refs_write();
          check_soft_dirty("Clear_refs written", 0);
          mprotect(page, psize, PROT_READ);
          check_soft_dirty("Marked RO", 0);
          mprotect(page, psize, PROT_READ|PROT_WRITE);
          check_soft_dirty("Marked RW", 0);
          *page = 2;
          check_soft_dirty("Wrote page again", 1);

          munmap(page, psize);
          close(fd);
          printf("Test passed.
    ");

          return 0;
      }
    =======8<======

    Here we attach a Fixes to commit 64fe24a3e05e only for easy tracking, as
    this patch won't apply to a tree before that point.  However the commit
    wasn't the source of problem, but instead 64e455079e.  It's just that
    after 64fe24a3e05e anonymous memory will also suffer from this problem
    with mprotect().

    Link: https://lkml.kernel.org/r/20220725142048.30450-1-peterx@redhat.com
    Link: https://lkml.kernel.org/r/20220725142048.30450-2-peterx@redhat.com
    Fixes: 64e455079e ("mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared")
    Fixes: 64fe24a3e05e ("mm/mprotect: try avoiding write faults for exclusive anonymous pages when changing protection")
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:09 -04:00
Chris von Recklinghausen 179ea3b604 mm/early_ioremap: declare early_memremap_pgprot_adjust()
Bugzilla: https://bugzilla.redhat.com/2120352

commit be4893d92b6b426357978ed955190c0ead23a4b1
Author: Vlastimil Babka <vbabka@suse.cz>
Date:   Tue Mar 22 14:47:55 2022 -0700

    mm/early_ioremap: declare early_memremap_pgprot_adjust()

    The mm/ directory can almost fully be built with W=1, which would help
    in local development.  One remaining issue is missing prototype for
    early_memremap_pgprot_adjust().

    Thus add a declaration for this function.  Use mm/internal.h instead of
    asm/early_ioremap.h to avoid missing type definitions and unnecessary
    exposure.

    Link: https://lkml.kernel.org/r/20220314165724.16071-2-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: David Hildenbrand <david@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:55 -04:00
Chris von Recklinghausen 4220f0548b mm/sparse: make mminit_validate_memmodel_limits() static
Bugzilla: https://bugzilla.redhat.com/2120352

commit c7878534a1b61c5cc2effa3a539099f2cf87cd3a
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Mar 22 14:42:44 2022 -0700

    mm/sparse: make mminit_validate_memmodel_limits() static

    It's only used in the sparse.c now. So we can make it static and further
    clean up the relevant code.

    Link: https://lkml.kernel.org/r/20220127093221.63524-1-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:51 -04:00
Chris von Recklinghausen 4cc30e4e8f mm: remove the extra ZONE_DEVICE struct page refcount
Conflicts: mm/internal.h - We already have
	09f49dca570a ("mm: handle uninitialized numa nodes gracefully")
	ece1ed7bfa12 ("mm/gup: Add try_get_folio() and try_grab_folio()")
	so keep declarations for boot_nodestats and try_grab_folio

Bugzilla: https://bugzilla.redhat.com/2120352
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2099722

commit 27674ef6c73f0c9096a9827dc5d6ba9fc7808422
Author: Christoph Hellwig <hch@lst.de>
Date:   Wed Feb 16 15:31:36 2022 +1100

    mm: remove the extra ZONE_DEVICE struct page refcount

    ZONE_DEVICE struct pages have an extra reference count that complicates
    the code for put_page() and several places in the kernel that need to
    check the reference count to see that a page is not being used (gup,
    compaction, migration, etc.). Clean up the code so the reference count
    doesn't need to be treated specially for ZONE_DEVICE pages.

    Note that this excludes the special idle page wakeup for fsdax pages,
    which still happens at refcount 1.  This is a separate issue and will
    be sorted out later.  Given that only fsdax pages require the
    notifiacation when the refcount hits 1 now, the PAGEMAP_OPS Kconfig
    symbol can go away and be replaced with a FS_DAX check for this hook
    in the put_page fastpath.

    Based on an earlier patch from Ralph Campbell <rcampbell@nvidia.com>.

    Link: https://lkml.kernel.org/r/20220210072828.2930359-8-hch@lst.de
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
    Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Reviewed-by: Dan Williams <dan.j.williams@intel.com>
    Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
    Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>

    Cc: Alex Deucher <alexander.deucher@amd.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Ben Skeggs <bskeggs@redhat.com>
    Cc: Chaitanya Kulkarni <kch@nvidia.com>
    Cc: Christian Knig <christian.koenig@amd.com>
    Cc: Karol Herbst <kherbst@redhat.com>
    Cc: Lyude Paul <lyude@redhat.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:45 -04:00
Chris von Recklinghausen f474c34c0d mm: memcontrol: make cgroup_memory_nokmem static
Bugzilla: https://bugzilla.redhat.com/2120352

commit 17c17367758059930246dde937cc7da9b8f3549e
Author: Muchun Song <songmuchun@bytedance.com>
Date:   Fri Jan 14 14:05:29 2022 -0800

    mm: memcontrol: make cgroup_memory_nokmem static

    Commit 494c1dfe85 ("mm: memcg/slab: create a new set of kmalloc-cg-<n>
    caches") makes cgroup_memory_nokmem global, however, it is unnecessary
    because there is already a function mem_cgroup_kmem_disabled() which
    exports it.

    Just make it static and replace it with mem_cgroup_kmem_disabled() in
    mm/slab_common.c.

    Link: https://lkml.kernel.org/r/20211109065418.21693-1-songmuchun@bytedance.com
    Signed-off-by: Muchun Song <songmuchun@bytedance.com>
    Acked-by: Chris Down <chris@chrisdown.name>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:38 -04:00
Chris von Recklinghausen 892fa2d7d9 mm/vmscan: centralise timeout values for reclaim_throttle
Conflicts: The presence of
	d1d8a3b4d06d ("mm: Turn isolate_lru_page() into folio_isolate_lru()")
	causes a merge coflict due to differing context. Jus remove the
	timeout argument to reclaim_throttle

Bugzilla: https://bugzilla.redhat.com/2120352

commit c3f4a9a2b082c5392fbff17c6d8551154add5fdb
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Fri Nov 5 13:42:42 2021 -0700

    mm/vmscan: centralise timeout values for reclaim_throttle

    Neil Brown raised concerns about callers of reclaim_throttle specifying
    a timeout value.  The original timeout values to congestion_wait() were
    probably pulled out of thin air or copy&pasted from somewhere else.
    This patch centralises the timeout values and selects a timeout based on
    the reason for reclaim throttling.  These figures are also pulled out of
    the same thin air but better values may be derived

    Running a workload that is throttling for inappropriate periods and
    tracing mm_vmscan_throttled can be used to pick a more appropriate
    value.  Excessive throttling would pick a lower timeout where as
    excessive CPU usage in reclaim context would select a larger timeout.
    Ideally a large value would always be used and the wakeups would occur
    before a timeout but that requires careful testing.

    Link: https://lkml.kernel.org/r/20211022144651.19914-7-mgorman@techsingulari
ty.net
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Andreas Dilger <adilger.kernel@dilger.ca>
    Cc: "Darrick J . Wong" <djwong@kernel.org>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: NeilBrown <neilb@suse.de>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: "Theodore Ts'o" <tytso@mit.edu>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:29 -04:00
Chris von Recklinghausen d58d20a86f mm/vmscan: throttle reclaim and compaction when too may pages are isolated
Conflicts: mm/internal.h - We already have
	d1d8a3b4d06d ("mm: Turn isolate_lru_page() into folio_isolate_lru()")
	so add declaration for reclaim_throttle just below the declaration for
	folio_putback_lru

Bugzilla: https://bugzilla.redhat.com/2120352

commit d818fca1cac31b1fc9301bda83e195a46fb4ebaa
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Fri Nov 5 13:42:29 2021 -0700

    mm/vmscan: throttle reclaim and compaction when too may pages are isolated

    Page reclaim throttles on congestion if too many parallel reclaim
    instances have isolated too many pages.  This makes no sense, excessive
    parallelisation has nothing to do with writeback or congestion.

    This patch creates an additional workqueue to sleep on when too many
    pages are isolated.  The throttled tasks are woken when the number of
    isolated pages is reduced or a timeout occurs.  There may be some false
    positive wakeups for GFP_NOIO/GFP_NOFS callers but the tasks will
    throttle again if necessary.

    [shy828301@gmail.com: Wake up from compaction context]
    [vbabka@suse.cz: Account number of throttled tasks only for writeback]

    Link: https://lkml.kernel.org/r/20211022144651.19914-3-mgorman@techsingulari
ty.net
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Andreas Dilger <adilger.kernel@dilger.ca>
    Cc: "Darrick J . Wong" <djwong@kernel.org>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: NeilBrown <neilb@suse.de>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: "Theodore Ts'o" <tytso@mit.edu>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:29 -04:00
Chris von Recklinghausen 809d37d23f mm/vmscan: throttle reclaim until some writeback completes if congested
Conflicts:
	mm/filemap.c - We already have
		4268b48077e5 ("mm/filemap: Add folio_end_writeback()")
		so so put the acct_reclaim_writeback call between the
		folio_wake call and the folio_put call and pass it a
		folio
	mm/internal.h - We already have
		646010009d35 ("mm: Add folio_raw_mapping()")
		so keep definition of folio_raw_mapping.
		Squash in changes from merge commit
		512b7931ad05 ("Merge branch 'akpm' (patches from Andrew)")
		to be comptible with existing folio changes.
	mm/vmscan.c - Squash in changes from merge commit
                512b7931ad05 ("Merge branch 'akpm' (patches from Andrew)")
                to be comptible with existing folio changes.

Bugzilla: https://bugzilla.redhat.com/2120352

commit 8cd7c588decf470bf7e14f2be93b709f839a965e
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Fri Nov 5 13:42:25 2021 -0700

    mm/vmscan: throttle reclaim until some writeback completes if congested

    Patch series "Remove dependency on congestion_wait in mm/", v5.

    This series that removes all calls to congestion_wait in mm/ and deletes
    wait_iff_congested.  It's not a clever implementation but
    congestion_wait has been broken for a long time [1].

    Even if congestion throttling worked, it was never a great idea.  While
    excessive dirty/writeback pages at the tail of the LRU is one
    possibility that reclaim may be slow, there is also the problem of too
    many pages being isolated and reclaim failing for other reasons
    (elevated references, too many pages isolated, excessive LRU contention
    etc).

    This series replaces the "congestion" throttling with 3 different types.

     - If there are too many dirty/writeback pages, sleep until a timeout or
       enough pages get cleaned

     - If too many pages are isolated, sleep until enough isolated pages are
       either reclaimed or put back on the LRU

     - If no progress is being made, direct reclaim tasks sleep until
       another task makes progress with acceptable efficiency.

    This was initially tested with a mix of workloads that used to trigger
    corner cases that no longer work.  A new test case was created called
    "stutterp" (pagereclaim-stutterp-noreaders in mmtests) using a freshly
    created XFS filesystem.  Note that it may be necessary to increase the
    timeout of ssh if executing remotely as ssh itself can get throttled and
    the connection may timeout.

    stutterp varies the number of "worker" processes from 4 up to NR_CPUS*4
    to check the impact as the number of direct reclaimers increase.  It has
    four types of worker.

     - One "anon latency" worker creates small mappings with mmap() and
       times how long it takes to fault the mapping reading it 4K at a time

     - X file writers which is fio randomly writing X files where the total
       size of the files add up to the allowed dirty_ratio. fio is allowed
       to run for a warmup period to allow some file-backed pages to
       accumulate. The duration of the warmup is based on the best-case
       linear write speed of the storage.

     - Y file readers which is fio randomly reading small files

     - Z anon memory hogs which continually map (100-dirty_ratio)% of memory

     - Total estimated WSS = (100+dirty_ration) percentage of memory

    X+Y+Z+1 == NR_WORKERS varying from 4 up to NR_CPUS*4

    The intent is to maximise the total WSS with a mix of file and anon
    memory where some anonymous memory must be swapped and there is a high
    likelihood of dirty/writeback pages reaching the end of the LRU.

    The test can be configured to have no background readers to stress
    dirty/writeback pages.  The results below are based on having zero
    readers.

    The short summary of the results is that the series works and stalls
    until some event occurs but the timeouts may need adjustment.

    The test results are not broken down by patch as the series should be
    treated as one block that replaces a broken throttling mechanism with a
    working one.

    Finally, three machines were tested but I'm reporting the worst set of
    results.  The other two machines had much better latencies for example.

    First the results of the "anon latency" latency

      stutterp
                                    5.15.0-rc1             5.15.0-rc1
                                       vanilla mm-reclaimcongest-v5r4
      Amean     mmap-4      31.4003 (   0.00%)   2661.0198 (-8374.52%)
      Amean     mmap-7      38.1641 (   0.00%)    149.2891 (-291.18%)
      Amean     mmap-12     60.0981 (   0.00%)    187.8105 (-212.51%)
      Amean     mmap-21    161.2699 (   0.00%)    213.9107 ( -32.64%)
      Amean     mmap-30    174.5589 (   0.00%)    377.7548 (-116.41%)
      Amean     mmap-48   8106.8160 (   0.00%)   1070.5616 (  86.79%)
      Stddev    mmap-4      41.3455 (   0.00%)  27573.9676 (-66591.66%)
      Stddev    mmap-7      53.5556 (   0.00%)   4608.5860 (-8505.23%)
      Stddev    mmap-12    171.3897 (   0.00%)   5559.4542 (-3143.75%)
      Stddev    mmap-21   1506.6752 (   0.00%)   5746.2507 (-281.39%)
      Stddev    mmap-30    557.5806 (   0.00%)   7678.1624 (-1277.05%)
      Stddev    mmap-48  61681.5718 (   0.00%)  14507.2830 (  76.48%)
      Max-90    mmap-4      31.4243 (   0.00%)     83.1457 (-164.59%)
      Max-90    mmap-7      41.0410 (   0.00%)     41.0720 (  -0.08%)
      Max-90    mmap-12     66.5255 (   0.00%)     53.9073 (  18.97%)
      Max-90    mmap-21    146.7479 (   0.00%)    105.9540 (  27.80%)
      Max-90    mmap-30    193.9513 (   0.00%)     64.3067 (  66.84%)
      Max-90    mmap-48    277.9137 (   0.00%)    591.0594 (-112.68%)
      Max       mmap-4    1913.8009 (   0.00%) 299623.9695 (-15555.96%)
      Max       mmap-7    2423.9665 (   0.00%) 204453.1708 (-8334.65%)
      Max       mmap-12   6845.6573 (   0.00%) 221090.3366 (-3129.64%)
      Max       mmap-21  56278.6508 (   0.00%) 213877.3496 (-280.03%)
      Max       mmap-30  19716.2990 (   0.00%) 216287.6229 (-997.00%)
      Max       mmap-48 477923.9400 (   0.00%) 245414.8238 (  48.65%)

    For most thread counts, the time to mmap() is unfortunately increased.
    In earlier versions of the series, this was lower but a large number of
    throttling events were reaching their timeout increasing the amount of
    inefficient scanning of the LRU.  There is no prioritisation of reclaim
    tasks making progress based on each tasks rate of page allocation versus
    progress of reclaim.  The variance is also impacted for high worker
    counts but in all cases, the differences in latency are not
    statistically significant due to very large maximum outliers.  Max-90
    shows that 90% of the stalls are comparable but the Max results show the
    massive outliers which are increased to to stalling.

    It is expected that this will be very machine dependant.  Due to the
    test design, reclaim is difficult so allocations stall and there are
    variances depending on whether THPs can be allocated or not.  The amount
    of memory will affect exactly how bad the corner cases are and how often
    they trigger.  The warmup period calculation is not ideal as it's based
    on linear writes where as fio is randomly writing multiple files from
    multiple tasks so the start state of the test is variable.  For example,
    these are the latencies on a single-socket machine that had more memory

      Amean     mmap-4      42.2287 (   0.00%)     49.6838 * -17.65%*
      Amean     mmap-7     216.4326 (   0.00%)     47.4451 *  78.08%*
      Amean     mmap-12   2412.0588 (   0.00%)     51.7497 (  97.85%)
      Amean     mmap-21   5546.2548 (   0.00%)     51.8862 (  99.06%)
      Amean     mmap-30   1085.3121 (   0.00%)     72.1004 (  93.36%)

    The overall system CPU usage and elapsed time is as follows

                        5.15.0-rc3  5.15.0-rc3
                           vanilla mm-reclaimcongest-v5r4
      Duration User        6989.03      983.42
      Duration System      7308.12      799.68
      Duration Elapsed     2277.67     2092.98

    The patches reduce system CPU usage by 89% as the vanilla kernel is rarely
    stalling.

    The high-level /proc/vmstats show

                                           5.15.0-rc1     5.15.0-rc1
                                              vanilla mm-reclaimcongest-v5r2
      Ops Direct pages scanned          1056608451.00   503594991.00
      Ops Kswapd pages scanned           109795048.00   147289810.00
      Ops Kswapd pages reclaimed          63269243.00    31036005.00
      Ops Direct pages reclaimed          10803973.00     6328887.00
      Ops Kswapd efficiency %                   57.62          21.07
      Ops Kswapd velocity                    48204.98       57572.86
      Ops Direct efficiency %                    1.02           1.26
      Ops Direct velocity                   463898.83      196845.97

    Kswapd scanned less pages but the detailed pattern is different.  The
    vanilla kernel scans slowly over time where as the patches exhibits
    burst patterns of scan activity.  Direct reclaim scanning is reduced by
    52% due to stalling.

    The pattern for stealing pages is also slightly different.  Both kernels
    exhibit spikes but the vanilla kernel when reclaiming shows pages being
    reclaimed over a period of time where as the patches tend to reclaim in
    spikes.  The difference is that vanilla is not throttling and instead
    scanning constantly finding some pages over time where as the patched
    kernel throttles and reclaims in spikes.

      Ops Percentage direct scans               90.59          77.37

    For direct reclaim, vanilla scanned 90.59% of pages where as with the
    patches, 77.37% were direct reclaim due to throttling

      Ops Page writes by reclaim           2613590.00     1687131.00

    Page writes from reclaim context are reduced.

      Ops Page writes anon                 2932752.00     1917048.00

    And there is less swapping.

      Ops Page reclaim immediate         996248528.00   107664764.00

    The number of pages encountered at the tail of the LRU tagged for
    immediate reclaim but still dirty/writeback is reduced by 89%.

      Ops Slabs scanned                     164284.00      153608.00

    Slab scan activity is similar.

    ftrace was used to gather stall activity

      Vanilla
      -------
          1 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=16000
          2 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=12000
          8 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=8000
         29 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=4000
      82394 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=0

    The fast majority of wait_iff_congested calls do not stall at all.  What
    is likely happening is that cond_resched() reschedules the task for a
    short period when the BDI is not registering congestion (which it never
    will in this test setup).

          1 writeback_congestion_wait: usec_timeout=100000 usec_delayed=120000
          2 writeback_congestion_wait: usec_timeout=100000 usec_delayed=132000
          4 writeback_congestion_wait: usec_timeout=100000 usec_delayed=112000
        380 writeback_congestion_wait: usec_timeout=100000 usec_delayed=108000
        778 writeback_congestion_wait: usec_timeout=100000 usec_delayed=104000

    congestion_wait if called always exceeds the timeout as there is no
    trigger to wake it up.

    Bottom line: Vanilla will throttle but it's not effective.

    Patch series
    ------------

    Kswapd throttle activity was always due to scanning pages tagged for
    immediate reclaim at the tail of the LRU

          1 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITE
BACK
          4 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITE
BACK
          5 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITE
BACK
          6 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITE
BACK
         11 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRIT
EBACK
         11 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEB
ACK
         94 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
        112 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEB
ACK

    The majority of events did not stall or stalled for a short period.
    Roughly 16% of stalls reached the timeout before expiry.  For direct
    reclaim, the number of times stalled for each reason were

       6624 reason=VMSCAN_THROTTLE_ISOLATED
      93246 reason=VMSCAN_THROTTLE_NOPROGRESS
      96934 reason=VMSCAN_THROTTLE_WRITEBACK

    The most common reason to stall was due to excessive pages tagged for
    immediate reclaim at the tail of the LRU followed by a failure to make
    forward.  A relatively small number were due to too many pages isolated
    from the LRU by parallel threads

    For VMSCAN_THROTTLE_ISOLATED, the breakdown of delays was

          9 usec_timeout=20000 usect_delayed=4000 reason=VMSCAN_THROTTLE_ISOLATE
D
         12 usec_timeout=20000 usect_delayed=16000 reason=VMSCAN_THROTTLE_ISOLAT
ED
         83 usec_timeout=20000 usect_delayed=20000 reason=VMSCAN_THROTTLE_ISOLAT
ED
       6520 usec_timeout=20000 usect_delayed=0 reason=VMSCAN_THROTTLE_ISOLATED

    Most did not stall at all.  A small number reached the timeout.

    For VMSCAN_THROTTLE_NOPROGRESS, the breakdown of stalls were all over
    the map

          1 usec_timeout=500000 usect_delayed=324000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          1 usec_timeout=500000 usect_delayed=332000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          1 usec_timeout=500000 usect_delayed=348000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          1 usec_timeout=500000 usect_delayed=360000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          2 usec_timeout=500000 usect_delayed=228000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          2 usec_timeout=500000 usect_delayed=260000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          2 usec_timeout=500000 usect_delayed=340000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          2 usec_timeout=500000 usect_delayed=364000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          2 usec_timeout=500000 usect_delayed=372000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          2 usec_timeout=500000 usect_delayed=428000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          2 usec_timeout=500000 usect_delayed=460000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          2 usec_timeout=500000 usect_delayed=464000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          3 usec_timeout=500000 usect_delayed=244000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          3 usec_timeout=500000 usect_delayed=252000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          3 usec_timeout=500000 usect_delayed=272000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          4 usec_timeout=500000 usect_delayed=188000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          4 usec_timeout=500000 usect_delayed=268000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          4 usec_timeout=500000 usect_delayed=328000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          4 usec_timeout=500000 usect_delayed=380000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          4 usec_timeout=500000 usect_delayed=392000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          4 usec_timeout=500000 usect_delayed=432000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          5 usec_timeout=500000 usect_delayed=204000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          5 usec_timeout=500000 usect_delayed=220000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          5 usec_timeout=500000 usect_delayed=412000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          5 usec_timeout=500000 usect_delayed=436000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          6 usec_timeout=500000 usect_delayed=488000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          7 usec_timeout=500000 usect_delayed=212000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          7 usec_timeout=500000 usect_delayed=300000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          7 usec_timeout=500000 usect_delayed=316000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          7 usec_timeout=500000 usect_delayed=472000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          8 usec_timeout=500000 usect_delayed=248000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          8 usec_timeout=500000 usect_delayed=356000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          8 usec_timeout=500000 usect_delayed=456000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          9 usec_timeout=500000 usect_delayed=124000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          9 usec_timeout=500000 usect_delayed=376000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          9 usec_timeout=500000 usect_delayed=484000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         10 usec_timeout=500000 usect_delayed=172000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         10 usec_timeout=500000 usect_delayed=420000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         10 usec_timeout=500000 usect_delayed=452000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         11 usec_timeout=500000 usect_delayed=256000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         12 usec_timeout=500000 usect_delayed=112000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         12 usec_timeout=500000 usect_delayed=116000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         12 usec_timeout=500000 usect_delayed=144000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         12 usec_timeout=500000 usect_delayed=152000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         12 usec_timeout=500000 usect_delayed=264000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         12 usec_timeout=500000 usect_delayed=384000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         12 usec_timeout=500000 usect_delayed=424000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         12 usec_timeout=500000 usect_delayed=492000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         13 usec_timeout=500000 usect_delayed=184000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         13 usec_timeout=500000 usect_delayed=444000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         14 usec_timeout=500000 usect_delayed=308000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         14 usec_timeout=500000 usect_delayed=440000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         14 usec_timeout=500000 usect_delayed=476000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         16 usec_timeout=500000 usect_delayed=140000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         17 usec_timeout=500000 usect_delayed=232000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         17 usec_timeout=500000 usect_delayed=240000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         17 usec_timeout=500000 usect_delayed=280000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         18 usec_timeout=500000 usect_delayed=404000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         20 usec_timeout=500000 usect_delayed=148000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         20 usec_timeout=500000 usect_delayed=216000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         20 usec_timeout=500000 usect_delayed=468000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         21 usec_timeout=500000 usect_delayed=448000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         23 usec_timeout=500000 usect_delayed=168000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         23 usec_timeout=500000 usect_delayed=296000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         25 usec_timeout=500000 usect_delayed=132000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         25 usec_timeout=500000 usect_delayed=352000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         26 usec_timeout=500000 usect_delayed=180000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         27 usec_timeout=500000 usect_delayed=284000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         28 usec_timeout=500000 usect_delayed=164000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         29 usec_timeout=500000 usect_delayed=136000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         30 usec_timeout=500000 usect_delayed=200000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         30 usec_timeout=500000 usect_delayed=400000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         31 usec_timeout=500000 usect_delayed=196000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         32 usec_timeout=500000 usect_delayed=156000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         33 usec_timeout=500000 usect_delayed=224000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         35 usec_timeout=500000 usect_delayed=128000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         35 usec_timeout=500000 usect_delayed=176000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         36 usec_timeout=500000 usect_delayed=368000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         36 usec_timeout=500000 usect_delayed=496000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         37 usec_timeout=500000 usect_delayed=312000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         38 usec_timeout=500000 usect_delayed=304000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         40 usec_timeout=500000 usect_delayed=288000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         43 usec_timeout=500000 usect_delayed=408000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         55 usec_timeout=500000 usect_delayed=416000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         56 usec_timeout=500000 usect_delayed=76000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
         58 usec_timeout=500000 usect_delayed=120000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         59 usec_timeout=500000 usect_delayed=208000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         61 usec_timeout=500000 usect_delayed=68000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
         71 usec_timeout=500000 usect_delayed=192000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         71 usec_timeout=500000 usect_delayed=480000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         79 usec_timeout=500000 usect_delayed=60000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
         82 usec_timeout=500000 usect_delayed=320000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         82 usec_timeout=500000 usect_delayed=92000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
         85 usec_timeout=500000 usect_delayed=64000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
         85 usec_timeout=500000 usect_delayed=80000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
         88 usec_timeout=500000 usect_delayed=84000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
         90 usec_timeout=500000 usect_delayed=160000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         90 usec_timeout=500000 usect_delayed=292000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         94 usec_timeout=500000 usect_delayed=56000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        118 usec_timeout=500000 usect_delayed=88000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        119 usec_timeout=500000 usect_delayed=72000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        126 usec_timeout=500000 usect_delayed=108000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
        146 usec_timeout=500000 usect_delayed=52000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        148 usec_timeout=500000 usect_delayed=36000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        148 usec_timeout=500000 usect_delayed=48000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        159 usec_timeout=500000 usect_delayed=28000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        178 usec_timeout=500000 usect_delayed=44000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        183 usec_timeout=500000 usect_delayed=40000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        237 usec_timeout=500000 usect_delayed=100000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
        266 usec_timeout=500000 usect_delayed=32000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        313 usec_timeout=500000 usect_delayed=24000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        347 usec_timeout=500000 usect_delayed=96000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        470 usec_timeout=500000 usect_delayed=20000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        559 usec_timeout=500000 usect_delayed=16000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        964 usec_timeout=500000 usect_delayed=12000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
       2001 usec_timeout=500000 usect_delayed=104000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
       2447 usec_timeout=500000 usect_delayed=8000 reason=VMSCAN_THROTTLE_NOPROG
RESS
       7888 usec_timeout=500000 usect_delayed=4000 reason=VMSCAN_THROTTLE_NOPROG
RESS
      22727 usec_timeout=500000 usect_delayed=0 reason=VMSCAN_THROTTLE_NOPROGRES
S
      51305 usec_timeout=500000 usect_delayed=500000 reason=VMSCAN_THROTTLE_NOPR
OGRESS

    The full timeout is often hit but a large number also do not stall at
    all.  The remainder slept a little allowing other reclaim tasks to make
    progress.

    While this timeout could be further increased, it could also negatively
    impact worst-case behaviour when there is no prioritisation of what task
    should make progress.

    For VMSCAN_THROTTLE_WRITEBACK, the breakdown was

          1 usec_timeout=100000 usect_delayed=44000 reason=VMSCAN_THROTTLE_WRITE
BACK
          2 usec_timeout=100000 usect_delayed=76000 reason=VMSCAN_THROTTLE_WRITE
BACK
          3 usec_timeout=100000 usect_delayed=80000 reason=VMSCAN_THROTTLE_WRITE
BACK
          5 usec_timeout=100000 usect_delayed=48000 reason=VMSCAN_THROTTLE_WRITE
BACK
          5 usec_timeout=100000 usect_delayed=84000 reason=VMSCAN_THROTTLE_WRITE
BACK
          6 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITE
BACK
          7 usec_timeout=100000 usect_delayed=88000 reason=VMSCAN_THROTTLE_WRITE
BACK
         11 usec_timeout=100000 usect_delayed=56000 reason=VMSCAN_THROTTLE_WRITE
BACK
         12 usec_timeout=100000 usect_delayed=64000 reason=VMSCAN_THROTTLE_WRITE
BACK
         16 usec_timeout=100000 usect_delayed=92000 reason=VMSCAN_THROTTLE_WRITE
BACK
         24 usec_timeout=100000 usect_delayed=68000 reason=VMSCAN_THROTTLE_WRITE
BACK
         28 usec_timeout=100000 usect_delayed=32000 reason=VMSCAN_THROTTLE_WRITEBACK
         30 usec_timeout=100000 usect_delayed=60000 reason=VMSCAN_THROTTLE_WRITE
BACK
         30 usec_timeout=100000 usect_delayed=96000 reason=VMSCAN_THROTTLE_WRITE
BACK
         32 usec_timeout=100000 usect_delayed=52000 reason=VMSCAN_THROTTLE_WRITE
BACK
         42 usec_timeout=100000 usect_delayed=40000 reason=VMSCAN_THROTTLE_WRITE
BACK
         77 usec_timeout=100000 usect_delayed=28000 reason=VMSCAN_THROTTLE_WRITE
BACK
         99 usec_timeout=100000 usect_delayed=36000 reason=VMSCAN_THROTTLE_WRITE
BACK
        137 usec_timeout=100000 usect_delayed=24000 reason=VMSCAN_THROTTLE_WRITE
BACK
        190 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITE
BACK
        339 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITE
BACK
        518 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITE
BACK
        852 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEB
ACK
       3359 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEB
ACK
       7147 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
      83962 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRIT
EBACK

    The majority hit the timeout in direct reclaim context although a
    sizable number did not stall at all.  This is very different to kswapd
    where only a tiny percentage of stalls due to writeback reached the
    timeout.

    Bottom line, the throttling appears to work and the wakeup events may
    limit worst case stalls.  There might be some grounds for adjusting
    timeouts but it's likely futile as the worst-case scenarios depend on
    the workload, memory size and the speed of the storage.  A better
    approach to improve the series further would be to prioritise tasks
    based on their rate of allocation with the caveat that it may be very
    expensive to track.

    This patch (of 5):

    Page reclaim throttles on wait_iff_congested under the following
    conditions:

     - kswapd is encountering pages under writeback and marked for immediate
       reclaim implying that pages are cycling through the LRU faster than
       pages can be cleaned.

     - Direct reclaim will stall if all dirty pages are backed by congested
       inodes.

    wait_iff_congested is almost completely broken with few exceptions.
    This patch adds a new node-based workqueue and tracks the number of
    throttled tasks and pages written back since throttling started.  If
    enough pages belonging to the node are written back then the throttled
    tasks will wake early.  If not, the throttled tasks sleeps until the
    timeout expires.

    [neilb@suse.de: Uninterruptible sleep and simpler wakeups]
    [hdanton@sina.com: Avoid race when reclaim starts]
    [vbabka@suse.cz: vmstat irq-safe api, clarifications]

    Link: https://lore.kernel.org/linux-mm/45d8b7a6-8548-65f5-cccf-9f451d4ae3d4@
kernel.dk/ [1]
    Link: https://lkml.kernel.org/r/20211022144651.19914-1-mgorman@techsingulari
ty.net
    Link: https://lkml.kernel.org/r/20211022144651.19914-2-mgorman@techsingulari
ty.net
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: NeilBrown <neilb@suse.de>
    Cc: "Theodore Ts'o" <tytso@mit.edu>
    Cc: Andreas Dilger <adilger.kernel@dilger.ca>
    Cc: "Darrick J . Wong" <djwong@kernel.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:29 -04:00
Chris von Recklinghausen 207adffc8b mm: introduce pmd_install() helper
Bugzilla: https://bugzilla.redhat.com/2120352

commit 03c4f20454e0231d2cdec4373841a3a25cf4efed
Author: Qi Zheng <zhengqi.arch@bytedance.com>
Date:   Fri Nov 5 13:38:38 2021 -0700

    mm: introduce pmd_install() helper

    Patch series "Do some code cleanups related to mm", v3.

    This patch (of 2):

    Currently we have three times the same few lines repeated in the code.
    Deduplicate them by newly introduced pmd_install() helper.

    Link: https://lkml.kernel.org/r/20210901102722.47686-1-zhengqi.arch@bytedance.com
    Link: https://lkml.kernel.org/r/20210901102722.47686-2-zhengqi.arch@bytedance.com
    Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
    Cc: Mika Penttila <mika.penttila@nextfour.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:08 -04:00
Patrick Talbert f3e3b472bd Merge: mm: make slab and vmalloc allocators __GFP_NOLOCKDEP aware
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1184

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2109001

commit 704687deaae768a818d7da0584ee021793a97684
Author: Michal Hocko <mhocko@suse.com>
Date:   Fri, 14 Jan 2022 14:07:11 -0800

    mm: make slab and vmalloc allocators __GFP_NOLOCKDEP aware

    sl?b and vmalloc allocators reduce the given gfp mask for their internal
    needs.  For that they use GFP_RECLAIM_MASK to preserve the reclaim
    behavior and constrains.

    __GFP_NOLOCKDEP is not a part of that mask because it doesn't really
    control the reclaim behavior strictly speaking.  On the other hand it
    tells the underlying page allocator to disable reclaim recursion
    detection so arguably it should be part of the mask.

    Having __GFP_NOLOCKDEP in the mask will not alter the behavior in any
    form so this change is safe pretty much by definition.  It also adds a
    support for this flag to SL?B and vmalloc allocators which will in turn
    allow its use to kvmalloc as well.  A lack of the support has been
    noticed recently in

      http://lkml.kernel.org/r/20211119225435.GZ449541@dread.disaster.area

    Link: https://lkml.kernel.org/r/YZ9XtLY4AEjVuiEI@dhcp22.suse.cz
    Signed-off-by: Michal Hocko <mhocko@suse.com>
    Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Acked-by: Dave Chinner <dchinner@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Ilya Dryomov <idryomov@gmail.com>
    Cc: Jeff Layton <jlayton@kernel.org>
    Cc: Neil Brown <neilb@suse.de>
    Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Waiman Long <longman@redhat.com>

Approved-by: Jerry Snitselaar <jsnitsel@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Aristeu Rozanski <arozansk@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-08-03 11:54:34 -04:00
Waiman Long 0678f7e31b mm: make slab and vmalloc allocators __GFP_NOLOCKDEP aware
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2109001

commit 704687deaae768a818d7da0584ee021793a97684
Author: Michal Hocko <mhocko@suse.com>
Date:   Fri, 14 Jan 2022 14:07:11 -0800

    mm: make slab and vmalloc allocators __GFP_NOLOCKDEP aware

    sl?b and vmalloc allocators reduce the given gfp mask for their internal
    needs.  For that they use GFP_RECLAIM_MASK to preserve the reclaim
    behavior and constrains.

    __GFP_NOLOCKDEP is not a part of that mask because it doesn't really
    control the reclaim behavior strictly speaking.  On the other hand it
    tells the underlying page allocator to disable reclaim recursion
    detection so arguably it should be part of the mask.

    Having __GFP_NOLOCKDEP in the mask will not alter the behavior in any
    form so this change is safe pretty much by definition.  It also adds a
    support for this flag to SL?B and vmalloc allocators which will in turn
    allow its use to kvmalloc as well.  A lack of the support has been
    noticed recently in

      http://lkml.kernel.org/r/20211119225435.GZ449541@dread.disaster.area

    Link: https://lkml.kernel.org/r/YZ9XtLY4AEjVuiEI@dhcp22.suse.cz
    Signed-off-by: Michal Hocko <mhocko@suse.com>
    Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Acked-by: Dave Chinner <dchinner@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Ilya Dryomov <idryomov@gmail.com>
    Cc: Jeff Layton <jlayton@kernel.org>
    Cc: Neil Brown <neilb@suse.de>
    Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2022-07-27 15:45:09 -04:00
Waiman Long e462accf60 mm/munlock: protect the per-CPU pagevec by a local_lock_t
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2109671
Conflicts: A minor fuzz in mm/migrate.c due to missing upstream commit
	   1eba86c096e3 ("mm: change page type prior to adding page
	   table entry"). Pulling it, however, will require taking in
	   a number of additional patches. So it is not done here.

commit adb11e78c5dc5e26774acb05f983da36447f7911
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Fri, 1 Apr 2022 11:28:33 -0700

    mm/munlock: protect the per-CPU pagevec by a local_lock_t

    The access to mlock_pvec is protected by disabling preemption via
    get_cpu_var() or implicit by having preemption disabled by the caller
    (in mlock_page_drain() case).  This breaks on PREEMPT_RT since
    folio_lruvec_lock_irq() acquires a sleeping lock in this section.

    Create struct mlock_pvec which consits of the local_lock_t and the
    pagevec.  Acquire the local_lock() before accessing the per-CPU pagevec.
    Replace mlock_page_drain() with a _local() version which is invoked on
    the local CPU and acquires the local_lock_t and a _remote() version
    which uses the pagevec from a remote CPU which offline.

    Link: https://lkml.kernel.org/r/YjizWi9IY0mpvIfb@linutronix.de
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Acked-by: Hugh Dickins <hughd@google.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2022-07-21 14:50:55 -04:00
Aristeu Rozanski 031883f992 mm/readahead: Switch to page_cache_ra_order
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 56a4d67c264e37014b8392cba9869c7fe904ed1e
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Sat Jul 24 23:26:14 2021 -0400

    mm/readahead: Switch to page_cache_ra_order

    do_page_cache_ra() was being exposed for the benefit of
    do_sync_mmap_readahead().  Switch it over to page_cache_ra_order()
    partly because it's a better interface but mostly for the benefit of
    the next patch.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:20 -04:00
Aristeu Rozanski 51ad4bb4db mm: Turn page_anon_vma() into folio_anon_vma()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit e05b34539d008ab819388f699b25eae962ba24ac
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Sat Jan 29 11:52:52 2022 -0500

    mm: Turn page_anon_vma() into folio_anon_vma()

    Move the prototype from mm.h to mm/internal.h and convert all callers
    to pass a folio.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:19 -04:00
Aristeu Rozanski 99cfd73d88 mm/mlock: Add mlock_vma_folio()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit dcc5d337c5e62761ee71f2e25c7aa890b1aa41a2
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Feb 15 13:33:59 2022 -0500

    mm/mlock: Add mlock_vma_folio()

    Convert mlock_page() into mlock_folio() and convert the callers.  Keep
    mlock_vma_page() as a wrapper.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:18 -04:00
Aristeu Rozanski 880848c57e mm: Convert page_vma_mapped_walk to work on PFNs
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 2aff7a4755bed2870ee23b75bc88cdc8d76cdd03
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Thu Feb 3 11:40:17 2022 -0500

    mm: Convert page_vma_mapped_walk to work on PFNs

    page_mapped_in_vma() really just wants to walk one page, but as the
    code stands, if passed the head page of a compound page, it will
    walk every page in the compound page.  Extract pfn/nr_pages/pgoff
    from the struct page early, so they can be overridden by
    page_mapped_in_vma().

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:18 -04:00
Aristeu Rozanski 528faa0405 mm/truncate: Combine invalidate_mapping_pagevec() and __invalidate_mapping_pages()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit c56109dd35c9204cd6c49d2116ef36e5044ef867
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Sun Feb 13 17:22:10 2022 -0500

    mm/truncate: Combine invalidate_mapping_pagevec() and __invalidate_mapping_pages()

    We can save a function call by combining these two functions, which
    are identical except for the return value.  Also move the prototype
    to mm/internal.h.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:17 -04:00
Aristeu Rozanski b86ef54bdd mm: Turn deactivate_file_page() into deactivate_file_folio()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 261b6840ed10419ac2f554e515592d59dd5c82cf
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Sun Feb 13 16:40:24 2022 -0500

    mm: Turn deactivate_file_page() into deactivate_file_folio()

    This function has one caller which already has a reference to the
    page, so we don't need to use get_page_unless_zero().  Also move the
    prototype to mm/internal.h.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:17 -04:00
Aristeu Rozanski 5a8509634f mm/truncate: Split invalidate_inode_page() into mapping_evict_folio()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit d6c75dc22c755c567838f12f12a16f2a323ebd4e
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Sun Feb 13 15:22:28 2022 -0500

    mm/truncate: Split invalidate_inode_page() into mapping_evict_folio()

    Some of the callers already have the address_space and can avoid calling
    folio_mapping() and checking if the folio was already truncated.  Also
    add kernel-doc and fix the return type (in case we ever support folios
    larger than 4TB).

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:17 -04:00
Aristeu Rozanski 1ed6a6a101 mm: Turn putback_lru_page() into folio_putback_lru()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites
Conflicts: context due lack of c3f4a9a2b082c

commit ca6d60f3f18b78d37b7a93262108ade0727d1441
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Jan 21 08:41:46 2022 -0500

    mm: Turn putback_lru_page() into folio_putback_lru()

    Add a putback_lru_page() wrapper.  Removes a couple of compound_head()
    calls.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:16 -04:00
Aristeu Rozanski 3a42202d5f mm: Turn isolate_lru_page() into folio_isolate_lru()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites
Conflicts: context due lack of d818fca1cac3

commit d1d8a3b4d06d8c9188f2b9b89ef053db0bf899de
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Dec 24 13:26:22 2021 -0500

    mm: Turn isolate_lru_page() into folio_isolate_lru()

    Add isolate_lru_page() as a wrapper around isolate_lru_folio().
    TestClearPageLRU() would have always failed on a tail page, so
    returning -EBUSY is the same behaviour.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:16 -04:00
Aristeu Rozanski 63cdd4209a mm/gup: Add try_get_folio() and try_grab_folio()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites
Conflicts: context due missing 27674ef6c73f

commit ece1ed7bfa1208b527b3dc90bb45c55e0d139a88
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Feb 4 10:27:40 2022 -0500

    mm/gup: Add try_get_folio() and try_grab_folio()

    Convert try_get_compound_head() into try_get_folio() and convert
    try_grab_compound_head() into try_grab_folio().  Add a temporary
    try_grab_compound_head() wrapper around try_grab_folio() to let us
    convert callers individually.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:15 -04:00
Aristeu Rozanski 7752954e07 mm/munlock: mlock_vma_page() check against VM_SPECIAL
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit c8263bd605009355edf781f2dd711de633998475
Author: Hugh Dickins <hughd@google.com>
Date:   Wed Mar 2 17:35:30 2022 -0800

    mm/munlock: mlock_vma_page() check against VM_SPECIAL

    Although mmap_region() and mlock_fixup() take care that VM_LOCKED
    is never left set on a VM_SPECIAL vma, there is an interval while
    file->f_op->mmap() is using vm_insert_page(s), when VM_LOCKED may
    still be set while VM_SPECIAL bits are added: so mlock_vma_page()
    should ignore VM_LOCKED while any VM_SPECIAL bits are set.

    This showed up as a "Bad page" still mlocked, when vfree()ing pages
    which had been vm_inserted by remap_vmalloc_range_partial(): while
    release_pages() and __page_cache_release(), and so put_page(), catch
    pages still mlocked when freeing (and clear_page_mlock() caught them
    when unmapping), the vfree() path is unprepared for them: fix it?
    but these pages should not have been mlocked in the first place.

    I assume that an mlockall(MCL_FUTURE) had been done in the past; or
    maybe the user got to specify MAP_LOCKED on a vmalloc'ing driver mmap.

    Signed-off-by: Hugh Dickins <hughd@google.com>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:13 -04:00
Aristeu Rozanski cd396ce107 mm/munlock: mlock_page() munlock_page() batch by pagevec
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 2fbb0c10d1e8222604132b3a3f81bfd8345a44b6
Author: Hugh Dickins <hughd@google.com>
Date:   Mon Feb 14 18:37:29 2022 -0800

    mm/munlock: mlock_page() munlock_page() batch by pagevec

    A weakness of the page->mlock_count approach is the need for lruvec lock
    while holding page table lock.  That is not an overhead we would allow on
    normal pages, but I think acceptable just for pages in an mlocked area.
    But let's try to amortize the extra cost by gathering on per-cpu pagevec
    before acquiring the lruvec lock.

    I have an unverified conjecture that the mlock pagevec might work out
    well for delaying the mlock processing of new file pages until they have
    got off lru_cache_add()'s pagevec and on to LRU.

    The initialization of page->mlock_count is subject to races and awkward:
    0 or !!PageMlocked or 1?  Was it wrong even in the implementation before
    this commit, which just widens the window?  I haven't gone back to think
    it through.  Maybe someone can point out a better way to initialize it.

    Bringing lru_cache_add_inactive_or_unevictable()'s mlock initialization
    into mm/mlock.c has helped: mlock_new_page(), using the mlock pagevec,
    rather than lru_cache_add()'s pagevec.

    Experimented with various orderings: the right thing seems to be for
    mlock_page() and mlock_new_page() to TestSetPageMlocked before adding to
    pagevec, but munlock_page() to leave TestClearPageMlocked to the later
    pagevec processing.

    Dropped the VM_BUG_ON_PAGE(PageTail)s this time around: they have made
    their point, and the thp_nr_page()s already contain a VM_BUG_ON_PGFLAGS()
    for that.

    This still leaves acquiring lruvec locks under page table lock each time
    the pagevec fills (or a THP is added): which I suppose is rather silly,
    since they sit on pagevec waiting to be processed long after page table
    lock has been dropped; but I'm disinclined to uglify the calling sequence
    until some load shows an actual problem with it (nothing wrong with
    taking lruvec lock under page table lock, just "nicer" to do it less).

    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:12 -04:00
Aristeu Rozanski 7d43d2ba0b mm/munlock: mlock_pte_range() when mlocking or munlocking
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 34b6792380ce4f4b41018351cd67c9c26f4a7a0d
Author: Hugh Dickins <hughd@google.com>
Date:   Mon Feb 14 18:31:48 2022 -0800

    mm/munlock: mlock_pte_range() when mlocking or munlocking

    Fill in missing pieces: reimplementation of munlock_vma_pages_range(),
    required to lower the mlock_counts when munlocking without munmapping;
    and its complement, implementation of mlock_vma_pages_range(), required
    to raise the mlock_counts on pages already there when a range is mlocked.

    Combine them into just the one function mlock_vma_pages_range(), using
    walk_page_range() to run mlock_pte_range().  This approach fixes the
    "Very slow unlockall()" of unpopulated PROT_NONE areas, reported in
    https://lore.kernel.org/linux-mm/70885d37-62b7-748b-29df-9e94f3291736@gmail.com/

    Munlock clears VM_LOCKED at the start, under exclusive mmap_lock; but if
    a racing truncate or holepunch (depending on i_mmap_rwsem) gets to the
    pte first, it will not try to munlock the page: leaving release_pages()
    to correct it when the last reference to the page is gone - that's okay,
    a page is not evictable anyway while it is held by an extra reference.

    Mlock sets VM_LOCKED at the start, under exclusive mmap_lock; but if
    a racing remove_migration_pte() or try_to_unmap_one() (depending on
    i_mmap_rwsem) gets to the pte first, it will try to mlock the page,
    then mlock_pte_range() mlock it a second time.  This is harder to
    reproduce, but a more serious race because it could leave the page
    unevictable indefinitely though the area is munlocked afterwards.
    Guard against it by setting the (inappropriate) VM_IO flag,
    and modifying mlock_vma_page() to decline such vmas.

    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:12 -04:00
Aristeu Rozanski 3856990130 mm/munlock: replace clear_page_mlock() by final clearance
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit b109b87050df5438ee745b2bddfa3587970025bb
Author: Hugh Dickins <hughd@google.com>
Date:   Mon Feb 14 18:28:05 2022 -0800

    mm/munlock: replace clear_page_mlock() by final clearance

    Placing munlock_vma_page() at the end of page_remove_rmap() shifts most
    of the munlocking to clear_page_mlock(), since PageMlocked is typically
    still set when mapcount has fallen to 0.  That is not what we want: we
    want /proc/vmstat's unevictable_pgs_cleared to remain as a useful check
    on the integrity of of the mlock/munlock protocol - small numbers are
    not surprising, but big numbers mean the protocol is not working.

    That could be easily fixed by placing munlock_vma_page() at the start of
    page_remove_rmap(); but later in the series we shall want to batch the
    munlocking, and that too would tend to leave PageMlocked still set at
    the point when it is checked.

    So delete clear_page_mlock() now: leave it instead to release_pages()
    (and __page_cache_release()) to do this backstop clearing of Mlocked,
    when page refcount has fallen to 0.  If a pinned page occasionally gets
    counted as Mlocked and Unevictable until it is unpinned, that's okay.

    A slightly regrettable side-effect of this change is that, since
    release_pages() and __page_cache_release() may be called at interrupt
    time, those places which update NR_MLOCK with interrupts enabled
    had better use mod_zone_page_state() than __mod_zone_page_state()
    (but holding the lruvec lock always has interrupts disabled).

    This change, forcing Mlocked off when refcount 0 instead of earlier
    when mapcount 0, is not fundamental: it can be reversed if performance
    or something else is found to suffer; but this is the easiest way to
    separate the stats - let's not complicate that without good reason.

    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:12 -04:00
Aristeu Rozanski 4b2aa38f6e mm/munlock: rmap call mlock_vma_page() munlock_vma_page()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites
Conflicts: context difference due lack of f4c4a3f484 and differences due RHEL-only 44740bc20b

commit cea86fe246b694a191804b47378eb9d77aefabec
Author: Hugh Dickins <hughd@google.com>
Date:   Mon Feb 14 18:26:39 2022 -0800

    mm/munlock: rmap call mlock_vma_page() munlock_vma_page()

    Add vma argument to mlock_vma_page() and munlock_vma_page(), make them
    inline functions which check (vma->vm_flags & VM_LOCKED) before calling
    mlock_page() and munlock_page() in mm/mlock.c.

    Add bool compound to mlock_vma_page() and munlock_vma_page(): this is
    because we have understandable difficulty in accounting pte maps of THPs,
    and if passed a PageHead page, mlock_page() and munlock_page() cannot
    tell whether it's a pmd map to be counted or a pte map to be ignored.

    Add vma arg to page_add_file_rmap() and page_remove_rmap(), like the
    others, and use that to call mlock_vma_page() at the end of the page
    adds, and munlock_vma_page() at the end of page_remove_rmap() (end or
    beginning? unimportant, but end was easier for assertions in testing).

    No page lock is required (although almost all adds happen to hold it):
    delete the "Serialize with page migration" BUG_ON(!PageLocked(page))s.
    Certainly page lock did serialize with page migration, but I'm having
    difficulty explaining why that was ever important.

    Mlock accounting on THPs has been hard to define, differed between anon
    and file, involved PageDoubleMap in some places and not others, required
    clear_page_mlock() at some points.  Keep it simple now: just count the
    pmds and ignore the ptes, there is no reason for ptes to undo pmd mlocks.

    page_add_new_anon_rmap() callers unchanged: they have long been calling
    lru_cache_add_inactive_or_unevictable(), which does its own VM_LOCKED
    handling (it also checks for not VM_SPECIAL: I think that's overcautious,
    and inconsistent with other checks, that mmap_region() already prevents
    VM_LOCKED on VM_SPECIAL; but haven't quite convinced myself to change it).

    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:12 -04:00
Aristeu Rozanski 4ba8fd7ec7 mm/munlock: delete munlock_vma_pages_all(), allow oomreap
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites
Conflicts: context due missing prototype for pmd_install()

commit a213e5cf71cbcea4b23caedcb8fe6629a333b275
Author: Hugh Dickins <hughd@google.com>
Date:   Mon Feb 14 18:23:29 2022 -0800

    mm/munlock: delete munlock_vma_pages_all(), allow oomreap

    munlock_vma_pages_range() will still be required, when munlocking but
    not munmapping a set of pages; but when unmapping a pte, the mlock count
    will be maintained in much the same way as it will be maintained when
    mapping in the pte.  Which removes the need for munlock_vma_pages_all()
    on mlocked vmas when munmapping or exiting: eliminating the catastrophic
    contention on i_mmap_rwsem, and the need for page lock on the pages.

    There is still a need to update locked_vm accounting according to the
    munmapped vmas when munmapping: do that in detach_vmas_to_be_unmapped().
    exit_mmap() does not need locked_vm updates, so delete unlock_range().

    And wasn't I the one who forbade the OOM reaper to attack mlocked vmas,
    because of the uncertainty in blocking on all those page locks?
    No fear of that now, so permit the OOM reaper on mlocked vmas.

    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:12 -04:00
Aristeu Rozanski f7bbad4076 mm/munlock: delete page_mlock() and all its works
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit ebcbc6ea7d8a604ad8504dae70a6ac1b1e64a0b7
Author: Hugh Dickins <hughd@google.com>
Date:   Mon Feb 14 18:20:24 2022 -0800

    mm/munlock: delete page_mlock() and all its works

    We have recommended some applications to mlock their userspace, but that
    turns out to be counter-productive: when many processes mlock the same
    file, contention on rmap's i_mmap_rwsem can become intolerable at exit: it
    is needed for write, to remove any vma mapping that file from rmap's tree;
    but hogged for read by those with mlocks calling page_mlock() (formerly
    known as try_to_munlock()) on *each* page mapped from the file (the
    purpose being to find out whether another process has the page mlocked,
    so therefore it should not be unmlocked yet).

    Several optimizations have been made in the past: one is to skip
    page_mlock() when mapcount tells that nothing else has this page
    mapped; but that doesn't help at all when others do have it mapped.
    This time around, I initially intended to add a preliminary search
    of the rmap tree for overlapping VM_LOCKED ranges; but that gets
    messy with locking order, when in doubt whether a page is actually
    present; and risks adding even more contention on the i_mmap_rwsem.

    A solution would be much easier, if only there were space in struct page
    for an mlock_count... but actually, most of the time, there is space for
    it - an mlocked page spends most of its life on an unevictable LRU, but
    since 3.18 removed the scan_unevictable_pages sysctl, that "LRU" has
    been redundant.  Let's try to reuse its page->lru.

    But leave that until a later patch: in this patch, clear the ground by
    removing page_mlock(), and all the infrastructure that has gathered
    around it - which mostly hinders understanding, and will make reviewing
    new additions harder.  Don't mind those old comments about THPs, they
    date from before 4.5's refcounting rework: splitting is not a risk here.

    Just keep a minimal version of munlock_vma_page(), as reminder of what it
    should attend to (in particular, the odd way PGSTRANDED is counted out of
    PGMUNLOCKED), and likewise a stub for munlock_vma_pages_range().  Move
    unchanged __mlock_posix_error_return() out of the way, down to above its
    caller: this series then makes no further change after mlock_fixup().

    After this and each following commit, the kernel builds, boots and runs;
    but with deficiencies which may show up in testing of mlock and munlock.
    The system calls succeed or fail as before, and mlock remains effective
    in preventing page reclaim; but meminfo's Unevictable and Mlocked amounts
    may be shown too low after mlock, grow, then stay too high after munlock:
    with previously mlocked pages remaining unevictable for too long, until
    finally unmapped and freed and counts corrected. Normal service will be
    resumed in "mm/munlock: mlock_pte_range() when mlocking or munlocking".

    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:12 -04:00
Aristeu Rozanski 4a1432e31d truncate,shmem: Handle truncates that split large folios
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit b9a8a4195c7d3a51235a4fc974a46ad4e9689ffd
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed May 27 17:59:22 2020 -0400

    truncate,shmem: Handle truncates that split large folios

    Handle folio splitting in the parts of the truncation functions which
    already handle partial pages.  Factor all that code out into a new
    function called truncate_inode_partial_folio().

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Jan Kara <jack@suse.cz>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:11 -04:00
Aristeu Rozanski 59e51bcf24 mm: Convert find_lock_entries() to use a folio_batch
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites
Conflicts: context due missing 51b8c1fe250d1bd70c17

commit 51dcbdac28d4dde915f78adf08bb3fac87f516e9
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Dec 7 14:15:07 2021 -0500

    mm: Convert find_lock_entries() to use a folio_batch

    find_lock_entries() already only returned the head page of folios, so
    convert it to return a folio_batch instead of a pagevec.  That cascades
    through converting truncate_inode_pages_range() to
    delete_from_page_cache_batch() and page_cache_delete_batch().

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:11 -04:00
Aristeu Rozanski 433ab58b7a filemap: Return only folios from find_get_entries()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 0e499ed3d7a216706e02eeded562627d3e69dcfd
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Sep 1 23:17:50 2020 -0400

    filemap: Return only folios from find_get_entries()

    The callers have all been converted to work on folios, so convert
    find_get_entries() to return a batch of folios instead of pages.
    We also now return multiple large folios in a single call.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Jan Kara <jack@suse.cz>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:10 -04:00
Aristeu Rozanski 2e4f1700b5 truncate: Add invalidate_complete_folio2()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites
Conflicts: context due missing 51b8c1fe250d1bd70c17

commit 78f426608f21c997975adb96641b7ac82d4d15b1
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Jul 28 15:52:34 2021 -0400

    truncate: Add invalidate_complete_folio2()

    Convert invalidate_complete_page2() to invalidate_complete_folio2().
    Use filemap_free_folio() to free the page instead of calling ->freepage
    manually.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:10 -04:00
Aristeu Rozanski 0a91f1bd33 truncate,shmem: Add truncate_inode_folio()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 1e84a3d997b74c33491899e31d48774f252213ab
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Thu Dec 2 16:01:55 2021 -0500

    truncate,shmem: Add truncate_inode_folio()

    Convert all callers of truncate_inode_page() to call
    truncate_inode_folio() instead, and move the declaration to mm/internal.h.
    Move the assertion that the caller is not passing in a tail page to
    generic_error_remove_page().  We can't entirely remove the struct page
    from the callers yet because the page pointer in the pvec might be a
    shadow/dax/swap entry instead of actually a page.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:10 -04:00
Aristeu Rozanski 4f77d61363 mm: Add unmap_mapping_folio()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 3506659e18a61ae525f3b9b4f5af23b4b149d4db
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Sun Nov 28 14:53:35 2021 -0500

    mm: Add unmap_mapping_folio()

    Convert both callers of unmap_mapping_page() to call unmap_mapping_folio()
    instead.  Also move zap_details from linux/mm.h to mm/memory.c

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:10 -04:00
Patrick Talbert 407ad35116 Merge: mm: backport folio support
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/678

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2019485
Tested: ran Rafael's set of sanity tests with a stock kernel test run for comparison

This backport includes the base folio patches *without* touching any subsystems.
Patches are mostly straight forward converting functions to use folios.

v2: merge conflict, dropped 78525c74d9e7d1a6ce69bd4388f045f6e474a20b as contradicts the fact we're trying to not do subsystems converting in this MR

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>

Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Carlos Maiolino <cmaiolino@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-05-03 10:59:25 +02:00
Aristeu Rozanski 280cf64c9e mm: Add folio_evictable()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2019485
Tested: ran Rafael's set of sanity tests, other than known issues, seems ok

commit 3eed3ef55c83ec718fae676fd59699816223215f
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri May 14 15:04:28 2021 -0400

    mm: Add folio_evictable()

    This is the folio equivalent of page_evictable().  Unfortunately, it's
    different from !folio_test_unevictable(), but I think it's used in places
    where you have to be a VM expert and can reasonably be expected to know
    the difference.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-04-07 09:58:31 -04:00
Aristeu Rozanski 95f4e0a7c1 mm/writeback: Add __folio_end_writeback()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2019485
Tested: ran Rafael's set of sanity tests, other than known issues, seems ok

commit 269ccca3899f6bce49e004f50f623e0b161fb027
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Jan 15 23:34:16 2021 -0500

    mm/writeback: Add __folio_end_writeback()

    test_clear_page_writeback() is actually an mm-internal function, although
    it's named as if it's a pagecache function.  Move it to mm/internal.h,
    rename it to __folio_end_writeback() and change the return type to bool.

    The conversion from page to folio is mostly about accounting the number
    of pages being written back, although it does eliminate a couple of
    calls to compound_head().

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: David Howells <dhowells@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-04-07 09:58:30 -04:00
Aristeu Rozanski 785ffd41c1 mm: Add folio_raw_mapping()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2019485
Tested: ran Rafael's set of sanity tests, other than known issues, seems ok

commit 646010009d3541b8cb4f803dcb4b8d0da2f22579
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri May 7 11:17:34 2021 -0400

    mm: Add folio_raw_mapping()

    Convert __page_rmapping to folio_raw_mapping and move it to mm/internal.h.
    It's only a couple of instructions (load and mask), so it's definitely
    going to be cheaper to inline it than call it.  Leave page_rmapping
    out of line.  Change page_anon_vma() to not call folio_raw_mapping() --
    it's more efficient to do the subtraction than the mask.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: David Howells <dhowells@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-04-07 09:58:29 -04:00
Aristeu Rozanski 7b198ad135 mm/swap: Add folio_rotate_reclaimable()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2019485
Tested: ran Rafael's set of sanity tests, other than known issues, seems ok

commit 575ced1c8b0d3b578b933a68ce67ddaff3df9506
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Dec 8 01:25:39 2020 -0500

    mm/swap: Add folio_rotate_reclaimable()

    Convert rotate_reclaimable_page() to folio_rotate_reclaimable().  This
    eliminates all five of the calls to compound_head() in this function,
    saving 75 bytes at the cost of adding 15 bytes to its one caller,
    end_page_writeback().  We also save 36 bytes from pagevec_move_tail_fn()
    due to using folios there.  Net 96 bytes savings.

    Also move its declaration to mm/internal.h as it's only used by filemap.c.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Acked-by: Mike Rapoport <rppt@linux.ibm.com>
    Reviewed-by: David Howells <dhowells@redhat.com>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-04-07 09:58:27 -04:00
Nico Pache 8e3254a841 mm: handle uninitialized numa nodes gracefully
commit 09f49dca570a917a8c6bccd7e8c61f5141534e3a
Author: Michal Hocko <mhocko@suse.com>
Date:   Tue Mar 22 14:46:54 2022 -0700

    mm: handle uninitialized numa nodes gracefully

    We have had several reports [1][2][3] that page allocator blows up when an
    allocation from a possible node is requested.  The underlying reason is
    that NODE_DATA for the specific node is not allocated.

    NUMA specific initialization is arch specific and it can vary a lot.  E.g.
    x86 tries to initialize all nodes that have some cpu affinity (see
    init_cpu_to_node) but this can be insufficient because the node might be
    cpuless for example.

    One way to address this problem would be to check for !node_online nodes
    when trying to get a zonelist and silently fall back to another node.
    That is unfortunately adding a branch into allocator hot path and it
    doesn't handle any other potential NODE_DATA users.

    This patch takes a different approach (following a lead of [3]) and it pre
    allocates pgdat for all possible nodes in an arch indipendent code -
    free_area_init.  All uninitialized nodes are treated as memoryless nodes.
    node_state of the node is not changed because that would lead to other
    side effects - e.g.  sysfs representation of such a node and from past
    discussions [4] it is known that some tools might have problems digesting
    that.

    Newly allocated pgdat only gets a minimal initialization and the rest of
    the work is expected to be done by the memory hotplug - hotadd_new_pgdat
    (renamed to hotadd_init_pgdat).

    generic_alloc_nodedata is changed to use the memblock allocator because
    neither page nor slab allocators are available at the stage when all
    pgdats are allocated.  Hotplug doesn't allocate pgdat anymore so we can
    use the early boot allocator.  The only arch specific implementation is
    ia64 and that is changed to use the early allocator as well.

    [1] http://lkml.kernel.org/r/20211101201312.11589-1-amakhalov@vmware.com
    [2] http://lkml.kernel.org/r/20211207224013.880775-1-npache@redhat.com
    [3] http://lkml.kernel.org/r/20190114082416.30939-1-mhocko@kernel.org
    [4] http://lkml.kernel.org/r/20200428093836.27190-1-srikar@linux.vnet.ibm.com

    [akpm@linux-foundation.org: replace comment, per Mike]

    Link: https://lkml.kernel.org/r/Yfe7RBeLCijnWBON@dhcp22.suse.cz
    Reported-by: Alexey Makhalov <amakhalov@vmware.com>
    Tested-by: Alexey Makhalov <amakhalov@vmware.com>
    Reported-by: Nico Pache <npache@redhat.com>
    Acked-by: Rafael Aquini <raquini@redhat.com>
    Tested-by: Rafael Aquini <raquini@redhat.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Acked-by: Mike Rapoport <rppt@linux.ibm.com>
    Signed-off-by: Michal Hocko <mhocko@suse.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Dennis Zhou <dennis@kernel.org>
    Cc: Eric Dumazet <eric.dumazet@gmail.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Wei Yang <richard.weiyang@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2024054
Signed-off-by: Nico Pache <npache@redhat.com>
2022-03-28 12:41:38 -06:00
Rafael Aquini 489bee842d mm/numa: automatically generate node migration order
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 79c28a41672278283fa72e03d0bf80e6644d4ac4
Author: Dave Hansen <dave.hansen@linux.intel.com>
Date:   Thu Sep 2 14:59:06 2021 -0700

    mm/numa: automatically generate node migration order

    Patch series "Migrate Pages in lieu of discard", v11.

    We're starting to see systems with more and more kinds of memory such as
    Intel's implementation of persistent memory.

    Let's say you have a system with some DRAM and some persistent memory.
    Today, once DRAM fills up, reclaim will start and some of the DRAM
    contents will be thrown out.  Allocations will, at some point, start
    falling over to the slower persistent memory.

    That has two nasty properties.  First, the newer allocations can end up in
    the slower persistent memory.  Second, reclaimed data in DRAM are just
    discarded even if there are gobs of space in persistent memory that could
    be used.

    This patchset implements a solution to these problems.  At the end of the
    reclaim process in shrink_page_list() just before the last page refcount
    is dropped, the page is migrated to persistent memory instead of being
    dropped.

    While I've talked about a DRAM/PMEM pairing, this approach would function
    in any environment where memory tiers exist.

    This is not perfect.  It "strands" pages in slower memory and never brings
    them back to fast DRAM.  Huang Ying has follow-on work which repurposes
    NUMA balancing to promote hot pages back to DRAM.

    This is also all based on an upstream mechanism that allows persistent
    memory to be onlined and used as if it were volatile:

            http://lkml.kernel.org/r/20190124231441.37A4A305@viggo.jf.intel.com

    With that, the DRAM and PMEM in each socket will be represented as 2
    separate NUMA nodes, with the CPUs sit in the DRAM node.  So the
    general inter-NUMA demotion mechanism introduced in the patchset can
    migrate the cold DRAM pages to the PMEM node.

    We have tested the patchset with the postgresql and pgbench.  On a
    2-socket server machine with DRAM and PMEM, the kernel with the patchset
    can improve the score of pgbench up to 22.1% compared with that of the
    DRAM only + disk case.  This comes from the reduced disk read throughput
    (which reduces up to 70.8%).

    == Open Issues ==

     * Memory policies and cpusets that, for instance, restrict allocations
       to DRAM can be demoted to PMEM whenever they opt in to this
       new mechanism.  A cgroup-level API to opt-in or opt-out of
       these migrations will likely be required as a follow-on.
     * Could be more aggressive about where anon LRU scanning occurs
       since it no longer necessarily involves I/O.  get_scan_count()
       for instance says: "If we have no swap space, do not bother
       scanning anon pages"

    This patch (of 9):

    Prepare for the kernel to auto-migrate pages to other memory nodes with a
    node migration table.  This allows creating single migration target for
    each NUMA node to enable the kernel to do NUMA page migrations instead of
    simply discarding colder pages.  A node with no target is a "terminal
    node", so reclaim acts normally there.  The migration target does not
    fundamentally _need_ to be a single node, but this implementation starts
    there to limit complexity.

    When memory fills up on a node, memory contents can be automatically
    migrated to another node.  The biggest problems are knowing when to
    migrate and to where the migration should be targeted.

    The most straightforward way to generate the "to where" list would be to
    follow the page allocator fallback lists.  Those lists already tell us if
    memory is full where to look next.  It would also be logical to move
    memory in that order.

    But, the allocator fallback lists have a fatal flaw: most nodes appear in
    all the lists.  This would potentially lead to migration cycles (A->B,
    B->A, A->B, ...).

    Instead of using the allocator fallback lists directly, keep a separate
    node migration ordering.  But, reuse the same data used to generate page
    allocator fallback in the first place: find_next_best_node().

    This means that the firmware data used to populate node distances
    essentially dictates the ordering for now.  It should also be
    architecture-neutral since all NUMA architectures have a working
    find_next_best_node().

    RCU is used to allow lock-less read of node_demotion[] and prevent
    demotion cycles been observed.  If multiple reads of node_demotion[] are
    performed, a single rcu_read_lock() must be held over all reads to ensure
    no cycles are observed.  Details are as follows.

    === What does RCU provide? ===

    Imagine a simple loop which walks down the demotion path looking
    for the last node:

            terminal_node = start_node;
            while (node_demotion[terminal_node] != NUMA_NO_NODE) {
                    terminal_node = node_demotion[terminal_node];
            }

    The initial values are:

            node_demotion[0] = 1;
            node_demotion[1] = NUMA_NO_NODE;

    and are updated to:

            node_demotion[0] = NUMA_NO_NODE;
            node_demotion[1] = 0;

    What guarantees that the cycle is not observed:

            node_demotion[0] = 1;
            node_demotion[1] = 0;

    and would loop forever?

    With RCU, a rcu_read_lock/unlock() can be placed around the loop.  Since
    the write side does a synchronize_rcu(), the loop that observed the old
    contents is known to be complete before the synchronize_rcu() has
    completed.

    RCU, combined with disable_all_migrate_targets(), ensures that the old
    migration state is not visible by the time __set_migration_target_nodes()
    is called.

    === What does READ_ONCE() provide? ===

    READ_ONCE() forbids the compiler from merging or reordering successive
    reads of node_demotion[].  This ensures that any updates are *eventually*
    observed.

    Consider the above loop again.  The compiler could theoretically read the
    entirety of node_demotion[] into local storage (registers) and never go
    back to memory, and *permanently* observe bad values for node_demotion[].

    Note: RCU does not provide any universal compiler-ordering
    guarantees:

            https://lore.kernel.org/lkml/20150921204327.GH4029@linux.vnet.ibm.com/

    This code is unused for now.  It will be called later in the
    series.

    Link: https://lkml.kernel.org/r/20210721063926.3024591-1-ying.huang@intel.com
    Link: https://lkml.kernel.org/r/20210715055145.195411-1-ying.huang@intel.com
    Link: https://lkml.kernel.org/r/20210715055145.195411-2-ying.huang@intel.com
    Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Wei Xu <weixugc@google.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Keith Busch <kbusch@kernel.org>
    Cc: Yang Shi <yang.shi@linux.alibaba.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:42:06 -05:00
Rafael Aquini f1e6a8f806 mm: introduce memmap_alloc() to unify memory map allocation
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit c803b3c8b3b70f306ee6300bf8acdd70ffd1441a
Author: Mike Rapoport <rppt@kernel.org>
Date:   Thu Sep 2 14:58:02 2021 -0700

    mm: introduce memmap_alloc() to unify memory map allocation

    There are several places that allocate memory for the memory map:
    alloc_node_mem_map() for FLATMEM, sparse_buffer_init() and
    __populate_section_memmap() for SPARSEMEM.

    The memory allocated in the FLATMEM case is zeroed and it is never
    poisoned, regardless of CONFIG_PAGE_POISON setting.

    The memory allocated in the SPARSEMEM cases is not zeroed and it is
    implicitly poisoned inside memblock if CONFIG_PAGE_POISON is set.

    Introduce memmap_alloc() wrapper for memblock allocators that will be used
    for both FLATMEM and SPARSEMEM cases and will makei memory map zeroing and
    poisoning consistent for different memory models.

    Link: https://lkml.kernel.org/r/20210714123739.16493-4-rppt@kernel.org
    Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Michal Simek <monstr@monstr.eu>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:41:51 -05:00
Mike Rapoport 6aeb25425d mmap: make mlock_future_check() global
Patch series "mm: introduce memfd_secret system call to create "secret" memory areas", v20.

This is an implementation of "secret" mappings backed by a file
descriptor.

The file descriptor backing secret memory mappings is created using a
dedicated memfd_secret system call The desired protection mode for the
memory is configured using flags parameter of the system call.  The mmap()
of the file descriptor created with memfd_secret() will create a "secret"
memory mapping.  The pages in that mapping will be marked as not present
in the direct map and will be present only in the page table of the owning
mm.

Although normally Linux userspace mappings are protected from other users,
such secret mappings are useful for environments where a hostile tenant is
trying to trick the kernel into giving them access to other tenants
mappings.

It's designed to provide the following protections:

* Enhanced protection (in conjunction with all the other in-kernel
  attack prevention systems) against ROP attacks.  Seceretmem makes
  "simple" ROP insufficient to perform exfiltration, which increases the
  required complexity of the attack.  Along with other protections like
  the kernel stack size limit and address space layout randomization which
  make finding gadgets is really hard, absence of any in-kernel primitive
  for accessing secret memory means the one gadget ROP attack can't work.
  Since the only way to access secret memory is to reconstruct the missing
  mapping entry, the attacker has to recover the physical page and insert
  a PTE pointing to it in the kernel and then retrieve the contents.  That
  takes at least three gadgets which is a level of difficulty beyond most
  standard attacks.

* Prevent cross-process secret userspace memory exposures.  Once the
  secret memory is allocated, the user can't accidentally pass it into the
  kernel to be transmitted somewhere.  The secreremem pages cannot be
  accessed via the direct map and they are disallowed in GUP.

* Harden against exploited kernel flaws.  In order to access secretmem,
  a kernel-side attack would need to either walk the page tables and
  create new ones, or spawn a new privileged uiserspace process to perform
  secrets exfiltration using ptrace.

In the future the secret mappings may be used as a mean to protect guest
memory in a virtual machine host.

For demonstration of secret memory usage we've created a userspace library

https://git.kernel.org/pub/scm/linux/kernel/git/jejb/secret-memory-preloader.git

that does two things: the first is act as a preloader for openssl to
redirect all the OPENSSL_malloc calls to secret memory meaning any secret
keys get automatically protected this way and the other thing it does is
expose the API to the user who needs it.  We anticipate that a lot of the
use cases would be like the openssl one: many toolkits that deal with
secret keys already have special handling for the memory to try to give
them greater protection, so this would simply be pluggable into the
toolkits without any need for user application modification.

Hiding secret memory mappings behind an anonymous file allows usage of the
page cache for tracking pages allocated for the "secret" mappings as well
as using address_space_operations for e.g.  page migration callbacks.

The anonymous file may be also used implicitly, like hugetlb files, to
implement mmap(MAP_SECRET) and use the secret memory areas with "native"
mm ABIs in the future.

Removing of the pages from the direct map may cause its fragmentation on
architectures that use large pages to map the physical memory which
affects the system performance.  However, the original Kconfig text for
CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "...  can
improve the kernel's performance a tiny bit ..." (commit 00d1c5e057
("x86: add gbpages switches")) and the recent report [1] showed that "...
although 1G mappings are a good default choice, there is no compelling
evidence that it must be the only choice".  Hence, it is sufficient to
have secretmem disabled by default with the ability of a system
administrator to enable it at boot time.

In addition, there is also a long term goal to improve management of the
direct map.

[1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/

This patch (of 7):

It will be used by the upcoming secret memory implementation.

Link: https://lkml.kernel.org/r/20210518072034.31572-1-rppt@kernel.org
Link: https://lkml.kernel.org/r/20210518072034.31572-2-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-08 11:48:20 -07:00
Mel Gorman ffd8f251f1 mm/page_alloc: move prototype for find_suitable_fallback
make W=1 generates the following warning in mmap_lock.c for allnoconfig

  mm/page_alloc.c:2670:5: warning: no previous prototype for `find_suitable_fallback' [-Wmissing-prototypes]
   int find_suitable_fallback(struct free_area *area, unsigned int order,
       ^~~~~~~~~~~~~~~~~~~~~~

find_suitable_fallback is only shared outside of page_alloc.c for
CONFIG_COMPACTION but to suppress the warning, move the protype outside of
CONFIG_COMPACTION.  It is not worth the effort at this time to find a
clever way of allowing compaction.c to share the code or avoid the use
entirely as the function is called on relatively slow paths.

Link: https://lkml.kernel.org/r/20210520084809.8576-14-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 11:06:03 -07:00
David Hildenbrand 4ca9b3859d mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables
I. Background: Sparse Memory Mappings

When we manage sparse memory mappings dynamically in user space - also
sometimes involving MAP_NORESERVE - we want to dynamically populate/
discard memory inside such a sparse memory region.  Example users are
hypervisors (especially implementing memory ballooning or similar
technologies like virtio-mem) and memory allocators.  In addition, we want
to fail in a nice way (instead of generating SIGBUS) if populating does
not succeed because we are out of backend memory (which can happen easily
with file-based mappings, especially tmpfs and hugetlbfs).

While MADV_DONTNEED, MADV_REMOVE and FALLOC_FL_PUNCH_HOLE allow for
reliably discarding memory for most mapping types, there is no generic
approach to populate page tables and preallocate memory.

Although mmap() supports MAP_POPULATE, it is not applicable to the concept
of sparse memory mappings, where we want to populate/discard dynamically
and avoid expensive/problematic remappings.  In addition, we never
actually report errors during the final populate phase - it is best-effort
only.

fallocate() can be used to preallocate file-based memory and fail in a
safe way.  However, it cannot really be used for any private mappings on
anonymous files via memfd due to COW semantics.  In addition, fallocate()
does not actually populate page tables, so we still always get pagefaults
on first access - which is sometimes undesired (i.e., real-time workloads)
and requires real prefaulting of page tables, not just a preallocation of
backend storage.  There might be interesting use cases for sparse memory
regions along with mlockall(MCL_ONFAULT) which fallocate() cannot satisfy
as it does not prefault page tables.

II. On preallcoation/prefaulting from user space

Because we don't have a proper interface, what applications (like QEMU and
databases) end up doing is touching (i.e., reading+writing one byte to not
overwrite existing data) all individual pages.

However, that approach
1) Can result in wear on storage backing, because we end up reading/writing
   each page; this is especially a problem for dax/pmem.
2) Can result in mmap_sem contention when prefaulting via multiple
   threads.
3) Requires expensive signal handling, especially to catch SIGBUS in case
   of hugetlbfs/shmem/file-backed memory. For example, this is
   problematic in hypervisors like QEMU where SIGBUS handlers might already
   be used by other subsystems concurrently to e.g, handle hardware errors.
   "Simply" doing preallocation concurrently from other thread is not that
   easy.

III. On MADV_WILLNEED

Extending MADV_WILLNEED is not an option because
1. It would change the semantics: "Expect access in the near future." and
   "might be a good idea to read some pages" vs. "Definitely populate/
   preallocate all memory and definitely fail on errors.".
2. Existing users (like virtio-balloon in QEMU when deflating the balloon)
   don't want populate/prealloc semantics. They treat this rather as a hint
   to give a little performance boost without too much overhead - and don't
   expect that a lot of memory might get consumed or a lot of time
   might be spent.

IV. MADV_POPULATE_READ and MADV_POPULATE_WRITE

Let's introduce MADV_POPULATE_READ and MADV_POPULATE_WRITE, inspired by
MAP_POPULATE, with the following semantics:
1. MADV_POPULATE_READ can be used to prefault page tables just like
   manually reading each individual page. This will not break any COW
   mappings. The shared zero page might get mapped and no backend storage
   might get preallocated -- allocation might be deferred to
   write-fault time. Especially shared file mappings require an explicit
   fallocate() upfront to actually preallocate backend memory (blocks in
   the file system) in case the file might have holes.
2. If MADV_POPULATE_READ succeeds, all page tables have been populated
   (prefaulted) readable once.
3. MADV_POPULATE_WRITE can be used to preallocate backend memory and
   prefault page tables just like manually writing (or
   reading+writing) each individual page. This will break any COW
   mappings -- e.g., the shared zeropage is never populated.
4. If MADV_POPULATE_WRITE succeeds, all page tables have been populated
   (prefaulted) writable once.
5. MADV_POPULATE_READ and MADV_POPULATE_WRITE cannot be applied to special
   mappings marked with VM_PFNMAP and VM_IO. Also, proper access
   permissions (e.g., PROT_READ, PROT_WRITE) are required. If any such
   mapping is encountered, madvise() fails with -EINVAL.
6. If MADV_POPULATE_READ or MADV_POPULATE_WRITE fails, some page tables
   might have been populated.
7. MADV_POPULATE_READ and MADV_POPULATE_WRITE will return -EHWPOISON
   when encountering a HW poisoned page in the range.
8. Similar to MAP_POPULATE, MADV_POPULATE_READ and MADV_POPULATE_WRITE
   cannot protect from the OOM (Out Of Memory) handler killing the
   process.

While the use case for MADV_POPULATE_WRITE is fairly obvious (i.e.,
preallocate memory and prefault page tables for VMs), one issue is that
whenever we prefault pages writable, the pages have to be marked dirty,
because the CPU could dirty them any time.  while not a real problem for
hugetlbfs or dax/pmem, it can be a problem for shared file mappings: each
page will be marked dirty and has to be written back later when evicting.

MADV_POPULATE_READ allows for optimizing this scenario: Pre-read a whole
mapping from backend storage without marking it dirty, such that eviction
won't have to write it back.  As discussed above, shared file mappings
might require an explciit fallocate() upfront to achieve
preallcoation+prepopulation.

Although sparse memory mappings are the primary use case, this will also
be useful for other preallocate/prefault use cases where MAP_POPULATE is
not desired or the semantics of MAP_POPULATE are not sufficient: as one
example, QEMU users can trigger preallocation/prefaulting of guest RAM
after the mapping was created -- and don't want errors to be silently
suppressed.

Looking at the history, MADV_POPULATE was already proposed in 2013 [1],
however, the main motivation back than was performance improvements --
which should also still be the case.

V. Single-threaded performance comparison

I did a short experiment, prefaulting page tables on completely *empty
mappings/files* and repeated the experiment 10 times.  The results
correspond to the shortest execution time.  In general, the performance
benefit for huge pages is negligible with small mappings.

V.1: Private mappings

POPULATE_READ and POPULATE_WRITE is fastest.  Note that
Reading/POPULATE_READ will populate the shared zeropage where applicable
-- which result in short population times.

The fastest way to allocate backend storage (here: swap or huge pages) and
prefault page tables is POPULATE_WRITE.

V.2: Shared mappings

fallocate() is fastest, however, doesn't prefault page tables.
POPULATE_WRITE is faster than simple writes and read/writes.
POPULATE_READ is faster than simple reads.

Without a fd, the fastest way to allocate backend storage and prefault
page tables is POPULATE_WRITE.  With an fd, the fastest way is usually
FALLOCATE+POPULATE_READ or FALLOCATE+POPULATE_WRITE respectively; one
exception are actual files: FALLOCATE+Read is slightly faster than
FALLOCATE+POPULATE_READ.

The fastest way to allocate backend storage prefault page tables is
FALLOCATE+POPULATE_WRITE -- except when dealing with actual files; then,
FALLOCATE+POPULATE_READ is fastest and won't directly mark all pages as
dirty.

v.3: Detailed results

==================================================
2 MiB MAP_PRIVATE:
**************************************************
Anon 4 KiB     : Read                     :     0.119 ms
Anon 4 KiB     : Write                    :     0.222 ms
Anon 4 KiB     : Read/Write               :     0.380 ms
Anon 4 KiB     : POPULATE_READ            :     0.060 ms
Anon 4 KiB     : POPULATE_WRITE           :     0.158 ms
Memfd 4 KiB    : Read                     :     0.034 ms
Memfd 4 KiB    : Write                    :     0.310 ms
Memfd 4 KiB    : Read/Write               :     0.362 ms
Memfd 4 KiB    : POPULATE_READ            :     0.039 ms
Memfd 4 KiB    : POPULATE_WRITE           :     0.229 ms
Memfd 2 MiB    : Read                     :     0.030 ms
Memfd 2 MiB    : Write                    :     0.030 ms
Memfd 2 MiB    : Read/Write               :     0.030 ms
Memfd 2 MiB    : POPULATE_READ            :     0.030 ms
Memfd 2 MiB    : POPULATE_WRITE           :     0.030 ms
tmpfs          : Read                     :     0.033 ms
tmpfs          : Write                    :     0.313 ms
tmpfs          : Read/Write               :     0.406 ms
tmpfs          : POPULATE_READ            :     0.039 ms
tmpfs          : POPULATE_WRITE           :     0.285 ms
file           : Read                     :     0.033 ms
file           : Write                    :     0.351 ms
file           : Read/Write               :     0.408 ms
file           : POPULATE_READ            :     0.039 ms
file           : POPULATE_WRITE           :     0.290 ms
hugetlbfs      : Read                     :     0.030 ms
hugetlbfs      : Write                    :     0.030 ms
hugetlbfs      : Read/Write               :     0.030 ms
hugetlbfs      : POPULATE_READ            :     0.030 ms
hugetlbfs      : POPULATE_WRITE           :     0.030 ms
**************************************************
4096 MiB MAP_PRIVATE:
**************************************************
Anon 4 KiB     : Read                     :   237.940 ms
Anon 4 KiB     : Write                    :   708.409 ms
Anon 4 KiB     : Read/Write               :  1054.041 ms
Anon 4 KiB     : POPULATE_READ            :   124.310 ms
Anon 4 KiB     : POPULATE_WRITE           :   572.582 ms
Memfd 4 KiB    : Read                     :   136.928 ms
Memfd 4 KiB    : Write                    :   963.898 ms
Memfd 4 KiB    : Read/Write               :  1106.561 ms
Memfd 4 KiB    : POPULATE_READ            :    78.450 ms
Memfd 4 KiB    : POPULATE_WRITE           :   805.881 ms
Memfd 2 MiB    : Read                     :   357.116 ms
Memfd 2 MiB    : Write                    :   357.210 ms
Memfd 2 MiB    : Read/Write               :   357.606 ms
Memfd 2 MiB    : POPULATE_READ            :   356.094 ms
Memfd 2 MiB    : POPULATE_WRITE           :   356.937 ms
tmpfs          : Read                     :   137.536 ms
tmpfs          : Write                    :   954.362 ms
tmpfs          : Read/Write               :  1105.954 ms
tmpfs          : POPULATE_READ            :    80.289 ms
tmpfs          : POPULATE_WRITE           :   822.826 ms
file           : Read                     :   137.874 ms
file           : Write                    :   987.025 ms
file           : Read/Write               :  1107.439 ms
file           : POPULATE_READ            :    80.413 ms
file           : POPULATE_WRITE           :   857.622 ms
hugetlbfs      : Read                     :   355.607 ms
hugetlbfs      : Write                    :   355.729 ms
hugetlbfs      : Read/Write               :   356.127 ms
hugetlbfs      : POPULATE_READ            :   354.585 ms
hugetlbfs      : POPULATE_WRITE           :   355.138 ms
**************************************************
2 MiB MAP_SHARED:
**************************************************
Anon 4 KiB     : Read                     :     0.394 ms
Anon 4 KiB     : Write                    :     0.348 ms
Anon 4 KiB     : Read/Write               :     0.400 ms
Anon 4 KiB     : POPULATE_READ            :     0.326 ms
Anon 4 KiB     : POPULATE_WRITE           :     0.273 ms
Anon 2 MiB     : Read                     :     0.030 ms
Anon 2 MiB     : Write                    :     0.030 ms
Anon 2 MiB     : Read/Write               :     0.030 ms
Anon 2 MiB     : POPULATE_READ            :     0.030 ms
Anon 2 MiB     : POPULATE_WRITE           :     0.030 ms
Memfd 4 KiB    : Read                     :     0.412 ms
Memfd 4 KiB    : Write                    :     0.372 ms
Memfd 4 KiB    : Read/Write               :     0.419 ms
Memfd 4 KiB    : POPULATE_READ            :     0.343 ms
Memfd 4 KiB    : POPULATE_WRITE           :     0.288 ms
Memfd 4 KiB    : FALLOCATE                :     0.137 ms
Memfd 4 KiB    : FALLOCATE+Read           :     0.446 ms
Memfd 4 KiB    : FALLOCATE+Write          :     0.330 ms
Memfd 4 KiB    : FALLOCATE+Read/Write     :     0.454 ms
Memfd 4 KiB    : FALLOCATE+POPULATE_READ  :     0.379 ms
Memfd 4 KiB    : FALLOCATE+POPULATE_WRITE :     0.268 ms
Memfd 2 MiB    : Read                     :     0.030 ms
Memfd 2 MiB    : Write                    :     0.030 ms
Memfd 2 MiB    : Read/Write               :     0.030 ms
Memfd 2 MiB    : POPULATE_READ            :     0.030 ms
Memfd 2 MiB    : POPULATE_WRITE           :     0.030 ms
Memfd 2 MiB    : FALLOCATE                :     0.030 ms
Memfd 2 MiB    : FALLOCATE+Read           :     0.031 ms
Memfd 2 MiB    : FALLOCATE+Write          :     0.031 ms
Memfd 2 MiB    : FALLOCATE+Read/Write     :     0.031 ms
Memfd 2 MiB    : FALLOCATE+POPULATE_READ  :     0.030 ms
Memfd 2 MiB    : FALLOCATE+POPULATE_WRITE :     0.030 ms
tmpfs          : Read                     :     0.416 ms
tmpfs          : Write                    :     0.369 ms
tmpfs          : Read/Write               :     0.425 ms
tmpfs          : POPULATE_READ            :     0.346 ms
tmpfs          : POPULATE_WRITE           :     0.295 ms
tmpfs          : FALLOCATE                :     0.139 ms
tmpfs          : FALLOCATE+Read           :     0.447 ms
tmpfs          : FALLOCATE+Write          :     0.333 ms
tmpfs          : FALLOCATE+Read/Write     :     0.454 ms
tmpfs          : FALLOCATE+POPULATE_READ  :     0.380 ms
tmpfs          : FALLOCATE+POPULATE_WRITE :     0.272 ms
file           : Read                     :     0.191 ms
file           : Write                    :     0.511 ms
file           : Read/Write               :     0.524 ms
file           : POPULATE_READ            :     0.196 ms
file           : POPULATE_WRITE           :     0.434 ms
file           : FALLOCATE                :     0.004 ms
file           : FALLOCATE+Read           :     0.197 ms
file           : FALLOCATE+Write          :     0.554 ms
file           : FALLOCATE+Read/Write     :     0.480 ms
file           : FALLOCATE+POPULATE_READ  :     0.201 ms
file           : FALLOCATE+POPULATE_WRITE :     0.381 ms
hugetlbfs      : Read                     :     0.030 ms
hugetlbfs      : Write                    :     0.030 ms
hugetlbfs      : Read/Write               :     0.030 ms
hugetlbfs      : POPULATE_READ            :     0.030 ms
hugetlbfs      : POPULATE_WRITE           :     0.030 ms
hugetlbfs      : FALLOCATE                :     0.030 ms
hugetlbfs      : FALLOCATE+Read           :     0.031 ms
hugetlbfs      : FALLOCATE+Write          :     0.031 ms
hugetlbfs      : FALLOCATE+Read/Write     :     0.030 ms
hugetlbfs      : FALLOCATE+POPULATE_READ  :     0.030 ms
hugetlbfs      : FALLOCATE+POPULATE_WRITE :     0.030 ms
**************************************************
4096 MiB MAP_SHARED:
**************************************************
Anon 4 KiB     : Read                     :  1053.090 ms
Anon 4 KiB     : Write                    :   913.642 ms
Anon 4 KiB     : Read/Write               :  1060.350 ms
Anon 4 KiB     : POPULATE_READ            :   893.691 ms
Anon 4 KiB     : POPULATE_WRITE           :   782.885 ms
Anon 2 MiB     : Read                     :   358.553 ms
Anon 2 MiB     : Write                    :   358.419 ms
Anon 2 MiB     : Read/Write               :   357.992 ms
Anon 2 MiB     : POPULATE_READ            :   357.533 ms
Anon 2 MiB     : POPULATE_WRITE           :   357.808 ms
Memfd 4 KiB    : Read                     :  1078.144 ms
Memfd 4 KiB    : Write                    :   942.036 ms
Memfd 4 KiB    : Read/Write               :  1100.391 ms
Memfd 4 KiB    : POPULATE_READ            :   925.829 ms
Memfd 4 KiB    : POPULATE_WRITE           :   804.394 ms
Memfd 4 KiB    : FALLOCATE                :   304.632 ms
Memfd 4 KiB    : FALLOCATE+Read           :  1163.359 ms
Memfd 4 KiB    : FALLOCATE+Write          :   933.186 ms
Memfd 4 KiB    : FALLOCATE+Read/Write     :  1187.304 ms
Memfd 4 KiB    : FALLOCATE+POPULATE_READ  :  1013.660 ms
Memfd 4 KiB    : FALLOCATE+POPULATE_WRITE :   794.560 ms
Memfd 2 MiB    : Read                     :   358.131 ms
Memfd 2 MiB    : Write                    :   358.099 ms
Memfd 2 MiB    : Read/Write               :   358.250 ms
Memfd 2 MiB    : POPULATE_READ            :   357.563 ms
Memfd 2 MiB    : POPULATE_WRITE           :   357.334 ms
Memfd 2 MiB    : FALLOCATE                :   356.735 ms
Memfd 2 MiB    : FALLOCATE+Read           :   358.152 ms
Memfd 2 MiB    : FALLOCATE+Write          :   358.331 ms
Memfd 2 MiB    : FALLOCATE+Read/Write     :   358.018 ms
Memfd 2 MiB    : FALLOCATE+POPULATE_READ  :   357.286 ms
Memfd 2 MiB    : FALLOCATE+POPULATE_WRITE :   357.523 ms
tmpfs          : Read                     :  1087.265 ms
tmpfs          : Write                    :   950.840 ms
tmpfs          : Read/Write               :  1107.567 ms
tmpfs          : POPULATE_READ            :   922.605 ms
tmpfs          : POPULATE_WRITE           :   810.094 ms
tmpfs          : FALLOCATE                :   306.320 ms
tmpfs          : FALLOCATE+Read           :  1169.796 ms
tmpfs          : FALLOCATE+Write          :   933.730 ms
tmpfs          : FALLOCATE+Read/Write     :  1191.610 ms
tmpfs          : FALLOCATE+POPULATE_READ  :  1020.474 ms
tmpfs          : FALLOCATE+POPULATE_WRITE :   798.945 ms
file           : Read                     :   654.101 ms
file           : Write                    :  1259.142 ms
file           : Read/Write               :  1289.509 ms
file           : POPULATE_READ            :   661.642 ms
file           : POPULATE_WRITE           :  1106.816 ms
file           : FALLOCATE                :     1.864 ms
file           : FALLOCATE+Read           :   656.328 ms
file           : FALLOCATE+Write          :  1153.300 ms
file           : FALLOCATE+Read/Write     :  1180.613 ms
file           : FALLOCATE+POPULATE_READ  :   668.347 ms
file           : FALLOCATE+POPULATE_WRITE :   996.143 ms
hugetlbfs      : Read                     :   357.245 ms
hugetlbfs      : Write                    :   357.413 ms
hugetlbfs      : Read/Write               :   357.120 ms
hugetlbfs      : POPULATE_READ            :   356.321 ms
hugetlbfs      : POPULATE_WRITE           :   356.693 ms
hugetlbfs      : FALLOCATE                :   355.927 ms
hugetlbfs      : FALLOCATE+Read           :   357.074 ms
hugetlbfs      : FALLOCATE+Write          :   357.120 ms
hugetlbfs      : FALLOCATE+Read/Write     :   356.983 ms
hugetlbfs      : FALLOCATE+POPULATE_READ  :   356.413 ms
hugetlbfs      : FALLOCATE+POPULATE_WRITE :   356.266 ms
**************************************************

[1] https://lkml.org/lkml/2013/6/27/698

[akpm@linux-foundation.org: coding style fixes]

Link: https://lkml.kernel.org/r/20210419135443.12822-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Chris Zankel <chris@zankel.net>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rolf Eike Beer <eike-kernel@sf-tec.de>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:30 -07:00
David Hildenbrand a78f1ccd37 mm: make variable names for populate_vma_page_range() consistent
Patch series "mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables", v2.

Excessive details on MADV_POPULATE_(READ|WRITE) can be found in patch #2.

This patch (of 5):

Let's make the variable names in the function declaration match the
variable names used in the definition.

Link: https://lkml.kernel.org/r/20210419135443.12822-1-david@redhat.com
Link: https://lkml.kernel.org/r/20210419135443.12822-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Peter Xu <peterx@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Chris Zankel <chris@zankel.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Rik van Riel <riel@surriel.com>
Cc: Rolf Eike Beer <eike-kernel@sf-tec.de>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:30 -07:00
Yang Shi c5b5a3dd2c mm: thp: refactor NUMA fault handling
When the THP NUMA fault support was added THP migration was not supported
yet.  So the ad hoc THP migration was implemented in NUMA fault handling.
Since v4.14 THP migration has been supported so it doesn't make too much
sense to still keep another THP migration implementation rather than using
the generic migration code.

This patch reworks the NUMA fault handling to use generic migration
implementation to migrate misplaced page.  There is no functional change.

After the refactor the flow of NUMA fault handling looks just like its
PTE counterpart:
  Acquire ptl
  Prepare for migration (elevate page refcount)
  Release ptl
  Isolate page from lru and elevate page refcount
  Migrate the misplaced THP

If migration fails just restore the old normal PMD.

In the old code anon_vma lock was needed to serialize THP migration
against THP split, but since then the THP code has been reworked a lot, it
seems anon_vma lock is not required anymore to avoid the race.

The page refcount elevation when holding ptl should prevent from THP
split.

Use migrate_misplaced_page() for both base page and THP NUMA hinting fault
and remove all the dead and duplicate code.

[dan.carpenter@oracle.com: fix a double unlock bug]
  Link: https://lkml.kernel.org/r/YLX8uYN01JmfLnlK@mwanda

Link: https://lkml.kernel.org/r/20210518200801.7413-4-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:30 -07:00
Yang Shi f4c0d8367e mm: memory: make numa_migrate_prep() non-static
The numa_migrate_prep() will be used by huge NUMA fault as well in the
following patch, make it non-static.

Link: https://lkml.kernel.org/r/20210518200801.7413-3-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:30 -07:00
Mel Gorman 44042b4498 mm/page_alloc: allow high-order pages to be stored on the per-cpu lists
The per-cpu page allocator (PCP) only stores order-0 pages.  This means
that all THP and "cheap" high-order allocations including SLUB contends on
the zone->lock.  This patch extends the PCP allocator to store THP and
"cheap" high-order pages.  Note that struct per_cpu_pages increases in
size to 256 bytes (4 cache lines) on x86-64.

Note that this is not necessarily a universal performance win because of
how it is implemented.  High-order pages can cause pcp->high to be
exceeded prematurely for lower-orders so for example, a large number of
THP pages being freed could release order-0 pages from the PCP lists.
Hence, much depends on the allocation/free pattern as observed by a single
CPU to determine if caching helps or hurts a particular workload.

That said, basic performance testing passed.  The following is a netperf
UDP_STREAM test which hits the relevant patches as some of the network
allocations are high-order.

netperf-udp
                                 5.13.0-rc2             5.13.0-rc2
                           mm-pcpburst-v3r4   mm-pcphighorder-v1r7
Hmean     send-64         261.46 (   0.00%)      266.30 *   1.85%*
Hmean     send-128        516.35 (   0.00%)      536.78 *   3.96%*
Hmean     send-256       1014.13 (   0.00%)     1034.63 *   2.02%*
Hmean     send-1024      3907.65 (   0.00%)     4046.11 *   3.54%*
Hmean     send-2048      7492.93 (   0.00%)     7754.85 *   3.50%*
Hmean     send-3312     11410.04 (   0.00%)    11772.32 *   3.18%*
Hmean     send-4096     13521.95 (   0.00%)    13912.34 *   2.89%*
Hmean     send-8192     21660.50 (   0.00%)    22730.72 *   4.94%*
Hmean     send-16384    31902.32 (   0.00%)    32637.50 *   2.30%*

Functionally, a patch like this is necessary to make bulk allocation of
high-order pages work with similar performance to order-0 bulk
allocations.  The bulk allocator is not updated in this series as it would
have to be determined by bulk allocation users how they want to track the
order of pages allocated with the bulk allocator.

Link: https://lkml.kernel.org/r/20210611135753.GC30378@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:55 -07:00