Commit Graph

524 Commits

Author SHA1 Message Date
Aristeu Rozanski 78a1a48897 mm: vmalloc: replace BUG_ON() by WARN_ON_ONCE()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 14687619e1122d71b2ed70e1afa6bc352e629e85
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Thu Dec 22 20:00:22 2022 +0100

    mm: vmalloc: replace BUG_ON() by WARN_ON_ONCE()

    Currently a vm_unmap_ram() functions triggers a BUG() if an area is not
    found.  Replace it by the WARN_ON_ONCE() error message and keep machine
    alive instead of stopping it.

    The worst case is a memory leaking.

    Link: https://lkml.kernel.org/r/20221222190022.134380-3-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Cc: Baoquan He <bhe@redhat.com>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:02 -04:00
Aristeu Rozanski d99765067c mm: vmalloc: avoid calling __find_vmap_area() twice in __vunmap()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit edd898181e2f6f0969c08e1dfe2b7cdf902b9b33
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Thu Dec 22 20:00:20 2022 +0100

    mm: vmalloc: avoid calling __find_vmap_area() twice in __vunmap()

    Currently the __vunmap() path calls __find_vmap_area() twice.  Once on
    entry to check that the area exists, then inside the remove_vm_area()
    function which also performs a new search for the VA.

    In order to improvie it from a performance point of view we split
    remove_vm_area() into two new parts:
      - find_unlink_vmap_area() that does a search and unlink from tree;
      - __remove_vm_area() that removes without searching.

    In this case there is no any functional change for remove_vm_area()
    whereas vm_remove_mappings(), where a second search happens, switches to
    the __remove_vm_area() variant where the already detached VA is passed as
    a parameter, so there is no need to find it again.

    Performance wise, i use test_vmalloc.sh with 32 threads doing alloc
    free on a 64-CPUs-x86_64-box:

    perf without this patch:
    -   31.41%     0.50%  vmalloc_test/10  [kernel.vmlinux]    [k] __vunmap
       - 30.92% __vunmap
          - 17.67% _raw_spin_lock
               native_queued_spin_lock_slowpath
          - 12.33% remove_vm_area
             - 11.79% free_vmap_area_noflush
                - 11.18% _raw_spin_lock
                     native_queued_spin_lock_slowpath
            0.76% free_unref_page

    perf with this patch:
    -   11.35%     0.13%  vmalloc_test/14  [kernel.vmlinux]    [k] __vunmap
       - 11.23% __vunmap
          - 8.28% find_unlink_vmap_area
             - 7.95% _raw_spin_lock
                  7.44% native_queued_spin_lock_slowpath
          - 1.93% free_vmap_area_noflush
             - 0.56% _raw_spin_lock
                  0.53% native_queued_spin_lock_slowpath
            0.60% __vunmap_range_noflush

    __vunmap() consumes around ~20% less CPU cycles on this test.

    Also, switch from find_vmap_area() to find_unlink_vmap_area() to prevent a
    double access to the vmap_area_lock: one for finding area, second time is
    for unlinking from a tree.

    [urezki@gmail.com: switch to find_unlink_vmap_area() in vm_unmap_ram()]
      Link: https://lkml.kernel.org/r/20221222190022.134380-2-urezki@gmail.com
    Link: https://lkml.kernel.org/r/20221222190022.134380-1-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Reported-by: Roman Gushchin <roman.gushchin@linux.dev>
    Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Baoquan He <bhe@redhat.com>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:02 -04:00
Aristeu Rozanski d58c35d7c6 mm: vmalloc: correct use of __GFP_NOWARN mask in __vmalloc_area_node()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 80b1d8fdfad1f3084450afa6e2efcdcce867d4af
Author: Lorenzo Stoakes <lstoakes@gmail.com>
Date:   Mon Dec 19 12:36:59 2022 +0000

    mm: vmalloc: correct use of __GFP_NOWARN mask in __vmalloc_area_node()

    This function sets __GFP_NOWARN in the gfp_mask rendering the warn_alloc()
    invocations no-ops.  Remove this and instead rely on this flag being set
    only for the vm_area_alloc_pages() function, ensuring it is cleared for
    each of the warn_alloc() calls.

    Link: https://lkml.kernel.org/r/20221219123659.90614-1-lstoakes@gmail.com
    Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
    Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Cc: Baoquan He <bhe@redhat.com>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:01 -04:00
Audra Mitchell b93f81bdc1 mm: vmalloc: use trace_free_vmap_area_noflush event
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit 8c4196fe810a6717a8f9e528083911703f6a5a51
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Tue Oct 18 20:10:52 2022 +0200

    mm: vmalloc: use trace_free_vmap_area_noflush event

    It is for debug purposes and is called when a vmap area gets freed.  This
    event gives some indication about:

    - a start address of released area;
    - a current number of outstanding pages;
    - a maximum number of allowed outstanding pages.

    Link: https://lkml.kernel.org/r/20221018181053.434508-7-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:52 -04:00
Audra Mitchell ecc16fd211 mm: vmalloc: use trace_purge_vmap_area_lazy event
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit 6030fd5fd1f7baaac3661a5301cc7838d4e3b7f6
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Tue Oct 18 20:10:51 2022 +0200

    mm: vmalloc: use trace_purge_vmap_area_lazy event

    This is for debug purposes and is called when all outstanding areas are
    removed back to the vmap space.  It gives some extra information about:

    - a start:end range where set of vmap ares were freed;
    - a number of purged areas which were backed off.

    [urezki@gmail.com: simplify return boolean expression]
      Link: https://lkml.kernel.org/r/20221020125247.5053-1-urezki@gmail.com
    Link: https://lkml.kernel.org/r/20221018181053.434508-6-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:51 -04:00
Audra Mitchell 58a47c268a mm: vmalloc: use trace_alloc_vmap_area event
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit cf243da6ab3987b65b95357194926a31415095b8
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Tue Oct 18 20:10:50 2022 +0200

    mm: vmalloc: use trace_alloc_vmap_area event

    This is for debug purpose and is called when an allocation attempt occurs.
    This event gives some information about:

    - start address of allocated area;
    - size that is requested;
    - alignment that is required;
    - vstart/vend restriction;
    - if an allocation fails.

    Link: https://lkml.kernel.org/r/20221018181053.434508-5-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:51 -04:00
Chris von Recklinghausen caeb1e1d2e mm/vmalloc: vmalloc_to_page() use pte_offset_kernel()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 0d1c81edc61e553ed7a5db18fb8074c8b78e1538
Author: Hugh Dickins <hughd@google.com>
Date:   Thu Jun 8 18:21:41 2023 -0700

    mm/vmalloc: vmalloc_to_page() use pte_offset_kernel()

    vmalloc_to_page() was using pte_offset_map() (followed by pte_unmap()),
    but it's intended for userspace page tables: prefer pte_offset_kernel().

    Link: https://lkml.kernel.org/r/696386a-84f8-b33c-82e5-f865ed6eb39@google.com
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Ira Weiny <ira.weiny@intel.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Song Liu <song@kernel.org>
    Cc: Steven Price <steven.price@arm.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Cc: Zack Rusin <zackr@vmware.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:16:15 -04:00
Chris von Recklinghausen 275851fbc1 mm: kmsan: handle alloc failures in kmsan_ioremap_page_range()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit fdea03e12aa2a44a7bb34144208be97fc25dfd90
Author: Alexander Potapenko <glider@google.com>
Date:   Thu Apr 13 15:12:21 2023 +0200

    mm: kmsan: handle alloc failures in kmsan_ioremap_page_range()

    Similarly to kmsan_vmap_pages_range_noflush(), kmsan_ioremap_page_range()
    must also properly handle allocation/mapping failures.  In the case of
    such, it must clean up the already created metadata mappings and return an
    error code, so that the error can be propagated to ioremap_page_range().
    Without doing so, KMSAN may silently fail to bring the metadata for the
    page range into a consistent state, which will result in user-visible
    crashes when trying to access them.

    Link: https://lkml.kernel.org/r/20230413131223.4135168-2-glider@google.com
    Fixes: b073d7f8aee4 ("mm: kmsan: maintain KMSAN metadata for page operations")
    Signed-off-by: Alexander Potapenko <glider@google.com>
    Reported-by: Dipanjan Das <mail.dipanjan.das@gmail.com>
      Link: https://lore.kernel.org/linux-mm/CANX2M5ZRrRA64k0hOif02TjmY9kbbO2aCBPyq79es34RXZ=cAw@mail.gmail.com/
    Reviewed-by: Marco Elver <elver@google.com>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:16:00 -04:00
Chris von Recklinghausen 3adc0f09d5 mm: kmsan: handle alloc failures in kmsan_vmap_pages_range_noflush()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 47ebd0310e89c087f56e58c103c44b72a2f6b216
Author: Alexander Potapenko <glider@google.com>
Date:   Thu Apr 13 15:12:20 2023 +0200

    mm: kmsan: handle alloc failures in kmsan_vmap_pages_range_noflush()

    As reported by Dipanjan Das, when KMSAN is used together with kernel fault
    injection (or, generally, even without the latter), calls to kcalloc() or
    __vmap_pages_range_noflush() may fail, leaving the metadata mappings for
    the virtual mapping in an inconsistent state.  When these metadata
    mappings are accessed later, the kernel crashes.

    To address the problem, we return a non-zero error code from
    kmsan_vmap_pages_range_noflush() in the case of any allocation/mapping
    failure inside it, and make vmap_pages_range_noflush() return an error if
    KMSAN fails to allocate the metadata.

    This patch also removes KMSAN_WARN_ON() from vmap_pages_range_noflush(),
    as these allocation failures are not fatal anymore.

    Link: https://lkml.kernel.org/r/20230413131223.4135168-1-glider@google.com
    Fixes: b073d7f8aee4 ("mm: kmsan: maintain KMSAN metadata for page operations")
    Signed-off-by: Alexander Potapenko <glider@google.com>
    Reported-by: Dipanjan Das <mail.dipanjan.das@gmail.com>
      Link: https://lore.kernel.org/linux-mm/CANX2M5ZRrRA64k0hOif02TjmY9kbbO2aCBPyq79es34RXZ=cAw@mail.gmail.com/
    Reviewed-by: Marco Elver <elver@google.com>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:16:00 -04:00
Chris von Recklinghausen 271a98f55e mm: kmsan: maintain KMSAN metadata for page operations
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit b073d7f8aee4ebf05d10e3380df377b73120cf16
Author: Alexander Potapenko <glider@google.com>
Date:   Thu Sep 15 17:03:48 2022 +0200

    mm: kmsan: maintain KMSAN metadata for page operations

    Insert KMSAN hooks that make the necessary bookkeeping changes:
     - poison page shadow and origins in alloc_pages()/free_page();
     - clear page shadow and origins in clear_page(), copy_user_highpage();
     - copy page metadata in copy_highpage(), wp_page_copy();
     - handle vmap()/vunmap()/iounmap();

    Link: https://lkml.kernel.org/r/20220915150417.722975-15-glider@google.com
    Signed-off-by: Alexander Potapenko <glider@google.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Alexei Starovoitov <ast@kernel.org>
    Cc: Andrey Konovalov <andreyknvl@gmail.com>
    Cc: Andrey Konovalov <andreyknvl@google.com>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Eric Biggers <ebiggers@google.com>
    Cc: Eric Biggers <ebiggers@kernel.org>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Herbert Xu <herbert@gondor.apana.org.au>
    Cc: Ilya Leoshkevich <iii@linux.ibm.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Marco Elver <elver@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Michael S. Tsirkin <mst@redhat.com>
    Cc: Pekka Enberg <penberg@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Petr Mladek <pmladek@suse.com>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Vasily Gorbik <gor@linux.ibm.com>
    Cc: Vegard Nossum <vegard.nossum@oracle.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:35 -04:00
Nico Pache e3a473e864 mm, vmalloc: fix high order __GFP_NOFAIL allocations
commit e9c3cda4d86e56bf7fe403729f38c4f0f65d3860
Author: Michal Hocko <mhocko@suse.com>
Date:   Mon Mar 6 09:15:17 2023 +0100

    mm, vmalloc: fix high order __GFP_NOFAIL allocations

    Gao Xiang has reported that the page allocator complains about high order
    __GFP_NOFAIL request coming from the vmalloc core:

     __alloc_pages+0x1cb/0x5b0 mm/page_alloc.c:5549
     alloc_pages+0x1aa/0x270 mm/mempolicy.c:2286
     vm_area_alloc_pages mm/vmalloc.c:2989 [inline]
     __vmalloc_area_node mm/vmalloc.c:3057 [inline]
     __vmalloc_node_range+0x978/0x13c0 mm/vmalloc.c:3227
     kvmalloc_node+0x156/0x1a0 mm/util.c:606
     kvmalloc include/linux/slab.h:737 [inline]
     kvmalloc_array include/linux/slab.h:755 [inline]
     kvcalloc include/linux/slab.h:760 [inline]

    it seems that I have completely missed high order allocation backing
    vmalloc areas case when implementing __GFP_NOFAIL support.  This means
    that [k]vmalloc at al.  can allocate higher order allocations with
    __GFP_NOFAIL which can trigger OOM killer for non-costly orders easily or
    cause a lot of reclaim/compaction activity if those requests cannot be
    satisfied.

    Fix the issue by falling back to zero order allocations for __GFP_NOFAIL
    requests if the high order request fails.

    Link: https://lkml.kernel.org/r/ZAXynvdNqcI0f6Us@dhcp22.suse.cz
    Fixes: 9376130c390a ("mm/vmalloc: add support for __GFP_NOFAIL")
    Reported-by: Gao Xiang <hsiangkao@linux.alibaba.com>
      Link: https://lkml.kernel.org/r/20230305053035.1911-1-hsiangkao@linux.alibaba.com
    Signed-off-by: Michal Hocko <mhocko@suse.com>
    Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Baoquan He <bhe@redhat.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:11:03 -06:00
Chris von Recklinghausen f3aeca244c mm/vmalloc: extend find_vmap_lowest_match_check with extra arguments
Bugzilla: https://bugzilla.redhat.com/2160210

commit bd1264c37c15a75a3164852740ad0c9529907d83
Author: Song Liu <song@kernel.org>
Date:   Tue Aug 30 22:27:34 2022 -0700

    mm/vmalloc: extend find_vmap_lowest_match_check with extra arguments

    find_vmap_lowest_match() is now able to handle different roots.  With
    DEBUG_AUGMENT_LOWEST_MATCH_CHECK enabled as:

    : --- a/mm/vmalloc.c
    : +++ b/mm/vmalloc.c
    : @@ -713,7 +713,7 @@ EXPORT_SYMBOL(vmalloc_to_pfn);
    : /*** Global kva allocator ***/
    :
    : -#define DEBUG_AUGMENT_LOWEST_MATCH_CHECK 0
    : +#define DEBUG_AUGMENT_LOWEST_MATCH_CHECK 1

    compilation failed as:

    mm/vmalloc.c: In function 'find_vmap_lowest_match_check':
    mm/vmalloc.c:1328:32: warning: passing argument 1 of 'find_vmap_lowest_match' makes pointer from integer without a cast [-Wint-conversion]
    1328 |  va_1 = find_vmap_lowest_match(size, align, vstart, false);
         |                                ^~~~
         |                                |
         |                                long unsigned int
    mm/vmalloc.c:1236:40: note: expected 'struct rb_root *' but argument is of type 'long unsigned int'
    1236 | find_vmap_lowest_match(struct rb_root *root, unsigned long size,
         |                        ~~~~~~~~~~~~~~~~^~~~
    mm/vmalloc.c:1328:9: error: too few arguments to function 'find_vmap_lowest_match'
    1328 |  va_1 = find_vmap_lowest_match(size, align, vstart, false);
         |         ^~~~~~~~~~~~~~~~~~~~~~
    mm/vmalloc.c:1236:1: note: declared here
    1236 | find_vmap_lowest_match(struct rb_root *root, unsigned long size,
         | ^~~~~~~~~~~~~~~~~~~~~~

    Extend find_vmap_lowest_match_check() and find_vmap_lowest_linear_match()
    with extra arguments to fix this.

    Link: https://lkml.kernel.org/r/20220906060548.1127396-1-song@kernel.org
    Link: https://lkml.kernel.org/r/20220831052734.3423079-1-song@kernel.org
    Fixes: f9863be49312 ("mm/vmalloc: extend __alloc_vmap_area() with extra arguments")
    Signed-off-by: Song Liu <song@kernel.org>
    Reviewed-by: Baoquan He <bhe@redhat.com>
    Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:32 -04:00
Chris von Recklinghausen 8026e6e74d mm/vmalloc: extend __find_vmap_area() with one more argument
Bugzilla: https://bugzilla.redhat.com/2160210

commit 899c6efe58dbe8cb9768057ffc206d03e5a89ce8
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Tue Jun 7 11:34:48 2022 +0200

    mm/vmalloc: extend __find_vmap_area() with one more argument

    __find_vmap_area() finds a "vmap_area" based on passed address.  It scan
    the specific "vmap_area_root" rb-tree.  Extend the function with one extra
    argument, so any tree can be specified where the search has to be done.

    There is no functional change as a result of this patch.

    Link: https://lkml.kernel.org/r/20220607093449.3100-5-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Reviewed-by: Baoquan He <bhe@redhat.com>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:18 -04:00
Chris von Recklinghausen fa692525d6 mm/vmalloc: initialize VA's list node after unlink
Bugzilla: https://bugzilla.redhat.com/2160210

commit 5d7a7c54d3d7ff2f54725881dc7e06a7f5c94dc2
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Tue Jun 7 11:34:47 2022 +0200

    mm/vmalloc: initialize VA's list node after unlink

    A vmap_area can travel between different places.  For example
    attached/detached to/from different rb-trees.  In order to prevent fancy
    bugs, initialize a VA's list node after it is removed from the list, so it
    pairs with VA's rb_node which is also initialized.

    There is no functional change as a result of this patch.

    Link: https://lkml.kernel.org/r/20220607093449.3100-4-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Reviewed-by: Baoquan He <bhe@redhat.com>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:18 -04:00
Chris von Recklinghausen 903bea615d mm/vmalloc: extend __alloc_vmap_area() with extra arguments
Bugzilla: https://bugzilla.redhat.com/2160210

commit f9863be49312aa1f566dca12603e33487965e6a4
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Tue Jun 7 11:34:46 2022 +0200

    mm/vmalloc: extend __alloc_vmap_area() with extra arguments

    It implies that __alloc_vmap_area() allocates only from the global vmap
    space, therefore a list-head and rb-tree, which represent a free vmap
    space, are not passed as parameters to this function and are accessed
    directly from this function.

    Extend the __alloc_vmap_area() and other dependent functions to have a
    possibility to allocate from different trees making an interface common
    and not specific.

    There is no functional change as a result of this patch.

    Link: https://lkml.kernel.org/r/20220607093449.3100-3-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Reviewed-by: Baoquan He <bhe@redhat.com>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:17 -04:00
Chris von Recklinghausen cb6343367f mm/vmalloc: make link_va()/unlink_va() common to different rb_root
Bugzilla: https://bugzilla.redhat.com/2160210

commit 8eb510db2125ab471967819d1f8749162588bba9
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Tue Jun 7 11:34:45 2022 +0200

    mm/vmalloc: make link_va()/unlink_va() common to different rb_root

    Patch series "Reduce a vmalloc internal lock contention preparation work".

    This small serias is preparation work to implement per-cpu vmalloc
    allocation in order to reduce a high internal lock contention.  This
    series does not introduce any functional changes, it is only about
    preparation.

    This patch (of 5):

    Currently link_va() and unlik_va(), in order to figure out a tree type,
    compares a passed root value with a global free_vmap_area_root variable to
    distinguish the augmented rb-tree from a regular one.  It is hard coded
    since such functions can manipulate only with specific
    "free_vmap_area_root" tree that represents a global free vmap space.

    Make it common by introducing "_augment" versions of both internal
    functions, so it is possible to deal with different trees.

    There is no functional change as a result of this patch.

    Link: https://lkml.kernel.org/r/20220607093449.3100-1-urezki@gmail.com
    Link: https://lkml.kernel.org/r/20220607093449.3100-2-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Reviewed-by: Baoquan He <bhe@redhat.com>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:17 -04:00
Chris von Recklinghausen 81a0337bf2 mm/vmalloc: add code comment for find_vmap_area_exceed_addr()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 153090f2c6d595c9636c582ed4b6c4dac1739a41
Author: Baoquan He <bhe@redhat.com>
Date:   Tue Jun 7 18:59:58 2022 +0800

    mm/vmalloc: add code comment for find_vmap_area_exceed_addr()

    Its behaviour is like find_vma() which finds an area above the specified
    address, add comment to make it easier to understand.

    And also fix two places of grammer mistake/typo.

    Link: https://lkml.kernel.org/r/20220607105958.382076-5-bhe@redhat.com
    Signed-off-by: Baoquan He <bhe@redhat.com>
    Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Cc: Christoph Hellwig <hch@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:14 -04:00
Chris von Recklinghausen 46f526f0da mm/vmalloc: fix typo in local variable name
Bugzilla: https://bugzilla.redhat.com/2160210

commit baa468a648b489e35475c8de9dd1d77f0a687b4d
Author: Baoquan He <bhe@redhat.com>
Date:   Tue Jun 7 18:59:57 2022 +0800

    mm/vmalloc: fix typo in local variable name

    In __purge_vmap_area_lazy(), rename local_pure_list to local_purge_list.

    Link: https://lkml.kernel.org/r/20220607105958.382076-4-bhe@redhat.com
    Signed-off-by: Baoquan He <bhe@redhat.com>
    Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Cc: Christoph Hellwig <hch@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:14 -04:00
Chris von Recklinghausen ec12a169cc mm/vmalloc: remove the redundant boundary check
Bugzilla: https://bugzilla.redhat.com/2160210

commit 753df96be5d3a21cd70d8ab4f7464a868e1d2cb4
Author: Baoquan He <bhe@redhat.com>
Date:   Tue Jun 7 18:59:56 2022 +0800

    mm/vmalloc: remove the redundant boundary check

    In find_va_links(), when traversing the vmap_area tree, the comparing to
    check if the passed in 'va' is above or below 'tmp_va' is redundant,
    assuming both 'va' and 'tmp_va' has ->va_start <= ->va_end.

    Here, to simplify the checking as code change.

    Link: https://lkml.kernel.org/r/20220607105958.382076-3-bhe@redhat.com
    Signed-off-by: Baoquan He <bhe@redhat.com>
    Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Cc: Christoph Hellwig <hch@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:14 -04:00
Chris von Recklinghausen c2f147f788 mm/vmalloc: invoke classify_va_fit_type() in adjust_va_to_fit_type()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 1b23ff80b399ae4561bbfd45f7c9c98f62797304
Author: Baoquan He <bhe@redhat.com>
Date:   Tue Jun 7 18:59:55 2022 +0800

    mm/vmalloc: invoke classify_va_fit_type() in adjust_va_to_fit_type()

    Patch series "Cleanup patches of vmalloc", v2.

    Some cleanup patches found when reading vmalloc code.

    This patch (of 4):

    adjust_va_to_fit_type() checks all values of passed in fit type, including
    NOTHING_FIT in the else branch.  However, the check of NOTHING_FIT has
    been done inside adjust_va_to_fit_type() and before it's called in all
    call sites.

    In fact, both of these functions are coupled tightly, since
    classify_va_fit_type() is doing the preparation work for
    adjust_va_to_fit_type().  So putting invocation of classify_va_fit_type()
    inside adjust_va_to_fit_type() can simplify code logic and the redundant
    check of NOTHING_FIT issue will go away.

    Link: https://lkml.kernel.org/r/20220607105958.382076-1-bhe@redhat.com
    Link: https://lkml.kernel.org/r/20220607105958.382076-2-bhe@redhat.com
    Signed-off-by: Baoquan He <bhe@redhat.com>
    Suggested-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Cc: Christoph Hellwig <hch@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:14 -04:00
Chris von Recklinghausen 51420b6878 usercopy: Handle vm_map_ram() areas
Bugzilla: https://bugzilla.redhat.com/2160210

commit 993d0b287e2ef7bee2e8b13b0ce4d2b5066f278e
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Sun Jun 12 22:32:25 2022 +0100

    usercopy: Handle vm_map_ram() areas

    vmalloc does not allocate a vm_struct for vm_map_ram() areas.  That causes
    us to deny usercopies from those areas.  This affects XFS which uses
    vm_map_ram() for its directories.

    Fix this by calling find_vmap_area() instead of find_vm_area().

    Fixes: 0aef499f3172 ("mm/usercopy: Detect vmalloc overruns")
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Tested-by: Zorro Lang <zlang@redhat.com>
    Signed-off-by: Kees Cook <keescook@chromium.org>
    Link: https://lore.kernel.org/r/20220612213227.3881769-2-willy@infradead.org

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:13 -04:00
Chris von Recklinghausen d3bf7e482f vmap(): don't allow invalid pages
Bugzilla: https://bugzilla.redhat.com/2160210

commit 4fcdcc12915c70761ae6adf25e3a295a75a7431d
Author: Yury Norov <yury.norov@gmail.com>
Date:   Thu Apr 28 23:16:00 2022 -0700

    vmap(): don't allow invalid pages

    vmap() takes struct page *pages as one of arguments, and user may provide
    an invalid pointer which may lead to corrupted translation table.

    An example of such behaviour is erroneous usage of virt_to_page():

            vaddr1 = dma_alloc_coherent()
            page = virt_to_page()   // Wrong here
            ...
            vaddr2 = vmap(page)
            memset(vaddr2)          // Faulting here

    virt_to_page() returns a wrong pointer if vaddr1 is not a linear kernel
    address.  The problem is that vmap() populates pte with bad pfn
    successfully, and it's much harder to debug at memory access time.  This
    case should be caught by DEBUG_VIRTUAL being that enabled, but it's not
    enabled in popular distros.

    Kernel already checks the pages against NULL.  In the case mentioned
    above, however, the address is not NULL, and it's big enough so that the
    hardware generated Address Size Abort on arm64:

            [  665.484101] Unhandled fault at 0xffff8000252cd000
            [  665.488807] Mem abort info:
            [  665.491617]   ESR = 0x96000043
            [  665.494675]   EC = 0x25: DABT (current EL), IL = 32 bits
            [  665.499985]   SET = 0, FnV = 0
            [  665.503039]   EA = 0, S1PTW = 0
            [  665.506167] Data abort info:
            [  665.509047]   ISV = 0, ISS = 0x00000043
            [  665.512882]   CM = 0, WnR = 1
            [  665.515851] swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000000818cb000
            [  665.522550] [ffff8000252cd000] pgd=000000affcfff003, pud=000000affcffe003, pmd=0000008fad8c3003, pte=00688000a5217713
            [  665.533160] Internal error: level 3 address size fault: 96000043 [#1] SMP
            [  665.539936] Modules linked in: [...]
            [  665.616212] CPU: 178 PID: 13199 Comm: test Tainted: P           OE 5.4.0-84-generic #94~18.04.1-Ubuntu
            [  665.626806] Hardware name: HPE Apollo 70             /C01_APACHE_MB , BIOS L50_5.13_1.0.6 07/10/2018
            [  665.636618] pstate: 80400009 (Nzcv daif +PAN -UAO)
            [  665.641407] pc : __memset+0x38/0x188
            [  665.645146] lr : test+0xcc/0x3f8
            [  665.650184] sp : ffff8000359bb840
            [  665.653486] x29: ffff8000359bb840 x28: 0000000000000000
            [  665.658785] x27: 0000000000000000 x26: 0000000000231000
            [  665.664083] x25: ffff00ae660f6110 x24: ffff00ae668cb800
            [  665.669382] x23: 0000000000000001 x22: ffff00af533e5000
            [  665.674680] x21: 0000000000001000 x20: 0000000000000000
            [  665.679978] x19: ffff00ae66950000 x18: ffffffffffffffff
            [  665.685276] x17: 00000000588636a5 x16: 0000000000000013
            [  665.690574] x15: ffffffffffffffff x14: 000000000007ffff
            [  665.695872] x13: 0000000080000000 x12: 0140000000000000
            [  665.701170] x11: 0000000000000041 x10: ffff8000652cd000
            [  665.706468] x9 : ffff8000252cf000 x8 : ffff8000252cd000
            [  665.711767] x7 : 0303030303030303 x6 : 0000000000001000
            [  665.717065] x5 : ffff8000252cd000 x4 : 0000000000000000
            [  665.722363] x3 : ffff8000252cdfff x2 : 0000000000000001
            [  665.727661] x1 : 0000000000000003 x0 : ffff8000252cd000
            [  665.732960] Call trace:
            [  665.735395]  __memset+0x38/0x188
            [...]

    Interestingly, this abort happens even if copy_from_kernel_nofault() is
    used, which is quite inconvenient for debugging purposes.

    This patch adds a pfn_valid() check into vmap() path, so that invalid
    mapping will not be created; WARN_ON() is used to let client code know
    that something goes wrong, and it's not a regular EINVAL situation.

    Link: https://lkml.kernel.org/r/20220422220410.1308706-1-yury.norov@gmail.com
    Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
    Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Alexey Klimov <aklimov@redhat.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Ding Tianhong <dingtianhong@huawei.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Robin Murphy <robin.murphy@arm.com>
    Cc: Russell King <linux@armlinux.org.uk>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:53 -04:00
Chris von Recklinghausen cf14b00963 mm/vmalloc: fix a comment
Bugzilla: https://bugzilla.redhat.com/2160210

commit 98af39d52e336b2d7d7be67ac405f978d81f65b8
Author: Yixuan Cao <caoyixuan2019@email.szu.edu.cn>
Date:   Thu Apr 28 23:16:00 2022 -0700

    mm/vmalloc: fix a comment

    The sentence
    "but the mempolcy want to alloc memory by interleaving"
    should be rephrased with
    "but the mempolicy wants to alloc memory by interleaving"
    where "mempolicy" is a struct name.

    This work is coauthored by
    Yinan Zhang
    Jiajian Ye
    Shenghong Han
    Chongxi Zhao
    Yuhong Feng
    Yongqiang Liu

    Link: https://lkml.kernel.org/r/20220401064543.4447-1-caoyixuan2019@email.szu.edu.cn
    Signed-off-by: Yixuan Cao <caoyixuan2019@email.szu.edu.cn>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:53 -04:00
Juri Lelli ed7281aaf3 mm/vmalloc: use raw_cpu_ptr() for vmap_block_queue access
Bugzilla: https://bugzilla.redhat.com/2171995

commit 3f80492001aa64ac585016050ace8680611c2e20
Author:    Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:      Thu May 12 20:23:08 2022 -0700

    mm/vmalloc: use raw_cpu_ptr() for vmap_block_queue access

    The per-CPU resource vmap_block_queue is accessed via get_cpu_var().  That
    macro disables preemption and then loads the pointer from the current CPU.

    This doesn't work on PREEMPT_RT because a spinlock_t is later accessed
    within the preempt-disable section.

    There is no need to disable preemption while accessing the per-CPU struct
    vmap_block_queue because the list is protected with a spinlock_t.  The
    per-CPU struct is also accessed cross-CPU in purge_fragmented_blocks().

    It is possible that by using raw_cpu_ptr() the code migrates to another
    CPU and uses struct from another CPU.  This is fine because the list is
    locked and the locked section is very short.

    Use raw_cpu_ptr() to access vmap_block_queue.

    Link: https://lkml.kernel.org/r/YnKx3duAB53P7ojN@linutronix.de
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
2023-02-27 13:46:06 +01:00
Nico Pache 24fcd8d160 mm/vmalloc.c: support HIGHMEM pages in vmap_pages_range_noflush()
commit 08262ac50a7e4d70ee92b34746ea54a0ba51739a
Author: Matthew Wilcox <willy@infradead.org>
Date:   Thu Aug 18 22:07:41 2022 +0100

    mm/vmalloc.c: support HIGHMEM pages in vmap_pages_range_noflush()

    If the pages being mapped are in HIGHMEM, page_address() returns NULL.
    This probably wasn't noticed before because there aren't currently any
    architectures with HAVE_ARCH_HUGE_VMALLOC and HIGHMEM, but it's simpler to
    call page_to_phys() and futureproofs us against such configurations
    existing.

    Link: https://lkml.kernel.org/r/Yv6qHc6e+m7TMWhi@casper.infradead.org
    Fixes: 121e6f3258 ("mm/vmalloc: hugepage vmalloc mappings")
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Cc: Uladzislau Rezki <urezki@gmail.com>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:42 -07:00
Chris von Recklinghausen c27592720d kasan: fix zeroing vmalloc memory with HW_TAGS
Bugzilla: https://bugzilla.redhat.com/2120352

commit 6c2f761dad7851d8088b91063ccaea3c970efe78
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Jun 9 20:18:47 2022 +0200

    kasan: fix zeroing vmalloc memory with HW_TAGS

    HW_TAGS KASAN skips zeroing page_alloc allocations backing vmalloc
    mappings via __GFP_SKIP_ZERO.  Instead, these pages are zeroed via
    kasan_unpoison_vmalloc() by passing the KASAN_VMALLOC_INIT flag.

    The problem is that __kasan_unpoison_vmalloc() does not zero pages when
    either kasan_vmalloc_enabled() or is_vmalloc_or_module_addr() fail.

    Thus:

    1. Change __vmalloc_node_range() to only set KASAN_VMALLOC_INIT when
       __GFP_SKIP_ZERO is set.

    2. Change __kasan_unpoison_vmalloc() to always zero pages when the
       KASAN_VMALLOC_INIT flag is set.

    3. Add WARN_ON() asserts to check that KASAN_VMALLOC_INIT cannot be set
       in other early return paths of __kasan_unpoison_vmalloc().

    Also clean up the comment in __kasan_unpoison_vmalloc.

    Link: https://lkml.kernel.org/r/4bc503537efdc539ffc3f461c1b70162eea31cf6.1654798516.git.andreyknvl@google.com
    Fixes: 23689e91fb22 ("kasan, vmalloc: add vmalloc tagging for HW_TAGS")
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Cc: Marco Elver <elver@google.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:09 -04:00
Chris von Recklinghausen 733d1c207f mm/vmalloc: huge vmalloc backing pages should be split rather than compound
Bugzilla: https://bugzilla.redhat.com/2120352

commit 3b8000ae185cb068adbda5f966a3835053c85fd4
Author: Nicholas Piggin <npiggin@gmail.com>
Date:   Fri Apr 22 16:01:05 2022 +1000

    mm/vmalloc: huge vmalloc backing pages should be split rather than compound

    Huge vmalloc higher-order backing pages were allocated with __GFP_COMP
    in order to allow the sub-pages to be refcounted by callers such as
    "remap_vmalloc_page [sic]" (remap_vmalloc_range).

    However a similar problem exists for other struct page fields callers
    use, for example fb_deferred_io_fault() takes a vmalloc'ed page and
    not only refcounts it but uses ->lru, ->mapping, ->index.

    This is not compatible with compound sub-pages, and can cause bad page
    state issues like

      BUG: Bad page state in process swapper/0  pfn:00743
      page:(____ptrval____) refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x743
      flags: 0x7ffff000000000(node=0|zone=0|lastcpupid=0x7ffff)
      raw: 007ffff000000000 c00c00000001d0c8 c00c00000001d0c8 0000000000000000
      raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
      page dumped because: corrupted mapping in tail page
      Modules linked in:
      CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.18.0-rc3-00082-gfc6fff4a7ce1-dirty #2810
      Call Trace:
        dump_stack_lvl+0x74/0xa8 (unreliable)
        bad_page+0x12c/0x170
        free_tail_pages_check+0xe8/0x190
        free_pcp_prepare+0x31c/0x4e0
        free_unref_page+0x40/0x1b0
        __vunmap+0x1d8/0x420
        ...

    The correct approach is to use split high-order pages for the huge
    vmalloc backing. These allow callers to treat them in exactly the same
    way as individually-allocated order-0 pages.

    Link: https://lore.kernel.org/all/14444103-d51b-0fb3-ee63-c3f182f0b546@molgen.mpg.de/
    Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
    Cc: Paul Menzel <pmenzel@molgen.mpg.de>
    Cc: Song Liu <songliubraving@fb.com>
    Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:06 -04:00
Chris von Recklinghausen b7b5b2dab2 vmalloc: replace VM_NO_HUGE_VMAP with VM_ALLOW_HUGE_VMAP
Bugzilla: https://bugzilla.redhat.com/2120352

commit 559089e0a93d44280ec3ab478830af319c56dbe3
Author: Song Liu <song@kernel.org>
Date:   Fri Apr 15 09:44:10 2022 -0700

    vmalloc: replace VM_NO_HUGE_VMAP with VM_ALLOW_HUGE_VMAP

    Huge page backed vmalloc memory could benefit performance in many cases.
    However, some users of vmalloc may not be ready to handle huge pages for
    various reasons: hardware constraints, potential pages split, etc.
    VM_NO_HUGE_VMAP was introduced to allow vmalloc users to opt-out huge
    pages.  However, it is not easy to track down all the users that require
    the opt-out, as the allocation are passed different stacks and may cause
    issues in different layers.

    To address this issue, replace VM_NO_HUGE_VMAP with an opt-in flag,
    VM_ALLOW_HUGE_VMAP, so that users that benefit from huge pages could ask
    specificially.

    Also, remove vmalloc_no_huge() and add opt-in helper vmalloc_huge().

    Fixes: fac54e2bfb5b ("x86/Kconfig: Select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP")
    Link: https://lore.kernel.org/netdev/14444103-d51b-0fb3-ee63-c3f182f0b546@molgen.mpg.de/"
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Song Liu <song@kernel.org>
    Reviewed-by: Rik van Riel <riel@surriel.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:06 -04:00
Chris von Recklinghausen 38699432de mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore
Bugzilla: https://bugzilla.redhat.com/2120352

commit c12cd77cb028255663810e6c4528f0325facff66
Author: Omar Sandoval <osandov@fb.com>
Date:   Thu Apr 14 19:14:01 2022 -0700

    mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore

    Commit 3ee48b6af4 ("mm, x86: Saving vmcore with non-lazy freeing of
    vmas") introduced set_iounmap_nonlazy(), which sets vmap_lazy_nr to
    lazy_max_pages() + 1, ensuring that any future vunmaps() immediately
    purge the vmap areas instead of doing it lazily.

    Commit 690467c81b1a ("mm/vmalloc: Move draining areas out of caller
    context") moved the purging from the vunmap() caller to a worker thread.
    Unfortunately, set_iounmap_nonlazy() can cause the worker thread to spin
    (possibly forever).  For example, consider the following scenario:

     1. Thread reads from /proc/vmcore. This eventually calls
        __copy_oldmem_page() -> set_iounmap_nonlazy(), which sets
        vmap_lazy_nr to lazy_max_pages() + 1.

     2. Then it calls free_vmap_area_noflush() (via iounmap()), which adds 2
        pages (one page plus the guard page) to the purge list and
        vmap_lazy_nr. vmap_lazy_nr is now lazy_max_pages() + 3, so the
        drain_vmap_work is scheduled.

     3. Thread returns from the kernel and is scheduled out.

     4. Worker thread is scheduled in and calls drain_vmap_area_work(). It
        frees the 2 pages on the purge list. vmap_lazy_nr is now
        lazy_max_pages() + 1.

     5. This is still over the threshold, so it tries to purge areas again,
        but doesn't find anything.

     6. Repeat 5.

    If the system is running with only one CPU (which is typicial for kdump)
    and preemption is disabled, then this will never make forward progress:
    there aren't any more pages to purge, so it hangs.  If there is more
    than one CPU or preemption is enabled, then the worker thread will spin
    forever in the background.  (Note that if there were already pages to be
    purged at the time that set_iounmap_nonlazy() was called, this bug is
    avoided.)

    This can be reproduced with anything that reads from /proc/vmcore
    multiple times.  E.g., vmcore-dmesg /proc/vmcore.

    It turns out that improvements to vmap() over the years have obsoleted
    the need for this "optimization".  I benchmarked `dd if=/proc/vmcore
    of=/dev/null` with 4k and 1M read sizes on a system with a 32GB vmcore.
    The test was run on 5.17, 5.18-rc1 with a fix that avoided the hang, and
    5.18-rc1 with set_iounmap_nonlazy() removed entirely:

        |5.17  |5.18+fix|5.18+removal
      4k|40.86s|  40.09s|      26.73s
      1M|24.47s|  23.98s|      21.84s

    The removal was the fastest (by a wide margin with 4k reads).  This
    patch removes set_iounmap_nonlazy().

    Link: https://lkml.kernel.org/r/52f819991051f9b865e9ce25605509bfdbacadcd.1649277321.git.osandov@fb.com
    Fixes: 690467c81b1a  ("mm/vmalloc: Move draining areas out of caller context")
    Signed-off-by: Omar Sandoval <osandov@fb.com>
    Acked-by: Chris Down <chris@chrisdown.name>
    Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Acked-by: Baoquan He <bhe@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:06 -04:00
Chris von Recklinghausen 06b1f97a56 kasan, vmalloc: only tag normal vmalloc allocations
Bugzilla: https://bugzilla.redhat.com/2120352

commit f6e39794f4b6da7ca9b77f2f9ad11fd6f0ac83e5
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Mar 24 18:11:35 2022 -0700

    kasan, vmalloc: only tag normal vmalloc allocations

    The kernel can use to allocate executable memory.  The only supported
    way to do that is via __vmalloc_node_range() with the executable bit set
    in the prot argument.  (vmap() resets the bit via pgprot_nx()).

    Once tag-based KASAN modes start tagging vmalloc allocations, executing
    code from such allocations will lead to the PC register getting a tag,
    which is not tolerated by the kernel.

    Only tag the allocations for normal kernel pages.

    [andreyknvl@google.com: pass KASAN_VMALLOC_PROT_NORMAL to kasan_unpoison_vmalloc()]
      Link: https://lkml.kernel.org/r/9230ca3d3e40ffca041c133a524191fd71969a8d.1646233925.git.andreyknvl@google.com
    [andreyknvl@google.com: support tagged vmalloc mappings]
      Link: https://lkml.kernel.org/r/2f6605e3a358cf64d73a05710cb3da356886ad29.1646233925.git.andreyknvl@google.com
    [andreyknvl@google.com: don't unintentionally disabled poisoning]
      Link: https://lkml.kernel.org/r/de4587d6a719232e83c760113e46ed2d4d8da61e.1646757322.git.andreyknvl@google.com

    Link: https://lkml.kernel.org/r/fbfd9939a4dc375923c9a5c6b9e7ab05c26b8c6b.1643047180.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Acked-by: Marco Elver <elver@google.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Evgenii Stepanov <eugenis@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:00 -04:00
Chris von Recklinghausen 748e63b743 kasan, vmalloc: add vmalloc tagging for HW_TAGS
Bugzilla: https://bugzilla.redhat.com/2120352

commit 23689e91fb22c15b84ac6c22ad9942039792f3af
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Mar 24 18:11:32 2022 -0700

    kasan, vmalloc: add vmalloc tagging for HW_TAGS

    Add vmalloc tagging support to HW_TAGS KASAN.

    The key difference between HW_TAGS and the other two KASAN modes when it
    comes to vmalloc: HW_TAGS KASAN can only assign tags to physical memory.
    The other two modes have shadow memory covering every mapped virtual
    memory region.

    Make __kasan_unpoison_vmalloc() for HW_TAGS KASAN:

     - Skip non-VM_ALLOC mappings as HW_TAGS KASAN can only tag a single
       mapping of normal physical memory; see the comment in the function.

     - Generate a random tag, tag the returned pointer and the allocation,
       and initialize the allocation at the same time.

     - Propagate the tag into the page stucts to allow accesses through
       page_address(vmalloc_to_page()).

    The rest of vmalloc-related KASAN hooks are not needed:

     - The shadow-related ones are fully skipped.

     - __kasan_poison_vmalloc() is kept as a no-op with a comment.

    Poisoning and zeroing of physical pages that are backing vmalloc()
    allocations are skipped via __GFP_SKIP_KASAN_UNPOISON and
    __GFP_SKIP_ZERO: __kasan_unpoison_vmalloc() does that instead.

    Enabling CONFIG_KASAN_VMALLOC with HW_TAGS is not yet allowed.

    Link: https://lkml.kernel.org/r/d19b2e9e59a9abc59d05b72dea8429dcaea739c6.1643047180.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Co-developed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Acked-by: Marco Elver <elver@google.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Evgenii Stepanov <eugenis@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:00 -04:00
Chris von Recklinghausen 4eb112573d kasan, vmalloc: unpoison VM_ALLOC pages after mapping
Bugzilla: https://bugzilla.redhat.com/2120352

commit 19f1c3acf8f4431982355b2e8a78e42b884dd788
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Mar 24 18:11:20 2022 -0700

    kasan, vmalloc: unpoison VM_ALLOC pages after mapping

    Make KASAN unpoison vmalloc mappings after they have been mapped in when
    it's possible: for vmalloc() (indentified via VM_ALLOC) and vm_map_ram().

    The reasons for this are:

     - For vmalloc() and vm_map_ram(): pages don't get unpoisoned in case
       mapping them fails.

     - For vmalloc(): HW_TAGS KASAN needs pages to be mapped to set tags via
       kasan_unpoison_vmalloc().

    As a part of these changes, the return value of __vmalloc_node_range() is
    changed to area->addr.  This is a non-functional change, as
    __vmalloc_area_node() returns area->addr anyway.

    Link: https://lkml.kernel.org/r/fcb98980e6fcd3c4be6acdcb5d6110898ef28548.1643047180.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Reviewed-by: Alexander Potapenko <glider@google.com>
    Acked-by: Marco Elver <elver@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Evgenii Stepanov <eugenis@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:59 -04:00
Chris von Recklinghausen c4a2bdcb91 kasan, vmalloc, arm64: mark vmalloc mappings as pgprot_tagged
Bugzilla: https://bugzilla.redhat.com/2120352

commit 01d92c7f358ce892279ca830cf6ccf2862a17d1c
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Mar 24 18:11:16 2022 -0700

    kasan, vmalloc, arm64: mark vmalloc mappings as pgprot_tagged

    HW_TAGS KASAN relies on ARM Memory Tagging Extension (MTE).  With MTE, a
    memory region must be mapped as MT_NORMAL_TAGGED to allow setting memory
    tags via MTE-specific instructions.

    Add proper protection bits to vmalloc() allocations.  These allocations
    are always backed by page_alloc pages, so the tags will actually be
    getting set on the corresponding physical memory.

    Link: https://lkml.kernel.org/r/983fc33542db2f6b1e77b34ca23448d4640bbb9e.1643047180.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Co-developed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Acked-by: Marco Elver <elver@google.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Evgenii Stepanov <eugenis@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:59 -04:00
Chris von Recklinghausen 4322150003 kasan, vmalloc: add vmalloc tagging for SW_TAGS
Bugzilla: https://bugzilla.redhat.com/2120352

commit 1d96320f8d5320423737da61e6e6937f6b475b5c
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Mar 24 18:11:13 2022 -0700

    kasan, vmalloc: add vmalloc tagging for SW_TAGS

    Add vmalloc tagging support to SW_TAGS KASAN.

     - __kasan_unpoison_vmalloc() now assigns a random pointer tag, poisons
       the virtual mapping accordingly, and embeds the tag into the returned
       pointer.

     - __get_vm_area_node() (used by vmalloc() and vmap()) and
       pcpu_get_vm_areas() save the tagged pointer into vm_struct->addr
       (note: not into vmap_area->addr).

       This requires putting kasan_unpoison_vmalloc() after
       setup_vmalloc_vm[_locked](); otherwise the latter will overwrite the
       tagged pointer. The tagged pointer then is naturally propagateed to
       vmalloc() and vmap().

     - vm_map_ram() returns the tagged pointer directly.

    As a result of this change, vm_struct->addr is now tagged.

    Enabling KASAN_VMALLOC with SW_TAGS is not yet allowed.

    Link: https://lkml.kernel.org/r/4a78f3c064ce905e9070c29733aca1dd254a74f1.1643047180.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Acked-by: Marco Elver <elver@google.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Evgenii Stepanov <eugenis@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:59 -04:00
Chris von Recklinghausen 421a7d2055 kasan, vmalloc: reset tags in vmalloc functions
Bugzilla: https://bugzilla.redhat.com/2120352

commit 4aff1dc4fb3a5a3be96b6ad8db8348d3e877d7d0
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Mar 24 18:11:04 2022 -0700

    kasan, vmalloc: reset tags in vmalloc functions

    In preparation for adding vmalloc support to SW/HW_TAGS KASAN, reset
    pointer tags in functions that use pointer values in range checks.

    vread() is a special case here.  Despite the untagging of the addr pointer
    in its prologue, the accesses performed by vread() are checked.

    Instead of accessing the virtual mappings though addr directly, vread()
    recovers the physical address via page_address(vmalloc_to_page()) and
    acceses that.  And as page_address() recovers the pointer tag, the
    accesses get checked.

    Link: https://lkml.kernel.org/r/046003c5f683cacb0ba18e1079e9688bb3dca943.1643047180.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Acked-by: Marco Elver <elver@google.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Evgenii Stepanov <eugenis@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:59 -04:00
Chris von Recklinghausen 53e03d4eb7 kasan, x86, arm64, s390: rename functions for modules shadow
Bugzilla: https://bugzilla.redhat.com/2120352

commit 63840de296472f3914bb933b11ba2b764590755e
Author: Andrey Konovalov <andreyknvl@gmail.com>
Date:   Thu Mar 24 18:10:52 2022 -0700

    kasan, x86, arm64, s390: rename functions for modules shadow

    Rename kasan_free_shadow to kasan_free_module_shadow and
    kasan_module_alloc to kasan_alloc_module_shadow.

    These functions are used to allocate/free shadow memory for kernel modules
    when KASAN_VMALLOC is not enabled.  The new names better reflect their
    purpose.

    Also reword the comment next to their declaration to improve clarity.

    Link: https://lkml.kernel.org/r/36db32bde765d5d0b856f77d2d806e838513fe84.1643047180.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Acked-by: Catalin Marinas <catalin.marinas@arm.com>
    Acked-by: Marco Elver <elver@google.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Evgenii Stepanov <eugenis@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:59 -04:00
Chris von Recklinghausen ebc96d65f4 mm/vmalloc.c: fix "unused function" warning
Bugzilla: https://bugzilla.redhat.com/2120352

commit c3385e845824b8d435f1f323ebd38031fdec4590
Author: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Date:   Tue Mar 22 14:42:59 2022 -0700

    mm/vmalloc.c: fix "unused function" warning

    compute_subtree_max_size() is unused, when building with
    DEBUG_AUGMENT_PROPAGATE_CHECK=y.

      mm/vmalloc.c:785:1: warning: unused function 'compute_subtree_max_size' [-Wunused-function].

    Link: https://lkml.kernel.org/r/20220129034652.75359-1-jiapeng.chong@linux.alibaba.com
    Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
    Reported-by: Abaci Robot <abaci@linux.alibaba.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:51 -04:00
Chris von Recklinghausen 1c410d1405 mm/vmalloc: eliminate an extra orig_gfp_mask
Bugzilla: https://bugzilla.redhat.com/2120352

commit c3d77172dfc04c8443c327e8acb83e683f8c0193
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Tue Mar 22 14:42:56 2022 -0700

    mm/vmalloc: eliminate an extra orig_gfp_mask

    That extra variable has been introduced just for keeping an original
    passed gfp_mask because it is updated with __GFP_NOWARN on entry, thus
    error handling messages were broken.

    Instead we can keep an original gfp_mask without modifying it and add an
    extra __GFP_NOWARN flag together with gfp_mask as a parameter to the
    vm_area_alloc_pages() function.  It will make it less confused.

    Link: https://lkml.kernel.org/r/20220119143540.601149-3-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Cc: Vasily Averin <vvs@virtuozzo.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
    Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:51 -04:00
Chris von Recklinghausen 748e5a48d8 mm/vmalloc: add adjust_search_size parameter
Bugzilla: https://bugzilla.redhat.com/2120352

commit 9333fe98d0a61a590cc076bcc21711f59ed8d972
Author: Uladzislau Rezki <uladzislau.rezki@sony.com>
Date:   Tue Mar 22 14:42:53 2022 -0700

    mm/vmalloc: add adjust_search_size parameter

    Extend the find_vmap_lowest_match() function with one more parameter.
    It is "adjust_search_size" boolean variable, so it is possible to
    control an accuracy of search block if a specific alignment is required.

    With this patch, a search size is always adjusted, to serve a request as
    fast as possible because of performance reason.

    But there is one exception though, it is short ranges where requested
    size corresponds to passed vstart/vend restriction together with a
    specific alignment request.  In such scenario an adjustment wold not
    lead to success allocation.

    Link: https://lkml.kernel.org/r/20220119143540.601149-2-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki <uladzislau.rezki@sony.com>
    Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
    Cc: Vasily Averin <vvs@virtuozzo.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:51 -04:00
Chris von Recklinghausen c84501c47c mm/vmalloc: Move draining areas out of caller context
Bugzilla: https://bugzilla.redhat.com/2120352

commit 690467c81b1a49de38a4b89eedc0ae85015f4c79
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Tue Mar 22 14:42:50 2022 -0700

    mm/vmalloc: Move draining areas out of caller context

    A caller initiates the drain procces from its context once the
    drain threshold is reached or passed. There are at least two
    drawbacks of doing so:

    a) a caller can be a high-prio or RT task. In that case it can
       stuck in doing the actual drain of all lazily freed areas.
       This is not optimal because such tasks usually are latency
       sensitive where the control should be returned back as soon
       as possible in order to drive such workloads in time. See
       96e2db4561 ("mm/vmalloc: rework the drain logic")

    b) It is not safe to call vfree() during holding a spinlock due
       to the vmap_purge_lock mutex. The was a report about this from
       Zeal Robot <zealci@zte.com.cn> here:
       https://lore.kernel.org/all/20211222081026.484058-1-chi.minghao@zte.com.cn

    Moving the drain to the separate work context addresses those
    issues.

    v1->v2:
       - Added prefix "_work" to the drain worker function.
    v2->v3:
       - Remove the drain_vmap_work_in_progress. Extra queuing
         is expectable under heavy load but it can be disregarded
         because a work will bail out if nothing to be done.

    Link: https://lkml.kernel.org/r/20220131144058.35608-1-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
    Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
    Cc: Vasily Averin <vvs@virtuozzo.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:51 -04:00
Chris von Recklinghausen 8732b9c84b mm/vmalloc: remove unneeded function forward declaration
Bugzilla: https://bugzilla.redhat.com/2120352

commit 651d55ce096543c52f7e589d04dfa7393f90ff47
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Mar 22 14:42:47 2022 -0700

    mm/vmalloc: remove unneeded function forward declaration

    The forward declaration for lazy_max_pages() is unnecessary.  Remove it.

    Link: https://lkml.kernel.org/r/20220124133752.60663-1-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Uladzislau Rezki <urezki@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:51 -04:00
Chris von Recklinghausen c31ce51018 mm/vmalloc: be more explicit about supported gfp flags.
Bugzilla: https://bugzilla.redhat.com/2120352

commit 30d3f01191d305c99e8b3f8b1b328fc852270c95
Author: Michal Hocko <mhocko@suse.com>
Date:   Fri Jan 14 14:07:04 2022 -0800

    mm/vmalloc: be more explicit about supported gfp flags.

    Commit b7d90e7a5ea8 ("mm/vmalloc: be more explicit about supported gfp
    flags") has been merged prematurely without the rest of the series and
    without addressed review feedback from Neil.  Fix that up now.  Only
    wording is changed slightly.

    Link: https://lkml.kernel.org/r/20211122153233.9924-4-mhocko@kernel.org
    Signed-off-by: Michal Hocko <mhocko@suse.com>
    Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Ilya Dryomov <idryomov@gmail.com>
    Cc: Jeff Layton <jlayton@kernel.org>
    Cc: Neil Brown <neilb@suse.de>
    Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:39 -04:00
Chris von Recklinghausen c21ecd74b4 mm/vmalloc: add support for __GFP_NOFAIL
Bugzilla: https://bugzilla.redhat.com/2120352

commit 9376130c390a76fac2788a5d6e1a149017b4ab50
Author: Michal Hocko <mhocko@suse.com>
Date:   Fri Jan 14 14:07:01 2022 -0800

    mm/vmalloc: add support for __GFP_NOFAIL

    Dave Chinner has mentioned that some of the xfs code would benefit from
    kvmalloc support for __GFP_NOFAIL because they have allocations that
    cannot fail and they do not fit into a single page.

    The large part of the vmalloc implementation already complies with the
    given gfp flags so there is no work for those to be done.  The area and
    page table allocations are an exception to that.  Implement a retry loop
    for those.

    Add a short sleep before retrying.  1 jiffy is a completely random
    timeout.  Ideally the retry would wait for an explicit event - e.g.  a
    change to the vmalloc space change if the failure was caused by the
    space fragmentation or depletion.  But there are multiple different
    reasons to retry and this could become much more complex.  Keep the
    retry simple for now and just sleep to prevent from hogging CPUs.

    Link: https://lkml.kernel.org/r/20211122153233.9924-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Ilya Dryomov <idryomov@gmail.com>
    Cc: Jeff Layton <jlayton@kernel.org>
    Cc: Neil Brown <neilb@suse.de>
    Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:39 -04:00
Chris von Recklinghausen b5ebcd3661 mm/vmalloc: alloc GFP_NO{FS,IO} for vmalloc
Bugzilla: https://bugzilla.redhat.com/2120352

commit 451769ebb7e792c3404db53b3c2a422990de654e
Author: Michal Hocko <mhocko@suse.com>
Date:   Fri Jan 14 14:06:57 2022 -0800

    mm/vmalloc: alloc GFP_NO{FS,IO} for vmalloc

    Patch series "extend vmalloc support for constrained allocations", v2.

    Based on a recent discussion with Dave and Neil [1] I have tried to
    implement NOFS, NOIO, NOFAIL support for the vmalloc to make life of
    kvmalloc users easier.

    A requirement for NOFAIL support for kvmalloc was new to me but this
    seems to be really needed by the xfs code.

    NOFS/NOIO was a known and a long term problem which was hoped to be
    handled by the scope API.  Those scope should have been used at the
    reclaim recursion boundaries both to document them and also to remove
    the necessity of NOFS/NOIO constrains for all allocations within that
    scope.  Instead workarounds were developed to wrap a single allocation
    instead (like ceph_kvmalloc).

    First patch implements NOFS/NOIO support for vmalloc.  The second one
    adds NOFAIL support and the third one bundles all together into kvmalloc
    and drops ceph_kvmalloc which can use kvmalloc directly now.

    [1] http://lkml.kernel.org/r/163184741778.29351.16920832234899124642.stgit@noble.brown

    This patch (of 4):

    vmalloc historically hasn't supported GFP_NO{FS,IO} requests because
    page table allocations do not support externally provided gfp mask and
    performed GFP_KERNEL like allocations.

    Since few years we have scope (memalloc_no{fs,io}_{save,restore}) APIs
    to enforce NOFS and NOIO constrains implicitly to all allocators within
    the scope.  There was a hope that those scopes would be defined on a
    higher level when the reclaim recursion boundary starts/stops (e.g.
    when a lock required during the memory reclaim is required etc.).  It
    seems that not all NOFS/NOIO users have adopted this approach and
    instead they have taken a workaround approach to wrap a single
    [k]vmalloc allocation by a scope API.

    These workarounds do not serve the purpose of a better reclaim recursion
    documentation and reduction of explicit GFP_NO{FS,IO} usege so let's
    just provide them with the semantic they are asking for without a need
    for workarounds.

    Add support for GFP_NOFS and GFP_NOIO to vmalloc directly.  All internal
    allocations already comply with the given gfp_mask.  The only current
    exception is vmap_pages_range which maps kernel page tables.  Infer the
    proper scope API based on the given gfp mask.

    [sfr@canb.auug.org.au: mm/vmalloc.c needs linux/sched/mm.h]
     Link: https://lkml.kernel.org/r/20211217232641.0148710c@canb.auug.org.au

    Link: https://lkml.kernel.org/r/20211122153233.9924-1-mhocko@kernel.org
    Link: https://lkml.kernel.org/r/20211122153233.9924-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko <mhocko@suse.com>
    Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
    Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Neil Brown <neilb@suse.de>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Ilya Dryomov <idryomov@gmail.com>
    Cc: Jeff Layton <jlayton@kernel.org>
    Cc: Dave Chinner <dchinner@redhat.com>
    Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:39 -04:00
Chris von Recklinghausen 3ef0634acd mm: defer kmemleak object creation of module_alloc()
Bugzilla: https://bugzilla.redhat.com/2120352

commit 60115fa54ad7b913b7cb5844e6b7ffeb842d55f2
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Fri Jan 14 14:04:11 2022 -0800

    mm: defer kmemleak object creation of module_alloc()

    Yongqiang reports a kmemleak panic when module insmod/rmmod with KASAN
    enabled(without KASAN_VMALLOC) on x86[1].

    When the module area allocates memory, it's kmemleak_object is created
    successfully, but the KASAN shadow memory of module allocation is not
    ready, so when kmemleak scan the module's pointer, it will panic due to
    no shadow memory with KASAN check.

      module_alloc
        __vmalloc_node_range
          kmemleak_vmalloc
                                    kmemleak_scan
                                      update_checksum
        kasan_module_alloc
          kmemleak_ignore

    Note, there is no problem if KASAN_VMALLOC enabled, the modules area
    entire shadow memory is preallocated.  Thus, the bug only exits on ARCH
    which supports dynamic allocation of module area per module load, for
    now, only x86/arm64/s390 are involved.

    Add a VM_DEFER_KMEMLEAK flags, defer vmalloc'ed object register of
    kmemleak in module_alloc() to fix this issue.

    [1] https://lore.kernel.org/all/6d41e2b9-4692-5ec4-b1cd-cbe29ae89739@huawei.com/

    [wangkefeng.wang@huawei.com: fix build]
      Link: https://lkml.kernel.org/r/20211125080307.27225-1-wangkefeng.wang@huawei.com
    [akpm@linux-foundation.org: simplify ifdefs, per Andrey]
      Link: https://lkml.kernel.org/r/CA+fCnZcnwJHUQq34VuRxpdoY6_XbJCDJ-jopksS5Eia4PijPzw@mail.gmail.com

    Link: https://lkml.kernel.org/r/20211124142034.192078-1-wangkefeng.wang@huawei.com
    Fixes: 793213a82d ("s390/kasan: dynamic shadow mem allocation for modules")
    Fixes: 39d114ddc6 ("arm64: add KASAN support")
    Fixes: bebf56a1b1 ("kasan: enable instrumentation of global variables")
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Reported-by: Yongqiang Liu <liuyongqiang13@huawei.com>
    Cc: Andrey Konovalov <andreyknvl@gmail.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Heiko Carstens <hca@linux.ibm.com>
    Cc: Vasily Gorbik <gor@linux.ibm.com>
    Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
    Cc: Alexander Gordeev <agordeev@linux.ibm.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:37 -04:00
Chris von Recklinghausen cd6ef9fec1 mm: functions may simplify the use of return values
Bugzilla: https://bugzilla.redhat.com/2120352

commit c8db8c2628afc7088a43de3f7cfbcc2ef1f182f7
Author: Li kunyu <kunyu@nfschina.com>
Date:   Thu May 12 20:23:07 2022 -0700

    mm: functions may simplify the use of return values

    p4d_clear_huge may be optimized for void return type and function usage.
    vunmap_p4d_range function saves a few steps here.

    Link: https://lkml.kernel.org/r/20220507150630.90399-1-kunyu@nfschina.com
    Signed-off-by: Li kunyu <kunyu@nfschina.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: "H. Peter Anvin" <hpa@zytor.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:20 -04:00
Chris von Recklinghausen 6cf178092c mm: merge pte_mkhuge() call into arch_make_huge_pte()
Bugzilla: https://bugzilla.redhat.com/2120352

commit 16785bd7743104d57257a455001172b75afa7614
Author: Anshuman Khandual <anshuman.khandual@arm.com>
Date:   Tue Mar 22 14:41:47 2022 -0700

    mm: merge pte_mkhuge() call into arch_make_huge_pte()

    Each call into pte_mkhuge() is invariably followed by
    arch_make_huge_pte().  Instead arch_make_huge_pte() can accommodate
    pte_mkhuge() at the beginning.  This updates generic fallback stub for
    arch_make_huge_pte() and available platforms definitions.  This makes huge
    pte creation much cleaner and easier to follow.

    Link: https://lkml.kernel.org/r/1643860669-26307-1-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
    Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
    Acked-by: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Paul Mackerras <paulus@samba.org>
    Cc: "David S. Miller" <davem@davemloft.net>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:16 -04:00
Patrick Talbert e3c61b6d3d Merge: arm64: Update core arch code to upstream v5.16
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/735

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2076088

Update arch/arm64 bits to upstream v5.16 level.

Signed-off-by: Mark Salter <msalter@redhat.com>

Approved-by: Gavin Shan <gshan@redhat.com>
Approved-by: Michael Petlan <mpetlan@redhat.com>
Approved-by: Mark Langsdorf <mlangsdo@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-06-29 09:52:39 +02:00
Mark Salter 969b90b574 kasan: arm64: fix pcpu_page_first_chunk crash with KASAN_VMALLOC
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2076088

commit 3252b1d8309ea42bc6329d9341072ecf1c9505c0
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date: Fri, 5 Nov 2021 13:39:47 -0700

    With KASAN_VMALLOC and NEED_PER_CPU_PAGE_FIRST_CHUNK the kernel crashes:

      Unable to handle kernel paging request at virtual address ffff7000028f2000
      ...
      swapper pgtable: 64k pages, 48-bit VAs, pgdp=0000000042440000
      [ffff7000028f2000] pgd=000000063e7c0003, p4d=000000063e7c0003, pud=000000063e7c0003, pmd=000000063e7b0003, pte=0000000000000000
      Internal error: Oops: 96000007 [#1] PREEMPT SMP
      Modules linked in:
      CPU: 0 PID: 0 Comm: swapper Not tainted 5.13.0-rc4-00003-gc6e6e28f3f30-dirty #62
      Hardware name: linux,dummy-virt (DT)
      pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO BTYPE=--)
      pc : kasan_check_range+0x90/0x1a0
      lr : memcpy+0x88/0xf4
      sp : ffff80001378fe20
      ...
      Call trace:
       kasan_check_range+0x90/0x1a0
       pcpu_page_first_chunk+0x3f0/0x568
       setup_per_cpu_areas+0xb8/0x184
       start_kernel+0x8c/0x328

    The vm area used in vm_area_register_early() has no kasan shadow memory,
    Let's add a new kasan_populate_early_vm_area_shadow() function to
    populate the vm area shadow memory to fix the issue.

    [wangkefeng.wang@huawei.com: fix redefinition of 'kasan_populate_early_vm_area_shadow']
      Link: https://lkml.kernel.org/r/20211011123211.3936196-1-wangkefeng.wang@huawei.com

    Link: https://lkml.kernel.org/r/20210910053354.26721-4-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Acked-by: Marco Elver <elver@google.com>		[KASAN]
    Acked-by: Andrey Konovalov <andreyknvl@gmail.com>	[KASAN]
    Acked-by: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Mark Salter <msalter@redhat.com>
2022-04-18 10:05:58 -04:00
Waiman Long 77a2f836ee memcg: add per-memcg vmalloc stat
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2013413

commit 4e5aa1f4c2b489bc6f3ab5ca54747b18a847289d
Author: Shakeel Butt <shakeelb@google.com>
Date:   Fri, 14 Jan 2022 14:05:45 -0800

    memcg: add per-memcg vmalloc stat

    The kvmalloc* allocation functions can fallback to vmalloc allocations
    and more often on long running machines.  In addition the kernel does
    have __GFP_ACCOUNT kvmalloc* calls.  So, often on long running machines,
    the memory.stat does not tell the complete picture which type of memory
    is charged to the memcg.  So add a per-memcg vmalloc stat.

    [shakeelb@google.com: page_memcg() within rcu lock, per Muchun]
      Link: https://lkml.kernel.org/r/20211222052457.1960701-1-shakeelb@google.com
    [akpm@linux-foundation.org: remove cast, per Muchun]
    [shakeelb@google.com: remove area->page[0] checks and move to page by page accounting per Michal]
      Link: https://lkml.kernel.org/r/20220104222341.3972772-1-shakeelb@google.com

    Link: https://lkml.kernel.org/r/20211221215336.1922823-1-shakeelb@google.com
    Signed-off-by: Shakeel Butt <shakeelb@google.com>
    Acked-by: Roman Gushchin <guro@fb.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2022-04-07 14:11:11 -04:00
Rafael Aquini ac8bc7fa95 mm/vmalloc: be more explicit about supported gfp flags
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2064990

This patch is a backport of the following upstream commit:
commit b7d90e7a5ea8d64e668d5685925900d33d3884d5
Author: Michal Hocko <mhocko@suse.com>
Date:   Fri Nov 5 13:39:50 2021 -0700

    mm/vmalloc: be more explicit about supported gfp flags

    The core of the vmalloc allocator __vmalloc_area_node doesn't say
    anything about gfp mask argument.  Not all gfp flags are supported
    though.  Be more explicit about constraints.

    Link: https://lkml.kernel.org/r/20211020082545.4830-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko <mhocko@suse.com>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Neil Brown <neilb@suse.de>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Uladzislau Rezki <urezki@gmail.com>
    Cc: Ilya Dryomov <idryomov@gmail.com>
    Cc: Jeff Layton <jlayton@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2022-03-27 00:48:16 -04:00
Rafael Aquini 2bd7e68dcb vmalloc: choose a better start address in vm_area_register_early()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2064990

This patch is a backport of the following upstream commit:
commit 0eb68437a7f9dfef9c218873310c66c714f2fa99
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Fri Nov 5 13:39:41 2021 -0700

    vmalloc: choose a better start address in vm_area_register_early()

    Percpu embedded first chunk allocator is the firstly option, but it
    could fail on ARM64, eg,

      percpu: max_distance=0x5fcfdc640000 too large for vmalloc space 0x781fefff0000
      percpu: max_distance=0x600000540000 too large for vmalloc space 0x7dffb7ff0000
      percpu: max_distance=0x5fff9adb0000 too large for vmalloc space 0x5dffb7ff0000

    then we could get to

      WARNING: CPU: 15 PID: 461 at vmalloc.c:3087 pcpu_get_vm_areas+0x488/0x838

    and the system cannot boot successfully.

    Let's implement page mapping percpu first chunk allocator as a fallback
    to the embedding allocator to increase the robustness of the system.

    Also fix a crash when both NEED_PER_CPU_PAGE_FIRST_CHUNK and
    KASAN_VMALLOC enabled.

    Tested on ARM64 qemu with cmdline "percpu_alloc=page".

    This patch (of 3):

    There are some fixed locations in the vmalloc area be reserved in
    ARM(see iotable_init()) and ARM64(see map_kernel()), but for
    pcpu_page_first_chunk(), it calls vm_area_register_early() and choose
    VMALLOC_START as the start address of vmap area which could be
    conflicted with above address, then could trigger a BUG_ON in
    vm_area_add_early().

    Let's choose a suit start address by traversing the vmlist.

    Link: https://lkml.kernel.org/r/20210910053354.26721-1-wangkefeng.wang@huawei.com
    Link: https://lkml.kernel.org/r/20210910053354.26721-2-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Andrey Konovalov <andreyknvl@gmail.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Marco Elver <elver@google.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2022-03-27 00:48:15 -04:00
Rafael Aquini a64a27240c vmalloc: back off when the current task is OOM-killed
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2064990

This patch is a backport of the following upstream commit:
commit dd544141b9eb8f3c58dedc9c1dcc4803de0eed45
Author: Vasily Averin <vvs@virtuozzo.com>
Date:   Fri Nov 5 13:39:37 2021 -0700

    vmalloc: back off when the current task is OOM-killed

    Huge vmalloc allocation on heavy loaded node can lead to a global memory
    shortage.  Task called vmalloc can have worst badness and be selected by
    OOM-killer, however taken fatal signal does not interrupt allocation
    cycle.  Vmalloc repeat page allocaions again and again, exacerbating the
    crisis and consuming the memory freed up by another killed tasks.

    After a successful completion of the allocation procedure, a fatal
    signal will be processed and task will be destroyed finally.  However it
    may not release the consumed memory, since the allocated object may have
    a lifetime unrelated to the completed task.  In the worst case, this can
    lead to the host will panic due to "Out of memory and no killable
    processes..."

    This patch allows OOM-killer to break vmalloc cycle, makes OOM more
    effective and avoid host panic.  It does not check oom condition
    directly, however, and breaks page allocation cycle when fatal signal
    was received.

    This may trigger some hidden problems, when caller does not handle
    vmalloc failures, or when rollaback after failed vmalloc calls own
    vmallocs inside.  However all of these scenarios are incorrect: vmalloc
    does not guarantee successful allocation, it has never been called with
    __GFP_NOFAIL and threfore either should not be used for any rollbacks or
    should handle such errors correctly and not lead to critical failures.

    Link: https://lkml.kernel.org/r/83efc664-3a65-2adb-d7c4-2885784cf109@virtuozzo.com
    Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
    Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
    Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2022-03-27 00:48:15 -04:00
Rafael Aquini 0aca3ddb48 mm/vmalloc: check various alignments when debugging
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2064990

This patch is a backport of the following upstream commit:
commit 066fed59d8a1bab4f3213e8fe413c54e4a76b77a
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Fri Nov 5 13:39:34 2021 -0700

    mm/vmalloc: check various alignments when debugging

    Before we did not guarantee a free block with lowest start address for
    allocations with alignment >= PAGE_SIZE.  Because an alignment overhead
    was included into a search length like below:

         length = size + align - 1;

    doing so we make sure that a bigger block would fit after applying an
    alignment adjustment.  Now there is no such limitation, i.e.  any
    alignment that user wants to apply will result to a lowest address of
    returned free area.

    Link: https://lkml.kernel.org/r/20211004142829.22222-2-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Hillf Danton <hdanton@sina.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
    Cc: Ping Fang <pifang@redhat.com>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2022-03-27 00:48:14 -04:00
Rafael Aquini 965227f88b mm/vmalloc: make sure to dump unpurged areas in /proc/vmallocinfo
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2064990

This patch is a backport of the following upstream commit:
commit 7cc7913e8e61ac436497d01a64963770d1600f5d
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Nov 5 13:39:28 2021 -0700

    mm/vmalloc: make sure to dump unpurged areas in /proc/vmallocinfo

    If last va found in vmap_area_list does not have a vm pointer,
    vmallocinfo.s_show() returns 0, and show_purge_info() is not called as
    it should.

    Link: https://lkml.kernel.org/r/20211001170815.73321-1-eric.dumazet@gmail.com
    Fixes: dd3b8353ba ("mm/vmalloc: do not keep unpurged areas in the busy tree")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Cc: Pengfei Li <lpf.vector@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2022-03-27 00:48:13 -04:00
Rafael Aquini 6b4c5778d3 mm/vmalloc: make show_numa_info() aware of hugepage mappings
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2064990

This patch is a backport of the following upstream commit:
commit 51e50b3a22937ab7b350f05af7e3b79b7ff73dd3
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Nov 5 13:39:25 2021 -0700

    mm/vmalloc: make show_numa_info() aware of hugepage mappings

    show_numa_info() can be slightly faster, by skipping over hugepages
    directly.

    Link: https://lkml.kernel.org/r/20211001172725.105824-1-eric.dumazet@gmail.com
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2022-03-27 00:48:12 -04:00
Rafael Aquini 4223b2018f mm/vmalloc: don't allow VM_NO_GUARD on vmap()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2064990

This patch is a backport of the following upstream commit:
commit bd1a8fb2d43f7c293383f76691d7a55f7f89d9da
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Fri Nov 5 13:39:22 2021 -0700

    mm/vmalloc: don't allow VM_NO_GUARD on vmap()

    The vmalloc guard pages are added on top of each allocation, thereby
    isolating any two allocations from one another.  The top guard of the
    lower allocation is the bottom guard guard of the higher allocation etc.

    Therefore VM_NO_GUARD is dangerous; it breaks the basic premise of
    isolating separate allocations.

    There are only two in-tree users of this flag, neither of which use it
    through the exported interface.  Ensure it stays this way.

    Link: https://lkml.kernel.org/r/YUMfdA36fuyZ+/xt@hirez.programming.kicks-ass.net
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Acked-by: Will Deacon <will@kernel.org>
    Acked-by: Kees Cook <keescook@chromium.org>
    Cc: Andrey Konovalov <andreyknvl@gmail.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Uladzislau Rezki <urezki@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2022-03-27 00:48:12 -04:00
Rafael Aquini aec21d65d2 mm/vmalloc: repair warn_alloc()s in __vmalloc_area_node()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2064990

This patch is a backport of the following upstream commit:
commit 228f778e973035185232ae745be0e3bc57dacea6
Author: Vasily Averin <vvs@virtuozzo.com>
Date:   Fri Nov 5 13:39:19 2021 -0700

    mm/vmalloc: repair warn_alloc()s in __vmalloc_area_node()

    Commit f255935b97 ("mm: cleanup the gfp_mask handling in
    __vmalloc_area_node") added __GFP_NOWARN to gfp_mask unconditionally
    however it disabled all output inside warn_alloc() call.  This patch
    saves original gfp_mask and provides it to all warn_alloc() calls.

    Link: https://lkml.kernel.org/r/f4f3187b-9684-e426-565d-827c2a9bbb0e@virtuozzo.com
    Fixes: f255935b97 ("mm: cleanup the gfp_mask handling in __vmalloc_area_node")
    Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2022-03-27 00:48:11 -04:00
David Hildenbrand a4ae119139 mm/vmalloc: do not adjust the search size for alignment overhead
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2029493

commit 9f531973dff39c671219352573ad5df6d0d9a58c
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Fri Nov 5 13:39:31 2021 -0700

    mm/vmalloc: do not adjust the search size for alignment overhead

    We used to include an alignment overhead into a search length, in that
    case we guarantee that a found area will definitely fit after applying a
    specific alignment that user specifies.  From the other hand we do not
    guarantee that an area has the lowest address if an alignment is >=
    PAGE_SIZE.

    It means that, when a user specifies a special alignment together with a
    range that corresponds to an exact requested size then an allocation
    will fail.  This is what happens to KASAN, it wants the free block that
    exactly matches a specified range during onlining memory banks:

        [root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory82/state
        [root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory83/state
        [root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory85/state
        [root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory84/state
        vmap allocation for size 16777216 failed: use vmalloc=<size> to increase size
        bash: vmalloc: allocation failure: 16777216 bytes, mode:0x6000c0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0
        CPU: 4 PID: 1644 Comm: bash Kdump: loaded Not tainted 4.18.0-339.el8.x86_64+debug #1
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
        Call Trace:
         dump_stack+0x8e/0xd0
         warn_alloc.cold.90+0x8a/0x1b2
         ? zone_watermark_ok_safe+0x300/0x300
         ? slab_free_freelist_hook+0x85/0x1a0
         ? __get_vm_area_node+0x240/0x2c0
         ? kfree+0xdd/0x570
         ? kmem_cache_alloc_node_trace+0x157/0x230
         ? notifier_call_chain+0x90/0x160
         __vmalloc_node_range+0x465/0x840
         ? mark_held_locks+0xb7/0x120

    Fix it by making sure that find_vmap_lowest_match() returns lowest start
    address with any given alignment value, i.e.  for alignments bigger then
    PAGE_SIZE the algorithm rolls back toward parent nodes checking right
    sub-trees if the most left free block did not fit due to alignment
    overhead.

    Link: https://lkml.kernel.org/r/20211004142829.22222-1-urezki@gmail.com
    Fixes: 68ad4a3304 ("mm/vmalloc.c: keep track of free blocks for vmap allocation")
    Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Reported-by: Ping Fang <pifang@redhat.com>
    Tested-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Hillf Danton <hdanton@sina.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: David Hildenbrand <david@redhat.com>
2022-01-03 14:39:57 +01:00
Rafael Aquini dd4bf080b3 mm/vmalloc: introduce alloc_pages_bulk_array_mempolicy to accelerate memory allocation
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit c00b6b9610991c042ff4c3153daaa3ea8522c210
Author: Chen Wandun <chenwandun@huawei.com>
Date:   Fri Nov 5 13:39:53 2021 -0700

    mm/vmalloc: introduce alloc_pages_bulk_array_mempolicy to accelerate memory allocation

    Commit ffb29b1c255a ("mm/vmalloc: fix numa spreading for large hash
    tables") can cause significant performance regressions in some
    situations as Andrew mentioned in [1].  The main situation is vmalloc,
    vmalloc will allocate pages with NUMA_NO_NODE by default, that will
    result in alloc page one by one;

    In order to solve this, __alloc_pages_bulk and mempolicy should be
    considered at the same time.

    1) If node is specified in memory allocation request, it will alloc all
       pages by __alloc_pages_bulk.

    2) If interleaving allocate memory, it will cauculate how many pages
       should be allocated in each node, and use __alloc_pages_bulk to alloc
       pages in each node.

    [1]: https://lore.kernel.org/lkml/CALvZod4G3SzP3kWxQYn0fj+VgG-G3yWXz=gz17+3N57ru1iajw@mail.gmail.com/t/#m750c8e3231206134293b089feaa090590afa0f60

    [akpm@linux-foundation.org: coding style fixes]
    [akpm@linux-foundation.org: make two functions static]
    [akpm@linux-foundation.org: fix CONFIG_NUMA=n build]

    Link: https://lkml.kernel.org/r/20211021080744.874701-3-chenwandun@huawei.com
    Signed-off-by: Chen Wandun <chenwandun@huawei.com>
    Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Hanjun Guo <guohanjun@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:44:22 -05:00
Rafael Aquini 8b62d76dce mm/vmalloc: fix numa spreading for large hash tables
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit ffb29b1c255ab48cb0062a3d11c101501e3e9b3f
Author: Chen Wandun <chenwandun@huawei.com>
Date:   Thu Oct 28 14:36:24 2021 -0700

    mm/vmalloc: fix numa spreading for large hash tables

    Eric Dumazet reported a strange numa spreading info in [1], and found
    commit 121e6f3258 ("mm/vmalloc: hugepage vmalloc mappings") introduced
    this issue [2].

    Dig into the difference before and after this patch, page allocation has
    some difference:

    before:
      alloc_large_system_hash
        __vmalloc
          __vmalloc_node(..., NUMA_NO_NODE, ...)
            __vmalloc_node_range
              __vmalloc_area_node
                alloc_page /* because NUMA_NO_NODE, so choose alloc_page branch */
                  alloc_pages_current
                    alloc_page_interleave /* can be proved by print policy mode */

    after:
      alloc_large_system_hash
        __vmalloc
          __vmalloc_node(..., NUMA_NO_NODE, ...)
            __vmalloc_node_range
              __vmalloc_area_node
                alloc_pages_node /* choose nid by nuam_mem_id() */
                  __alloc_pages_node(nid, ....)

    So after commit 121e6f3258 ("mm/vmalloc: hugepage vmalloc mappings"),
    it will allocate memory in current node instead of interleaving allocate
    memory.

    Link: https://lore.kernel.org/linux-mm/CANn89iL6AAyWhfxdHO+jaT075iOa3XcYn9k6JJc7JR2XYn6k_Q@mail.gmail.com/ [1]
    Link: https://lore.kernel.org/linux-mm/CANn89iLofTR=AK-QOZY87RdUZENCZUT4O6a0hvhu3_EwRMerOg@mail.gmail.com/ [2]
    Link: https://lkml.kernel.org/r/20211021080744.874701-2-chenwandun@huawei.com
    Fixes: 121e6f3258 ("mm/vmalloc: hugepage vmalloc mappings")
    Signed-off-by: Chen Wandun <chenwandun@huawei.com>
    Reported-by: Eric Dumazet <edumazet@google.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Hanjun Guo <guohanjun@huawei.com>
    Cc: Uladzislau Rezki <urezki@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:44:16 -05:00
Rafael Aquini 1088a2d59e mm: don't allow executable ioremap mappings
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 8491502f787c4a902bd4f223b578ef47d3490264
Author: Christoph Hellwig <hch@lst.de>
Date:   Tue Sep 7 19:56:04 2021 -0700

    mm: don't allow executable ioremap mappings

    There is no need to execute from iomem (and most platforms it is
    impossible anyway), so add the pgprot_nx() call similar to vmap.

    Link: https://lkml.kernel.org/r/20210824091259.1324527-3-hch@lst.de
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:43:15 -05:00
Rafael Aquini 8128f89709 mm: move ioremap_page_range to vmalloc.c
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 82a70ce0426dd7c4099516175019dccbd18cebf9
Author: Christoph Hellwig <hch@lst.de>
Date:   Tue Sep 7 19:56:01 2021 -0700

    mm: move ioremap_page_range to vmalloc.c

    Patch series "small ioremap cleanups".

    The first patch moves a little code around the vmalloc/ioremap boundary
    following a bigger move by Nick earlier.  The second enforces
    non-executable mapping on ioremap just like we do for vmap.  No driver
    currently uses executable mappings anyway, as they should.

    This patch (of 2):

    This keeps it together with the implementation, and to remove the
    vmap_range wrapper.

    Link: https://lkml.kernel.org/r/20210824091259.1324527-1-hch@lst.de
    Link: https://lkml.kernel.org/r/20210824091259.1324527-2-hch@lst.de
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:43:15 -05:00
Rafael Aquini ee65143c26 mm/vmalloc: fix wrong behavior in vread
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit f181234a5a21fd0a86b793330016b92c7b3ed8ee
Author: Chen Wandun <chenwandun@huawei.com>
Date:   Thu Sep 2 14:57:26 2021 -0700

    mm/vmalloc: fix wrong behavior in vread

    commit f608788cd2 ("mm/vmalloc: use rb_tree instead of list for vread()
    lookups") use rb_tree instread of list to speed up lookup, but function
    __find_vmap_area is try to find a vmap_area that include target address,
    if target address is smaller than the leftmost node in vmap_area_root, it
    will return NULL, then vread will read nothing.  This behavior is
    different from the primitive semantics.

    The correct way is find the first vmap_are that bigger than target addr,
    that is what function find_vmap_area_exceed_addr does.

    Link: https://lkml.kernel.org/r/20210714015959.3204871-1-chenwandun@huawei.com
    Fixes: f608788cd2 ("mm/vmalloc: use rb_tree instead of list for vread() lookups")
    Signed-off-by: Chen Wandun <chenwandun@huawei.com>
    Reported-by: Hulk Robot <hulkci@huawei.com>
    Cc: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
    Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Wei Yongjun <weiyongjun1@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:41:44 -05:00
Rafael Aquini 5743959d52 mm/vmalloc: remove gfpflags_allow_blocking() check
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 12e376a6f859a000308b6c7cf4a2493eda2bb026
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Thu Sep 2 14:57:19 2021 -0700

    mm/vmalloc: remove gfpflags_allow_blocking() check

    Get rid of gfpflags_allow_blocking() check from the vmalloc() path as it
    is supposed to be sleepable anyway.  Thus remove it from the
    alloc_vmap_area() as well as from the vm_area_alloc_pages().

    Link: https://lkml.kernel.org/r/20210707182639.31282-2-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Hillf Danton <hdanton@sina.com>
    Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:41:42 -05:00
Rafael Aquini fe25a2d50f mm/vmalloc: use batched page requests in bulk-allocator
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 343ab8178f318b6006d54865972ff9c433b29e10
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Thu Sep 2 14:57:16 2021 -0700

    mm/vmalloc: use batched page requests in bulk-allocator

    In case of simultaneous vmalloc allocations, for example it is 1GB and 12
    CPUs my system is able to hit "BUG: soft lockup" for !CONFIG_PREEMPT
    kernel.

      RIP: 0010:__alloc_pages_bulk+0xa9f/0xbb0
      Call Trace:
       __vmalloc_node_range+0x11c/0x2d0
       __vmalloc_node+0x4b/0x70
       fix_size_alloc_test+0x44/0x60 [test_vmalloc]
       test_func+0xe7/0x1f0 [test_vmalloc]
       kthread+0x11a/0x140
       ret_from_fork+0x22/0x30

    To address this issue invoke a bulk-allocator many times until all pages
    are obtained, i.e.  do batched page requests adding cond_resched()
    meanwhile to reschedule.  Batched value is hard-coded and is 100 pages per
    call.

    Link: https://lkml.kernel.org/r/20210707182639.31282-1-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Hillf Danton <hdanton@sina.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:41:41 -05:00
Mel Gorman 5da96bdd93 mm/vmalloc: include header for prototype of set_iounmap_nonlazy
make W=1 generates the following warning for mm/vmalloc.c

  mm/vmalloc.c:1599:6: warning: no previous prototype for `set_iounmap_nonlazy' [-Wmissing-prototypes]
   void set_iounmap_nonlazy(void)
        ^~~~~~~~~~~~~~~~~~~

This is an arch-generic function only used by x86.  On other arches, it's
dead code.  Include the header with the definition and make it x86-64
specific.

Link: https://lkml.kernel.org/r/20210520084809.8576-3-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 11:06:02 -07:00
Christophe Leroy 3382bbee04 mm/vmalloc: enable mapping of huge pages at pte level in vmalloc
On some architectures like powerpc, there are huge pages that are mapped
at pte level.

Enable it in vmalloc.

For that, architectures can provide arch_vmap_pte_supported_shift() that
returns the shift for pages to map at pte level.

Link: https://lkml.kernel.org/r/2c717e3b1fba1894d890feb7669f83025bfa314d.1620795204.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:26 -07:00
Christophe Leroy f7ee1f13d6 mm/vmalloc: enable mapping of huge pages at pte level in vmap
On some architectures like powerpc, there are huge pages that are mapped
at pte level.

Enable it in vmap.

For that, architectures can provide arch_vmap_pte_range_map_size() that
returns the size of pages to map at pte level.

Link: https://lkml.kernel.org/r/fb3ccc73377832ac6708181ec419128a2f98ce36.1620795204.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:26 -07:00
Rafael Aquini a850e932df mm: vmalloc: add cond_resched() in __vunmap()
On non-preemptible kernel builds the watchdog can complain about soft
lockups when vfree() is called against large vmalloc areas:

[  210.851798] kvmalloc-test: vmalloc(2199023255552) succeeded
[  238.654842] watchdog: BUG: soft lockup - CPU#181 stuck for 26s! [rmmod:5203]
[  238.662716] Modules linked in: kvmalloc_test(OE-) ...
[  238.772671] CPU: 181 PID: 5203 Comm: rmmod Tainted: G S         OE     5.13.0-rc7+ #1
[  238.781413] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS PLYXCRB1.86B.0553.D01.1809190614 09/19/2018
[  238.792383] RIP: 0010:free_unref_page+0x52/0x60
[  238.797447] Code: 48 c1 fd 06 48 89 ee e8 9c d0 ff ff 84 c0 74 19 9c 41 5c fa 48 89 ee 48 89 df e8 b9 ea ff ff 41 f7 c4 00 02 00 00 74 01 fb 5b <5d> 41 5c c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f0 29 77
[  238.818406] RSP: 0018:ffffb4d87868fe98 EFLAGS: 00000206
[  238.824236] RAX: 0000000000000000 RBX: 000000001da0c945 RCX: ffffb4d87868fe40
[  238.832200] RDX: ffffd79d3beed108 RSI: ffffd7998501dc08 RDI: ffff9c6fbffd7010
[  238.840166] RBP: 000000000d518cbd R08: ffffd7998501dc08 R09: 0000000000000001
[  238.848131] R10: 0000000000000000 R11: ffffd79d3beee088 R12: 0000000000000202
[  238.856095] R13: ffff9e5be3eceec0 R14: 0000000000000000 R15: 0000000000000000
[  238.864059] FS:  00007fe082c2d740(0000) GS:ffff9f4c69b40000(0000) knlGS:0000000000000000
[  238.873089] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  238.879503] CR2: 000055a000611128 CR3: 000000f6094f6006 CR4: 00000000007706e0
[  238.887467] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  238.895433] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  238.903397] PKRU: 55555554
[  238.906417] Call Trace:
[  238.909149]  __vunmap+0x17c/0x220
[  238.912851]  __x64_sys_delete_module+0x13a/0x250
[  238.918008]  ? syscall_trace_enter.isra.20+0x13c/0x1b0
[  238.923746]  do_syscall_64+0x39/0x80
[  238.927740]  entry_SYSCALL_64_after_hwframe+0x44/0xae

Like in other range zapping routines that iterate over a large list, lets
just add cond_resched() within __vunmap()'s page-releasing loop in order
to avoid the watchdog splats.

Link: https://lkml.kernel.org/r/20210622225030.478384-1-aquini@redhat.com
Signed-off-by: Rafael Aquini <aquini@redhat.com>
Acked-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:52 -07:00
Uladzislau Rezki 12b9f873a5 mm/vmalloc: fallback to a single page allocator
Currently for order-0 pages we use a bulk-page allocator to get set of
pages.  From the other hand not allocating all pages is something that
might occur.  In that case we should fallbak to the single-page allocator
trying to get missing pages, because it is more permissive(direct reclaim,
etc).

Introduce a vm_area_alloc_pages() function where the described logic is
implemented.

Link: https://lkml.kernel.org/r/20210521130718.GA17882@pc638.lan
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:52 -07:00
Uladzislau Rezki (Sony) f4bdfeaf18 mm/vmalloc: remove quoted strings split across lines
A checkpatch.pl script complains on splitting a text across lines.  It is
because if a user wants to find an entire string he or she will not
succeeded.

<snip>
WARNING: quoted string split across lines
+               "vmalloc size %lu allocation failure: "
+               "page order %u allocation failed",

total: 0 errors, 1 warnings, 10 lines checked
<snip>

Link: https://lkml.kernel.org/r/20210521204359.19943-1-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:52 -07:00
Uladzislau Rezki (Sony) cd61413baa mm/vmalloc: print a warning message first on failure
When a memory allocation for array of pages are not succeed emit a warning
message as a first step and then perform the further cleanup.

The reason it should be done in a right order is the clean up function
which is free_vm_area() can potentially also follow its error paths what
can lead to confusion what was broken first.

Link: https://lkml.kernel.org/r/20210516202056.2120-4-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:52 -07:00
Uladzislau Rezki (Sony) 5c1f4e690e mm/vmalloc: switch to bulk allocator in __vmalloc_area_node()
Recently there has been introduced a page bulk allocator for users which
need to get number of pages per one call request.

For order-0 pages switch to an alloc_pages_bulk_array_node() instead of
alloc_pages_node(), the reason is the former is not capable of allocating
set of pages, thus a one call is per one page.

Second, according to my tests the bulk allocator uses less cycles even for
scenarios when only one page is requested.  Running the "perf" on same
test case shows below difference:

<default>
  - 45.18% __vmalloc_node
     - __vmalloc_node_range
        - 35.60% __alloc_pages
           - get_page_from_freelist
                3.36% __list_del_entry_valid
                3.00% check_preemption_disabled
                1.42% prep_new_page
<default>

<patch>
  - 31.00% __vmalloc_node
     - __vmalloc_node_range
        - 14.48% __alloc_pages_bulk
             3.22% __list_del_entry_valid
           - 0.83% __alloc_pages
                get_page_from_freelist
<patch>

The "test_vmalloc.sh" also shows performance improvements:

fix_size_alloc_test_4MB   loops: 1000000 avg: 89105095 usec
fix_size_alloc_test       loops: 1000000 avg: 513672   usec
full_fit_alloc_test       loops: 1000000 avg: 748900   usec
long_busy_list_alloc_test loops: 1000000 avg: 8043038  usec
random_size_alloc_test    loops: 1000000 avg: 4028582  usec
fix_align_alloc_test      loops: 1000000 avg: 1457671  usec

fix_size_alloc_test_4MB   loops: 1000000 avg: 62083711 usec
fix_size_alloc_test       loops: 1000000 avg: 449207   usec
full_fit_alloc_test       loops: 1000000 avg: 735985   usec
long_busy_list_alloc_test loops: 1000000 avg: 5176052  usec
random_size_alloc_test    loops: 1000000 avg: 2589252  usec
fix_align_alloc_test      loops: 1000000 avg: 1365009  usec

For example 4MB allocations illustrates ~30% gain, all the
rest is also better.

Link: https://lkml.kernel.org/r/20210516202056.2120-3-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:52 -07:00
Daniel Axtens 7ca3027b72 mm/vmalloc: unbreak kasan vmalloc support
In commit 121e6f3258 ("mm/vmalloc: hugepage vmalloc mappings"),
__vmalloc_node_range was changed such that __get_vm_area_node was no
longer called with the requested/real size of the vmalloc allocation,
but rather with a rounded-up size.

This means that __get_vm_area_node called kasan_unpoision_vmalloc() with
a rounded up size rather than the real size.  This led to it allowing
access to too much memory and so missing vmalloc OOBs and failing the
kasan kunit tests.

Pass the real size and the desired shift into __get_vm_area_node.  This
allows it to round up the size for the underlying allocators while still
unpoisioning the correct quantity of shadow memory.

Adjust the other call-sites to pass in PAGE_SHIFT for the shift value.

Link: https://lkml.kernel.org/r/20210617081330.98629-1-dja@axtens.net
Link: https://bugzilla.kernel.org/show_bug.cgi?id=213335
Fixes: 121e6f3258 ("mm/vmalloc: hugepage vmalloc mappings")
Signed-off-by: Daniel Axtens <dja@axtens.net>
Tested-by: David Gow <davidgow@google.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Tested-by: Andrey Konovalov <andreyknvl@gmail.com>
Acked-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24 19:40:54 -07:00
Claudio Imbrenda 15a64f5a88 mm/vmalloc: add vmalloc_no_huge
Patch series "mm: add vmalloc_no_huge and use it", v4.

Add vmalloc_no_huge() and export it, so modules can allocate memory with
small pages.

Use the newly added vmalloc_no_huge() in KVM on s390 to get around a
hardware limitation.

This patch (of 2):

Commit 121e6f3258 ("mm/vmalloc: hugepage vmalloc mappings") added
support for hugepage vmalloc mappings, it also added the flag
VM_NO_HUGE_VMAP for __vmalloc_node_range to request the allocation to be
performed with 0-order non-huge pages.

This flag is not accessible when calling vmalloc, the only option is to
call directly __vmalloc_node_range, which is not exported.

This means that a module can't vmalloc memory with small pages.

Case in point: KVM on s390x needs to vmalloc a large area, and it needs
to be mapped with non-huge pages, because of a hardware limitation.

This patch adds the function vmalloc_no_huge, which works like vmalloc,
but it is guaranteed to always back the mapping using small pages.  This
new function is exported, therefore it is usable by modules.

[akpm@linux-foundation.org: whitespace fixes, per Christoph]

Link: https://lkml.kernel.org/r/20210614132357.10202-1-imbrenda@linux.ibm.com
Link: https://lkml.kernel.org/r/20210614132357.10202-2-imbrenda@linux.ibm.com
Fixes: 121e6f3258 ("mm/vmalloc: hugepage vmalloc mappings")
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24 19:40:53 -07:00
Ingo Molnar f0953a1bba mm: fix typos in comments
Fix ~94 single-word typos in locking code comments, plus a few
very obvious grammar mistakes.

Link: https://lkml.kernel.org/r/20210322212624.GA1963421@gmail.com
Link: https://lore.kernel.org/r/20210322205203.GB1959563@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-07 00:26:35 -07:00
David Hildenbrand f7c8ce44eb mm/vmalloc: remove vwrite()
The last user (/dev/kmem) is gone. Let's drop it.

Link: https://lkml.kernel.org/r/20210324102351.6932-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: huang ying <huang.ying.caritas@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-07 00:26:34 -07:00
David Hildenbrand bbcd53c960 drivers/char: remove /dev/kmem for good
Patch series "drivers/char: remove /dev/kmem for good".

Exploring /dev/kmem and /dev/mem in the context of memory hot(un)plug and
memory ballooning, I started questioning the existence of /dev/kmem.

Comparing it with the /proc/kcore implementation, it does not seem to be
able to deal with things like

a) Pages unmapped from the direct mapping (e.g., to be used by secretmem)
  -> kern_addr_valid(). virt_addr_valid() is not sufficient.

b) Special cases like gart aperture memory that is not to be touched
  -> mem_pfn_is_ram()

Unless I am missing something, it's at least broken in some cases and might
fault/crash the machine.

Looks like its existence has been questioned before in 2005 and 2010 [1],
after ~11 additional years, it might make sense to revive the discussion.

CONFIG_DEVKMEM is only enabled in a single defconfig (on purpose or by
mistake?).  All distributions disable it: in Ubuntu it has been disabled
for more than 10 years, in Debian since 2.6.31, in Fedora at least
starting with FC3, in RHEL starting with RHEL4, in SUSE starting from
15sp2, and OpenSUSE has it disabled as well.

1) /dev/kmem was popular for rootkits [2] before it got disabled
   basically everywhere. Ubuntu documents [3] "There is no modern user of
   /dev/kmem any more beyond attackers using it to load kernel rootkits.".
   RHEL documents in a BZ [5] "it served no practical purpose other than to
   serve as a potential security problem or to enable binary module drivers
   to access structures/functions they shouldn't be touching"

2) /proc/kcore is a decent interface to have a controlled way to read
   kernel memory for debugging puposes. (will need some extensions to
   deal with memory offlining/unplug, memory ballooning, and poisoned
   pages, though)

3) It might be useful for corner case debugging [1]. KDB/KGDB might be a
   better fit, especially, to write random memory; harder to shoot
   yourself into the foot.

4) "Kernel Memory Editor" [4] hasn't seen any updates since 2000 and seems
   to be incompatible with 64bit [1]. For educational purposes,
   /proc/kcore might be used to monitor value updates -- or older
   kernels can be used.

5) It's broken on arm64, and therefore, completely disabled there.

Looks like it's essentially unused and has been replaced by better
suited interfaces for individual tasks (/proc/kcore, KDB/KGDB). Let's
just remove it.

[1] https://lwn.net/Articles/147901/
[2] https://www.linuxjournal.com/article/10505
[3] https://wiki.ubuntu.com/Security/Features#A.2Fdev.2Fkmem_disabled
[4] https://sourceforge.net/projects/kme/
[5] https://bugzilla.redhat.com/show_bug.cgi?id=154796

Link: https://lkml.kernel.org/r/20210324102351.6932-1-david@redhat.com
Link: https://lkml.kernel.org/r/20210324102351.6932-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Alexander A. Klimov" <grandmaster@al2klimov.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Chris Zankel <chris@zankel.net>
Cc: Corentin Labbe <clabbe@baylibre.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Gregory Clement <gregory.clement@bootlin.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Hillf Danton <hdanton@sina.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: James Troup <james.troup@canonical.com>
Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kairui Song <kasong@redhat.com>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Liviu Dudau <liviu.dudau@arm.com>
Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Cc: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: openrisc@lists.librecores.org
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Pavel Machek (CIP)" <pavel@denx.de>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Pierre Morel <pmorel@linux.ibm.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Rich Felker <dalias@libc.org>
Cc: Robert Richter <rric@kernel.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
Cc: sparclinux@vger.kernel.org
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Theodore Dubois <tblodt@icloud.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: William Cohen <wcohen@redhat.com>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-07 00:26:34 -07:00
Zhiyuan Dai 68d68ff6eb mm/mempool: minor coding style tweaks
Various coding style tweaks to various files under mm/

[daizhiyuan@phytium.com.cn: mm/swapfile: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1614223624-16055-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/sparse: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1614227288-19363-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/vmscan: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1614227649-19853-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/compaction: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1614228218-20770-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/oom_kill: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1614228360-21168-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/shmem: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1614228504-21491-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/page_alloc: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1614228613-21754-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/filemap: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1614228936-22337-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/mlock: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1613956588-2453-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/frontswap: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1613962668-15045-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/vmalloc: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1613963379-15988-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/memory_hotplug: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1613971784-24878-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/mempolicy: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1613972228-25501-1-git-send-email-daizhiyuan@phytium.com.cn

Link: https://lkml.kernel.org/r/1614222374-13805-1-git-send-email-daizhiyuan@phytium.com.cn
Signed-off-by: Zhiyuan Dai <daizhiyuan@phytium.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:27 -07:00
Uladzislau Rezki (Sony) 299420ba35 mm/vmalloc: remove an empty line
Link: https://lkml.kernel.org/r/20210402202237.20334-5-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:40 -07:00
Uladzislau Rezki (Sony) 187f8cc456 mm/vmalloc: refactor the preloading loagic
Instead of keeping open-coded style, move the code related to preloading
into a separate function.  Therefore introduce the preload_this_cpu_lock()
routine that prelaods a current CPU with one extra vmap_area object.

There is no functional change as a result of this patch.

Link: https://lkml.kernel.org/r/20210402202237.20334-4-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:40 -07:00
Vijayanand Jitta ad216c0316 mm: vmalloc: prevent use after free in _vm_unmap_aliases
A potential use after free can occur in _vm_unmap_aliases where an already
freed vmap_area could be accessed, Consider the following scenario:

Process 1						Process 2

__vm_unmap_aliases					__vm_unmap_aliases
	purge_fragmented_blocks_allcpus				rcu_read_lock()
		rcu_read_lock()
			list_del_rcu(&vb->free_list)
									list_for_each_entry_rcu(vb .. )
	__purge_vmap_area_lazy
		kmem_cache_free(va)
										va_start = vb->va->va_start

Here Process 1 is in purge path and it does list_del_rcu on vmap_block and
later frees the vmap_area, since Process 2 was holding the rcu lock at
this time vmap_block will still be present in and Process 2 accesse it and
thereby it tries to access vmap_area of that vmap_block which was already
freed by Process 1 and this results in use after free.

Fix this by adding a check for vb->dirty before accessing vmap_area
structure since vb->dirty will be set to VMAP_BBMAP_BITS in purge path
checking for this will prevent the use after free.

Link: https://lkml.kernel.org/r/1616062105-23263-1-git-send-email-vjitta@codeaurora.org
Signed-off-by: Vijayanand Jitta <vjitta@codeaurora.org>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:40 -07:00
Nicholas Piggin d70bec8cc9 mm/vmalloc: improve allocation failure error messages
There are several reasons why a vmalloc can fail, virtual space exhausted,
page array allocation failure, page allocation failure, and kernel page
table allocation failure.

Add distinct warning messages for the main causes of failure, with some
added information like page order or allocation size where applicable.

[urezki@gmail.com: print correct vmalloc allocation size]
  Link: https://lkml.kernel.org/r/20210329193214.GA28602@pc638.lan

Link: https://lkml.kernel.org/r/20210322021806.892164-6-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:40 -07:00
Nicholas Piggin 4ad0ae8c64 mm/vmalloc: remove unmap_kernel_range
This is a shim around vunmap_range, get rid of it.

Move the main API comment from the _noflush variant to the normal
variant, and make _noflush internal to mm/.

[npiggin@gmail.com: fix nommu builds and a comment bug per sfr]
  Link: https://lkml.kernel.org/r/1617292598.m6g0knx24s.astroid@bobo.none
[akpm@linux-foundation.org: move vunmap_range_noflush() stub inside !CONFIG_MMU, not !CONFIG_NUMA]
[npiggin@gmail.com: fix nommu builds]
  Link: https://lkml.kernel.org/r/1617292497.o1uhq5ipxp.astroid@bobo.none

Link: https://lkml.kernel.org/r/20210322021806.892164-5-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Cédric Le Goater <clg@kaod.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:40 -07:00
Nicholas Piggin b67177ecd9 mm/vmalloc: remove map_kernel_range
Patch series "mm/vmalloc: cleanup after hugepage series", v2.

Christoph pointed out some overdue cleanups required after the huge
vmalloc series, and I had another failure error message improvement as
well.

This patch (of 5):

This is a shim around vmap_pages_range, get rid of it.

Move the main API comment from the _noflush variant to the normal variant,
and make _noflush internal to mm/.

Link: https://lkml.kernel.org/r/20210322021806.892164-1-npiggin@gmail.com
Link: https://lkml.kernel.org/r/20210322021806.892164-2-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:40 -07:00
Nicholas Piggin 121e6f3258 mm/vmalloc: hugepage vmalloc mappings
Support huge page vmalloc mappings.  Config option HAVE_ARCH_HUGE_VMALLOC
enables support on architectures that define HAVE_ARCH_HUGE_VMAP and
supports PMD sized vmap mappings.

vmalloc will attempt to allocate PMD-sized pages if allocating PMD size or
larger, and fall back to small pages if that was unsuccessful.

Architectures must ensure that any arch specific vmalloc allocations that
require PAGE_SIZE mappings (e.g., module allocations vs strict module rwx)
use the VM_NOHUGE flag to inhibit larger mappings.

This can result in more internal fragmentation and memory overhead for a
given allocation, an option nohugevmalloc is added to disable at boot.

[colin.king@canonical.com: fix read of uninitialized pointer area]
  Link: https://lkml.kernel.org/r/20210318155955.18220-1-colin.king@canonical.com

Link: https://lkml.kernel.org/r/20210317062402.533919-14-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ding Tianhong <dingtianhong@huawei.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:40 -07:00
Nicholas Piggin 5d87510de1 mm/vmalloc: add vmap_range_noflush variant
As a side-effect, the order of flush_cache_vmap() and
arch_sync_kernel_mappings() calls are switched, but that now matches the
other callers in this file.

Link: https://lkml.kernel.org/r/20210317062402.533919-13-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ding Tianhong <dingtianhong@huawei.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:40 -07:00
Nicholas Piggin 5e9e3d777b mm: move vmap_range from mm/ioremap.c to mm/vmalloc.c
This is a generic kernel virtual memory mapper, not specific to ioremap.

Code is unchanged other than making vmap_range non-static.

Link: https://lkml.kernel.org/r/20210317062402.533919-12-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ding Tianhong <dingtianhong@huawei.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:40 -07:00
Nicholas Piggin 0a26488404 mm/vmalloc: rename vmap_*_range vmap_pages_*_range
The vmalloc mapper operates on a struct page * array rather than a linear
physical address, re-name it to make this distinction clear.

Link: https://lkml.kernel.org/r/20210317062402.533919-5-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ding Tianhong <dingtianhong@huawei.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:39 -07:00
Nicholas Piggin c0eb315ad9 mm/vmalloc: fix HUGE_VMAP regression by enabling huge pages in vmalloc_to_page
vmalloc_to_page returns NULL for addresses mapped by larger pages[*].
Whether or not a vmap is huge depends on the architecture details,
alignments, boot options, etc., which the caller can not be expected to
know.  Therefore HUGE_VMAP is a regression for vmalloc_to_page.

This change teaches vmalloc_to_page about larger pages, and returns the
struct page that corresponds to the offset within the large page.  This
makes the API agnostic to mapping implementation details.

[*] As explained by commit 029c54b095 ("mm/vmalloc.c: huge-vmap:
    fail gracefully on unexpected huge vmap mappings")

[npiggin@gmail.com: sparc32: add stub pud_page define for walking huge vmalloc page tables]
  Link: https://lkml.kernel.org/r/20210324232825.1157363-1-npiggin@gmail.com

Link: https://lkml.kernel.org/r/20210317062402.533919-3-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ding Tianhong <dingtianhong@huawei.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:39 -07:00
Serapheim Dimitropoulos f608788cd2 mm/vmalloc: use rb_tree instead of list for vread() lookups
vread() has been linearly searching vmap_area_list for looking up vmalloc
areas to read from.  These same areas are also tracked by a rb_tree
(vmap_area_root) which offers logarithmic lookup.

This patch modifies vread() to use the rb_tree structure instead of the
list and the speedup for heavy /proc/kcore readers can be pretty
significant.  Below are the wall clock measurements of a Python
application that leverages the drgn debugging library to read and
interpret data read from /proc/kcore.

Before the patch:
-----
  $ time sudo sdb -e 'dbuf | head 3000 | wc'
  (unsigned long)3000

  real	0m22.446s
  user	0m2.321s
  sys	0m20.690s
-----

With the patch:
-----
  $ time sudo sdb -e 'dbuf | head 3000 | wc'
  (unsigned long)3000

  real	0m2.104s
  user	0m2.043s
  sys	0m0.921s
-----

Link: https://lkml.kernel.org/r/20210209190253.108763-1-serapheim@delphix.com
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:39 -07:00
Christoph Hellwig 0f71d7e14c mm: unexport remap_vmalloc_range_partial
remap_vmalloc_range_partial is only used to implement remap_vmalloc_range
and by procfs.  Unexport it.

Link: https://lkml.kernel.org/r/20210301082235.932968-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:39 -07:00
Paul E. McKenney 5bb1bb353c mm: Don't build mm_dump_obj() on CONFIG_PRINTK=n kernels
The mem_dump_obj() functionality adds a few hundred bytes, which is a
small price to pay.  Except on kernels built with CONFIG_PRINTK=n, in
which mem_dump_obj() messages will be suppressed.  This commit therefore
makes mem_dump_obj() be a static inline empty function on kernels built
with CONFIG_PRINTK=n and excludes all of its support functions as well.
This avoids kernel bloat on systems that cannot use mem_dump_obj().

Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <linux-mm@kvack.org>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-08 14:18:46 -08:00
Ingo Molnar 85e853c5ec Merge branch 'for-mingo-rcu' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:

- Documentation updates.

- Miscellaneous fixes.

- kfree_rcu() updates: Addition of mem_dump_obj() to provide allocator return
  addresses to more easily locate bugs.  This has a couple of RCU-related commits,
  but is mostly MM.  Was pulled in with akpm's agreement.

- Per-callback-batch tracking of numbers of callbacks,
  which enables better debugging information and smarter
  reactions to large numbers of callbacks.

- The first round of changes to allow CPUs to be runtime switched from and to
  callback-offloaded state.

- CONFIG_PREEMPT_RT-related changes.

- RCU CPU stall warning updates.
- Addition of polling grace-period APIs for SRCU.

- Torture-test and torture-test scripting updates, including a "torture everything"
  script that runs rcutorture, locktorture, scftorture, rcuscale, and refscale.
  Plus does an allmodconfig build.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-02-12 12:56:55 +01:00
Paul E. McKenney bd34dcd412 mm: Make mem_obj_dump() vmalloc() dumps include start and length
This commit adds the starting address and number of pages to the vmalloc()
information dumped by way of vmalloc_dump_obj().

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <linux-mm@kvack.org>
Reported-by: Andrii Nakryiko <andrii@kernel.org>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-01-22 15:24:10 -08:00
Paul E. McKenney 98f180837a mm: Make mem_dump_obj() handle vmalloc() memory
This commit adds vmalloc() support to mem_dump_obj().  Note that the
vmalloc_dump_obj() function combines the checking and dumping, in
contrast with the split between kmem_valid_obj() and kmem_dump_obj().
The reason for the difference is that the checking in the vmalloc()
case involves acquiring a global lock, and redundant acquisitions of
global locks should be avoided, even on not-so-fast paths.

Note that this change causes on-stack variables to be reported as
vmalloc() storage from kernel_clone() or similar, depending on the degree
of inlining that your compiler does.  This is likely more helpful than
the earlier "non-paged (local) memory".

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <linux-mm@kvack.org>
Reported-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-01-22 15:24:04 -08:00
Miaohe Lin c22ee5284c mm/vmalloc.c: fix potential memory leak
In VM_MAP_PUT_PAGES case, we should put pages and free array in vfree.
But we missed to set area->nr_pages in vmap().  So we would fail to put
pages in __vunmap() because area->nr_pages = 0.

Link: https://lkml.kernel.org/r/20210107123541.39206-1-linmiaohe@huawei.com
Fixes: b944afc9d6 ("mm: add a VM_MAP_PUT_PAGES flag for vmap")
Signed-off-by: Shijie Luo <luoshijie1@huawei.com>
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-12 18:12:54 -08:00
Vincenzo Frascino c041098c69 mm/vmalloc.c: fix kasan shadow poisoning size
The size of vm area can be affected by the presence or not of the guard
page.  In particular when VM_NO_GUARD is present, the actual accessible
size has to be considered like the real size minus the guard page.

Currently kasan does not keep into account this information during the
poison operation and in particular tries to poison the guard page as well.

This approach, even if incorrect, does not cause an issue because the tags
for the guard page are written in the shadow memory.  With the future
introduction of the Tag-Based KASAN, being the guard page inaccessible by
nature, the write tag operation on this page triggers a fault.

Fix kasan shadow poisoning size invoking get_vm_area_size() instead of
accessing directly the field in the data structure to detect the correct
value.

Link: https://lkml.kernel.org/r/20201027160213.32904-1-vincenzo.frascino@arm.com
Fixes: d98c9e83b5 ("kasan: fix crashes on access to memory mapped by vm_map_ram()")
Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 12:13:42 -08:00
Waiman Long 0a7dd4e901 mm/vmalloc: Fix unlock order in s_stop()
When multiple locks are acquired, they should be released in reverse
order. For s_start() and s_stop() in mm/vmalloc.c, that is not the
case.

  s_start: mutex_lock(&vmap_purge_lock); spin_lock(&vmap_area_lock);
  s_stop : mutex_unlock(&vmap_purge_lock); spin_unlock(&vmap_area_lock);

This unlock sequence, though allowed, is not optimal. If a waiter is
present, mutex_unlock() will need to go through the slowpath of waking
up the waiter with preemption disabled. Fix that by releasing the
spinlock first before the mutex.

Link: https://lkml.kernel.org/r/20201213180843.16938-1-longman@redhat.com
Fixes: e36176be1c ("mm/vmalloc: rework vmap_area_lock")
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 12:13:42 -08:00