Commit Graph

491 Commits

Author SHA1 Message Date
Aristeu Rozanski 0718d86d0c mm: memory-failure: bump memory failure stats to pglist_data
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 18f41fa616ee4d66c67033eb46b951bf6e1b4a12
Author: Jiaqi Yan <jiaqiyan@google.com>
Date:   Fri Jan 20 03:46:21 2023 +0000

    mm: memory-failure: bump memory failure stats to pglist_data

    Right before memory_failure finishes its handling, accumulate poisoned
    page's resolution counters to pglist_data's memory_failure_stats, so as to
    update the corresponding sysfs entries.

    Tested:
    1) Start an application to allocate memory buffer chunks
    2) Convert random memory buffer addresses to physical addresses
    3) Inject memory errors using EINJ at chosen physical addresses
    4) Access poisoned memory buffer and recover from SIGBUS
    5) Check counter values under
       /sys/devices/system/node/node*/memory_failure/*

    Link: https://lkml.kernel.org/r/20230120034622.2698268-3-jiaqiyan@google.com
    Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
    Acked-by: David Rientjes <rientjes@google.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Tony Luck <tony.luck@intel.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:11 -04:00
Aristeu Rozanski 274a3f2005 mm: memory-failure: add memory failure stats to sysfs
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 44b8f8bf2438bfee3aceae4d647a7460213ff340
Author: Jiaqi Yan <jiaqiyan@google.com>
Date:   Fri Jan 20 03:46:20 2023 +0000

    mm: memory-failure: add memory failure stats to sysfs

    Patch series "Introduce per NUMA node memory error statistics", v2.

    Background
    ==========

    In the RFC for Kernel Support of Memory Error Detection [1], one advantage
    of software-based scanning over hardware patrol scrubber is the ability to
    make statistics visible to system administrators.  The statistics include
    2 categories:

    * Memory error statistics, for example, how many memory error are
      encountered, how many of them are recovered by the kernel.  Note these
      memory errors are non-fatal to kernel: during the machine check
      exception (MCE) handling kernel already classified MCE's severity to be
      unnecessary to panic (but either action required or optional).

    * Scanner statistics, for example how many times the scanner have fully
      scanned a NUMA node, how many errors are first detected by the scanner.

    The memory error statistics are useful to userspace and actually not
    specific to scanner detected memory errors, and are the focus of this
    patchset.

    Motivation
    ==========

    Memory error stats are important to userspace but insufficient in kernel
    today.  Datacenter administrators can better monitor a machine's memory
    health with the visible stats.  For example, while memory errors are
    inevitable on servers with 10+ TB memory, starting server maintenance when
    there are only 1~2 recovered memory errors could be overreacting; in cloud
    production environment maintenance usually means live migrate all the
    workload running on the server and this usually causes nontrivial
    disruption to the customer.  Providing insight into the scope of memory
    errors on a system helps to determine the appropriate follow-up action.
    In addition, the kernel's existing memory error stats need to be
    standardized so that userspace can reliably count on their usefulness.

    Today kernel provides following memory error info to userspace, but they
    are not sufficient or have disadvantages:
    * HardwareCorrupted in /proc/meminfo: number of bytes poisoned in total,
      not per NUMA node stats though
    * ras:memory_failure_event: only available after explicitly enabled
    * /dev/mcelog provides many useful info about the MCEs, but doesn't
      capture how memory_failure recovered memory MCEs
    * kernel logs: userspace needs to process log text

    Exposing memory error stats is also a good start for the in-kernel memory
    error detector.  Today the data source of memory error stats are either
    direct memory error consumption, or hardware patrol scrubber detection
    (either signaled as UCNA or SRAO).  Once in-kernel memory scanner is
    implemented, it will be the main source as it is usually configured to
    scan memory DIMMs constantly and faster than hardware patrol scrubber.

    How Implemented
    ===============

    As Naoya pointed out [2], exposing memory error statistics to userspace is
    useful independent of software or hardware scanner.  Therefore we
    implement the memory error statistics independent of the in-kernel memory
    error detector.  It exposes the following per NUMA node memory error
    counters:

      /sys/devices/system/node/node${X}/memory_failure/total
      /sys/devices/system/node/node${X}/memory_failure/recovered
      /sys/devices/system/node/node${X}/memory_failure/ignored
      /sys/devices/system/node/node${X}/memory_failure/failed
      /sys/devices/system/node/node${X}/memory_failure/delayed

    These counters describe how many raw pages are poisoned and after the
    attempted recoveries by the kernel, their resolutions: how many are
    recovered, ignored, failed, or delayed respectively.  This approach can be
    easier to extend for future use cases than /proc/meminfo, trace event, and
    log.  The following math holds for the statistics:

    * total = recovered + ignored + failed + delayed

    These memory error stats are reset during machine boot.

    The 1st commit introduces these sysfs entries.  The 2nd commit populates
    memory error stats every time memory_failure attempts memory error
    recovery.  The 3rd commit adds documentations for introduced stats.

    [1] https://lore.kernel.org/linux-mm/7E670362-C29E-4626-B546-26530D54F937@gmail.com/T/#mc22959244f5388891c523882e61163c6e4d703af
    [2] https://lore.kernel.org/linux-mm/7E670362-C29E-4626-B546-26530D54F937@gmail.com/T/#m52d8d7a333d8536bd7ce74253298858b1c0c0ac6

    This patch (of 3):

    Today kernel provides following memory error info to userspace, but each
    has its own disadvantage

    * HardwareCorrupted in /proc/meminfo: number of bytes poisoned in total,
      not per NUMA node stats though

    * ras:memory_failure_event: only available after explicitly enabled

    * /dev/mcelog provides many useful info about the MCEs, but
      doesn't capture how memory_failure recovered memory MCEs

    * kernel logs: userspace needs to process log text

    Exposes per NUMA node memory error stats as sysfs entries:

      /sys/devices/system/node/node${X}/memory_failure/total
      /sys/devices/system/node/node${X}/memory_failure/recovered
      /sys/devices/system/node/node${X}/memory_failure/ignored
      /sys/devices/system/node/node${X}/memory_failure/failed
      /sys/devices/system/node/node${X}/memory_failure/delayed

    These counters describe how many raw pages are poisoned and after the
    attempted recoveries by the kernel, their resolutions: how many are
    recovered, ignored, failed, or delayed respectively.  The following math
    holds for the statistics:

    * total = recovered + ignored + failed + delayed

    Link: https://lkml.kernel.org/r/20230120034622.2698268-1-jiaqiyan@google.com
    Link: https://lkml.kernel.org/r/20230120034622.2698268-2-jiaqiyan@google.com
    Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
    Acked-by: David Rientjes <rientjes@google.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Tony Luck <tony.luck@intel.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:11 -04:00
Aristeu Rozanski db7d9d8a0e mm/hugetlb: convert get_hwpoison_huge_page() to folios
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 04bac040bc71b4b37550eed5854f34ca161756f9
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Wed Jan 18 09:40:39 2023 -0800

    mm/hugetlb: convert get_hwpoison_huge_page() to folios

    Straightforward conversion of get_hwpoison_huge_page() to
    get_hwpoison_hugetlb_folio().  Reduces two references to a head page in
    memory-failure.c

    [arnd@arndb.de: fix get_hwpoison_hugetlb_folio() stub]
      Link: https://lkml.kernel.org/r/20230119111920.635260-1-arnd@kernel.org
    Link: https://lkml.kernel.org/r/20230118174039.14247-1-sidhartha.kumar@oracle.com
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Signed-off-by: Arnd Bergmann <arnd@arndb.de>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:11 -04:00
Aristeu Rozanski 9c0236cf23 mm: clean up mlock_page / munlock_page references in comments
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit e0650a41f7d024b72669a2a2db846ef70281abd8
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jan 16 19:28:27 2023 +0000

    mm: clean up mlock_page / munlock_page references in comments

    Change documentation and comments that refer to now-renamed functions.

    Link: https://lkml.kernel.org/r/20230116192827.2146732-5-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:10 -04:00
Aristeu Rozanski 2685ef8f1b mm/memory-failure: convert unpoison_memory() to folios
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me
Conflicts: reverting duplicated 6bbabd041dfd4 backport applied on the wrong spot

commit a6fddef49eef2cf68c23e91d73d6a6d5e2cd448f
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Thu Jan 12 14:46:08 2023 -0600

    mm/memory-failure: convert unpoison_memory() to folios

    Use a folio inside unpoison_memory which replaces a compound_head() call
    with a call to page_folio().

    Link: https://lkml.kernel.org/r/20230112204608.80136-9-sidhartha.kumar@oracle.com
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:07 -04:00
Aristeu Rozanski 5bd96c04d8 mm/memory-failure: convert hugetlb_set_page_hwpoison() to folios
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 595dd8185cf1db248b2be4c65ec8936de6ac87c1
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Thu Jan 12 14:46:07 2023 -0600

    mm/memory-failure: convert hugetlb_set_page_hwpoison() to folios

    Change hugetlb_set_page_hwpoison() to folio_set_hugetlb_hwpoison() and use
    a folio internally.

    Link: https://lkml.kernel.org/r/20230112204608.80136-8-sidhartha.kumar@oracle.com
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:07 -04:00
Aristeu Rozanski edb86f25cd mm/memory-failure: convert __free_raw_hwp_pages() to folios
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 0858b5eb3aab2de0662a40901699162519628f6e
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Thu Jan 12 14:46:06 2023 -0600

    mm/memory-failure: convert __free_raw_hwp_pages() to folios

    Change __free_raw_hwp_pages() to __folio_free_raw_hwp() and modify its
    callers to pass in a folio.

    Link: https://lkml.kernel.org/r/20230112204608.80136-7-sidhartha.kumar@oracle.com
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:07 -04:00
Aristeu Rozanski ef57592ae8 mm/memory-failure: convert raw_hwp_list_head() to folios
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit b02e7582ef245e9694fff6aee8e95fd1764cc5ee
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Thu Jan 12 14:46:05 2023 -0600

    mm/memory-failure: convert raw_hwp_list_head() to folios

    Change raw_hwp_list_head() to take in a folio and modify its callers to
    pass in a folio.  Also converts two users of hugetlb specific page macro
    users to their folio equivalents.

    Link: https://lkml.kernel.org/r/20230112204608.80136-6-sidhartha.kumar@oracle.com
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:07 -04:00
Aristeu Rozanski ef1686d43b mm/memory-failure: convert free_raw_hwp_pages() to folios
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 9637d7dfb19ce934f81cd56cde23573759c73afb
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Thu Jan 12 14:46:04 2023 -0600

    mm/memory-failure: convert free_raw_hwp_pages() to folios

    Change free_raw_hwp_pages() to folio_free_raw_hwp(), converts two users of
    hugetlb specific page macro users to their folio equivalents.

    Link: https://lkml.kernel.org/r/20230112204608.80136-5-sidhartha.kumar@oracle.com
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:07 -04:00
Aristeu Rozanski 2ca0a475ff mm/memory-failure: convert hugetlb_clear_page_hwpoison to folios
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 2ff6cecee669bf0fc63eadebac8cfc81f74b9a4c
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Thu Jan 12 14:46:03 2023 -0600

    mm/memory-failure: convert hugetlb_clear_page_hwpoison to folios

    Change hugetlb_clear_page_hwpoison() to folio_clear_hugetlb_hwpoison() by
    changing the function to take in a folio.  This converts one use of
    ClearPageHWPoison and HPageRawHwpUnreliable to their folio equivalents.

    Link: https://lkml.kernel.org/r/20230112204608.80136-4-sidhartha.kumar@oracle.com
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:07 -04:00
Aristeu Rozanski a794f60450 mm/memory-failure: convert try_memory_failure_hugetlb() to folios
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit bc1cfde194675215857755b75e5fe90f6a654843
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Thu Jan 12 14:46:02 2023 -0600

    mm/memory-failure: convert try_memory_failure_hugetlb() to folios

    Use a struct folio rather than a head page in try_memory_failure_hugetlb.
    This converts one user of SetHPageMigratable to the folio equivalent.

    Link: https://lkml.kernel.org/r/20230112204608.80136-3-sidhartha.kumar@oracle.com
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:07 -04:00
Aristeu Rozanski 43a259d6ab mm/memory-failure: convert __get_huge_page_for_hwpoison() to folios
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 4c110ec98e39944732ec31bf0415f22632bae2b7
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Thu Jan 12 14:46:01 2023 -0600

    mm/memory-failure: convert __get_huge_page_for_hwpoison() to folios

    Patch series "convert hugepage memory failure functions to folios".

    This series contains a 1:1 straightforward page to folio conversion for
    memory failure functions which deal with huge pages.  I renamed a few
    functions to fit with how other folio operating functions are named.
    These include:

    hugetlb_clear_page_hwpoison -> folio_clear_hugetlb_hwpoison
    free_raw_hwp_pages -> folio_free_raw_hwp
    __free_raw_hwp_pages -> __folio_free_raw_hwp
    hugetlb_set_page_hwpoison -> folio_set_hugetlb_hwpoison

    The goal of this series was to reduce users of the hugetlb specific page
    flag macros which take in a page so users are protected by the compiler to
    make sure they are operating on a head page.

    This patch (of 8):

    Use a folio throughout the function rather than using a head page.  This
    also reduces the users of the page version of hugetlb specific page flags.

    Link: https://lkml.kernel.org/r/20230112204608.80136-2-sidhartha.kumar@oracle.com
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:07 -04:00
Aristeu Rozanski 2325b48b0c tools/vm: rename tools/vm to tools/mm
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me
Conflicts: MAINTAINERS change dropped. already done; update spec file for the rename

commit 799fb82aa132fa3a3886b7872997a5a84e820062
Author: SeongJae Park <sj@kernel.org>
Date:   Tue Jan 3 18:07:52 2023 +0000

    tools/vm: rename tools/vm to tools/mm

    Rename tools/vm to tools/mm for being more consistent with the code and
    documentation directories, and won't be confused with virtual machines.

    Link: https://lkml.kernel.org/r/20230103180754.129637-4-sj@kernel.org
    Signed-off-by: SeongJae Park <sj@kernel.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Shuah Khan <shuah@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:03 -04:00
Lucas Zampieri 6f794c0e0b
Merge: MM update to v6.2
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3738

JIRA: https://issues.redhat.com/browse/RHEL-27739

Depends: !3662

Dropped Patches and the reason they were dropped:

Needs to be evaluated by the FS team:
138060ba92b3 ("fs: pass dentry to set acl method")
3b4c7bc01727 ("xattr: use rbtree for simple_xattrs")

Needs to be evaluated by the NVME team:
4003f107fa2e ("mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages")

Needs to be evaluated by the ZRAM team:
7c2af309abd2 ("zram: add size class equals check into recompression")

Signed-off-by: Audra Mitchell <audra@redhat.com>

Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: Jocelyn Falempe <jfalempe@redhat.com>
Approved-by: David Arcari <darcari@redhat.com>
Approved-by: Steve Best <sbest@redhat.com>
Approved-by: David Airlie <airlied@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-04-17 10:14:56 -03:00
Audra Mitchell 89f083976d mm/memory-failure.c: cleanup in unpoison_memory
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit e0ff428042335c7b62785b3cf911c427a618bc86
Author: Ma Wupeng <mawupeng1@huawei.com>
Date:   Fri Nov 25 14:54:44 2022 +0800

    mm/memory-failure.c: cleanup in unpoison_memory

    If freeit is true, the value of ret must be zero, there is no need to
    check the value of freeit after label unlock_mutex.

    We can drop variable freeit to do this cleanup.

    Link: https://lkml.kernel.org/r/20221125065444.3462681-1-mawupeng1@huawei.com
    Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: zhenwei pi <pizhenwei@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:43:00 -04:00
Bill O'Donnell d7fddd5eaa mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
JIRA: https://issues.redhat.com/browse/RHEL-12888

Conflicts: difference from upstream mm/memory-failure.c

commit fa422b353d212373fb2b2857a5ea5a6fa4876f9c
Author: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Date:   Mon Oct 23 15:20:46 2023 +0800

    mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind

    Now, if we suddenly remove a PMEM device(by calling unbind) which
    contains FSDAX while programs are still accessing data in this device,
    e.g.:
    ```
     $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
     # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
     echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
    ```
    it could come into an unacceptable state:
      1. device has gone but mount point still exists, and umount will fail
           with "target is busy"
      2. programs will hang and cannot be killed
      3. may crash with NULL pointer dereference

    To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
    are going to remove the whole device, and make sure all related processes
    could be notified so that they could end up gracefully.

    This patch is inspired by Dan's "mm, dax, pmem: Introduce
    dev_pagemap_failure()"[1].  With the help of dax_holder and
    ->notify_failure() mechanism, the pmem driver is able to ask filesystem
    on it to unmap all files in use, and notify processes who are using
    those files.

    Call trace:
    trigger unbind
     -> unbind_store()
      -> ... (skip)
       -> devres_release_all()
        -> kill_dax()
         -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
          -> xfs_dax_notify_failure()
          `-> freeze_super()             // freeze (kernel call)
          `-> do xfs rmap
          ` -> mf_dax_kill_procs()
          `  -> collect_procs_fsdax()    // all associated processes
          `  -> unmap_and_kill()
          ` -> invalidate_inode_pages2_range() // drop file's cache
          `-> thaw_super()               // thaw (both kernel & user call)

    Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
    event.  Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
    new dax mapping from being created.  Do not shutdown filesystem directly
    if configuration is not supported, or if failure range includes metadata
    area.  Make sure all files and processes(not only the current progress)
    are handled correctly.  Also drop the cache of associated files before
    pmem is removed.

    [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
    [2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/

    Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dan Williams <dan.j.williams@intel.com>
    Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-04-05 11:58:49 -05:00
Chris von Recklinghausen 89634cbdb8 mm/various: give up if pte_offset_map[_lock]() fails
Conflicts:
	mm/gup.c - We don't have
		f7355e99d9f7 ("mm/gup: remove FOLL_MIGRATION")
		so don't remove retry label.
	mm/ksm.c - We don't have
		d7c0e68dab98 ("mm/ksm: convert break_ksm() to use walk_page_range_vma()")
		so don't add definition of break_ksm_pmd_entry or break_ksm_ops

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 04dee9e85cf50a2f24738e456d66b88de109b806
Author: Hugh Dickins <hughd@google.com>
Date:   Thu Jun 8 18:29:22 2023 -0700

    mm/various: give up if pte_offset_map[_lock]() fails

    Following the examples of nearby code, various functions can just give up
    if pte_offset_map() or pte_offset_map_lock() fails.  And there's no need
    for a preliminary pmd_trans_unstable() or other such check, since such
    cases are now safely handled inside.

    Link: https://lkml.kernel.org/r/7b9bd85d-1652-cbf2-159d-f503b45e5b@google.co
m
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Ira Weiny <ira.weiny@intel.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Song Liu <song@kernel.org>
    Cc: Steven Price <steven.price@arm.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Cc: Zack Rusin <zackr@vmware.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:16:16 -04:00
Chris von Recklinghausen 82a2c876fa mm,hugetlb: use folio fields in second tail page
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit dad6a5eb55564845aa17b8b20fa834af21e46c48
Author: Hugh Dickins <hughd@google.com>
Date:   Wed Nov 2 18:48:45 2022 -0700

    mm,hugetlb: use folio fields in second tail page

    Patch series "mm,huge,rmap: unify and speed up compound mapcounts".

    This patch (of 3):

    We want to declare one more int in the first tail of a compound page: that
    first tail page being valuable property, since every compound page has a
    first tail, but perhaps no more than that.

    No problem on 64-bit: there is already space for it.  No problem with
    32-bit THPs: 5.18 commit 5232c63f46fd ("mm: Make compound_pincount always
    available") kindly cleared the space for it, apparently not realizing that
    only 64-bit architectures enable CONFIG_THP_SWAP (whose use of tail
    page->private might conflict) - but make sure of that in its Kconfig.

    But hugetlb pages use tail page->private of the first tail page for a
    subpool pointer, which will conflict; and they also use page->private of
    the 2nd, 3rd and 4th tails.

    Undo "mm: add private field of first tail to struct page and struct
    folio"'s recent addition of private_1 to the folio tail: instead add
    hugetlb_subpool, hugetlb_cgroup, hugetlb_cgroup_rsvd, hugetlb_hwpoison to
    a second tail page of the folio: THP has long been using several fields of
    that tail, so make better use of it for hugetlb too.  This is not how a
    generic folio should be declared in future, but it is an effective
    transitional way to make use of it.

    Delete the SUBPAGE_INDEX stuff, but keep __NR_USED_SUBPAGE: now 3.

    [hughd@google.com: prefix folio's page_1 and page_2 with double underscore,
      give folio's _flags_2 and _head_2 a line documentation each]
      Link: https://lkml.kernel.org/r/9e2cb6b-5b58-d3f2-b5ee-5f8a14e8f10@google.com
    Link: https://lkml.kernel.org/r/5f52de70-975-e94f-f141-543765736181@google.com
    Link: https://lkml.kernel.org/r/3818cc9a-9999-d064-d778-9c94c5911e6@google.com
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: James Houghton <jthoughton@google.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Zach O'Keefe <zokeefe@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:25 -04:00
Chris von Recklinghausen 891dbb5790 mm/hwpoison: introduce per-memory_block hwpoison counter
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 5033091de814ab4b5623faed2755f3064e19e2d2
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Mon Oct 24 15:20:12 2022 +0900

    mm/hwpoison: introduce per-memory_block hwpoison counter

    Currently PageHWPoison flag does not behave well when experiencing memory
    hotremove/hotplug.  Any data field in struct page is unreliable when the
    associated memory is offlined, and the current mechanism can't tell
    whether a memory block is onlined because a new memory devices is
    installed or because previous failed offline operations are undone.
    Especially if there's a hwpoisoned memory, it's unclear what the best
    option is.

    So introduce a new mechanism to make struct memory_block remember that a
    memory block has hwpoisoned memory inside it.  And make any online event
    fail if the onlining memory block contains hwpoison.  struct memory_block
    is freed and reallocated over ACPI-based hotremove/hotplug, but not over
    sysfs-based hotremove/hotplug.  So the new counter can distinguish these
    cases.

    Link: https://lkml.kernel.org/r/20221024062012.1520887-5-naoya.horiguchi@linux.dev
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reported-by: kernel test robot <lkp@intel.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jane Chu <jane.chu@oracle.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:19 -04:00
Chris von Recklinghausen 9546f93bc5 mm/hwpoison: pass pfn to num_poisoned_pages_*()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit a46c9304b4bbf1b164154976cbb7e648980c7b5b
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Mon Oct 24 15:20:11 2022 +0900

    mm/hwpoison: pass pfn to num_poisoned_pages_*()

    No functional change.

    Link: https://lkml.kernel.org/r/20221024062012.1520887-4-naoya.horiguchi@linux.dev
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jane Chu <jane.chu@oracle.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:19 -04:00
Chris von Recklinghausen f7a3d4396f mm/hwpoison: move definitions of num_poisoned_pages_* to memory-failure.c
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit d027122d8363e58cd8bc2fa6a16917f7f69b85bb
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Mon Oct 24 15:20:10 2022 +0900

    mm/hwpoison: move definitions of num_poisoned_pages_* to memory-failure.c

    These interfaces will be used by drivers/base/memory.c by later patch, so
    as a preparatory work move them to more common header file visible to the
    file.

    Link: https://lkml.kernel.org/r/20221024062012.1520887-3-naoya.horiguchi@linux.dev
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jane Chu <jane.chu@oracle.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:18 -04:00
Chris von Recklinghausen 0e8d7c85ff mm,hwpoison,hugetlb,memory_hotplug: hotremove memory section with hwpoisoned hugepage
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit e591ef7d96d6ea249916f351dc26a636e565c635
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Mon Oct 24 15:20:09 2022 +0900

    mm,hwpoison,hugetlb,memory_hotplug: hotremove memory section with hwpoisoned hugepage

    Patch series "mm, hwpoison: improve handling workload related to hugetlb
    and memory_hotplug", v7.

    This patchset tries to solve the issue among memory_hotplug, hugetlb and hwpoison.
    In this patchset, memory hotplug handles hwpoison pages like below:

      - hwpoison pages should not prevent memory hotremove,
      - memory block with hwpoison pages should not be onlined.

    This patch (of 4):

    HWPoisoned page is not supposed to be accessed once marked, but currently
    such accesses can happen during memory hotremove because
    do_migrate_range() can be called before dissolve_free_huge_pages() is
    called.

    Clear HPageMigratable for hwpoisoned hugepages to prevent them from being
    migrated.  This should be done in hugetlb_lock to avoid race against
    isolate_hugetlb().

    get_hwpoison_huge_page() needs to have a flag to show it's called from
    unpoison to take refcount of hwpoisoned hugepages, so add it.

    [naoya.horiguchi@linux.dev: remove TestClearHPageMigratable and reduce to test and clear separately]
      Link: https://lkml.kernel.org/r/20221025053559.GA2104800@ik1-406-35019.vs.sakura.ne.jp
    Link: https://lkml.kernel.org/r/20221024062012.1520887-1-naoya.horiguchi@linux.dev
    Link: https://lkml.kernel.org/r/20221024062012.1520887-2-naoya.horiguchi@linux.dev
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reported-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jane Chu <jane.chu@oracle.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:18 -04:00
Chris von Recklinghausen 3b309fdf96 rmap: remove page_unlock_anon_vma_read()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 0c826c0b6a176b9ed5ace7106fd1770bb48f1898
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Sep 2 20:46:51 2022 +0100

    rmap: remove page_unlock_anon_vma_read()

    This was simply an alias for anon_vma_unlock_read() since 2011.

    Link: https://lkml.kernel.org/r/20220902194653.1739778-56-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:14:05 -04:00
Chris von Recklinghausen 9b7af11315 mm, hwpoison: fix page refcnt leaking in unpoison_memory()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 6bbabd041dfd4823c752940286656d404621bf38
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Aug 18 21:00:12 2022 +0800

    mm, hwpoison: fix page refcnt leaking in unpoison_memory()

    When free_raw_hwp_pages() fails its work, the refcnt of the hugetlb page
    would have been incremented if ret > 0.  Using put_page() to fix refcnt
    leaking in this case.

    Link: https://lkml.kernel.org/r/20220818130016.45313-3-linmiaohe@huawei.com
    Fixes: debb6b9c3fdd ("mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:32 -04:00
Dave Wysochanski 924daddc03 mm: merge folio_has_private()/filemap_release_folio() call pairs
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2209756

Patch series "mm, netfs, fscache: Stop read optimisation when folio
removed from pagecache", v7.

This fixes an optimisation in fscache whereby we don't read from the cache
for a particular file until we know that there's data there that we don't
have in the pagecache.  The problem is that I'm no longer using PG_fscache
(aka PG_private_2) to indicate that the page is cached and so I don't get
a notification when a cached page is dropped from the pagecache.

The first patch merges some folio_has_private() and
filemap_release_folio() pairs and introduces a helper,
folio_needs_release(), to indicate if a release is required.

The second patch is the actual fix.  Following Willy's suggestions[1], it
adds an AS_RELEASE_ALWAYS flag to an address_space that will make
filemap_release_folio() always call ->release_folio(), even if
PG_private/PG_private_2 aren't set.  folio_needs_release() is altered to
add a check for this.

This patch (of 2):

Make filemap_release_folio() check folio_has_private().  Then, in most
cases, where a call to folio_has_private() is immediately followed by a
call to filemap_release_folio(), we can get rid of the test in the pair.

There are a couple of sites in mm/vscan.c that this can't so easily be
done.  In shrink_folio_list(), there are actually three cases (something
different is done for incompletely invalidated buffers), but
filemap_release_folio() elides two of them.

In shrink_active_list(), we don't have have the folio lock yet, so the
check allows us to avoid locking the page unnecessarily.

A wrapper function to check if a folio needs release is provided for those
places that still need to do it in the mm/ directory.  This will acquire
additional parts to the condition in a future patch.

After this, the only remaining caller of folio_has_private() outside of
mm/ is a check in fuse.

Link: https://lkml.kernel.org/r/20230628104852.3391651-1-dhowells@redhat.com
Link: https://lkml.kernel.org/r/20230628104852.3391651-2-dhowells@redhat.com
Reported-by: Rohith Surabattula <rohiths.msft@gmail.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steve French <sfrench@samba.org>
Cc: Shyam Prasad N <nspmangalore@gmail.com>
Cc: Rohith Surabattula <rohiths.msft@gmail.com>
Cc: Dave Wysochanski <dwysocha@redhat.com>
Cc: Dominique Martinet <asmadeus@codewreck.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Xiubo Li <xiubli@redhat.com>
Cc: Jingbo Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 0201ebf274a306a6ebb95e5dc2d6a0a27c737cac)
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
2023-09-13 18:19:41 -04:00
Dave Wysochanski 62e9490d1b memory-failure: convert truncate_error_page() to use folio
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2209756

Replace try_to_release_page() with filemap_release_folio().  This change
is in preparation for the removal of the try_to_release_page() wrapper.

Link: https://lkml.kernel.org/r/20221118073055.55694-4-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit ac5efa782041670b63a05c36d92d02a80e50bb63)
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
2023-09-13 18:17:54 -04:00
Bill O'Donnell 3124cb8bb9 mm/memory-failure: fall back to vma_address() when ->notify_failure() fails
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2192730

Conflicts: line numbering diffs from previous out of order mm patch

commit ac87ca0ea0102ec5dfd17c95eb9fa87a8bf54d61
Author: Dan Williams <dan.j.williams@intel.com>
Date:   Fri Aug 26 10:18:14 2022 -0700

    mm/memory-failure: fall back to vma_address() when ->notify_failure() fails

    In the case where a filesystem is polled to take over the memory failure
    and receives -EOPNOTSUPP it indicates that page->index and page->mapping
    are valid for reverse mapping the failure address.  Introduce
    FSDAX_INVALID_PGOFF to distinguish when add_to_kill() is being called from
    mf_dax_kill_procs() by a filesytem vs the typical memory_failure() path.

    Otherwise, vma_pgoff_address() is called with an invalid fsdax_pgoff which
    then trips this failing signature:

     kernel BUG at mm/memory-failure.c:319!
     invalid opcode: 0000 [#1] PREEMPT SMP PTI
     CPU: 13 PID: 1262 Comm: dax-pmd Tainted: G           OE    N 6.0.0-rc2+ #62
     Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
     RIP: 0010:add_to_kill.cold+0x19d/0x209
     [..]
     Call Trace:
      <TASK>
      collect_procs.part.0+0x2c4/0x460
      memory_failure+0x71b/0xba0
      ? _printk+0x58/0x73
      do_madvise.part.0.cold+0xaf/0xc5

    Link: https://lkml.kernel.org/r/166153429427.2758201.14605968329933175594.stgit@dwillia2-xfh.jf.intel.com
    Fixes: c36e20249571 ("mm: introduce mf_dax_kill_procs() for fsdax case")
    Signed-off-by: Dan Williams <dan.j.williams@intel.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Cc: Shiyang Ruan <ruansy.fnst@fujitsu.com>
    Cc: Darrick J. Wong <djwong@kernel.org>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
    Cc: Jane Chu <jane.chu@oracle.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Ritesh Harjani <riteshh@linux.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-06-16 10:35:48 -05:00
Bill O'Donnell 77fdf9407a mm/memory-failure: fix detection of memory_failure() handlers
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2192730

commit 65d3440e8d2e9e024ef2357f8ff021611b068c99
Author: Dan Williams <dan.j.williams@intel.com>
Date:   Fri Aug 26 10:18:07 2022 -0700

    mm/memory-failure: fix detection of memory_failure() handlers

    Some pagemap types, like MEMORY_DEVICE_GENERIC (device-dax) do not even
    have pagemap ops which results in crash signatures like this:

      BUG: kernel NULL pointer dereference, address: 0000000000000010
      #PF: supervisor read access in kernel mode
      #PF: error_code(0x0000) - not-present page
      PGD 8000000205073067 P4D 8000000205073067 PUD 2062b3067 PMD 0
      Oops: 0000 [#1] PREEMPT SMP PTI
      CPU: 22 PID: 4535 Comm: device-dax Tainted: G           OE    N 6.0.0-rc2+ #59
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
      RIP: 0010:memory_failure+0x667/0xba0
     [..]
      Call Trace:
       <TASK>
       ? _printk+0x58/0x73
       do_madvise.part.0.cold+0xaf/0xc5

    Check for ops before checking if the ops have a memory_failure()
    handler.

    Link: https://lkml.kernel.org/r/166153428781.2758201.1990616683438224741.stgit@dwillia2-xfh.jf.intel.com
    Fixes: 33a8f7f2b3a3 ("pagemap,pmem: introduce ->memory_failure()")
    Signed-off-by: Dan Williams <dan.j.williams@intel.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Cc: Shiyang Ruan <ruansy.fnst@fujitsu.com>
    Cc: Darrick J. Wong <djwong@kernel.org>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
    Cc: Jane Chu <jane.chu@oracle.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Ritesh Harjani <riteshh@linux.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-06-16 10:35:48 -05:00
Bill O'Donnell 76df626f11 mm: introduce mf_dax_kill_procs() for fsdax case
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2192730

Conflicts: line numbering diffs due to previous out of order mm patches

commit c36e2024957120566efd99395b5c8cc95b5175c1
Author: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Date:   Fri Jun 3 13:37:29 2022 +0800

    mm: introduce mf_dax_kill_procs() for fsdax case

    This new function is a variant of mf_generic_kill_procs that accepts a
    file, offset pair instead of a struct to support multiple files sharing a
    DAX mapping.  It is intended to be called by the file systems as part of
    the memory_failure handler after the file system performed a reverse
    mapping from the storage address to the file and file offset.

    Link: https://lkml.kernel.org/r/20220603053738.1218681-6-ruansy.fnst@fujitsu.com
    Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
    Reviewed-by: Dan Williams <dan.j.williams@intel.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Dan Williams <dan.j.wiliams@intel.com>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
    Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
    Cc: Jane Chu <jane.chu@oracle.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Ritesh Harjani <riteshh@linux.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-06-06 13:41:03 -05:00
Bill O'Donnell df39916052 pagemap,pmem: introduce ->memory_failure()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2192730

commit 33a8f7f2b3a3437d016d1b4047a4fd37eb6951b3
Author: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Date:   Fri Jun 3 13:37:27 2022 +0800

    pagemap,pmem: introduce ->memory_failure()

    When memory-failure occurs, we call this function which is implemented by
    each kind of devices.  For the fsdax case, pmem device driver implements
    it.  Pmem device driver will find out the filesystem in which the
    corrupted page located in.

    With dax_holder notify support, we are able to notify the memory failure
    from pmem driver to upper layers.  If there is something not support in
    the notify routine, memory_failure will fall back to the generic hanlder.

    Link: https://lkml.kernel.org/r/20220603053738.1218681-4-ruansy.fnst@fujitsu.com
    Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Dan Williams <dan.j.williams@intel.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Dan Williams <dan.j.wiliams@intel.com>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
    Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
    Cc: Jane Chu <jane.chu@oracle.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Ritesh Harjani <riteshh@linux.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-06-06 13:41:02 -05:00
Aristeu Rozanski d735cfc43a mm: memory-failure: make action_result() return int
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2184858
Tested: with reproducer and generic tests

commit b66d00dfebe79ebd0d5a0ec4ee4e26583432c381
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Fri Oct 21 16:46:11 2022 +0800

    mm: memory-failure: make action_result() return int

    Check mf_result in action_result(), only return 0 when MF_RECOVERED,
    or return -EBUSY, which will simplify code a bit.

    [wangkefeng.wang@huawei.com: v2]
      Link: https://lkml.kernel.org/r/20221024035138.99119-1-wangkefeng.wang@huawei.com
    Link: https://lkml.kernel.org/r/20221021084611.53765-3-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2023-04-18 10:53:49 -04:00
Aristeu Rozanski a2d518b237 mm: memory-failure: avoid pfn_valid() twice in soft_offline_page()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2184858
Tested: with reproducer and generic tests

commit 183a7c5d15d3c56f49955662d3edd0092141df78
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Fri Oct 21 16:46:10 2022 +0800

    mm: memory-failure: avoid pfn_valid() twice in soft_offline_page()

    Simplify WARN_ON_ONCE(flags & MF_COUNT_INCREASED) under !pfn_valid().

    Link: https://lkml.kernel.org/r/20221021084611.53765-2-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2023-04-18 10:53:49 -04:00
Aristeu Rozanski 48a9e0c8e8 mm: memory-failure: make put_ref_page() more useful
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2184858
Tested: with reproducer and generic tests

commit b5f1fc98c62b6b75e9f7499e7519dc67684affd3
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Fri Oct 21 16:46:09 2022 +0800

    mm: memory-failure: make put_ref_page() more useful

    Pass pfn/flags to put_ref_page(), then check MF_COUNT_INCREASED and drop
    refcount to make the code look cleaner.

    Link: https://lkml.kernel.org/r/20221021084611.53765-1-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2023-04-18 10:53:49 -04:00
Aristeu Rozanski ec94d66baa hugetlbfs: don't delete error page from pagecache
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2184858
Tested: with reproducer and generic tests
Conflicts: different context due missing fa27759af4a6d7494c986c44695b13bcd6eaf46b. We still use remove_huge_page() since we didn't backport 7e1813d48dd30e6c6f235f6661d1bc108fcab528 which renames it as hugetlb_delete_from_page_cache()

commit 8625147cafaa9ba74713d682f5185eb62cb2aedb
Author: James Houghton <jthoughton@google.com>
Date:   Tue Oct 18 20:01:25 2022 +0000

    hugetlbfs: don't delete error page from pagecache

    This change is very similar to the change that was made for shmem [1], and
    it solves the same problem but for HugeTLBFS instead.

    Currently, when poison is found in a HugeTLB page, the page is removed
    from the page cache.  That means that attempting to map or read that
    hugepage in the future will result in a new hugepage being allocated
    instead of notifying the user that the page was poisoned.  As [1] states,
    this is effectively memory corruption.

    The fix is to leave the page in the page cache.  If the user attempts to
    use a poisoned HugeTLB page with a syscall, the syscall will fail with
    EIO, the same error code that shmem uses.  For attempts to map the page,
    the thread will get a BUS_MCEERR_AR SIGBUS.

    [1]: commit a76054266661 ("mm: shmem: don't truncate page if memory failure happens")

    Link: https://lkml.kernel.org/r/20221018200125.848471-1-jthoughton@google.com
    Signed-off-by: James Houghton <jthoughton@google.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Tested-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: James Houghton <jthoughton@google.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2023-04-18 10:53:48 -04:00
Aristeu Rozanski afd17c90c6 mm, hwpoison: cleanup some obsolete comments
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2184858
Tested: with reproducer and generic tests

commit 9cf2819159d5a35311fc39c328ebeca5ce50d7c0
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Aug 30 20:36:04 2022 +0800

    mm, hwpoison: cleanup some obsolete comments

    1.Remove meaningless comment in kill_proc(). That doesn't tell anything.
    2.Fix the wrong function name get_hwpoison_unless_zero(). It should be
    get_page_unless_zero().
    3.The gate keeper for free hwpoison page has moved to check_new_page().
    Update the corresponding comment.

    Link: https://lkml.kernel.org/r/20220830123604.25763-7-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2023-04-18 10:53:48 -04:00
Aristeu Rozanski f2fdfb9510 mm, hwpoison: check PageTable() explicitly in hwpoison_user_mappings()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2184858
Tested: with reproducer and generic tests

commit b680dae9a881bbf80dc53a79a59e4f1386b7da5e
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Aug 30 20:36:03 2022 +0800

    mm, hwpoison: check PageTable() explicitly in hwpoison_user_mappings()

    PageTable can't be handled by memory_failure(). Filter it out explicitly in
    hwpoison_user_mappings(). This will also make code more consistent with the
    relevant check in unpoison_memory().

    Link: https://lkml.kernel.org/r/20220830123604.25763-6-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2023-04-18 10:53:48 -04:00
Aristeu Rozanski 9c45fd19d6 mm, hwpoison: avoid unneeded page_mapped_in_vma() overhead in collect_procs_anon()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2184858
Tested: with reproducer and generic tests
Conflicts: context difference due lack of ac87ca0ea0102ec5dfd17c95eb9fa87a8bf54d61. It's part of a series of dax patches not directly related to this one

commit 36537a67d3561bfe2b3654161d6c9008fff84d43
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Aug 30 20:36:02 2022 +0800

    mm, hwpoison: avoid unneeded page_mapped_in_vma() overhead in collect_procs_anon()

    If vma->vm_mm != t->mm, there's no need to call page_mapped_in_vma() as
    add_to_kill() won't be called in this case. Move up the mm check to avoid
    possible unneeded calling to page_mapped_in_vma().

    Link: https://lkml.kernel.org/r/20220830123604.25763-5-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2023-04-18 10:53:48 -04:00
Aristeu Rozanski ef81b4b92e mm, hwpoison: use num_poisoned_pages_sub() to decrease num_poisoned_pages
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2184858
Tested: with reproducer and generic tests

commit 21c9e90ab9a4c991d21dd15cc5163c99a885d4a8
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Aug 30 20:36:01 2022 +0800

    mm, hwpoison: use num_poisoned_pages_sub() to decrease num_poisoned_pages

    Use num_poisoned_pages_sub() to combine multiple atomic ops into one. Also
    num_poisoned_pages_dec() can be killed as there's no caller now.

    Link: https://lkml.kernel.org/r/20220830123604.25763-4-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2023-04-18 10:53:48 -04:00
Aristeu Rozanski bcb50715fd mm, hwpoison: use __PageMovable() to detect non-lru movable pages
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2184858
Tested: with reproducer and generic tests

commit da29499124cd2221539b235c1f93c7d93faf6565
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Aug 30 20:36:00 2022 +0800

    mm, hwpoison: use __PageMovable() to detect non-lru movable pages

    It's more recommended to use __PageMovable() to detect non-lru movable
    pages. We can avoid bumping page refcnt via isolate_movable_page() for
    the isolated lru pages. Also if pages become PageLRU just after they're
    checked but before trying to isolate them, isolate_lru_page() will be
    called to do the right work.

    [linmiaohe@huawei.com: fixes per Naoya Horiguchi]
      Link: https://lkml.kernel.org/r/1f7ee86e-7d28-0d8c-e0de-b7a5a94519e8@huawei.com
    Link: https://lkml.kernel.org/r/20220830123604.25763-3-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2023-04-18 10:53:48 -04:00
Aristeu Rozanski 7506ee2589 mm, hwpoison: use ClearPageHWPoison() in memory_failure()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2184858
Tested: with reproducer and generic tests

commit 2fe62e222680e1d6ff7112cad5bcccdc858d020d
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Aug 30 20:35:59 2022 +0800

    mm, hwpoison: use ClearPageHWPoison() in memory_failure()

    Patch series "A few cleanup patches for memory-failure".

    his series contains a few cleanup patches to use __PageMovable() to detect
    non-lru movable pages, use num_poisoned_pages_sub() to reduce multiple
    atomic ops overheads and so on.  More details can be found in the
    respective changelogs.

    This patch (of 6):

    Use ClearPageHWPoison() instead of TestClearPageHWPoison() to clear page
    hwpoison flags to avoid unneeded full memory barrier overhead.

    Link: https://lkml.kernel.org/r/20220830123604.25763-1-linmiaohe@huawei.com
    Link: https://lkml.kernel.org/r/20220830123604.25763-2-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2023-04-18 10:53:48 -04:00
Aristeu Rozanski cfddcb8fa8 mm: memory-failure: kill __soft_offline_page()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2184858
Tested: with reproducer and generic tests

commit 48309e1f6f7b15b041b39b7f15e7adc1c7e2de95
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Fri Aug 19 11:34:02 2022 +0800

    mm: memory-failure: kill __soft_offline_page()

    Squash the __soft_offline_page() into soft_offline_in_use_page() and kill
    __soft_offline_page().

    [wangkefeng.wang@huawei.com: update hpage when try_to_split_thp_page() succeeds]
      Link: https://lkml.kernel.org/r/20220830104654.28234-1-wangkefeng.wang@huawei.com
    Link: https://lkml.kernel.org/r/20220819033402.156519-2-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2023-04-18 10:53:48 -04:00
Aristeu Rozanski d667a7b752 mm: memory-failure: kill soft_offline_free_page()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2184858
Tested: with reproducer and generic tests

commit 7adb45887c8af88985c335b53d253654e9d2dd16
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Fri Aug 19 11:34:01 2022 +0800

    mm: memory-failure: kill soft_offline_free_page()

    Open-code the page_handle_poison() into soft_offline_page() and kill
    unneeded soft_offline_free_page().

    Link: https://lkml.kernel.org/r/20220819033402.156519-1-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2023-04-18 10:53:48 -04:00
Aristeu Rozanski b5794fd691 mm, hwpoison: avoid trying to unpoison reserved page
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2184858
Tested: with reproducer and generic tests

commit e9ff3ba7ff10490a92792faf1d3573a24fc6e5c9
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Aug 18 21:00:16 2022 +0800

    mm, hwpoison: avoid trying to unpoison reserved page

    For reserved pages, HWPoison flag will be set without increasing the page
    refcnt.  So we shouldn't even try to unpoison these pages and thus
    decrease the page refcnt unexpectly.  Add a PageReserved() check to filter
    this case out and remove the below unneeded zero page (zero page is
    reserved) check.

    Link: https://lkml.kernel.org/r/20220818130016.45313-7-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2023-04-18 10:53:48 -04:00
Aristeu Rozanski a00cfa9e9d mm, hwpoison: kill procs if unmap fails
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2184858
Tested: with reproducer and generic tests

commit 0792a4a6195a6d67a4ead2554da393cbd8dcdf5a
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Aug 18 21:00:15 2022 +0800

    mm, hwpoison: kill procs if unmap fails

    If try_to_unmap() fails, the hwpoisoned page still resides in the address
    space of some processes.  We should kill these processes or the hwpoisoned
    page might be consumed later.  collect_procs() is always called to collect
    relevant processes now so they can be killed later if unmap fails.

    [linmiaohe@huawei.com: v2]
      Link: https://lkml.kernel.org/r/20220823032346.4260-6-linmiaohe@huawei.com
    Link: https://lkml.kernel.org/r/20220818130016.45313-6-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2023-04-18 10:53:48 -04:00
Aristeu Rozanski e2568b519f mm, hwpoison: fix possible use-after-free in mf_dax_kill_procs()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2184858
Tested: with reproducer and generic tests

commit 54f9555d4031d4aeb10651aa9dcb5335f6a05865
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Aug 23 11:23:44 2022 +0800

    mm, hwpoison: fix possible use-after-free in mf_dax_kill_procs()

    After kill_procs(), tk will be freed without being removed from the
    to_kill list.  In the next iteration, the freed list entry in the to_kill
    list will be accessed, thus leading to use-after-free issue.  Adding
    list_del() in kill_procs() to fix the issue.

    Link: https://lkml.kernel.org/r/20220823032346.4260-5-linmiaohe@huawei.com
    Fixes: c36e20249571 ("mm: introduce mf_dax_kill_procs() for fsdax case")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2023-04-18 10:53:48 -04:00
Aristeu Rozanski ad97a4a927 mm, hwpoison: fix page refcnt leaking in unpoison_memory()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2184858
Tested: with reproducer and generic tests

commit 6bbabd041dfd4823c752940286656d404621bf38
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Aug 18 21:00:12 2022 +0800

    mm, hwpoison: fix page refcnt leaking in unpoison_memory()

    When free_raw_hwp_pages() fails its work, the refcnt of the hugetlb page
    would have been incremented if ret > 0.  Using put_page() to fix refcnt
    leaking in this case.

    Link: https://lkml.kernel.org/r/20220818130016.45313-3-linmiaohe@huawei.com
    Fixes: debb6b9c3fdd ("mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2023-04-18 10:53:47 -04:00
Aristeu Rozanski b2214bac44 mm, hwpoison: fix page refcnt leaking in try_memory_failure_hugetlb()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2184858
Tested: with reproducer and generic tests

commit f36a5543a74883c21a59b8082b403a13c7654769
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Aug 18 21:00:11 2022 +0800

    mm, hwpoison: fix page refcnt leaking in try_memory_failure_hugetlb()

    Patch series "A few fixup patches for memory-failure", v2.

    This series contains a few fixup patches to fix incorrect update of page
    refcnt, fix possible use-after-free issue and so on.  More details can be
    found in the respective changelogs.

    This patch (of 6):

    When hwpoison_filter() refuses to hwpoison a hugetlb page, the refcnt of
    the page would have been incremented if res == 1.  Using put_page() to fix
    the refcnt leaking in this case.

    Link: https://lkml.kernel.org/r/20220823032346.4260-1-linmiaohe@huawei.com
    Link: https://lkml.kernel.org/r/20220818130016.45313-1-linmiaohe@huawei.com
    Link: https://lkml.kernel.org/r/20220818130016.45313-2-linmiaohe@huawei.com
    Fixes: 405ce051236c ("mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2023-04-18 10:53:47 -04:00
Aristeu Rozanski 026c3fe92c mm: memory-failure: cleanup try_to_split_thp_page()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2184858
Tested: with reproducer and generic tests

commit 2ace36f0f55777be8a871c370832527e1cd54b15
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Tue Aug 9 19:18:13 2022 +0800

    mm: memory-failure: cleanup try_to_split_thp_page()

    Since commit 5d1fd5dc87 ("mm,hwpoison: introduce MF_MSG_UNSPLIT_THP"),
    the action_result(,MF_MSG_UNSPLIT_THP,) called to show memory error event
    in memory_failure(), so the pr_info() in try_to_split_thp_page() is only
    needed in soft_offline_in_use_page().

    Meanwhile this could also fix the unexpected prefix for "thp split failed"
    due to commit 96f96763de26 ("mm: memory-failure: convert to pr_fmt()").

    Link: https://lkml.kernel.org/r/20220809111813.139690-1-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2023-04-18 10:53:47 -04:00
Rafael Aquini d1056730f0 mm/swap: add swp_offset_pfn() to fetch PFN from swap entry
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168392

This patch is a backport of the following upstream commit:
commit 0d206b5d2e0d7d7f09ac9540e3ab3e35a34f536e
Author: Peter Xu <peterx@redhat.com>
Date:   Thu Aug 11 12:13:27 2022 -0400

    mm/swap: add swp_offset_pfn() to fetch PFN from swap entry

    We've got a bunch of special swap entries that stores PFN inside the swap
    offset fields.  To fetch the PFN, normally the user just calls
    swp_offset() assuming that'll be the PFN.

    Add a helper swp_offset_pfn() to fetch the PFN instead, fetching only the
    max possible length of a PFN on the host, meanwhile doing proper check
    with MAX_PHYSMEM_BITS to make sure the swap offsets can actually store the
    PFNs properly always using the BUILD_BUG_ON() in is_pfn_swap_entry().

    One reason to do so is we never tried to sanitize whether swap offset can
    really fit for storing PFN.  At the meantime, this patch also prepares us
    with the future possibility to store more information inside the swp
    offset field, so assuming "swp_offset(entry)" to be the PFN will not stand
    any more very soon.

    Replace many of the swp_offset() callers to use swp_offset_pfn() where
    proper.  Note that many of the existing users are not candidates for the
    replacement, e.g.:

      (1) When the swap entry is not a pfn swap entry at all, or,
      (2) when we wanna keep the whole swp_offset but only change the swp type.

    For the latter, it can happen when fork() triggered on a write-migration
    swap entry pte, we may want to only change the migration type from
    write->read but keep the rest, so it's not "fetching PFN" but "changing
    swap type only".  They're left aside so that when there're more
    information within the swp offset they'll be carried over naturally in
    those cases.

    Since at it, dropping hwpoison_entry_to_pfn() because that's exactly what
    the new swp_offset_pfn() is about.

    Link: https://lkml.kernel.org/r/20220811161331.37055-4-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Andi Kleen <andi.kleen@intel.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Dave Hansen <dave.hansen@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2023-04-03 10:16:24 -04:00
Chris von Recklinghausen ba08386a56 mm,hwpoison: check mm when killing accessing process
Bugzilla: https://bugzilla.redhat.com/2160210

commit 77677cdbc2aa4b5d5d839562793d3d126201d18d
Author: Shuai Xue <xueshuai@linux.alibaba.com>
Date:   Wed Sep 14 14:49:35 2022 +0800

    mm,hwpoison: check mm when killing accessing process

    The GHES code calls memory_failure_queue() from IRQ context to queue work
    into workqueue and schedule it on the current CPU.  Then the work is
    processed in memory_failure_work_func() by kworker and calls
    memory_failure().

    When a page is already poisoned, commit a3f5d80ea4 ("mm,hwpoison: send
    SIGBUS with error virutal address") make memory_failure() call
    kill_accessing_process() that:

        - holds mmap locking of current->mm
        - does pagetable walk to find the error virtual address
        - and sends SIGBUS to the current process with error info.

    However, the mm of kworker is not valid, resulting in a null-pointer
    dereference.  So check mm when killing the accessing process.

    [akpm@linux-foundation.org: remove unrelated whitespace alteration]
    Link: https://lkml.kernel.org/r/20220914064935.7851-1-xueshuai@linux.alibaba.com
    Fixes: a3f5d80ea4 ("mm,hwpoison: send SIGBUS with error virutal address")
    Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Huang Ying <ying.huang@intel.com>
    Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: Bixuan Cui <cuibixuan@linux.alibaba.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:32 -04:00
Chris von Recklinghausen 1ac3b99189 mm, hwpoison: fix extra put_page() in soft_offline_page()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 12f1dbcf8f144c0b8dde7a62fea766f88cb79fc8
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Aug 18 21:00:13 2022 +0800

    mm, hwpoison: fix extra put_page() in soft_offline_page()

    When hwpoison_filter() refuses to soft offline a page, the page refcnt
    incremented previously by MF_COUNT_INCREASED would have been consumed via
    get_hwpoison_page() if ret <= 0.  So the put_ref_page() here will put the
    extra one.  Remove it to fix the issue.

    Link: https://lkml.kernel.org/r/20220818130016.45313-4-linmiaohe@huawei.com
    Fixes: 9113eaf331bf ("mm/memory-failure.c: add hwpoison_filter for soft offline")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:32 -04:00
Chris von Recklinghausen 7228fc6140 mm, hwpoison: enable memory error handling on 1GB hugepage
Bugzilla: https://bugzilla.redhat.com/2160210

commit 6f4614886baa59b6ae014093300482c1da4d3c93
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Thu Jul 14 13:24:20 2022 +0900

    mm, hwpoison: enable memory error handling on 1GB hugepage

    Now error handling code is prepared, so remove the blocking code and
    enable memory error handling on 1GB hugepage.

    Link: https://lkml.kernel.org/r/20220714042420.1847125-9-naoya.horiguchi@linux.dev
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: kernel test robot <lkp@intel.com>
    Cc: Liu Shixin <liushixin2@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:31 -04:00
Chris von Recklinghausen f6b4b74d69 mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage
Bugzilla: https://bugzilla.redhat.com/2160210

commit ceaf8fbea79a854373b9fc03c9fde98eb8712725
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Thu Jul 14 13:24:19 2022 +0900

    mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage

    Currently if memory_failure() (modified to remove blocking code with
    subsequent patch) is called on a page in some 1GB hugepage, memory error
    handling fails and the raw error page gets into leaked state.  The impact
    is small in production systems (just leaked single 4kB page), but this
    limits the testability because unpoison doesn't work for it.  We can no
    longer create 1GB hugepage on the 1GB physical address range with such
    leaked pages, that's not useful when testing on small systems.

    When a hwpoison page in a 1GB hugepage is handled, it's caught by the
    PageHWPoison check in free_pages_prepare() because the 1GB hugepage is
    broken down into raw error pages before coming to this point:

            if (unlikely(PageHWPoison(page)) && !order) {
                    ...
                    return false;
            }

    Then, the page is not sent to buddy and the page refcount is left 0.

    Originally this check is supposed to work when the error page is freed
    from page_handle_poison() (that is called from soft-offline), but now we
    are opening another path to call it, so the callers of
    __page_handle_poison() need to handle the case by considering the return
    value 0 as success.  Then page refcount for hwpoison is properly
    incremented so unpoison works.

    Link: https://lkml.kernel.org/r/20220714042420.1847125-8-naoya.horiguchi@linux.dev
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: kernel test robot <lkp@intel.com>
    Cc: Liu Shixin <liushixin2@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:31 -04:00
Chris von Recklinghausen 787dc58044 mm, hwpoison: make __page_handle_poison returns int
Bugzilla: https://bugzilla.redhat.com/2160210

commit 7453bf621cfaf01a61f0e9180390ac6abc414894
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Thu Jul 14 13:24:18 2022 +0900

    mm, hwpoison: make __page_handle_poison returns int

    __page_handle_poison() returns bool that shows whether
    take_page_off_buddy() has passed or not now.  But we will want to
    distinguish another case of "dissolve has passed but taking off failed" by
    its return value.  So change the type of the return value.  No functional
    change.

    Link: https://lkml.kernel.org/r/20220714042420.1847125-7-naoya.horiguchi@linux.dev
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: kernel test robot <lkp@intel.com>
    Cc: Liu Shixin <liushixin2@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:31 -04:00
Chris von Recklinghausen 69469d9dc6 mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage
Bugzilla: https://bugzilla.redhat.com/2160210

commit ac5fcde0a96a18773f06b7c00c5ea081bbdc64b3
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Thu Jul 14 13:24:16 2022 +0900

    mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage

    Raw error info list needs to be removed when hwpoisoned hugetlb is
    unpoisoned.  And unpoison handler needs to know how many errors there are
    in the target hugepage.  So add them.

    HPageVmemmapOptimized(hpage) and HPageRawHwpUnreliable(hpage)) sometimes
    can't be unpoisoned, so skip them.

    Link: https://lkml.kernel.org/r/20220714042420.1847125-5-naoya.horiguchi@linux.dev
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reported-by: kernel test robot <lkp@intel.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Liu Shixin <liushixin2@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:31 -04:00
Chris von Recklinghausen 16a4b1211c mm, hwpoison, hugetlb: support saving mechanism of raw error pages
Bugzilla: https://bugzilla.redhat.com/2160210

commit 161df60e9e89651c9aa3ae0edc9aae3a8a2d21e7
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Thu Jul 14 13:24:15 2022 +0900

    mm, hwpoison, hugetlb: support saving mechanism of raw error pages

    When handling memory error on a hugetlb page, the error handler tries to
    dissolve and turn it into 4kB pages.  If it's successfully dissolved,
    PageHWPoison flag is moved to the raw error page, so that's all right.
    However, dissolve sometimes fails, then the error page is left as
    hwpoisoned hugepage.  It's useful if we can retry to dissolve it to save
    healthy pages, but that's not possible now because the information about
    where the raw error pages is lost.

    Use the private field of a few tail pages to keep that information.  The
    code path of shrinking hugepage pool uses this info to try delayed
    dissolve.  In order to remember multiple errors in a hugepage, a
    singly-linked list originated from SUBPAGE_INDEX_HWPOISON-th tail page is
    constructed.  Only simple operations (adding an entry or clearing all) are
    required and the list is assumed not to be very long, so this simple data
    structure should be enough.

    If we failed to save raw error info, the hwpoison hugepage has errors on
    unknown subpage, then this new saving mechanism does not work any more, so
    disable saving new raw error info and freeing hwpoison hugepages.

    Link: https://lkml.kernel.org/r/20220714042420.1847125-4-naoya.horiguchi@linux.dev
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reported-by: kernel test robot <lkp@intel.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Liu Shixin <liushixin2@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:31 -04:00
Chris von Recklinghausen 59b7858be4 mm: memory-failure: convert to pr_fmt()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 96f96763de26d6ee333d5b2446d1b04a4e6bc75b
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Tue Jul 26 16:10:46 2022 +0800

    mm: memory-failure: convert to pr_fmt()

    Use pr_fmt to prefix all pr_<level> output, but unpoison_memory() and
    soft_offline_page() are used by error injection, which have own prefixes
    like "Unpoison:" and "soft offline:", meanwhile, soft_offline_page() could
    be used by memory hotremove, so reset pr_fmt before unpoison_pr_info
    definition to keep the original output for them.

    [wangkefeng.wang@huawei.com: v3]
      Link: https://lkml.kernel.org/r/20220729031919.72331-1-wangkefeng.wang@huawei.com
    Link: https://lkml.kernel.org/r/20220726081046.10742-1-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:29 -04:00
Chris von Recklinghausen fdd93778ad mm: factor helpers for memory_failure_dev_pagemap
Bugzilla: https://bugzilla.redhat.com/2160210

commit 00cc790e00369387f6ab80c5724550c2c6340334
Author: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Date:   Fri Jun 3 13:37:26 2022 +0800

    mm: factor helpers for memory_failure_dev_pagemap

    memory_failure_dev_pagemap code is a bit complex before introduce RMAP
    feature for fsdax.  So it is needed to factor some helper functions to
    simplify these code.

    [akpm@linux-foundation.org: fix CONFIG_HUGETLB_PAGE=n build]
    [zhengbin13@huawei.com: fix redefinition of mf_generic_kill_procs]
      Link: https://lkml.kernel.org/r/20220628112143.1170473-1-zhengbin13@huawei.com
    Link: https://lkml.kernel.org/r/20220603053738.1218681-3-ruansy.fnst@fujitsu.com
    Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
    Signed-off-by: Zheng Bin <zhengbin13@huawei.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Dan Williams <dan.j.williams@intel.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Dan Williams <dan.j.wiliams@intel.com>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
    Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
    Cc: Jane Chu <jane.chu@oracle.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Ritesh Harjani <riteshh@linux.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:23 -04:00
Chris von Recklinghausen 279ba99c6b mm: add zone device coherent type memory support
Bugzilla: https://bugzilla.redhat.com/2160210

commit f25cbb7a95a24ff9a2a3bebd308e303942ae6b2c
Author: Alex Sierra <alex.sierra@amd.com>
Date:   Fri Jul 15 10:05:10 2022 -0500

    mm: add zone device coherent type memory support

    Device memory that is cache coherent from device and CPU point of view.
    This is used on platforms that have an advanced system bus (like CAPI or
    CXL).  Any page of a process can be migrated to such memory.  However, no
    one should be allowed to pin such memory so that it can always be evicted.

    [hch@lst.de: rebased ontop of the refcount changes, remove is_dev_private_or_coherent_page]
    Link: https://lkml.kernel.org/r/20220715150521.18165-4-alex.sierra@amd.com
    Signed-off-by: Alex Sierra <alex.sierra@amd.com>
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
    Reviewed-by: Alistair Popple <apopple@nvidia.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:23 -04:00
Chris von Recklinghausen 6811b8d5d5 mm/swap: convert delete_from_swap_cache() to take a folio
Bugzilla: https://bugzilla.redhat.com/2160210

commit 75fa68a5d89871a35246aa2759c95d6dfaf1b582
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Jun 17 18:50:19 2022 +0100

    mm/swap: convert delete_from_swap_cache() to take a folio

    All but one caller already has a folio, so convert it to use a folio.

    Link: https://lkml.kernel.org/r/20220617175020.717127-22-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:20 -04:00
Chris von Recklinghausen c991a31fce mm: Remove __delete_from_page_cache()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 6ffcd825e7d0416d78fd41cd5b7856a78122cc8c
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Jun 28 20:41:40 2022 -0400

    mm: Remove __delete_from_page_cache()

    This wrapper is no longer used.  Remove it and all references to it.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:15 -04:00
Chris von Recklinghausen 7b1db0833d mm: don't be stuck to rmap lock on reclaim path
Bugzilla: https://bugzilla.redhat.com/2160210

commit 6d4675e601357834dadd2ba1d803f6484596015c
Author: Minchan Kim <minchan@kernel.org>
Date:   Thu May 19 14:08:54 2022 -0700

    mm: don't be stuck to rmap lock on reclaim path

    The rmap locks(i_mmap_rwsem and anon_vma->root->rwsem) could be contended
    under memory pressure if processes keep working on their vmas(e.g., fork,
    mmap, munmap).  It makes reclaim path stuck.  In our real workload traces,
    we see kswapd is waiting the lock for 300ms+(worst case, a sec) and it
    makes other processes entering direct reclaim, which were also stuck on
    the lock.

    This patch makes lru aging path try_lock mode like shink_page_list so the
    reclaim context will keep working with next lru pages without being stuck.
    if it found the rmap lock contended, it rotates the page back to head of
    lru in both active/inactive lrus to make them consistent behavior, which
    is basic starting point rather than adding more heristic.

    Since this patch introduces a new "contended" field as out-param along
    with try_lock in-param in rmap_walk_control, it's not immutable any longer
    if the try_lock is set so remove const keywords on rmap related functions.
    Since rmap walking is already expensive operation, I doubt the const
    would help sizable benefit( And we didn't have it until 5.17).

    In a heavy app workload in Android, trace shows following statistics.  It
    almost removes rmap lock contention from reclaim path.

    Martin Liu reported:

    Before:

       max_dur(ms)  min_dur(ms)  max-min(dur)ms  avg_dur(ms)  sum_dur(ms)  count blocked_function
             1632            0            1631   151.542173        31672    209  page_lock_anon_vma_read
              601            0             601   145.544681        28817    198  rmap_walk_file

    After:

       max_dur(ms)  min_dur(ms)  max-min(dur)ms  avg_dur(ms)  sum_dur(ms)  count blocked_function
              NaN          NaN              NaN          NaN          NaN    0.0             NaN
                0            0                0     0.127645            1     12  rmap_walk_file

    [minchan@kernel.org: add comment, per Matthew]
      Link: https://lkml.kernel.org/r/YnNqeB5tUf6LZ57b@google.com
    Link: https://lkml.kernel.org/r/20220510215423.164547-1-minchan@kernel.org
    Signed-off-by: Minchan Kim <minchan@kernel.org>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: John Dias <joaodias@google.com>
    Cc: Tim Murray <timmurray@google.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
    Cc: Martin Liu <liumartin@google.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:11 -04:00
Chris von Recklinghausen 91c4a45404 mm/memory-failure.c: simplify num_poisoned_pages_inc/dec
Bugzilla: https://bugzilla.redhat.com/2160210

commit e240ac52f7da5986f9dcbe29d423b7b2f141b41b
Author: zhenwei pi <pizhenwei@bytedance.com>
Date:   Thu May 12 20:23:10 2022 -0700

    mm/memory-failure.c: simplify num_poisoned_pages_inc/dec

    Originally, do num_poisoned_pages_inc() in memory failure routine, use
    num_poisoned_pages_dec() to rollback the number if filtered/ cancelled.

    Suggested by Naoya, do num_poisoned_pages_inc() only in action_result(),
    this make this clear and simple.

    Link: https://lkml.kernel.org/r/20220509105641.491313-6-pizhenwei@bytedance.com
    Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:08 -04:00
Chris von Recklinghausen 7e2e3250a4 mm/memory-failure.c: add hwpoison_filter for soft offline
Bugzilla: https://bugzilla.redhat.com/2160210

commit 9113eaf331bf44579882c001867773cf1b3364fd
Author: zhenwei pi <pizhenwei@bytedance.com>
Date:   Thu May 12 20:23:10 2022 -0700

    mm/memory-failure.c: add hwpoison_filter for soft offline

    hwpoison_filter is missing in the soft offline path, this leads an issue:
    after enabling the corrupt filter, the user process still has a chance to
    inject hwpoison fault by madvise(addr, len, MADV_SOFT_OFFLINE) at PFN
    which is expected to reject.

    Also do a minor change in comment of memory_failure().

    Link: https://lkml.kernel.org/r/20220509105641.491313-4-pizhenwei@bytedance.com
    Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:08 -04:00
Chris von Recklinghausen 32443a7ae5 mm/memory-failure.c: simplify num_poisoned_pages_dec
Bugzilla: https://bugzilla.redhat.com/2160210

commit c8bd84f73fd6215d5b8d0b3cfc914a3671b16d1c
Author: zhenwei pi <pizhenwei@bytedance.com>
Date:   Thu May 12 20:23:09 2022 -0700

    mm/memory-failure.c: simplify num_poisoned_pages_dec

    Don't decrease the number of poisoned pages in page_alloc.c, let the
    memory-failure.c do inc/dec poisoned pages only.

    Also simplify unpoison_memory(), only decrease the number of
    poisoned pages when:
     - TestClearPageHWPoison() succeed
     - put_page_back_buddy succeed

    After decreasing, print necessary log.

    Finally, remove clear_page_hwpoison() and unpoison_taken_off_page().

    Link: https://lkml.kernel.org/r/20220509105641.491313-3-pizhenwei@bytedance.com
    Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:08 -04:00
Chris von Recklinghausen 65b3725ab2 mm/memory-failure.c: move clear_hwpoisoned_pages
Bugzilla: https://bugzilla.redhat.com/2160210

commit 60f272f6b09a8f14156df88cccd21447ab394452
Author: zhenwei pi <pizhenwei@bytedance.com>
Date:   Thu May 12 20:23:09 2022 -0700

    mm/memory-failure.c: move clear_hwpoisoned_pages

    Patch series "memory-failure: fix hwpoison_filter", v2.

    As well known, the memory failure mechanism handles memory corrupted
    event, and try to send SIGBUS to the user process which uses this
    corrupted page.

    For the virtualization case, QEMU catches SIGBUS and tries to inject MCE
    into the guest, and the guest handles memory failure again.  Thus the
    guest gets the minimal effect from hardware memory corruption.

    The further step I'm working on:

    1, try to modify code to decrease poisoned pages in a single place
       (mm/memofy-failure.c: simplify num_poisoned_pages_dec in this series).

    2, try to use page_handle_poison() to handle SetPageHWPoison() and
       num_poisoned_pages_inc() together.  It would be best to call
       num_poisoned_pages_inc() in a single place too.

    3, introduce memory failure notifier list in memory-failure.c: notify
       the corrupted PFN to someone who registers this list.  If I can
       complete [1] and [2] part, [3] will be quite easy(just call notifier
       list after increasing poisoned page).

    4, introduce memory recover VQ for memory balloon device, and registers
       memory failure notifier list.  During the guest kernel handles memory
       failure, balloon device gets notified by memory failure notifier list,
       and tells the host to recover the corrupted PFN(GPA) by the new VQ.

    5, host side remaps the corrupted page(HVA), and tells the guest side
       to unpoison the PFN(GPA).  Then the guest fixes the corrupted page(GPA)
       dynamically.

    This patch (of 5):

    clear_hwpoisoned_pages() clears HWPoison flag and decreases the number of
    poisoned pages, this actually works as part of memory failure.

    Move this function from sparse.c to memory-failure.c, finally there is no
    CONFIG_MEMORY_FAILURE in sparse.c.

    Link: https://lkml.kernel.org/r/20220509105641.491313-1-pizhenwei@bytedance.com
    Link: https://lkml.kernel.org/r/20220509105641.491313-2-pizhenwei@bytedance.com
    Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:08 -04:00
Chris von Recklinghausen f942ace7a2 mm: create new mm/swap.h header file
Bugzilla: https://bugzilla.redhat.com/2160210

commit 014bb1de4fc17d54907d54418126a9a9736f4aff
Author: NeilBrown <neilb@suse.de>
Date:   Mon May 9 18:20:47 2022 -0700

    mm: create new mm/swap.h header file

    Patch series "MM changes to improve swap-over-NFS support".

    Assorted improvements for swap-via-filesystem.

    This is a resend of these patches, rebased on current HEAD.  The only
    substantial changes is that swap_dirty_folio has replaced
    swap_set_page_dirty.

    Currently swap-via-fs (SWP_FS_OPS) doesn't work for any filesystem.  It
    has previously worked for NFS but that broke a few releases back.  This
    series changes to use a new ->swap_rw rather than ->readpage and
    ->direct_IO.  It also makes other improvements.

    There is a companion series already in linux-next which fixes various
    issues with NFS.  Once both series land, a final patch is needed which
    changes NFS over to use ->swap_rw.

    This patch (of 10):

    Many functions declared in include/linux/swap.h are only used within mm/

    Create a new "mm/swap.h" and move some of these declarations there.
    Remove the redundant 'extern' from the function declarations.

    [akpm@linux-foundation.org: mm/memory-failure.c needs mm/swap.h]
    Link: https://lkml.kernel.org/r/164859751830.29473.5309689752169286816.stgit@noble.brown
    Link: https://lkml.kernel.org/r/164859778120.29473.11725907882296224053.stgit@noble.brown
    Signed-off-by: NeilBrown <neilb@suse.de>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Tested-by: David Howells <dhowells@redhat.com>
    Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
    Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:00 -04:00
Chris von Recklinghausen f4ca3e9bff mm, hugetlb, hwpoison: separate branch for free and in-use hugepage
Bugzilla: https://bugzilla.redhat.com/2160210

commit b283d983a7a6ffe3939ff26f06d151331a7c1071
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Thu Apr 28 23:16:02 2022 -0700

    mm, hugetlb, hwpoison: separate branch for free and in-use hugepage

    We know that HPageFreed pages should have page refcount 0, so
    get_page_unless_zero() always fails and returns 0.  So explicitly separate
    the branch based on page state for minor optimization and better
    readability.

    Link: https://lkml.kernel.org/r/20220415041848.GA3034499@ik1-406-35019.vs.sakura.ne.jp
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
    Suggested-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:53 -04:00
Chris von Recklinghausen 569cbb051f mm/memory-failure.c: dissolve truncated hugetlb page
Bugzilla: https://bugzilla.redhat.com/2160210

commit ef526b17bc3399b8df25d574aa11fc36f89da80a
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Apr 28 23:16:02 2022 -0700

    mm/memory-failure.c: dissolve truncated hugetlb page

    If me_huge_page meets a truncated but not yet freed hugepage, it won't be
    dissolved even if we hold the last refcnt. It's because the hugepage has
    NULL page_mapping while it's not anonymous hugepage too. Thus we lose the
    last chance to dissolve it into buddy to save healthy subpages. Remove
    PageAnon check to handle these hugepages too.

    Link: https://lkml.kernel.org/r/20220414114941.11223-3-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:53 -04:00
Chris von Recklinghausen 7e822751d3 mm/memory-failure.c: minor cleanup for HWPoisonHandlable
Bugzilla: https://bugzilla.redhat.com/2160210

commit 3f871370686ddf3c72207321eef8f6672ae957e4
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Apr 28 23:16:02 2022 -0700

    mm/memory-failure.c: minor cleanup for HWPoisonHandlable

    Patch series "A few fixup and cleanup patches for memory failure", v2.

    This series contains a patch to clean up the HWPoisonHandlable and another
    one to dissolve truncated hugetlb page.  More details can be found in the
    respective changelogs.

    This patch (of 2):

    The local variable movable can be removed by returning true directly. Also
    fix typo 'mirgate'. No functional change intended.

    Link: https://lkml.kernel.org/r/20220414114941.11223-1-linmiaohe@huawei.com
    Link: https://lkml.kernel.org/r/20220414114941.11223-2-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:53 -04:00
Chris von Recklinghausen f9a62953cd mm/hwpoison: put page in already hwpoisoned case with MF_COUNT_INCREASED
Bugzilla: https://bugzilla.redhat.com/2160210

commit f361e2462e8cccdd9231aa3274690705a2ea35a2
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Thu Apr 28 23:16:02 2022 -0700

    mm/hwpoison: put page in already hwpoisoned case with MF_COUNT_INCREASED

    In already hwpoisoned case, memory_failure() is supposed to return with
    releasing the page refcount taken for error handling.  But currently the
    refcount is not released when called with MF_COUNT_INCREASED, which makes
    page refcount inconsistent.  This should be rare and non-critical, but it
    might be inconvenient in testing (unpoison doesn't work).

    Link: https://lkml.kernel.org/r/20220408135323.1559401-3-naoya.horiguchi@linux.dev
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Suggested-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Dan Carpenter <dan.carpenter@oracle.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:53 -04:00
Chris von Recklinghausen 50055f3e44 mm/memory-failure.c: remove unnecessary (void*) conversions
Bugzilla: https://bugzilla.redhat.com/2160210

commit f142e70750a1ea36ba60fb4f24bc37713e921f73
Author: liqiong <liqiong@nfschina.com>
Date:   Thu Apr 28 23:16:01 2022 -0700

    mm/memory-failure.c: remove unnecessary (void*) conversions

    No need cast (void*) to (struct hwp_walk*).

    Link: https://lkml.kernel.org/r/20220322142826.25939-1-liqiong@nfschina.com
    Signed-off-by: liqiong <liqiong@nfschina.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:53 -04:00
Nico Pache f60056bb70 mm,hwpoison: check mm when killing accessing process
commit 77677cdbc2aa4b5d5d839562793d3d126201d18d
Author: Shuai Xue <xueshuai@linux.alibaba.com>
Date:   Wed Sep 14 14:49:35 2022 +0800

    mm,hwpoison: check mm when killing accessing process

    The GHES code calls memory_failure_queue() from IRQ context to queue work
    into workqueue and schedule it on the current CPU.  Then the work is
    processed in memory_failure_work_func() by kworker and calls
    memory_failure().

    When a page is already poisoned, commit a3f5d80ea4 ("mm,hwpoison: send
    SIGBUS with error virutal address") make memory_failure() call
    kill_accessing_process() that:

        - holds mmap locking of current->mm
        - does pagetable walk to find the error virtual address
        - and sends SIGBUS to the current process with error info.

    However, the mm of kworker is not valid, resulting in a null-pointer
    dereference.  So check mm when killing the accessing process.

    [akpm@linux-foundation.org: remove unrelated whitespace alteration]
    Link: https://lkml.kernel.org/r/20220914064935.7851-1-xueshuai@linux.alibaba.com
    Fixes: a3f5d80ea4 ("mm,hwpoison: send SIGBUS with error virutal address")
    Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Huang Ying <ying.huang@intel.com>
    Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: Bixuan Cui <cuibixuan@linux.alibaba.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:43 -07:00
Nico Pache aa28eb0c17 mm/migration: return errno when isolate_huge_page failed
commit 7ce82f4c3f3ead13a9d9498768e3b1a79975c4d8
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Mon May 30 19:30:15 2022 +0800

    mm/migration: return errno when isolate_huge_page failed

    We might fail to isolate huge page due to e.g.  the page is under
    migration which cleared HPageMigratable.  We should return errno in this
    case rather than always return 1 which could confuse the user, i.e.  the
    caller might think all of the memory is migrated while the hugetlb page is
    left behind.  We make the prototype of isolate_huge_page consistent with
    isolate_lru_page as suggested by Huang Ying and rename isolate_huge_page
    to isolate_hugetlb as suggested by Muchun to improve the readability.

    Link: https://lkml.kernel.org/r/20220530113016.16663-4-linmiaohe@huawei.com
    Fixes: e8db67eb0d ("mm: migrate: move_pages() supports thp migration")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Suggested-by: Huang Ying <ying.huang@intel.com>
    Reported-by: kernel test robot <lkp@intel.com> (build error)
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Peter Xu <peterx@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:39 -07:00
Chris von Recklinghausen da6b17e5e9 mm, hwpoison: set PG_hwpoison for busy hugetlb pages
Bugzilla: https://bugzilla.redhat.com/2120352

commit 38f6d29397ccb9c191c4c91103e8123f518fdc10
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Thu Jul 14 13:24:17 2022 +0900

    mm, hwpoison: set PG_hwpoison for busy hugetlb pages

    If memory_failure() fails to grab page refcount on a hugetlb page because
    it's busy, it returns without setting PG_hwpoison on it.  This not only
    loses a chance of error containment, but breaks the rule that
    action_result() should be called only when memory_failure() do any of
    handling work (even if that's just setting PG_hwpoison).  This
    inconsistency could harm code maintainability.

    So set PG_hwpoison and call hugetlb_set_page_hwpoison() for such a case.

    Link: https://lkml.kernel.org/r/20220714042420.1847125-6-naoya.horiguchi@linux.dev
    Fixes: 405ce051236c ("mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()")
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: kernel test robot <lkp@intel.com>
    Cc: Liu Shixin <liushixin2@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:09 -04:00
Chris von Recklinghausen 18b123b391 mm/memory-failure: disable unpoison once hw error happens
Bugzilla: https://bugzilla.redhat.com/2120352

commit 67f22ba7750f940bcd7e1b12720896c505c2d63f
Author: zhenwei pi <pizhenwei@bytedance.com>
Date:   Wed Jun 15 17:32:09 2022 +0800

    mm/memory-failure: disable unpoison once hw error happens

    Currently unpoison_memory(unsigned long pfn) is designed for soft
    poison(hwpoison-inject) only.  Since 17fae1294a, the KPTE gets cleared
    on a x86 platform once hardware memory corrupts.

    Unpoisoning a hardware corrupted page puts page back buddy only, the
    kernel has a chance to access the page with *NOT PRESENT* KPTE.  This
    leads BUG during accessing on the corrupted KPTE.

    Suggested by David&Naoya, disable unpoison mechanism when a real HW error
    happens to avoid BUG like this:

     Unpoison: Software-unpoisoned page 0x61234
     BUG: unable to handle page fault for address: ffff888061234000
     #PF: supervisor write access in kernel mode
     #PF: error_code(0x0002) - not-present page
     PGD 2c01067 P4D 2c01067 PUD 107267063 PMD 10382b063 PTE 800fffff9edcb062
     Oops: 0002 [#1] PREEMPT SMP NOPTI
     CPU: 4 PID: 26551 Comm: stress Kdump: loaded Tainted: G   M       OE     5.18.0.bm.1-amd64 #7
     Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ...
     RIP: 0010:clear_page_erms+0x7/0x10
     Code: ...
     RSP: 0000:ffffc90001107bc8 EFLAGS: 00010246
     RAX: 0000000000000000 RBX: 0000000000000901 RCX: 0000000000001000
     RDX: ffffea0001848d00 RSI: ffffea0001848d40 RDI: ffff888061234000
     RBP: ffffea0001848d00 R08: 0000000000000901 R09: 0000000000001276
     R10: 0000000000000003 R11: 0000000000000000 R12: 0000000000000001
     R13: 0000000000000000 R14: 0000000000140dca R15: 0000000000000001
     FS:  00007fd8b2333740(0000) GS:ffff88813fd00000(0000) knlGS:0000000000000000
     CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
     CR2: ffff888061234000 CR3: 00000001023d2005 CR4: 0000000000770ee0
     DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
     DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
     PKRU: 55555554
     Call Trace:
      <TASK>
      prep_new_page+0x151/0x170
      get_page_from_freelist+0xca0/0xe20
      ? sysvec_apic_timer_interrupt+0xab/0xc0
      ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
      __alloc_pages+0x17e/0x340
      __folio_alloc+0x17/0x40
      vma_alloc_folio+0x84/0x280
      __handle_mm_fault+0x8d4/0xeb0
      handle_mm_fault+0xd5/0x2a0
      do_user_addr_fault+0x1d0/0x680
      ? kvm_read_and_reset_apf_flags+0x3b/0x50
      exc_page_fault+0x78/0x170
      asm_exc_page_fault+0x27/0x30

    Link: https://lkml.kernel.org/r/20220615093209.259374-2-pizhenwei@bytedance.com
    Fixes: 847ce401df ("HWPOISON: Add unpoisoning support")
    Fixes: 17fae1294a ("x86/{mce,mm}: Unmap the entire page if the whole page is affected and poisoned")
    Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: <stable@vger.kernel.org>    [5.8+]
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:09 -04:00
Chris von Recklinghausen 848ae57ee1 Revert "mm/memory-failure.c: fix race with changing page compound again"
Bugzilla: https://bugzilla.redhat.com/2120352

commit 2ba2b008a8bf5fd268a43d03ba79e0ad464d6836
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Thu Apr 28 23:16:02 2022 -0700

    Revert "mm/memory-failure.c: fix race with changing page compound again"

    Reverts commit 888af2701db7 ("mm/memory-failure.c: fix race with changing
    page compound again") because now we fetch the page refcount under
    hugetlb_lock in try_memory_failure_hugetlb() so that the race check is no
    longer necessary.

    Link: https://lkml.kernel.org/r/20220408135323.1559401-4-naoya.horiguchi@linux.dev
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Suggested-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Dan Carpenter <dan.carpenter@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:07 -04:00
Chris von Recklinghausen 0e870434e4 Revert "mm/memory-failure.c: skip huge_zero_page in memory_failure()"
Bugzilla: https://bugzilla.redhat.com/2120352

commit b4e61fc031b11dd807dffc46cebbf0e25966d3d1
Author: Xu Yu <xuyu@linux.alibaba.com>
Date:   Thu Apr 28 23:14:43 2022 -0700

    Revert "mm/memory-failure.c: skip huge_zero_page in memory_failure()"

    Patch series "mm/memory-failure: rework fix on huge_zero_page splitting".

    This patch (of 2):

    This reverts commit d173d5417fb67411e623d394aab986d847e47dad.

    The commit d173d5417fb6 ("mm/memory-failure.c: skip huge_zero_page in
    memory_failure()") explicitly skips huge_zero_page in memory_failure(), in
    order to avoid triggering VM_BUG_ON_PAGE on huge_zero_page in
    split_huge_page_to_list().

    This works, but Yang Shi thinks that,

        Raising BUG is overkilling for splitting huge_zero_page. The
        huge_zero_page can't be met from normal paths other than memory
        failure, but memory failure is a valid caller. So I tend to replace
        the BUG to WARN + returning -EBUSY. If we don't care about the
        reason code in memory failure, we don't have to touch memory
        failure.

    And for the issue that huge_zero_page will be set PG_has_hwpoisoned,
    Yang Shi comments that,

        The anonymous page fault doesn't check if the page is poisoned or
        not since it typically gets a fresh allocated page and assumes the
        poisoned page (isolated successfully) can't be reallocated again.
        But huge zero page and base zero page are reused every time. So no
        matter what fix we pick, the issue is always there.

    Finally, Yang, David, Anshuman and Naoya all agree to fix the bug, i.e.,
    to split huge_zero_page, in split_huge_page_to_list().

    This reverts the commit d173d5417fb6 ("mm/memory-failure.c: skip
    huge_zero_page in memory_failure()"), and the original bug will be fixed
    by the next patch.

    Link: https://lkml.kernel.org/r/872cefb182ba1dd686b0e7db1e6b2ebe5a4fff87.1651039624.git.xuyu@linux.alibaba.com
    Fixes: d173d5417fb6 ("mm/memory-failure.c: skip huge_zero_page in memory_failure()")
    Fixes: 6a46079cf5 ("HWPOISON: The high level memory error handler in the VM v7")
    Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
    Suggested-by: Yang Shi <shy828301@gmail.com>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:07 -04:00
Chris von Recklinghausen 8ca43f4c30 mm/memory-failure.c: skip huge_zero_page in memory_failure()
Bugzilla: https://bugzilla.redhat.com/2120352

commit d173d5417fb67411e623d394aab986d847e47dad
Author: Xu Yu <xuyu@linux.alibaba.com>
Date:   Thu Apr 21 16:35:37 2022 -0700

    mm/memory-failure.c: skip huge_zero_page in memory_failure()

    Kernel panic when injecting memory_failure for the global
    huge_zero_page, when CONFIG_DEBUG_VM is enabled, as follows.

      Injecting memory failure for pfn 0x109ff9 at process virtual address 0x20ff9000
      page:00000000fb053fc3 refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x109e00
      head:00000000fb053fc3 order:9 compound_mapcount:0 compound_pincount:0
      flags: 0x17fffc000010001(locked|head|node=0|zone=2|lastcpupid=0x1ffff)
      raw: 017fffc000010001 0000000000000000 dead000000000122 0000000000000000
      raw: 0000000000000000 0000000000000000 00000002ffffffff 0000000000000000
      page dumped because: VM_BUG_ON_PAGE(is_huge_zero_page(head))
      ------------[ cut here ]------------
      kernel BUG at mm/huge_memory.c:2499!
      invalid opcode: 0000 [#1] PREEMPT SMP PTI
      CPU: 6 PID: 553 Comm: split_bug Not tainted 5.18.0-rc1+ #11
      Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014
      RIP: 0010:split_huge_page_to_list+0x66a/0x880
      Code: 84 9b fb ff ff 48 8b 7c 24 08 31 f6 e8 9f 5d 2a 00 b8 b8 02 00 00 e9 e8 fb ff ff 48 c7 c6 e8 47 3c 82 4c b
      RSP: 0018:ffffc90000dcbdf8 EFLAGS: 00010246
      RAX: 000000000000003c RBX: 0000000000000001 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: ffffffff823e4c4f RDI: 00000000ffffffff
      RBP: ffff88843fffdb40 R08: 0000000000000000 R09: 00000000fffeffff
      R10: ffffc90000dcbc48 R11: ffffffff82d68448 R12: ffffea0004278000
      R13: ffffffff823c6203 R14: 0000000000109ff9 R15: ffffea000427fe40
      FS:  00007fc375a26740(0000) GS:ffff88842fd80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fc3757c9290 CR3: 0000000102174006 CR4: 00000000003706e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       try_to_split_thp_page+0x3a/0x130
       memory_failure+0x128/0x800
       madvise_inject_error.cold+0x8b/0xa1
       __x64_sys_madvise+0x54/0x60
       do_syscall_64+0x35/0x80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      RIP: 0033:0x7fc3754f8bf9
      Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 8
      RSP: 002b:00007ffeda93a1d8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3754f8bf9
      RDX: 0000000000000064 RSI: 0000000000003000 RDI: 0000000020ff9000
      RBP: 00007ffeda93a200 R08: 0000000000000000 R09: 0000000000000000
      R10: 00000000ffffffff R11: 0000000000000217 R12: 0000000000400490
      R13: 00007ffeda93a2e0 R14: 0000000000000000 R15: 0000000000000000

    This makes huge_zero_page bail out explicitly before split in
    memory_failure(), thus the panic above won't happen again.

    Link: https://lkml.kernel.org/r/497d3835612610e370c74e697ea3c721d1d55b9c.1649775850.git.xuyu@linux.alibaba.com
    Fixes: 6a46079cf5 ("HWPOISON: The high level memory error handler in the VM v7")
    Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
    Reported-by: Abaci <abaci@linux.alibaba.com>
    Suggested-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:06 -04:00
Chris von Recklinghausen a805faea7e mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()
Bugzilla: https://bugzilla.redhat.com/2120352

commit 405ce051236cc65b30bbfe490b28ce60ae6aed85
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Thu Apr 21 16:35:33 2022 -0700

    mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()

    There is a race condition between memory_failure_hugetlb() and hugetlb
    free/demotion, which causes setting PageHWPoison flag on the wrong page.
    The one simple result is that wrong processes can be killed, but another
    (more serious) one is that the actual error is left unhandled, so no one
    prevents later access to it, and that might lead to more serious results
    like consuming corrupted data.

    Think about the below race window:

      CPU 1                                   CPU 2
      memory_failure_hugetlb
      struct page *head = compound_head(p);
                                              hugetlb page might be freed to
                                              buddy, or even changed to another
                                              compound page.

      get_hwpoison_page -- page is not what we want now...

    The current code first does prechecks roughly and then reconfirms after
    taking refcount, but it's found that it makes code overly complicated,
    so move the prechecks in a single hugetlb_lock range.

    A newly introduced function, try_memory_failure_hugetlb(), always takes
    hugetlb_lock (even for non-hugetlb pages).  That can be improved, but
    memory_failure() is rare in principle, so should not be a big problem.

    Link: https://lkml.kernel.org/r/20220408135323.1559401-2-naoya.horiguchi@linux.dev
    Fixes: 761ad8d7c7 ("mm: hwpoison: introduce memory_failure_hugetlb()")
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reported-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Dan Carpenter <dan.carpenter@oracle.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:06 -04:00
Chris von Recklinghausen b2c2b8a92b mm/memory-failure.c: make non-LRU movable pages unhandlable
Bugzilla: https://bugzilla.redhat.com/2120352

commit bf6445bc8f778590ac754b06a8fe82ce5a9f818a
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Mar 22 14:44:50 2022 -0700

    mm/memory-failure.c: make non-LRU movable pages unhandlable

    We can not really handle non-LRU movable pages in memory failure.
    Typically they are balloon, zsmalloc, etc.

    Assuming we run into a base (4K) non-LRU movable page, we could reach as
    far as identify_page_state(), it should not fall into any category
    except me_unknown.

    For the non-LRU compound movable pages, they could be taken for
    transhuge pages but it's unexpected to split non-LRU movable pages using
    split_huge_page_to_list in memory_failure.  So we could just simply make
    non-LRU movable pages unhandlable to avoid these possible nasty cases.

    Link: https://lkml.kernel.org/r/20220312074613.4798-4-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Suggested-by: Yang Shi <shy828301@gmail.com>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Tony Luck <tony.luck@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:53 -04:00
Chris von Recklinghausen 210c0d8f7c mm/memory-failure.c: avoid calling invalidate_inode_page() with unexpected pages
Bugzilla: https://bugzilla.redhat.com/2120352

commit 593396b86ef6f79c71e09c183eae28040ccfeedf
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Mar 22 14:44:47 2022 -0700

    mm/memory-failure.c: avoid calling invalidate_inode_page() with unexpected pages

    Since commit 042c4f32323b ("mm/truncate: Inline invalidate_complete_page()
    into its one caller"), invalidate_inode_page() can invalidate the pages
    in the swap cache because the check of page->mapping != mapping is
    removed.  But invalidate_inode_page() is not expected to deal with the
    pages in swap cache.  Also non-lru movable page can reach here too.
    They're not page cache pages.  Skip these pages by checking
    PageSwapCache and PageLRU.

    Link: https://lkml.kernel.org/r/20220312074613.4798-3-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Tony Luck <tony.luck@intel.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:53 -04:00
Chris von Recklinghausen fba8841df7 mm/memory-failure.c: fix race with changing page compound again
Bugzilla: https://bugzilla.redhat.com/2120352

commit 888af2701db79b9b27c7e37f9ede528a5ca53b76
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Mar 22 14:44:44 2022 -0700

    mm/memory-failure.c: fix race with changing page compound again

    Patch series "A few fixup patches for memory failure", v2.

    This series contains a few patches to fix the race with changing page
    compound page, make non-LRU movable pages unhandlable and so on.  More
    details can be found in the respective changelogs.

    There is a race window where we got the compound_head, the hugetlb page
    could be freed to buddy, or even changed to another compound page just
    before we try to get hwpoison page.  Think about the below race window:

      CPU 1                                   CPU 2
      memory_failure_hugetlb
      struct page *head = compound_head(p);
                                              hugetlb page might be freed to
                                              buddy, or even changed to another
                                              compound page.

      get_hwpoison_page -- page is not what we want now...

    If this race happens, just bail out.  Also MF_MSG_DIFFERENT_PAGE_SIZE is
    introduced to record this event.

    [akpm@linux-foundation.org: s@/**@/*@, per Naoya Horiguchi]

    Link: https://lkml.kernel.org/r/20220312074613.4798-1-linmiaohe@huawei.com
    Link: https://lkml.kernel.org/r/20220312074613.4798-2-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Tony Luck <tony.luck@intel.com>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:53 -04:00
Chris von Recklinghausen e624cb473d mm/hwpoison: add in-use hugepage hwpoison filter judgement
Bugzilla: https://bugzilla.redhat.com/2120352

commit a06ad3c0c75297f0b0999b1a981e50224e690ee9
Author: luofei <luofei@unicloud.com>
Date:   Tue Mar 22 14:44:41 2022 -0700

    mm/hwpoison: add in-use hugepage hwpoison filter judgement

    After successfully obtaining the reference count of the huge page, it is
    still necessary to call hwpoison_filter() to make a filter judgement,
    otherwise the filter hugepage will be unmaped and the related process
    may be killed.

    Link: https://lkml.kernel.org/r/20220223082254.2769757-1-luofei@unicloud.com
    Signed-off-by: luofei <luofei@unicloud.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: H. Peter Anvin <hpa@zytor.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Tony Luck <tony.luck@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:53 -04:00
Chris von Recklinghausen 281e153b89 mm/hwpoison: avoid the impact of hwpoison_filter() return value on mce handler
Bugzilla: https://bugzilla.redhat.com/2120352

commit d1fe111fb62a1cf0446a2919f5effbb33ad0702c
Author: luofei <luofei@unicloud.com>
Date:   Tue Mar 22 14:44:38 2022 -0700

    mm/hwpoison: avoid the impact of hwpoison_filter() return value on mce handler

    When the hwpoison page meets the filter conditions, it should not be
    regarded as successful memory_failure() processing for mce handler, but
    should return a distinct value, otherwise mce handler regards the error
    page has been identified and isolated, which may lead to calling
    set_mce_nospec() to change page attribute, etc.

    Here memory_failure() return -EOPNOTSUPP to indicate that the error
    event is filtered, mce handler should not take any action for this
    situation and hwpoison injector should treat as correct.

    Link: https://lkml.kernel.org/r/20220223082135.2769649-1-luofei@unicloud.com
    Signed-off-by: luofei <luofei@unicloud.com>
    Acked-by: Borislav Petkov <bp@suse.de>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: H. Peter Anvin <hpa@zytor.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Tony Luck <tony.luck@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:53 -04:00
Chris von Recklinghausen b59a957edd mm/memory-failure.c: remove unnecessary PageTransTail check
Bugzilla: https://bugzilla.redhat.com/2120352

commit b04d3eebebf8372f83924db6c1e4fbdcab7cafc2
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Mar 22 14:44:33 2022 -0700

    mm/memory-failure.c: remove unnecessary PageTransTail check

    When we reach here, we're guaranteed to have non-compound page as thp is
    already splited.  Remove this unnecessary PageTransTail check.

    Link: https://lkml.kernel.org/r/20220218090118.1105-9-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:52 -04:00
Chris von Recklinghausen caeba3ae14 mm/memory-failure.c: remove obsolete comment in __soft_offline_page
Bugzilla: https://bugzilla.redhat.com/2120352

commit 2ab916790ff0bbaac557dc1238f08237dd7799cc
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Mar 22 14:44:30 2022 -0700

    mm/memory-failure.c: remove obsolete comment in __soft_offline_page

    Since commit add05cecef ("mm: soft-offline: don't free target page in
    successful page migration"), set_migratetype_isolate logic is removed.
    Remove this obsolete comment.

    Link: https://lkml.kernel.org/r/20220218090118.1105-8-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:52 -04:00
Chris von Recklinghausen b699715562 mm/memory-failure.c: rework the try_to_unmap logic in hwpoison_user_map
pings()

Conflicts: mm/memory-failure.c - We already have
	869f7ee6f647 ("mm/rmap: Convert try_to_unmap() to take a folio")
	so keep calling try_to_unmap with a folio

Bugzilla: https://bugzilla.redhat.com/2120352

commit 357670f79efb7e520461d18bb093342605c7cbed
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Mar 22 14:44:27 2022 -0700

    mm/memory-failure.c: rework the try_to_unmap logic in hwpoison_user_mappings
()

    Only for hugetlb pages in shared mappings, try_to_unmap should take
    semaphore in write mode here.  Rework the code to make it clear.

    Link: https://lkml.kernel.org/r/20220218090118.1105-7-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:52 -04:00
Chris von Recklinghausen 8d8db99aae mm/memory-failure.c: remove PageSlab check in hwpoison_filter_dev
Bugzilla: https://bugzilla.redhat.com/2120352

commit 67ff51c6a6d2ef99cf35a937e59269dc9a0c7fc2
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Mar 22 14:44:24 2022 -0700

    mm/memory-failure.c: remove PageSlab check in hwpoison_filter_dev

    Since commit 03e5ac2fc3 ("mm: fix crash when using XFS on loopback"),
    page_mapping() can handle the Slab pages.  So remove this unnecessary
    PageSlab check and obsolete comment.

    Link: https://lkml.kernel.org/r/20220218090118.1105-6-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:52 -04:00
Chris von Recklinghausen d7ee05e98d mm/memory-failure.c: fix race with changing page more robustly
Bugzilla: https://bugzilla.redhat.com/2120352

commit 75ee64b3c9a9695726056e9ec527e11dbf286500
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Mar 22 14:44:21 2022 -0700

    mm/memory-failure.c: fix race with changing page more robustly

    We're only intended to deal with the non-Compound page after we split
    thp in memory_failure.  However, the page could have changed compound
    pages due to race window.  If this happens, we could retry once to
    hopefully handle the page next round.  Also remove unneeded orig_head.
    It's always equal to the hpage.  So we can use hpage directly and remove
    this redundant one.

    Link: https://lkml.kernel.org/r/20220218090118.1105-5-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:52 -04:00
Chris von Recklinghausen 798a62dcd4 mm/memory-failure.c: rework the signaling logic in kill_proc
Bugzilla: https://bugzilla.redhat.com/2120352

commit 49775047cf52a92e41444d41a0584180ec2c256b
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Mar 22 14:44:18 2022 -0700

    mm/memory-failure.c: rework the signaling logic in kill_proc

    BUS_MCEERR_AR code is only sent when MF_ACTION_REQUIRED is set and the
    target is current.  Rework the code to make this clear.

    Link: https://lkml.kernel.org/r/20220218090118.1105-4-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:52 -04:00
Chris von Recklinghausen e6b62fd6dc mm/memory-failure.c: catch unexpected -EFAULT from vma_address()
Bugzilla: https://bugzilla.redhat.com/2120352

commit a994402bc4714cefea5770b2d906cef5b0f4dc5c
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Mar 22 14:44:15 2022 -0700

    mm/memory-failure.c: catch unexpected -EFAULT from vma_address()

    It's unexpected to walk the page table when vma_address() return
    -EFAULT.  But dev_pagemap_mapping_shift() is called only when vma
    associated to the error page is found already in
    collect_procs_{file,anon}, so vma_address() should not return -EFAULT
    except with some bug, as Naoya pointed out.  We can use VM_BUG_ON_VMA()
    to catch this bug here.

    Link: https://lkml.kernel.org/r/20220218090118.1105-3-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:52 -04:00
Chris von Recklinghausen a0431d1544 mm/memory-failure.c: minor clean up for memory_failure_dev_pagemap
Bugzilla: https://bugzilla.redhat.com/2120352

commit 577553f4897181dc8960351511c921018892e818
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Mar 22 14:44:12 2022 -0700

    mm/memory-failure.c: minor clean up for memory_failure_dev_pagemap

    Patch series "A few cleanup and fixup patches for memory failure", v3.

    This series contains a few patches to simplify the code logic, remove
    unneeded variable and remove obsolete comment.  Also we fix race
    changing page more robustly in memory_failure.  More details can be
    found in the respective changelogs.

    This patch (of 8):

    The flags always has MF_ACTION_REQUIRED and MF_MUST_KILL set.  So we do
    not need to check these flags again.

    Link: https://lkml.kernel.org/r/20220218090118.1105-1-linmiaohe@huawei.com
    Link: https://lkml.kernel.org/r/20220218090118.1105-2-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:52 -04:00
Chris von Recklinghausen b9d2f13533 mm/memory-failure.c: remove obsolete comment
Bugzilla: https://bugzilla.redhat.com/2120352

commit ae483c20062695324202d19e5283819b11b83eaa
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Tue Mar 22 14:44:03 2022 -0700

    mm/memory-failure.c: remove obsolete comment

    With the introduction of mf_mutex, most of memory error handling process
    is mutually exclusive, so the in-line comment about subtlety about
    double-checking PageHWPoison is no more correct.  So remove it.

    Link: https://lkml.kernel.org/r/20220125025601.3054511-1-naoya.horiguchi@linux.dev
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:52 -04:00
Chris von Recklinghausen e5602006ba memory-failure: fetch compound_head after pgmap_pfn_valid()
Bugzilla: https://bugzilla.redhat.com/2120352

commit 61e28cf0543c7d8e6ef88c3c305f727c5a21ba5b
Author: Joao Martins <joao.m.martins@oracle.com>
Date:   Sat Jan 29 13:41:01 2022 -0800

    memory-failure: fetch compound_head after pgmap_pfn_valid()

    memory_failure_dev_pagemap() at the moment assumes base pages (e.g.
    dax_lock_page()).  For devmap with compound pages fetch the
    compound_head in case a tail page memory failure is being handled.

    Currently this is a nop, but in the advent of compound pages in
    dev_pagemap it allows memory_failure_dev_pagemap() to keep working.

    Without this fix memory-failure handling (i.e.  MCEs on pmem) with
    device-dax configured namespaces will regress (and crash).

    Link: https://lkml.kernel.org/r/20211202204422.26777-2-joao.m.martins@oracle.com
    Reported-by: Jane Chu <jane.chu@oracle.com>
    Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
    Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Dan Williams <dan.j.williams@intel.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:43 -04:00
Chris von Recklinghausen 8bdd7409d0 mm: fix some comment errors
Conflicts: mm/swap.c - We already have
	ff042f4a9b05 ("mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu")
	which updated the comment. Keep the changes.

Bugzilla: https://bugzilla.redhat.com/2120352

commit 0b8f0d870020dbd7037bfacbb73a9b3213470f90
Author: Quanfa Fu <fuqf0919@gmail.com>
Date:   Fri Jan 14 14:09:25 2022 -0800

    mm: fix some comment errors

    Link: https://lkml.kernel.org/r/20211101040208.460810-1-fuqf0919@gmail.com
    Signed-off-by: Quanfa Fu <fuqf0919@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:41 -04:00
Chris von Recklinghausen 5d4d6bace0 mm: shmem: don't truncate page if memory failure happens
Bugzilla: https://bugzilla.redhat.com/2120352

commit a7605426666196c5a460dd3de6f8dac1d3c21f00
Author: Yang Shi <shy828301@gmail.com>
Date:   Fri Jan 14 14:05:19 2022 -0800

    mm: shmem: don't truncate page if memory failure happens

    The current behavior of memory failure is to truncate the page cache
    regardless of dirty or clean.  If the page is dirty the later access
    will get the obsolete data from disk without any notification to the
    users.  This may cause silent data loss.  It is even worse for shmem
    since shmem is in-memory filesystem, truncating page cache means
    discarding data blocks.  The later read would return all zero.

    The right approach is to keep the corrupted page in page cache, any
    later access would return error for syscalls or SIGBUS for page fault,
    until the file is truncated, hole punched or removed.  The regular
    storage backed filesystems would be more complicated so this patch is
    focused on shmem.  This also unblock the support for soft offlining
    shmem THP.

    [akpm@linux-foundation.org: coding style fixes]
    [arnd@arndb.de: fix uninitialized variable use in me_pagecache_clean()]
      Link: https://lkml.kernel.org/r/20211022064748.4173718-1-arnd@kernel.org
    [Fix invalid pointer dereference in shmem_read_mapping_page_gfp() with a
     slight different implementation from what Ajay Garg <ajaygargnsit@gmail.com>
     and Muchun Song <songmuchun@bytedance.com> proposed and reworked the
     error handling of shmem_write_begin() suggested by Linus]
      Link: https://lore.kernel.org/linux-mm/20211111084617.6746-1-ajaygargnsit@gmail.com/

    Link: https://lkml.kernel.org/r/20211020210755.23964-6-shy828301@gmail.com
    Link: https://lkml.kernel.org/r/20211116193247.21102-1-shy828301@gmail.com
    Signed-off-by: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Arnd Bergmann <arnd@arndb.de>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Ajay Garg <ajaygargnsit@gmail.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Andy Lavr <andy.lavr@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:38 -04:00
Chris von Recklinghausen 5b74ce1a45 mm/memory_failure: constify static mm_walk_ops
Bugzilla: https://bugzilla.redhat.com/2120352

commit ba9eb3cef9e699e259f9ceefdbcd3ee83d3529e2
Author: Rikard Falkeborn <rikard.falkeborn@gmail.com>
Date:   Fri Nov 5 13:41:01 2021 -0700

    mm/memory_failure: constify static mm_walk_ops

    The only usage of hwp_walk_ops is to pass its address to
    walk_page_range() which takes a pointer to const mm_walk_ops as
    argument.

    Make it const to allow the compiler to put it in read-only memory.

    Link: https://lkml.kernel.org/r/20211014075042.17174-3-rikard.falkeborn@gmail.com
    Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:28 -04:00
Aristeu Rozanski 6ed3b2ca9f mm/hwpoison: use pr_err() instead of dump_page() in get_any_page()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 1825b93b626e99eb9a0f9f50342c7b2fa201b387
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Thu Apr 28 23:14:44 2022 -0700

    mm/hwpoison: use pr_err() instead of dump_page() in get_any_page()

    The following VM_BUG_ON_FOLIO() is triggered when memory error event
    happens on the (thp/folio) pages which are about to be freed:

      [ 1160.232771] page:00000000b36a8a0f refcount:1 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x16a000
      [ 1160.236916] page:00000000b36a8a0f refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x16a000
      [ 1160.240684] flags: 0x57ffffc0800000(hwpoison|node=1|zone=2|lastcpupid=0x1fffff)
      [ 1160.243458] raw: 0057ffffc0800000 dead000000000100 dead000000000122 0000000000000000
      [ 1160.246268] raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
      [ 1160.249197] page dumped because: VM_BUG_ON_FOLIO(!folio_test_large(folio))
      [ 1160.251815] ------------[ cut here ]------------
      [ 1160.253438] kernel BUG at include/linux/mm.h:788!
      [ 1160.256162] invalid opcode: 0000 [#1] PREEMPT SMP PTI
      [ 1160.258172] CPU: 2 PID: 115368 Comm: mceinj.sh Tainted: G            E     5.18.0-rc1-v5.18-rc1-220404-2353-005-g83111+ #3
      [ 1160.262049] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1.fc35 04/01/2014
      [ 1160.265103] RIP: 0010:dump_page.cold+0x27e/0x2bd
      [ 1160.266757] Code: fe ff ff 48 c7 c6 81 f1 5a 98 e9 4c fe ff ff 48 c7 c6 a1 95 59 98 e9 40 fe ff ff 48 c7 c6 50 bf 5a 98 48 89 ef e8 9d 04 6d ff <0f> 0b 41 f7 c4 ff 0f 00 00 0f 85 9f fd ff ff 49 8b 04 24 a9 00 00
      [ 1160.273180] RSP: 0018:ffffaa2c4d59fd18 EFLAGS: 00010292
      [ 1160.274969] RAX: 000000000000003e RBX: 0000000000000001 RCX: 0000000000000000
      [ 1160.277263] RDX: 0000000000000001 RSI: ffffffff985995a1 RDI: 00000000ffffffff
      [ 1160.279571] RBP: ffffdc9c45a80000 R08: 0000000000000000 R09: 00000000ffffdfff
      [ 1160.281794] R10: ffffaa2c4d59fb08 R11: ffffffff98940d08 R12: ffffdc9c45a80000
      [ 1160.283920] R13: ffffffff985b6f94 R14: 0000000000000000 R15: ffffdc9c45a80000
      [ 1160.286641] FS:  00007eff54ce1740(0000) GS:ffff99c67bd00000(0000) knlGS:0000000000000000
      [ 1160.289498] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1160.291106] CR2: 00005628381a5f68 CR3: 0000000104712003 CR4: 0000000000170ee0
      [ 1160.293031] Call Trace:
      [ 1160.293724]  <TASK>
      [ 1160.294334]  get_hwpoison_page+0x47d/0x570
      [ 1160.295474]  memory_failure+0x106/0xaa0
      [ 1160.296474]  ? security_capable+0x36/0x50
      [ 1160.297524]  hard_offline_page_store+0x43/0x80
      [ 1160.298684]  kernfs_fop_write_iter+0x11c/0x1b0
      [ 1160.299829]  new_sync_write+0xf9/0x160
      [ 1160.300810]  vfs_write+0x209/0x290
      [ 1160.301835]  ksys_write+0x4f/0xc0
      [ 1160.302718]  do_syscall_64+0x3b/0x90
      [ 1160.303664]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [ 1160.304981] RIP: 0033:0x7eff54b018b7

    As shown in the RIP address, this VM_BUG_ON in folio_entire_mapcount() is
    called from dump_page("hwpoison: unhandlable page") in get_any_page().
    The below explains the mechanism of the race:

      CPU 0                                       CPU 1

        memory_failure
          get_hwpoison_page
            get_any_page
              dump_page
                compound = PageCompound
                                                    free_pages_prepare
                                                      page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP
                folio_entire_mapcount
                  VM_BUG_ON_FOLIO(!folio_test_large(folio))

    So replace dump_page() with safer one, pr_err().

    Link: https://lkml.kernel.org/r/20220427053220.719866-1-naoya.horiguchi@linux.dev
    Fixes: 74e8ee4708a8 ("mm: Turn head_compound_mapcount() into folio_entire_mapcount()")
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: William Kucharski <william.kucharski@oracle.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:22 -04:00
Aristeu Rozanski 81b7032292 mm/rmap: Turn page_lock_anon_vma_read() into folio_lock_anon_vma_read()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 9595d76942b8714627d670a7e7ae543812c731ae
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Feb 1 23:33:08 2022 -0500

    mm/rmap: Turn page_lock_anon_vma_read() into folio_lock_anon_vma_read()

    Add back page_lock_anon_vma_read() as a wrapper.  This saves a few calls
    to compound_head().  If any callers were passing a tail page before,
    this would have failed to lock the anon VMA as page->mapping is not
    valid for tail pages.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:19 -04:00