Commit Graph

491 Commits

Author SHA1 Message Date
Chris von Recklinghausen 1ac3b99189 mm, hwpoison: fix extra put_page() in soft_offline_page()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 12f1dbcf8f144c0b8dde7a62fea766f88cb79fc8
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Aug 18 21:00:13 2022 +0800

    mm, hwpoison: fix extra put_page() in soft_offline_page()

    When hwpoison_filter() refuses to soft offline a page, the page refcnt
    incremented previously by MF_COUNT_INCREASED would have been consumed via
    get_hwpoison_page() if ret <= 0.  So the put_ref_page() here will put the
    extra one.  Remove it to fix the issue.

    Link: https://lkml.kernel.org/r/20220818130016.45313-4-linmiaohe@huawei.com
    Fixes: 9113eaf331bf ("mm/memory-failure.c: add hwpoison_filter for soft offline")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:32 -04:00
Chris von Recklinghausen 7228fc6140 mm, hwpoison: enable memory error handling on 1GB hugepage
Bugzilla: https://bugzilla.redhat.com/2160210

commit 6f4614886baa59b6ae014093300482c1da4d3c93
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Thu Jul 14 13:24:20 2022 +0900

    mm, hwpoison: enable memory error handling on 1GB hugepage

    Now error handling code is prepared, so remove the blocking code and
    enable memory error handling on 1GB hugepage.

    Link: https://lkml.kernel.org/r/20220714042420.1847125-9-naoya.horiguchi@linux.dev
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: kernel test robot <lkp@intel.com>
    Cc: Liu Shixin <liushixin2@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:31 -04:00
Chris von Recklinghausen f6b4b74d69 mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage
Bugzilla: https://bugzilla.redhat.com/2160210

commit ceaf8fbea79a854373b9fc03c9fde98eb8712725
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Thu Jul 14 13:24:19 2022 +0900

    mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage

    Currently if memory_failure() (modified to remove blocking code with
    subsequent patch) is called on a page in some 1GB hugepage, memory error
    handling fails and the raw error page gets into leaked state.  The impact
    is small in production systems (just leaked single 4kB page), but this
    limits the testability because unpoison doesn't work for it.  We can no
    longer create 1GB hugepage on the 1GB physical address range with such
    leaked pages, that's not useful when testing on small systems.

    When a hwpoison page in a 1GB hugepage is handled, it's caught by the
    PageHWPoison check in free_pages_prepare() because the 1GB hugepage is
    broken down into raw error pages before coming to this point:

            if (unlikely(PageHWPoison(page)) && !order) {
                    ...
                    return false;
            }

    Then, the page is not sent to buddy and the page refcount is left 0.

    Originally this check is supposed to work when the error page is freed
    from page_handle_poison() (that is called from soft-offline), but now we
    are opening another path to call it, so the callers of
    __page_handle_poison() need to handle the case by considering the return
    value 0 as success.  Then page refcount for hwpoison is properly
    incremented so unpoison works.

    Link: https://lkml.kernel.org/r/20220714042420.1847125-8-naoya.horiguchi@linux.dev
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: kernel test robot <lkp@intel.com>
    Cc: Liu Shixin <liushixin2@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:31 -04:00
Chris von Recklinghausen 787dc58044 mm, hwpoison: make __page_handle_poison returns int
Bugzilla: https://bugzilla.redhat.com/2160210

commit 7453bf621cfaf01a61f0e9180390ac6abc414894
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Thu Jul 14 13:24:18 2022 +0900

    mm, hwpoison: make __page_handle_poison returns int

    __page_handle_poison() returns bool that shows whether
    take_page_off_buddy() has passed or not now.  But we will want to
    distinguish another case of "dissolve has passed but taking off failed" by
    its return value.  So change the type of the return value.  No functional
    change.

    Link: https://lkml.kernel.org/r/20220714042420.1847125-7-naoya.horiguchi@linux.dev
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: kernel test robot <lkp@intel.com>
    Cc: Liu Shixin <liushixin2@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:31 -04:00
Chris von Recklinghausen 69469d9dc6 mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage
Bugzilla: https://bugzilla.redhat.com/2160210

commit ac5fcde0a96a18773f06b7c00c5ea081bbdc64b3
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Thu Jul 14 13:24:16 2022 +0900

    mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage

    Raw error info list needs to be removed when hwpoisoned hugetlb is
    unpoisoned.  And unpoison handler needs to know how many errors there are
    in the target hugepage.  So add them.

    HPageVmemmapOptimized(hpage) and HPageRawHwpUnreliable(hpage)) sometimes
    can't be unpoisoned, so skip them.

    Link: https://lkml.kernel.org/r/20220714042420.1847125-5-naoya.horiguchi@linux.dev
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reported-by: kernel test robot <lkp@intel.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Liu Shixin <liushixin2@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:31 -04:00
Chris von Recklinghausen 16a4b1211c mm, hwpoison, hugetlb: support saving mechanism of raw error pages
Bugzilla: https://bugzilla.redhat.com/2160210

commit 161df60e9e89651c9aa3ae0edc9aae3a8a2d21e7
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Thu Jul 14 13:24:15 2022 +0900

    mm, hwpoison, hugetlb: support saving mechanism of raw error pages

    When handling memory error on a hugetlb page, the error handler tries to
    dissolve and turn it into 4kB pages.  If it's successfully dissolved,
    PageHWPoison flag is moved to the raw error page, so that's all right.
    However, dissolve sometimes fails, then the error page is left as
    hwpoisoned hugepage.  It's useful if we can retry to dissolve it to save
    healthy pages, but that's not possible now because the information about
    where the raw error pages is lost.

    Use the private field of a few tail pages to keep that information.  The
    code path of shrinking hugepage pool uses this info to try delayed
    dissolve.  In order to remember multiple errors in a hugepage, a
    singly-linked list originated from SUBPAGE_INDEX_HWPOISON-th tail page is
    constructed.  Only simple operations (adding an entry or clearing all) are
    required and the list is assumed not to be very long, so this simple data
    structure should be enough.

    If we failed to save raw error info, the hwpoison hugepage has errors on
    unknown subpage, then this new saving mechanism does not work any more, so
    disable saving new raw error info and freeing hwpoison hugepages.

    Link: https://lkml.kernel.org/r/20220714042420.1847125-4-naoya.horiguchi@linux.dev
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reported-by: kernel test robot <lkp@intel.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Liu Shixin <liushixin2@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:31 -04:00
Chris von Recklinghausen 59b7858be4 mm: memory-failure: convert to pr_fmt()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 96f96763de26d6ee333d5b2446d1b04a4e6bc75b
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Tue Jul 26 16:10:46 2022 +0800

    mm: memory-failure: convert to pr_fmt()

    Use pr_fmt to prefix all pr_<level> output, but unpoison_memory() and
    soft_offline_page() are used by error injection, which have own prefixes
    like "Unpoison:" and "soft offline:", meanwhile, soft_offline_page() could
    be used by memory hotremove, so reset pr_fmt before unpoison_pr_info
    definition to keep the original output for them.

    [wangkefeng.wang@huawei.com: v3]
      Link: https://lkml.kernel.org/r/20220729031919.72331-1-wangkefeng.wang@huawei.com
    Link: https://lkml.kernel.org/r/20220726081046.10742-1-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:29 -04:00
Chris von Recklinghausen fdd93778ad mm: factor helpers for memory_failure_dev_pagemap
Bugzilla: https://bugzilla.redhat.com/2160210

commit 00cc790e00369387f6ab80c5724550c2c6340334
Author: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Date:   Fri Jun 3 13:37:26 2022 +0800

    mm: factor helpers for memory_failure_dev_pagemap

    memory_failure_dev_pagemap code is a bit complex before introduce RMAP
    feature for fsdax.  So it is needed to factor some helper functions to
    simplify these code.

    [akpm@linux-foundation.org: fix CONFIG_HUGETLB_PAGE=n build]
    [zhengbin13@huawei.com: fix redefinition of mf_generic_kill_procs]
      Link: https://lkml.kernel.org/r/20220628112143.1170473-1-zhengbin13@huawei.com
    Link: https://lkml.kernel.org/r/20220603053738.1218681-3-ruansy.fnst@fujitsu.com
    Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
    Signed-off-by: Zheng Bin <zhengbin13@huawei.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Dan Williams <dan.j.williams@intel.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Dan Williams <dan.j.wiliams@intel.com>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
    Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
    Cc: Jane Chu <jane.chu@oracle.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Ritesh Harjani <riteshh@linux.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:23 -04:00
Chris von Recklinghausen 279ba99c6b mm: add zone device coherent type memory support
Bugzilla: https://bugzilla.redhat.com/2160210

commit f25cbb7a95a24ff9a2a3bebd308e303942ae6b2c
Author: Alex Sierra <alex.sierra@amd.com>
Date:   Fri Jul 15 10:05:10 2022 -0500

    mm: add zone device coherent type memory support

    Device memory that is cache coherent from device and CPU point of view.
    This is used on platforms that have an advanced system bus (like CAPI or
    CXL).  Any page of a process can be migrated to such memory.  However, no
    one should be allowed to pin such memory so that it can always be evicted.

    [hch@lst.de: rebased ontop of the refcount changes, remove is_dev_private_or_coherent_page]
    Link: https://lkml.kernel.org/r/20220715150521.18165-4-alex.sierra@amd.com
    Signed-off-by: Alex Sierra <alex.sierra@amd.com>
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
    Reviewed-by: Alistair Popple <apopple@nvidia.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:23 -04:00
Chris von Recklinghausen 6811b8d5d5 mm/swap: convert delete_from_swap_cache() to take a folio
Bugzilla: https://bugzilla.redhat.com/2160210

commit 75fa68a5d89871a35246aa2759c95d6dfaf1b582
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Jun 17 18:50:19 2022 +0100

    mm/swap: convert delete_from_swap_cache() to take a folio

    All but one caller already has a folio, so convert it to use a folio.

    Link: https://lkml.kernel.org/r/20220617175020.717127-22-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:20 -04:00
Chris von Recklinghausen c991a31fce mm: Remove __delete_from_page_cache()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 6ffcd825e7d0416d78fd41cd5b7856a78122cc8c
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Jun 28 20:41:40 2022 -0400

    mm: Remove __delete_from_page_cache()

    This wrapper is no longer used.  Remove it and all references to it.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:15 -04:00
Chris von Recklinghausen 7b1db0833d mm: don't be stuck to rmap lock on reclaim path
Bugzilla: https://bugzilla.redhat.com/2160210

commit 6d4675e601357834dadd2ba1d803f6484596015c
Author: Minchan Kim <minchan@kernel.org>
Date:   Thu May 19 14:08:54 2022 -0700

    mm: don't be stuck to rmap lock on reclaim path

    The rmap locks(i_mmap_rwsem and anon_vma->root->rwsem) could be contended
    under memory pressure if processes keep working on their vmas(e.g., fork,
    mmap, munmap).  It makes reclaim path stuck.  In our real workload traces,
    we see kswapd is waiting the lock for 300ms+(worst case, a sec) and it
    makes other processes entering direct reclaim, which were also stuck on
    the lock.

    This patch makes lru aging path try_lock mode like shink_page_list so the
    reclaim context will keep working with next lru pages without being stuck.
    if it found the rmap lock contended, it rotates the page back to head of
    lru in both active/inactive lrus to make them consistent behavior, which
    is basic starting point rather than adding more heristic.

    Since this patch introduces a new "contended" field as out-param along
    with try_lock in-param in rmap_walk_control, it's not immutable any longer
    if the try_lock is set so remove const keywords on rmap related functions.
    Since rmap walking is already expensive operation, I doubt the const
    would help sizable benefit( And we didn't have it until 5.17).

    In a heavy app workload in Android, trace shows following statistics.  It
    almost removes rmap lock contention from reclaim path.

    Martin Liu reported:

    Before:

       max_dur(ms)  min_dur(ms)  max-min(dur)ms  avg_dur(ms)  sum_dur(ms)  count blocked_function
             1632            0            1631   151.542173        31672    209  page_lock_anon_vma_read
              601            0             601   145.544681        28817    198  rmap_walk_file

    After:

       max_dur(ms)  min_dur(ms)  max-min(dur)ms  avg_dur(ms)  sum_dur(ms)  count blocked_function
              NaN          NaN              NaN          NaN          NaN    0.0             NaN
                0            0                0     0.127645            1     12  rmap_walk_file

    [minchan@kernel.org: add comment, per Matthew]
      Link: https://lkml.kernel.org/r/YnNqeB5tUf6LZ57b@google.com
    Link: https://lkml.kernel.org/r/20220510215423.164547-1-minchan@kernel.org
    Signed-off-by: Minchan Kim <minchan@kernel.org>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: John Dias <joaodias@google.com>
    Cc: Tim Murray <timmurray@google.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
    Cc: Martin Liu <liumartin@google.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:11 -04:00
Chris von Recklinghausen 91c4a45404 mm/memory-failure.c: simplify num_poisoned_pages_inc/dec
Bugzilla: https://bugzilla.redhat.com/2160210

commit e240ac52f7da5986f9dcbe29d423b7b2f141b41b
Author: zhenwei pi <pizhenwei@bytedance.com>
Date:   Thu May 12 20:23:10 2022 -0700

    mm/memory-failure.c: simplify num_poisoned_pages_inc/dec

    Originally, do num_poisoned_pages_inc() in memory failure routine, use
    num_poisoned_pages_dec() to rollback the number if filtered/ cancelled.

    Suggested by Naoya, do num_poisoned_pages_inc() only in action_result(),
    this make this clear and simple.

    Link: https://lkml.kernel.org/r/20220509105641.491313-6-pizhenwei@bytedance.com
    Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:08 -04:00
Chris von Recklinghausen 7e2e3250a4 mm/memory-failure.c: add hwpoison_filter for soft offline
Bugzilla: https://bugzilla.redhat.com/2160210

commit 9113eaf331bf44579882c001867773cf1b3364fd
Author: zhenwei pi <pizhenwei@bytedance.com>
Date:   Thu May 12 20:23:10 2022 -0700

    mm/memory-failure.c: add hwpoison_filter for soft offline

    hwpoison_filter is missing in the soft offline path, this leads an issue:
    after enabling the corrupt filter, the user process still has a chance to
    inject hwpoison fault by madvise(addr, len, MADV_SOFT_OFFLINE) at PFN
    which is expected to reject.

    Also do a minor change in comment of memory_failure().

    Link: https://lkml.kernel.org/r/20220509105641.491313-4-pizhenwei@bytedance.com
    Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:08 -04:00
Chris von Recklinghausen 32443a7ae5 mm/memory-failure.c: simplify num_poisoned_pages_dec
Bugzilla: https://bugzilla.redhat.com/2160210

commit c8bd84f73fd6215d5b8d0b3cfc914a3671b16d1c
Author: zhenwei pi <pizhenwei@bytedance.com>
Date:   Thu May 12 20:23:09 2022 -0700

    mm/memory-failure.c: simplify num_poisoned_pages_dec

    Don't decrease the number of poisoned pages in page_alloc.c, let the
    memory-failure.c do inc/dec poisoned pages only.

    Also simplify unpoison_memory(), only decrease the number of
    poisoned pages when:
     - TestClearPageHWPoison() succeed
     - put_page_back_buddy succeed

    After decreasing, print necessary log.

    Finally, remove clear_page_hwpoison() and unpoison_taken_off_page().

    Link: https://lkml.kernel.org/r/20220509105641.491313-3-pizhenwei@bytedance.com
    Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:08 -04:00
Chris von Recklinghausen 65b3725ab2 mm/memory-failure.c: move clear_hwpoisoned_pages
Bugzilla: https://bugzilla.redhat.com/2160210

commit 60f272f6b09a8f14156df88cccd21447ab394452
Author: zhenwei pi <pizhenwei@bytedance.com>
Date:   Thu May 12 20:23:09 2022 -0700

    mm/memory-failure.c: move clear_hwpoisoned_pages

    Patch series "memory-failure: fix hwpoison_filter", v2.

    As well known, the memory failure mechanism handles memory corrupted
    event, and try to send SIGBUS to the user process which uses this
    corrupted page.

    For the virtualization case, QEMU catches SIGBUS and tries to inject MCE
    into the guest, and the guest handles memory failure again.  Thus the
    guest gets the minimal effect from hardware memory corruption.

    The further step I'm working on:

    1, try to modify code to decrease poisoned pages in a single place
       (mm/memofy-failure.c: simplify num_poisoned_pages_dec in this series).

    2, try to use page_handle_poison() to handle SetPageHWPoison() and
       num_poisoned_pages_inc() together.  It would be best to call
       num_poisoned_pages_inc() in a single place too.

    3, introduce memory failure notifier list in memory-failure.c: notify
       the corrupted PFN to someone who registers this list.  If I can
       complete [1] and [2] part, [3] will be quite easy(just call notifier
       list after increasing poisoned page).

    4, introduce memory recover VQ for memory balloon device, and registers
       memory failure notifier list.  During the guest kernel handles memory
       failure, balloon device gets notified by memory failure notifier list,
       and tells the host to recover the corrupted PFN(GPA) by the new VQ.

    5, host side remaps the corrupted page(HVA), and tells the guest side
       to unpoison the PFN(GPA).  Then the guest fixes the corrupted page(GPA)
       dynamically.

    This patch (of 5):

    clear_hwpoisoned_pages() clears HWPoison flag and decreases the number of
    poisoned pages, this actually works as part of memory failure.

    Move this function from sparse.c to memory-failure.c, finally there is no
    CONFIG_MEMORY_FAILURE in sparse.c.

    Link: https://lkml.kernel.org/r/20220509105641.491313-1-pizhenwei@bytedance.com
    Link: https://lkml.kernel.org/r/20220509105641.491313-2-pizhenwei@bytedance.com
    Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:08 -04:00
Chris von Recklinghausen f942ace7a2 mm: create new mm/swap.h header file
Bugzilla: https://bugzilla.redhat.com/2160210

commit 014bb1de4fc17d54907d54418126a9a9736f4aff
Author: NeilBrown <neilb@suse.de>
Date:   Mon May 9 18:20:47 2022 -0700

    mm: create new mm/swap.h header file

    Patch series "MM changes to improve swap-over-NFS support".

    Assorted improvements for swap-via-filesystem.

    This is a resend of these patches, rebased on current HEAD.  The only
    substantial changes is that swap_dirty_folio has replaced
    swap_set_page_dirty.

    Currently swap-via-fs (SWP_FS_OPS) doesn't work for any filesystem.  It
    has previously worked for NFS but that broke a few releases back.  This
    series changes to use a new ->swap_rw rather than ->readpage and
    ->direct_IO.  It also makes other improvements.

    There is a companion series already in linux-next which fixes various
    issues with NFS.  Once both series land, a final patch is needed which
    changes NFS over to use ->swap_rw.

    This patch (of 10):

    Many functions declared in include/linux/swap.h are only used within mm/

    Create a new "mm/swap.h" and move some of these declarations there.
    Remove the redundant 'extern' from the function declarations.

    [akpm@linux-foundation.org: mm/memory-failure.c needs mm/swap.h]
    Link: https://lkml.kernel.org/r/164859751830.29473.5309689752169286816.stgit@noble.brown
    Link: https://lkml.kernel.org/r/164859778120.29473.11725907882296224053.stgit@noble.brown
    Signed-off-by: NeilBrown <neilb@suse.de>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Tested-by: David Howells <dhowells@redhat.com>
    Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
    Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:00 -04:00
Chris von Recklinghausen f4ca3e9bff mm, hugetlb, hwpoison: separate branch for free and in-use hugepage
Bugzilla: https://bugzilla.redhat.com/2160210

commit b283d983a7a6ffe3939ff26f06d151331a7c1071
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Thu Apr 28 23:16:02 2022 -0700

    mm, hugetlb, hwpoison: separate branch for free and in-use hugepage

    We know that HPageFreed pages should have page refcount 0, so
    get_page_unless_zero() always fails and returns 0.  So explicitly separate
    the branch based on page state for minor optimization and better
    readability.

    Link: https://lkml.kernel.org/r/20220415041848.GA3034499@ik1-406-35019.vs.sakura.ne.jp
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
    Suggested-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:53 -04:00
Chris von Recklinghausen 569cbb051f mm/memory-failure.c: dissolve truncated hugetlb page
Bugzilla: https://bugzilla.redhat.com/2160210

commit ef526b17bc3399b8df25d574aa11fc36f89da80a
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Apr 28 23:16:02 2022 -0700

    mm/memory-failure.c: dissolve truncated hugetlb page

    If me_huge_page meets a truncated but not yet freed hugepage, it won't be
    dissolved even if we hold the last refcnt. It's because the hugepage has
    NULL page_mapping while it's not anonymous hugepage too. Thus we lose the
    last chance to dissolve it into buddy to save healthy subpages. Remove
    PageAnon check to handle these hugepages too.

    Link: https://lkml.kernel.org/r/20220414114941.11223-3-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:53 -04:00
Chris von Recklinghausen 7e822751d3 mm/memory-failure.c: minor cleanup for HWPoisonHandlable
Bugzilla: https://bugzilla.redhat.com/2160210

commit 3f871370686ddf3c72207321eef8f6672ae957e4
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Apr 28 23:16:02 2022 -0700

    mm/memory-failure.c: minor cleanup for HWPoisonHandlable

    Patch series "A few fixup and cleanup patches for memory failure", v2.

    This series contains a patch to clean up the HWPoisonHandlable and another
    one to dissolve truncated hugetlb page.  More details can be found in the
    respective changelogs.

    This patch (of 2):

    The local variable movable can be removed by returning true directly. Also
    fix typo 'mirgate'. No functional change intended.

    Link: https://lkml.kernel.org/r/20220414114941.11223-1-linmiaohe@huawei.com
    Link: https://lkml.kernel.org/r/20220414114941.11223-2-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:53 -04:00
Chris von Recklinghausen f9a62953cd mm/hwpoison: put page in already hwpoisoned case with MF_COUNT_INCREASED
Bugzilla: https://bugzilla.redhat.com/2160210

commit f361e2462e8cccdd9231aa3274690705a2ea35a2
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Thu Apr 28 23:16:02 2022 -0700

    mm/hwpoison: put page in already hwpoisoned case with MF_COUNT_INCREASED

    In already hwpoisoned case, memory_failure() is supposed to return with
    releasing the page refcount taken for error handling.  But currently the
    refcount is not released when called with MF_COUNT_INCREASED, which makes
    page refcount inconsistent.  This should be rare and non-critical, but it
    might be inconvenient in testing (unpoison doesn't work).

    Link: https://lkml.kernel.org/r/20220408135323.1559401-3-naoya.horiguchi@linux.dev
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Suggested-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Dan Carpenter <dan.carpenter@oracle.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:53 -04:00
Chris von Recklinghausen 50055f3e44 mm/memory-failure.c: remove unnecessary (void*) conversions
Bugzilla: https://bugzilla.redhat.com/2160210

commit f142e70750a1ea36ba60fb4f24bc37713e921f73
Author: liqiong <liqiong@nfschina.com>
Date:   Thu Apr 28 23:16:01 2022 -0700

    mm/memory-failure.c: remove unnecessary (void*) conversions

    No need cast (void*) to (struct hwp_walk*).

    Link: https://lkml.kernel.org/r/20220322142826.25939-1-liqiong@nfschina.com
    Signed-off-by: liqiong <liqiong@nfschina.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:53 -04:00
Nico Pache f60056bb70 mm,hwpoison: check mm when killing accessing process
commit 77677cdbc2aa4b5d5d839562793d3d126201d18d
Author: Shuai Xue <xueshuai@linux.alibaba.com>
Date:   Wed Sep 14 14:49:35 2022 +0800

    mm,hwpoison: check mm when killing accessing process

    The GHES code calls memory_failure_queue() from IRQ context to queue work
    into workqueue and schedule it on the current CPU.  Then the work is
    processed in memory_failure_work_func() by kworker and calls
    memory_failure().

    When a page is already poisoned, commit a3f5d80ea4 ("mm,hwpoison: send
    SIGBUS with error virutal address") make memory_failure() call
    kill_accessing_process() that:

        - holds mmap locking of current->mm
        - does pagetable walk to find the error virtual address
        - and sends SIGBUS to the current process with error info.

    However, the mm of kworker is not valid, resulting in a null-pointer
    dereference.  So check mm when killing the accessing process.

    [akpm@linux-foundation.org: remove unrelated whitespace alteration]
    Link: https://lkml.kernel.org/r/20220914064935.7851-1-xueshuai@linux.alibaba.com
    Fixes: a3f5d80ea4 ("mm,hwpoison: send SIGBUS with error virutal address")
    Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Huang Ying <ying.huang@intel.com>
    Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: Bixuan Cui <cuibixuan@linux.alibaba.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:43 -07:00
Nico Pache aa28eb0c17 mm/migration: return errno when isolate_huge_page failed
commit 7ce82f4c3f3ead13a9d9498768e3b1a79975c4d8
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Mon May 30 19:30:15 2022 +0800

    mm/migration: return errno when isolate_huge_page failed

    We might fail to isolate huge page due to e.g.  the page is under
    migration which cleared HPageMigratable.  We should return errno in this
    case rather than always return 1 which could confuse the user, i.e.  the
    caller might think all of the memory is migrated while the hugetlb page is
    left behind.  We make the prototype of isolate_huge_page consistent with
    isolate_lru_page as suggested by Huang Ying and rename isolate_huge_page
    to isolate_hugetlb as suggested by Muchun to improve the readability.

    Link: https://lkml.kernel.org/r/20220530113016.16663-4-linmiaohe@huawei.com
    Fixes: e8db67eb0d ("mm: migrate: move_pages() supports thp migration")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Suggested-by: Huang Ying <ying.huang@intel.com>
    Reported-by: kernel test robot <lkp@intel.com> (build error)
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Peter Xu <peterx@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:39 -07:00
Chris von Recklinghausen da6b17e5e9 mm, hwpoison: set PG_hwpoison for busy hugetlb pages
Bugzilla: https://bugzilla.redhat.com/2120352

commit 38f6d29397ccb9c191c4c91103e8123f518fdc10
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Thu Jul 14 13:24:17 2022 +0900

    mm, hwpoison: set PG_hwpoison for busy hugetlb pages

    If memory_failure() fails to grab page refcount on a hugetlb page because
    it's busy, it returns without setting PG_hwpoison on it.  This not only
    loses a chance of error containment, but breaks the rule that
    action_result() should be called only when memory_failure() do any of
    handling work (even if that's just setting PG_hwpoison).  This
    inconsistency could harm code maintainability.

    So set PG_hwpoison and call hugetlb_set_page_hwpoison() for such a case.

    Link: https://lkml.kernel.org/r/20220714042420.1847125-6-naoya.horiguchi@linux.dev
    Fixes: 405ce051236c ("mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()")
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: kernel test robot <lkp@intel.com>
    Cc: Liu Shixin <liushixin2@huawei.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:09 -04:00
Chris von Recklinghausen 18b123b391 mm/memory-failure: disable unpoison once hw error happens
Bugzilla: https://bugzilla.redhat.com/2120352

commit 67f22ba7750f940bcd7e1b12720896c505c2d63f
Author: zhenwei pi <pizhenwei@bytedance.com>
Date:   Wed Jun 15 17:32:09 2022 +0800

    mm/memory-failure: disable unpoison once hw error happens

    Currently unpoison_memory(unsigned long pfn) is designed for soft
    poison(hwpoison-inject) only.  Since 17fae1294a, the KPTE gets cleared
    on a x86 platform once hardware memory corrupts.

    Unpoisoning a hardware corrupted page puts page back buddy only, the
    kernel has a chance to access the page with *NOT PRESENT* KPTE.  This
    leads BUG during accessing on the corrupted KPTE.

    Suggested by David&Naoya, disable unpoison mechanism when a real HW error
    happens to avoid BUG like this:

     Unpoison: Software-unpoisoned page 0x61234
     BUG: unable to handle page fault for address: ffff888061234000
     #PF: supervisor write access in kernel mode
     #PF: error_code(0x0002) - not-present page
     PGD 2c01067 P4D 2c01067 PUD 107267063 PMD 10382b063 PTE 800fffff9edcb062
     Oops: 0002 [#1] PREEMPT SMP NOPTI
     CPU: 4 PID: 26551 Comm: stress Kdump: loaded Tainted: G   M       OE     5.18.0.bm.1-amd64 #7
     Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ...
     RIP: 0010:clear_page_erms+0x7/0x10
     Code: ...
     RSP: 0000:ffffc90001107bc8 EFLAGS: 00010246
     RAX: 0000000000000000 RBX: 0000000000000901 RCX: 0000000000001000
     RDX: ffffea0001848d00 RSI: ffffea0001848d40 RDI: ffff888061234000
     RBP: ffffea0001848d00 R08: 0000000000000901 R09: 0000000000001276
     R10: 0000000000000003 R11: 0000000000000000 R12: 0000000000000001
     R13: 0000000000000000 R14: 0000000000140dca R15: 0000000000000001
     FS:  00007fd8b2333740(0000) GS:ffff88813fd00000(0000) knlGS:0000000000000000
     CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
     CR2: ffff888061234000 CR3: 00000001023d2005 CR4: 0000000000770ee0
     DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
     DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
     PKRU: 55555554
     Call Trace:
      <TASK>
      prep_new_page+0x151/0x170
      get_page_from_freelist+0xca0/0xe20
      ? sysvec_apic_timer_interrupt+0xab/0xc0
      ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
      __alloc_pages+0x17e/0x340
      __folio_alloc+0x17/0x40
      vma_alloc_folio+0x84/0x280
      __handle_mm_fault+0x8d4/0xeb0
      handle_mm_fault+0xd5/0x2a0
      do_user_addr_fault+0x1d0/0x680
      ? kvm_read_and_reset_apf_flags+0x3b/0x50
      exc_page_fault+0x78/0x170
      asm_exc_page_fault+0x27/0x30

    Link: https://lkml.kernel.org/r/20220615093209.259374-2-pizhenwei@bytedance.com
    Fixes: 847ce401df ("HWPOISON: Add unpoisoning support")
    Fixes: 17fae1294a ("x86/{mce,mm}: Unmap the entire page if the whole page is affected and poisoned")
    Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: <stable@vger.kernel.org>    [5.8+]
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:09 -04:00
Chris von Recklinghausen 848ae57ee1 Revert "mm/memory-failure.c: fix race with changing page compound again"
Bugzilla: https://bugzilla.redhat.com/2120352

commit 2ba2b008a8bf5fd268a43d03ba79e0ad464d6836
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Thu Apr 28 23:16:02 2022 -0700

    Revert "mm/memory-failure.c: fix race with changing page compound again"

    Reverts commit 888af2701db7 ("mm/memory-failure.c: fix race with changing
    page compound again") because now we fetch the page refcount under
    hugetlb_lock in try_memory_failure_hugetlb() so that the race check is no
    longer necessary.

    Link: https://lkml.kernel.org/r/20220408135323.1559401-4-naoya.horiguchi@linux.dev
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Suggested-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Dan Carpenter <dan.carpenter@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:07 -04:00
Chris von Recklinghausen 0e870434e4 Revert "mm/memory-failure.c: skip huge_zero_page in memory_failure()"
Bugzilla: https://bugzilla.redhat.com/2120352

commit b4e61fc031b11dd807dffc46cebbf0e25966d3d1
Author: Xu Yu <xuyu@linux.alibaba.com>
Date:   Thu Apr 28 23:14:43 2022 -0700

    Revert "mm/memory-failure.c: skip huge_zero_page in memory_failure()"

    Patch series "mm/memory-failure: rework fix on huge_zero_page splitting".

    This patch (of 2):

    This reverts commit d173d5417fb67411e623d394aab986d847e47dad.

    The commit d173d5417fb6 ("mm/memory-failure.c: skip huge_zero_page in
    memory_failure()") explicitly skips huge_zero_page in memory_failure(), in
    order to avoid triggering VM_BUG_ON_PAGE on huge_zero_page in
    split_huge_page_to_list().

    This works, but Yang Shi thinks that,

        Raising BUG is overkilling for splitting huge_zero_page. The
        huge_zero_page can't be met from normal paths other than memory
        failure, but memory failure is a valid caller. So I tend to replace
        the BUG to WARN + returning -EBUSY. If we don't care about the
        reason code in memory failure, we don't have to touch memory
        failure.

    And for the issue that huge_zero_page will be set PG_has_hwpoisoned,
    Yang Shi comments that,

        The anonymous page fault doesn't check if the page is poisoned or
        not since it typically gets a fresh allocated page and assumes the
        poisoned page (isolated successfully) can't be reallocated again.
        But huge zero page and base zero page are reused every time. So no
        matter what fix we pick, the issue is always there.

    Finally, Yang, David, Anshuman and Naoya all agree to fix the bug, i.e.,
    to split huge_zero_page, in split_huge_page_to_list().

    This reverts the commit d173d5417fb6 ("mm/memory-failure.c: skip
    huge_zero_page in memory_failure()"), and the original bug will be fixed
    by the next patch.

    Link: https://lkml.kernel.org/r/872cefb182ba1dd686b0e7db1e6b2ebe5a4fff87.1651039624.git.xuyu@linux.alibaba.com
    Fixes: d173d5417fb6 ("mm/memory-failure.c: skip huge_zero_page in memory_failure()")
    Fixes: 6a46079cf5 ("HWPOISON: The high level memory error handler in the VM v7")
    Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
    Suggested-by: Yang Shi <shy828301@gmail.com>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:07 -04:00
Chris von Recklinghausen 8ca43f4c30 mm/memory-failure.c: skip huge_zero_page in memory_failure()
Bugzilla: https://bugzilla.redhat.com/2120352

commit d173d5417fb67411e623d394aab986d847e47dad
Author: Xu Yu <xuyu@linux.alibaba.com>
Date:   Thu Apr 21 16:35:37 2022 -0700

    mm/memory-failure.c: skip huge_zero_page in memory_failure()

    Kernel panic when injecting memory_failure for the global
    huge_zero_page, when CONFIG_DEBUG_VM is enabled, as follows.

      Injecting memory failure for pfn 0x109ff9 at process virtual address 0x20ff9000
      page:00000000fb053fc3 refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x109e00
      head:00000000fb053fc3 order:9 compound_mapcount:0 compound_pincount:0
      flags: 0x17fffc000010001(locked|head|node=0|zone=2|lastcpupid=0x1ffff)
      raw: 017fffc000010001 0000000000000000 dead000000000122 0000000000000000
      raw: 0000000000000000 0000000000000000 00000002ffffffff 0000000000000000
      page dumped because: VM_BUG_ON_PAGE(is_huge_zero_page(head))
      ------------[ cut here ]------------
      kernel BUG at mm/huge_memory.c:2499!
      invalid opcode: 0000 [#1] PREEMPT SMP PTI
      CPU: 6 PID: 553 Comm: split_bug Not tainted 5.18.0-rc1+ #11
      Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014
      RIP: 0010:split_huge_page_to_list+0x66a/0x880
      Code: 84 9b fb ff ff 48 8b 7c 24 08 31 f6 e8 9f 5d 2a 00 b8 b8 02 00 00 e9 e8 fb ff ff 48 c7 c6 e8 47 3c 82 4c b
      RSP: 0018:ffffc90000dcbdf8 EFLAGS: 00010246
      RAX: 000000000000003c RBX: 0000000000000001 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: ffffffff823e4c4f RDI: 00000000ffffffff
      RBP: ffff88843fffdb40 R08: 0000000000000000 R09: 00000000fffeffff
      R10: ffffc90000dcbc48 R11: ffffffff82d68448 R12: ffffea0004278000
      R13: ffffffff823c6203 R14: 0000000000109ff9 R15: ffffea000427fe40
      FS:  00007fc375a26740(0000) GS:ffff88842fd80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fc3757c9290 CR3: 0000000102174006 CR4: 00000000003706e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       try_to_split_thp_page+0x3a/0x130
       memory_failure+0x128/0x800
       madvise_inject_error.cold+0x8b/0xa1
       __x64_sys_madvise+0x54/0x60
       do_syscall_64+0x35/0x80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      RIP: 0033:0x7fc3754f8bf9
      Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 8
      RSP: 002b:00007ffeda93a1d8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3754f8bf9
      RDX: 0000000000000064 RSI: 0000000000003000 RDI: 0000000020ff9000
      RBP: 00007ffeda93a200 R08: 0000000000000000 R09: 0000000000000000
      R10: 00000000ffffffff R11: 0000000000000217 R12: 0000000000400490
      R13: 00007ffeda93a2e0 R14: 0000000000000000 R15: 0000000000000000

    This makes huge_zero_page bail out explicitly before split in
    memory_failure(), thus the panic above won't happen again.

    Link: https://lkml.kernel.org/r/497d3835612610e370c74e697ea3c721d1d55b9c.1649775850.git.xuyu@linux.alibaba.com
    Fixes: 6a46079cf5 ("HWPOISON: The high level memory error handler in the VM v7")
    Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
    Reported-by: Abaci <abaci@linux.alibaba.com>
    Suggested-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:06 -04:00
Chris von Recklinghausen a805faea7e mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()
Bugzilla: https://bugzilla.redhat.com/2120352

commit 405ce051236cc65b30bbfe490b28ce60ae6aed85
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Thu Apr 21 16:35:33 2022 -0700

    mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()

    There is a race condition between memory_failure_hugetlb() and hugetlb
    free/demotion, which causes setting PageHWPoison flag on the wrong page.
    The one simple result is that wrong processes can be killed, but another
    (more serious) one is that the actual error is left unhandled, so no one
    prevents later access to it, and that might lead to more serious results
    like consuming corrupted data.

    Think about the below race window:

      CPU 1                                   CPU 2
      memory_failure_hugetlb
      struct page *head = compound_head(p);
                                              hugetlb page might be freed to
                                              buddy, or even changed to another
                                              compound page.

      get_hwpoison_page -- page is not what we want now...

    The current code first does prechecks roughly and then reconfirms after
    taking refcount, but it's found that it makes code overly complicated,
    so move the prechecks in a single hugetlb_lock range.

    A newly introduced function, try_memory_failure_hugetlb(), always takes
    hugetlb_lock (even for non-hugetlb pages).  That can be improved, but
    memory_failure() is rare in principle, so should not be a big problem.

    Link: https://lkml.kernel.org/r/20220408135323.1559401-2-naoya.horiguchi@linux.dev
    Fixes: 761ad8d7c7 ("mm: hwpoison: introduce memory_failure_hugetlb()")
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reported-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Dan Carpenter <dan.carpenter@oracle.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:06 -04:00
Chris von Recklinghausen b2c2b8a92b mm/memory-failure.c: make non-LRU movable pages unhandlable
Bugzilla: https://bugzilla.redhat.com/2120352

commit bf6445bc8f778590ac754b06a8fe82ce5a9f818a
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Mar 22 14:44:50 2022 -0700

    mm/memory-failure.c: make non-LRU movable pages unhandlable

    We can not really handle non-LRU movable pages in memory failure.
    Typically they are balloon, zsmalloc, etc.

    Assuming we run into a base (4K) non-LRU movable page, we could reach as
    far as identify_page_state(), it should not fall into any category
    except me_unknown.

    For the non-LRU compound movable pages, they could be taken for
    transhuge pages but it's unexpected to split non-LRU movable pages using
    split_huge_page_to_list in memory_failure.  So we could just simply make
    non-LRU movable pages unhandlable to avoid these possible nasty cases.

    Link: https://lkml.kernel.org/r/20220312074613.4798-4-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Suggested-by: Yang Shi <shy828301@gmail.com>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Tony Luck <tony.luck@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:53 -04:00
Chris von Recklinghausen 210c0d8f7c mm/memory-failure.c: avoid calling invalidate_inode_page() with unexpected pages
Bugzilla: https://bugzilla.redhat.com/2120352

commit 593396b86ef6f79c71e09c183eae28040ccfeedf
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Mar 22 14:44:47 2022 -0700

    mm/memory-failure.c: avoid calling invalidate_inode_page() with unexpected pages

    Since commit 042c4f32323b ("mm/truncate: Inline invalidate_complete_page()
    into its one caller"), invalidate_inode_page() can invalidate the pages
    in the swap cache because the check of page->mapping != mapping is
    removed.  But invalidate_inode_page() is not expected to deal with the
    pages in swap cache.  Also non-lru movable page can reach here too.
    They're not page cache pages.  Skip these pages by checking
    PageSwapCache and PageLRU.

    Link: https://lkml.kernel.org/r/20220312074613.4798-3-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Tony Luck <tony.luck@intel.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:53 -04:00
Chris von Recklinghausen fba8841df7 mm/memory-failure.c: fix race with changing page compound again
Bugzilla: https://bugzilla.redhat.com/2120352

commit 888af2701db79b9b27c7e37f9ede528a5ca53b76
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Mar 22 14:44:44 2022 -0700

    mm/memory-failure.c: fix race with changing page compound again

    Patch series "A few fixup patches for memory failure", v2.

    This series contains a few patches to fix the race with changing page
    compound page, make non-LRU movable pages unhandlable and so on.  More
    details can be found in the respective changelogs.

    There is a race window where we got the compound_head, the hugetlb page
    could be freed to buddy, or even changed to another compound page just
    before we try to get hwpoison page.  Think about the below race window:

      CPU 1                                   CPU 2
      memory_failure_hugetlb
      struct page *head = compound_head(p);
                                              hugetlb page might be freed to
                                              buddy, or even changed to another
                                              compound page.

      get_hwpoison_page -- page is not what we want now...

    If this race happens, just bail out.  Also MF_MSG_DIFFERENT_PAGE_SIZE is
    introduced to record this event.

    [akpm@linux-foundation.org: s@/**@/*@, per Naoya Horiguchi]

    Link: https://lkml.kernel.org/r/20220312074613.4798-1-linmiaohe@huawei.com
    Link: https://lkml.kernel.org/r/20220312074613.4798-2-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Tony Luck <tony.luck@intel.com>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:53 -04:00
Chris von Recklinghausen e624cb473d mm/hwpoison: add in-use hugepage hwpoison filter judgement
Bugzilla: https://bugzilla.redhat.com/2120352

commit a06ad3c0c75297f0b0999b1a981e50224e690ee9
Author: luofei <luofei@unicloud.com>
Date:   Tue Mar 22 14:44:41 2022 -0700

    mm/hwpoison: add in-use hugepage hwpoison filter judgement

    After successfully obtaining the reference count of the huge page, it is
    still necessary to call hwpoison_filter() to make a filter judgement,
    otherwise the filter hugepage will be unmaped and the related process
    may be killed.

    Link: https://lkml.kernel.org/r/20220223082254.2769757-1-luofei@unicloud.com
    Signed-off-by: luofei <luofei@unicloud.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: H. Peter Anvin <hpa@zytor.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Tony Luck <tony.luck@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:53 -04:00
Chris von Recklinghausen 281e153b89 mm/hwpoison: avoid the impact of hwpoison_filter() return value on mce handler
Bugzilla: https://bugzilla.redhat.com/2120352

commit d1fe111fb62a1cf0446a2919f5effbb33ad0702c
Author: luofei <luofei@unicloud.com>
Date:   Tue Mar 22 14:44:38 2022 -0700

    mm/hwpoison: avoid the impact of hwpoison_filter() return value on mce handler

    When the hwpoison page meets the filter conditions, it should not be
    regarded as successful memory_failure() processing for mce handler, but
    should return a distinct value, otherwise mce handler regards the error
    page has been identified and isolated, which may lead to calling
    set_mce_nospec() to change page attribute, etc.

    Here memory_failure() return -EOPNOTSUPP to indicate that the error
    event is filtered, mce handler should not take any action for this
    situation and hwpoison injector should treat as correct.

    Link: https://lkml.kernel.org/r/20220223082135.2769649-1-luofei@unicloud.com
    Signed-off-by: luofei <luofei@unicloud.com>
    Acked-by: Borislav Petkov <bp@suse.de>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: H. Peter Anvin <hpa@zytor.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Tony Luck <tony.luck@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:53 -04:00
Chris von Recklinghausen b59a957edd mm/memory-failure.c: remove unnecessary PageTransTail check
Bugzilla: https://bugzilla.redhat.com/2120352

commit b04d3eebebf8372f83924db6c1e4fbdcab7cafc2
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Mar 22 14:44:33 2022 -0700

    mm/memory-failure.c: remove unnecessary PageTransTail check

    When we reach here, we're guaranteed to have non-compound page as thp is
    already splited.  Remove this unnecessary PageTransTail check.

    Link: https://lkml.kernel.org/r/20220218090118.1105-9-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:52 -04:00
Chris von Recklinghausen caeba3ae14 mm/memory-failure.c: remove obsolete comment in __soft_offline_page
Bugzilla: https://bugzilla.redhat.com/2120352

commit 2ab916790ff0bbaac557dc1238f08237dd7799cc
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Mar 22 14:44:30 2022 -0700

    mm/memory-failure.c: remove obsolete comment in __soft_offline_page

    Since commit add05cecef ("mm: soft-offline: don't free target page in
    successful page migration"), set_migratetype_isolate logic is removed.
    Remove this obsolete comment.

    Link: https://lkml.kernel.org/r/20220218090118.1105-8-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:52 -04:00
Chris von Recklinghausen b699715562 mm/memory-failure.c: rework the try_to_unmap logic in hwpoison_user_map
pings()

Conflicts: mm/memory-failure.c - We already have
	869f7ee6f647 ("mm/rmap: Convert try_to_unmap() to take a folio")
	so keep calling try_to_unmap with a folio

Bugzilla: https://bugzilla.redhat.com/2120352

commit 357670f79efb7e520461d18bb093342605c7cbed
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Mar 22 14:44:27 2022 -0700

    mm/memory-failure.c: rework the try_to_unmap logic in hwpoison_user_mappings
()

    Only for hugetlb pages in shared mappings, try_to_unmap should take
    semaphore in write mode here.  Rework the code to make it clear.

    Link: https://lkml.kernel.org/r/20220218090118.1105-7-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:52 -04:00
Chris von Recklinghausen 8d8db99aae mm/memory-failure.c: remove PageSlab check in hwpoison_filter_dev
Bugzilla: https://bugzilla.redhat.com/2120352

commit 67ff51c6a6d2ef99cf35a937e59269dc9a0c7fc2
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Mar 22 14:44:24 2022 -0700

    mm/memory-failure.c: remove PageSlab check in hwpoison_filter_dev

    Since commit 03e5ac2fc3 ("mm: fix crash when using XFS on loopback"),
    page_mapping() can handle the Slab pages.  So remove this unnecessary
    PageSlab check and obsolete comment.

    Link: https://lkml.kernel.org/r/20220218090118.1105-6-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:52 -04:00
Chris von Recklinghausen d7ee05e98d mm/memory-failure.c: fix race with changing page more robustly
Bugzilla: https://bugzilla.redhat.com/2120352

commit 75ee64b3c9a9695726056e9ec527e11dbf286500
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Mar 22 14:44:21 2022 -0700

    mm/memory-failure.c: fix race with changing page more robustly

    We're only intended to deal with the non-Compound page after we split
    thp in memory_failure.  However, the page could have changed compound
    pages due to race window.  If this happens, we could retry once to
    hopefully handle the page next round.  Also remove unneeded orig_head.
    It's always equal to the hpage.  So we can use hpage directly and remove
    this redundant one.

    Link: https://lkml.kernel.org/r/20220218090118.1105-5-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:52 -04:00
Chris von Recklinghausen 798a62dcd4 mm/memory-failure.c: rework the signaling logic in kill_proc
Bugzilla: https://bugzilla.redhat.com/2120352

commit 49775047cf52a92e41444d41a0584180ec2c256b
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Mar 22 14:44:18 2022 -0700

    mm/memory-failure.c: rework the signaling logic in kill_proc

    BUS_MCEERR_AR code is only sent when MF_ACTION_REQUIRED is set and the
    target is current.  Rework the code to make this clear.

    Link: https://lkml.kernel.org/r/20220218090118.1105-4-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:52 -04:00
Chris von Recklinghausen e6b62fd6dc mm/memory-failure.c: catch unexpected -EFAULT from vma_address()
Bugzilla: https://bugzilla.redhat.com/2120352

commit a994402bc4714cefea5770b2d906cef5b0f4dc5c
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Mar 22 14:44:15 2022 -0700

    mm/memory-failure.c: catch unexpected -EFAULT from vma_address()

    It's unexpected to walk the page table when vma_address() return
    -EFAULT.  But dev_pagemap_mapping_shift() is called only when vma
    associated to the error page is found already in
    collect_procs_{file,anon}, so vma_address() should not return -EFAULT
    except with some bug, as Naoya pointed out.  We can use VM_BUG_ON_VMA()
    to catch this bug here.

    Link: https://lkml.kernel.org/r/20220218090118.1105-3-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:52 -04:00
Chris von Recklinghausen a0431d1544 mm/memory-failure.c: minor clean up for memory_failure_dev_pagemap
Bugzilla: https://bugzilla.redhat.com/2120352

commit 577553f4897181dc8960351511c921018892e818
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Mar 22 14:44:12 2022 -0700

    mm/memory-failure.c: minor clean up for memory_failure_dev_pagemap

    Patch series "A few cleanup and fixup patches for memory failure", v3.

    This series contains a few patches to simplify the code logic, remove
    unneeded variable and remove obsolete comment.  Also we fix race
    changing page more robustly in memory_failure.  More details can be
    found in the respective changelogs.

    This patch (of 8):

    The flags always has MF_ACTION_REQUIRED and MF_MUST_KILL set.  So we do
    not need to check these flags again.

    Link: https://lkml.kernel.org/r/20220218090118.1105-1-linmiaohe@huawei.com
    Link: https://lkml.kernel.org/r/20220218090118.1105-2-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:52 -04:00
Chris von Recklinghausen b9d2f13533 mm/memory-failure.c: remove obsolete comment
Bugzilla: https://bugzilla.redhat.com/2120352

commit ae483c20062695324202d19e5283819b11b83eaa
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Tue Mar 22 14:44:03 2022 -0700

    mm/memory-failure.c: remove obsolete comment

    With the introduction of mf_mutex, most of memory error handling process
    is mutually exclusive, so the in-line comment about subtlety about
    double-checking PageHWPoison is no more correct.  So remove it.

    Link: https://lkml.kernel.org/r/20220125025601.3054511-1-naoya.horiguchi@linux.dev
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:52 -04:00
Chris von Recklinghausen e5602006ba memory-failure: fetch compound_head after pgmap_pfn_valid()
Bugzilla: https://bugzilla.redhat.com/2120352

commit 61e28cf0543c7d8e6ef88c3c305f727c5a21ba5b
Author: Joao Martins <joao.m.martins@oracle.com>
Date:   Sat Jan 29 13:41:01 2022 -0800

    memory-failure: fetch compound_head after pgmap_pfn_valid()

    memory_failure_dev_pagemap() at the moment assumes base pages (e.g.
    dax_lock_page()).  For devmap with compound pages fetch the
    compound_head in case a tail page memory failure is being handled.

    Currently this is a nop, but in the advent of compound pages in
    dev_pagemap it allows memory_failure_dev_pagemap() to keep working.

    Without this fix memory-failure handling (i.e.  MCEs on pmem) with
    device-dax configured namespaces will regress (and crash).

    Link: https://lkml.kernel.org/r/20211202204422.26777-2-joao.m.martins@oracle.com
    Reported-by: Jane Chu <jane.chu@oracle.com>
    Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
    Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Dan Williams <dan.j.williams@intel.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:43 -04:00
Chris von Recklinghausen 8bdd7409d0 mm: fix some comment errors
Conflicts: mm/swap.c - We already have
	ff042f4a9b05 ("mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu")
	which updated the comment. Keep the changes.

Bugzilla: https://bugzilla.redhat.com/2120352

commit 0b8f0d870020dbd7037bfacbb73a9b3213470f90
Author: Quanfa Fu <fuqf0919@gmail.com>
Date:   Fri Jan 14 14:09:25 2022 -0800

    mm: fix some comment errors

    Link: https://lkml.kernel.org/r/20211101040208.460810-1-fuqf0919@gmail.com
    Signed-off-by: Quanfa Fu <fuqf0919@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:41 -04:00
Chris von Recklinghausen 5d4d6bace0 mm: shmem: don't truncate page if memory failure happens
Bugzilla: https://bugzilla.redhat.com/2120352

commit a7605426666196c5a460dd3de6f8dac1d3c21f00
Author: Yang Shi <shy828301@gmail.com>
Date:   Fri Jan 14 14:05:19 2022 -0800

    mm: shmem: don't truncate page if memory failure happens

    The current behavior of memory failure is to truncate the page cache
    regardless of dirty or clean.  If the page is dirty the later access
    will get the obsolete data from disk without any notification to the
    users.  This may cause silent data loss.  It is even worse for shmem
    since shmem is in-memory filesystem, truncating page cache means
    discarding data blocks.  The later read would return all zero.

    The right approach is to keep the corrupted page in page cache, any
    later access would return error for syscalls or SIGBUS for page fault,
    until the file is truncated, hole punched or removed.  The regular
    storage backed filesystems would be more complicated so this patch is
    focused on shmem.  This also unblock the support for soft offlining
    shmem THP.

    [akpm@linux-foundation.org: coding style fixes]
    [arnd@arndb.de: fix uninitialized variable use in me_pagecache_clean()]
      Link: https://lkml.kernel.org/r/20211022064748.4173718-1-arnd@kernel.org
    [Fix invalid pointer dereference in shmem_read_mapping_page_gfp() with a
     slight different implementation from what Ajay Garg <ajaygargnsit@gmail.com>
     and Muchun Song <songmuchun@bytedance.com> proposed and reworked the
     error handling of shmem_write_begin() suggested by Linus]
      Link: https://lore.kernel.org/linux-mm/20211111084617.6746-1-ajaygargnsit@gmail.com/

    Link: https://lkml.kernel.org/r/20211020210755.23964-6-shy828301@gmail.com
    Link: https://lkml.kernel.org/r/20211116193247.21102-1-shy828301@gmail.com
    Signed-off-by: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Arnd Bergmann <arnd@arndb.de>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Ajay Garg <ajaygargnsit@gmail.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Andy Lavr <andy.lavr@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:38 -04:00
Chris von Recklinghausen 5b74ce1a45 mm/memory_failure: constify static mm_walk_ops
Bugzilla: https://bugzilla.redhat.com/2120352

commit ba9eb3cef9e699e259f9ceefdbcd3ee83d3529e2
Author: Rikard Falkeborn <rikard.falkeborn@gmail.com>
Date:   Fri Nov 5 13:41:01 2021 -0700

    mm/memory_failure: constify static mm_walk_ops

    The only usage of hwp_walk_ops is to pass its address to
    walk_page_range() which takes a pointer to const mm_walk_ops as
    argument.

    Make it const to allow the compiler to put it in read-only memory.

    Link: https://lkml.kernel.org/r/20211014075042.17174-3-rikard.falkeborn@gmail.com
    Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:28 -04:00
Aristeu Rozanski 6ed3b2ca9f mm/hwpoison: use pr_err() instead of dump_page() in get_any_page()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 1825b93b626e99eb9a0f9f50342c7b2fa201b387
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Thu Apr 28 23:14:44 2022 -0700

    mm/hwpoison: use pr_err() instead of dump_page() in get_any_page()

    The following VM_BUG_ON_FOLIO() is triggered when memory error event
    happens on the (thp/folio) pages which are about to be freed:

      [ 1160.232771] page:00000000b36a8a0f refcount:1 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x16a000
      [ 1160.236916] page:00000000b36a8a0f refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x16a000
      [ 1160.240684] flags: 0x57ffffc0800000(hwpoison|node=1|zone=2|lastcpupid=0x1fffff)
      [ 1160.243458] raw: 0057ffffc0800000 dead000000000100 dead000000000122 0000000000000000
      [ 1160.246268] raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
      [ 1160.249197] page dumped because: VM_BUG_ON_FOLIO(!folio_test_large(folio))
      [ 1160.251815] ------------[ cut here ]------------
      [ 1160.253438] kernel BUG at include/linux/mm.h:788!
      [ 1160.256162] invalid opcode: 0000 [#1] PREEMPT SMP PTI
      [ 1160.258172] CPU: 2 PID: 115368 Comm: mceinj.sh Tainted: G            E     5.18.0-rc1-v5.18-rc1-220404-2353-005-g83111+ #3
      [ 1160.262049] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1.fc35 04/01/2014
      [ 1160.265103] RIP: 0010:dump_page.cold+0x27e/0x2bd
      [ 1160.266757] Code: fe ff ff 48 c7 c6 81 f1 5a 98 e9 4c fe ff ff 48 c7 c6 a1 95 59 98 e9 40 fe ff ff 48 c7 c6 50 bf 5a 98 48 89 ef e8 9d 04 6d ff <0f> 0b 41 f7 c4 ff 0f 00 00 0f 85 9f fd ff ff 49 8b 04 24 a9 00 00
      [ 1160.273180] RSP: 0018:ffffaa2c4d59fd18 EFLAGS: 00010292
      [ 1160.274969] RAX: 000000000000003e RBX: 0000000000000001 RCX: 0000000000000000
      [ 1160.277263] RDX: 0000000000000001 RSI: ffffffff985995a1 RDI: 00000000ffffffff
      [ 1160.279571] RBP: ffffdc9c45a80000 R08: 0000000000000000 R09: 00000000ffffdfff
      [ 1160.281794] R10: ffffaa2c4d59fb08 R11: ffffffff98940d08 R12: ffffdc9c45a80000
      [ 1160.283920] R13: ffffffff985b6f94 R14: 0000000000000000 R15: ffffdc9c45a80000
      [ 1160.286641] FS:  00007eff54ce1740(0000) GS:ffff99c67bd00000(0000) knlGS:0000000000000000
      [ 1160.289498] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1160.291106] CR2: 00005628381a5f68 CR3: 0000000104712003 CR4: 0000000000170ee0
      [ 1160.293031] Call Trace:
      [ 1160.293724]  <TASK>
      [ 1160.294334]  get_hwpoison_page+0x47d/0x570
      [ 1160.295474]  memory_failure+0x106/0xaa0
      [ 1160.296474]  ? security_capable+0x36/0x50
      [ 1160.297524]  hard_offline_page_store+0x43/0x80
      [ 1160.298684]  kernfs_fop_write_iter+0x11c/0x1b0
      [ 1160.299829]  new_sync_write+0xf9/0x160
      [ 1160.300810]  vfs_write+0x209/0x290
      [ 1160.301835]  ksys_write+0x4f/0xc0
      [ 1160.302718]  do_syscall_64+0x3b/0x90
      [ 1160.303664]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [ 1160.304981] RIP: 0033:0x7eff54b018b7

    As shown in the RIP address, this VM_BUG_ON in folio_entire_mapcount() is
    called from dump_page("hwpoison: unhandlable page") in get_any_page().
    The below explains the mechanism of the race:

      CPU 0                                       CPU 1

        memory_failure
          get_hwpoison_page
            get_any_page
              dump_page
                compound = PageCompound
                                                    free_pages_prepare
                                                      page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP
                folio_entire_mapcount
                  VM_BUG_ON_FOLIO(!folio_test_large(folio))

    So replace dump_page() with safer one, pr_err().

    Link: https://lkml.kernel.org/r/20220427053220.719866-1-naoya.horiguchi@linux.dev
    Fixes: 74e8ee4708a8 ("mm: Turn head_compound_mapcount() into folio_entire_mapcount()")
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: William Kucharski <william.kucharski@oracle.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:22 -04:00
Aristeu Rozanski 81b7032292 mm/rmap: Turn page_lock_anon_vma_read() into folio_lock_anon_vma_read()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 9595d76942b8714627d670a7e7ae543812c731ae
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Feb 1 23:33:08 2022 -0500

    mm/rmap: Turn page_lock_anon_vma_read() into folio_lock_anon_vma_read()

    Add back page_lock_anon_vma_read() as a wrapper.  This saves a few calls
    to compound_head().  If any callers were passing a tail page before,
    this would have failed to lock the anon VMA as page->mapping is not
    valid for tail pages.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:19 -04:00
Aristeu Rozanski 903b22e482 mm/rmap: Convert try_to_unmap() to take a folio
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites
Conflicts: context due missing af28a988b313

commit 869f7ee6f6477341f859c8b0949ae81caf9ca7f3
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Feb 15 09:28:49 2022 -0500

    mm/rmap: Convert try_to_unmap() to take a folio

    Change all three callers and the worker function try_to_unmap_one().

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:18 -04:00
Aristeu Rozanski 5a8509634f mm/truncate: Split invalidate_inode_page() into mapping_evict_folio()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit d6c75dc22c755c567838f12f12a16f2a323ebd4e
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Sun Feb 13 15:22:28 2022 -0500

    mm/truncate: Split invalidate_inode_page() into mapping_evict_folio()

    Some of the callers already have the address_space and can avoid calling
    folio_mapping() and checking if the folio was already truncated.  Also
    add kernel-doc and fix the return type (in case we ever support folios
    larger than 4TB).

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:17 -04:00
Patrick Talbert 407ad35116 Merge: mm: backport folio support
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/678

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2019485
Tested: ran Rafael's set of sanity tests with a stock kernel test run for comparison

This backport includes the base folio patches *without* touching any subsystems.
Patches are mostly straight forward converting functions to use folios.

v2: merge conflict, dropped 78525c74d9e7d1a6ce69bd4388f045f6e474a20b as contradicts the fact we're trying to not do subsystems converting in this MR

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>

Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Carlos Maiolino <cmaiolino@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-05-03 10:59:25 +02:00
Waiman Long bd1fb2084e vsprintf: Make %pGp print the hex value
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2073625

commit 23efd0804c0a869dfb1e78470f80a27251317b7e
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue, 19 Oct 2021 15:26:21 +0100

    vsprintf: Make %pGp print the hex value

    All existing users of %pGp want the hex value as well as the decoded
    flag names.  This looks awkward (passing the same parameter to printf
    twice), so move that functionality into the core.  If we want, we
    can make that optional with flag arguments to %pGp in the future.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Yafang Shao <laoar.shao@gmail.com>
    Reviewed-by: Petr Mladek <pmladek@suse.com>
    Signed-off-by: Petr Mladek <pmladek@suse.com>
    Link: https://lore.kernel.org/r/20211019142621.2810043-6-willy@infradead.org

Signed-off-by: Waiman Long <longman@redhat.com>
2022-04-08 21:09:33 -04:00
Aristeu Rozanski 98caaaf947 mm/memcg: Convert mem_cgroup_uncharge() to take a folio
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2019485
Tested: ran Rafael's set of sanity tests, other than known issues, seems ok

commit bbc6b703b21963e909f633cf7718903ed5094319
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Sat May 1 20:42:23 2021 -0400

    mm/memcg: Convert mem_cgroup_uncharge() to take a folio

    Convert all the callers to call page_folio().  Most of them were already
    using a head page, but a few of them I can't prove were, so this may
    actually fix a bug.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Acked-by: Mike Rapoport <rppt@linux.ibm.com>
    Reviewed-by: David Howells <dhowells@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-04-07 09:58:28 -04:00
Rafael Aquini f8768f6cd4 mm/hwpoison: fix error page recovered but reported "not recovered"
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2064990

This patch is a backport of the following upstream commit:
commit 046545a661af2beec21de7b90ca0e35f05088a81
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Tue Mar 22 14:44:06 2022 -0700

    mm/hwpoison: fix error page recovered but reported "not recovered"

    When an uncorrected memory error is consumed there is a race between the
    CMCI from the memory controller reporting an uncorrected error with a
    UCNA signature, and the core reporting and SRAR signature machine check
    when the data is about to be consumed.

    If the CMCI wins that race, the page is marked poisoned when
    uc_decode_notifier() calls memory_failure() and the machine check
    processing code finds the page already poisoned.  It calls
    kill_accessing_process() to make sure a SIGBUS is sent.  But returns the
    wrong error code.

    Console log looks like this:

      mce: Uncorrected hardware memory error in user-access at 3710b3400
      Memory failure: 0x3710b3: recovery action for dirty LRU page: Recovered
      Memory failure: 0x3710b3: already hardware poisoned
      Memory failure: 0x3710b3: Sending SIGBUS to einj_mem_uc:361438 due to hardware memory corruption
      mce: Memory error not recovered

    kill_accessing_process() is supposed to return -EHWPOISON to notify that
    SIGBUS is already set to the process and kill_me_maybe() doesn't have to
    send it again.  But current code simply fails to do this, so fix it to
    make sure to work as intended.  This change avoids the noise message
    "Memory error not recovered" and skips duplicate SIGBUSs.

    [tony.luck@intel.com: reword some parts of commit message]

    Link: https://lkml.kernel.org/r/20220113231117.1021405-1-naoya.horiguchi@linux.dev
    Fixes: a3f5d80ea4 ("mm,hwpoison: send SIGBUS with error virutal address")
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reported-by: Youquan Song <youquan.song@intel.com>
    Cc: Tony Luck <tony.luck@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2022-03-27 00:48:44 -04:00
Rafael Aquini a95243c329 mm: don't include <linux/dax.h> in <linux/mempolicy.h>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2064990

This patch is a backport of the following upstream commit:
commit 96c84dde362a3e4ff67a12eaac2ea6e88e963c07
Author: Christoph Hellwig <hch@lst.de>
Date:   Fri Nov 5 13:35:30 2021 -0700

    mm: don't include <linux/dax.h> in <linux/mempolicy.h>

    Not required at all, and having this causes a huge kernel rebuild as
    soon as something in dax.h changes.

    Link: https://lkml.kernel.org/r/20210921082253.1859794-1-hch@lst.de
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Dan Williams <dan.j.williams@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2022-03-27 00:48:08 -04:00
Herton R. Krzesinski 7c794ec2d4 Merge: Backport page unpoisoning fixes
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/490

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1972220
Tested: with reproducer

This patchset fixes reference counting issues that still exist in RHEL9 and
can be reproduced by soft poisoning/unpoisoning along with fixes to prevent
silent corruption in tmpfs and shmem when a page is poisoned.

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>

Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2022-02-16 22:47:32 +00:00
Herton R. Krzesinski 8626a20e91 Merge: x86/sgx: Update SGX subsystem code upto v5.16-rc5
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/230

```
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1920028
Upstream Status: master branch of tip.git: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
                 except one commit from the upstream linux.git.

Update SGX subsystem code upto v5.16-rc5 as requested by Intel. All the commits
except one are from the master branch of tip.git.

Except one, all the commits apply cleanly, no conflicts, no changes from tip.git.

8th commit ("x86/sgx: Fix free page accounting") is from the upstream and has a conflict.
The conflict is trivially resolved by a context change so this commit applies cleanly.
The resulting code is the same as in the upstream.

Signed-off-by: Vladis Dronov <vdronov@redhat.com>
```

Approved-by: Dean Nelson <dnelson@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Aristeu Rozanski <arozansk@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2022-02-11 19:12:38 +00:00
Aristeu Rozanski d5f97bda11 mm/hwpoison: fix unpoison_memory()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1972220
Tested: with reproducer

commit bf181c582588f8f7406d52f2ee228539b465f173
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Fri Jan 14 14:09:09 2022 -0800

    mm/hwpoison: fix unpoison_memory()

    After recent soft-offline rework, error pages can be taken off from
    buddy allocator, but the existing unpoison_memory() does not properly
    undo the operation.  Moreover, due to the recent change on
    __get_hwpoison_page(), get_page_unless_zero() is hardly called for
    hwpoisoned pages.  So __get_hwpoison_page() highly likely returns -EBUSY
    (meaning to fail to grab page refcount) and unpoison just clears
    PG_hwpoison without releasing a refcount.  That does not lead to a
    critical issue like kernel panic, but unpoisoned pages never get back to
    buddy (leaked permanently), which is not good.

    To (partially) fix this, we need to identify "taken off" pages from
    other types of hwpoisoned pages.  We can't use refcount or page flags
    for this purpose, so a pseudo flag is defined by hacking ->private
    field.  Someone might think that put_page() is enough to cancel
    taken-off pages, but the normal free path contains some operations not
    suitable for the current purpose, and can fire VM_BUG_ON().

    Note that unpoison_memory() is now supposed to be cancel hwpoison events
    injected only by madvise() or
    /sys/devices/system/memory/{hard,soft}_offline_page, not by MCE
    injection, so please don't try to use unpoison when testing with MCE
    injection.

    [lkp@intel.com: report build failure for ARCH=i386]

    Link: https://lkml.kernel.org/r/20211115084006.3728254-4-naoya.horiguchi@linux.dev
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Ding Hui <dinghui@sangfor.com.cn>
    Cc: Tony Luck <tony.luck@intel.com>
    Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Peter Xu <peterx@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-02-07 11:31:53 -05:00
Aristeu Rozanski 779d924882 mm/hwpoison: remove MF_MSG_BUDDY_2ND and MF_MSG_POISONED_HUGE
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1972220
Tested: with reproducer

commit c9fdc4d5487a16bd1f003fc8b66e91f88efb50e6
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Fri Jan 14 14:09:06 2022 -0800

    mm/hwpoison: remove MF_MSG_BUDDY_2ND and MF_MSG_POISONED_HUGE

    These action_page_types are no longer used, so remove them.

    Link: https://lkml.kernel.org/r/20211115084006.3728254-3-naoya.horiguchi@linux.dev
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Acked-by: Yang Shi <shy828301@gmail.com>
    Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Ding Hui <dinghui@sangfor.com.cn>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Tony Luck <tony.luck@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-02-07 11:31:53 -05:00
Aristeu Rozanski 831f0ed950 mm/hwpoison: mf_mutex for soft offline and unpoison
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1972220
Tested: with reproducer

commit 91d005479e06392617bacc114509d611b705eaac
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Fri Jan 14 14:09:02 2022 -0800

    mm/hwpoison: mf_mutex for soft offline and unpoison

    Patch series "mm/hwpoison: fix unpoison_memory()", v4.

    The main purpose of this series is to sync unpoison code to recent
    changes around how hwpoison code takes page refcount.  Unpoison should
    work or simply fail (without crash) if impossible.

    The recent works of keeping hwpoison pages in shmem pagecache introduce
    a new state of hwpoisoned pages, but unpoison for such pages is not
    supported yet with this series.

    It seems that soft-offline and unpoison can be used as general purpose
    page offline/online mechanism (not in the context of memory error).  I
    think that we need some additional works to realize it because currently
    soft-offline and unpoison are assumed not to happen so frequently (print
    out too many messages for aggressive usecases).  But anyway this could
    be another interesting next topic.

    v1: https://lore.kernel.org/linux-mm/20210614021212.223326-1-nao.horiguchi@gmail.com/
    v2: https://lore.kernel.org/linux-mm/20211025230503.2650970-1-naoya.horiguchi@linux.dev/
    v3: https://lore.kernel.org/linux-mm/20211105055058.3152564-1-naoya.horiguchi@linux.dev/

    This patch (of 3):

    Originally mf_mutex is introduced to serialize multiple MCE events, but
    it is not that useful to allow unpoison to run in parallel with
    memory_failure() and soft offline.  So apply mf_mutex to soft offline
    and unpoison.  The memory failure handler and soft offline handler get
    simpler with this.

    Link: https://lkml.kernel.org/r/20211115084006.3728254-1-naoya.horiguchi@linux.dev
    Link: https://lkml.kernel.org/r/20211115084006.3728254-2-naoya.horiguchi@linux.dev
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Ding Hui <dinghui@sangfor.com.cn>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Tony Luck <tony.luck@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-02-07 11:31:53 -05:00
Aristeu Rozanski ce8d02be33 mm/hwpoison: clear MF_COUNT_INCREASED before retrying get_any_page()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1972220
Tested: with reproducer

commit 2a57d83c78f889bf3f54eede908d0643c40d5418
Author: Liu Shixin <liushixin2@huawei.com>
Date:   Fri Dec 24 21:12:58 2021 -0800

    mm/hwpoison: clear MF_COUNT_INCREASED before retrying get_any_page()

    Hulk Robot reported a panic in put_page_testzero() when testing
    madvise() with MADV_SOFT_OFFLINE.  The BUG() is triggered when retrying
    get_any_page().  This is because we keep MF_COUNT_INCREASED flag in
    second try but the refcnt is not increased.

        page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
        ------------[ cut here ]------------
        kernel BUG at include/linux/mm.h:737!
        invalid opcode: 0000 [#1] PREEMPT SMP
        CPU: 5 PID: 2135 Comm: sshd Tainted: G    B             5.16.0-rc6-dirty #373
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
        RIP: release_pages+0x53f/0x840
        Call Trace:
          free_pages_and_swap_cache+0x64/0x80
          tlb_flush_mmu+0x6f/0x220
          unmap_page_range+0xe6c/0x12c0
          unmap_single_vma+0x90/0x170
          unmap_vmas+0xc4/0x180
          exit_mmap+0xde/0x3a0
          mmput+0xa3/0x250
          do_exit+0x564/0x1470
          do_group_exit+0x3b/0x100
          __do_sys_exit_group+0x13/0x20
          __x64_sys_exit_group+0x16/0x20
          do_syscall_64+0x34/0x80
          entry_SYSCALL_64_after_hwframe+0x44/0xae
        Modules linked in:
        ---[ end trace e99579b570fe0649 ]---
        RIP: 0010:release_pages+0x53f/0x840

    Link: https://lkml.kernel.org/r/20211221074908.3910286-1-liushixin2@huawei.com
    Fixes: b94e02822d ("mm,hwpoison: try to narrow window race for free pages")
    Signed-off-by: Liu Shixin <liushixin2@huawei.com>
    Reported-by: Hulk Robot <hulkci@huawei.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-02-07 11:31:52 -05:00
Aristeu Rozanski 86e5dde68a mm, hwpoison: fix condition in free hugetlb page path
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1972220
Tested: with reproducer

commit e37e7b0b3bd52ec4f8ab71b027bcec08f57f1b3b
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Fri Dec 24 21:12:45 2021 -0800

    mm, hwpoison: fix condition in free hugetlb page path

    When a memory error hits a tail page of a free hugepage,
    __page_handle_poison() is expected to be called to isolate the error in
    4kB unit, but it's not called due to the outdated if-condition in
    memory_failure_hugetlb().  This loses the chance to isolate the error in
    the finer unit, so it's not optimal.  Drop the condition.

    This "(p != head && TestSetPageHWPoison(head)" condition is based on the
    old semantics of PageHWPoison on hugepage (where PG_hwpoison flag was
    set on the subpage), so it's not necessray any more.  By getting to set
    PG_hwpoison on head page for hugepages, concurrent error events on
    different subpages in a single hugepage can be prevented by
    TestSetPageHWPoison(head) at the beginning of memory_failure_hugetlb().
    So dropping the condition should not reopen the race window originally
    mentioned in commit b985194c8c ("hwpoison, hugetlb:
    lock_page/unlock_page does not match for handling a free hugepage")

    [naoya.horiguchi@linux.dev: fix "HardwareCorrupted" counter]
      Link: https://lkml.kernel.org/r/20211220084851.GA1460264@u2004

    Link: https://lkml.kernel.org/r/20211210110208.879740-1-naoya.horiguchi@linux.dev
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reported-by: Fei Luo <luofei@unicloud.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: <stable@vger.kernel.org>    [5.14+]
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-02-07 11:31:52 -05:00
Aristeu Rozanski df0caf6d7f mm: hwpoison: handle non-anonymous THP correctly
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1972220
Tested: with reproducer

commit 4966455d9100236fd6dd72b0cd00818435fdb25d
Author: Yang Shi <shy828301@gmail.com>
Date:   Fri Nov 5 13:41:14 2021 -0700

    mm: hwpoison: handle non-anonymous THP correctly

    Currently hwpoison doesn't handle non-anonymous THP, but since v4.8 THP
    support for tmpfs and read-only file cache has been added.  They could
    be offlined by split THP, just like anonymous THP.

    Link: https://lkml.kernel.org/r/20211020210755.23964-7-shy828301@gmail.com
    Signed-off-by: Yang Shi <shy828301@gmail.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Peter Xu <peterx@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-02-07 11:31:52 -05:00
Aristeu Rozanski 8ec612ee47 mm: hwpoison: refactor refcount check handling
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1972220
Tested: with reproducer

commit dd0f230a0a80ff396c7ce587f16429f2a8131344
Author: Yang Shi <shy828301@gmail.com>
Date:   Fri Nov 5 13:41:07 2021 -0700

    mm: hwpoison: refactor refcount check handling

    Memory failure will report failure if the page still has extra pinned
    refcount other than from hwpoison after the handler is done.  Actually
    the check is not necessary for all handlers, so move the check into
    specific handlers.  This would make the following keeping shmem page in
    page cache patch easier.

    There may be expected extra pin for some cases, for example, when the
    page is dirty and in swapcache.

    Link: https://lkml.kernel.org/r/20211020210755.23964-5-shy828301@gmail.com
    Signed-off-by: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Suggested-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Peter Xu <peterx@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-02-07 11:31:52 -05:00
Vladis Dronov c5c23f81a4 x86/sgx: Hook arch_memory_failure() into mainline code
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1920028
Upstream Status: master branch of tip.git: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git

commit 03b122da74b22fbe7cd98184fa5657a9ce13970c
Author: Tony Luck <tony.luck@intel.com>
Date:   Tue Oct 26 15:00:48 2021 -0700

    x86/sgx: Hook arch_memory_failure() into mainline code

    Add a call inside memory_failure() to call the arch specific code
    to check if the address is an SGX EPC page and handle it.

    Note the SGX EPC pages do not have a "struct page" entry, so the hook
    goes in at the same point as the device mapping hook.

    Pull the call to acquire the mutex earlier so the SGX errors are also
    protected.

    Make set_mce_nospec() skip SGX pages when trying to adjust
    the 1:1 map.

    Signed-off-by: Tony Luck <tony.luck@intel.com>
    Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
    Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
    Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Tested-by: Reinette Chatre <reinette.chatre@intel.com>
    Link: https://lkml.kernel.org/r/20211026220050.697075-6-tony.luck@intel.com

Signed-off-by: Vladis Dronov <vdronov@redhat.com>
2021-12-08 16:47:59 +01:00
Rafael Aquini a77ba4ce70 mm: filemap: check if THP has hwpoisoned subpage for PMD page fault
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit eac96c3efdb593df1a57bb5b95dbe037bfa9a522
Author: Yang Shi <shy828301@gmail.com>
Date:   Thu Oct 28 14:36:11 2021 -0700

    mm: filemap: check if THP has hwpoisoned subpage for PMD page fault

    When handling shmem page fault the THP with corrupted subpage could be
    PMD mapped if certain conditions are satisfied.  But kernel is supposed
    to send SIGBUS when trying to map hwpoisoned page.

    There are two paths which may do PMD map: fault around and regular
    fault.

    Before commit f9ce0be71d ("mm: Cleanup faultaround and finish_fault()
    codepaths") the thing was even worse in fault around path.  The THP
    could be PMD mapped as long as the VMA fits regardless what subpage is
    accessed and corrupted.  After this commit as long as head page is not
    corrupted the THP could be PMD mapped.

    In the regular fault path the THP could be PMD mapped as long as the
    corrupted page is not accessed and the VMA fits.

    This loophole could be fixed by iterating every subpage to check if any
    of them is hwpoisoned or not, but it is somewhat costly in page fault
    path.

    So introduce a new page flag called HasHWPoisoned on the first tail
    page.  It indicates the THP has hwpoisoned subpage(s).  It is set if any
    subpage of THP is found hwpoisoned by memory failure and after the
    refcount is bumped successfully, then cleared when the THP is freed or
    split.

    The soft offline path doesn't need this since soft offline handler just
    marks a subpage hwpoisoned when the subpage is migrated successfully.
    But shmem THP didn't get split then migrated at all.

    Link: https://lkml.kernel.org/r/20211020210755.23964-3-shy828301@gmail.com
    Fixes: 800d8c63b2 ("shmem: add huge pages support")
    Signed-off-by: Yang Shi <shy828301@gmail.com>
    Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:44:14 -05:00
Rafael Aquini dc92a78efb mm: hwpoison: remove the unnecessary THP check
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit c7cb42e94473aafe553c0f2a3d8ca904599399ed
Author: Yang Shi <shy828301@gmail.com>
Date:   Thu Oct 28 14:36:07 2021 -0700

    mm: hwpoison: remove the unnecessary THP check

    When handling THP hwpoison checked if the THP is in allocation or free
    stage since hwpoison may mistreat it as hugetlb page.  After commit
    415c64c145 ("mm/memory-failure: split thp earlier in memory error
    handling") the problem has been fixed, so this check is no longer
    needed.  Remove it.  The side effect of the removal is hwpoison may
    report unsplit THP instead of unknown error for shmem THP.  It seems not
    like a big deal.

    The following patch "mm: filemap: check if THP has hwpoisoned subpage
    for PMD page fault" depends on this, which fixes shmem THP with
    hwpoisoned subpage(s) are mapped PMD wrongly.  So this patch needs to be
    backported to -stable as well.

    Link: https://lkml.kernel.org/r/20211020210755.23964-2-shy828301@gmail.com
    Signed-off-by: Yang Shi <shy828301@gmail.com>
    Suggested-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:44:13 -05:00
Rafael Aquini 50ec26a05d mm/memory_failure: fix the missing pte_unmap() call
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 5c91c0e77b8f2681e2b269c8abb4c5acef434d5b
Author: Qi Zheng <zhengqi.arch@bytedance.com>
Date:   Fri Sep 24 15:44:03 2021 -0700

    mm/memory_failure: fix the missing pte_unmap() call

    The paired pte_unmap() call is missing before the
    dev_pagemap_mapping_shift() returns.  So fix it.

    David says:
     "I guess this code never runs on 32bit / highmem, that's why we didn't
      notice so far".

    [akpm@linux-foundation.org: cleanup]

    Link: https://lkml.kernel.org/r/20210923122642.4999-1-zhengqi.arch@bytedance.com
    Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:43:58 -05:00
Rafael Aquini 8675e9c151 mm, hwpoison: add is_free_buddy_page() in HWPoisonHandlable()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit acfa299a4a63a58e5e81a87cb16798f20d35f7d7
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Fri Sep 24 15:43:20 2021 -0700

    mm, hwpoison: add is_free_buddy_page() in HWPoisonHandlable()

    Commit fcc00621d8 ("mm/hwpoison: retry with shake_page() for
    unhandlable pages") changed the return value of __get_hwpoison_page() to
    retry for transiently unhandlable cases.  However, __get_hwpoison_page()
    currently fails to properly judge buddy pages as handlable, so hard/soft
    offline for buddy pages always fail as "unhandlable page".  This is
    totally regrettable.

    So let's add is_free_buddy_page() in HWPoisonHandlable(), so that
    __get_hwpoison_page() returns different return values between buddy
    pages and unhandlable pages as intended.

    Link: https://lkml.kernel.org/r/20210909004131.163221-1-naoya.horiguchi@linux.dev
    Fixes: fcc00621d8 ("mm/hwpoison: retry with shake_page() for unhandlable pages")
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Cc: Tony Luck <tony.luck@intel.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:43:52 -05:00
Rafael Aquini 2e0da4572f mm/migrate: enable returning precise migrate_pages() success count
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 5ac95884a784e822b8cbe3d4bd6e9f96b3b71e3f
Author: Yang Shi <yang.shi@linux.alibaba.com>
Date:   Thu Sep 2 14:59:13 2021 -0700

    mm/migrate: enable returning precise migrate_pages() success count

    Under normal circumstances, migrate_pages() returns the number of pages
    migrated.  In error conditions, it returns an error code.  When returning
    an error code, there is no way to know how many pages were migrated or not
    migrated.

    Make migrate_pages() return how many pages are demoted successfully for
    all cases, including when encountering errors.  Page reclaim behavior will
    depend on this in subsequent patches.

    Link: https://lkml.kernel.org/r/20210721063926.3024591-3-ying.huang@intel.com
    Link: https://lkml.kernel.org/r/20210715055145.195411-4-ying.huang@intel.com
    Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
    Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Suggested-by: Oscar Salvador <osalvador@suse.de> [optional parameter]
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Wei Xu <weixugc@google.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Keith Busch <kbusch@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:42:08 -05:00
Rafael Aquini 51b404734c mm: fix panic caused by __page_handle_poison()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit f87060d345232c7d855167a43faf006e24afa999
Author: Michael Wang <yun.wang@linux.alibaba.com>
Date:   Thu Sep 2 14:58:40 2021 -0700

    mm: fix panic caused by __page_handle_poison()

    In commit 510d25c92e ("mm/hwpoison: disable pcp for
    page_handle_poison()"), __page_handle_poison() was introduced, and if we
    mark:

    RET_A = dissolve_free_huge_page();
    RET_B = take_page_off_buddy();

    then __page_handle_poison was supposed to return TRUE When RET_A == 0 &&
    RET_B == TRUE

    But since it failed to take care the case when RET_A is -EBUSY or -ENOMEM,
    and just return the ret as a bool which actually become TRUE, it break the
    original logic.

    The following result is a huge page in freelist but was
    referenced as poisoned, and lead into the final panic:

      kernel BUG at mm/internal.h:95!
      invalid opcode: 0000 [#1] SMP PTI
      skip...
      RIP: 0010:set_page_refcounted mm/internal.h:95 [inline]
      RIP: 0010:remove_hugetlb_page+0x23c/0x240 mm/hugetlb.c:1371
      skip...
      Call Trace:
       remove_pool_huge_page+0xe4/0x110 mm/hugetlb.c:1892
       return_unused_surplus_pages+0x8d/0x150 mm/hugetlb.c:2272
       hugetlb_acct_memory.part.91+0x524/0x690 mm/hugetlb.c:4017

    This patch replaces 'bool' with 'int' to handle RET_A correctly.

    Link: https://lkml.kernel.org/r/61782ac6-1e8a-4f6f-35e6-e94fce3b37f5@linux.alibaba.com
    Fixes: 510d25c92e ("mm/hwpoison: disable pcp for page_handle_poison()")
    Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reported-by: Abaci <abaci@linux.alibaba.com>
    Cc: <stable@vger.kernel.org>    [5.14+]
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:42:01 -05:00
Rafael Aquini 3f004c5892 mm: hwpoison: dump page for unhandlable page
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 941ca063eb8ed01e66336b1f493e95b107024bc8
Author: Yang Shi <shy828301@gmail.com>
Date:   Thu Sep 2 14:58:37 2021 -0700

    mm: hwpoison: dump page for unhandlable page

    Currently just very simple message is shown for unhandlable page, e.g.
    non-LRU page, like: soft_offline: 0x1469f2: unknown non LRU page type
    5ffff0000000000 ()

    It is not very helpful for further debug, calling dump_page() could show
    more useful information.

    Calling dump_page() in get_any_page() in order to not duplicate the call
    in a couple of different places.  It may be called with pcp disabled and
    holding memory hotplug lock, it should be not a big deal since hwpoison
    handler is not called very often.

    [shy828301@gmail.com: remove redundant pr_info per Noaya Horiguchi]
      Link: https://lkml.kernel.org/r/20210824020946.195257-3-shy828301@gmail.com

    Link: https://lkml.kernel.org/r/20210819054116.266126-3-shy828301@gmail.com
    Signed-off-by: Yang Shi <shy828301@gmail.com>
    Suggested-by: Matthew Wilcox <willy@infradead.org>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: David Mackey <tdmackey@twitter.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:42:00 -05:00
Rafael Aquini 59d67e090c mm: hwpoison: don't drop slab caches for offlining non-LRU page
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit d0505e9f7dcec85da6634ec66da2b17656ee177b
Author: Yang Shi <shy828301@gmail.com>
Date:   Thu Sep 2 14:58:31 2021 -0700

    mm: hwpoison: don't drop slab caches for offlining non-LRU page

    In the current implementation of soft offline, if non-LRU page is met,
    all the slab caches will be dropped to free the page then offline.  But
    if the page is not slab page all the effort is wasted in vain.  Even
    though it is a slab page, it is not guaranteed the page could be freed
    at all.

    However the side effect and cost is quite high.  It does not only drop
    the slab caches, but also may drop a significant amount of page caches
    which are associated with inode caches.  It could make the most
    workingset gone in order to just offline a page.  And the offline is not
    guaranteed to succeed at all, actually I really doubt the success rate
    for real life workload.

    Furthermore the worse consequence is the system may be locked up and
    unusable since the page cache release may incur huge amount of works
    queued for memcg release.

    Actually we ran into such unpleasant case in our production environment.
    Firstly, the workqueue of memory_failure_work_func is locked up as
    below:

        BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 53s!
        Showing busy workqueues and worker pools:
        workqueue events: flags=0x0
         pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=14/256 refcnt=15
          in-flight: 409271:memory_failure_work_func
          pending: kfree_rcu_work, kfree_rcu_monitor, kfree_rcu_work, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, drain_local_stock, kfree_rcu_work
        workqueue mm_percpu_wq: flags=0x8
         pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
          pending: vmstat_update
        workqueue cgroup_destroy: flags=0x0
          pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1 refcnt=12072
            pending: css_release_work_fn

    There were over 12K css_release_work_fn queued, and this caused a few
    lockups due to the contention of worker pool lock with IRQ disabled, for
    example:

        NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
        Modules linked in: amd64_edac_mod edac_mce_amd crct10dif_pclmul crc32_pclmul ghash_clmulni_intel xt_DSCP iptable_mangle kvm_amd bpfilter vfat fat acpi_ipmi i2c_piix4 usb_storage ipmi_si k10temp i2c_core ipmi_devintf ipmi_msghandler acpi_cpufreq sch_fq_codel xfs libcrc32c crc32c_intel mlx5_core mlxfw nvme xhci_pci ptp nvme_core pps_core xhci_hcd
        CPU: 1 PID: 205500 Comm: kworker/1:0 Tainted: G             L    5.10.32-t1.el7.twitter.x86_64 #1
        Hardware name: TYAN F5AMT /z        /S8026GM2NRE-CGN, BIOS V8.030 03/30/2021
        Workqueue: events memory_failure_work_func
        RIP: 0010:queued_spin_lock_slowpath+0x41/0x1a0
        Code: 41 f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 1b 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 f6 c4 01 75 04 c6 47
        RSP: 0018:ffff9b2ac278f900 EFLAGS: 00000002
        RAX: 0000000000480101 RBX: ffff8ce98ce71800 RCX: 0000000000000084
        RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8ce98ce6a140
        RBP: 00000000000284c8 R08: ffffd7248dcb6808 R09: 0000000000000000
        R10: 0000000000000003 R11: ffff9b2ac278f9b0 R12: 0000000000000001
        R13: ffff8cb44dab9c00 R14: ffffffffbd1ce6a0 R15: ffff8cacaa37f068
        FS:  0000000000000000(0000) GS:ffff8ce98ce40000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007fcf6e8cb000 CR3: 0000000a0c60a000 CR4: 0000000000350ee0
        Call Trace:
         __queue_work+0xd6/0x3c0
         queue_work_on+0x1c/0x30
         uncharge_batch+0x10e/0x110
         mem_cgroup_uncharge_list+0x6d/0x80
         release_pages+0x37f/0x3f0
         __pagevec_release+0x1c/0x50
         __invalidate_mapping_pages+0x348/0x380
         inode_lru_isolate+0x10a/0x160
         __list_lru_walk_one+0x7b/0x170
         list_lru_walk_one+0x4a/0x60
         prune_icache_sb+0x37/0x50
         super_cache_scan+0x123/0x1a0
         do_shrink_slab+0x10c/0x2c0
         shrink_slab+0x1f1/0x290
         drop_slab_node+0x4d/0x70
         soft_offline_page+0x1ac/0x5b0
         memory_failure_work_func+0x6a/0x90
         process_one_work+0x19e/0x340
         worker_thread+0x30/0x360
         kthread+0x116/0x130

    The lockup made the machine is quite unusable.  And it also made the
    most workingset gone, the reclaimabled slab caches were reduced from 12G
    to 300MB, the page caches were decreased from 17G to 4G.

    But the most disappointing thing is all the effort doesn't make the page
    offline, it just returns:

        soft_offline: 0x1469f2: unknown non LRU page type 5ffff0000000000 ()

    It seems the aggressive behavior for non-LRU page didn't pay back, so it
    doesn't make too much sense to keep it considering the terrible side
    effect.

    Link: https://lkml.kernel.org/r/20210819054116.266126-1-shy828301@gmail.com
    Signed-off-by: Yang Shi <shy828301@gmail.com>
    Reported-by: David Mackey <tdmackey@twitter.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:41:59 -05:00
Rafael Aquini e7a9f855eb mm/hwpoison: fix some obsolete comments
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit a21c184fe25eab36fb6efabae55333452171d53b
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Sep 2 14:58:28 2021 -0700

    mm/hwpoison: fix some obsolete comments

    Since commit cb731d6c62 ("vmscan: per memory cgroup slab shrinkers"),
    shrink_node_slabs is renamed to drop_slab_node.  And doit argument is
    changed to forcekill since commit 6751ed65dc ("x86/mce: Fix
    siginfo_t->si_addr value for non-recoverable memory faults").

    Link: https://lkml.kernel.org/r/20210814105131.48814-5-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:41:58 -05:00
Rafael Aquini 64a6dffc94 mm/hwpoison: change argument struct page **hpagep to *hpage
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit ed8c2f492d4e7248a9c0493c444c47bed84d345d
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Sep 2 14:58:25 2021 -0700

    mm/hwpoison: change argument struct page **hpagep to *hpage

    It's unnecessary to pass in a struct page **hpagep because it's never
    modified.  Changing to use *hpage to simplify the code.

    Link: https://lkml.kernel.org/r/20210814105131.48814-4-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:41:57 -05:00
Rafael Aquini b4810f66e5 mm/hwpoison: fix potential pte_unmap_unlock pte error
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit ea3732f7a1cf636284388988d1a1e56d5cba6044
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Sep 2 14:58:22 2021 -0700

    mm/hwpoison: fix potential pte_unmap_unlock pte error

    If the first pte is equal to poisoned_pfn, i.e.  check_hwpoisoned_entry()
    return 1, the wrong ptep - 1 would be passed to pte_unmap_unlock().

    Link: https://lkml.kernel.org/r/20210814105131.48814-3-linmiaohe@huawei.com
    Fixes: ad9c59c24095 ("mm,hwpoison: send SIGBUS with error virutal address")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:41:56 -05:00
Rafael Aquini 4930f654cc mm/hwpoison: remove unneeded variable unmap_success
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit ae611d072c5c2968e2cc29431cf58094d8971b94
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Sep 2 14:58:19 2021 -0700

    mm/hwpoison: remove unneeded variable unmap_success

    Patch series "Cleanups and fixup for hwpoison"

    This series contains cleanups to remove unneeded variable, fix some
    obsolete comments and so on.  Also we fix potential pte_unmap_unlock on
    wrong pte.  More details can be found in the respective changelogs.

    This patch (of 4):

    unmap_success is used to indicate whether page is successfully unmapped
    but it's irrelated with ZONE_DEVICE page and unmap_success is always true
    here.  Remove this unneeded one.

    Link: https://lkml.kernel.org/r/20210814105131.48814-1-linmiaohe@huawei.com
    Link: https://lkml.kernel.org/r/20210814105131.48814-2-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:41:56 -05:00
Rafael Aquini 5a88d17b6c mm: Fix comments mentioning i_mutex
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 9608703e488cf7a711c42c7ccd981c32377f7b78
Author: Jan Kara <jack@suse.cz>
Date:   Mon Apr 12 15:50:21 2021 +0200

    mm: Fix comments mentioning i_mutex

    inode->i_mutex has been replaced with inode->i_rwsem long ago. Fix
    comments still mentioning i_mutex.

    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Acked-by: Hugh Dickins <hughd@google.com>
    Signed-off-by: Jan Kara <jack@suse.cz>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:40:21 -05:00
Naoya Horiguchi fcc00621d8 mm/hwpoison: retry with shake_page() for unhandlable pages
HWPoisonHandlable() sometimes returns false for typical user pages due
to races with average memory events like transfers over LRU lists.  This
causes failures in hwpoison handling.

There's retry code for such a case but does not work because the retry
loop reaches the retry limit too quickly before the page settles down to
handlable state.  Let get_any_page() call shake_page() to fix it.

[naoya.horiguchi@nec.com: get_any_page(): return -EIO when retry limit reached]
  Link: https://lkml.kernel.org/r/20210819001958.2365157-1-naoya.horiguchi@linux.dev

Link: https://lkml.kernel.org/r/20210817053703.2267588-1-naoya.horiguchi@linux.dev
Fixes: 25182f05ff ("mm,hwpoison: fix race with hugetlb page allocation")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reported-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>		[5.13+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-20 11:31:42 -07:00
Zhen Lei 041711ce7c mm: fix spelling mistakes
Fix some spelling mistakes in comments:
each having differents usage ==> each has a different usage
statments ==> statements
adresses ==> addresses
aggresive ==> aggressive
datas ==> data
posion ==> poison
higer ==> higher
precisly ==> precisely
wont ==> won't
We moves tha ==> We move the
endianess ==> endianness

Link: https://lkml.kernel.org/r/20210519065853.7723-2-thunder.leizhen@huawei.com
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 11:06:02 -07:00
Hugh Dickins 36af67370e mm: hwpoison_user_mappings() try_to_unmap() with TTU_SYNC
TTU_SYNC prevents an unlikely race, when try_to_unmap() returns shortly
before the page is accounted as unmapped.  It is unlikely to coincide with
hwpoisoning, but now that we have the flag, hwpoison_user_mappings() would
do well to use it.

Link: https://lkml.kernel.org/r/329c28ed-95df-9a2c-8893-b444d8a6d340@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jue Wang <juew@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:30 -07:00
Yang Shi 1fb08ac63b mm: rmap: make try_to_unmap() void function
Currently try_to_unmap() return bool value by checking page_mapcount(),
however this may return false positive since page_mapcount() doesn't check
all subpages of compound page.  The total_mapcount() could be used
instead, but its cost is higher since it traverses all subpages.

Actually the most callers of try_to_unmap() don't care about the return
value at all.  So just need check if page is still mapped by page_mapped()
when necessary.  And page_mapped() does bail out early when it finds
mapped subpage.

Link: https://lkml.kernel.org/r/bb27e3fe-6036-b637-5086-272befbfe3da@google.com
Suggested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Yang Shi <shy828301@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jue Wang <juew@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:30 -07:00
Naoya Horiguchi 510d25c92e mm/hwpoison: disable pcp for page_handle_poison()
Recent changes by patch "mm/page_alloc: allow high-order pages to be
stored on the per-cpu lists" makes kernels determine whether to use pcp by
pcp_allowed_order(), which breaks soft-offline for hugetlb pages.

Soft-offline dissolves a migration source page, then removes it from buddy
free list, so it's assumed that any subpage of the soft-offlined hugepage
are recognized as a buddy page just after returning from
dissolve_free_huge_page().  pcp_allowed_order() returns true for hugetlb,
so this assumption is no longer true.

So disable pcp during dissolve_free_huge_page() and take_page_off_buddy()
to prevent soft-offlined hugepages from linking to pcp lists.
Soft-offline should not be common events so the impact on performance
should be minimal.  And I think that the optimization of Mel's patch could
benefit to hugetlb so zone_pcp_disable() is called only in hwpoison
context.

Link: https://lkml.kernel.org/r/20210617092626.291006-1-nao.horiguchi@gmail.com
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:27 -07:00
Naoya Horiguchi 0ed950d1f2 mm,hwpoison: make get_hwpoison_page() call get_any_page()
__get_hwpoison_page() could fail to grab refcount by some race condition,
so it's helpful if we can handle it by retrying.  We already have retry
logic, so make get_hwpoison_page() call get_any_page() when called from
memory_failure().

As a result, get_hwpoison_page() can return negative values (i.e.  error
code), so some callers are also changed to handle error cases.
soft_offline_page() does nothing for -EBUSY because that's enough and
users in userspace can easily handle it.  unpoison_memory() is also
unchanged because it's broken and need thorough fixes (will be done
later).

Link: https://lkml.kernel.org/r/20210603233632.2964832-3-nao.horiguchi@gmail.com
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:56 -07:00
Naoya Horiguchi a3f5d80ea4 mm,hwpoison: send SIGBUS with error virutal address
Now an action required MCE in already hwpoisoned address surely sends a
SIGBUS to current process, but the SIGBUS doesn't convey error virtual
address.  That's not optimal for hwpoison-aware applications.

To fix the issue, make memory_failure() call kill_accessing_process(),
that does pagetable walk to find the error virtual address.  It could find
multiple virtual addresses for the same error page, and it seems hard to
tell which virtual address is correct one.  But that's rare and sending
incorrect virtual address could be better than no address.  So let's
report the first found virtual address for now.

[naoya.horiguchi@nec.com: fix walk_page_range() return]
  Link: https://lkml.kernel.org/r/20210603051055.GA244241@hori.linux.bs1.fc.nec.co.jp

Link: https://lkml.kernel.org/r/20210521030156.2612074-4-nao.horiguchi@gmail.com
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Aili Yao <yaoaili@kingsoft.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Jue Wang <juew@google.com>
Cc: Borislav Petkov <bp@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:55 -07:00
Naoya Horiguchi ea6d063010 mm/hwpoison: do not lock page again when me_huge_page() successfully recovers
Currently me_huge_page() temporary unlocks page to perform some actions
then locks it again later.  My testcase (which calls hard-offline on
some tail page in a hugetlb, then accesses the address of the hugetlb
range) showed that page allocation code detects this page lock on buddy
page and printed out "BUG: Bad page state" message.

check_new_page_bad() does not consider a page with __PG_HWPOISON as bad
page, so this flag works as kind of filter, but this filtering doesn't
work in this case because the "bad page" is not the actual hwpoisoned
page.  So stop locking page again.  Actions to be taken depend on the
page type of the error, so page unlocking should be done in ->action()
callbacks.  So let's make it assumed and change all existing callbacks
that way.

Link: https://lkml.kernel.org/r/20210609072029.74645-1-nao.horiguchi@gmail.com
Fixes: commit 78bb920344 ("mm: hwpoison: dissolve in-use hugepage in unrecoverable memory error")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24 19:40:54 -07:00
Aili Yao 47af12bae1 mm,hwpoison: return -EHWPOISON to denote that the page has already been poisoned
When memory_failure() is called with MF_ACTION_REQUIRED on the page that
has already been hwpoisoned, memory_failure() could fail to send SIGBUS
to the affected process, which results in infinite loop of MCEs.

Currently memory_failure() returns 0 if it's called for already
hwpoisoned page, then the caller, kill_me_maybe(), could return without
sending SIGBUS to current process.  An action required MCE is raised
when the current process accesses to the broken memory, so no SIGBUS
means that the current process continues to run and access to the error
page again soon, so running into MCE loop.

This issue can arise for example in the following scenarios:

 - Two or more threads access to the poisoned page concurrently. If
   local MCE is enabled, MCE handler independently handles the MCE
   events. So there's a race among MCE events, and the second or latter
   threads fall into the situation in question.

 - If there was a precedent memory error event and memory_failure() for
   the event failed to unmap the error page for some reason, the
   subsequent memory access to the error page triggers the MCE loop
   situation.

To fix the issue, make memory_failure() return an error code when the
error page has already been hwpoisoned.  This allows memory error
handler to control how it sends signals to userspace.  And make sure
that any process touching a hwpoisoned page should get a SIGBUS even in
"already hwpoisoned" path of memory_failure() as is done in page fault
path.

Link: https://lkml.kernel.org/r/20210521030156.2612074-3-nao.horiguchi@gmail.com
Signed-off-by: Aili Yao <yaoaili@kingsoft.com>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jue Wang <juew@google.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24 19:40:54 -07:00
Tony Luck 171936ddaf mm/memory-failure: use a mutex to avoid memory_failure() races
Patch series "mm,hwpoison: fix sending SIGBUS for Action Required MCE", v5.

I wrote this patchset to materialize what I think is the current
allowable solution mentioned by the previous discussion [1].  I simply
borrowed Tony's mutex patch and Aili's return code patch, then I queued
another one to find error virtual address in the best effort manner.  I
know that this is not a perfect solution, but should work for some
typical case.

[1]: https://lore.kernel.org/linux-mm/20210331192540.2141052f@alex-virtual-machine/

This patch (of 2):

There can be races when multiple CPUs consume poison from the same page.
The first into memory_failure() atomically sets the HWPoison page flag
and begins hunting for tasks that map this page.  Eventually it
invalidates those mappings and may send a SIGBUS to the affected tasks.

But while all that work is going on, other CPUs see a "success" return
code from memory_failure() and so they believe the error has been
handled and continue executing.

Fix by wrapping most of the internal parts of memory_failure() in a
mutex.

[akpm@linux-foundation.org: make mf_mutex local to memory_failure()]

Link: https://lkml.kernel.org/r/20210521030156.2612074-1-nao.horiguchi@gmail.com
Link: https://lkml.kernel.org/r/20210521030156.2612074-2-nao.horiguchi@gmail.com
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Aili Yao <yaoaili@kingsoft.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jue Wang <juew@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24 19:40:54 -07:00
yangerkun e8675d291a mm/memory-failure: make sure wait for page writeback in memory_failure
Our syzkaller trigger the "BUG_ON(!list_empty(&inode->i_wb_list))" in
clear_inode:

  kernel BUG at fs/inode.c:519!
  Internal error: Oops - BUG: 0 [#1] SMP
  Modules linked in:
  Process syz-executor.0 (pid: 249, stack limit = 0x00000000a12409d7)
  CPU: 1 PID: 249 Comm: syz-executor.0 Not tainted 4.19.95
  Hardware name: linux,dummy-virt (DT)
  pstate: 80000005 (Nzcv daif -PAN -UAO)
  pc : clear_inode+0x280/0x2a8
  lr : clear_inode+0x280/0x2a8
  Call trace:
    clear_inode+0x280/0x2a8
    ext4_clear_inode+0x38/0xe8
    ext4_free_inode+0x130/0xc68
    ext4_evict_inode+0xb20/0xcb8
    evict+0x1a8/0x3c0
    iput+0x344/0x460
    do_unlinkat+0x260/0x410
    __arm64_sys_unlinkat+0x6c/0xc0
    el0_svc_common+0xdc/0x3b0
    el0_svc_handler+0xf8/0x160
    el0_svc+0x10/0x218
  Kernel panic - not syncing: Fatal exception

A crash dump of this problem show that someone called __munlock_pagevec
to clear page LRU without lock_page: do_mmap -> mmap_region -> do_munmap
-> munlock_vma_pages_range -> __munlock_pagevec.

As a result memory_failure will call identify_page_state without
wait_on_page_writeback.  And after truncate_error_page clear the mapping
of this page.  end_page_writeback won't call sb_clear_inode_writeback to
clear inode->i_wb_list.  That will trigger BUG_ON in clear_inode!

Fix it by checking PageWriteback too to help determine should we skip
wait_on_page_writeback.

Link: https://lkml.kernel.org/r/20210604084705.3729204-1-yangerkun@huawei.com
Fixes: 0bc1f8b068 ("hwpoison: fix the handling path of the victimized page frame that belong to non-LRU")
Signed-off-by: yangerkun <yangerkun@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-16 09:24:42 -07:00
Naoya Horiguchi 25182f05ff mm,hwpoison: fix race with hugetlb page allocation
When hugetlb page fault (under overcommitting situation) and
memory_failure() race, VM_BUG_ON_PAGE() is triggered by the following
race:

    CPU0:                           CPU1:

                                    gather_surplus_pages()
                                      page = alloc_surplus_huge_page()
    memory_failure_hugetlb()
      get_hwpoison_page(page)
        __get_hwpoison_page(page)
          get_page_unless_zero(page)
                                      zero = put_page_testzero(page)
                                      VM_BUG_ON_PAGE(!zero, page)
                                      enqueue_huge_page(h, page)
      put_page(page)

__get_hwpoison_page() only checks the page refcount before taking an
additional one for memory error handling, which is not enough because
there's a time window where compound pages have non-zero refcount during
hugetlb page initialization.

So make __get_hwpoison_page() check page status a bit more for hugetlb
pages with get_hwpoison_huge_page().  Checking hugetlb-specific flags
under hugetlb_lock makes sure that the hugetlb page is not transitive.
It's notable that another new function, HWPoisonHandlable(), is helpful
to prevent a race against other transitive page states (like a generic
compound page just before PageHuge becomes true).

Link: https://lkml.kernel.org/r/20210603233632.2964832-2-nao.horiguchi@gmail.com
Fixes: ead07f6a86 ("mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reported-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: <stable@vger.kernel.org>	[5.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-16 09:24:42 -07:00
Ingo Molnar f0953a1bba mm: fix typos in comments
Fix ~94 single-word typos in locking code comments, plus a few
very obvious grammar mistakes.

Link: https://lkml.kernel.org/r/20210322212624.GA1963421@gmail.com
Link: https://lore.kernel.org/r/20210322205203.GB1959563@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-07 00:26:35 -07:00
Jane Chu 4d75136be8 mm/memory-failure: unnecessary amount of unmapping
It appears that unmap_mapping_range() actually takes a 'size' as its third
argument rather than a location, the current calling fashion causes
unnecessary amount of unmapping to occur.

Link: https://lkml.kernel.org/r/20210420002821.2749748-1-jane.chu@oracle.com
Fixes: 6100e34b25 ("mm, memory_failure: Teach memory_failure() about dev_pagemap pages")
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:44 -07:00
Dan Williams 34dc45be45 mm: fix memory_failure() handling of dax-namespace metadata
Given 'struct dev_pagemap' spans both data pages and metadata pages be
careful to consult the altmap if present to delineate metadata.  In fact
the pfn_first() helper already identifies the first valid data pfn, so
export that helper for other code paths via pgmap_pfn_valid().

Other usage of get_dev_pagemap() are not a concern because those are
operating on known data pfns having been looked up by get_user_pages().
I.e.  metadata pfns are never user mapped.

Link: https://lkml.kernel.org/r/161058501758.1840162.4239831989762604527.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: 6100e34b25 ("mm, memory_failure: Teach memory_failure() about dev_pagemap pages")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: David Hildenbrand <david@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-26 09:41:00 -08:00
Aili Yao 30c9cf4927 mm,hwpoison: send SIGBUS to PF_MCE_EARLY processes on action required events
When a memory uncorrected error is triggered by process who accessed the
address with error, It's Action Required Case for only current process
which triggered this; This Action Required case means Action optional to
other process who share the same page.  Usually killing current process
will be sufficient, other processes sharing the same page will get be
signaled when they really touch the poisoned page.

But there is another scenario that other processes sharing the same page
want to be signaled early with PF_MCE_EARLY set.  In this case, we should
get them into kill list and signal BUS_MCEERR_AO to them.

So in this patch, task_early_kill will check current process if
force_early is set, and if not current,the code will fallback to
find_early_kill_thread() to check if there is PF_MCE_EARLY process who
cares the error.

In kill_proc(), BUS_MCEERR_AR is only send to current, other processes in
kill list will be signaled with BUS_MCEERR_AO.

Link: https://lkml.kernel.org/r/20210122132424.313c8f5f.yaoaili@kingsoft.com
Signed-off-by: Aili Yao <yaoaili@kingsoft.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 13:38:32 -08:00
Dan Williams dad4e5b390 mm: fix page reference leak in soft_offline_page()
The conversion to move pfn_to_online_page() internal to
soft_offline_page() missed that the get_user_pages() reference taken by
the madvise() path needs to be dropped when pfn_to_online_page() fails.

Note the direct sysfs-path to soft_offline_page() does not perform a
get_user_pages() lookup.

When soft_offline_page() is handed a pfn_valid() && !pfn_to_online_page()
pfn the kernel hangs at dax-device shutdown due to a leaked reference.

Link: https://lkml.kernel.org/r/161058501210.1840162.8108917599181157327.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: feec24a613 ("mm, soft-offline: convert parameter to pfn")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24 10:34:52 -08:00
Oscar Salvador 6696d2a6f3 mm,hwpoison: fix printing of page flags
Format %pG expects a lower case 'p' in order to print the flags.
Fix it.

Link: https://lkml.kernel.org/r/20210108085202.4506-1-osalvador@suse.de
Fixes: 8295d535e2 ("mm,hwpoison: refactor get_any_page")
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-12 18:12:54 -08:00
Oscar Salvador 3f4b815a43 mm,hwpoison: return -EBUSY when migration fails
Currently, we return -EIO when we fail to migrate the page.

Migrations' failures are rather transient as they can happen due to
several reasons, e.g: high page refcount bump, mapping->migrate_page
failing etc.  All meaning that at that time the page could not be
migrated, but that has nothing to do with an EIO error.

Let us return -EBUSY instead, as we do in case we failed to isolate the
page.

While are it, let us remove the "ret" print as its value does not change.

Link: https://lkml.kernel.org/r/20201209092818.30417-1-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 12:13:44 -08:00
Oscar Salvador 1e8aaedb18 mm,memory_failure: always pin the page in madvise_inject_error
madvise_inject_error() uses get_user_pages_fast to translate the address
we specified to a page.  After [1], we drop the extra reference count for
memory_failure() path.  That commit says that memory_failure wanted to
keep the pin in order to take the page out of circulation.

The truth is that we need to keep the page pinned, otherwise the page
might be re-used after the put_page() and we can end up messing with
someone else's memory.

E.g:

CPU0
process X					CPU1
 madvise_inject_error
  get_user_pages
   put_page
					page gets reclaimed
					process Y allocates the page
  memory_failure
   // We mess with process Y memory

madvise() is meant to operate on a self address space, so messing with
pages that do not belong to us seems the wrong thing to do.
To avoid that, let us keep the page pinned for memory_failure as well.

Pages for DAX mappings will release this extra refcount in
memory_failure_dev_pagemap.

[1] ("23e7b5c2e271: mm, madvise_inject_error:
      Let memory_failure() optionally take a page reference")

Link: https://lkml.kernel.org/r/20201207094818.8518-1-osalvador@suse.de
Fixes: 23e7b5c2e2 ("mm, madvise_inject_error: Let memory_failure() optionally take a page reference")
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 12:13:44 -08:00
Oscar Salvador 47e431f43b mm,hwpoison: remove drain_all_pages from shake_page
get_hwpoison_page already drains pcplists, previously disabling them when
trying to grab a refcount.  We do not need shake_page to take care of it
anymore.

Link: https://lkml.kernel.org/r/20201204102558.31607-4-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Qian Cai <qcai@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 12:13:44 -08:00
Oscar Salvador 2f7141600d mm,hwpoison: disable pcplists before grabbing a refcount
Currently, we have a sort of retry mechanism to make sure pages in
pcp-lists are spilled to the buddy system, so we can handle those.

We can save us this extra checks with the new disable-pcplist mechanism
that is available with [1].

zone_pcplist_disable makes sure to 1) disable pcplists, so any page that
is freed up from that point onwards will end up in the buddy system and 2)
drain pcplists, so those pages that already in pcplists are spilled to
buddy.

With that, we can make a common entry point for grabbing a refcount from
both soft_offline and memory_failure paths that is guarded by
zone_pcplist_disable/zone_pcplist_enable.

[1] https://patchwork.kernel.org/project/linux-mm/cover/20201111092812.11329-1-vbabka@suse.cz/

Link: https://lkml.kernel.org/r/20201204102558.31607-3-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Qian Cai <qcai@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 12:13:44 -08:00
Oscar Salvador 8295d535e2 mm,hwpoison: refactor get_any_page
Patch series "HWPoison: Refactor get page interface", v2.

This patch (of 3):

When we want to grab a refcount via get_any_page, we call __get_any_page
that calls get_hwpoison_page to get the actual refcount.

get_any_page() is only there because we have a sort of retry mechanism in
case the page we met is unknown to us or if we raced with an allocation.

Also __get_any_page() prints some messages about the page type in case the
page was a free page or the page type was unknown, but if anything, we
only need to print a message in case the pagetype was unknown, as that is
reporting an error down the chain.

Let us merge get_any_page() and __get_any_page(), and let the message be
printed in soft_offline_page.  While we are it, we can also remove the
'pfn' parameter as it is no longer used.

Link: https://lkml.kernel.org/r/20201204102558.31607-1-osalvador@suse.de
Link: https://lkml.kernel.org/r/20201204102558.31607-2-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Acked-by: Vlastimil Babka <Vbabka@suse.cz>
Cc: Qian Cai <qcai@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 12:13:44 -08:00
Oscar Salvador a8b2c2ce89 mm,hwpoison: take free pages off the buddy freelists
The crux of the matter is that historically we left poisoned pages in the
buddy system because we have some checks in place when allocating a page
that are gatekeeper for poisoned pages.  Unfortunately, we do have other
users (e.g: compaction [1]) that scan buddy freelists and try to get a
page from there without checking whether the page is HWPoison.

As I stated already, I think it is fundamentally wrong to keep HWPoison
pages within the buddy systems, checks in place or not.

Let us fix this the same way we did for soft_offline [2], taking the page
off the buddy freelist so it is completely unreachable.

Note that this is fairly simple to trigger, as we only need to poison free
buddy pages (madvise MADV_HWPOISON) and then run some sort of memory
stress system.

Just for a matter of reference, I put a dump_page() in compaction_alloc()
to trigger for HWPoison patches:

    page:0000000012b2982b refcount:1 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x1d5db
    flags: 0xfffffc0800000(hwpoison)
    raw: 000fffffc0800000 ffffea00007573c8 ffffc90000857de0 0000000000000000
    raw: 0000000000000001 0000000000000000 00000001ffffffff 0000000000000000
    page dumped because: compaction_alloc

    CPU: 4 PID: 123 Comm: kcompactd0 Tainted: G            E     5.9.0-rc2-mm1-1-default+ #5
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
    Call Trace:
     dump_stack+0x6d/0x8b
     compaction_alloc+0xb2/0xc0
     migrate_pages+0x2a6/0x12a0
     compact_zone+0x5eb/0x11c0
     proactive_compact_node+0x89/0xf0
     kcompactd+0x2d0/0x3a0
     kthread+0x118/0x130
     ret_from_fork+0x22/0x30

After that, if e.g: a process faults in the page,  it will get killed
unexpectedly.
Fix it by containing the page immediatelly.

Besides that, two more changes can be noticed:

* MF_DELAYED no longer suits as we are fixing the issue by containing
  the page immediately, so it does no longer rely on the allocation-time
  checks to stop HWPoison to be handed over.
  gain unless it is unpoisoned, so we fixed the situation.
  Because of that, let us use MF_RECOVERED from now on.

* The second block that handles PageBuddy pages is no longer needed:
  We call shake_page and then check whether the page is Buddy
  because shake_page calls drain_all_pages, which sends pcp-pages back to
  the buddy freelists, so we could have a chance to handle free pages.
  Currently, get_hwpoison_page already calls drain_all_pages, and we call
  get_hwpoison_page right before coming here, so we should be on the safe
  side.

[1] https://lore.kernel.org/linux-mm/20190826104144.GA7849@linux/T/#u
[2] https://patchwork.kernel.org/cover/11792607/

[osalvador@suse.de: take the poisoned subpage off the buddy frelists]
  Link: https://lkml.kernel.org/r/20201013144447.6706-4-osalvador@suse.de

Link: https://lkml.kernel.org/r/20201013144447.6706-3-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 12:13:44 -08:00
Oscar Salvador 17e395b60f mm,hwpoison: drain pcplists before bailing out for non-buddy zero-refcount page
Patch series "HWpoison: further fixes and cleanups", v5.

This patchset includes some more fixes and a cleanup.

Patch#2 and patch#3 are both fixes for taking a HWpoison page off a buddy
freelist, since having them there has proved to be bad (see [1] and
pathch#2's commit log).  Patch#3 does the same for hugetlb pages.

[1] https://lkml.org/lkml/2020/9/22/565

This patch (of 4):

A page with 0-refcount and !PageBuddy could perfectly be a pcppage.
Currently, we bail out with an error if we encounter such a page, meaning
that we do not handle pcppages neither from hard-offline nor from
soft-offline path.

Fix this by draining pcplists whenever we find this kind of page and retry
the check again.  It might be that pcplists have been spilled into the
buddy allocator and so we can handle it.

Link: https://lkml.kernel.org/r/20201013144447.6706-1-osalvador@suse.de
Link: https://lkml.kernel.org/r/20201013144447.6706-2-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 12:13:44 -08:00
Shakeel Butt 013339df11 mm/rmap: always do TTU_IGNORE_ACCESS
Since commit 369ea8242c ("mm/rmap: update to new mmu_notifier semantic
v2"), the code to check the secondary MMU's page table access bit is
broken for !(TTU_IGNORE_ACCESS) because the page is unmapped from the
secondary MMU's page table before the check.  More specifically for those
secondary MMUs which unmap the memory in
mmu_notifier_invalidate_range_start() like kvm.

However memory reclaim is the only user of !(TTU_IGNORE_ACCESS) or the
absence of TTU_IGNORE_ACCESS and it explicitly performs the page table
access check before trying to unmap the page.  So, at worst the reclaim
will miss accesses in a very short window if we remove page table access
check in unmapping code.

There is an unintented consequence of !(TTU_IGNORE_ACCESS) for the memcg
reclaim.  From memcg reclaim the page_referenced() only account the
accesses from the processes which are in the same memcg of the target page
but the unmapping code is considering accesses from all the processes, so,
decreasing the effectiveness of memcg reclaim.

The simplest solution is to always assume TTU_IGNORE_ACCESS in unmapping
code.

Link: https://lkml.kernel.org/r/20201104231928.1494083-1-shakeelb@google.com
Fixes: 369ea8242c ("mm/rmap: update to new mmu_notifier semantic v2")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 12:13:39 -08:00
Mike Kravetz 336bf30eb7 hugetlbfs: fix anon huge page migration race
Qian Cai reported the following BUG in [1]

  LTP: starting move_pages12
  BUG: unable to handle page fault for address: ffffffffffffffe0
  ...
  RIP: 0010:anon_vma_interval_tree_iter_first+0xa2/0x170 avc_start_pgoff at mm/interval_tree.c:63
  Call Trace:
    rmap_walk_anon+0x141/0xa30 rmap_walk_anon at mm/rmap.c:1864
    try_to_unmap+0x209/0x2d0 try_to_unmap at mm/rmap.c:1763
    migrate_pages+0x1005/0x1fb0
    move_pages_and_store_status.isra.47+0xd7/0x1a0
    __x64_sys_move_pages+0xa5c/0x1100
    do_syscall_64+0x5f/0x310
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

Hugh Dickins diagnosed this as a migration bug caused by code introduced
to use i_mmap_rwsem for pmd sharing synchronization.  Specifically, the
routine unmap_and_move_huge_page() is always passing the TTU_RMAP_LOCKED
flag to try_to_unmap() while holding i_mmap_rwsem.  This is wrong for
anon pages as the anon_vma_lock should be held in this case.  Further
analysis suggested that i_mmap_rwsem was not required to he held at all
when calling try_to_unmap for anon pages as an anon page could never be
part of a shared pmd mapping.

Discussion also revealed that the hack in hugetlb_page_mapping_lock_write
to drop page lock and acquire i_mmap_rwsem is wrong.  There is no way to
keep mapping valid while dropping page lock.

This patch does the following:

 - Do not take i_mmap_rwsem and set TTU_RMAP_LOCKED for anon pages when
   calling try_to_unmap.

 - Remove the hacky code in hugetlb_page_mapping_lock_write. The routine
   will now simply do a 'trylock' while still holding the page lock. If
   the trylock fails, it will return NULL. This could impact the
   callers:

    - migration calling code will receive -EAGAIN and retry up to the
      hard coded limit (10).

    - memory error code will treat the page as BUSY. This will force
      killing (SIGKILL) instead of SIGBUS any mapping tasks.

   Do note that this change in behavior only happens when there is a
   race. None of the standard kernel testing suites actually hit this
   race, but it is possible.

[1] https://lore.kernel.org/lkml/20200708012044.GC992@lca.pw/
[2] https://lore.kernel.org/linux-mm/alpine.LSU.2.11.2010071833100.2214@eggly.anvils/

Fixes: c0d0381ade ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
Reported-by: Qian Cai <cai@lca.pw>
Suggested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20201105195058.78401-1-mike.kravetz@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-11-14 11:26:04 -08:00
Joonsoo Kim 5460875999 mm/memory-failure: remove a wrapper for alloc_migration_target()
There is a well-defined standard migration target callback.  Use it
directly.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Roman Gushchin <guro@fb.com>
Link: http://lkml.kernel.org/r/1594622517-20681-9-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18 09:27:09 -07:00
Oscar Salvador b94e02822d mm,hwpoison: try to narrow window race for free pages
Aristeu Rozanski reported that a customer test case started to report
-EBUSY after the hwpoison rework patchset.

There is a race window between spotting a free page and taking it off its
buddy freelist, so it might be that by the time we try to take it off, the
page has been already allocated.

This patch tries to handle such race window by trying to handle the new
type of page again if the page was allocated under us.

Reported-by: Aristeu Rozanski <aris@ruivo.org>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Aristeu Rozanski <aris@ruivo.org>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dmitry Yakunin <zeil@yandex-team.ru>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200922135650.1634-15-osalvador@suse.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:17 -07:00
Naoya Horiguchi 1f2481ddbe mm,hwpoison: double-check page count in __get_any_page()
Soft offlining could fail with EIO due to the race condition with hugepage
migration.  This issuse became visible due to the change by previous patch
that makes soft offline handler take page refcount by its own.  We have no
way to directly pin zero refcount page, and the page considered as a zero
refcount page could be allocated just after the first check.

This patch adds the second check to find the race and gives us chance to
handle it more reliably.

Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Aristeu Rozanski <aris@ruivo.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dmitry Yakunin <zeil@yandex-team.ru>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200922135650.1634-14-osalvador@suse.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:17 -07:00
Naoya Horiguchi 5d1fd5dc87 mm,hwpoison: introduce MF_MSG_UNSPLIT_THP
memory_failure() is supposed to call action_result() when it handles a
memory error event, but there's one missing case.  So let's add it.

I find that include/ras/ras_event.h has some other MF_MSG_* undefined, so
this patch also adds them.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Aristeu Rozanski <aris@ruivo.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dmitry Yakunin <zeil@yandex-team.ru>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200922135650.1634-13-osalvador@suse.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:17 -07:00
Oscar Salvador 5a2ffca3c2 mm,hwpoison: return 0 if the page is already poisoned in soft-offline
Currently, there is an inconsistency when calling soft-offline from
different paths on a page that is already poisoned.

1) madvise:

        madvise_inject_error skips any poisoned page and continues
        the loop.
        If that was the only page to madvise, it returns 0.

2) /sys/devices/system/memory/:

        When calling soft_offline_page_store()->soft_offline_page(),
        we return -EBUSY in case the page is already poisoned.
        This is inconsistent with a) the above example and b)
        memory_failure, where we return 0 if the page was poisoned.

Fix this by dropping the PageHWPoison() check in madvise_inject_error, and
let soft_offline_page return 0 if it finds the page already poisoned.

Please, note that this represents a user-api change, since now the return
error when calling soft_offline_page_store()->soft_offline_page() will be
different.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Aristeu Rozanski <aris@ruivo.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dmitry Yakunin <zeil@yandex-team.ru>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200922135650.1634-12-osalvador@suse.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:16 -07:00
Oscar Salvador 6b9a217eda mm,hwpoison: refactor soft_offline_huge_page and __soft_offline_page
Merging soft_offline_huge_page and __soft_offline_page let us get rid of
quite some duplicated code, and makes the code much easier to follow.

Now, __soft_offline_page will handle both normal and hugetlb pages.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Aristeu Rozanski <aris@ruivo.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dmitry Yakunin <zeil@yandex-team.ru>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200922135650.1634-11-osalvador@suse.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:16 -07:00
Oscar Salvador 79f5f8fab4 mm,hwpoison: rework soft offline for in-use pages
This patch changes the way we set and handle in-use poisoned pages.  Until
now, poisoned pages were released to the buddy allocator, trusting that
the checks that take place at allocation time would act as a safe net and
would skip that page.

This has proved to be wrong, as we got some pfn walkers out there, like
compaction, that all they care is the page to be in a buddy freelist.

Although this might not be the only user, having poisoned pages in the
buddy allocator seems a bad idea as we should only have free pages that
are ready and meant to be used as such.

Before explaining the taken approach, let us break down the kind of pages
we can soft offline.

- Anonymous THP (after the split, they end up being 4K pages)
- Hugetlb
- Order-0 pages (that can be either migrated or invalited)

* Normal pages (order-0 and anon-THP)

  - If they are clean and unmapped page cache pages, we invalidate
    then by means of invalidate_inode_page().
  - If they are mapped/dirty, we do the isolate-and-migrate dance.

Either way, do not call put_page directly from those paths.  Instead, we
keep the page and send it to page_handle_poison to perform the right
handling.

page_handle_poison sets the HWPoison flag and does the last put_page.

Down the chain, we placed a check for HWPoison page in
free_pages_prepare, that just skips any poisoned page, so those pages
do not end up in any pcplist/freelist.

After that, we set the refcount on the page to 1 and we increment
the poisoned pages counter.

If we see that the check in free_pages_prepare creates trouble, we can
always do what we do for free pages:

  - wait until the page hits buddy's freelists
  - take it off, and flag it

The downside of the above approach is that we could race with an
allocation, so by the time we  want to take the page off the buddy, the
page has been already allocated so we cannot soft offline it.
But the user could always retry it.

* Hugetlb pages

  - We isolate-and-migrate them

After the migration has been successful, we call dissolve_free_huge_page,
and we set HWPoison on the page if we succeed.
Hugetlb has a slightly different handling though.

While for non-hugetlb pages we cared about closing the race with an
allocation, doing so for hugetlb pages requires quite some additional
and intrusive code (we would need to hook in free_huge_page and some other
places).
So I decided to not make the code overly complicated and just fail
normally if the page we allocated in the meantime.

We can always build on top of this.

As a bonus, because of the way we handle now in-use pages, we no longer
need the put-as-isolation-migratetype dance, that was guarding for poisoned
pages to end up in pcplists.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Aristeu Rozanski <aris@ruivo.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dmitry Yakunin <zeil@yandex-team.ru>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200922135650.1634-10-osalvador@suse.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:16 -07:00
Oscar Salvador 06be6ff3d2 mm,hwpoison: rework soft offline for free pages
When trying to soft-offline a free page, we need to first take it off the
buddy allocator.  Once we know is out of reach, we can safely flag it as
poisoned.

take_page_off_buddy will be used to take a page meant to be poisoned off
the buddy allocator.  take_page_off_buddy calls break_down_buddy_pages,
which splits a higher-order page in case our page belongs to one.

Once the page is under our control, we call page_handle_poison to set it
as poisoned and grab a refcount on it.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Aristeu Rozanski <aris@ruivo.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dmitry Yakunin <zeil@yandex-team.ru>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200922135650.1634-9-osalvador@suse.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:16 -07:00
Oscar Salvador 694bf0b0cd mm,hwpoison: unify THP handling for hard and soft offline
Place the THP's page handling in a helper and use it from both hard and
soft-offline machinery, so we get rid of some duplicated code.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Aristeu Rozanski <aris@ruivo.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dmitry Yakunin <zeil@yandex-team.ru>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200922135650.1634-8-osalvador@suse.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:16 -07:00
Oscar Salvador dd6e2402fa mm,hwpoison: kill put_hwpoison_page
After commit 4e41a30c6d ("mm: hwpoison: adjust for new thp
refcounting"), put_hwpoison_page got reduced to a put_page.  Let us just
use put_page instead.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Aristeu Rozanski <aris@ruivo.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dmitry Yakunin <zeil@yandex-team.ru>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200922135650.1634-7-osalvador@suse.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:16 -07:00
Oscar Salvador 7e27f22c9e mm,hwpoison: unexport get_hwpoison_page and make it static
Since get_hwpoison_page is only used in memory-failure code now, let us
un-export it and make it private to that code.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Aristeu Rozanski <aris@ruivo.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dmitry Yakunin <zeil@yandex-team.ru>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200922135650.1634-5-osalvador@suse.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:16 -07:00
Naoya Horiguchi 1b473becde mm, hwpoison: remove recalculating hpage
hpage is never used after try_to_split_thp_page() in memory_failure(), so
we don't have to update hpage.  So let's not recalculate/use hpage.

Suggested-by: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Aristeu Rozanski <aris@ruivo.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dmitry Yakunin <zeil@yandex-team.ru>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200922135650.1634-3-osalvador@suse.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:16 -07:00
Naoya Horiguchi 7d9d46ac87 mm,hwpoison: cleanup unused PageHuge() check
Patch series "HWPOISON: soft offline rework", v7.

This patchset fixes a couple of issues that the patchset Naoya sent [1]
contained due to rebasing problems and a misunterdansting.

Main focus of this series is to stabilize soft offline.  Historically soft
offlined pages have suffered from racy conditions because PageHWPoison is
used to a little too aggressively, which (directly or indirectly) invades
other mm code which cares little about hwpoison.  This results in
unexpected behavior or kernel panic, which is very far from soft offline's
"do not disturb userspace or other kernel component" policy.  An example
of this can be found here [2].

Along with several cleanups, this code refactors and changes the way soft
offline work.  Main point of this change set is to contain target page
"via buddy allocator" or in migrating path.  For ther former we first free
the target page as we do for normal pages, and once it has reached buddy
and it has been taken off the freelists, we flag it as HWpoison.  For the
latter we never get to release the page in unmap_and_move, so the page is
under our control and we can handle it in hwpoison code.

[1] https://patchwork.kernel.org/cover/11704083/
[2] https://lore.kernel.org/linux-mm/20190826104144.GA7849@linux/T/#u

This patch (of 14):

Drop the PageHuge check, which is dead code since memory_failure() forks
into memory_failure_hugetlb() for hugetlb pages.

memory_failure() and memory_failure_hugetlb() shares some functions like
hwpoison_user_mappings() and identify_page_state(), so they should
properly handle 4kB page, thp, and hugetlb.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Dmitry Yakunin <zeil@yandex-team.ru>
Cc: Qian Cai <cai@lca.pw>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Aristeu Rozanski <aris@ruivo.org>
Cc: Oscar Salvador <osalvador@suse.com>
Link: https://lkml.kernel.org/r/20200922135650.1634-1-osalvador@suse.de
Link: https://lkml.kernel.org/r/20200922135650.1634-2-osalvador@suse.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:16 -07:00
Alex Shi 2c3125977e mm/memory-failure.c: remove unused macro `writeback'
Unlike others we don't use the marco writeback.  so let's remove it to
tame gcc warning:

mm/memory-failure.c:827: warning: macro "writeback" is not used
[-Wunused-macros]

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Link: https://lkml.kernel.org/r/1599715096-20369-1-git-send-email-alex.shi@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13 18:38:32 -07:00
Xianting Tian c43bc03d0a mm/memory-failure: do pgoff calculation before for_each_process()
There is no need to calculate pgoff in each loop of for_each_process(), so
move it to the place before for_each_process(), which can save some CPU
cycles.

Signed-off-by: Xianting Tian <tian.xianting@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Link: http://lkml.kernel.org/r/20200818082647.34322-1-tian.xianting@h3c.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13 18:38:32 -07:00
Christoph Hellwig f56753ac2a bdi: replace BDI_CAP_NO_{WRITEBACK,ACCT_DIRTY} with a single flag
Replace the two negative flags that are always used together with a
single positive flag that indicates the writeback capability instead
of two related non-capabilities.  Also remove the pointless wrappers
to just check the flag.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24 13:43:39 -06:00
Joonsoo Kim 19fc7bed25 mm/migrate: introduce a standard migration target allocation function
There are some similar functions for migration target allocation.  Since
there is no fundamental difference, it's better to keep just one rather
than keeping all variants.  This patch implements base migration target
allocation function.  In the following patches, variants will be converted
to use this function.

Changes should be mechanical, but, unfortunately, there are some
differences.  First, some callers' nodemask is assgined to NULL since NULL
nodemask will be considered as all available nodes, that is,
&node_states[N_MEMORY].  Second, for hugetlb page allocation, gfp_mask is
redefined as regular hugetlb allocation gfp_mask plus __GFP_THISNODE if
user provided gfp_mask has it.  This is because future caller of this
function requires to set this node constaint.  Lastly, if provided nodeid
is NUMA_NO_NODE, nodeid is set up to the node where migration source
lives.  It helps to remove simple wrappers for setting up the nodeid.

Note that PageHighmem() call in previous function is changed to open-code
"is_highmem_idx()" since it provides more readability.

[akpm@linux-foundation.org: tweak patch title, per Vlastimil]
[akpm@linux-foundation.org: fix typo in comment]

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Roman Gushchin <guro@fb.com>
Link: http://lkml.kernel.org/r/1594622517-20681-6-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 10:58:02 -07:00
Naoya Horiguchi 03151c6e0b mm/memory-failure: send SIGBUS(BUS_MCEERR_AR) only to current thread
Action Required memory error should happen only when a processor is
about to access to a corrupted memory, so it's synchronous and only
affects current process/thread.

Recently commit 872e9a205c ("mm, memory_failure: don't send
BUS_MCEERR_AO for action required error") fixed the issue that Action
Required memory could unnecessarily send SIGBUS to the processes which
share the error memory.  But we still have another issue that we could
send SIGBUS to a wrong thread.

This is because collect_procs() and task_early_kill() fails to add the
current process to "to-kill" list.  So this patch is suggesting to fix
it.  With this fix, SIGBUS(BUS_MCEERR_AR) is never sent to non-current
process/thread.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Tony Luck <tony.luck@intel.com>
Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Link: http://lkml.kernel.org/r/1591321039-22141-3-git-send-email-naoya.horiguchi@nec.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-11 18:17:47 -07:00
Naoya Horiguchi 4e018b450a mm/memory-failure: prioritize prctl(PR_MCE_KILL) over vm.memory_failure_early_kill
Patch series "hwpoison: fixes signaling on memory error"

This is a small patchset to solve issues in memory error handler to send
SIGBUS to proper process/thread as expected in configuration.  Please
see descriptions in individual patches for more details.

This patch (of 2):

Early-kill policy is controlled from two types of settings, one is
per-process setting prctl(PR_MCE_KILL) and the other is system-wide
setting vm.memory_failure_early_kill.  Users expect per-process setting
to override system-wide setting as many other settings do, but
early-kill setting doesn't work as such.

For example, if a system configures vm.memory_failure_early_kill to 1
(enabled), a process receives SIGBUS even if it's configured to
explicitly disable PF_MCE_KILL by prctl().  That's not desirable for
applications with their own policies.

This patch is suggesting to change the priority of these two types of
settings, by checking sysctl_memory_failure_early_kill only when a given
process has the default kill policy.

Note that this patch is solving a thread choice issue too.

Originally, collect_procs() always chooses the main thread when
vm.memory_failure_early_kill is 1, even if the process has a dedicated
thread for memory error handling.  SIGBUS should be sent to the
dedicated thread if early-kill is enabled via
vm.memory_failure_early_kill as we are doing for PR_MCE_KILL_EARLY
processes.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Link: http://lkml.kernel.org/r/1591321039-22141-1-git-send-email-naoya.horiguchi@nec.com
Link: http://lkml.kernel.org/r/1591321039-22141-2-git-send-email-naoya.horiguchi@nec.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-11 18:17:47 -07:00
Linus Torvalds 118d6e9829 ACPI updates for 5.8-rc1
- Update the ACPICA code in the kernel to upstream revision
    20200430:
 
    * Move acpi_gbl_next_cmd_num definition (Erik Kaneda).
 
    * Ignore AE_ALREADY_EXISTS status in the disassembler when parsing
      create operators (Erik Kaneda).
 
    * Add status checks to the dispatcher (Erik Kaneda).
 
    * Fix required parameters for _NIG and _NIH (Erik Kaneda).
 
    * Make acpi_protocol_lengths static (Yue Haibing).
 
  - Fix ACPI table reference counting errors in several places, mostly
    in error code paths (Hanjun Guo).
 
  - Extend the Generic Event Device (GED) driver to support _Exx and
    _Lxx handler methods (Ard Biesheuvel).
 
  - Add new acpi_evaluate_reg() helper and modify the ACPI PCI hotplug
    code to use it (Hans de Goede).
 
  - Add new DPTF battery participant driver and make the DPFT power
    participant driver create more sysfs device attributes (Srinivas
    Pandruvada).
 
  - Improve the handling of memory failures in APEI (James Morse).
 
  - Add new blacklist entry for Acer TravelMate 5735Z to the backlight
    driver (Paul Menzel).
 
  - Add i2c address for thermal control to the PMIC driver (Mauro
    Carvalho Chehab).
 
  - Allow the ACPI processor idle driver to work on platforms with
    only one ACPI C-state present (Zhang Rui).
 
  - Fix kobject reference count leaks in error code paths in two
    places (Qiushi Wu).
 
  - Delete unused proc filename macros and make some symbols static
    (Pascal Terjan, Zheng Zengkai, Zou Wei).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl7VHb8SHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxVboQAIjYda2RhQANIlIvoEa+Qd2/FBd3HXgU
 Mv0LZ6y1xxxEZYeKne7zja1hzt5WetuZ1hZHGfg8YkXyrLqZGxfCIFbbhSA90BGG
 PGzFerGmOBNzB3I9SN6iQY7vSqoFHvQEV1PVh24d+aHWZqj2lnaRRq+GT54qbRLX
 /U3Hy5glFl8A/DCBP4cpoEjDr4IJHY68DathkDK2Ep2ybXV6B401uuqx8Su/OBd/
 MQmJTYI1UK/RYBXfdzS9TIZahnkxBbU1cnLFy08Ve2mawl5YsHPEbvm77a0yX2M6
 sOAerpgyzYNivAuOLpNIwhUZjpOY66nQuKAQaEl2cfRUkqt4nbmq7yDoH3d2MJLC
 /Ccz955rV2YyD1DtyV+PyT+HB+/EVwH/+UCZ+gsSbdHvOiwdFU6VaTc2eI1qq8K9
 4m5eEZFrAMPlvTzj/xVxr2Hfw1lbm23J5B5n7sM5HzYbT6MUWRQpvfV4zM3jTGz0
 rQd8JmcHVvZk/MV1mGrYHrN5TnGTLWpbS4Yv1lAQa6FP0N0NxzVud7KRfLKnCnJ1
 vh5yzW2fCYmVulJpuqxJDfXSqNV7n40CFrIewSp6nJRQXnWpImqHwwiA8fl51+hC
 fBL72Ey08EHGFnnNQqbebvNglsodRWJddBy43ppnMHtuLBA/2GVKYf2GihPbpEBq
 NHtX+Rd3vlWW
 =xH3i
 -----END PGP SIGNATURE-----

Merge tag 'acpi-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI updates from Rafael Wysocki:
 "These update the ACPICA code in the kernel to upstream revision
  20200430, fix several reference counting errors related to ACPI
  tables, add _Exx / _Lxx support to the GED driver, add a new
  acpi_evaluate_reg() helper, add new DPTF battery participant driver
  and extend the DPFT power participant driver, improve the handling of
  memory failures in the APEI code, add a blacklist entry to the
  backlight driver, update the PMIC driver and the processor idle
  driver, fix two kobject reference count leaks, and make a few janitory
  changes.

  Specifics:

   - Update the ACPICA code in the kernel to upstream revision 20200430:

      - Move acpi_gbl_next_cmd_num definition (Erik Kaneda).

      - Ignore AE_ALREADY_EXISTS status in the disassembler when parsing
        create operators (Erik Kaneda).

      - Add status checks to the dispatcher (Erik Kaneda).

      - Fix required parameters for _NIG and _NIH (Erik Kaneda).

      - Make acpi_protocol_lengths static (Yue Haibing).

   - Fix ACPI table reference counting errors in several places, mostly
     in error code paths (Hanjun Guo).

   - Extend the Generic Event Device (GED) driver to support _Exx and
     _Lxx handler methods (Ard Biesheuvel).

   - Add new acpi_evaluate_reg() helper and modify the ACPI PCI hotplug
     code to use it (Hans de Goede).

   - Add new DPTF battery participant driver and make the DPFT power
     participant driver create more sysfs device attributes (Srinivas
     Pandruvada).

   - Improve the handling of memory failures in APEI (James Morse).

   - Add new blacklist entry for Acer TravelMate 5735Z to the backlight
     driver (Paul Menzel).

   - Add i2c address for thermal control to the PMIC driver (Mauro
     Carvalho Chehab).

   - Allow the ACPI processor idle driver to work on platforms with only
     one ACPI C-state present (Zhang Rui).

   - Fix kobject reference count leaks in error code paths in two places
     (Qiushi Wu).

   - Delete unused proc filename macros and make some symbols static
     (Pascal Terjan, Zheng Zengkai, Zou Wei)"

* tag 'acpi-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (32 commits)
  ACPI: CPPC: Fix reference count leak in acpi_cppc_processor_probe()
  ACPI: sysfs: Fix reference count leak in acpi_sysfs_add_hotplug_profile()
  ACPI: GED: use correct trigger type field in _Exx / _Lxx handling
  ACPI: DPTF: Add battery participant driver
  ACPI: DPTF: Additional sysfs attributes for power participant driver
  ACPI: video: Use native backlight on Acer TravelMate 5735Z
  arm64: acpi: Make apei_claim_sea() synchronise with APEI's irq work
  ACPI: APEI: Kick the memory_failure() queue for synchronous errors
  mm/memory-failure: Add memory_failure_queue_kick()
  ACPI / PMIC: Add i2c address for thermal control
  ACPI: GED: add support for _Exx / _Lxx handler methods
  ACPI: Delete unused proc filename macros
  ACPI: hotplug: PCI: Use the new acpi_evaluate_reg() helper
  ACPI: utils: Add acpi_evaluate_reg() helper
  ACPI: debug: Make two functions static
  ACPI: sleep: Put the FACS table after using it
  ACPI: scan: Put SPCR and STAO table after using it
  ACPI: EC: Put the ACPI table after using it
  ACPI: APEI: Put the HEST table for error path
  ACPI: APEI: Put the error record serialization table for error path
  ...
2020-06-02 13:25:52 -07:00
Wetp Zhang 872e9a205c mm, memory_failure: don't send BUS_MCEERR_AO for action required error
Some processes dont't want to be killed early, but in "Action Required"
case, those also may be killed by BUS_MCEERR_AO when sharing memory with
other which is accessing the fail memory.  And sending SIGBUS with
BUS_MCEERR_AO for action required error is strange, so ignore the
non-current processes here.

Suggested-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Wetp Zhang <wetp.zy@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Link: http://lkml.kernel.org/r/1590817116-21281-1-git-send-email-wetp.zy@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:10 -07:00
James Morse 062022315e mm/memory-failure: Add memory_failure_queue_kick()
The GHES code calls memory_failure_queue() from IRQ context to schedule
work on the current CPU so that memory_failure() can sleep.

For synchronous memory errors the arch code needs to know any signals
that memory_failure() will trigger are pending before it returns to
user-space, possibly when exiting from the IRQ.

Add a helper to kick the memory failure queue, to ensure the scheduled
work has happened. This has to be called from process context, so may
have been migrated from the original cpu. Pass the cpu the work was
queued on.

Change memory_failure_work_func() to permit being called on the 'wrong'
cpu.

Signed-off-by: James Morse <james.morse@arm.com>
Tested-by: Tyler Baicar <baicar@os.amperecomputing.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-05-19 19:51:10 +02:00
Huang Ying 9de4f22a60 mm: code cleanup for MADV_FREE
Some comments for MADV_FREE is revised and added to help people understand
the MADV_FREE code, especially the page flag, PG_swapbacked.  This makes
page_is_file_cache() isn't consistent with its comments.  So the function
is renamed to page_is_file_lru() to make them consistent again.  All these
are put in one patch as one logical change.

Suggested-by: David Hildenbrand <david@redhat.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: David Rientjes <rientjes@google.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@kernel.org>
Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@surriel.com>
Link: http://lkml.kernel.org/r/20200317100342.2730705-1-ying.huang@intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:38 -07:00
Mike Kravetz c0d0381ade hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
Patch series "hugetlbfs: use i_mmap_rwsem for more synchronization", v2.

While discussing the issue with huge_pte_offset [1], I remembered that
there were more outstanding hugetlb races.  These issues are:

1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become
   invalid via a call to huge_pmd_unshare by another thread.
2) hugetlbfs page faults can race with truncation causing invalid global
   reserve counts and state.

A previous attempt was made to use i_mmap_rwsem in this manner as
described at [2].  However, those patches were reverted starting with [3]
due to locking issues.

To effectively use i_mmap_rwsem to address the above issues it needs to be
held (in read mode) during page fault processing.  However, during fault
processing we need to lock the page we will be adding.  Lock ordering
requires we take page lock before i_mmap_rwsem.  Waiting until after
taking the page lock is too late in the fault process for the
synchronization we want to do.

To address this lock ordering issue, the following patches change the lock
ordering for hugetlb pages.  This is not too invasive as hugetlbfs
processing is done separate from core mm in many places.  However, I don't
really like this idea.  Much ugliness is contained in the new routine
hugetlb_page_mapping_lock_write() of patch 1.

The only other way I can think of to address these issues is by catching
all the races.  After catching a race, cleanup, backout, retry ...  etc,
as needed.  This can get really ugly, especially for huge page
reservations.  At one time, I started writing some of the reservation
backout code for page faults and it got so ugly and complicated I went
down the path of adding synchronization to avoid the races.  Any other
suggestions would be welcome.

[1] https://lore.kernel.org/linux-mm/1582342427-230392-1-git-send-email-longpeng2@huawei.com/
[2] https://lore.kernel.org/linux-mm/20181222223013.22193-1-mike.kravetz@oracle.com/
[3] https://lore.kernel.org/linux-mm/20190103235452.29335-1-mike.kravetz@oracle.com
[4] https://lore.kernel.org/linux-mm/1584028670.7365.182.camel@lca.pw/
[5] https://lore.kernel.org/lkml/20200312183142.108df9ac@canb.auug.org.au/

This patch (of 2):

While looking at BUGs associated with invalid huge page map counts, it was
discovered and observed that a huge pte pointer could become 'invalid' and
point to another task's page table.  Consider the following:

A task takes a page fault on a shared hugetlbfs file and calls
huge_pte_alloc to get a ptep.  Suppose the returned ptep points to a
shared pmd.

Now, another task truncates the hugetlbfs file.  As part of truncation, it
unmaps everyone who has the file mapped.  If the range being truncated is
covered by a shared pmd, huge_pmd_unshare will be called.  For all but the
last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
to the pmd.  If the task in the middle of the page fault is not the last
user, the ptep returned by huge_pte_alloc now points to another task's
page table or worse.  This leads to bad things such as incorrect page
map/reference counts or invalid memory references.

To fix, expand the use of i_mmap_rwsem as follows:
- i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
  huge_pmd_share is only called via huge_pte_alloc, so callers of
  huge_pte_alloc take i_mmap_rwsem before calling.  In addition, callers
  of huge_pte_alloc continue to hold the semaphore until finished with
  the ptep.
- i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called.

One problem with this scheme is that it requires taking i_mmap_rwsem
before taking the page lock during page faults.  This is not the order
specified in the rest of mm code.  Handling of hugetlbfs pages is mostly
isolated today.  Therefore, we use this alternative locking order for
PageHuge() pages.

         mapping->i_mmap_rwsem
           hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
             page->flags PG_locked (lock_page)

To help with lock ordering issues, hugetlb_page_mapping_lock_write() is
introduced to write lock the i_mmap_rwsem associated with a page.

In most cases it is easy to get address_space via vma->vm_file->f_mapping.
However, in the case of migration or memory errors for anon pages we do
not have an associated vma.  A new routine _get_hugetlb_page_mapping()
will use anon_vma to get address_space in these cases.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Link: http://lkml.kernel.org/r/20200316205756.146666-2-mike.kravetz@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 09:35:32 -07:00
Yunfeng Ye 7506851837 mm/memory-failure.c: use page_shift() in add_to_kill()
page_shift() is supported after the commit 94ad933810 ("mm: introduce
page_shift()").

So replace with page_shift() in add_to_kill() for readability.

Link: http://lkml.kernel.org/r/543d8bc9-f2e7-3023-7c35-2e7ed67c0e82@huawei.com
Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01 12:59:04 -08:00
Naoya Horiguchi feec24a613 mm, soft-offline: convert parameter to pfn
Currently soft_offline_page() receives struct page, and its sibling
memory_failure() receives pfn.  This discrepancy looks weird and makes
precheck on pfn validity tricky.  So let's align them.

Link: http://lkml.kernel.org/r/20191016234706.GA5493@www9186uo.sakura.ne.jp
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01 12:59:04 -08:00
Jane Chu 996ff7a08d mm/memory-failure.c clean up around tk pre-allocation
add_to_kill() expects the first 'tk' to be pre-allocated, it makes
subsequent allocations on need basis, this makes the code a bit
difficult to read.

Move all the allocation internal to add_to_kill() and drop the **tk
argument.

Link: http://lkml.kernel.org/r/1565112345-28754-2-git-send-email-jane.chu@oracle.com
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01 12:59:04 -08:00
David Hildenbrand 96c804a6ae mm/memory-failure.c: don't access uninitialized memmaps in memory_failure()
We should check for pfn_to_online_page() to not access uninitialized
memmaps.  Reshuffle the code so we don't have to duplicate the error
message.

Link: http://lkml.kernel.org/r/20191009142435.3975-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Fixes: f1dd2cd13c ("mm, memory_hotplug: do not associate hotadded memory to zones until online")	[visible after d0dc12e86b]
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org>	[4.13+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:31 -04:00
Jane Chu 3d7fed4ad8 mm/memory-failure: poison read receives SIGKILL instead of SIGBUS if mmaped more than once
Mmap /dev/dax more than once, then read the poison location using
address from one of the mappings.  The other mappings due to not having
the page mapped in will cause SIGKILLs delivered to the process.
SIGKILL succeeds over SIGBUS, so user process loses the opportunity to
handle the UE.

Although one may add MAP_POPULATE to mmap(2) to work around the issue,
MAP_POPULATE makes mapping 128GB of pmem several magnitudes slower, so
isn't always an option.

Details -

  ndctl inject-error --block=10 --count=1 namespace6.0

  ./read_poison -x dax6.0 -o 5120 -m 2
  mmaped address 0x7f5bb6600000
  mmaped address 0x7f3cf3600000
  doing local read at address 0x7f3cf3601400
  Killed

Console messages in instrumented kernel -

  mce: Uncorrected hardware memory error in user-access at edbe201400
  Memory failure: tk->addr = 7f5bb6601000
  Memory failure: address edbe201: call dev_pagemap_mapping_shift
  dev_pagemap_mapping_shift: page edbe201: no PUD
  Memory failure: tk->size_shift == 0
  Memory failure: Unable to find user space address edbe201 in read_poison
  Memory failure: tk->addr = 7f3cf3601000
  Memory failure: address edbe201: call dev_pagemap_mapping_shift
  Memory failure: tk->size_shift = 21
  Memory failure: 0xedbe201: forcibly killing read_poison:22434 because of failure to unmap corrupted page
    => to deliver SIGKILL
  Memory failure: 0xedbe201: Killing read_poison:22434 due to hardware memory corruption
    => to deliver SIGBUS

Link: http://lkml.kernel.org/r/1565112345-28754-3-git-send-email-jane.chu@oracle.com
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Suggested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-14 15:04:01 -07:00
Linus Torvalds fec88ab0af HMM patches for 5.3
Improvements and bug fixes for the hmm interface in the kernel:
 
 - Improve clarity, locking and APIs related to the 'hmm mirror' feature
   merged last cycle. In linux-next we now see AMDGPU and nouveau to be
   using this API.
 
 - Remove old or transitional hmm APIs. These are hold overs from the past
   with no users, or APIs that existed only to manage cross tree conflicts.
   There are still a few more of these cleanups that didn't make the merge
   window cut off.
 
 - Improve some core mm APIs:
   * export alloc_pages_vma() for driver use
   * refactor into devm_request_free_mem_region() to manage
     DEVICE_PRIVATE resource reservations
   * refactor duplicative driver code into the core dev_pagemap
     struct
 
 - Remove hmm wrappers of improved core mm APIs, instead have drivers use
   the simplified API directly
 
 - Remove DEVICE_PUBLIC
 
 - Simplify the kconfig flow for the hmm users and core code
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl0k1zkACgkQOG33FX4g
 mxrO+w//QF/yI/9Hh30RWEBq8W107cODkDlaT0Z/7cVEXfGetZzIUpqzxnJofRfQ
 xTw1XmYkc9WpJe/mTTuFZFewNQwWuMM6X0Xi25fV438/Y64EclevlcJTeD49TIH1
 CIMsz8bX7CnCEq5sz+UypLg9LPnaD9L/JLyuSbyjqjms/o+yzqa7ji7p/DSINuhZ
 Qva9OZL1ZSEDJfNGi8uGpYBqryHoBAonIL12R9sCF5pbJEnHfWrH7C06q7AWOAjQ
 4vjN/p3F4L9l/v2IQ26Kn/S0AhmN7n3GT//0K66e2gJPfXa8fxRKGuFn/Kd79EGL
 YPASn5iu3cM23up1XkbMNtzacL8yiIeTOcMdqw26OaOClojy/9OJduv5AChe6qL/
 VUQIAn1zvPsJTyC5U7mhmkrGuTpP6ivHpxtcaUp+Ovvi1cyK40nLCmSNvLnbN5ES
 bxbb0SjE4uupDG5qU6Yct/hFp6uVMSxMqXZOb9Xy8ZBkbMsJyVOLj71G1/rVIfPU
 hO1AChX5CRG1eJoMo6oBIpiwmSvcOaPp3dqIOQZvwMOqrO869LR8qv7RXyh/g9gi
 FAEKnwLl4GK3YtEO4Kt/1YI5DXYjSFUbfgAs0SPsRKS6hK2+RgRk2M/B/5dAX0/d
 lgOf9WPODPwiSXBYLtJB8qHVDX0DIY8faOyTx6BYIKClUtgbBI8=
 =wKvp
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull HMM updates from Jason Gunthorpe:
 "Improvements and bug fixes for the hmm interface in the kernel:

   - Improve clarity, locking and APIs related to the 'hmm mirror'
     feature merged last cycle. In linux-next we now see AMDGPU and
     nouveau to be using this API.

   - Remove old or transitional hmm APIs. These are hold overs from the
     past with no users, or APIs that existed only to manage cross tree
     conflicts. There are still a few more of these cleanups that didn't
     make the merge window cut off.

   - Improve some core mm APIs:
       - export alloc_pages_vma() for driver use
       - refactor into devm_request_free_mem_region() to manage
         DEVICE_PRIVATE resource reservations
       - refactor duplicative driver code into the core dev_pagemap
         struct

   - Remove hmm wrappers of improved core mm APIs, instead have drivers
     use the simplified API directly

   - Remove DEVICE_PUBLIC

   - Simplify the kconfig flow for the hmm users and core code"

* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (42 commits)
  mm: don't select MIGRATE_VMA_HELPER from HMM_MIRROR
  mm: remove the HMM config option
  mm: sort out the DEVICE_PRIVATE Kconfig mess
  mm: simplify ZONE_DEVICE page private data
  mm: remove hmm_devmem_add
  mm: remove hmm_vma_alloc_locked_page
  nouveau: use devm_memremap_pages directly
  nouveau: use alloc_page_vma directly
  PCI/P2PDMA: use the dev_pagemap internal refcount
  device-dax: use the dev_pagemap internal refcount
  memremap: provide an optional internal refcount in struct dev_pagemap
  memremap: replace the altmap_valid field with a PGMAP_ALTMAP_VALID flag
  memremap: remove the data field in struct dev_pagemap
  memremap: add a migrate_to_ram method to struct dev_pagemap_ops
  memremap: lift the devmap_enable manipulation into devm_memremap_pages
  memremap: pass a struct dev_pagemap to ->kill and ->cleanup
  memremap: move dev_pagemap callbacks into a separate structure
  memremap: validate the pagemap type passed to devm_memremap_pages
  mm: factor out a devm_request_free_mem_region helper
  mm: export alloc_pages_vma
  ...
2019-07-14 19:42:11 -07:00
Jane Chu 135e53514e mm/memory-failure.c: clarify error message
Some user who install SIGBUS handler that does longjmp out therefore
keeping the process alive is confused by the error message

  "[188988.765862] Memory failure: 0x1840200: Killing cellsrv:33395 due to hardware memory corruption"

Slightly modify the error message to improve clarity.

Link: http://lkml.kernel.org/r/1558403523-22079-1-git-send-email-jane.chu@oracle.com
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Pankaj Gupta <pagupta@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12 11:05:47 -07:00
Linus Torvalds 5ad18b2e60 Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull force_sig() argument change from Eric Biederman:
 "A source of error over the years has been that force_sig has taken a
  task parameter when it is only safe to use force_sig with the current
  task.

  The force_sig function is built for delivering synchronous signals
  such as SIGSEGV where the userspace application caused a synchronous
  fault (such as a page fault) and the kernel responded with a signal.

  Because the name force_sig does not make this clear, and because the
  force_sig takes a task parameter the function force_sig has been
  abused for sending other kinds of signals over the years. Slowly those
  have been fixed when the oopses have been tracked down.

  This set of changes fixes the remaining abusers of force_sig and
  carefully rips out the task parameter from force_sig and friends
  making this kind of error almost impossible in the future"

* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits)
  signal/x86: Move tsk inside of CONFIG_MEMORY_FAILURE in do_sigbus
  signal: Remove the signal number and task parameters from force_sig_info
  signal: Factor force_sig_info_to_task out of force_sig_info
  signal: Generate the siginfo in force_sig
  signal: Move the computation of force into send_signal and correct it.
  signal: Properly set TRACE_SIGNAL_LOSE_INFO in __send_signal
  signal: Remove the task parameter from force_sig_fault
  signal: Use force_sig_fault_to_task for the two calls that don't deliver to current
  signal: Explicitly call force_sig_fault on current
  signal/unicore32: Remove tsk parameter from __do_user_fault
  signal/arm: Remove tsk parameter from __do_user_fault
  signal/arm: Remove tsk parameter from ptrace_break
  signal/nds32: Remove tsk parameter from send_sigtrap
  signal/riscv: Remove tsk parameter from do_trap
  signal/sh: Remove tsk parameter from force_sig_info_fault
  signal/um: Remove task parameter from send_sigtrap
  signal/x86: Remove task parameter from send_sigtrap
  signal: Remove task parameter from force_sig_mceerr
  signal: Remove task parameter from force_sig
  signal: Remove task parameter from force_sigsegv
  ...
2019-07-08 21:48:15 -07:00
Christoph Hellwig 25b2995a35 mm: remove MEMORY_DEVICE_PUBLIC support
The code hasn't been used since it was added to the tree, and doesn't
appear to actually be usable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-07-02 14:32:43 -03:00
Naoya Horiguchi faf53def3b mm: hugetlb: soft-offline: dissolve_free_huge_page() return zero on !PageHuge
madvise(MADV_SOFT_OFFLINE) often returns -EBUSY when calling soft offline
for hugepages with overcommitting enabled.  That was caused by the
suboptimal code in current soft-offline code.  See the following part:

    ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
                            MIGRATE_SYNC, MR_MEMORY_FAILURE);
    if (ret) {
            ...
    } else {
            /*
             * We set PG_hwpoison only when the migration source hugepage
             * was successfully dissolved, because otherwise hwpoisoned
             * hugepage remains on free hugepage list, then userspace will
             * find it as SIGBUS by allocation failure. That's not expected
             * in soft-offlining.
             */
            ret = dissolve_free_huge_page(page);
            if (!ret) {
                    if (set_hwpoison_free_buddy_page(page))
                            num_poisoned_pages_inc();
            }
    }
    return ret;

Here dissolve_free_huge_page() returns -EBUSY if the migration source page
was freed into buddy in migrate_pages(), but even in that case we actually
has a chance that set_hwpoison_free_buddy_page() succeeds.  So that means
current code gives up offlining too early now.

dissolve_free_huge_page() checks that a given hugepage is suitable for
dissolving, where we should return success for !PageHuge() case because
the given hugepage is considered as already dissolved.

This change also affects other callers of dissolve_free_huge_page(), which
are cleaned up together.

[n-horiguchi@ah.jp.nec.com: v3]
  Link: http://lkml.kernel.org/r/1560761476-4651-3-git-send-email-n-horiguchi@ah.jp.nec.comLink: http://lkml.kernel.org/r/1560154686-18497-3-git-send-email-n-horiguchi@ah.jp.nec.com
Fixes: 6bc9b56433 ("mm: fix race on soft-offlining")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Chen, Jerry T <jerry.t.chen@intel.com>
Tested-by: Chen, Jerry T <jerry.t.chen@intel.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com>
Cc: "Chen, Jerry T" <jerry.t.chen@intel.com>
Cc: "Zhuo, Qiuxu" <qiuxu.zhuo@intel.com>
Cc: <stable@vger.kernel.org>	[4.19+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-29 16:43:45 +08:00
Naoya Horiguchi b38e5962f8 mm: soft-offline: return -EBUSY if set_hwpoison_free_buddy_page() fails
The pass/fail of soft offline should be judged by checking whether the
raw error page was finally contained or not (i.e.  the result of
set_hwpoison_free_buddy_page()), but current code do not work like
that.  It might lead us to misjudge the test result when
set_hwpoison_free_buddy_page() fails.

Without this fix, there are cases where madvise(MADV_SOFT_OFFLINE) may
not offline the original page and will not return an error.

Link: http://lkml.kernel.org/r/1560154686-18497-2-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Fixes: 6bc9b56433 ("mm: fix race on soft-offlining")
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com>
Cc: "Chen, Jerry T" <jerry.t.chen@intel.com>
Cc: "Zhuo, Qiuxu" <qiuxu.zhuo@intel.com>
Cc: <stable@vger.kernel.org>	[4.19+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-29 16:43:45 +08:00
Thomas Gleixner 1439f94c54 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 263
Based on 1 normalized pattern(s):

  this software may be redistributed and or modified under the terms
  of the gnu general public license gpl version 2 only as published by
  the free software foundation

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 1 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190529141333.676969322@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:30:28 +02:00
Eric W. Biederman f8eac9011b signal: Remove task parameter from force_sig_mceerr
All of the callers pass current into force_sig_mceer so remove the
task parameter to make this obvious.

This also makes it clear that force_sig_mceerr passes current
into force_sig_info.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2019-05-27 09:36:28 -05:00
zhongjiang 46612b751c mm: hwpoison: fix thp split handing in soft_offline_in_use_page()
When soft_offline_in_use_page() runs on a thp tail page after pmd is
split, we trigger the following VM_BUG_ON_PAGE():

  Memory failure: 0x3755ff: non anonymous thp
  __get_any_page: 0x3755ff: unknown zero refcount page type 2fffff80000000
  Soft offlining pfn 0x34d805 at process virtual address 0x20fff000
  page:ffffea000d360140 count:0 mapcount:0 mapping:0000000000000000 index:0x1
  flags: 0x2fffff80000000()
  raw: 002fffff80000000 ffffea000d360108 ffffea000d360188 0000000000000000
  raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
  page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
  ------------[ cut here ]------------
  kernel BUG at ./include/linux/mm.h:519!

soft_offline_in_use_page() passed refcount and page lock from tail page
to head page, which is not needed because we can pass any subpage to
split_huge_page().

Naoya had fixed a similar issue in c3901e722b ("mm: hwpoison: fix thp
split handling in memory_failure()").  But he missed fixing soft
offline.

Link: http://lkml.kernel.org/r/1551452476-24000-1-git-send-email-zhongjiang@huawei.com
Fixes: 61f5d698cc ("mm: re-enable THP")
Signed-off-by: zhongjiang <zhongjiang@huawei.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>	[4.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:13 -08:00
Naoya Horiguchi 6376360ecb mm: hwpoison: use do_send_sig_info() instead of force_sig()
Currently memory_failure() is racy against process's exiting, which
results in kernel crash by null pointer dereference.

The root cause is that memory_failure() uses force_sig() to forcibly
kill asynchronous (meaning not in the current context) processes.  As
discussed in thread https://lkml.org/lkml/2010/6/8/236 years ago for OOM
fixes, this is not a right thing to do.  OOM solves this issue by using
do_send_sig_info() as done in commit d2d393099d ("signal:
oom_kill_task: use SEND_SIG_FORCED instead of force_sig()"), so this
patch is suggesting to do the same for hwpoison.  do_send_sig_info()
properly accesses to siglock with lock_task_sighand(), so is free from
the reported race.

I confirmed that the reported bug reproduces with inserting some delay
in kill_procs(), and it never reproduces with this patch.

Note that memory_failure() can send another type of signal using
force_sig_mceerr(), and the reported race shouldn't happen on it because
force_sig_mceerr() is called only for synchronous processes (i.e.
BUS_MCEERR_AR happens only when some process accesses to the corrupted
memory.)

Link: http://lkml.kernel.org/r/20190116093046.GA29835@hori1.linux.bs1.fc.nec.co.jp
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Jane Chu <jane.chu@oracle.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01 15:46:23 -08:00
Mike Kravetz ddeaab32a8 hugetlbfs: revert "use i_mmap_rwsem for more pmd sharing synchronization"
This reverts b43a999005

The reverted commit caused issues with migration and poisoning of anon
huge pages.  The LTP move_pages12 test will cause an "unable to handle
kernel NULL pointer" BUG would occur with stack similar to:

  RIP: 0010:down_write+0x1b/0x40
  Call Trace:
    migrate_pages+0x81f/0xb90
    __ia32_compat_sys_migrate_pages+0x190/0x190
    do_move_pages_to_node.isra.53.part.54+0x2a/0x50
    kernel_move_pages+0x566/0x7b0
    __x64_sys_move_pages+0x24/0x30
    do_syscall_64+0x5b/0x180
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

The purpose of the reverted patch was to fix some long existing races
with huge pmd sharing.  It used i_mmap_rwsem for this purpose with the
idea that this could also be used to address truncate/page fault races
with another patch.  Further analysis has determined that i_mmap_rwsem
can not be used to address all these hugetlbfs synchronization issues.
Therefore, revert this patch while working an another approach to the
underlying issues.

Link: http://lkml.kernel.org/r/20190103235452.29335-2-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Jan Stancek <jstancek@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-08 17:15:11 -08:00
Mike Kravetz b43a999005 hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
While looking at BUGs associated with invalid huge page map counts, it was
discovered and observed that a huge pte pointer could become 'invalid' and
point to another task's page table.  Consider the following:

A task takes a page fault on a shared hugetlbfs file and calls
huge_pte_alloc to get a ptep.  Suppose the returned ptep points to a
shared pmd.

Now, another task truncates the hugetlbfs file.  As part of truncation, it
unmaps everyone who has the file mapped.  If the range being truncated is
covered by a shared pmd, huge_pmd_unshare will be called.  For all but the
last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
to the pmd.  If the task in the middle of the page fault is not the last
user, the ptep returned by huge_pte_alloc now points to another task's
page table or worse.  This leads to bad things such as incorrect page
map/reference counts or invalid memory references.

To fix, expand the use of i_mmap_rwsem as follows:

- i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
  huge_pmd_share is only called via huge_pte_alloc, so callers of
  huge_pte_alloc take i_mmap_rwsem before calling.  In addition, callers
  of huge_pte_alloc continue to hold the semaphore until finished with the
  ptep.

- i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is
  called.

[mike.kravetz@oracle.com: add explicit check for mapping != null]
Link: http://lkml.kernel.org/r/20181218223557.5202-2-mike.kravetz@oracle.com
Fixes: 39dde65c99 ("shared page table for hugetlb page")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:51 -08:00
Matthew Wilcox 27359fd6e5 dax: Fix unlock mismatch with updated API
Internal to dax_unlock_mapping_entry(), dax_unlock_entry() is used to
store a replacement entry in the Xarray at the given xas-index with the
DAX_LOCKED bit clear. When called, dax_unlock_entry() expects the unlocked
value of the entry relative to the current Xarray state to be specified.

In most contexts dax_unlock_entry() is operating in the same scope as
the matched dax_lock_entry(). However, in the dax_unlock_mapping_entry()
case the implementation needs to recall the original entry. In the case
where the original entry is a 'pmd' entry it is possible that the pfn
performed to do the lookup is misaligned to the value retrieved in the
Xarray.

Change the api to return the unlock cookie from dax_lock_page() and pass
it to dax_unlock_page(). This fixes a bug where dax_unlock_page() was
assuming that the page was PMD-aligned if the entry was a PMD entry with
signatures like:

 WARNING: CPU: 38 PID: 1396 at fs/dax.c:340 dax_insert_entry+0x2b2/0x2d0
 RIP: 0010:dax_insert_entry+0x2b2/0x2d0
 [..]
 Call Trace:
  dax_iomap_pte_fault.isra.41+0x791/0xde0
  ext4_dax_huge_fault+0x16f/0x1f0
  ? up_read+0x1c/0xa0
  __do_fault+0x1f/0x160
  __handle_mm_fault+0x1033/0x1490
  handle_mm_fault+0x18b/0x3d0

Link: https://lkml.kernel.org/r/20181130154902.GL10377@bombadil.infradead.org
Fixes: 9f32d22130 ("dax: Convert dax_lock_mapping_entry to XArray")
Reported-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-12-04 21:32:00 -08:00
Linus Torvalds 2923b27e54 libnvdimm-for-4.19_dax-memory-failure
* memory_failure() gets confused by dev_pagemap backed mappings. The
   recovery code has specific enabling for several possible page states
   that needs new enabling to handle poison in dax mappings. Teach
   memory_failure() about ZONE_DEVICE pages.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE5DAy15EJMCV1R6v9YGjFFmlTOEoFAlt9ui8ACgkQYGjFFmlT
 OEpNRw//XGj9s7sezfJFeol4psJlRUd935yii/gmJRgi/yPf2VxxQG9qyM6SMBUc
 75jASfOL6FSsfxHz0kplyWzMDNdrTkNNAD+9rv80FmY7GqWgcas9DaJX7jZ994vI
 5SRO7pfvNZcXlo7IhqZippDw3yxkIU9Ufi0YQKaEUm7GFieptvCZ0p9x3VYfdvwM
 BExrxQe0X1XUF4xErp5P78+WUbKxP47DLcucRDig8Q7dmHELUdyNzo3E1SVoc7m+
 3CmvyTj6XuFQgOZw7ZKun1BJYfx/eD5ZlRJLZbx6wJHRtTXv/Uea8mZ8mJ31ykN9
 F7QVd0Pmlyxys8lcXfK+nvpL09QBE0/PhwWKjmZBoU8AdgP/ZvBXLDL/D6YuMTg6
 T4wwtPNJorfV4lVD06OliFkVI4qbKbmNsfRq43Ns7PCaLueu4U/eMaSwSH99UMaZ
 MGbO140XW2RZsHiU9yTRUmZq73AplePEjxtzR8oHmnjo45nPDPy8mucWPlkT9kXA
 oUFMhgiviK7dOo19H4eaPJGqLmHM93+x5tpYxGqTr0dUOXUadKWxMsTnkID+8Yi7
 /kzQWCFvySz3VhiEHGuWkW08GZT6aCcpkREDomnRh4MEnETlZI8bblcuXYOCLs6c
 nNf1SIMtLdlsl7U1fEX89PNeQQ2y237vEDhFQZftaalPeu/JJV0=
 =Ftop
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.19_dax-memory-failure' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm memory-failure update from Dave Jiang:
 "As it stands, memory_failure() gets thoroughly confused by dev_pagemap
  backed mappings. The recovery code has specific enabling for several
  possible page states and needs new enabling to handle poison in dax
  mappings.

  In order to support reliable reverse mapping of user space addresses:

   1/ Add new locking in the memory_failure() rmap path to prevent races
      that would typically be handled by the page lock.

   2/ Since dev_pagemap pages are hidden from the page allocator and the
      "compound page" accounting machinery, add a mechanism to determine
      the size of the mapping that encompasses a given poisoned pfn.

   3/ Given pmem errors can be repaired, change the speculatively
      accessed poison protection, mce_unmap_kpfn(), to be reversible and
      otherwise allow ongoing access from the kernel.

  A side effect of this enabling is that MADV_HWPOISON becomes usable
  for dax mappings, however the primary motivation is to allow the
  system to survive userspace consumption of hardware-poison via dax.
  Specifically the current behavior is:

     mce: Uncorrected hardware memory error in user-access at af34214200
     {1}[Hardware Error]: It has been corrected by h/w and requires no further action
     mce: [Hardware Error]: Machine check events logged
     {1}[Hardware Error]: event severity: corrected
     Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
     [..]
     Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
     mce: Memory error not recovered
     <reboot>

  ...and with these changes:

     Injecting memory failure for pfn 0x20cb00 at process virtual address 0x7f763dd00000
     Memory failure: 0x20cb00: Killing dax-pmd:5421 due to hardware memory corruption
     Memory failure: 0x20cb00: recovery action for dax page: Recovered

  Given all the cross dependencies I propose taking this through
  nvdimm.git with acks from Naoya, x86/core, x86/RAS, and of course dax
  folks"

* tag 'libnvdimm-for-4.19_dax-memory-failure' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm:
  libnvdimm, pmem: Restore page attributes when clearing errors
  x86/memory_failure: Introduce {set, clear}_mce_nospec()
  x86/mm/pat: Prepare {reserve, free}_memtype() for "decoy" addresses
  mm, memory_failure: Teach memory_failure() about dev_pagemap pages
  filesystem-dax: Introduce dax_lock_mapping_entry()
  mm, memory_failure: Collect mapping size in collect_procs()
  mm, madvise_inject_error: Let memory_failure() optionally take a page reference
  mm, dev_pagemap: Do not clear ->mapping on final put
  mm, madvise_inject_error: Disable MADV_SOFT_OFFLINE for ZONE_DEVICE pages
  filesystem-dax: Set page->index
  device-dax: Set page->index
  device-dax: Enable page_mapping()
  device-dax: Convert to vmf_insert_mixed and vm_fault_t
2018-08-25 18:43:59 -07:00