Commit Graph

491 Commits

Author SHA1 Message Date
Rafael Aquini 1fb67b833f fs/hugetlbfs/inode.c: mm/memory-failure.c: fix hugetlbfs hwpoison handling
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 19d3e221807772f8443e565234a6fdc5a2b09d26
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Fri Jan 12 10:08:40 2024 -0800

    fs/hugetlbfs/inode.c: mm/memory-failure.c: fix hugetlbfs hwpoison handling

    has_extra_refcount() makes the assumption that the page cache adds a ref
    count of 1 and subtracts this in the extra_pins case.  Commit a08c7193e4f1
    (mm/filemap: remove hugetlb special casing in filemap.c) modifies
    __filemap_add_folio() by calling folio_ref_add(folio, nr); for all cases
    (including hugtetlb) where nr is the number of pages in the folio.  We
    should adjust the number of references coming from the page cache by
    subtracing the number of pages rather than 1.

    In hugetlbfs_read_iter(), folio_test_has_hwpoisoned() is testing the wrong
    flag as, in the hugetlb case, memory-failure code calls
    folio_test_set_hwpoison() to indicate poison.  folio_test_hwpoison() is
    the correct function to test for that flag.

    After these fixes, the hugetlb hwpoison read selftest passes all cases.

    Link: https://lkml.kernel.org/r/20240112180840.367006-1-sidhartha.kumar@oracle.com
    Fixes: a08c7193e4f1 ("mm/filemap: remove hugetlb special casing in filemap.c")
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Closes: https://lore.kernel.org/linux-mm/20230713001833.3778937-1-jiaqiyan@google.com/T/#m8e1469119e5b831bbd05d495f96b842e4a1c5519
    Reported-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
    Tested-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
    Acked-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Muchun Song <muchun.song@linux.dev>
    Cc: James Houghton <jthoughton@google.com>
    Cc: Jiaqi Yan <jiaqiyan@google.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: <stable@vger.kernel.org>    [6.7+]
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:24:19 -05:00
Rafael Aquini 5424fbf8f2 mm/memory-failure: cast index to loff_t before shifting it
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 39ebd6dce62d8cfe3864e16148927a139f11bc9a
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Dec 18 13:58:37 2023 +0000

    mm/memory-failure: cast index to loff_t before shifting it

    On 32-bit systems, we'll lose the top bits of index because arithmetic
    will be performed in unsigned long instead of unsigned long long.  This
    affects files over 4GB in size.

    Link: https://lkml.kernel.org/r/20231218135837.3310403-4-willy@infradead.org
    Fixes: 6100e34b25 ("mm, memory_failure: Teach memory_failure() about dev_pagemap pages")
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:23:53 -05:00
Rafael Aquini da5d8e6b19 mm/memory-failure: check the mapcount of the precise page
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit c79c5a0a00a9457718056b588f312baadf44e471
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Dec 18 13:58:36 2023 +0000

    mm/memory-failure: check the mapcount of the precise page

    A process may map only some of the pages in a folio, and might be missed
    if it maps the poisoned page but not the head page.  Or it might be
    unnecessarily hit if it maps the head page, but not the poisoned page.

    Link: https://lkml.kernel.org/r/20231218135837.3310403-3-willy@infradead.org
    Fixes: 7af446a841 ("HWPOISON, hugetlb: enable error handling path for hugepage")
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:23:52 -05:00
Rafael Aquini b31af343ae mm/memory-failure: pass the folio and the page to collect_procs()
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 376907f3a0b34a17e80417825f8cc1c40fcba81b
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Dec 18 13:58:35 2023 +0000

    mm/memory-failure: pass the folio and the page to collect_procs()

    Patch series "Three memory-failure fixes".

    I've been looking at the memory-failure code and I believe I have found
    three bugs that need fixing -- one going all the way back to 2010!  I'll
    have more patches later to use folios more extensively but didn't want
    these bugfixes to get caught up in that.

    This patch (of 3):

    Both collect_procs_anon() and collect_procs_file() iterate over the VMA
    interval trees looking for a single pgoff, so it is wrong to look for the
    pgoff of the head page as is currently done.  However, it is also wrong to
    look at page->mapping of the precise page as this is invalid for tail
    pages.  Clear up the confusion by passing both the folio and the precise
    page to collect_procs().

    Link: https://lkml.kernel.org/r/20231218135837.3310403-1-willy@infradead.org
    Link: https://lkml.kernel.org/r/20231218135837.3310403-2-willy@infradead.org
    Fixes: 415c64c145 ("mm/memory-failure: split thp earlier in memory error handling")
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:23:52 -05:00
Rafael Aquini 4d2d9ce626 mm: convert isolate_page() to mf_isolate_folio()
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 761d79fbad2a424a240a351b898b54eb674d3bdc
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Nov 8 18:28:08 2023 +0000

    mm: convert isolate_page() to mf_isolate_folio()

    The only caller now has a folio, so pass it in and operate on it.  Saves
    many page->folio conversions and introduces only one folio->page
    conversion when calling isolate_movable_page().

    Link: https://lkml.kernel.org/r/20231108182809.602073-6-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:23:38 -05:00
Rafael Aquini db213d9c8f mm: convert soft_offline_in_use_page() to use a folio
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 049b26048dd287d52f6f6fbe5eafa301fdca5d37
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Nov 8 18:28:07 2023 +0000

    mm: convert soft_offline_in_use_page() to use a folio

    Replace the existing head-page logic with folio logic.

    Link: https://lkml.kernel.org/r/20231108182809.602073-5-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:23:37 -05:00
Rafael Aquini ddaa8e7b76 mm: use mapping_evict_folio() in truncate_error_page()
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 19369d866a8b89788cdc9b10c7b8c9b2777f806b
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Nov 8 18:28:06 2023 +0000

    mm: use mapping_evict_folio() in truncate_error_page()

    We already have the folio and the mapping, so replace the call to
    invalidate_inode_page() with mapping_evict_folio().

    Link: https://lkml.kernel.org/r/20231108182809.602073-4-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:23:36 -05:00
Rafael Aquini f91158fc54 mm: convert DAX lock/unlock page to lock/unlock folio
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 91e79d22be75fec88ae58d274a7c9e49d6215099
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Aug 23 00:13:14 2023 +0100

    mm: convert DAX lock/unlock page to lock/unlock folio

    The one caller of DAX lock/unlock page already calls compound_head(), so
    use page_folio() instead, then use a folio throughout the DAX code to
    remove uses of page->mapping and page->index.

    [jane.chu@oracle.com: add comment to mf_generic_kill_procss(), simplify mf_generic_kill_procs:folio initialization]
      Link: https://lkml.kernel.org/r/20230908222336.186313-1-jane.chu@oracle.com
    Link: https://lkml.kernel.org/r/20230822231314.349200-1-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Jane Chu <jane.chu@oracle.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Jane Chu <jane.chu@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:22:12 -05:00
Rado Vrbovsky 1e56a49847 Merge: mm/huge_memory: don't unpoison huge_zero_folio
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4803

  
JIRA: https://issues.redhat.com/browse/RHEL-47802  
CVE: CVE-2024-40914  
  
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>

Approved-by: Herton R. Krzesinski <herton@redhat.com>
Approved-by: Audra Mitchell <aubaker@redhat.com>
Approved-by: Rafael Aquini <raquini@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
2024-11-22 09:12:17 +00:00
Rado Vrbovsky 570a71d7db Merge: mm: update core code to v6.6 upstream
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5252

JIRA: https://issues.redhat.com/browse/RHEL-27743  
JIRA: https://issues.redhat.com/browse/RHEL-59459    
CVE: CVE-2024-46787    
Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4961  
  
This MR brings RHEL9 core MM code up to upstream's v6.6 LTS level.    
This work follows up on the previous v6.5 update (RHEL-27742) and as such,    
the bulk of this changeset is comprised of refactoring and clean-ups of     
the internal implementation of several APIs as it further advances the     
conversion to FOLIOS, and follow up on the per-VMA locking changes.

Also, with the rebase to v6.6 LTS, we complete the infrastructure to allow    
Control-flow Enforcement Technology, a.k.a. Shadow Stacks, for x86 builds,    
and we add a potential extra level of protection (assessment pending) to help    
on mitigating kernel heap exploits dubbed as "SlubStick".     
    
Follow-up fixes are omitted from this series either because they are irrelevant to     
the bits we support on RHEL or because they depend on bigger changesets introduced     
upstream more recently. A follow-up ticket (RHEL-27745) will deal with these and other cases separately.    

Omitted-fix: e540b8c5da04 ("mips: mm: add slab availability checking in ioremap_prot")    
Omitted-fix: f7875966dc0c ("tools headers UAPI: Sync files changed by new fchmodat2 and map_shadow_stack syscalls with the kernel sources")   
Omitted-fix: df39038cd895 ("s390/mm: Fix VM_FAULT_HWPOISON handling in do_exception()")    
Omitted-fix: 12bbaae7635a ("mm: create FOLIO_FLAG_FALSE and FOLIO_TYPE_OPS macros")    
Omitted-fix: fd1a745ce03e ("mm: support page_mapcount() on page_has_type() pages")    
Omitted-fix: d99e3140a4d3 ("mm: turn folio_test_hugetlb into a PageType")    
Omitted-fix: fa2690af573d ("mm: page_ref: remove folio_try_get_rcu()")    
Omitted-fix: f442fa614137 ("mm: gup: stop abusing try_grab_folio")    
Omitted-fix: cb0f01beb166 ("mm/mprotect: fix dax pud handling")    
    
Signed-off-by: Rafael Aquini <raquini@redhat.com>

Approved-by: John W. Linville <linville@redhat.com>
Approved-by: Mark Salter <msalter@redhat.com>
Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: Steve Best <sbest@redhat.com>
Approved-by: David Airlie <airlied@redhat.com>
Approved-by: Michal Schmidt <mschmidt@redhat.com>
Approved-by: Baoquan He <5820488-baoquan_he@users.noreply.gitlab.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
2024-10-30 07:22:28 +00:00
Rado Vrbovsky 23dd8c7aad Merge: mm/memory-failure: use raw_spinlock_t in struct memory_failure_cpu
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5271

JIRA: https://issues.redhat.com/browse/RHEL-52957

commit d75abd0d0bc29e6ebfebbf76d11b4067b35844af

Author: Waiman Long <longman@redhat.com>

Date:   Tue Aug 6 12:41:07 2024 -0400

    mm/memory-failure: use raw_spinlock_t in struct memory_failure_cpu

    The memory_failure_cpu structure is a per-cpu structure.  Access to its
    content requires the use of get_cpu_var() to lock in the current CPU and
    disable preemption.  The use of a regular spinlock_t for locking purpose
    is fine for a non-RT kernel.

    Since the integration of RT spinlock support into the v5.15 kernel, a
    spinlock_t in a RT kernel becomes a sleeping lock and taking a sleeping
    lock in a preemption disabled context is illegal resulting in the
    following kind of warning.

      [12135.732244] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
      [12135.732248] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 270076, name: kworker/0:0
      [12135.732252] preempt_count: 1, expected: 0
      [12135.732255] RCU nest depth: 2, expected: 2
        :
      [12135.732420] Hardware name: Dell Inc. PowerEdge R640/0HG0J8, BIOS 2.10.2 02/24/2021
      [12135.732423] Workqueue: kacpi_notify acpi_os_execute_deferred
      [12135.732433] Call Trace:
      [12135.732436]  <TASK>
      [12135.732450]  dump_stack_lvl+0x57/0x81
      [12135.732461]  __might_resched.cold+0xf4/0x12f
      [12135.732479]  rt_spin_lock+0x4c/0x100
      [12135.732491]  memory_failure_queue+0x40/0xe0
      [12135.732503]  ghes_do_memory_failure+0x53/0x390
      [12135.732516]  ghes_do_proc.constprop.0+0x229/0x3e0
      [12135.732575]  ghes_proc+0xf9/0x1a0
      [12135.732591]  ghes_notify_hed+0x6a/0x150
      [12135.732602]  notifier_call_chain+0x43/0xb0
      [12135.732626]  blocking_notifier_call_chain+0x43/0x60
      [12135.732637]  acpi_ev_notify_dispatch+0x47/0x70
      [12135.732648]  acpi_os_execute_deferred+0x13/0x20
      [12135.732654]  process_one_work+0x41f/0x500
      [12135.732695]  worker_thread+0x192/0x360
      [12135.732715]  kthread+0x111/0x140
      [12135.732733]  ret_from_fork+0x29/0x50
      [12135.732779]  </TASK>

    Fix it by using a raw_spinlock_t for locking instead.

    Also move the pr_err() out of the lock critical section and after
    put_cpu_ptr() to avoid indeterminate latency and the possibility of sleep
    with this call.

    [longman@redhat.com: don't hold percpu ref across pr_err(), per Miaohe]
      Link: https://lkml.kernel.org/r/20240807181130.1122660-1-longman@redhat.com
    Link: https://lkml.kernel.org/r/20240806164107.1044956-1-longman@redhat.com
    Fixes: 0f383b6dc96e ("locking/spinlock: Provide RT variant")
    Signed-off-by: Waiman Long <longman@redhat.com>
    Acked-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Juri Lelli <juri.lelli@redhat.com>
    Cc: Len Brown <len.brown@intel.com>
    Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Wander Lairson Costa <wander@redhat.com>

Approved-by: Waiman Long <longman@redhat.com>
Approved-by: Rafael Aquini <raquini@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
2024-10-25 16:49:07 +00:00
Rafael Aquini d56b8e98a8 mm: memory-failure: use rcu lock instead of tasklist_lock when collect_procs()
JIRA: https://issues.redhat.com/browse/RHEL-27743
Conflicts:
  * mm/memory-failure.c: minor context conflict due to out-of-order backport
      of commit fa422b353d21 ("mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind")

This patch is a backport of the following upstream commit:
commit d256d1cd8da1cbc4615de69df71c87ce623fec2f
Author: Tong Tiangen <tongtiangen@huawei.com>
Date:   Mon Aug 28 10:25:27 2023 +0800

    mm: memory-failure: use rcu lock instead of tasklist_lock when collect_procs()

    We found a softlock issue in our test, analyzed the logs, and found that
    the relevant CPU call trace as follows:

    CPU0:
      _do_fork
        -> copy_process()
          -> write_lock_irq(&tasklist_lock)  //Disable irq,waiting for
                                             //tasklist_lock

    CPU1:
      wp_page_copy()
        ->pte_offset_map_lock()
          -> spin_lock(&page->ptl);        //Hold page->ptl
        -> ptep_clear_flush()
          -> flush_tlb_others() ...
            -> smp_call_function_many()
              -> arch_send_call_function_ipi_mask()
                -> csd_lock_wait()         //Waiting for other CPUs respond
                                           //IPI

    CPU2:
      collect_procs_anon()
        -> read_lock(&tasklist_lock)       //Hold tasklist_lock
          ->for_each_process(tsk)
            -> page_mapped_in_vma()
              -> page_vma_mapped_walk()
                -> map_pte()
                  ->spin_lock(&page->ptl)  //Waiting for page->ptl

    We can see that CPU1 waiting for CPU0 respond IPI,CPU0 waiting for CPU2
    unlock tasklist_lock, CPU2 waiting for CPU1 unlock page->ptl. As a result,
    softlockup is triggered.

    For collect_procs_anon(), what we're doing is task list iteration, during
    the iteration, with the help of call_rcu(), the task_struct object is freed
    only after one or more grace periods elapse. the logic as follows:

    release_task()
      -> __exit_signal()
        -> __unhash_process()
          -> list_del_rcu()

      -> put_task_struct_rcu_user()
        -> call_rcu(&task->rcu, delayed_put_task_struct)

    delayed_put_task_struct()
      -> put_task_struct()
      -> if (refcount_sub_and_test())
            __put_task_struct()
              -> free_task()

    Therefore, under the protection of the rcu lock, we can safely use
    get_task_struct() to ensure a safe reference to task_struct during the
    iteration.

    By removing the use of tasklist_lock in task list iteration, we can break
    the softlock chain above.

    The same logic can also be applied to:
     - collect_procs_file()
     - collect_procs_fsdax()
     - collect_procs_ksm()

    Link: https://lkml.kernel.org/r/20230828022527.241693-1-tongtiangen@huawei.com
    Signed-off-by: Tong Tiangen <tongtiangen@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:22:12 -04:00
Rafael Aquini 3f524cd7d8 mm: memory-failure: use helper macro llist_for_each_entry_safe()
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 6379693e3c2683a7c86f395e878534731ac7ed06
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Mon Aug 7 19:41:25 2023 +0800

    mm: memory-failure: use helper macro llist_for_each_entry_safe()

    It's more convenient to use helper macro llist_for_each_entry_safe().
    No functional change intended.

    Link: https://lkml.kernel.org/r/20230807114125.3440802-1-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:21:14 -04:00
Rafael Aquini de3ff7240b mm: memory-failure: add PageOffline() check
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 7a8817f2c96e98d5e65a59e34ba9ea1ff6ed23bc
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Jul 27 19:56:43 2023 +0800

    mm: memory-failure: add PageOffline() check

    Memory failure is not interested in logically offlined pages.  Skip this
    type of page.

    Link: https://lkml.kernel.org/r/20230727115643.639741-5-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:20:04 -04:00
Rafael Aquini b4417c5eba mm/hwpoison: rename hwp_walk* to hwpoison_walk*
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 6885938c349c14b277305cd1129e7eb14f3e2c55
Author: Jiaqi Yan <jiaqiyan@google.com>
Date:   Thu Jul 13 23:55:53 2023 +0000

    mm/hwpoison: rename hwp_walk* to hwpoison_walk*

    In the discussion of "Improve hugetlbfs read on HWPOISON hugepages" [1],
    Matthew Wilcox suggests hwp is a bad abbreviation of hwpoison, as hwp is
    already used as "an acronym by acpi, intel_pstate, some clock drivers, an
    ethernet driver, and a scsi driver"[1].

    So rename hwp_walk and hwp_walk_ops to hwpoison_walk and
    hwpoison_walk_ops respectively.

    raw_hwp_(page|list), *_raw_hwp, and raw_hwp_unreliable flag are other
    major appearances of "hwp".  However, given the "raw" hint in the name, it
    is easy to differentiate them from other "hwp" acronyms.  Since renaming
    them is not as straightforward as renaming hwp_walk*, they are not covered
    by this commit.

    [1] https://lore.kernel.org/lkml/20230707201904.953262-5-jiaqiyan@google.com/T/#me6fecb8ce1ad4d5769199c9e162a44bc88f7bdec

    Link: https://lkml.kernel.org/r/20230713235553.4121855-1-jiaqiyan@google.com
    Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:18:58 -04:00
Rafael Aquini 57d6367283 mm/hwpoison: check if a raw page in a hugetlb folio is raw HWPOISON
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit b79f8eb408d0468df0d6082ed958b67d94adce65
Author: Jiaqi Yan <jiaqiyan@google.com>
Date:   Thu Jul 13 00:18:31 2023 +0000

    mm/hwpoison: check if a raw page in a hugetlb folio is raw HWPOISON

    Add the functionality, is_raw_hwpoison_page_in_hugepage, to tell if a raw
    page in a hugetlb folio is HWPOISON.  This functionality relies on
    RawHwpUnreliable to be not set; otherwise hugepage's raw HWPOISON list
    becomes meaningless.

    is_raw_hwpoison_page_in_hugepage holds mf_mutex in order to synchronize
    with folio_set_hugetlb_hwpoison and folio_free_raw_hwp who iterate,
    insert, or delete entry in raw_hwp_list.  llist itself doesn't ensure
    insertion and removal are synchornized with the llist_for_each_entry used
    by is_raw_hwpoison_page_in_hugepage (unless iterated entries are already
    deleted from the list).  Caller can minimize the overhead of lock cycles
    by first checking HWPOISON flag of the folio.

    Exports this functionality to be immediately used in the read operation
    for hugetlbfs.

    Link: https://lkml.kernel.org/r/20230713001833.3778937-3-jiaqiyan@google.com
    Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: James Houghton <jthoughton@google.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:18:41 -04:00
Rafael Aquini a591d4daa4 mm/hwpoison: delete all entries before traversal in __folio_free_raw_hwp
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 9e130c4b000b0a3f0bf4b4c8e714bfe3d06ff4cc
Author: Jiaqi Yan <jiaqiyan@google.com>
Date:   Thu Jul 13 00:18:30 2023 +0000

    mm/hwpoison: delete all entries before traversal in __folio_free_raw_hwp

    Patch series "Improve hugetlbfs read on HWPOISON hugepages", v4.

    Today when hardware memory is corrupted in a hugetlb hugepage, kernel
    leaves the hugepage in pagecache [1]; otherwise future mmap or read will
    suject to silent data corruption.  This is implemented by returning -EIO
    from hugetlb_read_iter immediately if the hugepage has HWPOISON flag set.

    Since memory_failure already tracks the raw HWPOISON subpages in a
    hugepage, a natural improvement is possible: if userspace only asks for
    healthy subpages in the pagecache, kernel can return these data.

    This patchset implements this improvement.  It consist of three parts.
    The 1st commit exports the functionality to tell if a subpage inside a
    hugetlb hugepage is a raw HWPOISON page.  The 2nd commit teaches
    hugetlbfs_read_iter to return as many healthy bytes as possible.  The 3rd
    commit properly tests this new feature.

    [1] commit 8625147cafaa ("hugetlbfs: don't delete error page from pagecache")

    This patch (of 4):

    Traversal on llist (e.g.  llist_for_each_safe) is only safe AFTER entries
    are deleted from the llist.  Correct the way __folio_free_raw_hwp deletes
    and frees raw_hwp_page entries in raw_hwp_list: first llist_del_all, then
    kfree within llist_for_each_safe.

    As of today, concurrent adding, deleting, and traversal on raw_hwp_list
    from hugetlb.c and/or memory-failure.c are fine with each other.  Note
    this is guaranteed partly by the lock-free nature of llist, and partly by
    holding hugetlb_lock and/or mf_mutex.  For example, as llist_del_all is
    lock-free with itself, folio_clear_hugetlb_hwpoison()s from
    __update_and_free_hugetlb_folio and memory_failure won't need explicit
    locking when freeing the raw_hwp_list.  New code that manipulates
    raw_hwp_list must be careful to ensure the concurrency correctness.

    Link: https://lkml.kernel.org/r/20230713001833.3778937-1-jiaqiyan@google.com
    Link: https://lkml.kernel.org/r/20230713001833.3778937-2-jiaqiyan@google.com
    Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
    Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: James Houghton <jthoughton@google.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:18:40 -04:00
Rafael Aquini 9eefa7e56d mm: memory-failure: fix race window when trying to get hugetlb folio
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit d31155b8f29ce380f7816e54dee161db6d752909
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Jul 11 13:50:16 2023 +0800

    mm: memory-failure: fix race window when trying to get hugetlb folio

    page_folio() is fetched before calling get_hwpoison_hugetlb_folio()
    without hugetlb_lock being held.  So hugetlb page could be demoted before
    get_hwpoison_hugetlb_folio() holding hugetlb_lock but after page_folio()
    is fetched.  So get_hwpoison_hugetlb_folio() will hold unexpected extra
    refcnt of hugetlb folio while leaving demoted page un-refcnted.

    Link: https://lkml.kernel.org/r/20230711055016.2286677-9-linmiaohe@huawei.com
    Fixes: 25182f05ff ("mm,hwpoison: fix race with hugetlb page allocation")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:18:25 -04:00
Rafael Aquini af9ce223ca mm: memory-failure: fetch compound head after extra page refcnt is held
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit a363d1224b5add67a7cafab9fdb9f19d569fbe98
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Jul 11 13:50:15 2023 +0800

    mm: memory-failure: fetch compound head after extra page refcnt is held

    Page might become thp, huge page or being splited after compound head is
    fetched but before page refcnt is bumped.  So hpage might be a tail page
    leading to VM_BUG_ON_PAGE(PageTail(page)) in PageTransHuge().

    Link: https://lkml.kernel.org/r/20230711055016.2286677-8-linmiaohe@huawei.com
    Fixes: 415c64c145 ("mm/memory-failure: split thp earlier in memory error handling")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:18:24 -04:00
Rafael Aquini 4662536738 mm: memory-failure: minor cleanup for comments and codestyle
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 5885c6a62533cbda19e9eceab619bde317de0c0d
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Jul 11 13:50:14 2023 +0800

    mm: memory-failure: minor cleanup for comments and codestyle

    Fix some wrong function names and grammar error in comments. Also remove
    unneeded space after for_each_process. No functional change intended.

    Link: https://lkml.kernel.org/r/20230711055016.2286677-7-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:18:23 -04:00
Rafael Aquini 82488ad7ae mm: memory-failure: remove unneeded header files
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit e9c36f7aca7efee8318b12930b846464b9b5c7a3
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Jul 11 13:50:13 2023 +0800

    mm: memory-failure: remove unneeded header files

    Remove some unneeded header files. No functional change intended.

    Link: https://lkml.kernel.org/r/20230711055016.2286677-6-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:18:23 -04:00
Rafael Aquini 2db725e74e mm: memory-failure: use local variable huge to check hugetlb page
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 55c7ac4527086d52dedc5da4ee3d676bcc9a7691
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Jul 11 13:50:12 2023 +0800

    mm: memory-failure: use local variable huge to check hugetlb page

    Use local variable huge to check whether page is hugetlb page to avoid
    calling PageHuge() multiple times to save cpu cycles.  PageHuge() will be
    stable while extra page refcnt is held.

    Link: https://lkml.kernel.org/r/20230711055016.2286677-5-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:18:22 -04:00
Rafael Aquini c72854eb80 mm: memory-failure: don't account hwpoison_filter() filtered pages
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 80ee7cb271b52e5861eda3c67731c95fd55a2627
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Jul 11 13:50:11 2023 +0800

    mm: memory-failure: don't account hwpoison_filter() filtered pages

    mf_generic_kill_procs() will return -EOPNOTSUPP when hwpoison_filter()
    filtered dax page.  In that case, action_result() isn't expected to be
    called to update mf_stats.  This will results in inaccurate but benign
    memory failure handling statistics.

    Link: https://lkml.kernel.org/r/20230711055016.2286677-4-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:18:21 -04:00
Rafael Aquini 01ebbcd120 mm: memory-failure: ensure moving HWPoison flag to the raw error pages
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 92a025a790f82c278cc39b0997e9b3b6f3b69ee0
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Jul 11 13:50:10 2023 +0800

    mm: memory-failure: ensure moving HWPoison flag to the raw error pages

    If hugetlb_vmemmap_optimized is enabled, folio_clear_hugetlb_hwpoison()
    called from try_memory_failure_hugetlb() won't transfer HWPoison flag to
    subpages while folio's HWPoison flag is cleared.  So when trying to free
    this hugetlb page into buddy, folio_clear_hugetlb_hwpoison() is not called
    to move HWPoison flag from head page to the raw error pages even if now
    hugetlb_vmemmap_optimized is cleared.  This will results in HWPoisoned
    page being used again and raw_hwp_page leak.

    Link: https://lkml.kernel.org/r/20230711055016.2286677-3-linmiaohe@huawei.com
    Fixes: ac5fcde0a96a ("mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:18:20 -04:00
Rafael Aquini a13415a417 mm: memory-failure: remove unneeded PageHuge() check
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit dbe70dbb41ab45a9ea2fa537c9e6c9817477dfff
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Jul 11 13:50:09 2023 +0800

    mm: memory-failure: remove unneeded PageHuge() check

    Patch series "A few fixup and cleanup patches for memory-failure", v2.

    This series contains a few fixup patches to fix inaccurate mf_stats, fix
    race window when trying to get hugetlb folio and so on.  Also there is
    minor cleanup for comments and codestyle.  More details can be found in
    the respective changelogs.

    This patch (of 8):

    PageHuge() check in me_huge_page() is just for potential problems.  Remove
    it as it's actually dead code and won't catch anything.

    Link: https://lkml.kernel.org/r/20230711055016.2286677-1-linmiaohe@huawei.com
    Link: https://lkml.kernel.org/r/20230711055016.2286677-2-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:18:20 -04:00
Rafael Aquini fb3ad56464 mm: memory-failure: fix potential page refcnt leak in memory_failure()
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit d51b68469bc7804c34622f7f3d4889628d37cfd6
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Sat Jul 1 15:28:37 2023 +0800

    mm: memory-failure: fix potential page refcnt leak in memory_failure()

    put_ref_page() is not called to drop extra refcnt when comes from madvise
    in the case pfn is valid but pgmap is NULL leading to page refcnt leak.

    Link: https://lkml.kernel.org/r/20230701072837.1994253-1-linmiaohe@huawei.com
    Fixes: 1e8aaedb18 ("mm,memory_failure: always pin the page in madvise_inject_error")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:18:19 -04:00
Rafael Aquini 7b195b7166 mm: memory-failure: remove unneeded page state check in shake_page()
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit b7b618da0edc85280e1c9c8f4f5239571e7c1d3e
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Wed Jun 28 09:49:29 2023 +0800

    mm: memory-failure: remove unneeded page state check in shake_page()

    Remove unneeded PageLRU(p) and is_free_buddy_page(p) check as slab caches
    are not shrunk now.  This check can be added back when a lightweight range
    based shrinker is available.

    Link: https://lkml.kernel.org/r/20230628014929.3441386-1-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:18:18 -04:00
Rafael Aquini da8f97b6fb mm: memory-failure: remove unneeded 'inline' annotation
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 1a7d018dc38b6851c602b448bdac2e78b46857db
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Mon Jun 26 19:43:43 2023 +0800

    mm: memory-failure: remove unneeded 'inline' annotation

    Remove unneeded 'inline' annotation from num_poisoned_pages_inc() and
    num_poisoned_pages_sub().  No functional change intended.

    Link: https://lkml.kernel.org/r/20230626114343.1846587-1-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:18:17 -04:00
Wander Lairson Costa 7f14ec4872
mm/memory-failure: use raw_spinlock_t in struct memory_failure_cpu
JIRA: https://issues.redhat.com/browse/RHEL-52957

commit d75abd0d0bc29e6ebfebbf76d11b4067b35844af
Author: Waiman Long <longman@redhat.com>
Date:   Tue Aug 6 12:41:07 2024 -0400

    mm/memory-failure: use raw_spinlock_t in struct memory_failure_cpu

    The memory_failure_cpu structure is a per-cpu structure.  Access to its
    content requires the use of get_cpu_var() to lock in the current CPU and
    disable preemption.  The use of a regular spinlock_t for locking purpose
    is fine for a non-RT kernel.

    Since the integration of RT spinlock support into the v5.15 kernel, a
    spinlock_t in a RT kernel becomes a sleeping lock and taking a sleeping
    lock in a preemption disabled context is illegal resulting in the
    following kind of warning.

      [12135.732244] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
      [12135.732248] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 270076, name: kworker/0:0
      [12135.732252] preempt_count: 1, expected: 0
      [12135.732255] RCU nest depth: 2, expected: 2
        :
      [12135.732420] Hardware name: Dell Inc. PowerEdge R640/0HG0J8, BIOS 2.10.2 02/24/2021
      [12135.732423] Workqueue: kacpi_notify acpi_os_execute_deferred
      [12135.732433] Call Trace:
      [12135.732436]  <TASK>
      [12135.732450]  dump_stack_lvl+0x57/0x81
      [12135.732461]  __might_resched.cold+0xf4/0x12f
      [12135.732479]  rt_spin_lock+0x4c/0x100
      [12135.732491]  memory_failure_queue+0x40/0xe0
      [12135.732503]  ghes_do_memory_failure+0x53/0x390
      [12135.732516]  ghes_do_proc.constprop.0+0x229/0x3e0
      [12135.732575]  ghes_proc+0xf9/0x1a0
      [12135.732591]  ghes_notify_hed+0x6a/0x150
      [12135.732602]  notifier_call_chain+0x43/0xb0
      [12135.732626]  blocking_notifier_call_chain+0x43/0x60
      [12135.732637]  acpi_ev_notify_dispatch+0x47/0x70
      [12135.732648]  acpi_os_execute_deferred+0x13/0x20
      [12135.732654]  process_one_work+0x41f/0x500
      [12135.732695]  worker_thread+0x192/0x360
      [12135.732715]  kthread+0x111/0x140
      [12135.732733]  ret_from_fork+0x29/0x50
      [12135.732779]  </TASK>

    Fix it by using a raw_spinlock_t for locking instead.

    Also move the pr_err() out of the lock critical section and after
    put_cpu_ptr() to avoid indeterminate latency and the possibility of sleep
    with this call.

    [longman@redhat.com: don't hold percpu ref across pr_err(), per Miaohe]
      Link: https://lkml.kernel.org/r/20240807181130.1122660-1-longman@redhat.com
    Link: https://lkml.kernel.org/r/20240806164107.1044956-1-longman@redhat.com
    Fixes: 0f383b6dc96e ("locking/spinlock: Provide RT variant")
    Signed-off-by: Waiman Long <longman@redhat.com>
    Acked-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Juri Lelli <juri.lelli@redhat.com>
    Cc: Len Brown <len.brown@intel.com>
    Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Wander Lairson Costa <wander@redhat.com>
2024-09-24 11:50:04 -03:00
Rafael Aquini a85223eeb8 mm: ptep_get() conversion
JIRA: https://issues.redhat.com/browse/RHEL-27742
Conflicts:
  * drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c: hunks dropped as
      these are already applied via RHEL commit 26418f1a34 ("Merge DRM
      changes from upstream v6.4..v6.5")
  * kernel/events/uprobes.c: minor context difference due to backport of upstream
      commit ec8832d007cb ("mmu_notifiers: don't invalidate secondary TLBs
      as part of mmu_notifier_invalidate_range_end()")
  * mm/gup.c: minor context difference on the 2nd hunk due to backport of upstream
      commit d74943a2f3cd ("mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT")
  * mm/hugetlb.c: hunk dropped as it's unecessary given the proactive work done
      on the backport of upstream  commit 191fcdb6c9cf ("mm/hugetlb.c: fix a bug
      within a BUG(): inconsistent pte comparison")
  * mm/ksm.c: context conflicts and differences on the 1st hunk are due to
      out-of-order backport of upstream commit 04dee9e85cf5 ("mm/various:
      give up if pte_offset_map[_lock]() fails") being compensated for only now.
  * mm/memory.c: minor context difference on the 35th hunk due to backport of
      upstream commit 04c35ab3bdae ("x86/mm/pat: fix VM_PAT handling in COW mappings")
  * mm/mempolicy.c: minor context difference on the 1st hunk due to backport of
      upstream commit 24526268f4e3 ("mm: mempolicy: keep VMA walk if both
      MPOL_MF_STRICT and MPOL_MF_MOVE are specified")
  * mm/migrate.c: minor context difference on the 2nd hunk due to backport of
      upstream commits 161e393c0f63 ("mm: Make pte_mkwrite() take a VMA"), and
      f3ebdf042df4 ("mm: don't check VMA write permissions if the PTE/PMD
      indicates write permissions")
  * mm/migrate_device.c: minor context difference on the 5th hunk due to backport
      of upstream commit ec8832d007cb ("mmu_notifiers: don't invalidate secondary
      TLBs  as part of mmu_notifier_invalidate_range_end()")
  * mm/swapfile.c: minor contex differences on the 1st and 2nd hunks due to
      backport of upstream commit f985fc322063 ("mm/swapfile: fix wrong swap
      entry type for hwpoisoned swapcache page")
  * mm/vmscan.c: minor context difference on the 3rd hunk due to backport of
      upstream commit c28ac3c7eb94 ("mm/mglru: skip special VMAs in
      lru_gen_look_around()")

This patch is a backport of the following upstream commit:
commit c33c794828f21217f72ce6fc140e0d34e0d56bff
Author: Ryan Roberts <ryan.roberts@arm.com>
Date:   Mon Jun 12 16:15:45 2023 +0100

    mm: ptep_get() conversion

    Convert all instances of direct pte_t* dereferencing to instead use
    ptep_get() helper.  This means that by default, the accesses change from a
    C dereference to a READ_ONCE().  This is technically the correct thing to
    do since where pgtables are modified by HW (for access/dirty) they are
    volatile and therefore we should always ensure READ_ONCE() semantics.

    But more importantly, by always using the helper, it can be overridden by
    the architecture to fully encapsulate the contents of the pte.  Arch code
    is deliberately not converted, as the arch code knows best.  It is
    intended that arch code (arm64) will override the default with its own
    implementation that can (e.g.) hide certain bits from the core code, or
    determine young/dirty status by mixing in state from another source.

    Conversion was done using Coccinelle:

    ----

    // $ make coccicheck \
    //          COCCI=ptepget.cocci \
    //          SPFLAGS="--include-headers" \
    //          MODE=patch

    virtual patch

    @ depends on patch @
    pte_t *v;
    @@

    - *v
    + ptep_get(v)

    ----

    Then reviewed and hand-edited to avoid multiple unnecessary calls to
    ptep_get(), instead opting to store the result of a single call in a
    variable, where it is correct to do so.  This aims to negate any cost of
    READ_ONCE() and will benefit arch-overrides that may be more complex.

    Included is a fix for an issue in an earlier version of this patch that
    was pointed out by kernel test robot.  The issue arose because config
    MMU=n elides definition of the ptep helper functions, including
    ptep_get().  HUGETLB_PAGE=n configs still define a simple
    huge_ptep_clear_flush() for linking purposes, which dereferences the ptep.
    So when both configs are disabled, this caused a build error because
    ptep_get() is not defined.  Fix by continuing to do a direct dereference
    when MMU=n.  This is safe because for this config the arch code cannot be
    trying to virtualize the ptes because none of the ptep helpers are
    defined.

    Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com
    Reported-by: kernel test robot <lkp@intel.com>
    Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/
    Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
    Cc: Adrian Hunter <adrian.hunter@intel.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
    Cc: Alex Williamson <alex.williamson@redhat.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Andrey Konovalov <andreyknvl@gmail.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Daniel Vetter <daniel@ffwll.ch>
    Cc: Dave Airlie <airlied@gmail.com>
    Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Ian Rogers <irogers@google.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Jérôme Glisse <jglisse@redhat.com>
    Cc: Jiri Olsa <jolsa@kernel.org>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Namhyung Kim <namhyung@kernel.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:36:52 -04:00
Rafael Aquini db17223c45 mm: memory-failure: move sysctl register in memory_failure_init()
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit 97de10a9932c363a4e4ee9bb5f2297e254fb1413
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Mon May 8 19:41:28 2023 +0800

    mm: memory-failure: move sysctl register in memory_failure_init()

    There is already a memory_failure_init(), don't add a new initcall, move
    register_sysctl_init() into it to cleanup a bit.

    Link: https://lkml.kernel.org/r/20230508114128.37081-2-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:35:21 -04:00
Aristeu Rozanski 0fc3f0b80f mm/huge_memory: don't unpoison huge_zero_folio
JIRA: https://issues.redhat.com/browse/RHEL-47802
Tested: by me, sanity. Can't poison the huge_zero_page through conventional means as it's not part of a valid section
CVE: CVE-2024-40914
Conflicts: we don't have is_huge_zero_folio() backported

commit fe6f86f4b40855a130a19aa589f9ba7f650423f4
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu May 16 20:26:08 2024 +0800

    mm/huge_memory: don't unpoison huge_zero_folio

    When I did memory failure tests recently, below panic occurs:

     kernel BUG at include/linux/mm.h:1135!
     invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
     CPU: 9 PID: 137 Comm: kswapd1 Not tainted 6.9.0-rc4-00491-gd5ce28f156fe-dirty #14
     RIP: 0010:shrink_huge_zero_page_scan+0x168/0x1a0
     RSP: 0018:ffff9933c6c57bd0 EFLAGS: 00000246
     RAX: 000000000000003e RBX: 0000000000000000 RCX: ffff88f61fc5c9c8
     RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff88f61fc5c9c0
     RBP: ffffcd7c446b0000 R08: ffffffff9a9405f0 R09: 0000000000005492
     R10: 00000000000030ea R11: ffffffff9a9405f0 R12: 0000000000000000
     R13: 0000000000000000 R14: 0000000000000000 R15: ffff88e703c4ac00
     FS:  0000000000000000(0000) GS:ffff88f61fc40000(0000) knlGS:0000000000000000
     CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
     CR2: 000055f4da6e9878 CR3: 0000000c71048000 CR4: 00000000000006f0
     Call Trace:
      <TASK>
      do_shrink_slab+0x14f/0x6a0
      shrink_slab+0xca/0x8c0
      shrink_node+0x2d0/0x7d0
      balance_pgdat+0x33a/0x720
      kswapd+0x1f3/0x410
      kthread+0xd5/0x100
      ret_from_fork+0x2f/0x50
      ret_from_fork_asm+0x1a/0x30
      </TASK>
     Modules linked in: mce_inject hwpoison_inject
     ---[ end trace 0000000000000000 ]---
     RIP: 0010:shrink_huge_zero_page_scan+0x168/0x1a0
     RSP: 0018:ffff9933c6c57bd0 EFLAGS: 00000246
     RAX: 000000000000003e RBX: 0000000000000000 RCX: ffff88f61fc5c9c8
     RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff88f61fc5c9c0
     RBP: ffffcd7c446b0000 R08: ffffffff9a9405f0 R09: 0000000000005492
     R10: 00000000000030ea R11: ffffffff9a9405f0 R12: 0000000000000000
     R13: 0000000000000000 R14: 0000000000000000 R15: ffff88e703c4ac00
     FS:  0000000000000000(0000) GS:ffff88f61fc40000(0000) knlGS:0000000000000000
     CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
     CR2: 000055f4da6e9878 CR3: 0000000c71048000 CR4: 00000000000006f0

    The root cause is that HWPoison flag will be set for huge_zero_folio
    without increasing the folio refcnt.  But then unpoison_memory() will
    decrease the folio refcnt unexpectedly as it appears like a successfully
    hwpoisoned folio leading to VM_BUG_ON_PAGE(page_ref_count(page) == 0) when
    releasing huge_zero_folio.

    Skip unpoisoning huge_zero_folio in unpoison_memory() to fix this issue.
    We're not prepared to unpoison huge_zero_folio yet.

    Link: https://lkml.kernel.org/r/20240516122608.22610-1-linmiaohe@huawei.com
    Fixes: 478d134e9506 ("mm/huge_memory: do not overkill when splitting huge_zero_page")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
    Cc: Xu Yu <xuyu@linux.alibaba.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-08-15 11:38:52 -04:00
Aristeu Rozanski b760bc52c3 mm/memory-failure: fix handling of dissolved but not taken off from buddy pages
JIRA: https://issues.redhat.com/browse/RHEL-45023
CVE: CVE-2024-39298
Tested: by me

commit 8cf360b9d6a840700e06864236a01a883b34bbad
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu May 23 15:12:17 2024 +0800

    mm/memory-failure: fix handling of dissolved but not taken off from buddy pages

    When I did memory failure tests recently, below panic occurs:

    page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8cee00
    flags: 0x6fffe0000000000(node=1|zone=2|lastcpupid=0x7fff)
    raw: 06fffe0000000000 dead000000000100 dead000000000122 0000000000000000
    raw: 0000000000000000 0000000000000009 00000000ffffffff 0000000000000000
    page dumped because: VM_BUG_ON_PAGE(!PageBuddy(page))
    ------------[ cut here ]------------
    kernel BUG at include/linux/page-flags.h:1009!
    invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
    RIP: 0010:__del_page_from_free_list+0x151/0x180
    RSP: 0018:ffffa49c90437998 EFLAGS: 00000046
    RAX: 0000000000000035 RBX: 0000000000000009 RCX: ffff8dd8dfd1c9c8
    RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff8dd8dfd1c9c0
    RBP: ffffd901233b8000 R08: ffffffffab5511f8 R09: 0000000000008c69
    R10: 0000000000003c15 R11: ffffffffab5511f8 R12: ffff8dd8fffc0c80
    R13: 0000000000000001 R14: ffff8dd8fffc0c80 R15: 0000000000000009
    FS:  00007ff916304740(0000) GS:ffff8dd8dfd00000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000055eae50124c8 CR3: 00000008479e0000 CR4: 00000000000006f0
    Call Trace:
     <TASK>
     __rmqueue_pcplist+0x23b/0x520
     get_page_from_freelist+0x26b/0xe40
     __alloc_pages_noprof+0x113/0x1120
     __folio_alloc_noprof+0x11/0xb0
     alloc_buddy_hugetlb_folio.isra.0+0x5a/0x130
     __alloc_fresh_hugetlb_folio+0xe7/0x140
     alloc_pool_huge_folio+0x68/0x100
     set_max_huge_pages+0x13d/0x340
     hugetlb_sysctl_handler_common+0xe8/0x110
     proc_sys_call_handler+0x194/0x280
     vfs_write+0x387/0x550
     ksys_write+0x64/0xe0
     do_syscall_64+0xc2/0x1d0
     entry_SYSCALL_64_after_hwframe+0x77/0x7f
    RIP: 0033:0x7ff916114887
    RSP: 002b:00007ffec8a2fd78 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
    RAX: ffffffffffffffda RBX: 000055eae500e350 RCX: 00007ff916114887
    RDX: 0000000000000004 RSI: 000055eae500e390 RDI: 0000000000000003
    RBP: 000055eae50104c0 R08: 0000000000000000 R09: 000055eae50104c0
    R10: 0000000000000077 R11: 0000000000000246 R12: 0000000000000004
    R13: 0000000000000004 R14: 00007ff916216b80 R15: 00007ff916216a00
     </TASK>
    Modules linked in: mce_inject hwpoison_inject
    ---[ end trace 0000000000000000 ]---

    And before the panic, there had an warning about bad page state:

    BUG: Bad page state in process page-types  pfn:8cee00
    page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8cee00
    flags: 0x6fffe0000000000(node=1|zone=2|lastcpupid=0x7fff)
    page_type: 0xffffff7f(buddy)
    raw: 06fffe0000000000 ffffd901241c0008 ffffd901240f8008 0000000000000000
    raw: 0000000000000000 0000000000000009 00000000ffffff7f 0000000000000000
    page dumped because: nonzero mapcount
    Modules linked in: mce_inject hwpoison_inject
    CPU: 8 PID: 154211 Comm: page-types Not tainted 6.9.0-rc4-00499-g5544ec3178e2-dirty #22
    Call Trace:
     <TASK>
     dump_stack_lvl+0x83/0xa0
     bad_page+0x63/0xf0
     free_unref_page+0x36e/0x5c0
     unpoison_memory+0x50b/0x630
     simple_attr_write_xsigned.constprop.0.isra.0+0xb3/0x110
     debugfs_attr_write+0x42/0x60
     full_proxy_write+0x5b/0x80
     vfs_write+0xcd/0x550
     ksys_write+0x64/0xe0
     do_syscall_64+0xc2/0x1d0
     entry_SYSCALL_64_after_hwframe+0x77/0x7f
    RIP: 0033:0x7f189a514887
    RSP: 002b:00007ffdcd899718 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
    RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f189a514887
    RDX: 0000000000000009 RSI: 00007ffdcd899730 RDI: 0000000000000003
    RBP: 00007ffdcd8997a0 R08: 0000000000000000 R09: 00007ffdcd8994b2
    R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffdcda199a8
    R13: 0000000000404af1 R14: 000000000040ad78 R15: 00007f189a7a5040
     </TASK>

    The root cause should be the below race:

     memory_failure
      try_memory_failure_hugetlb
       me_huge_page
        __page_handle_poison
         dissolve_free_hugetlb_folio
         drain_all_pages -- Buddy page can be isolated e.g. for compaction.
         take_page_off_buddy -- Failed as page is not in the buddy list.
                 -- Page can be putback into buddy after compaction.
        page_ref_inc -- Leads to buddy page with refcnt = 1.

    Then unpoison_memory() can unpoison the page and send the buddy page back
    into buddy list again leading to the above bad page state warning.  And
    bad_page() will call page_mapcount_reset() to remove PageBuddy from buddy
    page leading to later VM_BUG_ON_PAGE(!PageBuddy(page)) when trying to
    allocate this page.

    Fix this issue by only treating __page_handle_poison() as successful when
    it returns 1.

    Link: https://lkml.kernel.org/r/20240523071217.1696196-1-linmiaohe@huawei.com
    Fixes: ceaf8fbea79a ("mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-07-17 12:48:06 -04:00
Lucas Zampieri a887cb4a39 Merge: mm/memory-failure: fix deadlock when hugetlb_optimize_vmemmap is enabled
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4178

JIRA: https://issues.redhat.com/browse/RHEL-35090  
JIRA: https://issues.redhat.com/browse/RHEL-35091  
CVE: CVE-2024-26987  
Tested: no reproducer, can't reproduce it myself  
  
commit 1983184c22dd84a4d95a71e5c6775c2638557dc7  
Author: Miaohe Lin <linmiaohe@huawei.com>  
Date:   Sun Apr 7 16:54:56 2024 +0800  
  
    mm/memory-failure: fix deadlock when hugetlb_optimize_vmemmap is enabled  
  
    When I did hard offline test with hugetlb pages, below deadlock occurs:  
  
    ======================================================  
    WARNING: possible circular locking dependency detected  
    6.8.0-11409-gf6cef5f8c37f #1 Not tainted  
    ------------------------------------------------------  
    bash/46904 is trying to acquire lock:  
    ffffffffabe68910 (cpu_hotplug_lock){++++}-{0:0}, at: static_key_slow_dec+0x16/0x60  
  
    but task is already holding lock:  
    ffffffffabf92ea8 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable+0x16/0x40  
  
    which lock already depends on the new lock.  
  
    the existing dependency chain (in reverse order) is:  
  
    -> #1 (pcp_batch_high_lock){+.+.}-{3:3}:  
           __mutex_lock+0x6c/0x770  
           page_alloc_cpu_online+0x3c/0x70  
           cpuhp_invoke_callback+0x397/0x5f0  
           __cpuhp_invoke_callback_range+0x71/0xe0  
           _cpu_up+0xeb/0x210  
           cpu_up+0x91/0xe0  
           cpuhp_bringup_mask+0x49/0xb0  
           bringup_nonboot_cpus+0xb7/0xe0  
           smp_init+0x25/0xa0  
           kernel_init_freeable+0x15f/0x3e0  
           kernel_init+0x15/0x1b0  
           ret_from_fork+0x2f/0x50  
           ret_from_fork_asm+0x1a/0x30  
  
    -> #0 (cpu_hotplug_lock){++++}-{0:0}:  
           __lock_acquire+0x1298/0x1cd0  
           lock_acquire+0xc0/0x2b0  
           cpus_read_lock+0x2a/0xc0  
           static_key_slow_dec+0x16/0x60  
           __hugetlb_vmemmap_restore_folio+0x1b9/0x200  
           dissolve_free_huge_page+0x211/0x260  
           __page_handle_poison+0x45/0xc0  
           memory_failure+0x65e/0xc70  
           hard_offline_page_store+0x55/0xa0  
           kernfs_fop_write_iter+0x12c/0x1d0  
           vfs_write+0x387/0x550  
           ksys_write+0x64/0xe0  
           do_syscall_64+0xca/0x1e0  
           entry_SYSCALL_64_after_hwframe+0x6d/0x75  
  
    other info that might help us debug this:  
  
     Possible unsafe locking scenario:  
  
           CPU0                    CPU1  
           ----                    ----  
      lock(pcp_batch_high_lock);  
                                   lock(cpu_hotplug_lock);  
                                   lock(pcp_batch_high_lock);  
      rlock(cpu_hotplug_lock);  
  
     *** DEADLOCK ***  
  
    5 locks held by bash/46904:  
  
    stack backtrace:  
    CPU: 10 PID: 46904 Comm: bash Kdump: loaded Not tainted 6.8.0-11409-gf6cef5f8c37f #1  
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014  
    Call Trace:  
     <TASK>  
     dump_stack_lvl+0x68/0xa0  
     check_noncircular+0x129/0x140  
     __lock_acquire+0x1298/0x1cd0  
     lock_acquire+0xc0/0x2b0  
     cpus_read_lock+0x2a/0xc0  
     static_key_slow_dec+0x16/0x60  
     __hugetlb_vmemmap_restore_folio+0x1b9/0x200  
     dissolve_free_huge_page+0x211/0x260  
     __page_handle_poison+0x45/0xc0  
     memory_failure+0x65e/0xc70  
     hard_offline_page_store+0x55/0xa0  
     kernfs_fop_write_iter+0x12c/0x1d0  
     vfs_write+0x387/0x550  
     ksys_write+0x64/0xe0  
     do_syscall_64+0xca/0x1e0  
     entry_SYSCALL_64_after_hwframe+0x6d/0x75  
    RIP: 0033:0x7fc862314887  
    Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24  
    RSP: 002b:00007fff19311268 EFLAGS: 00000246 ORIG_RAX: 0000000000000001  
    RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007fc862314887  
    RDX: 000000000000000c RSI: 000056405645fe10 RDI: 0000000000000001  
    RBP: 000056405645fe10 R08: 00007fc8623d1460 R09: 000000007fffffff  
    R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c  
    R13: 00007fc86241b780 R14: 00007fc862417600 R15: 00007fc862416a00  
  
    In short, below scene breaks the lock dependency chain:  
  
     memory_failure  
      __page_handle_poison  
       zone_pcp_disable -- lock(pcp_batch_high_lock)  
       dissolve_free_huge_page  
        __hugetlb_vmemmap_restore_folio  
         static_key_slow_dec  
          cpus_read_lock -- rlock(cpu_hotplug_lock)  
  
    Fix this by calling drain_all_pages() instead.  
  
    This issue won't occur until commit a6b40850c442 ("mm: hugetlb: replace  
    hugetlb_free_vmemmap_enabled with a static_key").  As it introduced  
    rlock(cpu_hotplug_lock) in dissolve_free_huge_page() code path while  
    lock(pcp_batch_high_lock) is already in the __page_handle_poison().  
  
    [linmiaohe@huawei.com: extend comment per Oscar]  
    [akpm@linux-foundation.org: reflow block comment]  
    Link: https://lkml.kernel.org/r/20240407085456.2798193-1-linmiaohe@huawei.com  
    Fixes: a6b40850c442 ("mm: hugetlb: replace hugetlb_free_vmemmap_enabled with a static_key")  
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>  
    Acked-by: Oscar Salvador <osalvador@suse.de>  
    Reviewed-by: Jane Chu <jane.chu@oracle.com>  
    Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>  
    Cc: <stable@vger.kernel.org>  
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>  
  
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>

Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: Audra Mitchell <aubaker@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-06-12 13:30:21 +00:00
Aristeu Rozanski 21aa1b07b5 mm/memory-failure: fix deadlock when hugetlb_optimize_vmemmap is enabled
JIRA: https://issues.redhat.com/browse/RHEL-35090
JIRA: https://issues.redhat.com/browse/RHEL-35091
CVE: CVE-2024-26987
Tested: no reproducer, can't reproduce it myself

commit 1983184c22dd84a4d95a71e5c6775c2638557dc7
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Sun Apr 7 16:54:56 2024 +0800

    mm/memory-failure: fix deadlock when hugetlb_optimize_vmemmap is enabled

    When I did hard offline test with hugetlb pages, below deadlock occurs:

    ======================================================
    WARNING: possible circular locking dependency detected
    6.8.0-11409-gf6cef5f8c37f #1 Not tainted
    ------------------------------------------------------
    bash/46904 is trying to acquire lock:
    ffffffffabe68910 (cpu_hotplug_lock){++++}-{0:0}, at: static_key_slow_dec+0x16/0x60

    but task is already holding lock:
    ffffffffabf92ea8 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable+0x16/0x40

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (pcp_batch_high_lock){+.+.}-{3:3}:
           __mutex_lock+0x6c/0x770
           page_alloc_cpu_online+0x3c/0x70
           cpuhp_invoke_callback+0x397/0x5f0
           __cpuhp_invoke_callback_range+0x71/0xe0
           _cpu_up+0xeb/0x210
           cpu_up+0x91/0xe0
           cpuhp_bringup_mask+0x49/0xb0
           bringup_nonboot_cpus+0xb7/0xe0
           smp_init+0x25/0xa0
           kernel_init_freeable+0x15f/0x3e0
           kernel_init+0x15/0x1b0
           ret_from_fork+0x2f/0x50
           ret_from_fork_asm+0x1a/0x30

    -> #0 (cpu_hotplug_lock){++++}-{0:0}:
           __lock_acquire+0x1298/0x1cd0
           lock_acquire+0xc0/0x2b0
           cpus_read_lock+0x2a/0xc0
           static_key_slow_dec+0x16/0x60
           __hugetlb_vmemmap_restore_folio+0x1b9/0x200
           dissolve_free_huge_page+0x211/0x260
           __page_handle_poison+0x45/0xc0
           memory_failure+0x65e/0xc70
           hard_offline_page_store+0x55/0xa0
           kernfs_fop_write_iter+0x12c/0x1d0
           vfs_write+0x387/0x550
           ksys_write+0x64/0xe0
           do_syscall_64+0xca/0x1e0
           entry_SYSCALL_64_after_hwframe+0x6d/0x75

    other info that might help us debug this:

     Possible unsafe locking scenario:

           CPU0                    CPU1
           ----                    ----
      lock(pcp_batch_high_lock);
                                   lock(cpu_hotplug_lock);
                                   lock(pcp_batch_high_lock);
      rlock(cpu_hotplug_lock);

     *** DEADLOCK ***

    5 locks held by bash/46904:
     #0: ffff98f6c3bb23f0 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x64/0xe0
     #1: ffff98f6c328e488 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0xf8/0x1d0
     #2: ffff98ef83b31890 (kn->active#113){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x100/0x1d0
     #3: ffffffffabf9db48 (mf_mutex){+.+.}-{3:3}, at: memory_failure+0x44/0xc70
     #4: ffffffffabf92ea8 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable+0x16/0x40

    stack backtrace:
    CPU: 10 PID: 46904 Comm: bash Kdump: loaded Not tainted 6.8.0-11409-gf6cef5f8c37f #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
    Call Trace:
     <TASK>
     dump_stack_lvl+0x68/0xa0
     check_noncircular+0x129/0x140
     __lock_acquire+0x1298/0x1cd0
     lock_acquire+0xc0/0x2b0
     cpus_read_lock+0x2a/0xc0
     static_key_slow_dec+0x16/0x60
     __hugetlb_vmemmap_restore_folio+0x1b9/0x200
     dissolve_free_huge_page+0x211/0x260
     __page_handle_poison+0x45/0xc0
     memory_failure+0x65e/0xc70
     hard_offline_page_store+0x55/0xa0
     kernfs_fop_write_iter+0x12c/0x1d0
     vfs_write+0x387/0x550
     ksys_write+0x64/0xe0
     do_syscall_64+0xca/0x1e0
     entry_SYSCALL_64_after_hwframe+0x6d/0x75
    RIP: 0033:0x7fc862314887
    Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
    RSP: 002b:00007fff19311268 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
    RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007fc862314887
    RDX: 000000000000000c RSI: 000056405645fe10 RDI: 0000000000000001
    RBP: 000056405645fe10 R08: 00007fc8623d1460 R09: 000000007fffffff
    R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
    R13: 00007fc86241b780 R14: 00007fc862417600 R15: 00007fc862416a00

    In short, below scene breaks the lock dependency chain:

     memory_failure
      __page_handle_poison
       zone_pcp_disable -- lock(pcp_batch_high_lock)
       dissolve_free_huge_page
        __hugetlb_vmemmap_restore_folio
         static_key_slow_dec
          cpus_read_lock -- rlock(cpu_hotplug_lock)

    Fix this by calling drain_all_pages() instead.

    This issue won't occur until commit a6b40850c442 ("mm: hugetlb: replace
    hugetlb_free_vmemmap_enabled with a static_key").  As it introduced
    rlock(cpu_hotplug_lock) in dissolve_free_huge_page() code path while
    lock(pcp_batch_high_lock) is already in the __page_handle_poison().

    [linmiaohe@huawei.com: extend comment per Oscar]
    [akpm@linux-foundation.org: reflow block comment]
    Link: https://lkml.kernel.org/r/20240407085456.2798193-1-linmiaohe@huawei.com
    Fixes: a6b40850c442 ("mm: hugetlb: replace hugetlb_free_vmemmap_enabled with a static_key")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Oscar Salvador <osalvador@suse.de>
    Reviewed-by: Jane Chu <jane.chu@oracle.com>
    Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-05-22 14:40:10 -04:00
Nico Pache ca66b21fc3 mm/memory-failure: fix crash in split_huge_page_to_list from soft_offline_page
commit 2fde9e7f9e6dc38e1d7091b9705c22be945c8697
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Wed Jan 24 16:40:14 2024 +0800

    mm/memory-failure: fix crash in split_huge_page_to_list from soft_offline_page

    When I did soft offline stress test, a machine was observed to crash with
    the following message:

      kernel BUG at include/linux/memcontrol.h:554!
      invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
      CPU: 5 PID: 3837 Comm: hwpoison.sh Not tainted 6.7.0-next-20240112-00001-g8ecf3e7fb7c8-dirty #97
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      RIP: 0010:folio_memcg+0xaf/0xd0
      Code: 10 5b 5d c3 cc cc cc cc 48 c7 c6 08 b1 f2 b2 48 89 ef e8 b4 c5 f8 ff 90 0f 0b 48 c7 c6 d0 b0 f2 b2 48 89 ef e8 a2 c5 f8 ff 90 <0f> 0b 48 c7 c6 08 b1 f2 b2 48 89 ef e8 90 c5 f8 ff 90 0f 0b 66 66
      RSP: 0018:ffffb6c043657c98 EFLAGS: 00000296
      RAX: 000000000000004b RBX: ffff932bc1d1e401 RCX: ffff933abfb5c908
      RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff933abfb5c900
      RBP: ffffea6f04019080 R08: ffffffffb3338ce8 R09: 0000000000009ffb
      R10: 00000000000004dd R11: ffffffffb3308d00 R12: ffffea6f04019080
      R13: ffffea6f04019080 R14: 0000000000000001 R15: ffffb6c043657da0
      FS:  00007f6c60f6b740(0000) GS:ffff933abfb40000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000559c3bc8b980 CR3: 0000000107f1c000 CR4: 00000000000006f0
      Call Trace:
       <TASK>
       split_huge_page_to_list+0x4d/0x1380
       try_to_split_thp_page+0x3a/0xf0
       soft_offline_page+0x1ea/0x8a0
       soft_offline_page_store+0x52/0x90
       kernfs_fop_write_iter+0x118/0x1b0
       vfs_write+0x30b/0x430
       ksys_write+0x5e/0xe0
       do_syscall_64+0xb0/0x1b0
       entry_SYSCALL_64_after_hwframe+0x6d/0x75
      RIP: 0033:0x7f6c60d14697
      Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
      RSP: 002b:00007ffe9b72b8d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f6c60d14697
      RDX: 000000000000000c RSI: 0000559c3bc8b980 RDI: 0000000000000001
      RBP: 0000559c3bc8b980 R08: 00007f6c60dd1460 R09: 000000007fffffff
      R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
      R13: 00007f6c60e1a780 R14: 00007f6c60e16600 R15: 00007f6c60e15a00

    The problem is that page->mapping is overloaded with slab->slab_list or
    slabs fields now, so slab pages could be taken as non-LRU movable pages if
    field slabs contains PAGE_MAPPING_MOVABLE or slab_list->prev is set to
    LIST_POISON2.  These slab pages will be treated as thp later leading to
    crash in split_huge_page_to_list().

    Link: https://lkml.kernel.org/r/20240126065837.2100184-1-linmiaohe@huawei.com
    Link: https://lkml.kernel.org/r/20240124084014.1772906-1-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Fixes: 130d4df57390 ("mm/sl[au]b: rearrange struct slab fields to allow larger rcu_head")
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
2024-04-30 17:51:32 -06:00
Nico Pache efb0b44050 mm: memory-failure: fix unexpected return value in soft_offline_page()
commit e2c1ab070fdc81010ec44634838d24fce9ff9e53
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Jun 27 19:28:08 2023 +0800

    mm: memory-failure: fix unexpected return value in soft_offline_page()

    When page_handle_poison() fails to handle the hugepage or free page in
    retry path, soft_offline_page() will return 0 while -EBUSY is expected in
    this case.

    Consequently the user will think soft_offline_page succeeds while it in
    fact failed.  So the user will not try again later in this case.

    Link: https://lkml.kernel.org/r/20230627112808.1275241-1-linmiaohe@huawei.com
    Fixes: b94e02822d ("mm,hwpoison: try to narrow window race for free pages")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
2024-04-30 17:51:28 -06:00
Nico Pache 0b91dbac20 mm: enable page walking API to lock vmas during the walk
commit 49b0638502da097c15d46cd4e871dbaa022caf7c
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Fri Aug 4 08:27:19 2023 -0700

    mm: enable page walking API to lock vmas during the walk

    walk_page_range() and friends often operate under write-locked mmap_lock.
    With introduction of vma locks, the vmas have to be locked as well during
    such walks to prevent concurrent page faults in these areas.  Add an
    additional member to mm_walk_ops to indicate locking requirements for the
    walk.

    The change ensures that page walks which prevent concurrent page faults
    by write-locking mmap_lock, operate correctly after introduction of
    per-vma locks.  With per-vma locks page faults can be handled under vma
    lock without taking mmap_lock at all, so write locking mmap_lock would
    not stop them.  The change ensures vmas are properly locked during such
    walks.

    A sample issue this solves is do_mbind() performing queue_pages_range()
    to queue pages for migration.  Without this change a concurrent page
    can be faulted into the area and be left out of migration.

    Link: https://lkml.kernel.org/r/20230804152724.3090321-2-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
    Suggested-by: Jann Horn <jannh@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Laurent Dufour <ldufour@linux.ibm.com>
    Cc: Liam Howlett <liam.howlett@oracle.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Michel Lespinasse <michel@lespinasse.org>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
2024-04-30 17:51:27 -06:00
Nico Pache fcd437b046 mm/memory-failure: fix hardware poison check in unpoison_memory()
commit 6c54312f9689fbe27c70db5d42eebd29d04b672e
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Mon Jul 17 11:18:12 2023 -0700

    mm/memory-failure: fix hardware poison check in unpoison_memory()

    It was pointed out[1] that using folio_test_hwpoison() is wrong as we need
    to check the indiviual page that has poison.  folio_test_hwpoison() only
    checks the head page so go back to using PageHWPoison().

    User-visible effects include existing hwpoison-inject tests possibly
    failing as unpoisoning a single subpage could lead to unpoisoning an
    entire folio.  Memory unpoisoning could also not work as expected as
    the function will break early due to only checking the head page and
    not the actually poisoned subpage.

    [1]: https://lore.kernel.org/lkml/ZLIbZygG7LqSI9xe@casper.infradead.org/

    Link: https://lkml.kernel.org/r/20230717181812.167757-1-sidhartha.kumar@oracle.com
    Fixes: a6fddef49eef ("mm/memory-failure: convert unpoison_memory() to folios")
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Reported-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
2024-04-30 17:51:24 -06:00
Nico Pache 5f1ce034db mm: memory-failure: avoid false hwpoison page mapped error info
commit faeb2ff2c1c5cb60ce0da193580b256c941f99ca
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Jul 27 19:56:42 2023 +0800

    mm: memory-failure: avoid false hwpoison page mapped error info

    folio->_mapcount is overloaded in SLAB, so folio_mapped() has to be done
    after folio_test_slab() is checked. Otherwise slab folio might be treated
    as a mapped folio leading to false 'Someone maps the hwpoison page' error
    info.

    Link: https://lkml.kernel.org/r/20230727115643.639741-4-linmiaohe@huawei.com
    Fixes: 230ac719c5 ("mm/hwpoison: don't try to unpoison containment-failed pages")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
2024-04-30 17:51:24 -06:00
Nico Pache 2f93afe6ea mm: memory-failure: fix potential unexpected return value from unpoison_memory()
commit f29623e4a599c295cc8f518c8e4bb7848581a14d
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Jul 27 19:56:41 2023 +0800

    mm: memory-failure: fix potential unexpected return value from unpoison_memory()

    If unpoison_memory() fails to clear page hwpoisoned flag, return value ret
    is expected to be -EBUSY.  But when get_hwpoison_page() returns 1 and
    fails to clear page hwpoisoned flag due to races, return value will be
    unexpected 1 leading to users being confused.  And there's a code smell
    that the variable "ret" is used not only to save the return value of
    unpoison_memory(), but also the return value from get_hwpoison_page().
    Make a further cleanup by using another auto-variable solely to save the
    return value of get_hwpoison_page() as suggested by Naoya.

    Link: https://lkml.kernel.org/r/20230727115643.639741-3-linmiaohe@huawei.com
    Fixes: bf181c582588 ("mm/hwpoison: fix unpoison_memory()")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
2024-04-30 17:51:24 -06:00
Chris von Recklinghausen a3d0d37911 mm: ksm: support hwpoison for ksm page
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 4248d0083ec5817eebfb916c54950d100b3468ee
Author: Longlong Xia <xialonglong1@huawei.com>
Date:   Fri Apr 14 10:17:41 2023 +0800

    mm: ksm: support hwpoison for ksm page

    hwpoison_user_mappings() is updated to support ksm pages, and add
    collect_procs_ksm() to collect processes when the error hit an ksm page.
    The difference from collect_procs_anon() is that it also needs to traverse
    the rmap-item list on the stable node of the ksm page.  At the same time,
    add_to_kill_ksm() is added to handle ksm pages.  And
    task_in_to_kill_list() is added to avoid duplicate addition of tsk to the
    to_kill list.  This is because when scanning the list, if the pages that
    make up the ksm page all come from the same process, they may be added
    repeatedly.

    Link: https://lkml.kernel.org/r/20230414021741.2597273-3-xialonglong1@huawei.com
    Signed-off-by: Longlong Xia <xialonglong1@huawei.com>
    Tested-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Nanyong Sun <sunnanyong@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:01:01 -04:00
Chris von Recklinghausen 2c83d5cb71 mm: memory-failure: refactor add_to_kill()
Conflicts: mm/memory-failure.c - conflicts in the workaround for
	36537a67d356 ("mm, hwpoison: avoid unneeded page_mapped_in_vma() overhead in collect_procs_anon()")
	cause conflicts with this patch. Make the end result identical
	to upstream.

JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 4f775086a6eee07c6ae4be4734d736e13b537351
Author: Longlong Xia <xialonglong1@huawei.com>
Date:   Fri Apr 14 10:17:40 2023 +0800

    mm: memory-failure: refactor add_to_kill()

    Patch series "mm: ksm: support hwpoison for ksm page", v2.

    Currently, ksm does not support hwpoison.  As ksm is being used more
    widely for deduplication at the system level, container level, and process
    level, supporting hwpoison for ksm has become increasingly important.
    However, ksm pages were not processed by hwpoison in 2009 [1].

    The main method of implementation:

    1. Refactor add_to_kill() and add new add_to_kill_*() to better
       accommodate the handling of different types of pages.

    2.  Add collect_procs_ksm() to collect processes when the error hit an
       ksm page.

    3. Add task_in_to_kill_list() to avoid duplicate addition of tsk to
       the to_kill list.

    4. Try_to_unmap ksm page (already supported).

    5. Handle related processes such as sending SIGBUS.

    Tested with poisoning to ksm page from
    1) different process
    2) one process

    and with/without memory_failure_early_kill set, the processes are killed
    as expected with the patchset.

    [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
    commit/?h=01e00f880ca700376e1845cf7a2524ebe68e47d6

    This patch (of 2):

    The page_address_in_vma() is used to find the user virtual address of page
    in add_to_kill(), but it doesn't support ksm due to the ksm page->index
    unusable, add an ksm_addr as parameter to add_to_kill(), let's the caller
    to pass it, also rename the function to __add_to_kill(), and adding
    add_to_kill_anon_file() for handling anonymous pages and file pages,
    adding add_to_kill_fsdax() for handling fsdax pages.

    Link: https://lkml.kernel.org/r/20230414021741.2597273-1-xialonglong1@huawei
.com
    Link: https://lkml.kernel.org/r/20230414021741.2597273-2-xialonglong1@huawei
.com
    Signed-off-by: Longlong Xia <xialonglong1@huawei.com>
    Tested-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Nanyong Sun <sunnanyong@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:01:01 -04:00
Chris von Recklinghausen de9f16f4d7 mm: memory-failure: Move memory failure sysctls to its own file
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 8cbc82f3ec0d58961cf9c1e5d99e56741f4bf134
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Mon Mar 20 15:40:10 2023 +0800

    mm: memory-failure: Move memory failure sysctls to its own file

    The sysctl_memory_failure_early_kill and memory_failure_recovery
    are only used in memory-failure.c, move them to its own file.

    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    [mcgrof: fix by adding empty ctl entry, this caused a crash]
    Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:47 -04:00
Chris von Recklinghausen 369c2efd68 mm: memory-failure: directly use IS_ENABLED(CONFIG_HWPOISON_INJECT)
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 611b9fd80fb53e79079744fb29d8c1363f4b5c02
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Mon Mar 13 13:39:29 2023 +0800

    mm: memory-failure: directly use IS_ENABLED(CONFIG_HWPOISON_INJECT)

    It's more clear and simple to just use IS_ENABLED(CONFIG_HWPOISON_INJECT)
    to check whether or not to enable HWPoison injector module instead of
    CONFIG_HWPOISON_INJECT/CONFIG_HWPOISON_INJECT_MODULE.

    Link: https://lkml.kernel.org/r/20230313053929.84607-1-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:16 -04:00
Aristeu Rozanski b495e79b31 mm/hwpoison: convert TTU_IGNORE_HWPOISON to TTU_HWPOISON
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 6da6b1d4a7df8c35770186b53ef65d388398e139
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date:   Tue Feb 21 17:59:05 2023 +0900

    mm/hwpoison: convert TTU_IGNORE_HWPOISON to TTU_HWPOISON

    After a memory error happens on a clean folio, a process unexpectedly
    receives SIGBUS when it accesses the error page.  This SIGBUS killing is
    pointless and simply degrades the level of RAS of the system, because the
    clean folio can be dropped without any data lost on memory error handling
    as we do for a clean pagecache.

    When memory_failure() is called on a clean folio, try_to_unmap() is called
    twice (one from split_huge_page() and one from hwpoison_user_mappings()).
    The root cause of the issue is that pte conversion to hwpoisoned entry is
    now done in the first call of try_to_unmap() because PageHWPoison is
    already set at this point, while it's actually expected to be done in the
    second call.  This behavior disturbs the error handling operation like
    removing pagecache, which results in the malfunction described above.

    So convert TTU_IGNORE_HWPOISON into TTU_HWPOISON and set TTU_HWPOISON only
    when we really intend to convert pte to hwpoison entry.  This can prevent
    other callers of try_to_unmap() from accidentally converting to hwpoison
    entries.

    Link: https://lkml.kernel.org/r/20230221085905.1465385-1-naoya.horiguchi@linux.dev
    Fixes: a42634a6c07d ("readahead: Use a folio in read_pages()")
    Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:24 -04:00
Aristeu Rozanski 837cf9f325 mm: change to return bool for isolate_movable_page()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit cd7755800eb54e8522f5e51f4e71e6494c1f1572
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date:   Wed Feb 15 18:39:37 2023 +0800

    mm: change to return bool for isolate_movable_page()

    Now the isolate_movable_page() can only return 0 or -EBUSY, and no users
    will care about the negative return value, thus we can convert the
    isolate_movable_page() to return a boolean value to make the code more
    clear when checking the movable page isolation state.

    No functional changes intended.

    [akpm@linux-foundation.org: remove unneeded comment, per Matthew]
    Link: https://lkml.kernel.org/r/cb877f73f4fff8d309611082ec740a7065b1ade0.1676424378.git.baolin.wang@linux.alibaba.com
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Reviewed-by: SeongJae Park <sj@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:24 -04:00
Aristeu Rozanski 39263f3448 mm: hugetlb: change to return bool for isolate_hugetlb()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 9747b9e92418b61c2281561e0651803f1fad0159
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date:   Wed Feb 15 18:39:36 2023 +0800

    mm: hugetlb: change to return bool for isolate_hugetlb()

    Now the isolate_hugetlb() only returns 0 or -EBUSY, and most users did not
    care about the negative value, thus we can convert the isolate_hugetlb()
    to return a boolean value to make code more clear when checking the
    hugetlb isolation state.  Moreover converts 2 users which will consider
    the negative value returned by isolate_hugetlb().

    No functional changes intended.

    [akpm@linux-foundation.org: shorten locked section, per SeongJae Park]
    Link: https://lkml.kernel.org/r/12a287c5bebc13df304387087bbecc6421510849.1676424378.git.baolin.wang@linux.alibaba.com
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Reviewed-by: SeongJae Park <sj@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:24 -04:00
Aristeu Rozanski 4c96f5154f mm: change to return bool for isolate_lru_page()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit f7f9c00dfafffd7a5a1a5685e2d874c64913e2ed
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date:   Wed Feb 15 18:39:35 2023 +0800

    mm: change to return bool for isolate_lru_page()

    The isolate_lru_page() can only return 0 or -EBUSY, and most users did not
    care about the negative error of isolate_lru_page(), except one user in
    add_page_for_migration().  So we can convert the isolate_lru_page() to
    return a boolean value, which can help to make the code more clear when
    checking the return value of isolate_lru_page().

    Also convert all users' logic of checking the isolation state.

    No functional changes intended.

    Link: https://lkml.kernel.org/r/3074c1ab628d9dbf139b33f248a8bc253a3f95f0.1676424378.git.baolin.wang@linux.alibaba.com
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Reviewed-by: SeongJae Park <sj@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:23 -04:00
Aristeu Rozanski edf79d9715 mm/hugetlb: convert isolate_hugetlb to folios
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 6aa3a920125e9f58891e2b5dc2efd4d0c1ff05a6
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date:   Fri Jan 13 16:30:50 2023 -0600

    mm/hugetlb: convert isolate_hugetlb to folios

    Patch series "continue hugetlb folio conversion", v3.

    This series continues the conversion of core hugetlb functions to use
    folios. This series converts many helper funtions in the hugetlb fault
    path. This is in preparation for another series to convert the hugetlb
    fault code paths to operate on folios.

    This patch (of 8):

    Convert isolate_hugetlb() to take in a folio and convert its callers to
    pass a folio.  Use page_folio() to convert the callers to use a folio is
    safe as isolate_hugetlb() operates on a head page.

    Link: https://lkml.kernel.org/r/20230113223057.173292-1-sidhartha.kumar@oracle.com
    Link: https://lkml.kernel.org/r/20230113223057.173292-2-sidhartha.kumar@oracle.com
    Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:20 -04:00