JIRA: https://issues.redhat.com/browse/RHEL-27745
This patch is a backport of the following upstream commit:
commit 19d3e221807772f8443e565234a6fdc5a2b09d26
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date: Fri Jan 12 10:08:40 2024 -0800
fs/hugetlbfs/inode.c: mm/memory-failure.c: fix hugetlbfs hwpoison handling
has_extra_refcount() makes the assumption that the page cache adds a ref
count of 1 and subtracts this in the extra_pins case. Commit a08c7193e4f1
(mm/filemap: remove hugetlb special casing in filemap.c) modifies
__filemap_add_folio() by calling folio_ref_add(folio, nr); for all cases
(including hugtetlb) where nr is the number of pages in the folio. We
should adjust the number of references coming from the page cache by
subtracing the number of pages rather than 1.
In hugetlbfs_read_iter(), folio_test_has_hwpoisoned() is testing the wrong
flag as, in the hugetlb case, memory-failure code calls
folio_test_set_hwpoison() to indicate poison. folio_test_hwpoison() is
the correct function to test for that flag.
After these fixes, the hugetlb hwpoison read selftest passes all cases.
Link: https://lkml.kernel.org/r/20240112180840.367006-1-sidhartha.kumar@oracle.com
Fixes: a08c7193e4f1 ("mm/filemap: remove hugetlb special casing in filemap.c")
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Closes: https://lore.kernel.org/linux-mm/20230713001833.3778937-1-jiaqiyan@google.com/T/#m8e1469119e5b831bbd05d495f96b842e4a1c5519
Reported-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Tested-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Muchun Song <muchun.song@linux.dev>
Cc: James Houghton <jthoughton@google.com>
Cc: Jiaqi Yan <jiaqiyan@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: <stable@vger.kernel.org> [6.7+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27745
This patch is a backport of the following upstream commit:
commit 39ebd6dce62d8cfe3864e16148927a139f11bc9a
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date: Mon Dec 18 13:58:37 2023 +0000
mm/memory-failure: cast index to loff_t before shifting it
On 32-bit systems, we'll lose the top bits of index because arithmetic
will be performed in unsigned long instead of unsigned long long. This
affects files over 4GB in size.
Link: https://lkml.kernel.org/r/20231218135837.3310403-4-willy@infradead.org
Fixes: 6100e34b25 ("mm, memory_failure: Teach memory_failure() about dev_pagemap pages")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27745
This patch is a backport of the following upstream commit:
commit c79c5a0a00a9457718056b588f312baadf44e471
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date: Mon Dec 18 13:58:36 2023 +0000
mm/memory-failure: check the mapcount of the precise page
A process may map only some of the pages in a folio, and might be missed
if it maps the poisoned page but not the head page. Or it might be
unnecessarily hit if it maps the head page, but not the poisoned page.
Link: https://lkml.kernel.org/r/20231218135837.3310403-3-willy@infradead.org
Fixes: 7af446a841 ("HWPOISON, hugetlb: enable error handling path for hugepage")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27745
This patch is a backport of the following upstream commit:
commit 376907f3a0b34a17e80417825f8cc1c40fcba81b
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date: Mon Dec 18 13:58:35 2023 +0000
mm/memory-failure: pass the folio and the page to collect_procs()
Patch series "Three memory-failure fixes".
I've been looking at the memory-failure code and I believe I have found
three bugs that need fixing -- one going all the way back to 2010! I'll
have more patches later to use folios more extensively but didn't want
these bugfixes to get caught up in that.
This patch (of 3):
Both collect_procs_anon() and collect_procs_file() iterate over the VMA
interval trees looking for a single pgoff, so it is wrong to look for the
pgoff of the head page as is currently done. However, it is also wrong to
look at page->mapping of the precise page as this is invalid for tail
pages. Clear up the confusion by passing both the folio and the precise
page to collect_procs().
Link: https://lkml.kernel.org/r/20231218135837.3310403-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20231218135837.3310403-2-willy@infradead.org
Fixes: 415c64c145 ("mm/memory-failure: split thp earlier in memory error handling")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27745
This patch is a backport of the following upstream commit:
commit 761d79fbad2a424a240a351b898b54eb674d3bdc
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date: Wed Nov 8 18:28:08 2023 +0000
mm: convert isolate_page() to mf_isolate_folio()
The only caller now has a folio, so pass it in and operate on it. Saves
many page->folio conversions and introduces only one folio->page
conversion when calling isolate_movable_page().
Link: https://lkml.kernel.org/r/20231108182809.602073-6-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27745
This patch is a backport of the following upstream commit:
commit 049b26048dd287d52f6f6fbe5eafa301fdca5d37
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date: Wed Nov 8 18:28:07 2023 +0000
mm: convert soft_offline_in_use_page() to use a folio
Replace the existing head-page logic with folio logic.
Link: https://lkml.kernel.org/r/20231108182809.602073-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27745
This patch is a backport of the following upstream commit:
commit 19369d866a8b89788cdc9b10c7b8c9b2777f806b
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date: Wed Nov 8 18:28:06 2023 +0000
mm: use mapping_evict_folio() in truncate_error_page()
We already have the folio and the mapping, so replace the call to
invalidate_inode_page() with mapping_evict_folio().
Link: https://lkml.kernel.org/r/20231108182809.602073-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27745
This patch is a backport of the following upstream commit:
commit 91e79d22be75fec88ae58d274a7c9e49d6215099
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date: Wed Aug 23 00:13:14 2023 +0100
mm: convert DAX lock/unlock page to lock/unlock folio
The one caller of DAX lock/unlock page already calls compound_head(), so
use page_folio() instead, then use a folio throughout the DAX code to
remove uses of page->mapping and page->index.
[jane.chu@oracle.com: add comment to mf_generic_kill_procss(), simplify mf_generic_kill_procs:folio initialization]
Link: https://lkml.kernel.org/r/20230908222336.186313-1-jane.chu@oracle.com
Link: https://lkml.kernel.org/r/20230822231314.349200-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jane Chu <jane.chu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5252
JIRA: https://issues.redhat.com/browse/RHEL-27743
JIRA: https://issues.redhat.com/browse/RHEL-59459
CVE: CVE-2024-46787
Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4961
This MR brings RHEL9 core MM code up to upstream's v6.6 LTS level.
This work follows up on the previous v6.5 update (RHEL-27742) and as such,
the bulk of this changeset is comprised of refactoring and clean-ups of
the internal implementation of several APIs as it further advances the
conversion to FOLIOS, and follow up on the per-VMA locking changes.
Also, with the rebase to v6.6 LTS, we complete the infrastructure to allow
Control-flow Enforcement Technology, a.k.a. Shadow Stacks, for x86 builds,
and we add a potential extra level of protection (assessment pending) to help
on mitigating kernel heap exploits dubbed as "SlubStick".
Follow-up fixes are omitted from this series either because they are irrelevant to
the bits we support on RHEL or because they depend on bigger changesets introduced
upstream more recently. A follow-up ticket (RHEL-27745) will deal with these and other cases separately.
Omitted-fix: e540b8c5da04 ("mips: mm: add slab availability checking in ioremap_prot")
Omitted-fix: f7875966dc0c ("tools headers UAPI: Sync files changed by new fchmodat2 and map_shadow_stack syscalls with the kernel sources")
Omitted-fix: df39038cd895 ("s390/mm: Fix VM_FAULT_HWPOISON handling in do_exception()")
Omitted-fix: 12bbaae7635a ("mm: create FOLIO_FLAG_FALSE and FOLIO_TYPE_OPS macros")
Omitted-fix: fd1a745ce03e ("mm: support page_mapcount() on page_has_type() pages")
Omitted-fix: d99e3140a4d3 ("mm: turn folio_test_hugetlb into a PageType")
Omitted-fix: fa2690af573d ("mm: page_ref: remove folio_try_get_rcu()")
Omitted-fix: f442fa614137 ("mm: gup: stop abusing try_grab_folio")
Omitted-fix: cb0f01beb166 ("mm/mprotect: fix dax pud handling")
Signed-off-by: Rafael Aquini <raquini@redhat.com>
Approved-by: John W. Linville <linville@redhat.com>
Approved-by: Mark Salter <msalter@redhat.com>
Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: Steve Best <sbest@redhat.com>
Approved-by: David Airlie <airlied@redhat.com>
Approved-by: Michal Schmidt <mschmidt@redhat.com>
Approved-by: Baoquan He <5820488-baoquan_he@users.noreply.gitlab.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>
Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5271
JIRA: https://issues.redhat.com/browse/RHEL-52957
commit d75abd0d0bc29e6ebfebbf76d11b4067b35844af
Author: Waiman Long <longman@redhat.com>
Date: Tue Aug 6 12:41:07 2024 -0400
mm/memory-failure: use raw_spinlock_t in struct memory_failure_cpu
The memory_failure_cpu structure is a per-cpu structure. Access to its
content requires the use of get_cpu_var() to lock in the current CPU and
disable preemption. The use of a regular spinlock_t for locking purpose
is fine for a non-RT kernel.
Since the integration of RT spinlock support into the v5.15 kernel, a
spinlock_t in a RT kernel becomes a sleeping lock and taking a sleeping
lock in a preemption disabled context is illegal resulting in the
following kind of warning.
[12135.732244] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
[12135.732248] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 270076, name: kworker/0:0
[12135.732252] preempt_count: 1, expected: 0
[12135.732255] RCU nest depth: 2, expected: 2
:
[12135.732420] Hardware name: Dell Inc. PowerEdge R640/0HG0J8, BIOS 2.10.2 02/24/2021
[12135.732423] Workqueue: kacpi_notify acpi_os_execute_deferred
[12135.732433] Call Trace:
[12135.732436] <TASK>
[12135.732450] dump_stack_lvl+0x57/0x81
[12135.732461] __might_resched.cold+0xf4/0x12f
[12135.732479] rt_spin_lock+0x4c/0x100
[12135.732491] memory_failure_queue+0x40/0xe0
[12135.732503] ghes_do_memory_failure+0x53/0x390
[12135.732516] ghes_do_proc.constprop.0+0x229/0x3e0
[12135.732575] ghes_proc+0xf9/0x1a0
[12135.732591] ghes_notify_hed+0x6a/0x150
[12135.732602] notifier_call_chain+0x43/0xb0
[12135.732626] blocking_notifier_call_chain+0x43/0x60
[12135.732637] acpi_ev_notify_dispatch+0x47/0x70
[12135.732648] acpi_os_execute_deferred+0x13/0x20
[12135.732654] process_one_work+0x41f/0x500
[12135.732695] worker_thread+0x192/0x360
[12135.732715] kthread+0x111/0x140
[12135.732733] ret_from_fork+0x29/0x50
[12135.732779] </TASK>
Fix it by using a raw_spinlock_t for locking instead.
Also move the pr_err() out of the lock critical section and after
put_cpu_ptr() to avoid indeterminate latency and the possibility of sleep
with this call.
[longman@redhat.com: don't hold percpu ref across pr_err(), per Miaohe]
Link: https://lkml.kernel.org/r/20240807181130.1122660-1-longman@redhat.com
Link: https://lkml.kernel.org/r/20240806164107.1044956-1-longman@redhat.com
Fixes: 0f383b6dc96e ("locking/spinlock: Provide RT variant")
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>
Approved-by: Rafael Aquini <raquini@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>
Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27743
Conflicts:
* mm/memory-failure.c: minor context conflict due to out-of-order backport
of commit fa422b353d21 ("mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind")
This patch is a backport of the following upstream commit:
commit d256d1cd8da1cbc4615de69df71c87ce623fec2f
Author: Tong Tiangen <tongtiangen@huawei.com>
Date: Mon Aug 28 10:25:27 2023 +0800
mm: memory-failure: use rcu lock instead of tasklist_lock when collect_procs()
We found a softlock issue in our test, analyzed the logs, and found that
the relevant CPU call trace as follows:
CPU0:
_do_fork
-> copy_process()
-> write_lock_irq(&tasklist_lock) //Disable irq,waiting for
//tasklist_lock
CPU1:
wp_page_copy()
->pte_offset_map_lock()
-> spin_lock(&page->ptl); //Hold page->ptl
-> ptep_clear_flush()
-> flush_tlb_others() ...
-> smp_call_function_many()
-> arch_send_call_function_ipi_mask()
-> csd_lock_wait() //Waiting for other CPUs respond
//IPI
CPU2:
collect_procs_anon()
-> read_lock(&tasklist_lock) //Hold tasklist_lock
->for_each_process(tsk)
-> page_mapped_in_vma()
-> page_vma_mapped_walk()
-> map_pte()
->spin_lock(&page->ptl) //Waiting for page->ptl
We can see that CPU1 waiting for CPU0 respond IPI,CPU0 waiting for CPU2
unlock tasklist_lock, CPU2 waiting for CPU1 unlock page->ptl. As a result,
softlockup is triggered.
For collect_procs_anon(), what we're doing is task list iteration, during
the iteration, with the help of call_rcu(), the task_struct object is freed
only after one or more grace periods elapse. the logic as follows:
release_task()
-> __exit_signal()
-> __unhash_process()
-> list_del_rcu()
-> put_task_struct_rcu_user()
-> call_rcu(&task->rcu, delayed_put_task_struct)
delayed_put_task_struct()
-> put_task_struct()
-> if (refcount_sub_and_test())
__put_task_struct()
-> free_task()
Therefore, under the protection of the rcu lock, we can safely use
get_task_struct() to ensure a safe reference to task_struct during the
iteration.
By removing the use of tasklist_lock in task list iteration, we can break
the softlock chain above.
The same logic can also be applied to:
- collect_procs_file()
- collect_procs_fsdax()
- collect_procs_ksm()
Link: https://lkml.kernel.org/r/20230828022527.241693-1-tongtiangen@huawei.com
Signed-off-by: Tong Tiangen <tongtiangen@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27743
This patch is a backport of the following upstream commit:
commit 6379693e3c2683a7c86f395e878534731ac7ed06
Author: Miaohe Lin <linmiaohe@huawei.com>
Date: Mon Aug 7 19:41:25 2023 +0800
mm: memory-failure: use helper macro llist_for_each_entry_safe()
It's more convenient to use helper macro llist_for_each_entry_safe().
No functional change intended.
Link: https://lkml.kernel.org/r/20230807114125.3440802-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27743
This patch is a backport of the following upstream commit:
commit 7a8817f2c96e98d5e65a59e34ba9ea1ff6ed23bc
Author: Miaohe Lin <linmiaohe@huawei.com>
Date: Thu Jul 27 19:56:43 2023 +0800
mm: memory-failure: add PageOffline() check
Memory failure is not interested in logically offlined pages. Skip this
type of page.
Link: https://lkml.kernel.org/r/20230727115643.639741-5-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27743
This patch is a backport of the following upstream commit:
commit 6885938c349c14b277305cd1129e7eb14f3e2c55
Author: Jiaqi Yan <jiaqiyan@google.com>
Date: Thu Jul 13 23:55:53 2023 +0000
mm/hwpoison: rename hwp_walk* to hwpoison_walk*
In the discussion of "Improve hugetlbfs read on HWPOISON hugepages" [1],
Matthew Wilcox suggests hwp is a bad abbreviation of hwpoison, as hwp is
already used as "an acronym by acpi, intel_pstate, some clock drivers, an
ethernet driver, and a scsi driver"[1].
So rename hwp_walk and hwp_walk_ops to hwpoison_walk and
hwpoison_walk_ops respectively.
raw_hwp_(page|list), *_raw_hwp, and raw_hwp_unreliable flag are other
major appearances of "hwp". However, given the "raw" hint in the name, it
is easy to differentiate them from other "hwp" acronyms. Since renaming
them is not as straightforward as renaming hwp_walk*, they are not covered
by this commit.
[1] https://lore.kernel.org/lkml/20230707201904.953262-5-jiaqiyan@google.com/T/#me6fecb8ce1ad4d5769199c9e162a44bc88f7bdec
Link: https://lkml.kernel.org/r/20230713235553.4121855-1-jiaqiyan@google.com
Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27743
This patch is a backport of the following upstream commit:
commit b79f8eb408d0468df0d6082ed958b67d94adce65
Author: Jiaqi Yan <jiaqiyan@google.com>
Date: Thu Jul 13 00:18:31 2023 +0000
mm/hwpoison: check if a raw page in a hugetlb folio is raw HWPOISON
Add the functionality, is_raw_hwpoison_page_in_hugepage, to tell if a raw
page in a hugetlb folio is HWPOISON. This functionality relies on
RawHwpUnreliable to be not set; otherwise hugepage's raw HWPOISON list
becomes meaningless.
is_raw_hwpoison_page_in_hugepage holds mf_mutex in order to synchronize
with folio_set_hugetlb_hwpoison and folio_free_raw_hwp who iterate,
insert, or delete entry in raw_hwp_list. llist itself doesn't ensure
insertion and removal are synchornized with the llist_for_each_entry used
by is_raw_hwpoison_page_in_hugepage (unless iterated entries are already
deleted from the list). Caller can minimize the overhead of lock cycles
by first checking HWPOISON flag of the folio.
Exports this functionality to be immediately used in the read operation
for hugetlbfs.
Link: https://lkml.kernel.org/r/20230713001833.3778937-3-jiaqiyan@google.com
Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27743
This patch is a backport of the following upstream commit:
commit 9e130c4b000b0a3f0bf4b4c8e714bfe3d06ff4cc
Author: Jiaqi Yan <jiaqiyan@google.com>
Date: Thu Jul 13 00:18:30 2023 +0000
mm/hwpoison: delete all entries before traversal in __folio_free_raw_hwp
Patch series "Improve hugetlbfs read on HWPOISON hugepages", v4.
Today when hardware memory is corrupted in a hugetlb hugepage, kernel
leaves the hugepage in pagecache [1]; otherwise future mmap or read will
suject to silent data corruption. This is implemented by returning -EIO
from hugetlb_read_iter immediately if the hugepage has HWPOISON flag set.
Since memory_failure already tracks the raw HWPOISON subpages in a
hugepage, a natural improvement is possible: if userspace only asks for
healthy subpages in the pagecache, kernel can return these data.
This patchset implements this improvement. It consist of three parts.
The 1st commit exports the functionality to tell if a subpage inside a
hugetlb hugepage is a raw HWPOISON page. The 2nd commit teaches
hugetlbfs_read_iter to return as many healthy bytes as possible. The 3rd
commit properly tests this new feature.
[1] commit 8625147cafaa ("hugetlbfs: don't delete error page from pagecache")
This patch (of 4):
Traversal on llist (e.g. llist_for_each_safe) is only safe AFTER entries
are deleted from the llist. Correct the way __folio_free_raw_hwp deletes
and frees raw_hwp_page entries in raw_hwp_list: first llist_del_all, then
kfree within llist_for_each_safe.
As of today, concurrent adding, deleting, and traversal on raw_hwp_list
from hugetlb.c and/or memory-failure.c are fine with each other. Note
this is guaranteed partly by the lock-free nature of llist, and partly by
holding hugetlb_lock and/or mf_mutex. For example, as llist_del_all is
lock-free with itself, folio_clear_hugetlb_hwpoison()s from
__update_and_free_hugetlb_folio and memory_failure won't need explicit
locking when freeing the raw_hwp_list. New code that manipulates
raw_hwp_list must be careful to ensure the concurrency correctness.
Link: https://lkml.kernel.org/r/20230713001833.3778937-1-jiaqiyan@google.com
Link: https://lkml.kernel.org/r/20230713001833.3778937-2-jiaqiyan@google.com
Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27743
This patch is a backport of the following upstream commit:
commit d31155b8f29ce380f7816e54dee161db6d752909
Author: Miaohe Lin <linmiaohe@huawei.com>
Date: Tue Jul 11 13:50:16 2023 +0800
mm: memory-failure: fix race window when trying to get hugetlb folio
page_folio() is fetched before calling get_hwpoison_hugetlb_folio()
without hugetlb_lock being held. So hugetlb page could be demoted before
get_hwpoison_hugetlb_folio() holding hugetlb_lock but after page_folio()
is fetched. So get_hwpoison_hugetlb_folio() will hold unexpected extra
refcnt of hugetlb folio while leaving demoted page un-refcnted.
Link: https://lkml.kernel.org/r/20230711055016.2286677-9-linmiaohe@huawei.com
Fixes: 25182f05ff ("mm,hwpoison: fix race with hugetlb page allocation")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27743
This patch is a backport of the following upstream commit:
commit a363d1224b5add67a7cafab9fdb9f19d569fbe98
Author: Miaohe Lin <linmiaohe@huawei.com>
Date: Tue Jul 11 13:50:15 2023 +0800
mm: memory-failure: fetch compound head after extra page refcnt is held
Page might become thp, huge page or being splited after compound head is
fetched but before page refcnt is bumped. So hpage might be a tail page
leading to VM_BUG_ON_PAGE(PageTail(page)) in PageTransHuge().
Link: https://lkml.kernel.org/r/20230711055016.2286677-8-linmiaohe@huawei.com
Fixes: 415c64c145 ("mm/memory-failure: split thp earlier in memory error handling")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27743
This patch is a backport of the following upstream commit:
commit 5885c6a62533cbda19e9eceab619bde317de0c0d
Author: Miaohe Lin <linmiaohe@huawei.com>
Date: Tue Jul 11 13:50:14 2023 +0800
mm: memory-failure: minor cleanup for comments and codestyle
Fix some wrong function names and grammar error in comments. Also remove
unneeded space after for_each_process. No functional change intended.
Link: https://lkml.kernel.org/r/20230711055016.2286677-7-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27743
This patch is a backport of the following upstream commit:
commit e9c36f7aca7efee8318b12930b846464b9b5c7a3
Author: Miaohe Lin <linmiaohe@huawei.com>
Date: Tue Jul 11 13:50:13 2023 +0800
mm: memory-failure: remove unneeded header files
Remove some unneeded header files. No functional change intended.
Link: https://lkml.kernel.org/r/20230711055016.2286677-6-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27743
This patch is a backport of the following upstream commit:
commit 55c7ac4527086d52dedc5da4ee3d676bcc9a7691
Author: Miaohe Lin <linmiaohe@huawei.com>
Date: Tue Jul 11 13:50:12 2023 +0800
mm: memory-failure: use local variable huge to check hugetlb page
Use local variable huge to check whether page is hugetlb page to avoid
calling PageHuge() multiple times to save cpu cycles. PageHuge() will be
stable while extra page refcnt is held.
Link: https://lkml.kernel.org/r/20230711055016.2286677-5-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27743
This patch is a backport of the following upstream commit:
commit 80ee7cb271b52e5861eda3c67731c95fd55a2627
Author: Miaohe Lin <linmiaohe@huawei.com>
Date: Tue Jul 11 13:50:11 2023 +0800
mm: memory-failure: don't account hwpoison_filter() filtered pages
mf_generic_kill_procs() will return -EOPNOTSUPP when hwpoison_filter()
filtered dax page. In that case, action_result() isn't expected to be
called to update mf_stats. This will results in inaccurate but benign
memory failure handling statistics.
Link: https://lkml.kernel.org/r/20230711055016.2286677-4-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27743
This patch is a backport of the following upstream commit:
commit 92a025a790f82c278cc39b0997e9b3b6f3b69ee0
Author: Miaohe Lin <linmiaohe@huawei.com>
Date: Tue Jul 11 13:50:10 2023 +0800
mm: memory-failure: ensure moving HWPoison flag to the raw error pages
If hugetlb_vmemmap_optimized is enabled, folio_clear_hugetlb_hwpoison()
called from try_memory_failure_hugetlb() won't transfer HWPoison flag to
subpages while folio's HWPoison flag is cleared. So when trying to free
this hugetlb page into buddy, folio_clear_hugetlb_hwpoison() is not called
to move HWPoison flag from head page to the raw error pages even if now
hugetlb_vmemmap_optimized is cleared. This will results in HWPoisoned
page being used again and raw_hwp_page leak.
Link: https://lkml.kernel.org/r/20230711055016.2286677-3-linmiaohe@huawei.com
Fixes: ac5fcde0a96a ("mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27743
This patch is a backport of the following upstream commit:
commit dbe70dbb41ab45a9ea2fa537c9e6c9817477dfff
Author: Miaohe Lin <linmiaohe@huawei.com>
Date: Tue Jul 11 13:50:09 2023 +0800
mm: memory-failure: remove unneeded PageHuge() check
Patch series "A few fixup and cleanup patches for memory-failure", v2.
This series contains a few fixup patches to fix inaccurate mf_stats, fix
race window when trying to get hugetlb folio and so on. Also there is
minor cleanup for comments and codestyle. More details can be found in
the respective changelogs.
This patch (of 8):
PageHuge() check in me_huge_page() is just for potential problems. Remove
it as it's actually dead code and won't catch anything.
Link: https://lkml.kernel.org/r/20230711055016.2286677-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20230711055016.2286677-2-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27743
This patch is a backport of the following upstream commit:
commit d51b68469bc7804c34622f7f3d4889628d37cfd6
Author: Miaohe Lin <linmiaohe@huawei.com>
Date: Sat Jul 1 15:28:37 2023 +0800
mm: memory-failure: fix potential page refcnt leak in memory_failure()
put_ref_page() is not called to drop extra refcnt when comes from madvise
in the case pfn is valid but pgmap is NULL leading to page refcnt leak.
Link: https://lkml.kernel.org/r/20230701072837.1994253-1-linmiaohe@huawei.com
Fixes: 1e8aaedb18 ("mm,memory_failure: always pin the page in madvise_inject_error")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27743
This patch is a backport of the following upstream commit:
commit b7b618da0edc85280e1c9c8f4f5239571e7c1d3e
Author: Miaohe Lin <linmiaohe@huawei.com>
Date: Wed Jun 28 09:49:29 2023 +0800
mm: memory-failure: remove unneeded page state check in shake_page()
Remove unneeded PageLRU(p) and is_free_buddy_page(p) check as slab caches
are not shrunk now. This check can be added back when a lightweight range
based shrinker is available.
Link: https://lkml.kernel.org/r/20230628014929.3441386-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27743
This patch is a backport of the following upstream commit:
commit 1a7d018dc38b6851c602b448bdac2e78b46857db
Author: Miaohe Lin <linmiaohe@huawei.com>
Date: Mon Jun 26 19:43:43 2023 +0800
mm: memory-failure: remove unneeded 'inline' annotation
Remove unneeded 'inline' annotation from num_poisoned_pages_inc() and
num_poisoned_pages_sub(). No functional change intended.
Link: https://lkml.kernel.org/r/20230626114343.1846587-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-52957
commit d75abd0d0bc29e6ebfebbf76d11b4067b35844af
Author: Waiman Long <longman@redhat.com>
Date: Tue Aug 6 12:41:07 2024 -0400
mm/memory-failure: use raw_spinlock_t in struct memory_failure_cpu
The memory_failure_cpu structure is a per-cpu structure. Access to its
content requires the use of get_cpu_var() to lock in the current CPU and
disable preemption. The use of a regular spinlock_t for locking purpose
is fine for a non-RT kernel.
Since the integration of RT spinlock support into the v5.15 kernel, a
spinlock_t in a RT kernel becomes a sleeping lock and taking a sleeping
lock in a preemption disabled context is illegal resulting in the
following kind of warning.
[12135.732244] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
[12135.732248] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 270076, name: kworker/0:0
[12135.732252] preempt_count: 1, expected: 0
[12135.732255] RCU nest depth: 2, expected: 2
:
[12135.732420] Hardware name: Dell Inc. PowerEdge R640/0HG0J8, BIOS 2.10.2 02/24/2021
[12135.732423] Workqueue: kacpi_notify acpi_os_execute_deferred
[12135.732433] Call Trace:
[12135.732436] <TASK>
[12135.732450] dump_stack_lvl+0x57/0x81
[12135.732461] __might_resched.cold+0xf4/0x12f
[12135.732479] rt_spin_lock+0x4c/0x100
[12135.732491] memory_failure_queue+0x40/0xe0
[12135.732503] ghes_do_memory_failure+0x53/0x390
[12135.732516] ghes_do_proc.constprop.0+0x229/0x3e0
[12135.732575] ghes_proc+0xf9/0x1a0
[12135.732591] ghes_notify_hed+0x6a/0x150
[12135.732602] notifier_call_chain+0x43/0xb0
[12135.732626] blocking_notifier_call_chain+0x43/0x60
[12135.732637] acpi_ev_notify_dispatch+0x47/0x70
[12135.732648] acpi_os_execute_deferred+0x13/0x20
[12135.732654] process_one_work+0x41f/0x500
[12135.732695] worker_thread+0x192/0x360
[12135.732715] kthread+0x111/0x140
[12135.732733] ret_from_fork+0x29/0x50
[12135.732779] </TASK>
Fix it by using a raw_spinlock_t for locking instead.
Also move the pr_err() out of the lock critical section and after
put_cpu_ptr() to avoid indeterminate latency and the possibility of sleep
with this call.
[longman@redhat.com: don't hold percpu ref across pr_err(), per Miaohe]
Link: https://lkml.kernel.org/r/20240807181130.1122660-1-longman@redhat.com
Link: https://lkml.kernel.org/r/20240806164107.1044956-1-longman@redhat.com
Fixes: 0f383b6dc96e ("locking/spinlock: Provide RT variant")
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27742
Conflicts:
* drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c: hunks dropped as
these are already applied via RHEL commit 26418f1a34 ("Merge DRM
changes from upstream v6.4..v6.5")
* kernel/events/uprobes.c: minor context difference due to backport of upstream
commit ec8832d007cb ("mmu_notifiers: don't invalidate secondary TLBs
as part of mmu_notifier_invalidate_range_end()")
* mm/gup.c: minor context difference on the 2nd hunk due to backport of upstream
commit d74943a2f3cd ("mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT")
* mm/hugetlb.c: hunk dropped as it's unecessary given the proactive work done
on the backport of upstream commit 191fcdb6c9cf ("mm/hugetlb.c: fix a bug
within a BUG(): inconsistent pte comparison")
* mm/ksm.c: context conflicts and differences on the 1st hunk are due to
out-of-order backport of upstream commit 04dee9e85cf5 ("mm/various:
give up if pte_offset_map[_lock]() fails") being compensated for only now.
* mm/memory.c: minor context difference on the 35th hunk due to backport of
upstream commit 04c35ab3bdae ("x86/mm/pat: fix VM_PAT handling in COW mappings")
* mm/mempolicy.c: minor context difference on the 1st hunk due to backport of
upstream commit 24526268f4e3 ("mm: mempolicy: keep VMA walk if both
MPOL_MF_STRICT and MPOL_MF_MOVE are specified")
* mm/migrate.c: minor context difference on the 2nd hunk due to backport of
upstream commits 161e393c0f63 ("mm: Make pte_mkwrite() take a VMA"), and
f3ebdf042df4 ("mm: don't check VMA write permissions if the PTE/PMD
indicates write permissions")
* mm/migrate_device.c: minor context difference on the 5th hunk due to backport
of upstream commit ec8832d007cb ("mmu_notifiers: don't invalidate secondary
TLBs as part of mmu_notifier_invalidate_range_end()")
* mm/swapfile.c: minor contex differences on the 1st and 2nd hunks due to
backport of upstream commit f985fc322063 ("mm/swapfile: fix wrong swap
entry type for hwpoisoned swapcache page")
* mm/vmscan.c: minor context difference on the 3rd hunk due to backport of
upstream commit c28ac3c7eb94 ("mm/mglru: skip special VMAs in
lru_gen_look_around()")
This patch is a backport of the following upstream commit:
commit c33c794828f21217f72ce6fc140e0d34e0d56bff
Author: Ryan Roberts <ryan.roberts@arm.com>
Date: Mon Jun 12 16:15:45 2023 +0100
mm: ptep_get() conversion
Convert all instances of direct pte_t* dereferencing to instead use
ptep_get() helper. This means that by default, the accesses change from a
C dereference to a READ_ONCE(). This is technically the correct thing to
do since where pgtables are modified by HW (for access/dirty) they are
volatile and therefore we should always ensure READ_ONCE() semantics.
But more importantly, by always using the helper, it can be overridden by
the architecture to fully encapsulate the contents of the pte. Arch code
is deliberately not converted, as the arch code knows best. It is
intended that arch code (arm64) will override the default with its own
implementation that can (e.g.) hide certain bits from the core code, or
determine young/dirty status by mixing in state from another source.
Conversion was done using Coccinelle:
----
// $ make coccicheck \
// COCCI=ptepget.cocci \
// SPFLAGS="--include-headers" \
// MODE=patch
virtual patch
@ depends on patch @
pte_t *v;
@@
- *v
+ ptep_get(v)
----
Then reviewed and hand-edited to avoid multiple unnecessary calls to
ptep_get(), instead opting to store the result of a single call in a
variable, where it is correct to do so. This aims to negate any cost of
READ_ONCE() and will benefit arch-overrides that may be more complex.
Included is a fix for an issue in an earlier version of this patch that
was pointed out by kernel test robot. The issue arose because config
MMU=n elides definition of the ptep helper functions, including
ptep_get(). HUGETLB_PAGE=n configs still define a simple
huge_ptep_clear_flush() for linking purposes, which dereferences the ptep.
So when both configs are disabled, this caused a build error because
ptep_get() is not defined. Fix by continuing to do a direct dereference
when MMU=n. This is safe because for this config the arch code cannot be
trying to virtualize the ptes because none of the ptep helpers are
defined.
Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27742
This patch is a backport of the following upstream commit:
commit 97de10a9932c363a4e4ee9bb5f2297e254fb1413
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date: Mon May 8 19:41:28 2023 +0800
mm: memory-failure: move sysctl register in memory_failure_init()
There is already a memory_failure_init(), don't add a new initcall, move
register_sysctl_init() into it to cleanup a bit.
Link: https://lkml.kernel.org/r/20230508114128.37081-2-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
commit e2c1ab070fdc81010ec44634838d24fce9ff9e53
Author: Miaohe Lin <linmiaohe@huawei.com>
Date: Tue Jun 27 19:28:08 2023 +0800
mm: memory-failure: fix unexpected return value in soft_offline_page()
When page_handle_poison() fails to handle the hugepage or free page in
retry path, soft_offline_page() will return 0 while -EBUSY is expected in
this case.
Consequently the user will think soft_offline_page succeeds while it in
fact failed. So the user will not try again later in this case.
Link: https://lkml.kernel.org/r/20230627112808.1275241-1-linmiaohe@huawei.com
Fixes: b94e02822d ("mm,hwpoison: try to narrow window race for free pages")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
commit 49b0638502da097c15d46cd4e871dbaa022caf7c
Author: Suren Baghdasaryan <surenb@google.com>
Date: Fri Aug 4 08:27:19 2023 -0700
mm: enable page walking API to lock vmas during the walk
walk_page_range() and friends often operate under write-locked mmap_lock.
With introduction of vma locks, the vmas have to be locked as well during
such walks to prevent concurrent page faults in these areas. Add an
additional member to mm_walk_ops to indicate locking requirements for the
walk.
The change ensures that page walks which prevent concurrent page faults
by write-locking mmap_lock, operate correctly after introduction of
per-vma locks. With per-vma locks page faults can be handled under vma
lock without taking mmap_lock at all, so write locking mmap_lock would
not stop them. The change ensures vmas are properly locked during such
walks.
A sample issue this solves is do_mbind() performing queue_pages_range()
to queue pages for migration. Without this change a concurrent page
can be faulted into the area and be left out of migration.
Link: https://lkml.kernel.org/r/20230804152724.3090321-2-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Suggested-by: Jann Horn <jannh@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michel Lespinasse <michel@lespinasse.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
commit 6c54312f9689fbe27c70db5d42eebd29d04b672e
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date: Mon Jul 17 11:18:12 2023 -0700
mm/memory-failure: fix hardware poison check in unpoison_memory()
It was pointed out[1] that using folio_test_hwpoison() is wrong as we need
to check the indiviual page that has poison. folio_test_hwpoison() only
checks the head page so go back to using PageHWPoison().
User-visible effects include existing hwpoison-inject tests possibly
failing as unpoisoning a single subpage could lead to unpoisoning an
entire folio. Memory unpoisoning could also not work as expected as
the function will break early due to only checking the head page and
not the actually poisoned subpage.
[1]: https://lore.kernel.org/lkml/ZLIbZygG7LqSI9xe@casper.infradead.org/
Link: https://lkml.kernel.org/r/20230717181812.167757-1-sidhartha.kumar@oracle.com
Fixes: a6fddef49eef ("mm/memory-failure: convert unpoison_memory() to folios")
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Reported-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
commit faeb2ff2c1c5cb60ce0da193580b256c941f99ca
Author: Miaohe Lin <linmiaohe@huawei.com>
Date: Thu Jul 27 19:56:42 2023 +0800
mm: memory-failure: avoid false hwpoison page mapped error info
folio->_mapcount is overloaded in SLAB, so folio_mapped() has to be done
after folio_test_slab() is checked. Otherwise slab folio might be treated
as a mapped folio leading to false 'Someone maps the hwpoison page' error
info.
Link: https://lkml.kernel.org/r/20230727115643.639741-4-linmiaohe@huawei.com
Fixes: 230ac719c5 ("mm/hwpoison: don't try to unpoison containment-failed pages")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
commit f29623e4a599c295cc8f518c8e4bb7848581a14d
Author: Miaohe Lin <linmiaohe@huawei.com>
Date: Thu Jul 27 19:56:41 2023 +0800
mm: memory-failure: fix potential unexpected return value from unpoison_memory()
If unpoison_memory() fails to clear page hwpoisoned flag, return value ret
is expected to be -EBUSY. But when get_hwpoison_page() returns 1 and
fails to clear page hwpoisoned flag due to races, return value will be
unexpected 1 leading to users being confused. And there's a code smell
that the variable "ret" is used not only to save the return value of
unpoison_memory(), but also the return value from get_hwpoison_page().
Make a further cleanup by using another auto-variable solely to save the
return value of get_hwpoison_page() as suggested by Naoya.
Link: https://lkml.kernel.org/r/20230727115643.639741-3-linmiaohe@huawei.com
Fixes: bf181c582588 ("mm/hwpoison: fix unpoison_memory()")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27741
commit 4248d0083ec5817eebfb916c54950d100b3468ee
Author: Longlong Xia <xialonglong1@huawei.com>
Date: Fri Apr 14 10:17:41 2023 +0800
mm: ksm: support hwpoison for ksm page
hwpoison_user_mappings() is updated to support ksm pages, and add
collect_procs_ksm() to collect processes when the error hit an ksm page.
The difference from collect_procs_anon() is that it also needs to traverse
the rmap-item list on the stable node of the ksm page. At the same time,
add_to_kill_ksm() is added to handle ksm pages. And
task_in_to_kill_list() is added to avoid duplicate addition of tsk to the
to_kill list. This is because when scanning the list, if the pages that
make up the ksm page all come from the same process, they may be added
repeatedly.
Link: https://lkml.kernel.org/r/20230414021741.2597273-3-xialonglong1@huawei.com
Signed-off-by: Longlong Xia <xialonglong1@huawei.com>
Tested-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Nanyong Sun <sunnanyong@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
Conflicts: mm/memory-failure.c - conflicts in the workaround for
36537a67d356 ("mm, hwpoison: avoid unneeded page_mapped_in_vma() overhead in collect_procs_anon()")
cause conflicts with this patch. Make the end result identical
to upstream.
JIRA: https://issues.redhat.com/browse/RHEL-27741
commit 4f775086a6eee07c6ae4be4734d736e13b537351
Author: Longlong Xia <xialonglong1@huawei.com>
Date: Fri Apr 14 10:17:40 2023 +0800
mm: memory-failure: refactor add_to_kill()
Patch series "mm: ksm: support hwpoison for ksm page", v2.
Currently, ksm does not support hwpoison. As ksm is being used more
widely for deduplication at the system level, container level, and process
level, supporting hwpoison for ksm has become increasingly important.
However, ksm pages were not processed by hwpoison in 2009 [1].
The main method of implementation:
1. Refactor add_to_kill() and add new add_to_kill_*() to better
accommodate the handling of different types of pages.
2. Add collect_procs_ksm() to collect processes when the error hit an
ksm page.
3. Add task_in_to_kill_list() to avoid duplicate addition of tsk to
the to_kill list.
4. Try_to_unmap ksm page (already supported).
5. Handle related processes such as sending SIGBUS.
Tested with poisoning to ksm page from
1) different process
2) one process
and with/without memory_failure_early_kill set, the processes are killed
as expected with the patchset.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
commit/?h=01e00f880ca700376e1845cf7a2524ebe68e47d6
This patch (of 2):
The page_address_in_vma() is used to find the user virtual address of page
in add_to_kill(), but it doesn't support ksm due to the ksm page->index
unusable, add an ksm_addr as parameter to add_to_kill(), let's the caller
to pass it, also rename the function to __add_to_kill(), and adding
add_to_kill_anon_file() for handling anonymous pages and file pages,
adding add_to_kill_fsdax() for handling fsdax pages.
Link: https://lkml.kernel.org/r/20230414021741.2597273-1-xialonglong1@huawei
.com
Link: https://lkml.kernel.org/r/20230414021741.2597273-2-xialonglong1@huawei
.com
Signed-off-by: Longlong Xia <xialonglong1@huawei.com>
Tested-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Nanyong Sun <sunnanyong@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27741
commit 8cbc82f3ec0d58961cf9c1e5d99e56741f4bf134
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date: Mon Mar 20 15:40:10 2023 +0800
mm: memory-failure: Move memory failure sysctls to its own file
The sysctl_memory_failure_early_kill and memory_failure_recovery
are only used in memory-failure.c, move them to its own file.
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
[mcgrof: fix by adding empty ctl entry, this caused a crash]
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27741
commit 611b9fd80fb53e79079744fb29d8c1363f4b5c02
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date: Mon Mar 13 13:39:29 2023 +0800
mm: memory-failure: directly use IS_ENABLED(CONFIG_HWPOISON_INJECT)
It's more clear and simple to just use IS_ENABLED(CONFIG_HWPOISON_INJECT)
to check whether or not to enable HWPoison injector module instead of
CONFIG_HWPOISON_INJECT/CONFIG_HWPOISON_INJECT_MODULE.
Link: https://lkml.kernel.org/r/20230313053929.84607-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me
commit 6da6b1d4a7df8c35770186b53ef65d388398e139
Author: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date: Tue Feb 21 17:59:05 2023 +0900
mm/hwpoison: convert TTU_IGNORE_HWPOISON to TTU_HWPOISON
After a memory error happens on a clean folio, a process unexpectedly
receives SIGBUS when it accesses the error page. This SIGBUS killing is
pointless and simply degrades the level of RAS of the system, because the
clean folio can be dropped without any data lost on memory error handling
as we do for a clean pagecache.
When memory_failure() is called on a clean folio, try_to_unmap() is called
twice (one from split_huge_page() and one from hwpoison_user_mappings()).
The root cause of the issue is that pte conversion to hwpoisoned entry is
now done in the first call of try_to_unmap() because PageHWPoison is
already set at this point, while it's actually expected to be done in the
second call. This behavior disturbs the error handling operation like
removing pagecache, which results in the malfunction described above.
So convert TTU_IGNORE_HWPOISON into TTU_HWPOISON and set TTU_HWPOISON only
when we really intend to convert pte to hwpoison entry. This can prevent
other callers of try_to_unmap() from accidentally converting to hwpoison
entries.
Link: https://lkml.kernel.org/r/20230221085905.1465385-1-naoya.horiguchi@linux.dev
Fixes: a42634a6c07d ("readahead: Use a folio in read_pages()")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me
commit cd7755800eb54e8522f5e51f4e71e6494c1f1572
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date: Wed Feb 15 18:39:37 2023 +0800
mm: change to return bool for isolate_movable_page()
Now the isolate_movable_page() can only return 0 or -EBUSY, and no users
will care about the negative return value, thus we can convert the
isolate_movable_page() to return a boolean value to make the code more
clear when checking the movable page isolation state.
No functional changes intended.
[akpm@linux-foundation.org: remove unneeded comment, per Matthew]
Link: https://lkml.kernel.org/r/cb877f73f4fff8d309611082ec740a7065b1ade0.1676424378.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me
commit 9747b9e92418b61c2281561e0651803f1fad0159
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date: Wed Feb 15 18:39:36 2023 +0800
mm: hugetlb: change to return bool for isolate_hugetlb()
Now the isolate_hugetlb() only returns 0 or -EBUSY, and most users did not
care about the negative value, thus we can convert the isolate_hugetlb()
to return a boolean value to make code more clear when checking the
hugetlb isolation state. Moreover converts 2 users which will consider
the negative value returned by isolate_hugetlb().
No functional changes intended.
[akpm@linux-foundation.org: shorten locked section, per SeongJae Park]
Link: https://lkml.kernel.org/r/12a287c5bebc13df304387087bbecc6421510849.1676424378.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me
commit f7f9c00dfafffd7a5a1a5685e2d874c64913e2ed
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date: Wed Feb 15 18:39:35 2023 +0800
mm: change to return bool for isolate_lru_page()
The isolate_lru_page() can only return 0 or -EBUSY, and most users did not
care about the negative error of isolate_lru_page(), except one user in
add_page_for_migration(). So we can convert the isolate_lru_page() to
return a boolean value, which can help to make the code more clear when
checking the return value of isolate_lru_page().
Also convert all users' logic of checking the isolation state.
No functional changes intended.
Link: https://lkml.kernel.org/r/3074c1ab628d9dbf139b33f248a8bc253a3f95f0.1676424378.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me
commit 6aa3a920125e9f58891e2b5dc2efd4d0c1ff05a6
Author: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Date: Fri Jan 13 16:30:50 2023 -0600
mm/hugetlb: convert isolate_hugetlb to folios
Patch series "continue hugetlb folio conversion", v3.
This series continues the conversion of core hugetlb functions to use
folios. This series converts many helper funtions in the hugetlb fault
path. This is in preparation for another series to convert the hugetlb
fault code paths to operate on folios.
This patch (of 8):
Convert isolate_hugetlb() to take in a folio and convert its callers to
pass a folio. Use page_folio() to convert the callers to use a folio is
safe as isolate_hugetlb() operates on a head page.
Link: https://lkml.kernel.org/r/20230113223057.173292-1-sidhartha.kumar@oracle.com
Link: https://lkml.kernel.org/r/20230113223057.173292-2-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>