Commit Graph

687 Commits

Author SHA1 Message Date
Rafael Aquini 9f8a34b521 mm: remember young/dirty bit for page migrations
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168392

This patch is a backport of the following upstream commit:
commit 2e3468778dbe3ec389a10c21a703bb8e5be5cfbc
Author: Peter Xu <peterx@redhat.com>
Date:   Thu Aug 11 12:13:29 2022 -0400

    mm: remember young/dirty bit for page migrations

    When page migration happens, we always ignore the young/dirty bit settings
    in the old pgtable, and marking the page as old in the new page table
    using either pte_mkold() or pmd_mkold(), and keeping the pte clean.

    That's fine from functional-wise, but that's not friendly to page reclaim
    because the moving page can be actively accessed within the procedure.
    Not to mention hardware setting the young bit can bring quite some
    overhead on some systems, e.g.  x86_64 needs a few hundreds nanoseconds to
    set the bit.  The same slowdown problem to dirty bits when the memory is
    first written after page migration happened.

    Actually we can easily remember the A/D bit configuration and recover the
    information after the page is migrated.  To achieve it, define a new set
    of bits in the migration swap offset field to cache the A/D bits for old
    pte.  Then when removing/recovering the migration entry, we can recover
    the A/D bits even if the page changed.

    One thing to mention is that here we used max_swapfile_size() to detect
    how many swp offset bits we have, and we'll only enable this feature if we
    know the swp offset is big enough to store both the PFN value and the A/D
    bits.  Otherwise the A/D bits are dropped like before.

    Link: https://lkml.kernel.org/r/20220811161331.37055-6-peterx@redhat.com
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Andi Kleen <andi.kleen@intel.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Dave Hansen <dave.hansen@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2023-04-03 10:16:25 -04:00
Chris von Recklinghausen b9ef7d0aa3 memory tiering: hot page selection with hint page fault latency
Bugzilla: https://bugzilla.redhat.com/2160210

commit 33024536bafd9129f1d16ade0974671c648700ac
Author: Huang Ying <ying.huang@intel.com>
Date:   Wed Jul 13 16:39:51 2022 +0800

    memory tiering: hot page selection with hint page fault latency

    Patch series "memory tiering: hot page selection", v4.

    To optimize page placement in a memory tiering system with NUMA balancing,
    the hot pages in the slow memory nodes need to be identified.
    Essentially, the original NUMA balancing implementation selects the mostly
    recently accessed (MRU) pages to promote.  But this isn't a perfect
    algorithm to identify the hot pages.  Because the pages with quite low
    access frequency may be accessed eventually given the NUMA balancing page
    table scanning period could be quite long (e.g.  60 seconds).  So in this
    patchset, we implement a new hot page identification algorithm based on
    the latency between NUMA balancing page table scanning and hint page
    fault.  Which is a kind of mostly frequently accessed (MFU) algorithm.

    In NUMA balancing memory tiering mode, if there are hot pages in slow
    memory node and cold pages in fast memory node, we need to promote/demote
    hot/cold pages between the fast and cold memory nodes.

    A choice is to promote/demote as fast as possible.  But the CPU cycles and
    memory bandwidth consumed by the high promoting/demoting throughput will
    hurt the latency of some workload because of accessing inflating and slow
    memory bandwidth contention.

    A way to resolve this issue is to restrict the max promoting/demoting
    throughput.  It will take longer to finish the promoting/demoting.  But
    the workload latency will be better.  This is implemented in this patchset
    as the page promotion rate limit mechanism.

    The promotion hot threshold is workload and system configuration
    dependent.  So in this patchset, a method to adjust the hot threshold
    automatically is implemented.  The basic idea is to control the number of
    the candidate promotion pages to match the promotion rate limit.

    We used the pmbench memory accessing benchmark tested the patchset on a
    2-socket server system with DRAM and PMEM installed.  The test results are
    as follows,

                    pmbench score           promote rate
                     (accesses/s)                   MB/s
                    -------------           ------------
    base              146887704.1                  725.6
    hot selection     165695601.2                  544.0
    rate limit        162814569.8                  165.2
    auto adjustment   170495294.0                  136.9

    From the results above,

    With hot page selection patch [1/3], the pmbench score increases about
    12.8%, and promote rate (overhead) decreases about 25.0%, compared with
    base kernel.

    With rate limit patch [2/3], pmbench score decreases about 1.7%, and
    promote rate decreases about 69.6%, compared with hot page selection
    patch.

    With threshold auto adjustment patch [3/3], pmbench score increases about
    4.7%, and promote rate decrease about 17.1%, compared with rate limit
    patch.

    Baolin helped to test the patchset with MySQL on a machine which contains
    1 DRAM node (30G) and 1 PMEM node (126G).

    sysbench /usr/share/sysbench/oltp_read_write.lua \
    ......
    --tables=200 \
    --table-size=1000000 \
    --report-interval=10 \
    --threads=16 \
    --time=120

    The tps can be improved about 5%.

    This patch (of 3):

    To optimize page placement in a memory tiering system with NUMA balancing,
    the hot pages in the slow memory node need to be identified.  Essentially,
    the original NUMA balancing implementation selects the mostly recently
    accessed (MRU) pages to promote.  But this isn't a perfect algorithm to
    identify the hot pages.  Because the pages with quite low access frequency
    may be accessed eventually given the NUMA balancing page table scanning
    period could be quite long (e.g.  60 seconds).  The most frequently
    accessed (MFU) algorithm is better.

    So, in this patch we implemented a better hot page selection algorithm.
    Which is based on NUMA balancing page table scanning and hint page fault
    as follows,

    - When the page tables of the processes are scanned to change PTE/PMD
      to be PROT_NONE, the current time is recorded in struct page as scan
      time.

    - When the page is accessed, hint page fault will occur.  The scan
      time is gotten from the struct page.  And The hint page fault
      latency is defined as

        hint page fault time - scan time

    The shorter the hint page fault latency of a page is, the higher the
    probability of their access frequency to be higher.  So the hint page
    fault latency is a better estimation of the page hot/cold.

    It's hard to find some extra space in struct page to hold the scan time.
    Fortunately, we can reuse some bits used by the original NUMA balancing.

    NUMA balancing uses some bits in struct page to store the page accessing
    CPU and PID (referring to page_cpupid_xchg_last()).  Which is used by the
    multi-stage node selection algorithm to avoid to migrate pages shared
    accessed by the NUMA nodes back and forth.  But for pages in the slow
    memory node, even if they are shared accessed by multiple NUMA nodes, as
    long as the pages are hot, they need to be promoted to the fast memory
    node.  So the accessing CPU and PID information are unnecessary for the
    slow memory pages.  We can reuse these bits in struct page to record the
    scan time.  For the fast memory pages, these bits are used as before.

    For the hot threshold, the default value is 1 second, which works well in
    our performance test.  All pages with hint page fault latency < hot
    threshold will be considered hot.

    It's hard for users to determine the hot threshold.  So we don't provide a
    kernel ABI to set it, just provide a debugfs interface for advanced users
    to experiment.  We will continue to work on a hot threshold automatic
    adjustment mechanism.

    The downside of the above method is that the response time to the workload
    hot spot changing may be much longer.  For example,

    - A previous cold memory area becomes hot

    - The hint page fault will be triggered.  But the hint page fault
      latency isn't shorter than the hot threshold.  So the pages will
      not be promoted.

    - When the memory area is scanned again, maybe after a scan period,
      the hint page fault latency measured will be shorter than the hot
      threshold and the pages will be promoted.

    To mitigate this, if there are enough free space in the fast memory node,
    the hot threshold will not be used, all pages will be promoted upon the
    hint page fault for fast response.

    Thanks Zhong Jiang reported and tested the fix for a bug when disabling
    memory tiering mode dynamically.

    Link: https://lkml.kernel.org/r/20220713083954.34196-1-ying.huang@intel.com
    Link: https://lkml.kernel.org/r/20220713083954.34196-2-ying.huang@intel.com
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: Wei Xu <weixugc@google.com>
    Cc: osalvador <osalvador@suse.de>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:32 -04:00
Chris von Recklinghausen 7b890b4b92 fs: Remove aops->migratepage()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 9d0ddc0cb575fd41ff16131b06e08e1feac43b81
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jun 6 11:53:31 2022 -0400

    fs: Remove aops->migratepage()

    With all users converted to migrate_folio(), remove this operation.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:30 -04:00
Chris von Recklinghausen 0b234f928b hugetlb: Convert to migrate_folio
Bugzilla: https://bugzilla.redhat.com/2160210

commit b890ec2a2c2d962f71ba31ae291f8fd252b46258
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jun 6 10:47:21 2022 -0400

    hugetlb: Convert to migrate_folio

    This involves converting migrate_huge_page_move_mapping().  We also need a
    folio variant of hugetlb_set_page_subpool(), but that's for a later patch.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Acked-by: Muchun Song <songmuchun@bytedance.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:30 -04:00
Chris von Recklinghausen 0243261dbd mm/migrate: Add filemap_migrate_folio()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 2ec810d59602f0e08847f986ef8e16469722496f
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jun 6 12:55:08 2022 -0400

    mm/migrate: Add filemap_migrate_folio()

    There is nothing iomap-specific about iomap_migratepage(), and it fits
    a pattern used by several other filesystems, so move it to mm/migrate.c,
    convert it to be filemap_migrate_folio() and convert the iomap filesystems
    to use it.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:29 -04:00
Chris von Recklinghausen 30ef2db262 mm/migrate: Convert migrate_page() to migrate_folio()
Conflicts:
	drivers/gpu/drm/i915/gem/i915_gem_userptr.c - We already have
		7a3deb5bcc ("Merge DRM changes from upstream v5.19..v6.0")
		so it already has the change from this patch.
	drop changes to fs/btrfs/disk-io.c - unsupported config

Bugzilla: https://bugzilla.redhat.com/2160210

commit 541846502f4fe826cd7c16e4784695ac90736585
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jun 6 10:27:41 2022 -0400

    mm/migrate: Convert migrate_page() to migrate_folio()

    Convert all callers to pass a folio.  Most have the folio
    already available.  Switch all users from aops->migratepage to
    aops->migrate_folio.  Also turn the documentation into kerneldoc.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Acked-by: David Sterba <dsterba@suse.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:29 -04:00
Chris von Recklinghausen 88a958f9d7 mm/migrate: Convert expected_page_refs() to folio_expected_refs()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 108ca8358139bec4232319debfb20bafdaf4f877
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jun 6 16:25:10 2022 -0400

    mm/migrate: Convert expected_page_refs() to folio_expected_refs()

    Now that both callers have a folio, convert this function to
    take a folio & rename it.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:29 -04:00
Chris von Recklinghausen 21be3dd4c9 mm/migrate: Convert buffer_migrate_page() to buffer_migrate_folio()
Conflicts: drop changes to fs/ext2/inode.c, fs/ocfs2/aops.c - unsupported
	configs

Bugzilla: https://bugzilla.redhat.com/2160210

commit 67235182a41c1bd6b32806a1556a1d299b84212b
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jun 6 10:20:31 2022 -0400

    mm/migrate: Convert buffer_migrate_page() to buffer_migrate_folio()

    Use a folio throughout __buffer_migrate_folio(), add kernel-doc for
    buffer_migrate_folio() and buffer_migrate_folio_norefs(), move their
    declarations to buffer.h and switch all filesystems that have wired
    them up.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:29 -04:00
Chris von Recklinghausen b19eba0311 mm/migrate: Convert writeout() to take a folio
Bugzilla: https://bugzilla.redhat.com/2160210

commit 2be7fa10c028019f7b2fee11238987762567d41e
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jun 6 09:41:03 2022 -0400

    mm/migrate: Convert writeout() to take a folio

    Use a folio throughout this function.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:29 -04:00
Chris von Recklinghausen b4520b216e mm/migrate: Convert fallback_migrate_page() to fallback_migrate_folio()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 8faa8ef5dd11abe119ad0c8ccd39f2064ca7ed0e
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jun 6 09:34:36 2022 -0400

    mm/migrate: Convert fallback_migrate_page() to fallback_migrate_folio()

    Use a folio throughout.  migrate_page() will be converted to
    migrate_folio() later.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:29 -04:00
Chris von Recklinghausen 94f7ecb397 fs: Add aops->migrate_folio
Bugzilla: https://bugzilla.redhat.com/2160210

commit 5490da4f06d182ba944706875029e98fe7f6b821
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jun 6 09:00:16 2022 -0400

    fs: Add aops->migrate_folio

    Provide a folio-based replacement for aops->migratepage.  Update the
    documentation to document migrate_folio instead of migratepage.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:29 -04:00
Chris von Recklinghausen f064030a3a mm: Convert all PageMovable users to movable_operations
Conflicts: Documentation/vm/page_migration.rst - We already have
	ee65728e103b ("docs: rename Documentation/vm to Documentation/mm")
	sp make the change to Documentation/mm/page_migration.rst instead

Bugzilla: https://bugzilla.redhat.com/2160210

commit 68f2736a858324c3ec852f6c2cddd9d1c777357d
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Jun 7 15:38:48 2022 -0400

    mm: Convert all PageMovable users to movable_operations

    These drivers are rather uncomfortably hammered into the
    address_space_operations hole.  They aren't filesystems and don't behave
    like filesystems.  They just need their own movable_operations structure,
    which we can point to directly from page->mapping.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:29 -04:00
Chris von Recklinghausen 3af92583d0 mm/migrate: convert move_to_new_page() into move_to_new_folio()
Bugzilla: https://bugzilla.redhat.com/2160210

commit e7e3ffeb274f1ff5bc68bb9135128e1ba14a7d53
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Thu May 12 20:23:05 2022 -0700

    mm/migrate: convert move_to_new_page() into move_to_new_folio()

    Pass in the folios that we already have in each caller.  Saves a
    lot of calls to compound_head().

    Link: https://lkml.kernel.org/r/20220504182857.4013401-27-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:07 -04:00
Chris von Recklinghausen 6fdf93c91a mm: convert sysfs input to bool using kstrtobool()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 717aeab42943efa7cfa876b3b687c6ff36eae867
Author: Jagdish Gediya <jvgediya@linux.ibm.com>
Date:   Thu May 12 20:22:59 2022 -0700

    mm: convert sysfs input to bool using kstrtobool()

    Sysfs input conversion to corrosponding bool value e.g.  "false" or "0" to
    false, "true" or "1" to true are currently handled through strncmp at
    multiple places.  Use kstrtobool() to convert sysfs input to bool value.

    [akpm@linux-foundation.org: propagate kstrtobool() return value, per Andy]
    Link: https://lkml.kernel.org/r/20220426180203.70782-2-jvgediya@linux.ibm.com
    Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
    Cc: Alexey Dobriyan <adobriyan@gmail.com>
    Cc: Dave Hansen <dave.hansen@intel.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
    Cc: Petr Mladek <pmladek@suse.com>
    Cc: Richard Fitzgerald <rf@opensource.cirrus.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:05 -04:00
Chris von Recklinghausen 6379ebb826 fs: Change try_to_free_buffers() to take a folio
Conflicts: drop changes to fs/hfs/inode.c fs/hfsplus/inode.c fs/ocfs2/aops.c
	fs/reiserfs/inode.c fs/reiserfs/journal.c - unsupported configs

Bugzilla: https://bugzilla.redhat.com/2160210

commit 68189fef88c7d02eb92e038be3d6428ebd0d2945
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Sun May 1 01:08:08 2022 -0400

    fs: Change try_to_free_buffers() to take a folio

    All but two of the callers already have a folio; pass a folio into
    try_to_free_buffers().  This removes the last user of cancel_dirty_page()
    so remove that wrapper function too.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Jeff Layton <jlayton@kernel.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:02 -04:00
Chris von Recklinghausen 6dc0a6993a mm: untangle config dependencies for demote-on-reclaim
Conflicts: include/linux/migrate.h - The backport of
	20f9ba4f9952 ("mm: migrate: make demotion knob depend on migration")
	ended up placing the declaration of numa_demotion_enabled a few
	lines above where this patch expected. Remove it by hand.

Bugzilla: https://bugzilla.redhat.com/2160210

commit 7d6e2d96384556a4f30547803be1f606eb805a62
Author: Oscar Salvador <osalvador@suse.de>
Date:   Thu Apr 28 23:16:09 2022 -0700

    mm: untangle config dependencies for demote-on-reclaim

    At the time demote-on-reclaim was introduced, it was tied to
    CONFIG_HOTPLUG_CPU + CONFIG_MIGRATE, but that is not really accurate.

    The only two things we need to depend on are CONFIG_NUMA + CONFIG_MIGRATE,
    so clean this up.  Furthermore, we only register the hotplug memory
    notifier when the system has CONFIG_MEMORY_HOTPLUG.

    Link: https://lkml.kernel.org/r/20220322224016.4574-1-osalvador@suse.de
    Signed-off-by: Oscar Salvador <osalvador@suse.de>
    Suggested-by: "Huang, Ying" <ying.huang@intel.com>
    Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Abhishek Goel <huntbag@linux.vnet.ibm.com>
    Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:54 -04:00
Chris von Recklinghausen fb7969cc21 mm: migrate: simplify the refcount validation when migrating hugetlb mapping
Bugzilla: https://bugzilla.redhat.com/2160210

commit 9c42fe4e30a9b934b1de66c2edca196563221392
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date:   Thu Apr 28 23:16:09 2022 -0700

    mm: migrate: simplify the refcount validation when migrating hugetlb mapping

    There is no need to validate the hugetlb page's refcount before trying to
    freeze the hugetlb page's expected refcount, instead we can just rely on
    the page_ref_freeze() to simplify the validation.

    Moreover we are always under the page lock when migrating the hugetlb page
    mapping, which means nowhere else can remove it from the page cache, so we
    can remove the xas_load() validation under the i_pages lock.

    Link: https://lkml.kernel.org/r/eb2fbbeaef2b1714097b9dec457426d682ee0635.1649676424.git.baolin.wang@linux.alibaba.com
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:54 -04:00
Chris von Recklinghausen 7670ca8a08 mm/migration: remove some duplicated codes in migrate_pages
Conflicts: mm/migrate.c - We already have
	69a041ff5058 ("mm/migration: fix potential page refcounts leak in migrate_pages")
	There is a difference in surrounding context

Bugzilla: https://bugzilla.redhat.com/2160210

commit f430893b01e78e0b2e21f9bd1633a778c063993e
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Apr 28 23:16:08 2022 -0700

    mm/migration: remove some duplicated codes in migrate_pages

    Remove the duplicated codes in migrate_pages to simplify the code.  Minor
    readability improvement.  No functional change intended.

    Link: https://lkml.kernel.org/r/20220318111709.60311-9-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:54 -04:00
Chris von Recklinghausen fbdc578054 mm/migration: avoid unneeded nodemask_t initialization
Bugzilla: https://bugzilla.redhat.com/2160210

commit 91925ab8cc2a05ab0e524830247e1d66ba4e4e19
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Apr 28 23:16:08 2022 -0700

    mm/migration: avoid unneeded nodemask_t initialization

    Avoid unneeded next_pass and this_pass initialization as they're always
    set before using to save possible cpu cycles when there are plenty of
    nodes in the system.

    Link: https://lkml.kernel.org/r/20220318111709.60311-8-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:54 -04:00
Chris von Recklinghausen 4ffa0469cc mm/migration: use helper macro min in do_pages_stat
Bugzilla: https://bugzilla.redhat.com/2160210

commit 3eefb826c5a627084eccec788f0236a070988dae
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Apr 28 23:16:07 2022 -0700

    mm/migration: use helper macro min in do_pages_stat

    We could use helper macro min to help set the chunk_nr to simplify the
    code.

    Link: https://lkml.kernel.org/r/20220318111709.60311-7-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:54 -04:00
Chris von Recklinghausen f9324c9a19 mm/migration: use helper function vma_lookup() in add_page_for_migration
Bugzilla: https://bugzilla.redhat.com/2160210

commit cb1c37b1c65d9e5450af2ea6ec8916c5cd23a2e7
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Apr 28 23:16:07 2022 -0700

    mm/migration: use helper function vma_lookup() in add_page_for_migration

    We could use helper function vma_lookup() to lookup the needed vma to
    simplify the code.

    Link: https://lkml.kernel.org/r/20220318111709.60311-6-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:54 -04:00
Chris von Recklinghausen 2f7774c154 mm/migration: remove unneeded local variable page_lru
Bugzilla: https://bugzilla.redhat.com/2160210

commit b75454e10101af3a11b24ca7e14917b6c8874688
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Apr 28 23:16:07 2022 -0700

    mm/migration: remove unneeded local variable page_lru

    We can use page_is_file_lru() directly to help account the isolated pages
    to simplify the code a bit.

    Link: https://lkml.kernel.org/r/20220318111709.60311-4-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:54 -04:00
Chris von Recklinghausen 3ecfd3982a mm/migration: remove unneeded local variable mapping_locked
Bugzilla: https://bugzilla.redhat.com/2160210

commit 5202978b48786703a6cec94596067b3fafd1f734
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Apr 28 23:16:07 2022 -0700

    mm/migration: remove unneeded local variable mapping_locked

    Patch series "A few cleanup and fixup patches for migration", v2.

    This series contains a few patches to remove unneeded variables, jump
    label and use helper to simplify the code.  Also we fix some bugs such as
    page refcounts leak , invalid node access and so on.  More details can be
    found in the respective changelogs.

    This patch (of 11):

    When mapping_locked is true, TTU_RMAP_LOCKED is always set to ttu.  We can
    check ttu instead so mapping_locked can be removed.  And ttu is either 0
    or TTU_RMAP_LOCKED now.  Change '|=' to '=' to reflect this.

    Link: https://lkml.kernel.org/r/20220318111709.60311-1-linmiaohe@huawei.com
    Link: https://lkml.kernel.org/r/20220318111709.60311-2-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:54 -04:00
Chris von Recklinghausen 5cea471a90 mm/vmscan: make sure wakeup_kswapd with managed zone
Bugzilla: https://bugzilla.redhat.com/2160210

commit bc53008eea55330f485c956338d3c59f96c70c08
Author: Wei Yang <richard.weiyang@gmail.com>
Date:   Thu Apr 28 23:16:03 2022 -0700

    mm/vmscan: make sure wakeup_kswapd with managed zone

    wakeup_kswapd() only wake up kswapd when the zone is managed.

    For two callers of wakeup_kswapd(), they are node perspective.

      * wake_all_kswapds
      * numamigrate_isolate_page

    If we picked up a !managed zone, this is not we expected.

    This patch makes sure we pick up a managed zone for wakeup_kswapd().  And
    it also use managed_zone in migrate_balanced_pgdat() to get the proper
    zone.

    [richard.weiyang@gmail.com: adjust the usage in migrate_balanced_pgdat()]
      Link: https://lkml.kernel.org/r/20220329010901.1654-2-richard.weiyang@gmail.com
    Link: https://lkml.kernel.org/r/20220327024101.10378-2-richard.weiyang@gmail.com
    Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
    Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Oscar Salvador <osalvador@suse.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:54 -04:00
Nico Pache 8c112f97c0 mm: Clear page->private when splitting or migrating a page
commit b653db77350c7307a513b81856fe53e94cf42446
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Sun Jun 19 10:37:32 2022 -0400

    mm: Clear page->private when splitting or migrating a page

    In our efforts to remove uses of PG_private, we have found folios with
    the private flag clear and folio->private not-NULL.  That is the root
    cause behind 642d51fb0775 ("ceph: check folio PG_private bit instead
    of folio->private").  It can also affect a few other filesystems that
    haven't yet reported a problem.

    compaction_alloc() can return a page with uninitialised page->private,
    and rather than checking all the callers of migrate_pages(), just zero
    page->private after calling get_new_page().  Similarly, the tail pages
    from split_huge_page() may also have an uninitialised page->private.

    Reported-by: Xiubo Li <xiubli@redhat.com>
    Tested-by: Xiubo Li <xiubli@redhat.com>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:44 -07:00
Nico Pache 8960a08ee7 mm: migration: fix the FOLL_GET failure on following huge page
commit 831568214883e0a9940f776771343420306d2341
Author: Haiyue Wang <haiyue.wang@intel.com>
Date:   Fri Aug 12 16:49:21 2022 +0800

    mm: migration: fix the FOLL_GET failure on following huge page

    Not all huge page APIs support FOLL_GET option, so move_pages() syscall
    will fail to get the page node information for some huge pages.

    Like x86 on linux 5.19 with 1GB huge page API follow_huge_pud(), it will
    return NULL page for FOLL_GET when calling move_pages() syscall with the
    NULL 'nodes' parameter, the 'status' parameter has '-2' error in array.

    Note: follow_huge_pud() now supports FOLL_GET in linux 6.0.
          Link: https://lore.kernel.org/all/20220714042420.1847125-3-naoya.horiguchi@linux.dev

    But these huge page APIs don't support FOLL_GET:
      1. follow_huge_pud() in arch/s390/mm/hugetlbpage.c
      2. follow_huge_addr() in arch/ia64/mm/hugetlbpage.c
         It will cause WARN_ON_ONCE for FOLL_GET.
      3. follow_huge_pgd() in mm/hugetlb.c

    This is an temporary solution to mitigate the side effect of the race
    condition fix by calling follow_page() with FOLL_GET set for huge pages.

    After supporting follow huge page by FOLL_GET is done, this fix can be
    reverted safely.

    Link: https://lkml.kernel.org/r/20220823135841.934465-2-haiyue.wang@intel.com
    Link: https://lkml.kernel.org/r/20220812084921.409142-1-haiyue.wang@intel.com
    Fixes: 4cd614841c06 ("mm: migration: fix possible do_pages_stat_array racing with memory offline")
    Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
    Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
    Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:42 -07:00
Nico Pache 4cd418ee41 mm/migration: fix potential pte_unmap on an not mapped pte
commit ad1ac596e8a8c4b06715dfbd89853eb73c9886b2
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Mon May 30 19:30:16 2022 +0800

    mm/migration: fix potential pte_unmap on an not mapped pte

    __migration_entry_wait and migration_entry_wait_on_locked assume pte is
    always mapped from caller.  But this is not the case when it's called from
    migration_entry_wait_huge and follow_huge_pmd.  Add a hugetlbfs variant
    that calls hugetlb_migration_entry_wait(ptep == NULL) to fix this issue.

    Link: https://lkml.kernel.org/r/20220530113016.16663-5-linmiaohe@huawei.com
    Fixes: 30dad30922 ("mm: migration: add migrate_entry_wait_huge()")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Suggested-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Huang Ying <ying.huang@intel.com>
    Cc: kernel test robot <lkp@intel.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Peter Xu <peterx@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:39 -07:00
Nico Pache aa28eb0c17 mm/migration: return errno when isolate_huge_page failed
commit 7ce82f4c3f3ead13a9d9498768e3b1a79975c4d8
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Mon May 30 19:30:15 2022 +0800

    mm/migration: return errno when isolate_huge_page failed

    We might fail to isolate huge page due to e.g.  the page is under
    migration which cleared HPageMigratable.  We should return errno in this
    case rather than always return 1 which could confuse the user, i.e.  the
    caller might think all of the memory is migrated while the hugetlb page is
    left behind.  We make the prototype of isolate_huge_page consistent with
    isolate_lru_page as suggested by Huang Ying and rename isolate_huge_page
    to isolate_hugetlb as suggested by Muchun to improve the readability.

    Link: https://lkml.kernel.org/r/20220530113016.16663-4-linmiaohe@huawei.com
    Fixes: e8db67eb0d ("mm: migrate: move_pages() supports thp migration")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Suggested-by: Huang Ying <ying.huang@intel.com>
    Reported-by: kernel test robot <lkp@intel.com> (build error)
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Peter Xu <peterx@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:39 -07:00
Nico Pache b8767107f7 mm/migration: remove unneeded lock page and PageMovable check
commit 160088b3b6d7946e456caa379dcdfc8702c66274
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Mon May 30 19:30:14 2022 +0800

    mm/migration: remove unneeded lock page and PageMovable check

    When non-lru movable page was freed from under us, __ClearPageMovable must
    have been done.  So we can remove unneeded lock page and PageMovable check
    here.  Also free_pages_prepare() will clear PG_isolated for us, so we can
    further remove ClearPageIsolated as suggested by David.

    Link: https://lkml.kernel.org/r/20220530113016.16663-3-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Huang Ying <ying.huang@intel.com>
    Cc: kernel test robot <lkp@intel.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Peter Xu <peterx@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:39 -07:00
Chris von Recklinghausen 30e9a2455a mm: remember exclusively mapped anonymous pages with PG_anon_exclusive
Bugzilla: https://bugzilla.redhat.com/2120352

commit 6c287605fd56466e645693eff3ae7c08fba56e0a
Author: David Hildenbrand <david@redhat.com>
Date:   Mon May 9 18:20:44 2022 -0700

    mm: remember exclusively mapped anonymous pages with PG_anon_exclusive

    Let's mark exclusively mapped anonymous pages with PG_anon_exclusive as
    exclusive, and use that information to make GUP pins reliable and stay
    consistent with the page mapped into the page table even if the page table
    entry gets write-protected.

    With that information at hand, we can extend our COW logic to always reuse
    anonymous pages that are exclusive.  For anonymous pages that might be
    shared, the existing logic applies.

    As already documented, PG_anon_exclusive is usually only expressive in
    combination with a page table entry.  Especially PTE vs.  PMD-mapped
    anonymous pages require more thought, some examples: due to mremap() we
    can easily have a single compound page PTE-mapped into multiple page
    tables exclusively in a single process -- multiple page table locks apply.
    Further, due to MADV_WIPEONFORK we might not necessarily write-protect
    all PTEs, and only some subpages might be pinned.  Long story short: once
    PTE-mapped, we have to track information about exclusivity per sub-page,
    but until then, we can just track it for the compound page in the head
    page and not having to update a whole bunch of subpages all of the time
    for a simple PMD mapping of a THP.

    For simplicity, this commit mostly talks about "anonymous pages", while
    it's for THP actually "the part of an anonymous folio referenced via a
    page table entry".

    To not spill PG_anon_exclusive code all over the mm code-base, we let the
    anon rmap code to handle all PG_anon_exclusive logic it can easily handle.

    If a writable, present page table entry points at an anonymous (sub)page,
    that (sub)page must be PG_anon_exclusive.  If GUP wants to take a reliably
    pin (FOLL_PIN) on an anonymous page references via a present page table
    entry, it must only pin if PG_anon_exclusive is set for the mapped
    (sub)page.

    This commit doesn't adjust GUP, so this is only implicitly handled for
    FOLL_WRITE, follow-up commits will teach GUP to also respect it for
    FOLL_PIN without FOLL_WRITE, to make all GUP pins of anonymous pages fully
    reliable.

    Whenever an anonymous page is to be shared (fork(), KSM), or when
    temporarily unmapping an anonymous page (swap, migration), the relevant
    PG_anon_exclusive bit has to be cleared to mark the anonymous page
    possibly shared.  Clearing will fail if there are GUP pins on the page:

    * For fork(), this means having to copy the page and not being able to
      share it.  fork() protects against concurrent GUP using the PT lock and
      the src_mm->write_protect_seq.

    * For KSM, this means sharing will fail.  For swap this means, unmapping
      will fail, For migration this means, migration will fail early.  All
      three cases protect against concurrent GUP using the PT lock and a
      proper clear/invalidate+flush of the relevant page table entry.

    This fixes memory corruptions reported for FOLL_PIN | FOLL_WRITE, when a
    pinned page gets mapped R/O and the successive write fault ends up
    replacing the page instead of reusing it.  It improves the situation for
    O_DIRECT/vmsplice/...  that still use FOLL_GET instead of FOLL_PIN, if
    fork() is *not* involved, however swapout and fork() are still
    problematic.  Properly using FOLL_PIN instead of FOLL_GET for these GUP
    users will fix the issue for them.

    I. Details about basic handling

    I.1. Fresh anonymous pages

    page_add_new_anon_rmap() and hugepage_add_new_anon_rmap() will mark the
    given page exclusive via __page_set_anon_rmap(exclusive=1).  As that is
    the mechanism fresh anonymous pages come into life (besides migration code
    where we copy the page->mapping), all fresh anonymous pages will start out
    as exclusive.

    I.2. COW reuse handling of anonymous pages

    When a COW handler stumbles over a (sub)page that's marked exclusive, it
    simply reuses it.  Otherwise, the handler tries harder under page lock to
    detect if the (sub)page is exclusive and can be reused.  If exclusive,
    page_move_anon_rmap() will mark the given (sub)page exclusive.

    Note that hugetlb code does not yet check for PageAnonExclusive(), as it
    still uses the old COW logic that is prone to the COW security issue
    because hugetlb code cannot really tolerate unnecessary/wrong COW as huge
    pages are a scarce resource.

    I.3. Migration handling

    try_to_migrate() has to try marking an exclusive anonymous page shared via
    page_try_share_anon_rmap().  If it fails because there are GUP pins on the
    page, unmap fails.  migrate_vma_collect_pmd() and
    __split_huge_pmd_locked() are handled similarly.

    Writable migration entries implicitly point at shared anonymous pages.
    For readable migration entries that information is stored via a new
    "readable-exclusive" migration entry, specific to anonymous pages.

    When restoring a migration entry in remove_migration_pte(), information
    about exlusivity is detected via the migration entry type, and
    RMAP_EXCLUSIVE is set accordingly for
    page_add_anon_rmap()/hugepage_add_anon_rmap() to restore that information.

    I.4. Swapout handling

    try_to_unmap() has to try marking the mapped page possibly shared via
    page_try_share_anon_rmap().  If it fails because there are GUP pins on the
    page, unmap fails.  For now, information about exclusivity is lost.  In
    the future, we might want to remember that information in the swap entry
    in some cases, however, it requires more thought, care, and a way to store
    that information in swap entries.

    I.5. Swapin handling

    do_swap_page() will never stumble over exclusive anonymous pages in the
    swap cache, as try_to_migrate() prohibits that.  do_swap_page() always has
    to detect manually if an anonymous page is exclusive and has to set
    RMAP_EXCLUSIVE for page_add_anon_rmap() accordingly.

    I.6. THP handling

    __split_huge_pmd_locked() has to move the information about exclusivity
    from the PMD to the PTEs.

    a) In case we have a readable-exclusive PMD migration entry, simply
       insert readable-exclusive PTE migration entries.

    b) In case we have a present PMD entry and we don't want to freeze
       ("convert to migration entries"), simply forward PG_anon_exclusive to
       all sub-pages, no need to temporarily clear the bit.

    c) In case we have a present PMD entry and want to freeze, handle it
       similar to try_to_migrate(): try marking the page shared first.  In
       case we fail, we ignore the "freeze" instruction and simply split
       ordinarily.  try_to_migrate() will properly fail because the THP is
       still mapped via PTEs.

    When splitting a compound anonymous folio (THP), the information about
    exclusivity is implicitly handled via the migration entries: no need to
    replicate PG_anon_exclusive manually.

    I.7.  fork() handling fork() handling is relatively easy, because
    PG_anon_exclusive is only expressive for some page table entry types.

    a) Present anonymous pages

    page_try_dup_anon_rmap() will mark the given subpage shared -- which will
    fail if the page is pinned.  If it failed, we have to copy (or PTE-map a
    PMD to handle it on the PTE level).

    Note that device exclusive entries are just a pointer at a PageAnon()
    page.  fork() will first convert a device exclusive entry to a present
    page table and handle it just like present anonymous pages.

    b) Device private entry

    Device private entries point at PageAnon() pages that cannot be mapped
    directly and, therefore, cannot get pinned.

    page_try_dup_anon_rmap() will mark the given subpage shared, which cannot
    fail because they cannot get pinned.

    c) HW poison entries

    PG_anon_exclusive will remain untouched and is stale -- the page table
    entry is just a placeholder after all.

    d) Migration entries

    Writable and readable-exclusive entries are converted to readable entries:
    possibly shared.

    I.8. mprotect() handling

    mprotect() only has to properly handle the new readable-exclusive
    migration entry:

    When write-protecting a migration entry that points at an anonymous page,
    remember the information about exclusivity via the "readable-exclusive"
    migration entry type.

    II. Migration and GUP-fast

    Whenever replacing a present page table entry that maps an exclusive
    anonymous page by a migration entry, we have to mark the page possibly
    shared and synchronize against GUP-fast by a proper clear/invalidate+flush
    to make the following scenario impossible:

    1. try_to_migrate() places a migration entry after checking for GUP pins
       and marks the page possibly shared.

    2. GUP-fast pins the page due to lack of synchronization

    3. fork() converts the "writable/readable-exclusive" migration entry into a
       readable migration entry

    4. Migration fails due to the GUP pin (failing to freeze the refcount)

    5. Migration entries are restored. PG_anon_exclusive is lost

    -> We have a pinned page that is not marked exclusive anymore.

    Note that we move information about exclusivity from the page to the
    migration entry as it otherwise highly overcomplicates fork() and
    PTE-mapping a THP.

    III. Swapout and GUP-fast

    Whenever replacing a present page table entry that maps an exclusive
    anonymous page by a swap entry, we have to mark the page possibly shared
    and synchronize against GUP-fast by a proper clear/invalidate+flush to
    make the following scenario impossible:

    1. try_to_unmap() places a swap entry after checking for GUP pins and
       clears exclusivity information on the page.

    2. GUP-fast pins the page due to lack of synchronization.

    -> We have a pinned page that is not marked exclusive anymore.

    If we'd ever store information about exclusivity in the swap entry,
    similar to migration handling, the same considerations as in II would
    apply.  This is future work.

    Link: https://lkml.kernel.org/r/20220428083441.37290-13-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Don Dutile <ddutile@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jann Horn <jannh@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Khalid Aziz <khalid.aziz@oracle.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Liang Zhang <zhangliang5@huawei.com>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Oded Gabbay <oded.gabbay@gmail.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:11 -04:00
Chris von Recklinghausen d42a301175 mm/rmap: pass rmap flags to hugepage_add_anon_rmap()
Bugzilla: https://bugzilla.redhat.com/2120352

commit 28c5209dfd5f86f4398ce01bfac8508b2c4d4050
Author: David Hildenbrand <david@redhat.com>
Date:   Mon May 9 18:20:43 2022 -0700

    mm/rmap: pass rmap flags to hugepage_add_anon_rmap()

    Let's prepare for passing RMAP_EXCLUSIVE, similarly as we do for
    page_add_anon_rmap() now.  RMAP_COMPOUND is implicit for hugetlb pages and
    ignored.

    Link: https://lkml.kernel.org/r/20220428083441.37290-8-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Don Dutile <ddutile@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jann Horn <jannh@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Khalid Aziz <khalid.aziz@oracle.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Liang Zhang <zhangliang5@huawei.com>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Oded Gabbay <oded.gabbay@gmail.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:11 -04:00
Chris von Recklinghausen ab8c3870a8 mm/rmap: remove do_page_add_anon_rmap()
Bugzilla: https://bugzilla.redhat.com/2120352

commit f1e2db12e45baaa2d366f87c885968096c2ff5aa
Author: David Hildenbrand <david@redhat.com>
Date:   Mon May 9 18:20:43 2022 -0700

    mm/rmap: remove do_page_add_anon_rmap()

    ... and instead convert page_add_anon_rmap() to accept flags.

    Passing flags instead of bools is usually nicer either way, and we want to
    more often also pass RMAP_EXCLUSIVE in follow up patches when detecting
    that an anonymous page is exclusive: for example, when restoring an
    anonymous page from a writable migration entry.

    This is a preparation for marking an anonymous page inside
    page_add_anon_rmap() as exclusive when RMAP_EXCLUSIVE is passed.

    Link: https://lkml.kernel.org/r/20220428083441.37290-7-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Don Dutile <ddutile@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jann Horn <jannh@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Khalid Aziz <khalid.aziz@oracle.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Liang Zhang <zhangliang5@huawei.com>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Oded Gabbay <oded.gabbay@gmail.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:10 -04:00
Chris von Recklinghausen d8f21270d3 mm/rmap: split page_dup_rmap() into page_dup_file_rmap() and page_try_dup_anon_rmap()
Bugzilla: https://bugzilla.redhat.com/2120352

commit fb3d824d1a46c5bb0584ea88f32dc2495544aebf
Author: David Hildenbrand <david@redhat.com>
Date:   Mon May 9 18:20:43 2022 -0700

    mm/rmap: split page_dup_rmap() into page_dup_file_rmap() and page_try_dup_anon_rmap()

    ...  and move the special check for pinned pages into
    page_try_dup_anon_rmap() to prepare for tracking exclusive anonymous pages
    via a new pageflag, clearing it only after making sure that there are no
    GUP pins on the anonymous page.

    We really only care about pins on anonymous pages, because they are prone
    to getting replaced in the COW handler once mapped R/O.  For !anon pages
    in cow-mappings (!VM_SHARED && VM_MAYWRITE) we shouldn't really care about
    that, at least not that I could come up with an example.

    Let's drop the is_cow_mapping() check from page_needs_cow_for_dma(), as we
    know we're dealing with anonymous pages.  Also, drop the handling of
    pinned pages from copy_huge_pud() and add a comment if ever supporting
    anonymous pages on the PUD level.

    This is a preparation for tracking exclusivity of anonymous pages in the
    rmap code, and disallowing marking a page shared (-> failing to duplicate)
    if there are GUP pins on a page.

    Link: https://lkml.kernel.org/r/20220428083441.37290-5-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Don Dutile <ddutile@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jann Horn <jannh@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Khalid Aziz <khalid.aziz@oracle.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Liang Zhang <zhangliang5@huawei.com>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Oded Gabbay <oded.gabbay@gmail.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:10 -04:00
Chris von Recklinghausen 6697b528b0 mm: handling Non-LRU pages returned by vm_normal_pages
Bugzilla: https://bugzilla.redhat.com/2120352

commit 3218f8712d6bba1812efd5e0d66c1e15134f2a91
Author: Alex Sierra <alex.sierra@amd.com>
Date:   Fri Jul 15 10:05:11 2022 -0500

    mm: handling Non-LRU pages returned by vm_normal_pages

    With DEVICE_COHERENT, we'll soon have vm_normal_pages() return
    device-managed anonymous pages that are not LRU pages.  Although they
    behave like normal pages for purposes of mapping in CPU page, and for COW.
    They do not support LRU lists, NUMA migration or THP.

    Callers to follow_page() currently don't expect ZONE_DEVICE pages,
    however, with DEVICE_COHERENT we might now return ZONE_DEVICE.  Check for
    ZONE_DEVICE pages in applicable users of follow_page() as well.

    Link: https://lkml.kernel.org/r/20220715150521.18165-5-alex.sierra@amd.com
    Signed-off-by: Alex Sierra <alex.sierra@amd.com>
    Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>       [v2]
    Reviewed-by: Alistair Popple <apopple@nvidia.com>       [v6]
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:09 -04:00
Chris von Recklinghausen bae6e93d2e mm/migration: fix possible do_pages_stat_array racing with memory offline
Bugzilla: https://bugzilla.redhat.com/2120352

commit 4cd614841c06338a087769ee3cfa96718784d1f5
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Apr 28 23:16:08 2022 -0700

    mm/migration: fix possible do_pages_stat_array racing with memory offline

    When follow_page peeks a page, the page could be migrated and then be
    offlined while it's still being used by the do_pages_stat_array().  Use
    FOLL_GET to hold the page refcnt to fix this potential race.

    Link: https://lkml.kernel.org/r/20220318111709.60311-12-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Acked-by: "Huang, Ying" <ying.huang@intel.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:07 -04:00
Chris von Recklinghausen 0b694c170d mm/migration: fix potential invalid node access for reclaim-based migration
Bugzilla: https://bugzilla.redhat.com/2120352

commit 3f26c88bd66cd8ab1731763c68df7fe23a7671c0
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Apr 28 23:16:08 2022 -0700

    mm/migration: fix potential invalid node access for reclaim-based migration

    If we failed to setup hotplug state callbacks for mm/demotion:online in
    some corner cases, node_demotion will be left uninitialized.  Invalid node
    might be returned from the next_demotion_node() when doing reclaim-based
    migration.  Use kcalloc to allocate node_demotion to fix the issue.

    Link: https://lkml.kernel.org/r/20220318111709.60311-11-linmiaohe@huawei.com
    Fixes: ac16ec835314 ("mm: migrate: support multiple target nodes demotion")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:07 -04:00
Chris von Recklinghausen c692ee03d8 mm/migration: fix potential page refcounts leak in migrate_pages
Bugzilla: https://bugzilla.redhat.com/2120352

commit 69a041ff505806c95b24b8d5cab43e66aacd91e6
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Thu Apr 28 23:16:08 2022 -0700

    mm/migration: fix potential page refcounts leak in migrate_pages

    In -ENOMEM case, there might be some subpages of fail-to-migrate THPs left
    in thp_split_pages list.  We should move them back to migration list so
    that they could be put back to the right list by the caller otherwise the
    page refcnt will be leaked here.  Also adjust nr_failed and nr_thp_failed
    accordingly to make vm events account more accurate.

    Link: https://lkml.kernel.org/r/20220318111709.60311-10-linmiaohe@huawei.com
    Fixes: b5bade978e9b ("mm: migrate: fix the return value of migrate_pages()")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
    Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:07 -04:00
Chris von Recklinghausen 42154efcec mm: only re-generate demotion targets when a numa node changes its N_CPU state
Conflicts: include/linux/migrate.h - 'extern bool numa_demotion_enabled;'
	is missing surrounding context to this patch so the lack of it causes
	a merge conflict. Don't add it in.

Bugzilla: https://bugzilla.redhat.com/2120352

commit 734c15700cdf9062ae98d8b131c6fe873dfad26d
Author: Oscar Salvador <osalvador@suse.de>
Date:   Tue Mar 22 14:47:37 2022 -0700

    mm: only re-generate demotion targets when a numa node changes its N_CPU sta
te

    Abhishek reported that after patch [1], hotplug operations are taking
    roughly double the expected time.  [2]

    The reason behind is that the CPU callbacks that
    migrate_on_reclaim_init() sets always call set_migration_target_nodes()
    whenever a CPU is brought up/down.

    But we only care about numa nodes going from having cpus to become
    cpuless, and vice versa, as that influences the demotion_target order.

    We do already have two CPU callbacks (vmstat_cpu_online() and
    vmstat_cpu_dead()) that check exactly that, so get rid of the CPU
    callbacks in migrate_on_reclaim_init() and only call
    set_migration_target_nodes() from vmstat_cpu_{dead,online}() whenever a
    numa node change its N_CPU state.

    [1] https://lore.kernel.org/linux-mm/20210721063926.3024591-2-ying.huang@int
el.com/
    [2] https://lore.kernel.org/linux-mm/eb438ddd-2919-73d4-bd9f-b7eecdd9577a@li
nux.vnet.ibm.com/

    [osalvador@suse.de: add feedback from Huang Ying]
      Link: https://lkml.kernel.org/r/20220314150945.12694-1-osalvador@suse.de

    Link: https://lkml.kernel.org/r/20220310120749.23077-1-osalvador@suse.de
    Fixes: 884a6e5d1f93b ("mm/migrate: update node demotion order on hotplug eve
nts")
    Signed-off-by: Oscar Salvador <osalvador@suse.de>
    Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Reported-by: Abhishek Goel <huntbag@linux.vnet.ibm.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Abhishek Goel <huntbag@linux.vnet.ibm.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:55 -04:00
Chris von Recklinghausen 63534db797 NUMA balancing: optimize page placement for memory tiering system
Bugzilla: https://bugzilla.redhat.com/2120352

commit c574bbe917036c8968b984c82c7b13194fe5ce98
Author: Huang Ying <ying.huang@intel.com>
Date:   Tue Mar 22 14:46:23 2022 -0700

    NUMA balancing: optimize page placement for memory tiering system

    With the advent of various new memory types, some machines will have
    multiple types of memory, e.g.  DRAM and PMEM (persistent memory).  The
    memory subsystem of these machines can be called memory tiering system,
    because the performance of the different types of memory are usually
    different.

    In such system, because of the memory accessing pattern changing etc,
    some pages in the slow memory may become hot globally.  So in this
    patch, the NUMA balancing mechanism is enhanced to optimize the page
    placement among the different memory types according to hot/cold
    dynamically.

    In a typical memory tiering system, there are CPUs, fast memory and slow
    memory in each physical NUMA node.  The CPUs and the fast memory will be
    put in one logical node (called fast memory node), while the slow memory
    will be put in another (faked) logical node (called slow memory node).
    That is, the fast memory is regarded as local while the slow memory is
    regarded as remote.  So it's possible for the recently accessed pages in
    the slow memory node to be promoted to the fast memory node via the
    existing NUMA balancing mechanism.

    The original NUMA balancing mechanism will stop to migrate pages if the
    free memory of the target node becomes below the high watermark.  This
    is a reasonable policy if there's only one memory type.  But this makes
    the original NUMA balancing mechanism almost do not work to optimize
    page placement among different memory types.  Details are as follows.

    It's the common cases that the working-set size of the workload is
    larger than the size of the fast memory nodes.  Otherwise, it's
    unnecessary to use the slow memory at all.  So, there are almost always
    no enough free pages in the fast memory nodes, so that the globally hot
    pages in the slow memory node cannot be promoted to the fast memory
    node.  To solve the issue, we have 2 choices as follows,

    a. Ignore the free pages watermark checking when promoting hot pages
       from the slow memory node to the fast memory node.  This will
       create some memory pressure in the fast memory node, thus trigger
       the memory reclaiming.  So that, the cold pages in the fast memory
       node will be demoted to the slow memory node.

    b. Define a new watermark called wmark_promo which is higher than
       wmark_high, and have kswapd reclaiming pages until free pages reach
       such watermark.  The scenario is as follows: when we want to promote
       hot-pages from a slow memory to a fast memory, but fast memory's free
       pages would go lower than high watermark with such promotion, we wake
       up kswapd with wmark_promo watermark in order to demote cold pages and
       free us up some space.  So, next time we want to promote hot-pages we
       might have a chance of doing so.

    The choice "a" may create high memory pressure in the fast memory node.
    If the memory pressure of the workload is high, the memory pressure
    may become so high that the memory allocation latency of the workload
    is influenced, e.g.  the direct reclaiming may be triggered.

    The choice "b" works much better at this aspect.  If the memory
    pressure of the workload is high, the hot pages promotion will stop
    earlier because its allocation watermark is higher than that of the
    normal memory allocation.  So in this patch, choice "b" is implemented.
    A new zone watermark (WMARK_PROMO) is added.  Which is larger than the
    high watermark and can be controlled via watermark_scale_factor.

    In addition to the original page placement optimization among sockets,
    the NUMA balancing mechanism is extended to be used to optimize page
    placement according to hot/cold among different memory types.  So the
    sysctl user space interface (numa_balancing) is extended in a backward
    compatible way as follow, so that the users can enable/disable these
    functionality individually.

    The sysctl is converted from a Boolean value to a bits field.  The
    definition of the flags is,

    - 0: NUMA_BALANCING_DISABLED
    - 1: NUMA_BALANCING_NORMAL
    - 2: NUMA_BALANCING_MEMORY_TIERING

    We have tested the patch with the pmbench memory accessing benchmark
    with the 80:20 read/write ratio and the Gauss access address
    distribution on a 2 socket Intel server with Optane DC Persistent
    Memory Model.  The test results shows that the pmbench score can
    improve up to 95.9%.

    Thanks Andrew Morton to help fix the document format error.

    Link: https://lkml.kernel.org/r/20220221084529.1052339-3-ying.huang@intel.com
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: Wei Xu <weixugc@google.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
    Cc: Randy Dunlap <rdunlap@infradead.org>
    Cc: Feng Tang <feng.tang@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:54 -04:00
Chris von Recklinghausen b714bae950 NUMA Balancing: add page promotion counter
Conflicts: mm/migrate.c - We already have
	c185e494ae0c ("mm/migrate: Use a folio in migrate_misplaced_transhuge_page()")
	so keep its arguments to migrate_pages

Bugzilla: https://bugzilla.redhat.com/2120352

commit e39bb6be9f2b39a6dbaeff484361de76021b175d
Author: Huang Ying <ying.huang@intel.com>
Date:   Tue Mar 22 14:46:20 2022 -0700

    NUMA Balancing: add page promotion counter

    Patch series "NUMA balancing: optimize memory placement for memory tiering s
ystem", v13

    With the advent of various new memory types, some machines will have
    multiple types of memory, e.g.  DRAM and PMEM (persistent memory).  The
    memory subsystem of these machines can be called memory tiering system,
    because the performance of the different types of memory are different.

    After commit c221c0b030 ("device-dax: "Hotplug" persistent memory for
    use like normal RAM"), the PMEM could be used as the cost-effective
    volatile memory in separate NUMA nodes.  In a typical memory tiering
    system, there are CPUs, DRAM and PMEM in each physical NUMA node.  The
    CPUs and the DRAM will be put in one logical node, while the PMEM will
    be put in another (faked) logical node.

    To optimize the system overall performance, the hot pages should be
    placed in DRAM node.  To do that, we need to identify the hot pages in
    the PMEM node and migrate them to DRAM node via NUMA migration.

    In the original NUMA balancing, there are already a set of existing
    mechanisms to identify the pages recently accessed by the CPUs in a node
    and migrate the pages to the node.  So we can reuse these mechanisms to
    build the mechanisms to optimize the page placement in the memory
    tiering system.  This is implemented in this patchset.

    At the other hand, the cold pages should be placed in PMEM node.  So, we
    also need to identify the cold pages in the DRAM node and migrate them
    to PMEM node.

    In commit 26aa2d199d6f ("mm/migrate: demote pages during reclaim"), a
    mechanism to demote the cold DRAM pages to PMEM node under memory
    pressure is implemented.  Based on that, the cold DRAM pages can be
    demoted to PMEM node proactively to free some memory space on DRAM node
    to accommodate the promoted hot PMEM pages.  This is implemented in this
    patchset too.

    We have tested the solution with the pmbench memory accessing benchmark
    with the 80:20 read/write ratio and the Gauss access address
    distribution on a 2 socket Intel server with Optane DC Persistent Memory
    Model.  The test results shows that the pmbench score can improve up to
    95.9%.

    This patch (of 3):

    In a system with multiple memory types, e.g.  DRAM and PMEM, the CPU
    and DRAM in one socket will be put in one NUMA node as before, while
    the PMEM will be put in another NUMA node as described in the
    description of the commit c221c0b030 ("device-dax: "Hotplug"
    persistent memory for use like normal RAM").  So, the NUMA balancing
    mechanism will identify all PMEM accesses as remote access and try to
    promote the PMEM pages to DRAM.

    To distinguish the number of the inter-type promoted pages from that of
    the inter-socket migrated pages.  A new vmstat count is added.  The
    counter is per-node (count in the target node).  So this can be used to
    identify promotion imbalance among the NUMA nodes.

    Link: https://lkml.kernel.org/r/20220301085329.3210428-1-ying.huang@intel.co
m
    Link: https://lkml.kernel.org/r/20220221084529.1052339-1-ying.huang@intel.co
m
    Link: https://lkml.kernel.org/r/20220221084529.1052339-2-ying.huang@intel.co
m
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: Wei Xu <weixugc@google.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
    Cc: Feng Tang <feng.tang@intel.com>
    Cc: Randy Dunlap <rdunlap@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:54 -04:00
Chris von Recklinghausen 4ba2af444f mm/migrate: fix race between lock page and clear PG_Isolated
Bugzilla: https://bugzilla.redhat.com/2120352

commit 356ea3865687926e5da7579d1f3351d3f0a322a1
Author: andrew.yang <andrew.yang@mediatek.com>
Date:   Tue Mar 22 14:46:08 2022 -0700

    mm/migrate: fix race between lock page and clear PG_Isolated

    When memory is tight, system may start to compact memory for large
    continuous memory demands.  If one process tries to lock a memory page
    that is being locked and isolated for compaction, it may wait a long time
    or even forever.  This is because compaction will perform non-atomic
    PG_Isolated clear while holding page lock, this may overwrite PG_waiters
    set by the process that can't obtain the page lock and add itself to the
    waiting queue to wait for the lock to be unlocked.

      CPU1                            CPU2
      lock_page(page); (successful)
                                      lock_page(); (failed)
      __ClearPageIsolated(page);      SetPageWaiters(page) (may be overwritten)
      unlock_page(page);

    The solution is to not perform non-atomic operation on page flags while
    holding page lock.

    Link: https://lkml.kernel.org/r/20220315030515.20263-1-andrew.yang@mediatek.com
    Signed-off-by: andrew.yang <andrew.yang@mediatek.com>
    Cc: Matthias Brugger <matthias.bgg@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: "Vlastimil Babka" <vbabka@suse.cz>
    Cc: David Howells <dhowells@redhat.com>
    Cc: "William Kucharski" <william.kucharski@oracle.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Marc Zyngier <maz@kernel.org>
    Cc: Nicholas Tang <nicholas.tang@mediatek.com>
    Cc: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:54 -04:00
Chris von Recklinghausen 40a64b9480 mm,migrate: fix establishing demotion target
Bugzilla: https://bugzilla.redhat.com/2120352

commit fc89213a636c3735eb3386f10a34c082271b4192
Author: Huang Ying <ying.huang@intel.com>
Date:   Tue Mar 22 14:46:05 2022 -0700

    mm,migrate: fix establishing demotion target

    In commit ac16ec835314 ("mm: migrate: support multiple target nodes
    demotion"), after the first demotion target node is found, we will
    continue to check the next candidate obtained via find_next_best_node().
    This is to find all demotion target nodes with same NUMA distance.  But
    one side effect of find_next_best_node() is that the candidate node
    returned will be set in "used" parameter, even if the candidate node isn't
    passed in the following NUMA distance checking, the candidate node will
    not be used as demotion target node for the following nodes.  For example,
    for system as follows,

    node distances:
    node   0   1   2   3
      0:  10  21  17  28
      1:  21  10  28  17
      2:  17  28  10  28
      3:  28  17  28  10

    when we establish demotion target node for node 0, in the first round node
    2 is added to the demotion target node set.  Then in the second round,
    node 3 is checked and failed because distance(0, 3) > distance(0, 2).  But
    node 3 is set in "used" nodemask too.  When we establish demotion target
    node for node 1, there is no available node.  This is wrong, node 3 should
    be set as the demotion target of node 1.

    To fix this, if the candidate node is failed to pass the distance
    checking, it will be cleared in "used" nodemask.  So that it can be used
    for the following node.

    The bug can be reproduced and fixed with this patch on a 2 socket server
    machine with DRAM and PMEM.

    Link: https://lkml.kernel.org/r/20220128055940.1792614-1-ying.huang@intel.com
    Fixes: ac16ec835314 ("mm: migrate: support multiple target nodes demotion")
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
    Cc: Xunlei Pang <xlpang@linux.alibaba.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:54 -04:00
Chris von Recklinghausen cd6f34840e mm/fs: delete PF_SWAPWRITE
Bugzilla: https://bugzilla.redhat.com/2120352

commit b698f0a1773f7df73f2bb4bfe0e597ea1bb3881f
Author: Hugh Dickins <hughd@google.com>
Date:   Tue Mar 22 14:45:38 2022 -0700

    mm/fs: delete PF_SWAPWRITE

    PF_SWAPWRITE has been redundant since v3.2 commit ee72886d8e ("mm:
    vmscan: do not writeback filesystem pages in direct reclaim").

    Coincidentally, NeilBrown's current patch "remove inode_congested()"
    deletes may_write_to_inode(), which appeared to be the one function which
    took notice of PF_SWAPWRITE.  But if you study the old logic, and the
    conditions under which may_write_to_inode() was called, you discover that
    flag and function have been pointless for a decade.

    Link: https://lkml.kernel.org/r/75e80e7-742d-e3bd-531-614db8961e4@google.com
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Cc: NeilBrown <neilb@suse.de>
    Cc: Jan Kara <jack@suse.de>
    Cc: "Darrick J. Wong" <djwong@kernel.org>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:53 -04:00
Chris von Recklinghausen 1daa2a436a mm: remove unneeded local variable follflags
Bugzilla: https://bugzilla.redhat.com/2120352

commit 87d2762e22f3ea6885862cb1fd419b77a5bcd8f7
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Tue Mar 22 14:45:29 2022 -0700

    mm: remove unneeded local variable follflags

    We can pass FOLL_GET | FOLL_DUMP to follow_page directly to simplify the
    code a bit in add_page_for_migration and split_huge_pages_pid.

    Link: https://lkml.kernel.org/r/20220311072002.35575-1-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:53 -04:00
Chris von Recklinghausen 0070af694f mm/gup: follow_pfn_pte(): -EEXIST cleanup
Bugzilla: https://bugzilla.redhat.com/2120352

commit 65462462ffb28fddf13d46c628c4fc55878ab397
Author: John Hubbard <jhubbard@nvidia.com>
Date:   Tue Mar 22 14:39:40 2022 -0700

    mm/gup: follow_pfn_pte(): -EEXIST cleanup

    Remove a quirky special case from follow_pfn_pte(), and adjust its
    callers to match.  Caller changes include:

    __get_user_pages(): Regardless of any FOLL_* flags, get_user_pages() and
    its variants should handle PFN-only entries by stopping early, if the
    caller expected **pages to be filled in.  This makes for a more reliable
    API, as compared to the previous approach of skipping over such entries
    (and thus leaving them silently unwritten).

    move_pages(): squash the -EEXIST error return from follow_page() into
    -EFAULT, because -EFAULT is listed in the man page, whereas -EEXIST is
    not.

    Link: https://lkml.kernel.org/r/20220204020010.68930-3-jhubbard@nvidia.com
    Signed-off-by: John Hubbard <jhubbard@nvidia.com>
    Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Jan Kara <jack@suse.cz>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
    Cc: Alex Williamson <alex.williamson@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:49 -04:00
Chris von Recklinghausen cf5ab9070a mm: move the migrate_vma_* device migration code into its own file
Conflicts:
        mm/migrate.c - The backport of
                ab09243aa95a ("mm/migrate.c: remove MIGRATE_PFN_LOCKED")
                had a conflict because of the backports of
                413248faac ("mm/rmap: Convert try_to_migrate() to folios")
                and
                4eecb8b9163d ("mm/migrate: Convert remove_migration_ptes() to folios")
                which leads to a difference in deleted code.
        mm/migrate_device.c - because of 413248faac and 4eecb8b9163d add
                code to use folios for calls to try_to_migrate and
                remove_migration_ptes

Bugzilla: https://bugzilla.redhat.com/2120352

commit 76cbbead253ddcae9878be0d702208bb1e4fac6f
Author: Christoph Hellwig <hch@lst.de>
Date:   Wed Feb 16 15:31:38 2022 +1100

    mm: move the migrate_vma_* device migration code into its own file

    Split the code used to migrate to and from ZONE_DEVICE memory from
    migrate.c into a new file.

    Link: https://lkml.kernel.org/r/20220210072828.2930359-14-hch@lst.de
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>
    Cc: Alex Deucher <alexander.deucher@amd.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Ben Skeggs <bskeggs@redhat.com>
    Cc: Chaitanya Kulkarni <kch@nvidia.com>
    Cc: Christian Knig <christian.koenig@amd.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Felix Kuehling <Felix.Kuehling@amd.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Karol Herbst <kherbst@redhat.com>
    Cc: Logan Gunthorpe <logang@deltatee.com>
    Cc: Lyude Paul <lyude@redhat.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:46 -04:00
Chris von Recklinghausen 19855bbeb7 mm: refactor the ZONE_DEVICE handling in migrate_vma_pages
Bugzilla: https://bugzilla.redhat.com/2120352
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2099722

commit aaf7d70cc595c78d27e915451e93a4459cfc36f3
Author: Christoph Hellwig <hch@lst.de>
Date:   Wed Feb 16 15:31:37 2022 +1100

    mm: refactor the ZONE_DEVICE handling in migrate_vma_pages

    Make the flow a little more clear and prepare for adding a new
    ZONE_DEVICE memory type.

    Link: https://lkml.kernel.org/r/20220210072828.2930359-13-hch@lst.de
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Alistair Popple <apopple@nvidia.com>
    Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>

    Cc: Alex Deucher <alexander.deucher@amd.com>
    Cc: Ben Skeggs <bskeggs@redhat.com>
    Cc: Chaitanya Kulkarni <kch@nvidia.com>
    Cc: Christian Knig <christian.koenig@amd.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Felix Kuehling <Felix.Kuehling@amd.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Karol Herbst <kherbst@redhat.com>
    Cc: Logan Gunthorpe <logang@deltatee.com>
    Cc: Lyude Paul <lyude@redhat.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:46 -04:00
Chris von Recklinghausen af1b8a016a mm: refactor the ZONE_DEVICE handling in migrate_vma_insert_page
Bugzilla: https://bugzilla.redhat.com/2120352
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2099722

commit 1776c0d102482d4aeccd56e404285bc47a481be8
Author: Christoph Hellwig <hch@lst.de>
Date:   Wed Feb 16 15:31:37 2022 +1100

    mm: refactor the ZONE_DEVICE handling in migrate_vma_insert_page

    Make the flow a little more clear and prepare for adding a new
    ZONE_DEVICE memory type.

    Link: https://lkml.kernel.org/r/20220210072828.2930359-12-hch@lst.de
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Alistair Popple <apopple@nvidia.com>
    Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>

    Cc: Alex Deucher <alexander.deucher@amd.com>
    Cc: Ben Skeggs <bskeggs@redhat.com>
    Cc: Chaitanya Kulkarni <kch@nvidia.com>
    Cc: Christian Knig <christian.koenig@amd.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Felix Kuehling <Felix.Kuehling@amd.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Karol Herbst <kherbst@redhat.com>
    Cc: Logan Gunthorpe <logang@deltatee.com>
    Cc: Lyude Paul <lyude@redhat.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:46 -04:00
Chris von Recklinghausen 4cc30e4e8f mm: remove the extra ZONE_DEVICE struct page refcount
Conflicts: mm/internal.h - We already have
	09f49dca570a ("mm: handle uninitialized numa nodes gracefully")
	ece1ed7bfa12 ("mm/gup: Add try_get_folio() and try_grab_folio()")
	so keep declarations for boot_nodestats and try_grab_folio

Bugzilla: https://bugzilla.redhat.com/2120352
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2099722

commit 27674ef6c73f0c9096a9827dc5d6ba9fc7808422
Author: Christoph Hellwig <hch@lst.de>
Date:   Wed Feb 16 15:31:36 2022 +1100

    mm: remove the extra ZONE_DEVICE struct page refcount

    ZONE_DEVICE struct pages have an extra reference count that complicates
    the code for put_page() and several places in the kernel that need to
    check the reference count to see that a page is not being used (gup,
    compaction, migration, etc.). Clean up the code so the reference count
    doesn't need to be treated specially for ZONE_DEVICE pages.

    Note that this excludes the special idle page wakeup for fsdax pages,
    which still happens at refcount 1.  This is a separate issue and will
    be sorted out later.  Given that only fsdax pages require the
    notifiacation when the refcount hits 1 now, the PAGEMAP_OPS Kconfig
    symbol can go away and be replaced with a FS_DAX check for this hook
    in the put_page fastpath.

    Based on an earlier patch from Ralph Campbell <rcampbell@nvidia.com>.

    Link: https://lkml.kernel.org/r/20220210072828.2930359-8-hch@lst.de
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
    Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Reviewed-by: Dan Williams <dan.j.williams@intel.com>
    Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
    Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>

    Cc: Alex Deucher <alexander.deucher@amd.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Ben Skeggs <bskeggs@redhat.com>
    Cc: Chaitanya Kulkarni <kch@nvidia.com>
    Cc: Christian Knig <christian.koenig@amd.com>
    Cc: Karol Herbst <kherbst@redhat.com>
    Cc: Lyude Paul <lyude@redhat.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:45 -04:00
Chris von Recklinghausen 44cfedb08b mm/migrate: remove redundant variables used in a for-loop
Bugzilla: https://bugzilla.redhat.com/2120352

commit f1e8db04b68cc56edc5baee5c7cb1f9b79c3da7e
Author: Colin Ian King <colin.i.king@gmail.com>
Date:   Fri Jan 14 14:08:53 2022 -0800

    mm/migrate: remove redundant variables used in a for-loop

    The variable addr is being set and incremented in a for-loop but not
    actually being used.  It is redundant and so addr and also variable
    start can be removed.

    Link: https://lkml.kernel.org/r/20211221185729.609630-1-colin.i.king@gmail.com
    Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:40 -04:00
Chris von Recklinghausen 02d53bdab7 mm/migrate: move node demotion code to near its user
Bugzilla: https://bugzilla.redhat.com/2120352

commit dcee9bf5bf2f59c173f3645ac2274595ac6c6aea
Author: Huang Ying <ying.huang@intel.com>
Date:   Fri Jan 14 14:08:49 2022 -0800

    mm/migrate: move node demotion code to near its user

    Now, node_demotion and next_demotion_node() are placed between
    __unmap_and_move() and unmap_and_move().  This hurts code readability.
    So move them near their users in the file.  There's no functionality
    change in this patch.

    Link: https://lkml.kernel.org/r/20211206031227.3323097-1-ying.huang@intel.com
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Reviewed-by: Wei Xu <weixugc@google.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Keith Busch <kbusch@kernel.org>
    Cc: Yang Shi <yang.shi@linux.alibaba.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:40 -04:00
Chris von Recklinghausen 958854130f mm: migrate: add more comments for selecting target node randomly
Bugzilla: https://bugzilla.redhat.com/2120352

commit 7813a1b5257b8eb2cb915cd08e7ba857070fdfd3
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date:   Fri Jan 14 14:08:46 2022 -0800

    mm: migrate: add more comments for selecting target node randomly

    As Yang Shi suggested [1], it will be helpful to explain why we should
    select target node randomly now if there are multiple target nodes.

    [1] https://lore.kernel.org/all/CAHbLzkqSqCL+g7dfzeOw8fPyeEC0BBv13Ny1UVGHDkadnQdR=g@mail.gmail.com/

    Link: https://lkml.kernel.org/r/c31d36bd097c6e9e69fc0f409c43b78e53e64fc2.1637766801.git.baolin.wang@linux.alibaba.com
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
    Cc: Xunlei Pang <xlpang@linux.alibaba.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:40 -04:00
Chris von Recklinghausen 1fc5c04222 mm: migrate: support multiple target nodes demotion
Bugzilla: https://bugzilla.redhat.com/2120352

commit ac16ec835314677dd7405dfb5a5e007c3ca424c7
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date:   Fri Jan 14 14:08:43 2022 -0800

    mm: migrate: support multiple target nodes demotion

    We have some machines with multiple memory types like below, which have
    one fast (DRAM) memory node and two slow (persistent memory) memory
    nodes.  According to current node demotion policy, if node 0 fills up,
    its memory should be migrated to node 1, when node 1 fills up, its
    memory will be migrated to node 2: node 0 -> node 1 -> node 2 ->stop.

    But this is not efficient and suitbale memory migration route for our
    machine with multiple slow memory nodes.  Since the distance between
    node 0 to node 1 and node 0 to node 2 is equal, and memory migration
    between slow memory nodes will increase persistent memory bandwidth
    greatly, which will hurt the whole system's performance.

    Thus for this case, we can treat the slow memory node 1 and node 2 as a
    whole slow memory region, and we should migrate memory from node 0 to
    node 1 and node 2 if node 0 fills up.

    This patch changes the node_demotion data structure to support multiple
    target nodes, and establishes the migration path to support multiple
    target nodes with validating if the node distance is the best or not.

      available: 3 nodes (0-2)
      node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
      node 0 size: 62153 MB
      node 0 free: 55135 MB
      node 1 cpus:
      node 1 size: 127007 MB
      node 1 free: 126930 MB
      node 2 cpus:
      node 2 size: 126968 MB
      node 2 free: 126878 MB
      node distances:
      node   0   1   2
        0:  10  20  20
        1:  20  10  20
        2:  20  20  10

    Link: https://lkml.kernel.org/r/00728da107789bb4ed9e0d28b1d08fd8056af2ef.1636697263.git.baolin.wang@linux.alibaba.com
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
    Cc: Xunlei Pang <xlpang@linux.alibaba.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:40 -04:00
Chris von Recklinghausen 6a1def8719 mm: migrate: correct the hugetlb migration stats
Bugzilla: https://bugzilla.redhat.com/2120352

commit 5d39a7ebc8be70e30176aed6f98f799bfa7439d6
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date:   Fri Jan 14 14:08:37 2022 -0800

    mm: migrate: correct the hugetlb migration stats

    Correct the migration stats for hugetlb with using compound_nr() instead
    of thp_nr_pages(), meanwhile change 'nr_failed_pages' to record the
    number of normal pages failed to migrate, including THP and hugetlb, and
    'nr_succeeded' will record the number of normal pages migrated
    successfully.

    [baolin.wang@linux.alibaba.com: fix docs, per Mike]
      Link: https://lkml.kernel.org/r/141bdfc6-f898-3cc3-f692-726c5f6cb74d@linux.alibaba.com

    Link: https://lkml.kernel.org/r/71a4b6c22f208728fe8c78ad26375436c4ff9704.1636275127.git.baolin.wang@linux.alibaba.com
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:40 -04:00
Chris von Recklinghausen b42b2cc58d mm: migrate: fix the return value of migrate_pages()
Bugzilla: https://bugzilla.redhat.com/2120352

commit b5bade978e9b8f42521ccef711642bd21313cf44
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date:   Fri Jan 14 14:08:34 2022 -0800

    mm: migrate: fix the return value of migrate_pages()

    Patch series "Improve the migration stats".

    According to talk with Zi Yan [1], this patch set changes the return
    value of migrate_pages() to avoid returning a number which is larger
    than the number of pages the users tried to migrate by move_pages()
    syscall.  Also fix the hugetlb migration stats and migration stats in
    trace_mm_compaction_migratepages().

    [1] https://lore.kernel.org/linux-mm/7E44019D-2A5D-4BA7-B4D5-00D4712F1687@nvidia.com/

    This patch (of 3):

    As Zi Yan pointed out, the syscall move_pages() can return a
    non-migrated number larger than the number of pages the users tried to
    migrate, when a THP page is failed to migrate.  This is confusing for
    users.

    Since other migration scenarios do not care about the actual
    non-migrated number of pages except the memory compaction migration
    which will fix in following patch.  Thus we can change the return value
    to return the number of {normal page, THP, hugetlb} instead to avoid
    this issue, and the number of THP splits will be considered as the
    number of non-migrated THP, no matter how many subpages of the THP are
    migrated successfully.  Meanwhile we should still keep the migration
    counters using the number of normal pages.

    Link: https://lkml.kernel.org/r/cover.1636275127.git.baolin.wang@linux.alibaba.com
    Link: https://lkml.kernel.org/r/6486fabc3e8c66ff613e150af25e89b3147977a6.1636275127.git.baolin.wang@linux.alibaba.com
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Signed-off-by: Zi Yan <ziy@nvidia.com>
    Co-developed-by: Zi Yan <ziy@nvidia.com>
    Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:40 -04:00
Chris von Recklinghausen 46a6ceff68 mm/migrate.c: remove MIGRATE_PFN_LOCKED
Conflicts:
	drivers/gpu/drm/amd/amdkfd/kfd_migrate.c,
	drivers/gpu/drm/nouveau/nouveau_dmem.c -
		Changes done as part of CS9 commit
		75030c7eac ("Merge DRM changes from upstream v5.15..v5.16")
	mm/migrate.c -
		CS9 commit
		413248faac ("mm/rmap: Convert try_to_migrate() to folios")
		changed try_to_migrate to take a folio. Change the
		try_to_migrate call that this patch adds to call page_folio on
		the page argument.
		The conflict in CS9 commit
		ca19554894 ("mm/rmap: Convert rmap_walk() to take a folio")
		changed the first argument to remove_migration_pte, which
		causes a merge conflict with this patch. Remove lines in this
		hunk as if it were called with the old arguments.
		CS9 commit
		86b6e00a7b ("mm/migrate: Convert remove_migration_ptes() to folios")
		changed the unlock_page call to folio_unlock. The put_page call
		this patch adds would be redundant since folio_unlock does an
		implied put_page, so just leave the folio_unlock call

Bugzilla: https://bugzilla.redhat.com/2120352

commit ab09243aa95a72bac5c71e852773de34116f8d0f
Author: Alistair Popple <apopple@nvidia.com>
Date:   Wed Nov 10 20:32:40 2021 -0800

    mm/migrate.c: remove MIGRATE_PFN_LOCKED

    MIGRATE_PFN_LOCKED is used to indicate to migrate_vma_prepare() that a
    source page was already locked during migrate_vma_collect().  If it
    wasn't then the a second attempt is made to lock the page.  However if
    the first attempt failed it's unlikely a second attempt will succeed,
    and the retry adds complexity.  So clean this up by removing the retry
    and MIGRATE_PFN_LOCKED flag.

    Destination pages are also meant to have the MIGRATE_PFN_LOCKED flag
    set, but nothing actually checks that.

    Link: https://lkml.kernel.org/r/20211025041608.289017-1-apopple@nvidia.com
    Signed-off-by: Alistair Popple <apopple@nvidia.com>
    Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
    Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
    Cc: Alex Deucher <alexander.deucher@amd.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Ben Skeggs <bskeggs@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:31 -04:00
Chris von Recklinghausen d041f203bc mm: migrate: make demotion knob depend on migration
Conflicts: include/linux/migrate.h - The presence of
	19138349ed59 ("mm/migrate: Add folio_migrate_flags()")
	715cbfd6c5c5 ("mm/migrate: Add folio_migrate_copy()")
	and
	3417013e0d18 ("mm/migrate: Add folio_migrate_mapping()")
	causes a merge conflict due to differing context. Add the 2 lines for
	numa_demotion_enabled just below the declaration for
	migrate_page_move_mapping

Bugzilla: https://bugzilla.redhat.com/2120352

commit 20f9ba4f995247bb79e243741b8fdddbd76dd923
Author: Yang Shi <shy828301@gmail.com>
Date:   Fri Nov 5 13:43:35 2021 -0700

    mm: migrate: make demotion knob depend on migration

    The memory demotion needs to call migrate_pages() to do the jobs.  And
    it is controlled by a knob, however, the knob doesn't depend on
    CONFIG_MIGRATION.  The knob could be truned on even though MIGRATION is
    disabled, this will not cause any crash since migrate_pages() would just
    return -ENOSYS.  But it is definitely not optimal to go through demotion
    path then retry regular swap every time.

    And it doesn't make too much sense to have the knob visible to the users
    when !MIGRATION.  Move the related code from mempolicy.[h|c] to
    migrate.[h|c].

    Link: https://lkml.kernel.org/r/20211015005559.246709-1-shy828301@gmail.com
    Signed-off-by: Yang Shi <shy828301@gmail.com>
    Acked-by: "Huang, Ying" <ying.huang@intel.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:29 -04:00
Chris von Recklinghausen d311f2f0fb mm: change page type prior to adding page table entry
Conflicts:
* mm/memory.c: minor fuzz due to missing upstream commit (and related series)
  b756a3b5e7 ("mm: device exclusive memory access")

* mm/migrate.c: minor fuzz due to backport commits
  86b6e00a7b ("mm/migrate: Convert remove_migration_ptes() to folios")
  e462accf60 ("mm/munlock: protect the per-CPU pagevec by a local_lock_t")

Bugzilla: https://bugzilla.redhat.com/2120352

commit 1eba86c096e35e3cc83de1ad2c26f2d70470211b
Author: Pasha Tatashin <pasha.tatashin@soleen.com>
Date:   Fri Jan 14 14:06:29 2022 -0800

    mm: change page type prior to adding page table entry

    Patch series "page table check", v3.

    Ensure that some memory corruptions are prevented by checking at the
    time of insertion of entries into user page tables that there is no
    illegal sharing.

    We have recently found a problem [1] that existed in kernel since 4.14.
    The problem was caused by broken page ref count and led to memory
    leaking from one process into another.  The problem was accidentally
    detected by studying a dump of one process and noticing that one page
    contains memory that should not belong to this process.

    There are some other page->_refcount related problems that were recently
    fixed: [2], [3] which potentially could also lead to illegal sharing.

    In addition to hardening refcount [4] itself, this work is an attempt to
    prevent this class of memory corruption issues.

    It uses a simple state machine that is independent from regular MM logic
    to check for illegal sharing at time pages are inserted and removed from
    page tables.

    [1] https://lore.kernel.org/all/xr9335nxwc5y.fsf@gthelen2.svl.corp.google.com
    [2] https://lore.kernel.org/all/1582661774-30925-2-git-send-email-akaher@vmware.com
    [3] https://lore.kernel.org/all/20210622021423.154662-3-mike.kravetz@oracle.com
    [4] https://lore.kernel.org/all/20211221150140.988298-1-pasha.tatashin@soleen.com

    This patch (of 4):

    There are a few places where we first update the entry in the user page
    table, and later change the struct page to indicate that this is
    anonymous or file page.

    In most places, however, we first configure the page metadata and then
    insert entries into the page table.  Page table check, will use the
    information from struct page to verify the type of entry is inserted.

    Change the order in all places to first update struct page, and later to
    update page table.

    This means that we first do calls that may change the type of page (anon
    or file):

            page_move_anon_rmap
            page_add_anon_rmap
            do_page_add_anon_rmap
            page_add_new_anon_rmap
            page_add_file_rmap
            hugepage_add_anon_rmap
            hugepage_add_new_anon_rmap

    And after that do calls that add entries to the page table:

            set_huge_pte_at
            set_pte_at

    Link: https://lkml.kernel.org/r/20211221154650.1047963-1-pasha.tatashin@soleen.com
    Link: https://lkml.kernel.org/r/20211221154650.1047963-2-pasha.tatashin@soleen.com
    Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Paul Turner <pjt@google.com>
    Cc: Wei Xu <weixugc@google.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Will Deacon <will@kernel.org>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Masahiro Yamada <masahiroy@kernel.org>
    Cc: Sami Tolvanen <samitolvanen@google.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Frederic Weisbecker <frederic@kernel.org>
    Cc: "H. Peter Anvin" <hpa@zytor.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Jiri Slaby <jirislaby@kernel.org>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Hugh Dickins <hughd@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:13 -04:00
Waiman Long e462accf60 mm/munlock: protect the per-CPU pagevec by a local_lock_t
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2109671
Conflicts: A minor fuzz in mm/migrate.c due to missing upstream commit
	   1eba86c096e3 ("mm: change page type prior to adding page
	   table entry"). Pulling it, however, will require taking in
	   a number of additional patches. So it is not done here.

commit adb11e78c5dc5e26774acb05f983da36447f7911
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Fri, 1 Apr 2022 11:28:33 -0700

    mm/munlock: protect the per-CPU pagevec by a local_lock_t

    The access to mlock_pvec is protected by disabling preemption via
    get_cpu_var() or implicit by having preemption disabled by the caller
    (in mlock_page_drain() case).  This breaks on PREEMPT_RT since
    folio_lruvec_lock_irq() acquires a sleeping lock in this section.

    Create struct mlock_pvec which consits of the local_lock_t and the
    pagevec.  Acquire the local_lock() before accessing the per-CPU pagevec.
    Replace mlock_page_drain() with a _local() version which is invoked on
    the local CPU and acquires the local_lock_t and a _remote() version
    which uses the pagevec from a remote CPU which offline.

    Link: https://lkml.kernel.org/r/YjizWi9IY0mpvIfb@linutronix.de
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Acked-by: Hugh Dickins <hughd@google.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2022-07-21 14:50:55 -04:00
Waiman Long 361ecfe9c4 mm/migration: add trace events for base page and HugeTLB migrations
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2109671

commit 4cc79b3303f224a920f3aff21f3d231749d73384
Author: Anshuman Khandual <anshuman.khandual@arm.com>
Date:   Thu, 24 Mar 2022 18:10:01 -0700

    mm/migration: add trace events for base page and HugeTLB migrations

    This adds two trace events for base page and HugeTLB page migrations.
    These events, closely follow the implementation details like setting and
    removing of PTE migration entries, which are essential operations for
    migration.  The new CREATE_TRACE_POINTS in <mm/rmap.c> covers both
    <events/migration.h> and <events/tlb.h> based trace events.  Hence drop
    redundant CREATE_TRACE_POINTS from other places which could have otherwise
    conflicted during build.

    Link: https://lkml.kernel.org/r/1643368182-9588-3-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
    Reported-by: kernel test robot <lkp@intel.com>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2022-07-21 14:50:55 -04:00
Aristeu Rozanski 171b821b74 mm/migrate: Use a folio in migrate_misplaced_transhuge_page()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites
Conflicts: context due missing e39bb6be9f2b39a

commit c185e494ae0ceb126d89b8e3413ed0a1132e05d3
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Jul 6 10:50:39 2021 -0400

    mm/migrate: Use a folio in migrate_misplaced_transhuge_page()

    Unify alloc_misplaced_dst_page() and alloc_misplaced_dst_page_thp().
    Removes an assumption that compound pages are HPAGE_PMD_ORDER.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:22 -04:00
Aristeu Rozanski 47f9297e7a mm/migrate: Use a folio in alloc_migration_target()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit ffe06786b54039edcecb51a54061ee8d81036a19
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Apr 4 14:35:04 2022 -0400

    mm/migrate: Use a folio in alloc_migration_target()

    This removes an assumption that a large folio is HPAGE_PMD_ORDER
    as well as letting us remove the call to prep_transhuge_page()
    and a few hidden calls to compound_head().

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:22 -04:00
Aristeu Rozanski da85506184 mm: replace multiple dcache flush with flush_dcache_folio()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 3150be8fa89e4d1064d250bb3f8ea3665d1ec5e9
Author: Muchun Song <songmuchun@bytedance.com>
Date:   Tue Mar 22 14:42:11 2022 -0700

    mm: replace multiple dcache flush with flush_dcache_folio()

    Simplify the code by using flush_dcache_folio().

    Link: https://lkml.kernel.org/r/20220210123058.79206-8-songmuchun@bytedance.com
    Signed-off-by: Muchun Song <songmuchun@bytedance.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Fam Zheng <fam.zheng@bytedance.com>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Lars Persson <lars.persson@axis.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:21 -04:00
Aristeu Rozanski 7e13068c92 mm: fix missing cache flush for all tail pages of compound page
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 2771739a7162782c0aa6424b2e3dd874e884a15d
Author: Muchun Song <songmuchun@bytedance.com>
Date:   Tue Mar 22 14:41:56 2022 -0700

    mm: fix missing cache flush for all tail pages of compound page

    The D-cache maintenance inside move_to_new_page() only consider one
    page, there is still D-cache maintenance issue for tail pages of
    compound page (e.g. THP or HugeTLB).

    THP migration is only enabled on x86_64, ARM64 and powerpc, while
    powerpc and arm64 need to maintain the consistency between I-Cache and
    D-Cache, which depends on flush_dcache_page() to maintain the
    consistency between I-Cache and D-Cache.

    But there is no issues on arm64 and powerpc since they already considers
    the compound page cache flushing in their icache flush function.
    HugeTLB migration is enabled on arm, arm64, mips, parisc, powerpc,
    riscv, s390 and sh, while arm has handled the compound page cache flush
    in flush_dcache_page(), but most others do not.

    In theory, the issue exists on many architectures.  Fix this by not
    using flush_dcache_folio() since it is not backportable.

    Link: https://lkml.kernel.org/r/20220210123058.79206-3-songmuchun@bytedance.com
    Fixes: 290408d4a2 ("hugetlb: hugepage migration core")
    Signed-off-by: Muchun Song <songmuchun@bytedance.com>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Fam Zheng <fam.zheng@bytedance.com>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Lars Persson <lars.persson@axis.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:20 -04:00
Aristeu Rozanski ca19554894 mm/rmap: Convert rmap_walk() to take a folio
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites
Conflicts: due lack of ab09243aa95a7, we need to convert the call to remove_migration_pte() from migrate_vma_prepare; Removing temporary folio variable in try_to_migrate_one() and in try_to_unmap_one() introduced to allow building (for bisect) due lack of af28a988b313

commit 2f031c6f042cb8a9b221a8b6b80e69de5170f830
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Sat Jan 29 16:06:53 2022 -0500

    mm/rmap: Convert rmap_walk() to take a folio

    This ripples all the way through to every calling and called function
    from rmap.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:19 -04:00
Aristeu Rozanski 86b6e00a7b mm/migrate: Convert remove_migration_ptes() to folios
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites
Conflicts: memory_device.c doesn't exist, so making changes on memory.c. ab09243aa95a72bac5c71e852773de34116f8d0f wasn't backported so kept page_put() unchanged in migrate_vma_unmap()

commit 4eecb8b9163df82c87c91764a02fff228ef25f6d
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Jan 28 23:32:59 2022 -0500

    mm/migrate: Convert remove_migration_ptes() to folios

    Convert the implementation and all callers.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:18 -04:00
Aristeu Rozanski 413248faac mm/rmap: Convert try_to_migrate() to folios
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites
Conflicts: context due missing af28a988b313a; mm/migrate_device.c doesn't exist, so changes are made on mm/migrate.c; missing ab09243aa95a72 so changes had to be done to adapt current version of migrate_vma_unmap()

commit 4b8554c527f3cfa183f6c06d231a9387873205a0
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Jan 28 14:29:43 2022 -0500

    mm/rmap: Convert try_to_migrate() to folios

    Convert the callers to pass a folio and the try_to_migrate_one()
    worker to use a folio throughout.  Fixes an assumption that a
    folio must be <= PMD size.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:18 -04:00
Aristeu Rozanski 880848c57e mm: Convert page_vma_mapped_walk to work on PFNs
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 2aff7a4755bed2870ee23b75bc88cdc8d76cdd03
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Thu Feb 3 11:40:17 2022 -0500

    mm: Convert page_vma_mapped_walk to work on PFNs

    page_mapped_in_vma() really just wants to walk one page, but as the
    code stands, if passed the head page of a compound page, it will
    walk every page in the compound page.  Extract pfn/nr_pages/pgoff
    from the struct page early, so they can be overridden by
    page_mapped_in_vma().

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:18 -04:00
Aristeu Rozanski 101cf08986 mm: Add DEFINE_PAGE_VMA_WALK and DEFINE_FOLIO_VMA_WALK
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit eed05e54d275b3cfc5d8c79843c5276a5878e94a
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Thu Feb 3 09:06:08 2022 -0500

    mm: Add DEFINE_PAGE_VMA_WALK and DEFINE_FOLIO_VMA_WALK

    Instead of declaring a struct page_vma_mapped_walk directly,
    use these helpers to allow us to transition to a PFN approach in the
    following patches.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:17 -04:00
Aristeu Rozanski e7286c475a mm/munlock: page migration needs mlock pagevec drained
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites
Conflicts: context difference due missing 1eba86c096e35

commit b74355078b6554271371532a5daa3b1a3db620f9
Author: Hugh Dickins <hughd@google.com>
Date:   Mon Feb 14 18:38:47 2022 -0800

    mm/munlock: page migration needs mlock pagevec drained

    Page migration of a VM_LOCKED page tends to fail, because when the old
    page is unmapped, it is put on the mlock pagevec with raised refcount,
    which then fails the freeze.

    At first I thought this would be fixed by a local mlock_page_drain() at
    the upper rmap_walk() level - which would have nicely batched all the
    munlocks of that page; but tests show that the task can too easily move
    to another cpu, leaving pagevec residue behind which fails the migration.

    So try_to_migrate_one() drain the local pagevec after page_remove_rmap()
    from a VM_LOCKED vma; and do the same in try_to_unmap_one(), whose
    TTU_IGNORE_MLOCK users would want the same treatment; and do the same
    in remove_migration_pte() - not important when successfully inserting
    a new page, but necessary when hoping to retry after failure.

    Any new pagevec runs the risk of adding a new way of stranding, and we
    might discover other corners where mlock_page_drain() or lru_add_drain()
    would now help.

    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:12 -04:00
Aristeu Rozanski d7b21f40a6 mm/migrate: __unmap_and_move() push good newpage to LRU
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit c3096e6782b733158bf34f6bbb4567808d4e0740
Author: Hugh Dickins <hughd@google.com>
Date:   Mon Feb 14 18:33:17 2022 -0800

    mm/migrate: __unmap_and_move() push good newpage to LRU

    Compaction, NUMA page movement, THP collapse/split, and memory failure
    do isolate unevictable pages from their "LRU", losing the record of
    mlock_count in doing so (isolators are likely to use page->lru for their
    own private lists, so mlock_count has to be presumed lost).

    That's unfortunate, and we should put in some work to correct that: one
    can imagine a function to build up the mlock_count again - but it would
    require i_mmap_rwsem for read, so be careful where it's called.  Or
    page_referenced_one() and try_to_unmap_one() might do that extra work.

    But one place that can very easily be improved is page migration's
    __unmap_and_move(): a small adjustment to where the successful new page
    is put back on LRU, and its mlock_count (if any) is built back up by
    remove_migration_ptes().

    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:12 -04:00
Aristeu Rozanski 4b2aa38f6e mm/munlock: rmap call mlock_vma_page() munlock_vma_page()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites
Conflicts: context difference due lack of f4c4a3f484 and differences due RHEL-only 44740bc20b

commit cea86fe246b694a191804b47378eb9d77aefabec
Author: Hugh Dickins <hughd@google.com>
Date:   Mon Feb 14 18:26:39 2022 -0800

    mm/munlock: rmap call mlock_vma_page() munlock_vma_page()

    Add vma argument to mlock_vma_page() and munlock_vma_page(), make them
    inline functions which check (vma->vm_flags & VM_LOCKED) before calling
    mlock_page() and munlock_page() in mm/mlock.c.

    Add bool compound to mlock_vma_page() and munlock_vma_page(): this is
    because we have understandable difficulty in accounting pte maps of THPs,
    and if passed a PageHead page, mlock_page() and munlock_page() cannot
    tell whether it's a pmd map to be counted or a pte map to be ignored.

    Add vma arg to page_add_file_rmap() and page_remove_rmap(), like the
    others, and use that to call mlock_vma_page() at the end of the page
    adds, and munlock_vma_page() at the end of page_remove_rmap() (end or
    beginning? unimportant, but end was easier for assertions in testing).

    No page lock is required (although almost all adds happen to hold it):
    delete the "Serialize with page migration" BUG_ON(!PageLocked(page))s.
    Certainly page lock did serialize with page migration, but I'm having
    difficulty explaining why that was ever important.

    Mlock accounting on THPs has been hard to define, differed between anon
    and file, involved PageDoubleMap in some places and not others, required
    clear_page_mlock() at some points.  Keep it simple now: just count the
    pmds and ignore the ptes, there is no reason for ptes to undo pmd mlocks.

    page_add_new_anon_rmap() callers unchanged: they have long been calling
    lru_cache_add_inactive_or_unevictable(), which does its own VM_LOCKED
    handling (it also checks for not VM_SPECIAL: I think that's overcautious,
    and inconsistent with other checks, that mmap_region() already prevents
    VM_LOCKED on VM_SPECIAL; but haven't quite convinced myself to change it).

    Signed-off-by: Hugh Dickins <hughd@google.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:12 -04:00
Aristeu Rozanski f05fb4175c mm/migrate.c: rework migration_entry_wait() to not take a pageref
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit ffa65753c43142f3b803486442813744da71cff2
Author: Alistair Popple <apopple@nvidia.com>
Date:   Fri Jan 21 22:10:46 2022 -0800

    mm/migrate.c: rework migration_entry_wait() to not take a pageref

    This fixes the FIXME in migrate_vma_check_page().

    Before migrating a page migration code will take a reference and check
    there are no unexpected page references, failing the migration if there
    are.  When a thread faults on a migration entry it will take a temporary
    reference to the page to wait for the page to become unlocked signifying
    the migration entry has been removed.

    This reference is dropped just prior to waiting on the page lock,
    however the extra reference can cause migration failures so it is
    desirable to avoid taking it.

    As migration code already has a reference to the migrating page an extra
    reference to wait on PG_locked is unnecessary so long as the reference
    can't be dropped whilst setting up the wait.

    When faulting on a migration entry the ptl is taken to check the
    migration entry.  Removing a migration entry also requires the ptl, and
    migration code won't drop its page reference until after the migration
    entry has been removed.  Therefore retaining the ptl of a migration
    entry is sufficient to ensure the page has a reference.  Reworking
    migration_entry_wait() to hold the ptl until the wait setup is complete
    means the extra page reference is no longer needed.

    [apopple@nvidia.com: v5]
      Link: https://lkml.kernel.org/r/20211213033848.1973946-1-apopple@nvidia.com

    Link: https://lkml.kernel.org/r/20211118020754.954425-1-apopple@nvidia.com
    Signed-off-by: Alistair Popple <apopple@nvidia.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Ralph Campbell <rcampbell@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:12 -04:00
Aristeu Rozanski dbfc7e3a84 mm: Use multi-index entries in the page cache
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 6b24ca4a1a8d4ee3221d6d44ddbb99f542e4bda3
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Sat Jun 27 22:19:08 2020 -0400

    mm: Use multi-index entries in the page cache

    We currently store large folios as 2^N consecutive entries.  While this
    consumes rather more memory than necessary, it also turns out to be buggy.
    A writeback operation which starts within a tail page of a dirty folio will
    not write back the folio as the xarray's dirty bit is only set on the
    head index.  With multi-index entries, the dirty bit will be found no
    matter where in the folio the operation starts.

    This does end up simplifying the page cache slightly, although not as
    much as I had hoped.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:11 -04:00
Aristeu Rozanski 6cf24c91a2 filemap: Add folio_put_wait_locked()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 9f2b04a25a41b1f41b3cead4f56854a4192ec5b0
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Aug 16 23:36:31 2021 -0400

    filemap: Add folio_put_wait_locked()

    Convert all three callers of put_and_wait_on_page_locked() to
    folio_put_wait_locked().  This shrinks the kernel overall by 19 bytes.
    filemap_update_page() shrinks by 19 bytes while __migration_entry_wait()
    is unchanged.  folio_put_wait_locked() is 14 bytes smaller than
    put_and_wait_on_page_locked(), but pmd_migration_entry_wait() grows by
    14 bytes.  It removes the assumption from pmd_migration_entry_wait()
    that pages cannot be larger than a PMD (which is true today, but
    may be interesting to explore in the future).

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:07 -04:00
Aristeu Rozanski b50af80e1f mm: migrate: simplify the file-backed pages validation when migrating its mapping
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 0ef024621417fa3fcdeb2c3320f90ee34e18a5d9
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date:   Wed Nov 10 20:32:37 2021 -0800

    mm: migrate: simplify the file-backed pages validation when migrating its mapping

    There is no need to validate the file-backed page's refcount before
    trying to freeze the page's expected refcount, instead we can rely on
    the folio_ref_freeze() to validate if the page has the expected refcount
    before migrating its mapping.

    Moreover we are always under the page lock when migrating the page
    mapping, which means nowhere else can remove it from the page cache, so
    we can remove the xas_load() validation under the i_pages lock.

    Link: https://lkml.kernel.org/r/cover.1629447552.git.baolin.wang@linux.alibaba.com
    Link: https://lkml.kernel.org/r/df4c129fd8e86a95dbc55f4663d77441cc0d3bd1.1629447552.git.baolin.wang@linux.alibaba.com
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Suggested-by: Matthew Wilcox <willy@infradead.org>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:06 -04:00
Aristeu Rozanski a1619add04 mm/migrate: Add folio_migrate_copy()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2019485
Tested: ran Rafael's set of sanity tests, other than known issues, seems ok

commit 715cbfd6c5c595bc8b7a6f9ad1fe9fec0122bb20
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri May 7 15:05:06 2021 -0400

    mm/migrate: Add folio_migrate_copy()

    This is the folio equivalent of migrate_page_copy(), which is retained
    as a wrapper for filesystems which are not yet converted to folios.
    Also convert copy_huge_page() to folio_copy().

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-04-07 09:58:29 -04:00
Aristeu Rozanski fe68c2c86a mm/migrate: Add folio_migrate_flags()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2019485
Tested: ran Rafael's set of sanity tests, other than known issues, seems ok

commit 19138349ed59b90ce58aca319b873eca2e04ad43
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri May 7 15:26:29 2021 -0400

    mm/migrate: Add folio_migrate_flags()

    Turn migrate_page_states() into a wrapper around folio_migrate_flags().
    Also convert two functions only called from folio_migrate_flags() to
    be folio-based.  ksm_migrate_page() becomes folio_migrate_ksm() and
    copy_page_owner() becomes folio_copy_owner().  folio_migrate_flags()
    alone shrinks by two thirds -- 1967 bytes down to 642 bytes.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Reviewed-by: David Howells <dhowells@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-04-07 09:58:29 -04:00
Aristeu Rozanski bbd102df36 mm/migrate: Add folio_migrate_mapping()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2019485
Tested: ran Rafael's set of sanity tests, other than known issues, seems ok

commit 3417013e0d183be9b42d794082eec0ec1c5b5f15
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri May 7 07:28:40 2021 -0400

    mm/migrate: Add folio_migrate_mapping()

    Reimplement migrate_page_move_mapping() as a wrapper around
    folio_migrate_mapping().  Saves 193 bytes of kernel text.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: David Howells <dhowells@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-04-07 09:58:29 -04:00
Aristeu Rozanski 51f787b88f mm/memcg: Convert mem_cgroup_migrate() to take folios
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2019485
Tested: ran Rafael's set of sanity tests, other than known issues, seems ok

commit d21bba2b7d0ae19dd1279e10aee61c37a17aba74
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Thu May 6 18:14:59 2021 -0400

    mm/memcg: Convert mem_cgroup_migrate() to take folios

    Convert all callers of mem_cgroup_migrate() to call page_folio() first.
    They all look like they're using head pages already, but this proves it.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Acked-by: Mike Rapoport <rppt@linux.ibm.com>
    Reviewed-by: David Howells <dhowells@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-04-07 09:58:28 -04:00
Aristeu Rozanski bddf5b2fad mm/memcg: Convert mem_cgroup_charge() to take a folio
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2019485
Tested: ran Rafael's set of sanity tests, other than known issues, seems ok

commit 8f425e4ed0eb3ef0b2d85a9efccf947ca6aa9b1c
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Jun 25 09:27:04 2021 -0400

    mm/memcg: Convert mem_cgroup_charge() to take a folio

    Convert all callers of mem_cgroup_charge() to call page_folio() on the
    page they're currently passing in.  Many of them will be converted to
    use folios themselves soon.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: David Howells <dhowells@redhat.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-04-07 09:58:28 -04:00
Rafael Aquini 31ba6b59a4 mm/migrate: fix CPUHP state to update node demotion order
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit a6a0251c6fce496744121b4e08c899f45270dbcc
Author: Huang Ying <ying.huang@intel.com>
Date:   Mon Oct 18 15:15:35 2021 -0700

    mm/migrate: fix CPUHP state to update node demotion order

    The node demotion order needs to be updated during CPU hotplug.  Because
    whether a NUMA node has CPU may influence the demotion order.  The
    update function should be called during CPU online/offline after the
    node_states[N_CPU] has been updated.  That is done in
    CPUHP_AP_ONLINE_DYN during CPU online and in CPUHP_MM_VMSTAT_DEAD during
    CPU offline.  But in commit 884a6e5d1f93 ("mm/migrate: update node
    demotion order on hotplug events"), the function to update node demotion
    order is called in CPUHP_AP_ONLINE_DYN during CPU online/offline.  This
    doesn't satisfy the order requirement.

    For example, there are 4 CPUs (P0, P1, P2, P3) in 2 sockets (P0, P1 in S0
    and P2, P3 in S1), the demotion order is

     - S0 -> NUMA_NO_NODE
     - S1 -> NUMA_NO_NODE

    After P2 and P3 is offlined, because S1 has no CPU now, the demotion
    order should have been changed to

     - S0 -> S1
     - S1 -> NO_NODE

    but it isn't changed, because the order updating callback for CPU
    hotplug doesn't see the new nodemask.  After that, if P1 is offlined,
    the demotion order is changed to the expected order as above.

    So in this patch, we added CPUHP_AP_MM_DEMOTION_ONLINE and
    CPUHP_MM_DEMOTION_DEAD to be called after CPUHP_AP_ONLINE_DYN and
    CPUHP_MM_VMSTAT_DEAD during CPU online and offline, and register the
    update function on them.

    Link: https://lkml.kernel.org/r/20210929060351.7293-1-ying.huang@intel.com
    Fixes: 884a6e5d1f93 ("mm/migrate: update node demotion order on hotplug events")
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Wei Xu <weixugc@google.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Keith Busch <kbusch@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:44:03 -05:00
Rafael Aquini 0b0bc854f7 mm/migrate: add CPU hotplug to demotion #ifdef
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 76af6a054da4055305ddb28c5eb151b9ee4f74f9
Author: Dave Hansen <dave.hansen@linux.intel.com>
Date:   Mon Oct 18 15:15:32 2021 -0700

    mm/migrate: add CPU hotplug to demotion #ifdef

    Once upon a time, the node demotion updates were driven solely by memory
    hotplug events.  But now, there are handlers for both CPU and memory
    hotplug.

    However, the #ifdef around the code checks only memory hotplug.  A
    system that has HOTPLUG_CPU=y but MEMORY_HOTPLUG=n would miss CPU
    hotplug events.

    Update the #ifdef around the common code.  Add memory and CPU-specific
    #ifdefs for their handlers.  These memory/CPU #ifdefs avoid unused
    function warnings when their Kconfig option is off.

    [arnd@arndb.de: rework hotplug_memory_notifier() stub]
      Link: https://lkml.kernel.org/r/20211013144029.2154629-1-arnd@kernel.org

    Link: https://lkml.kernel.org/r/20210924161255.E5FE8F7E@davehans-spike.ostc.intel.com
    Fixes: 884a6e5d1f93 ("mm/migrate: update node demotion order on hotplug events")
    Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
    Signed-off-by: Arnd Bergmann <arnd@arndb.de>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Wei Xu <weixugc@google.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Yang Shi <yang.shi@linux.alibaba.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:44:02 -05:00
Rafael Aquini fe6c0243f4 mm/migrate: optimize hotplug-time demotion order updates
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 295be91f7ef0027fca2f2e4788e99731aa931834
Author: Dave Hansen <dave.hansen@linux.intel.com>
Date:   Mon Oct 18 15:15:29 2021 -0700

    mm/migrate: optimize hotplug-time demotion order updates

    Patch series "mm/migrate: 5.15 fixes for automatic demotion", v2.

    This contains two fixes for the "automatic demotion" code which was
    merged into 5.15:

     * Fix memory hotplug performance regression by watching
       suppressing any real action on irrelevant hotplug events.

     * Ensure CPU hotplug handler is registered when memory hotplug
       is disabled.

    This patch (of 2):

    == tl;dr ==

    Automatic demotion opted for a simple, lazy approach to handling hotplug
    events.  This noticeably slows down memory hotplug[1].  Optimize away
    updates to the demotion order when memory hotplug events should have no
    effect.

    This has no effect on CPU hotplug.  There is no known problem on the CPU
    side and any work there will be in a separate series.

    == Background ==

    Automatic demotion is a memory migration strategy to ensure that new
    allocations have room in faster memory tiers on tiered memory systems.
    The kernel maintains an array (node_demotion[]) to drive these
    migrations.

    The node_demotion[] path is calculated by starting at nodes with CPUs
    and then "walking" to nodes with memory.  Only hotplug events which
    online or offline a node with memory (N_ONLINE) or CPUs (N_CPU) will
    actually affect the migration order.

    == Problem ==

    However, the current code is lazy.  It completely regenerates the
    migration order on *any* CPU or memory hotplug event.  The logic was
    that these events are extremely rare and that the overhead from
    indiscriminate order regeneration is minimal.

    Part of the update logic involves a synchronize_rcu(), which is a pretty
    big hammer.  Its overhead was large enough to be detected by some 0day
    tests that watch memory hotplug performance[1].

    == Solution ==

    Add a new helper (node_demotion_topo_changed()) which can differentiate
    between superfluous and impactful hotplug events.  Skip the expensive
    update operation for superfluous events.

    == Aside: Locking ==

    It took me a few moments to declare the locking to be safe enough for
    node_demotion_topo_changed() to work.  It all hinges on the memory
    hotplug lock:

    During memory hotplug events, 'mem_hotplug_lock' is held for write.
    This ensures that two memory hotplug events can not be called
    simultaneously.

    CPU hotplug has a similar lock (cpuhp_state_mutex) which also provides
    mutual exclusion between CPU hotplug events.  In addition, the demotion
    code acquire and hold the mem_hotplug_lock for read during its CPU
    hotplug handlers.  This provides mutual exclusion between the demotion
    memory hotplug callbacks and the CPU hotplug callbacks.

    This effectively allows treating the migration target generation code to
    act as if it is single-threaded.

    1. https://lore.kernel.org/all/20210905135932.GE15026@xsang-OptiPlex-9020/

    Link: https://lkml.kernel.org/r/20210924161251.093CCD06@davehans-spike.ostc.intel.com
    Link: https://lkml.kernel.org/r/20210924161253.D7673E31@davehans-spike.ostc.intel.com
    Fixes: 884a6e5d1f93 ("mm/migrate: update node demotion order on hotplug events")
    Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
    Reported-by: kernel test robot <oliver.sang@intel.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Wei Xu <weixugc@google.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Yang Shi <yang.shi@linux.alibaba.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:44:02 -05:00
Rafael Aquini 1e9ca1d563 compat: remove some compat entry points
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 59ab844eed9c6b01d32dcb27b57accc23771b324
Author: Arnd Bergmann <arnd@arndb.de>
Date:   Wed Sep 8 15:18:25 2021 -0700

    compat: remove some compat entry points

    These are all handled correctly when calling the native system call entry
    point, so remove the special cases.

    Link: https://lkml.kernel.org/r/20210727144859.4150043-6-arnd@kernel.org
    Signed-off-by: Arnd Bergmann <arnd@arndb.de>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Christian Borntraeger <borntraeger@de.ibm.com>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: "David S. Miller" <davem@davemloft.net>
    Cc: Eric Biederman <ebiederm@xmission.com>
    Cc: Feng Tang <feng.tang@intel.com>
    Cc: Heiko Carstens <hca@linux.ibm.com>
    Cc: Helge Deller <deller@gmx.de>
    Cc: "H. Peter Anvin" <hpa@zytor.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Paul Mackerras <paulus@samba.org>
    Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Vasily Gorbik <gor@linux.ibm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:43:38 -05:00
Rafael Aquini 8677fa9398 mm: simplify compat_sys_move_pages
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 5b1b561ba73c8ab9c98e5dfd14dc7ee47efb6530
Author: Arnd Bergmann <arnd@arndb.de>
Date:   Wed Sep 8 15:18:17 2021 -0700

    mm: simplify compat_sys_move_pages

    The compat move_pages() implementation uses compat_alloc_user_space() for
    converting the pointer array.  Moving the compat handling into the
    function itself is a bit simpler and lets us avoid the
    compat_alloc_user_space() call.

    Link: https://lkml.kernel.org/r/20210727144859.4150043-4-arnd@kernel.org
    Signed-off-by: Arnd Bergmann <arnd@arndb.de>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Christian Borntraeger <borntraeger@de.ibm.com>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: "David S. Miller" <davem@davemloft.net>
    Cc: Eric Biederman <ebiederm@xmission.com>
    Cc: Feng Tang <feng.tang@intel.com>
    Cc: Heiko Carstens <hca@linux.ibm.com>
    Cc: Helge Deller <deller@gmx.de>
    Cc: "H. Peter Anvin" <hpa@zytor.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Paul Mackerras <paulus@samba.org>
    Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Vasily Gorbik <gor@linux.ibm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:43:36 -05:00
Rafael Aquini b0d1178768 mm: migrate: change to use bool type for 'page_was_mapped'
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 213ecb3157514486a9ae6848a298b91a79cc2e2a
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date:   Wed Sep 8 15:18:06 2021 -0700

    mm: migrate: change to use bool type for 'page_was_mapped'

    Change to use bool type for 'page_was_mapped' variable making it more
    readable.

    Link: https://lkml.kernel.org/r/ce1279df18d2c163998c403e0b5ec6d3f6f90f7a.1629447552.git.baolin.wang@linux.alibaba.com
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:43:34 -05:00
Rafael Aquini 3af7de30cd mm: migrate: fix the incorrect function name in comments
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 68a9843f14b6b0d1ce023721814403253d8e9153
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date:   Wed Sep 8 15:18:03 2021 -0700

    mm: migrate: fix the incorrect function name in comments

    since commit a98a2f0c8c ("mm/rmap: split migration into its own
    function"), the migration ptes establishment has been split into a
    separate try_to_migrate() function, thus update the related comments.

    Link: https://lkml.kernel.org/r/5b824bad6183259c916ae6cf42f81d14c6118b06.1629447552.git.baolin.wang@linux.alibaba.com
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Reviewed-by: Alistair Popple <apopple@nvidia.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:43:33 -05:00
Rafael Aquini 990de964c6 mm: migrate: introduce a local variable to get the number of pages
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 2b9b624f5aef6af608edf541fed973948e27004c
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date:   Wed Sep 8 15:18:01 2021 -0700

    mm: migrate: introduce a local variable to get the number of pages

    Use thp_nr_pages() instead of compound_nr() to get the number of pages for
    THP page, meanwhile introducing a local variable 'nr_pages' to avoid
    getting the number of pages repeatedly.

    Link: https://lkml.kernel.org/r/a8e331ac04392ee230c79186330fb05e86a2aa77.1629447552.git.baolin.wang@linux.alibaba.com
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:43:32 -05:00
Rafael Aquini 7e946c3efe mm/migrate: correct kernel-doc notation
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit c9bd7d183673b5136e56210003e1d94338d47c45
Author: Randy Dunlap <rdunlap@infradead.org>
Date:   Thu Sep 2 15:00:36 2021 -0700

    mm/migrate: correct kernel-doc notation

    Use the expected "Return:" format to prevent a kernel-doc warning.

    mm/migrate.c:1157: warning: Excess function parameter 'returns' description in 'next_demotion_node'

    Link: https://lkml.kernel.org/r/20210808203151.10632-1-rdunlap@infradead.org
    Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:42:27 -05:00
Rafael Aquini 2e0da4572f mm/migrate: enable returning precise migrate_pages() success count
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 5ac95884a784e822b8cbe3d4bd6e9f96b3b71e3f
Author: Yang Shi <yang.shi@linux.alibaba.com>
Date:   Thu Sep 2 14:59:13 2021 -0700

    mm/migrate: enable returning precise migrate_pages() success count

    Under normal circumstances, migrate_pages() returns the number of pages
    migrated.  In error conditions, it returns an error code.  When returning
    an error code, there is no way to know how many pages were migrated or not
    migrated.

    Make migrate_pages() return how many pages are demoted successfully for
    all cases, including when encountering errors.  Page reclaim behavior will
    depend on this in subsequent patches.

    Link: https://lkml.kernel.org/r/20210721063926.3024591-3-ying.huang@intel.com
    Link: https://lkml.kernel.org/r/20210715055145.195411-4-ying.huang@intel.com
    Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
    Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Suggested-by: Oscar Salvador <osalvador@suse.de> [optional parameter]
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Wei Xu <weixugc@google.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Keith Busch <kbusch@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:42:08 -05:00
Rafael Aquini 76c8ee391f mm/migrate: update node demotion order on hotplug events
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 884a6e5d1f93b5032e5d6dd2a183f8b3f008416b
Author: Dave Hansen <dave.hansen@linux.intel.com>
Date:   Thu Sep 2 14:59:09 2021 -0700

    mm/migrate: update node demotion order on hotplug events

    Reclaim-based migration is attempting to optimize data placement in memory
    based on the system topology.  If the system changes, so must the
    migration ordering.

    The implementation is conceptually simple and entirely unoptimized.  On
    any memory or CPU hotplug events, assume that a node was added or removed
    and recalculate all migration targets.  This ensures that the
    node_demotion[] array is always ready to be used in case the new reclaim
    mode is enabled.

    This recalculation is far from optimal, most glaringly that it does not
    even attempt to figure out the hotplug event would have some *actual*
    effect on the demotion order.  But, given the expected paucity of hotplug
    events, this should be fine.

    Link: https://lkml.kernel.org/r/20210721063926.3024591-2-ying.huang@intel.com
    Link: https://lkml.kernel.org/r/20210715055145.195411-3-ying.huang@intel.com
    Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Wei Xu <weixugc@google.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Keith Busch <kbusch@kernel.org>
    Cc: Yang Shi <yang.shi@linux.alibaba.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:42:07 -05:00
Rafael Aquini 489bee842d mm/numa: automatically generate node migration order
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 79c28a41672278283fa72e03d0bf80e6644d4ac4
Author: Dave Hansen <dave.hansen@linux.intel.com>
Date:   Thu Sep 2 14:59:06 2021 -0700

    mm/numa: automatically generate node migration order

    Patch series "Migrate Pages in lieu of discard", v11.

    We're starting to see systems with more and more kinds of memory such as
    Intel's implementation of persistent memory.

    Let's say you have a system with some DRAM and some persistent memory.
    Today, once DRAM fills up, reclaim will start and some of the DRAM
    contents will be thrown out.  Allocations will, at some point, start
    falling over to the slower persistent memory.

    That has two nasty properties.  First, the newer allocations can end up in
    the slower persistent memory.  Second, reclaimed data in DRAM are just
    discarded even if there are gobs of space in persistent memory that could
    be used.

    This patchset implements a solution to these problems.  At the end of the
    reclaim process in shrink_page_list() just before the last page refcount
    is dropped, the page is migrated to persistent memory instead of being
    dropped.

    While I've talked about a DRAM/PMEM pairing, this approach would function
    in any environment where memory tiers exist.

    This is not perfect.  It "strands" pages in slower memory and never brings
    them back to fast DRAM.  Huang Ying has follow-on work which repurposes
    NUMA balancing to promote hot pages back to DRAM.

    This is also all based on an upstream mechanism that allows persistent
    memory to be onlined and used as if it were volatile:

            http://lkml.kernel.org/r/20190124231441.37A4A305@viggo.jf.intel.com

    With that, the DRAM and PMEM in each socket will be represented as 2
    separate NUMA nodes, with the CPUs sit in the DRAM node.  So the
    general inter-NUMA demotion mechanism introduced in the patchset can
    migrate the cold DRAM pages to the PMEM node.

    We have tested the patchset with the postgresql and pgbench.  On a
    2-socket server machine with DRAM and PMEM, the kernel with the patchset
    can improve the score of pgbench up to 22.1% compared with that of the
    DRAM only + disk case.  This comes from the reduced disk read throughput
    (which reduces up to 70.8%).

    == Open Issues ==

     * Memory policies and cpusets that, for instance, restrict allocations
       to DRAM can be demoted to PMEM whenever they opt in to this
       new mechanism.  A cgroup-level API to opt-in or opt-out of
       these migrations will likely be required as a follow-on.
     * Could be more aggressive about where anon LRU scanning occurs
       since it no longer necessarily involves I/O.  get_scan_count()
       for instance says: "If we have no swap space, do not bother
       scanning anon pages"

    This patch (of 9):

    Prepare for the kernel to auto-migrate pages to other memory nodes with a
    node migration table.  This allows creating single migration target for
    each NUMA node to enable the kernel to do NUMA page migrations instead of
    simply discarding colder pages.  A node with no target is a "terminal
    node", so reclaim acts normally there.  The migration target does not
    fundamentally _need_ to be a single node, but this implementation starts
    there to limit complexity.

    When memory fills up on a node, memory contents can be automatically
    migrated to another node.  The biggest problems are knowing when to
    migrate and to where the migration should be targeted.

    The most straightforward way to generate the "to where" list would be to
    follow the page allocator fallback lists.  Those lists already tell us if
    memory is full where to look next.  It would also be logical to move
    memory in that order.

    But, the allocator fallback lists have a fatal flaw: most nodes appear in
    all the lists.  This would potentially lead to migration cycles (A->B,
    B->A, A->B, ...).

    Instead of using the allocator fallback lists directly, keep a separate
    node migration ordering.  But, reuse the same data used to generate page
    allocator fallback in the first place: find_next_best_node().

    This means that the firmware data used to populate node distances
    essentially dictates the ordering for now.  It should also be
    architecture-neutral since all NUMA architectures have a working
    find_next_best_node().

    RCU is used to allow lock-less read of node_demotion[] and prevent
    demotion cycles been observed.  If multiple reads of node_demotion[] are
    performed, a single rcu_read_lock() must be held over all reads to ensure
    no cycles are observed.  Details are as follows.

    === What does RCU provide? ===

    Imagine a simple loop which walks down the demotion path looking
    for the last node:

            terminal_node = start_node;
            while (node_demotion[terminal_node] != NUMA_NO_NODE) {
                    terminal_node = node_demotion[terminal_node];
            }

    The initial values are:

            node_demotion[0] = 1;
            node_demotion[1] = NUMA_NO_NODE;

    and are updated to:

            node_demotion[0] = NUMA_NO_NODE;
            node_demotion[1] = 0;

    What guarantees that the cycle is not observed:

            node_demotion[0] = 1;
            node_demotion[1] = 0;

    and would loop forever?

    With RCU, a rcu_read_lock/unlock() can be placed around the loop.  Since
    the write side does a synchronize_rcu(), the loop that observed the old
    contents is known to be complete before the synchronize_rcu() has
    completed.

    RCU, combined with disable_all_migrate_targets(), ensures that the old
    migration state is not visible by the time __set_migration_target_nodes()
    is called.

    === What does READ_ONCE() provide? ===

    READ_ONCE() forbids the compiler from merging or reordering successive
    reads of node_demotion[].  This ensures that any updates are *eventually*
    observed.

    Consider the above loop again.  The compiler could theoretically read the
    entirety of node_demotion[] into local storage (registers) and never go
    back to memory, and *permanently* observe bad values for node_demotion[].

    Note: RCU does not provide any universal compiler-ordering
    guarantees:

            https://lore.kernel.org/lkml/20150921204327.GH4029@linux.vnet.ibm.com/

    This code is unused for now.  It will be called later in the
    series.

    Link: https://lkml.kernel.org/r/20210721063926.3024591-1-ying.huang@intel.com
    Link: https://lkml.kernel.org/r/20210715055145.195411-1-ying.huang@intel.com
    Link: https://lkml.kernel.org/r/20210715055145.195411-2-ying.huang@intel.com
    Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Reviewed-by: Zi Yan <ziy@nvidia.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Wei Xu <weixugc@google.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Keith Busch <kbusch@kernel.org>
    Cc: Yang Shi <yang.shi@linux.alibaba.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:42:06 -05:00
Aneesh Kumar K.V b5916c0254 mm/migrate: fix NR_ISOLATED corruption on 64-bit
Similar to commit 2da9f6305f ("mm/vmscan: fix NR_ISOLATED_FILE
corruption on 64-bit") avoid using unsigned int for nr_pages.  With
unsigned int type the large unsigned int converts to a large positive
signed long.

Symptoms include CMA allocations hanging forever due to
alloc_contig_range->...->isolate_migratepages_block waiting forever in
"while (unlikely(too_many_isolated(pgdat)))".

Link: https://lkml.kernel.org/r/20210728042531.359409-1-aneesh.kumar@linux.ibm.com
Fixes: c5fc5c3ae0 ("mm: migrate: account THP NUMA migration counters correctly")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-30 10:14:39 -07:00
Matthew Wilcox (Oracle) 79789db03f mm: Make copy_huge_page() always available
Rewrite copy_huge_page() and move it into mm/util.c so it's always
available.  Fixes an exposure of uninitialised memory on configurations
with HUGETLB and UFFD enabled and MIGRATION disabled.

Fixes: 8cc5fcbb5b ("mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-12 11:30:56 -07:00
Alistair Popple 6b49bf6ddb mm: rename migrate_pgmap_owner
MMU notifier ranges have a migrate_pgmap_owner field which is used by
drivers to store a pointer.  This is subsequently used by the driver
callback to filter MMU_NOTIFY_MIGRATE events.  Other notifier event types
can also benefit from this filtering, so rename the 'migrate_pgmap_owner'
field to 'owner' and create a new notifier initialisation function to
initialise this field.

Link: https://lkml.kernel.org/r/20210616105937.23201-6-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Suggested-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 11:06:03 -07:00
Alistair Popple a98a2f0c8c mm/rmap: split migration into its own function
Migration is currently implemented as a mode of operation for
try_to_unmap_one() generally specified by passing the TTU_MIGRATION flag
or in the case of splitting a huge anonymous page TTU_SPLIT_FREEZE.

However it does not have much in common with the rest of the unmap
functionality of try_to_unmap_one() and thus splitting it into a separate
function reduces the complexity of try_to_unmap_one() making it more
readable.

Several simplifications can also be made in try_to_migrate_one() based on
the following observations:

 - All users of TTU_MIGRATION also set TTU_IGNORE_MLOCK.
 - No users of TTU_MIGRATION ever set TTU_IGNORE_HWPOISON.
 - No users of TTU_MIGRATION ever set TTU_BATCH_FLUSH.

TTU_SPLIT_FREEZE is a special case of migration used when splitting an
anonymous page.  This is most easily dealt with by calling the correct
function from unmap_page() in mm/huge_memory.c - either try_to_migrate()
for PageAnon or try_to_unmap().

Link: https://lkml.kernel.org/r/20210616105937.23201-5-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 11:06:03 -07:00
Alistair Popple 4dd845b5a3 mm/swapops: rework swap entry manipulation code
Both migration and device private pages use special swap entries that are
manipluated by a range of inline functions.  The arguments to these are
somewhat inconsistent so rework them to remove flag type arguments and to
make the arguments similar for both read and write entry creation.

Link: https://lkml.kernel.org/r/20210616105937.23201-3-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 11:06:03 -07:00
Alistair Popple af5cdaf822 mm: remove special swap entry functions
Patch series "Add support for SVM atomics in Nouveau", v11.

Introduction
============

Some devices have features such as atomic PTE bits that can be used to
implement atomic access to system memory.  To support atomic operations to
a shared virtual memory page such a device needs access to that page which
is exclusive of the CPU.  This series introduces a mechanism to
temporarily unmap pages granting exclusive access to a device.

These changes are required to support OpenCL atomic operations in Nouveau
to shared virtual memory (SVM) regions allocated with the
CL_MEM_SVM_ATOMICS clSVMAlloc flag.  A more complete description of the
OpenCL SVM feature is available at
https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/
OpenCL_API.html#_shared_virtual_memory .

Implementation
==============

Exclusive device access is implemented by adding a new swap entry type
(SWAP_DEVICE_EXCLUSIVE) which is similar to a migration entry.  The main
difference is that on fault the original entry is immediately restored by
the fault handler instead of waiting.

Restoring the entry triggers calls to MMU notifers which allows a device
driver to revoke the atomic access permission from the GPU prior to the
CPU finalising the entry.

Patches
=======

Patches 1 & 2 refactor existing migration and device private entry
functions.

Patches 3 & 4 rework try_to_unmap_one() by splitting out unrelated
functionality into separate functions - try_to_migrate_one() and
try_to_munlock_one().

Patch 5 renames some existing code but does not introduce functionality.

Patch 6 is a small clean-up to swap entry handling in copy_pte_range().

Patch 7 contains the bulk of the implementation for device exclusive
memory.

Patch 8 contains some additions to the HMM selftests to ensure everything
works as expected.

Patch 9 is a cleanup for the Nouveau SVM implementation.

Patch 10 contains the implementation of atomic access for the Nouveau
driver.

Testing
=======

This has been tested with upstream Mesa 21.1.0 and a simple OpenCL program
which checks that GPU atomic accesses to system memory are atomic.
Without this series the test fails as there is no way of write-protecting
the page mapping which results in the device clobbering CPU writes.  For
reference the test is available at
https://ozlabs.org/~apopple/opencl_svm_atomics/

Further testing has been performed by adding support for testing exclusive
access to the hmm-tests kselftests.

This patch (of 10):

Remove multiple similar inline functions for dealing with different types
of special swap entries.

Both migration and device private swap entries use the swap offset to
store a pfn.  Instead of multiple inline functions to obtain a struct page
for each swap entry type use a common function pfn_swap_entry_to_page().
Also open-code the various entry_to_pfn() functions as this results is
shorter code that is easier to understand.

Link: https://lkml.kernel.org/r/20210616105937.23201-1-apopple@nvidia.com
Link: https://lkml.kernel.org/r/20210616105937.23201-2-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 11:06:03 -07:00
Yang Shi 662aeea753 mm: migrate: check mapcount for THP instead of refcount
The generic migration path will check refcount, so no need check refcount
here.  But the old code actually prevents from migrating shared THP
(mapped by multiple processes), so bail out early if mapcount is > 1 to
keep the behavior.

Link: https://lkml.kernel.org/r/20210518200801.7413-7-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:30 -07:00
Yang Shi b0b515bfb3 mm: migrate: don't split THP for misplaced NUMA page
The old behavior didn't split THP if migration is failed due to lack of
memory on the target node.  But the THP migration does split THP, so keep
the old behavior for misplaced NUMA page migration.

Link: https://lkml.kernel.org/r/20210518200801.7413-6-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:30 -07:00
Yang Shi c5fc5c3ae0 mm: migrate: account THP NUMA migration counters correctly
Now both base page and THP NUMA migration is done via
migrate_misplaced_page(), keep the counters correctly for THP.

Link: https://lkml.kernel.org/r/20210518200801.7413-5-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:30 -07:00
Yang Shi c5b5a3dd2c mm: thp: refactor NUMA fault handling
When the THP NUMA fault support was added THP migration was not supported
yet.  So the ad hoc THP migration was implemented in NUMA fault handling.
Since v4.14 THP migration has been supported so it doesn't make too much
sense to still keep another THP migration implementation rather than using
the generic migration code.

This patch reworks the NUMA fault handling to use generic migration
implementation to migrate misplaced page.  There is no functional change.

After the refactor the flow of NUMA fault handling looks just like its
PTE counterpart:
  Acquire ptl
  Prepare for migration (elevate page refcount)
  Release ptl
  Isolate page from lru and elevate page refcount
  Migrate the misplaced THP

If migration fails just restore the old normal PMD.

In the old code anon_vma lock was needed to serialize THP migration
against THP split, but since then the THP code has been reworked a lot, it
seems anon_vma lock is not required anymore to avoid the race.

The page refcount elevation when holding ptl should prevent from THP
split.

Use migrate_misplaced_page() for both base page and THP NUMA hinting fault
and remove all the dead and duplicate code.

[dan.carpenter@oracle.com: fix a double unlock bug]
  Link: https://lkml.kernel.org/r/YLX8uYN01JmfLnlK@mwanda

Link: https://lkml.kernel.org/r/20210518200801.7413-4-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:30 -07:00
Muchun Song 6acfb5ba15 mm: migrate: fix missing update page_private to hugetlb_page_subpool
Since commit d6995da311 ("hugetlb: use page.private for hugetlb specific
page flags") converts page.private for hugetlb specific page flags.  We
should use hugetlb_page_subpool() to get the subpool pointer instead of
page_private().

This 'could' prevent the migration of hugetlb pages.  page_private(hpage)
is now used for hugetlb page specific flags.  At migration time, the only
flag which could be set is HPageVmemmapOptimized.  This flag will only be
set if the new vmemmap reduction feature is enabled.  In addition,
!page_mapping() implies an anonymous mapping.  So, this will prevent
migration of hugetb pages in anonymous mappings if the vmemmap reduction
feature is enabled.

In addition, that if statement checked for the rare race condition of a
page being migrated while in the process of being freed.  Since that check
is now wrong, we could leak hugetlb subpool usage counts.

The commit forgot to update it in the page migration routine.  So fix it.

[songmuchun@bytedance.com: fix compiler error when !CONFIG_HUGETLB_PAGE reported by Randy]
  Link: https://lkml.kernel.org/r/20210521022747.35736-1-songmuchun@bytedance.com

Link: https://lkml.kernel.org/r/20210520025949.1866-1-songmuchun@bytedance.com
Fixes: d6995da311 ("hugetlb: use page.private for hugetlb specific page flags")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reported-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Anshuman Khandual <anshuman.khandual@arm.com>	[arm64]
Cc: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:29 -07:00
Mina Almasry 8cc5fcbb5b mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY
On UFFDIO_COPY, if we fail to copy the page contents while holding the
hugetlb_fault_mutex, we will drop the mutex and return to the caller after
allocating a page that consumed a reservation.  In this case there may be
a fault that double consumes the reservation.  To handle this, we free the
allocated page, fix the reservations, and allocate a temporary hugetlb
page and return that to the caller.  When the caller does the copy outside
of the lock, we again check the cache, and allocate a page consuming the
reservation, and copy over the contents.

Test:
Hacked the code locally such that resv_huge_pages underflows produce
a warning and the copy_huge_page_from_user() always fails, then:

./tools/testing/selftests/vm/userfaultfd hugetlb_shared 10
        2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success
./tools/testing/selftests/vm/userfaultfd hugetlb 10
	2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success

Both tests succeed and produce no warnings. After the
test runs number of free/resv hugepages is correct.

[yuehaibing@huawei.com: remove set but not used variable 'vm_alloc_shared']
  Link: https://lkml.kernel.org/r/20210601141610.28332-1-yuehaibing@huawei.com
[almasrymina@google.com: fix allocation error check and copy func name]
  Link: https://lkml.kernel.org/r/20210605010626.1459873-1-almasrymina@google.com

Link: https://lkml.kernel.org/r/20210528005029.88088-1-almasrymina@google.com
Signed-off-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:26 -07:00
Christophe Leroy 79c1c594f4 mm/hugetlb: change parameters of arch_make_huge_pte()
Patch series "Subject: [PATCH v2 0/5] Implement huge VMAP and VMALLOC on powerpc 8xx", v2.

This series implements huge VMAP and VMALLOC on powerpc 8xx.

Powerpc 8xx has 4 page sizes:
- 4k
- 16k
- 512k
- 8M

At the time being, vmalloc and vmap only support huge pages which are
leaf at PMD level.

Here the PMD level is 4M, it doesn't correspond to any supported
page size.

For now, implement use of 16k and 512k pages which is done
at PTE level.

Support of 8M pages will be implemented later, it requires use of
hugepd tables.

To allow this, the architecture provides two functions:
- arch_vmap_pte_range_map_size() which tells vmap_pte_range() what
page size to use. A stub returning PAGE_SIZE is provided when the
architecture doesn't provide this function.
- arch_vmap_pte_supported_shift() which tells __vmalloc_node_range()
what page shift to use for a given area size. A stub returning
PAGE_SHIFT is provided when the architecture doesn't provide this
function.

This patch (of 5):

At the time being, arch_make_huge_pte() has the following prototype:

  pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma,
			   struct page *page, int writable);

vma is used to get the pages shift or size.
vma is also used on Sparc to get vm_flags.
page is not used.
writable is not used.

In order to use this function without a vma, replace vma by shift and
flags.  Also remove the used parameters.

Link: https://lkml.kernel.org/r/cover.1620795204.git.christophe.leroy@csgroup.eu
Link: https://lkml.kernel.org/r/f4633ac6a7da2f22f31a04a89e0a7026bb78b15b.1620795204.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:26 -07:00
Muchun Song ad2fa3717b mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page
When we free a HugeTLB page to the buddy allocator, we need to allocate
the vmemmap pages associated with it.  However, we may not be able to
allocate the vmemmap pages when the system is under memory pressure.  In
this case, we just refuse to free the HugeTLB page.  This changes behavior
in some corner cases as listed below:

 1) Failing to free a huge page triggered by the user (decrease nr_pages).

    User needs to try again later.

 2) Failing to free a surplus huge page when freed by the application.

    Try again later when freeing a huge page next time.

 3) Failing to dissolve a free huge page on ZONE_MOVABLE via
    offline_pages().

    This can happen when we have plenty of ZONE_MOVABLE memory, but
    not enough kernel memory to allocate vmemmmap pages.  We may even
    be able to migrate huge page contents, but will not be able to
    dissolve the source huge page.  This will prevent an offline
    operation and is unfortunate as memory offlining is expected to
    succeed on movable zones.  Users that depend on memory hotplug
    to succeed for movable zones should carefully consider whether the
    memory savings gained from this feature are worth the risk of
    possibly not being able to offline memory in certain situations.

 4) Failing to dissolve a huge page on CMA/ZONE_MOVABLE via
    alloc_contig_range() - once we have that handling in place. Mainly
    affects CMA and virtio-mem.

    Similar to 3). virito-mem will handle migration errors gracefully.
    CMA might be able to fallback on other free areas within the CMA
    region.

Vmemmap pages are allocated from the page freeing context.  In order for
those allocations to be not disruptive (e.g.  trigger oom killer)
__GFP_NORETRY is used.  hugetlb_lock is dropped for the allocation because
a non sleeping allocation would be too fragile and it could fail too
easily under memory pressure.  GFP_ATOMIC or other modes to access memory
reserves is not used because we want to prevent consuming reserves under
heavy hugetlb freeing.

[mike.kravetz@oracle.com: fix dissolve_free_huge_page use of tail/head page]
  Link: https://lkml.kernel.org/r/20210527231225.226987-1-mike.kravetz@oracle.com
[willy@infradead.org: fix alloc_vmemmap_page_list documentation warning]
  Link: https://lkml.kernel.org/r/20210615200242.1716568-6-willy@infradead.org

Link: https://lkml.kernel.org/r/20210510030027.56044-7-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Barry Song <song.bao.hua@hisilicon.com>
Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chen Huang <chenhuang5@huawei.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Neukum <oneukum@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:25 -07:00
Liam Howlett 059b8b4875 mm/migrate: use vma_lookup() in do_pages_stat_array()
Use vma_lookup() to find the VMA at a specific address.  As vma_lookup()
will return NULL if the address is not within any VMA, the start address
no longer needs to be validated.

Link: https://lkml.kernel.org/r/20210521174745.2219620-20-Liam.Howlett@Oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:52 -07:00
Xu Yu ffc90cbb29 mm, thp: use head page in __migration_entry_wait()
We notice that hung task happens in a corner but practical scenario when
CONFIG_PREEMPT_NONE is enabled, as follows.

Process 0                       Process 1                     Process 2..Inf
split_huge_page_to_list
    unmap_page
        split_huge_pmd_address
                                __migration_entry_wait(head)
                                                              __migration_entry_wait(tail)
    remap_page (roll back)
        remove_migration_ptes
            rmap_walk_anon
                cond_resched

Where __migration_entry_wait(tail) is occurred in kernel space, e.g.,
copy_to_user in fstat, which will immediately fault again without
rescheduling, and thus occupy the cpu fully.

When there are too many processes performing __migration_entry_wait on
tail page, remap_page will never be done after cond_resched.

This makes __migration_entry_wait operate on the compound head page,
thus waits for remap_page to complete, whether the THP is split
successfully or roll back.

Note that put_and_wait_on_page_locked helps to drop the page reference
acquired with get_page_unless_zero, as soon as the page is on the wait
queue, before actually waiting.  So splitting the THP is only prevented
for a brief interval.

Link: https://lkml.kernel.org/r/b9836c1dd522e903891760af9f0c86a2cce987eb.1623144009.git.xuyu@linux.alibaba.com
Fixes: ba98828088 ("thp: add option to setup migration entries during PMD split")
Suggested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com>
Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-16 09:24:42 -07:00
Ingo Molnar f0953a1bba mm: fix typos in comments
Fix ~94 single-word typos in locking code comments, plus a few
very obvious grammar mistakes.

Link: https://lkml.kernel.org/r/20210322212624.GA1963421@gmail.com
Link: https://lore.kernel.org/r/20210322205203.GB1959563@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-07 00:26:35 -07:00
Liam Mark 7bc1aec5e2 mm: cma: add trace events for CMA alloc perf testing
Add cma and migrate trace events to enable CMA allocation performance to
be measured via ftrace.

[georgi.djakov@linaro.org: add the CMA instance name to the cma_alloc_start trace event]
  Link: https://lkml.kernel.org/r/20210326155414.25006-1-georgi.djakov@linaro.org

Link: https://lkml.kernel.org/r/20210324160740.15901-1-georgi.djakov@linaro.org
Signed-off-by: Liam Mark <lmark@codeaurora.org>
Signed-off-by: Georgi Djakov <georgi.djakov@linaro.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:24 -07:00
Miaohe Lin 7ee820ee72 Revert "mm: migrate: skip shared exec THP for NUMA balancing"
This reverts commit c77c5cbafe.

Since commit c77c5cbafe ("mm: migrate: skip shared exec THP for NUMA
balancing"), the NUMA balancing would skip shared exec transhuge page.
But this enhancement is not suitable for transhuge page.  Because it's
required that page_mapcount() must be 1 due to no migration pte dance is
done here.  On the other hand, the shared exec transhuge page will leave
the migrate_misplaced_page() with pte entry untouched and page locked.
Thus pagefault for NUMA will be triggered again and deadlock occurs when
we start waiting for the page lock held by ourselves.

Yang Shi said:

 "Thanks for catching this. By relooking the code I think the other
  important reason for removing this is
  migrate_misplaced_transhuge_page() actually can't see shared exec
  file THP at all since page_lock_anon_vma_read() is called before
  and if page is not anonymous page it will just restore the PMD
  without migrating anything.
  The pages for private mapped file vma may be anonymous pages due to
  COW but they can't be THP so it won't trigger THP numa fault at all. I
  think this is why no bug was reported. I overlooked this in the first
  place."

Link: https://lkml.kernel.org/r/20210325131524.48181-6-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:24 -07:00
Miaohe Lin 843e1be108 mm/migrate.c: use helper migrate_vma_collect_skip() in migrate_vma_collect_hole()
It's more recommended to use helper function migrate_vma_collect_skip() to
skip the unexpected case and it also helps remove some duplicated codes.
Move migrate_vma_collect_skip() above migrate_vma_collect_hole() to avoid
compiler warning.

Link: https://lkml.kernel.org/r/20210325131524.48181-5-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:24 -07:00
Miaohe Lin 34f5e9b9d1 mm/migrate.c: fix potential indeterminate pte entry in migrate_vma_insert_page()
If the zone device page does not belong to un-addressable device memory,
the variable entry will be uninitialized and lead to indeterminate pte
entry ultimately.  Fix this unexpected case and warn about it.

Link: https://lkml.kernel.org/r/20210325131524.48181-4-linmiaohe@huawei.com
Fixes: df6ad69838 ("mm/device-public-memory: device memory cache coherent with CPU")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:24 -07:00
Miaohe Lin a04840c684 mm/migrate.c: remove unnecessary rc != MIGRATEPAGE_SUCCESS check in 'else' case
It's guaranteed that in the 'else' case of the rc == MIGRATEPAGE_SUCCESS
check, rc does not equal to MIGRATEPAGE_SUCCESS.  Remove this unnecessary
check.

Link: https://lkml.kernel.org/r/20210325131524.48181-3-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:24 -07:00
Miaohe Lin 606a6f71a2 mm/migrate.c: make putback_movable_page() static
Patch series "Cleanup and fixup for mm/migrate.c", v3.

This series contains cleanups to remove unnecessary VM_BUG_ON_PAGE and rc
!= MIGRATEPAGE_SUCCESS check.  Also use helper function to remove some
duplicated codes.  What's more, this fixes potential deadlock in NUMA
balancing shared exec THP case and so on.  More details can be found in
the respective changelogs.

This patch (of 5):

The putback_movable_page() is just called by putback_movable_pages() and
we know the page is locked and both PageMovable() and PageIsolated() is
checked right before calling putback_movable_page().  So we make it static
and remove all the 3 VM_BUG_ON_PAGE().

Link: https://lkml.kernel.org/r/20210325131524.48181-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20210325131524.48181-2-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:24 -07:00
Minchan Kim 361a2a229f mm: replace migrate_[prep|finish] with lru_cache_[disable|enable]
Currently, migrate_[prep|finish] is merely a wrapper of
lru_cache_[disable|enable].  There is not much to gain from having
additional abstraction.

Use lru_cache_[disable|enable] instead of migrate_[prep|finish], which
would be more descriptive.

note: migrate_prep_local in compaction.c changed into lru_add_drain to
avoid CPU schedule cost with involving many other CPUs to keep old
behavior.

Link: https://lkml.kernel.org/r/20210319175127.886124-2-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Chris Goldsworthy <cgoldswo@codeaurora.org>
Cc: John Dias <joaodias@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oliver Sang <oliver.sang@intel.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:24 -07:00
Minchan Kim d479960e44 mm: disable LRU pagevec during the migration temporarily
LRU pagevec holds refcount of pages until the pagevec are drained.  It
could prevent migration since the refcount of the page is greater than
the expection in migration logic.  To mitigate the issue, callers of
migrate_pages drains LRU pagevec via migrate_prep or lru_add_drain_all
before migrate_pages call.

However, it's not enough because pages coming into pagevec after the
draining call still could stay at the pagevec so it could keep
preventing page migration.  Since some callers of migrate_pages have
retrial logic with LRU draining, the page would migrate at next trail
but it is still fragile in that it doesn't close the fundamental race
between upcoming LRU pages into pagvec and migration so the migration
failure could cause contiguous memory allocation failure in the end.

To close the race, this patch disables lru caches(i.e, pagevec) during
ongoing migration until migrate is done.

Since it's really hard to reproduce, I measured how many times
migrate_pages retried with force mode(it is about a fallback to a sync
migration) with below debug code.

int migrate_pages(struct list_head *from, new_page_t get_new_page,
			..
			..

  if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
         printk(KERN_ERR, "pfn 0x%lx reason %d", page_to_pfn(page), rc);
         dump_page(page, "fail to migrate");
  }

The test was repeating android apps launching with cma allocation in
background every five seconds.  Total cma allocation count was about 500
during the testing.  With this patch, the dump_page count was reduced
from 400 to 30.

The new interface is also useful for memory hotplug which currently
drains lru pcp caches after each migration failure.  This is rather
suboptimal as it has to disrupt others running during the operation.
With the new interface the operation happens only once.  This is also in
line with pcp allocator cache which are disabled for the offlining as
well.

Link: https://lkml.kernel.org/r/20210319175127.886124-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Chris Goldsworthy <cgoldswo@codeaurora.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: John Dias <joaodias@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oliver Sang <oliver.sang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:24 -07:00
Matthew Wilcox (Oracle) 84172f4bb7 mm/page_alloc: combine __alloc_pages and __alloc_pages_nodemask
There are only two callers of __alloc_pages() so prune the thicket of
alloc_page variants by combining the two functions together.  Current
callers of __alloc_pages() simply add an extra 'NULL' parameter and
current callers of __alloc_pages_nodemask() call __alloc_pages() instead.

Link: https://lkml.kernel.org/r/20210225150642.2582252-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:42 -07:00
Shakeel Butt b603894248 mm: memcg: add swapcache stat for memcg v2
This patch adds swapcache stat for the cgroup v2.  The swapcache
represents the memory that is accounted against both the memory and the
swap limit of the cgroup.  The main motivation behind exposing the
swapcache stat is for enabling users to gracefully migrate from cgroup
v1's memsw counter to cgroup v2's memory and swap counters.

Cgroup v1's memsw limit allows users to limit the memory+swap usage of a
workload but without control on the exact proportion of memory and swap.
Cgroup v2 provides separate limits for memory and swap which enables more
control on the exact usage of memory and swap individually for the
workload.

With some little subtleties, the v1's memsw limit can be switched with the
sum of the v2's memory and swap limits.  However the alternative for memsw
usage is not yet available in cgroup v2.  Exposing per-cgroup swapcache
stat enables that alternative.  Adding the memory usage and swap usage and
subtracting the swapcache will approximate the memsw usage.  This will
help in the transparent migration of the workloads depending on memsw
usage and limit to v2' memory and swap counters.

The reasons these applications are still interested in this approximate
memsw usage are: (1) these applications are not really interested in two
separate memory and swap usage metrics.  A single usage metric is more
simple to use and reason about for them.

(2) The memsw usage metric hides the underlying system's swap setup from
the applications.  Applications with multiple instances running in a
datacenter with heterogeneous systems (some have swap and some don't) will
keep seeing a consistent view of their usage.

[akpm@linux-foundation.org: fix CONFIG_SWAP=n build]

Link: https://lkml.kernel.org/r/20210108155813.2914586-3-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 13:38:29 -08:00
Matthew Wilcox (Oracle) 4805462598 mm/filemap: pass a sleep state to put_and_wait_on_page_locked
This is prep work for the next patch, but I think at least one of the
current callers would prefer a killable sleep to an uninterruptible one.

Link: https://lkml.kernel.org/r/20210122160140.223228-6-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 13:38:28 -08:00
Muchun Song 71a64f618b mm: migrate: do not migrate HugeTLB page whose refcount is one
All pages isolated for the migration have an elevated reference count and
therefore seeing a reference count equal to 1 means that the last user of
the page has dropped the reference and the page has became unused and
there doesn't make much sense to migrate it anymore.

This has been done for regular pages and this patch does the same for
hugetlb pages.  Although the likelihood of the race is rather small for
hugetlb pages it makes sense the two code paths in sync.

Link: https://lkml.kernel.org/r/20210115124942.46403-2-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Yang Shi <shy828301@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-05 11:03:47 -08:00
Shakeel Butt 5c447d274f mm: fix numa stats for thp migration
Currently the kernel is not correctly updating the numa stats for
NR_FILE_PAGES and NR_SHMEM on THP migration.  Fix that.

For NR_FILE_DIRTY and NR_ZONE_WRITE_PENDING, although at the moment
there is no need to handle THP migration as kernel still does not have
write support for file THP but to be more future proof, this patch adds
the THP support for those stats as well.

Link: https://lkml.kernel.org/r/20210108155813.2914586-2-shakeelb@google.com
Fixes: e71769ae52 ("mm: enable thp migration for shmem thp")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24 09:20:52 -08:00
Shakeel Butt 8a8792f600 mm: memcg: fix memcg file_dirty numa stat
The kernel updates the per-node NR_FILE_DIRTY stats on page migration
but not the memcg numa stats.

That was not an issue until recently the commit 5f9a4f4a70 ("mm:
memcontrol: add the missing numa_stat interface for cgroup v2") exposed
numa stats for the memcg.

So fix the file_dirty per-memcg numa stat.

Link: https://lkml.kernel.org/r/20210108155813.2914586-1-shakeelb@google.com
Fixes: 5f9a4f4a70 ("mm: memcontrol: add the missing numa_stat interface for cgroup v2")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24 09:20:52 -08:00
Haitao Shi 8958b24911 mm: fix some spelling mistakes in comments
Fix some spelling mistakes in comments:
	udpate ==> update
	succesful ==> successful
	exmaple ==> example
	unneccessary ==> unnecessary
	stoping ==> stopping
	uknown ==> unknown

Link: https://lkml.kernel.org/r/20201127011747.86005-1-shihaitao1@huawei.com
Signed-off-by: Haitao Shi <shihaitao1@huawei.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 22:46:19 -08:00
Stephen Zhang d85c6db4cc mm: migrate: remove unused parameter in migrate_vma_insert_page()
"dst" parameter to migrate_vma_insert_page() is not used anymore.

Link: https://lkml.kernel.org/r/CANubcdUwCAMuUyamG2dkWP=cqSR9MAS=tHLDc95kQkqU-rEnAg@mail.gmail.com
Signed-off-by: Stephen Zhang <starzhangzsd@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 12:13:46 -08:00
Yang Shi d532e2e57e mm: migrate: return -ENOSYS if THP migration is unsupported
In the current implementation unmap_and_move() would return -ENOMEM if THP
migration is unsupported, then the THP will be split.  If split is failed
just exit without trying to migrate other pages.  It doesn't make too much
sense since there may be enough free memory to migrate other pages and
there may be a lot base pages on the list.

Return -ENOSYS to make consistent with hugetlb.  And if THP split is
failed just skip and try other pages on the list.

Just skip the whole list and exit when free memory is really low.

Link: https://lkml.kernel.org/r/20201113205359.556831-6-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 12:13:45 -08:00
Yang Shi 236c32eb10 mm: migrate: clean up migrate_prep{_local}
The migrate_prep{_local} never fails, so it is pointless to have return
value and check the return value.

Link: https://lkml.kernel.org/r/20201113205359.556831-5-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 12:13:45 -08:00
Yang Shi c77c5cbafe mm: migrate: skip shared exec THP for NUMA balancing
The NUMA balancing skip shared exec base page.  Since
CONFIG_READ_ONLY_THP_FOR_FS was introduced, there are probably shared exec
THP, so skip such THPs for NUMA balancing as well.

And Willy's regular filesystem THP support patches could create shared
exec THP wven without that config.

In addition, the page_is_file_lru() is used to tell if the page is file
cache or not, but it filters out shmem page.  It sounds like a typical
usecase by putting executables in shmem to achieve performance gain via
using shmem-THP, so it sounds worth skipping migration for such case too.

Link: https://lkml.kernel.org/r/20201113205359.556831-4-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 12:13:45 -08:00
Yang Shi dd4ae78a21 mm: migrate: simplify the logic for handling permanent failure
When unmap_and_move{_huge_page}() returns !-EAGAIN and
!MIGRATEPAGE_SUCCESS, the page would be put back to LRU or proper list if
it is non-LRU movable page.  But, the callers always call
putback_movable_pages() to put the failed pages back later on, so it seems
not very efficient to put every single page back immediately, and the code
looks convoluted.

Put the failed page on a separate list, then splice the list to migrate
list when all pages are tried.  It is the caller's responsibility to call
putback_movable_pages() to handle failures.  This also makes the code
simpler and more readable.

After the change the rules are:
    * Success: non hugetlb page will be freed, hugetlb page will be put
               back
    * -EAGAIN: stay on the from list
    * -ENOMEM: stay on the from list
    * Other errno: put on ret_pages list then splice to from list

The from list would be empty iff all pages are migrated successfully, it
was not so before.  This has no impact to current existing callsites.

Link: https://lkml.kernel.org/r/20201113205359.556831-3-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 12:13:45 -08:00
Yang Shi d12b8951ad mm: truncate_complete_page() does not exist any more
Patch series "mm: misc migrate cleanup and improvement", v3.

This patch (of 5):

The commit 9f4e41f471 ("mm: refactor truncate_complete_page()")
refactored truncate_complete_page(), and it is not existed anymore,
correct the comment in vmscan and migrate to avoid confusion.

Link: https://lkml.kernel.org/r/20201113205359.556831-1-shy828301@gmail.com
Link: https://lkml.kernel.org/r/20201113205359.556831-2-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 12:13:45 -08:00
Ralph Campbell 5e5dda81a0 mm/migrate.c: optimize migrate_vma_pages() mmu notifier
When migrating a zero page or pte_none() anonymous page to device private
memory, migrate_vma_setup() will initialize the src[] array with a NULL
PFN.  This lets the device driver allocate device private memory and clear
it instead of DMAing a page of zeros over the device bus.

Since the source page didn't exist at the time, no struct page was locked
nor a migration PTE inserted into the CPU page tables.  The actual PTE
insertion happens in migrate_vma_pages() when it tries to insert the
device private struct page PTE into the CPU page tables.
migrate_vma_pages() has to call the mmu notifiers again since another
device could fault on the same page before the page table locks are
acquired.

Allow device drivers to optimize the invalidation similar to
migrate_vma_setup() by calling mmu_notifier_range_init() which sets struct
mmu_notifier_range event type to MMU_NOTIFY_MIGRATE and the
migrate_pgmap_owner field.

Link: https://lkml.kernel.org/r/20201021191335.10916-1-rcampbell@nvidia.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 12:13:45 -08:00
Long Li ab9dd4f8a1 mm/migrate.c: fix comment spelling
The word in the comment is misspelled, it should be "include".

Link: https://lkml.kernel.org/r/20201024114144.GA20552@lilong
Signed-off-by: Long Li <lonuxli.64@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 12:13:45 -08:00
Shakeel Butt 013339df11 mm/rmap: always do TTU_IGNORE_ACCESS
Since commit 369ea8242c ("mm/rmap: update to new mmu_notifier semantic
v2"), the code to check the secondary MMU's page table access bit is
broken for !(TTU_IGNORE_ACCESS) because the page is unmapped from the
secondary MMU's page table before the check.  More specifically for those
secondary MMUs which unmap the memory in
mmu_notifier_invalidate_range_start() like kvm.

However memory reclaim is the only user of !(TTU_IGNORE_ACCESS) or the
absence of TTU_IGNORE_ACCESS and it explicitly performs the page table
access check before trying to unmap the page.  So, at worst the reclaim
will miss accesses in a very short window if we remove page table access
check in unmapping code.

There is an unintented consequence of !(TTU_IGNORE_ACCESS) for the memcg
reclaim.  From memcg reclaim the page_referenced() only account the
accesses from the processes which are in the same memcg of the target page
but the unmapping code is considering accesses from all the processes, so,
decreasing the effectiveness of memcg reclaim.

The simplest solution is to always assume TTU_IGNORE_ACCESS in unmapping
code.

Link: https://lkml.kernel.org/r/20201104231928.1494083-1-shakeelb@google.com
Fixes: 369ea8242c ("mm/rmap: update to new mmu_notifier semantic v2")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 12:13:39 -08:00
Mike Kravetz 336bf30eb7 hugetlbfs: fix anon huge page migration race
Qian Cai reported the following BUG in [1]

  LTP: starting move_pages12
  BUG: unable to handle page fault for address: ffffffffffffffe0
  ...
  RIP: 0010:anon_vma_interval_tree_iter_first+0xa2/0x170 avc_start_pgoff at mm/interval_tree.c:63
  Call Trace:
    rmap_walk_anon+0x141/0xa30 rmap_walk_anon at mm/rmap.c:1864
    try_to_unmap+0x209/0x2d0 try_to_unmap at mm/rmap.c:1763
    migrate_pages+0x1005/0x1fb0
    move_pages_and_store_status.isra.47+0xd7/0x1a0
    __x64_sys_move_pages+0xa5c/0x1100
    do_syscall_64+0x5f/0x310
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

Hugh Dickins diagnosed this as a migration bug caused by code introduced
to use i_mmap_rwsem for pmd sharing synchronization.  Specifically, the
routine unmap_and_move_huge_page() is always passing the TTU_RMAP_LOCKED
flag to try_to_unmap() while holding i_mmap_rwsem.  This is wrong for
anon pages as the anon_vma_lock should be held in this case.  Further
analysis suggested that i_mmap_rwsem was not required to he held at all
when calling try_to_unmap for anon pages as an anon page could never be
part of a shared pmd mapping.

Discussion also revealed that the hack in hugetlb_page_mapping_lock_write
to drop page lock and acquire i_mmap_rwsem is wrong.  There is no way to
keep mapping valid while dropping page lock.

This patch does the following:

 - Do not take i_mmap_rwsem and set TTU_RMAP_LOCKED for anon pages when
   calling try_to_unmap.

 - Remove the hacky code in hugetlb_page_mapping_lock_write. The routine
   will now simply do a 'trylock' while still holding the page lock. If
   the trylock fails, it will return NULL. This could impact the
   callers:

    - migration calling code will receive -EAGAIN and retry up to the
      hard coded limit (10).

    - memory error code will treat the page as BUSY. This will force
      killing (SIGKILL) instead of SIGBUS any mapping tasks.

   Do note that this change in behavior only happens when there is a
   race. None of the standard kernel testing suites actually hit this
   race, but it is possible.

[1] https://lore.kernel.org/lkml/20200708012044.GC992@lca.pw/
[2] https://lore.kernel.org/linux-mm/alpine.LSU.2.11.2010071833100.2214@eggly.anvils/

Fixes: c0d0381ade ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
Reported-by: Qian Cai <cai@lca.pw>
Suggested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20201105195058.78401-1-mike.kravetz@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-11-14 11:26:04 -08:00
Miaohe Lin 4dc200cee1 mm/migrate: avoid possible unnecessary process right check in kernel_move_pages()
There is no need to check if this process has the right to modify the
specified process when they are same.  And we could also skip the security
hook call if a process is modifying its own pages.  Add helper function to
handle these.

Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Hongxiang Lou <louhongxiang@huawei.com>
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Christopher Lameter <cl@linux.com>
Link: https://lkml.kernel.org/r/20200819083331.19012-1-linmiaohe@huawei.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18 09:27:09 -07:00
Oscar Salvador 79f5f8fab4 mm,hwpoison: rework soft offline for in-use pages
This patch changes the way we set and handle in-use poisoned pages.  Until
now, poisoned pages were released to the buddy allocator, trusting that
the checks that take place at allocation time would act as a safe net and
would skip that page.

This has proved to be wrong, as we got some pfn walkers out there, like
compaction, that all they care is the page to be in a buddy freelist.

Although this might not be the only user, having poisoned pages in the
buddy allocator seems a bad idea as we should only have free pages that
are ready and meant to be used as such.

Before explaining the taken approach, let us break down the kind of pages
we can soft offline.

- Anonymous THP (after the split, they end up being 4K pages)
- Hugetlb
- Order-0 pages (that can be either migrated or invalited)

* Normal pages (order-0 and anon-THP)

  - If they are clean and unmapped page cache pages, we invalidate
    then by means of invalidate_inode_page().
  - If they are mapped/dirty, we do the isolate-and-migrate dance.

Either way, do not call put_page directly from those paths.  Instead, we
keep the page and send it to page_handle_poison to perform the right
handling.

page_handle_poison sets the HWPoison flag and does the last put_page.

Down the chain, we placed a check for HWPoison page in
free_pages_prepare, that just skips any poisoned page, so those pages
do not end up in any pcplist/freelist.

After that, we set the refcount on the page to 1 and we increment
the poisoned pages counter.

If we see that the check in free_pages_prepare creates trouble, we can
always do what we do for free pages:

  - wait until the page hits buddy's freelists
  - take it off, and flag it

The downside of the above approach is that we could race with an
allocation, so by the time we  want to take the page off the buddy, the
page has been already allocated so we cannot soft offline it.
But the user could always retry it.

* Hugetlb pages

  - We isolate-and-migrate them

After the migration has been successful, we call dissolve_free_huge_page,
and we set HWPoison on the page if we succeed.
Hugetlb has a slightly different handling though.

While for non-hugetlb pages we cared about closing the race with an
allocation, doing so for hugetlb pages requires quite some additional
and intrusive code (we would need to hook in free_huge_page and some other
places).
So I decided to not make the code overly complicated and just fail
normally if the page we allocated in the meantime.

We can always build on top of this.

As a bonus, because of the way we handle now in-use pages, we no longer
need the put-as-isolation-migratetype dance, that was guarding for poisoned
pages to end up in pcplists.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Aristeu Rozanski <aris@ruivo.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dmitry Yakunin <zeil@yandex-team.ru>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200922135650.1634-10-osalvador@suse.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:16 -07:00
Ralph Campbell f1f4f3ab54 mm/migrate: remove obsolete comment about device public
Device public memory never had an in tree consumer and was removed in
commit 25b2995a35 ("mm: remove MEMORY_DEVICE_PUBLIC support").  Delete
the obsolete comment.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/20200827190735.12752-2-rcampbell@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13 18:38:36 -07:00
Ralph Campbell 4257889124 mm/migrate: remove cpages-- in migrate_vma_finalize()
The variable struct migrate_vma->cpages is only used in
migrate_vma_setup().  There is no need to decrement it in
migrate_vma_finalize() since it is never checked.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/20200827190735.12752-1-rcampbell@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13 18:38:35 -07:00
Linus Torvalds 3ad11d7ac8 block-5.10-2020-10-12
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+EWUgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpnoxEADCVSNBRkpV0OVkOEC3wf8EGhXhk01Jnjtl
 u5Mg2V55hcgJ0thQxBV/V28XyqmsEBrmAVi0Yf8Vr9Qbq4Ze08Wae4ChS4rEOyh1
 jTcGYWx5aJB3ChLvV/HI0nWQ3bkj03mMrL3SW8rhhf5DTyKHsVeTenpx42Qu/FKf
 fRzi09FSr3Pjd0B+EX6gunwJnlyXQC5Fa4AA0GhnXJzAznANXxHkkcXu8a6Yw75x
 e28CfhIBliORsK8sRHLoUnPpeTe1vtxCBhBMsE+gJAj9ZUOWMzvNFIPP4FvfawDy
 6cCQo2m1azJ/IdZZCDjFUWyjh+wxdKMp+NNryEcoV+VlqIoc3n98rFwrSL+GIq5Z
 WVwEwq+AcwoMCsD29Lu1ytL2PQ/RVqcJP5UheMrbL4vzefNfJFumQVZLIcX0k943
 8dFL2QHL+H/hM9Dx5y5rjeiWkAlq75v4xPKVjh/DHb4nehddCqn/+DD5HDhNANHf
 c1kmmEuYhvLpIaC4DHjE6DwLh8TPKahJjwsGuBOTr7D93NUQD+OOWsIhX6mNISIl
 FFhP8cd0/ZZVV//9j+q+5B4BaJsT+ZtwmrelKFnPdwPSnh+3iu8zPRRWO+8P8fRC
 YvddxuJAmE6BLmsAYrdz6Xb/wqfyV44cEiyivF0oBQfnhbtnXwDnkDWSfJD1bvCm
 ZwfpDh2+Tg==
 =LzyE
 -----END PGP SIGNATURE-----

Merge tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:

 - Series of merge handling cleanups (Baolin, Christoph)

 - Series of blk-throttle fixes and cleanups (Baolin)

 - Series cleaning up BDI, seperating the block device from the
   backing_dev_info (Christoph)

 - Removal of bdget() as a generic API (Christoph)

 - Removal of blkdev_get() as a generic API (Christoph)

 - Cleanup of is-partition checks (Christoph)

 - Series reworking disk revalidation (Christoph)

 - Series cleaning up bio flags (Christoph)

 - bio crypt fixes (Eric)

 - IO stats inflight tweak (Gabriel)

 - blk-mq tags fixes (Hannes)

 - Buffer invalidation fixes (Jan)

 - Allow soft limits for zone append (Johannes)

 - Shared tag set improvements (John, Kashyap)

 - Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel)

 - DM no-wait support (Mike, Konstantin)

 - Request allocation improvements (Ming)

 - Allow md/dm/bcache to use IO stat helpers (Song)

 - Series improving blk-iocost (Tejun)

 - Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang,
   Xianting, Yang, Yufen, yangerkun)

* tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits)
  block: fix uapi blkzoned.h comments
  blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue
  blk-mq: get rid of the dead flush handle code path
  block: get rid of unnecessary local variable
  block: fix comment and add lockdep assert
  blk-mq: use helper function to test hw stopped
  block: use helper function to test queue register
  block: remove redundant mq check
  block: invoke blk_mq_exit_sched no matter whether have .exit_sched
  percpu_ref: don't refer to ref->data if it isn't allocated
  block: ratelimit handle_bad_sector() message
  blk-throttle: Re-use the throtl_set_slice_end()
  blk-throttle: Open code __throtl_de/enqueue_tg()
  blk-throttle: Move service tree validation out of the throtl_rb_first()
  blk-throttle: Move the list operation after list validation
  blk-throttle: Fix IO hang for a corner case
  blk-throttle: Avoid tracking latency if low limit is invalid
  blk-throttle: Avoid getting the current time if tg->last_finish_time is 0
  blk-throttle: Remove a meaningless parameter for throtl_downgrade_state()
  block: Remove redundant 'return' statement
  ...
2020-10-13 12:12:44 -07:00
Zi Yan 6c5c7b9f33 mm/migrate: correct thp migration stats
PageTransHuge returns true for both thp and hugetlb, so thp stats was
counting both thp and hugetlb migrations.  Exclude hugetlb migration by
setting is_thp variable right.

Clean up thp handling code too when we are there.

Fixes: 1a5bae25e3 ("mm/vmstat: add events for THP migration without split")
Signed-off-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lkml.kernel.org/r/20200917210413.1462975-1-zi.yan@sent.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-26 10:33:57 -07:00
Christoph Hellwig f56753ac2a bdi: replace BDI_CAP_NO_{WRITEBACK,ACCT_DIRTY} with a single flag
Replace the two negative flags that are always used together with a
single positive flag that indicates the writeback capability instead
of two related non-capabilities.  Also remove the pointless wrappers
to just check the flag.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24 13:43:39 -06:00
Hugh Dickins a333e3e73b mm: migration of hugetlbfs page skip memcg
hugetlbfs pages do not participate in memcg: so although they do find most
of migrate_page_states() useful, it would be better if they did not call
into mem_cgroup_migrate() - where Qian Cai reported that LTP's
move_pages12 triggers the warning in Alex Shi's prospective commit
"mm/memcg: warning on !memcg after readahead page charged".

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxch.org>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Qian Cai <cai@lca.pw>
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008301359460.5954@eggly.anvils
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-19 13:13:38 -07:00
Ralph Campbell 3d321bf82c mm/migrate: preserve soft dirty in remove_migration_pte()
The code to remove a migration PTE and replace it with a device private
PTE was not copying the soft dirty bit from the migration entry.  This
could lead to page contents not being marked dirty when faulting the page
back from device private memory.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Bharata B Rao <bharata@linux.ibm.com>
Link: https://lkml.kernel.org/r/20200831212222.22409-3-rcampbell@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-05 12:14:30 -07:00
Ralph Campbell 6128763fc3 mm/migrate: remove unnecessary is_zone_device_page() check
Patch series "mm/migrate: preserve soft dirty in remove_migration_pte()".

I happened to notice this from code inspection after seeing Alistair
Popple's patch ("mm/rmap: Fixup copying of soft dirty and uffd ptes").

This patch (of 2):

The check for is_zone_device_page() and is_device_private_page() is
unnecessary since the latter is sufficient to determine if the page is a
device private page.  Simplify the code for easier reading.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Bharata B Rao <bharata@linux.ibm.com>
Link: https://lkml.kernel.org/r/20200831212222.22409-1-rcampbell@nvidia.com
Link: https://lkml.kernel.org/r/20200831212222.22409-2-rcampbell@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-05 12:14:30 -07:00
Alistair Popple ad7df764b7 mm/rmap: fixup copying of soft dirty and uffd ptes
During memory migration a pte is temporarily replaced with a migration
swap pte.  Some pte bits from the existing mapping such as the soft-dirty
and uffd write-protect bits are preserved by copying these to the
temporary migration swap pte.

However these bits are not stored at the same location for swap and
non-swap ptes.  Therefore testing these bits requires using the
appropriate helper function for the given pte type.

Unfortunately several code locations were found where the wrong helper
function is being used to test soft_dirty and uffd_wp bits which leads to
them getting incorrectly set or cleared during page-migration.

Fix these by using the correct tests based on pte type.

Fixes: a5430dda8a ("mm/migrate: support un-addressable ZONE_DEVICE page in migration")
Fixes: 8c3328f1f3 ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
Fixes: f45ec5ff16 ("userfaultfd: wp: support swap and page migration")
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Alistair Popple <alistair@popple.id.au>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200825064232.10023-2-alistair@popple.id.au
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-05 12:14:30 -07:00
Alistair Popple ebdf8321ee mm/migrate: fixup setting UFFD_WP flag
Commit f45ec5ff16 ("userfaultfd: wp: support swap and page migration")
introduced support for tracking the uffd wp bit during page migration.
However the non-swap PTE variant was used to set the flag for zone device
private pages which are a type of swap page.

This leads to corruption of the swap offset if the original PTE has the
uffd_wp flag set.

Fixes: f45ec5ff16 ("userfaultfd: wp: support swap and page migration")
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Link: https://lkml.kernel.org/r/20200825064232.10023-1-alistair@popple.id.au
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-05 12:14:30 -07:00
Matthew Wilcox (Oracle) 6c357848b4 mm: replace hpage_nr_pages with thp_nr_pages
The thp prefix is more frequently used than hpage and we should be
consistent between the various functions.

[akpm@linux-foundation.org: fix mm/migrate.c]

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-14 19:56:56 -07:00
Joonsoo Kim a097631160 mm/mempolicy: use a standard migration target allocation callback
There is a well-defined migration target allocation callback.  Use it.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Roman Gushchin <guro@fb.com>
Link: http://lkml.kernel.org/r/1594622517-20681-7-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 10:58:02 -07:00
Joonsoo Kim 19fc7bed25 mm/migrate: introduce a standard migration target allocation function
There are some similar functions for migration target allocation.  Since
there is no fundamental difference, it's better to keep just one rather
than keeping all variants.  This patch implements base migration target
allocation function.  In the following patches, variants will be converted
to use this function.

Changes should be mechanical, but, unfortunately, there are some
differences.  First, some callers' nodemask is assgined to NULL since NULL
nodemask will be considered as all available nodes, that is,
&node_states[N_MEMORY].  Second, for hugetlb page allocation, gfp_mask is
redefined as regular hugetlb allocation gfp_mask plus __GFP_THISNODE if
user provided gfp_mask has it.  This is because future caller of this
function requires to set this node constaint.  Lastly, if provided nodeid
is NUMA_NO_NODE, nodeid is set up to the node where migration source
lives.  It helps to remove simple wrappers for setting up the nodeid.

Note that PageHighmem() call in previous function is changed to open-code
"is_highmem_idx()" since it provides more readability.

[akpm@linux-foundation.org: tweak patch title, per Vlastimil]
[akpm@linux-foundation.org: fix typo in comment]

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Roman Gushchin <guro@fb.com>
Link: http://lkml.kernel.org/r/1594622517-20681-6-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 10:58:02 -07:00