Commit Graph

303 Commits

Author SHA1 Message Date
Bill O'Donnell fb54bc42ca xfs: force all buffers to be written during btree bulk load
JIRA: https://issues.redhat.com/browse/RHEL-65728

commit 13ae04d8d45227c2ba51e188daf9fc13d08a1b12
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Fri Dec 15 10:03:27 2023 -0800

    xfs: force all buffers to be written during btree bulk load

    While stress-testing online repair of btrees, I noticed periodic
    assertion failures from the buffer cache about buffers with incorrect
    DELWRI_Q state.  Looking further, I observed this race between the AIL
    trying to write out a btree block and repair zapping a btree block after
    the fact:

    AIL:    Repair0:

    pin buffer X
    delwri_queue:
    set DELWRI_Q
    add to delwri list

            stale buf X:
            clear DELWRI_Q
            does not clear b_list
            free space X
            commit

    delwri_submit   # oops

    Worse yet, I discovered that running the same repair over and over in a
    tight loop can result in a second race that cause data integrity
    problems with the repair:

    AIL:    Repair0:        Repair1:

    pin buffer X
    delwri_queue:
    set DELWRI_Q
    add to delwri list

            stale buf X:
            clear DELWRI_Q
            does not clear b_list
            free space X
            commit

                            find free space X
                            get buffer
                            rewrite buffer
                            delwri_queue:
                            set DELWRI_Q
                            already on a list, do not add
                            commit

                            BAD: committed tree root before all blocks written

    delwri_submit   # too late now

    I traced this to my own misunderstanding of how the delwri lists work,
    particularly with regards to the AIL's buffer list.  If a buffer is
    logged and committed, the buffer can end up on that AIL buffer list.  If
    btree repairs are run twice in rapid succession, it's possible that the
    first repair will invalidate the buffer and free it before the next time
    the AIL wakes up.  Marking the buffer stale clears DELWRI_Q from the
    buffer state without removing the buffer from its delwri list.  The
    buffer doesn't know which list it's on, so it cannot know which lock to
    take to protect the list for a removal.

    If the second repair allocates the same block, it will then recycle the
    buffer to start writing the new btree block.  Meanwhile, if the AIL
    wakes up and walks the buffer list, it will ignore the buffer because it
    can't lock it, and go back to sleep.

    When the second repair calls delwri_queue to put the buffer on the
    list of buffers to write before committing the new btree, it will set
    DELWRI_Q again, but since the buffer hasn't been removed from the AIL's
    buffer list, it won't add it to the bulkload buffer's list.

    This is incorrect, because the bulkload caller relies on delwri_submit
    to ensure that all the buffers have been sent to disk /before/
    committing the new btree root pointer.  This ordering requirement is
    required for data consistency.

    Worse, the AIL won't clear DELWRI_Q from the buffer when it does finally
    drop it, so the next thread to walk through the btree will trip over a
    debug assertion on that flag.

    To fix this, create a new function that waits for the buffer to be
    removed from any other delwri lists before adding the buffer to the
    caller's delwri list.  By waiting for the buffer to clear both the
    delwri list and any potential delwri wait list, we can be sure that
    repair will initiate writes of all buffers and report all write errors
    back to userspace instead of committing the new structure.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-11-20 11:26:00 -06:00
Bill O'Donnell d0e7df7358 xfs: allow scanning ranges of the buffer cache for live buffers
JIRA: https://issues.redhat.com/browse/RHEL-57114

commit 9ed851f695c71d325758f8c18e265da9316afd26
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Thu Aug 10 07:48:03 2023 -0700

    xfs: allow scanning ranges of the buffer cache for live buffers

    After an online repair, we need to invalidate buffers representing the
    blocks from the old metadata that we're replacing.  It's possible that
    parts of a tree that were previously cached in memory are no longer
    accessible due to media failure or other corruption on interior nodes,
    so repair figures out the old blocks from the reverse mapping data and
    scans the buffer cache directly.

    In other words, online fsck needs to find all the live (i.e. non-stale)
    buffers for a range of fsblocks so that it can invalidate them.

    Unfortunately, the current buffer cache code triggers asserts if the
    rhashtable lookup finds a non-stale buffer of a different length than
    the key we searched for.  For regular operation this is desirable, but
    for this repair procedure, we don't care since we're going to forcibly
    stale the buffer anyway.  Add an internal lookup flag to avoid the
    assert.  Skip buffers that are already XBF_STALE.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-10-15 10:46:22 -05:00
Rafael Aquini 9a083ffc2d list_lru: allow explicit memcg and NUMA node selection
JIRA: https://issues.redhat.com/browse/RHEL-40684
Conflicts:
    * include/linux/list_lru.h: minor context differences due to missing
         upstream v6.8 commit 7679e14098c9 ("mm: list_lru: Update kernel
         documentation to follow the requirements")

This patch is a backport of the following upstream commit:
commit 0a97c01cd20bb96359d8c9dedad92a061ed34e0b
Author: Nhat Pham <nphamcs@gmail.com>
Date:   Thu Nov 30 11:40:18 2023 -0800

    list_lru: allow explicit memcg and NUMA node selection

    Patch series "workload-specific and memory pressure-driven zswap
    writeback", v8.

    There are currently several issues with zswap writeback:

    1. There is only a single global LRU for zswap, making it impossible to
       perform worload-specific shrinking - an memcg under memory pressure
       cannot determine which pages in the pool it owns, and often ends up
       writing pages from other memcgs. This issue has been previously
       observed in practice and mitigated by simply disabling
       memcg-initiated shrinking:

       https://lore.kernel.org/all/20230530232435.3097106-1-nphamcs@gmail.com/T/#u

       But this solution leaves a lot to be desired, as we still do not
       have an avenue for an memcg to free up its own memory locked up in
       the zswap pool.

    2. We only shrink the zswap pool when the user-defined limit is hit.
       This means that if we set the limit too high, cold data that are
       unlikely to be used again will reside in the pool, wasting precious
       memory. It is hard to predict how much zswap space will be needed
       ahead of time, as this depends on the workload (specifically, on
       factors such as memory access patterns and compressibility of the
       memory pages).

    This patch series solves these issues by separating the global zswap LRU
    into per-memcg and per-NUMA LRUs, and performs workload-specific (i.e
    memcg- and NUMA-aware) zswap writeback under memory pressure.  The new
    shrinker does not have any parameter that must be tuned by the user, and
    can be opted in or out on a per-memcg basis.

    As a proof of concept, we ran the following synthetic benchmark: build the
    linux kernel in a memory-limited cgroup, and allocate some cold data in
    tmpfs to see if the shrinker could write them out and improved the overall
    performance.  Depending on the amount of cold data generated, we observe
    from 14% to 35% reduction in kernel CPU time used in the kernel builds.

    This patch (of 6):

    The interface of list_lru is based on the assumption that the list node
    and the data it represents belong to the same allocated on the correct
    node/memcg.  While this assumption is valid for existing slab objects LRU
    such as dentries and inodes, it is undocumented, and rather inflexible for
    certain potential list_lru users (such as the upcoming zswap shrinker and
    the THP shrinker).  It has caused us a lot of issues during our
    development.

    This patch changes list_lru interface so that the caller must explicitly
    specify numa node and memcg when adding and removing objects.  The old
    list_lru_add() and list_lru_del() are renamed to list_lru_add_obj() and
    list_lru_del_obj(), respectively.

    It also extends the list_lru API with a new function, list_lru_putback,
    which undoes a previous list_lru_isolate call.  Unlike list_lru_add, it
    does not increment the LRU node count (as list_lru_isolate does not
    decrement the node count).  list_lru_putback also allows for explicit
    memcg and NUMA node selection.

    Link: https://lkml.kernel.org/r/20231130194023.4102148-1-nphamcs@gmail.com
    Link: https://lkml.kernel.org/r/20231130194023.4102148-2-nphamcs@gmail.com
    Signed-off-by: Nhat Pham <nphamcs@gmail.com>
    Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
    Cc: Chris Li <chrisl@kernel.org>
    Cc: Dan Streetman <ddstreet@ieee.org>
    Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Seth Jennings <sjenning@redhat.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Shuah Khan <shuah@kernel.org>
    Cc: Vitaly Wool <vitaly.wool@konsulko.com>
    Cc: Yosry Ahmed <yosryahmed@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2024-06-28 12:24:14 -04:00
Chris von Recklinghausen 41d58c77af mm: vmscan: refactor updating current->reclaim_state
Conflicts: mm/slob.c - We already have
	6630e950d532 ("mm/slob: remove slob.c")
	so the file is gone.

JIRA: https://issues.redhat.com/browse/RHEL-27741

commit c7b23b68e2aa93f86a206222d23ccd9a21f5982a
Author: Yosry Ahmed <yosryahmed@google.com>
Date:   Thu Apr 13 10:40:34 2023 +0000

    mm: vmscan: refactor updating current->reclaim_state

    During reclaim, we keep track of pages reclaimed from other means than
    LRU-based reclaim through scan_control->reclaim_state->reclaimed_slab,
    which we stash a pointer to in current task_struct.

    However, we keep track of more than just reclaimed slab pages through
    this.  We also use it for clean file pages dropped through pruned inodes,
    and xfs buffer pages freed.  Rename reclaimed_slab to reclaimed, and add a
    helper function that wraps updating it through current, so that future
    changes to this logic are contained within include/linux/swap.h.

    Link: https://lkml.kernel.org/r/20230413104034.1086717-4-yosryahmed@google.c
om
    Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Darrick J. Wong <djwong@kernel.org>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: NeilBrown <neilb@suse.de>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Tim Chen <tim.c.chen@linux.intel.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:59 -04:00
Ming Lei ca8eaf1249 fs,block: yield devices early
JIRA: https://issues.redhat.com/browse/RHEL-29564
Conflicts: drop change on f2fs, bcachefs and reiserfs; context
	difference on ext4 & fs/super.c change.

commit 22650a99821dda3d05f1c334ea90330b4982de56
Author: Christian Brauner <brauner@kernel.org>
Date:   Tue Mar 26 13:47:22 2024 +0100

    fs,block: yield devices early

    Currently a device is only really released once the umount returns to
    userspace due to how file closing works. That ultimately could cause
    an old umount assumption to be violated that concurrent umount and mount
    don't fail. So an exclusively held device with a temporary holder should
    be yielded before the filesystem is gone. Add a helper that allows
    callers to do that. This also allows us to remove the two holder ops
    that Linus wasn't excited about.

    Link: https://lore.kernel.org/r/20240326-vfs-bdev-end_holder-v1-1-20af85202918@kernel.org
    Fixes: f3a608827d1f ("bdev: open block device as files") # mainline only
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Jan Kara <jack@suse.cz>
    Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Christian Brauner <brauner@kernel.org>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2024-04-17 10:39:09 +08:00
Ming Lei 0c712a8085 xfs: port block device access to files
JIRA: https://issues.redhat.com/browse/RHEL-29564

commit 1b9e2d90141c5e25faefbb7891f0ed8606aa02cf
Author: Christian Brauner <brauner@kernel.org>
Date:   Tue Jan 23 14:26:24 2024 +0100

    xfs: port block device access to files

    Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-7-adbd023e19cc@kernel.org
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Jan Kara <jack@suse.cz>
    Signed-off-by: Christian Brauner <brauner@kernel.org>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2024-04-17 10:18:37 +08:00
Ming Lei 079721ca6f xfs: Convert to bdev_open_by_path()
JIRA: https://issues.redhat.com/browse/RHEL-29262

commit e340dd63f6a11402424b3d77e51149bce8fcba7d
Author: Jan Kara <jack@suse.cz>
Date:   Wed Sep 27 11:34:34 2023 +0200

    xfs: Convert to bdev_open_by_path()

    Convert xfs to use bdev_open_by_path() and pass the handle around.

    CC: "Darrick J. Wong" <djwong@kernel.org>
    CC: linux-xfs@vger.kernel.org
    Acked-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christian Brauner <brauner@kernel.org>
    Signed-off-by: Jan Kara <jack@suse.cz>
    Link: https://lore.kernel.org/r/20230927093442.25915-28-jack@suse.cz
    Acked-by: "Darrick J. Wong" <djwong@kernel.org>
    Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
    Signed-off-by: Christian Brauner <brauner@kernel.org>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2024-03-19 10:07:46 +08:00
Ming Lei 6cba6f3195 fs: use the super_block as holder when mounting file systems
JIRA: https://issues.redhat.com/browse/RHEL-29262
Conflicts: drop f2fs change which isn't needed for rhel9

commit 2ea6f68932f73a6a9d82160d3ad0a49a5a6bb183
Author: Christoph Hellwig <hch@lst.de>
Date:   Wed Aug 2 17:41:25 2023 +0200

    fs: use the super_block as holder when mounting file systems

    The file system type is not a very useful holder as it doesn't allow us
    to go back to the actual file system instance.  Pass the super_block instead
    which is useful when passed back to the file system driver.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Jan Kara <jack@suse.cz>
    Reviewed-by: Christian Brauner <brauner@kernel.org>
    Message-Id: <20230802154131.2221419-7-hch@lst.de>
    Signed-off-by: Christian Brauner <brauner@kernel.org>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2024-03-19 10:07:44 +08:00
Ming Lei f9ca79532a xfs: close the external block devices in xfs_mount_free
JIRA: https://issues.redhat.com/browse/RHEL-29262

commit 35a93b148b0363dca23c3db1cc9d48100eb8b276
Author: Christoph Hellwig <hch@lst.de>
Date:   Wed Aug 9 15:05:38 2023 -0700

    xfs: close the external block devices in xfs_mount_free

    blkdev_put must not be called under sb->s_umount to avoid a lock order
    reversal with disk->open_mutex.  Move closing the buftargs into ->kill_sb
    to archive that.  Note that the flushing of the disk caches and
    block device mapping invalidated needs to stay in ->put_super as the main
    block device is closed in kill_block_super already.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
    Message-Id: <20230809220545.1308228-7-hch@lst.de>
    Signed-off-by: Christian Brauner <brauner@kernel.org>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2024-03-19 10:07:43 +08:00
Ming Lei 47287f1914 xfs: close the RT and log block devices in xfs_free_buftarg
JIRA: https://issues.redhat.com/browse/RHEL-29262

commit 41233576e9a4515dc9b0bd1cbbb896b520a1f486
Author: Christoph Hellwig <hch@lst.de>
Date:   Wed Aug 9 15:05:37 2023 -0700

    xfs: close the RT and log block devices in xfs_free_buftarg

    Closing the block devices logically belongs into xfs_free_buftarg,  So
    instead of open coding it in the caller move it there and add a check
    for the s_bdev so that the main device isn't close as that's done by the
    VFS helper.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
    Message-Id: <20230809220545.1308228-6-hch@lst.de>
    Signed-off-by: Christian Brauner <brauner@kernel.org>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2024-03-19 10:07:43 +08:00
Bill O'Donnell dc554f866b xfs: invalidate block device page cache during unmount
JIRA: https://issues.redhat.com/browse/RHEL-2002

commit 032e160305f6872e590c77f11896fb28365c6d6c
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Mon Nov 28 17:24:42 2022 -0800

    xfs: invalidate block device page cache during unmount

    Every now and then I see fstests failures on aarch64 (64k pages) that
    trigger on the following sequence:

    mkfs.xfs $dev
    mount $dev $mnt
    touch $mnt/a
    umount $mnt
    xfs_db -c 'path /a' -c 'print' $dev

    99% of the time this succeeds, but every now and then xfs_db cannot find
    /a and fails.  This turns out to be a race involving udev/blkid, the
    page cache for the block device, and the xfs_db process.

    udev is triggered whenever anyone closes a block device or unmounts it.
    The default udev rules invoke blkid to read the fs super and create
    symlinks to the bdev under /dev/disk.  For this, it uses buffered reads
    through the page cache.

    xfs_db also uses buffered reads to examine metadata.  There is no
    coordination between xfs_db and udev, which means that they can run
    concurrently.  Note there is no coordination between the kernel and
    blkid either.

    On a system with 64k pages, the page cache can cache the superblock and
    the root inode (and hence the root dir) with the same 64k page.  If
    udev spawns blkid after the mkfs and the system is busy enough that it
    is still running when xfs_db starts up, they'll both read from the same
    page in the pagecache.

    The unmount writes updated inode metadata to disk directly.  The XFS
    buffer cache does not use the bdev pagecache, nor does it invalidate the
    pagecache on umount.  If the above scenario occurs, the pagecache no
    longer reflects what's on disk, xfs_db reads the stale metadata, and
    fails to find /a.  Most of the time this succeeds because closing a bdev
    invalidates the page cache, but when processes race, everyone loses.

    Fix the problem by invalidating the bdev pagecache after flushing the
    bdev, so that xfs_db will see up to date metadata.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-11-10 07:22:17 -06:00
Bill O'Donnell c6108fc126 xfs: implement ->notify_failure() for XFS
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2192730

Conflicts: change to xfs_perag_get() and xfs_perag_put() api from previous out
	   of order patch from upstream fa044ae70 xfs: pass perag to xfs_read_agf
	   required changes to xfs_notify_failure.c

commit 6f643c57d57c56d4677bc05f1fca2ef3f249797c
Author: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Date:   Fri Jun 3 13:37:30 2022 +0800

    xfs: implement ->notify_failure() for XFS

    Introduce xfs_notify_failure.c to handle failure related works, such as
    implement ->notify_failure(), register/unregister dax holder in xfs, and
    so on.

    If the rmap feature of XFS enabled, we can query it to find files and
    metadata which are associated with the corrupt data.  For now all we do is
    kill processes with that file mapped into their address spaces, but future
    patches could actually do something about corrupt metadata.

    After that, the memory failure needs to notify the processes who are using
    those files.

    Link: https://lkml.kernel.org/r/20220603053738.1218681-7-ruansy.fnst@fujitsu.com
    Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Dan Williams <dan.j.wiliams@intel.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
    Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
    Cc: Jane Chu <jane.chu@oracle.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Ritesh Harjani <riteshh@linux.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-06-16 10:34:27 -05:00
Bill O'Donnell f4673910d2 dax: introduce holder for dax_device
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2192730

Conflicts: dropped hunks for erofs and ext2

commit 8012b866085523758780850087102421dbcce522
Author: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Date:   Fri Jun 3 13:37:25 2022 +0800

    dax: introduce holder for dax_device

    Patch series "v14 fsdax-rmap + v11 fsdax-reflink", v2.

    The patchset fsdax-rmap is aimed to support shared pages tracking for
    fsdax.

    It moves owner tracking from dax_assocaite_entry() to pmem device driver,
    by introducing an interface ->memory_failure() for struct pagemap.  This
    interface is called by memory_failure() in mm, and implemented by pmem
    device.

    Then call holder operations to find the filesystem which the corrupted
    data located in, and call filesystem handler to track files or metadata
    associated with this page.

    Finally we are able to try to fix the corrupted data in filesystem and do
    other necessary processing, such as killing processes who are using the
    files affected.

    The call trace is like this:
    memory_failure()
    |* fsdax case
    |------------
    |pgmap->ops->memory_failure()      => pmem_pgmap_memory_failure()
    | dax_holder_notify_failure()      =>
    |  dax_device->holder_ops->notify_failure() =>
    |                                     - xfs_dax_notify_failure()
    |  |* xfs_dax_notify_failure()
    |  |--------------------------
    |  |   xfs_rmap_query_range()
    |  |    xfs_dax_failure_fn()
    |  |    * corrupted on metadata
    |  |       try to recover data, call xfs_force_shutdown()
    |  |    * corrupted on file data
    |  |       try to recover data, call mf_dax_kill_procs()
    |* normal case
    |-------------
    |mf_generic_kill_procs()

    The patchset fsdax-reflink attempts to add CoW support for fsdax, and
    takes XFS, which has both reflink and fsdax features, as an example.

    One of the key mechanisms needed to be implemented in fsdax is CoW.  Copy
    the data from srcmap before we actually write data to the destination
    iomap.  And we just copy range in which data won't be changed.

    Another mechanism is range comparison.  In page cache case, readpage() is
    used to load data on disk to page cache in order to be able to compare
    data.  In fsdax case, readpage() does not work.  So, we need another
    compare data with direct access support.

    With the two mechanisms implemented in fsdax, we are able to make reflink
    and fsdax work together in XFS.

    This patch (of 14):

    To easily track filesystem from a pmem device, we introduce a holder for
    dax_device structure, and also its operation.  This holder is used to
    remember who is using this dax_device:

     - When it is the backend of a filesystem, the holder will be the
       instance of this filesystem.
     - When this pmem device is one of the targets in a mapped device, the
       holder will be this mapped device.  In this case, the mapped device
       has its own dax_device and it will follow the first rule.  So that we
       can finally track to the filesystem we needed.

    The holder and holder_ops will be set when filesystem is being mounted,
    or an target device is being activated.

    Link: https://lkml.kernel.org/r/20220603053738.1218681-1-ruansy.fnst@fujitsu.com
    Link: https://lkml.kernel.org/r/20220603053738.1218681-2-ruansy.fnst@fujitsu.com
    Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Dan Williams <dan.j.wiliams@intel.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Jane Chu <jane.chu@oracle.com>
    Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
    Cc: Ritesh Harjani <riteshh@linux.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-06-06 13:41:02 -05:00
Bill O'Donnell 426777a415 xfs: xfs_buf cache destroy isn't RCU safe
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 231f91ab504ecebcb88e942341b3d7dd91de45f1
Author: Dave Chinner <dchinner@redhat.com>
Date:   Mon Jul 18 18:20:37 2022 -0700

    xfs: xfs_buf cache destroy isn't RCU safe

    Darrick and Sachin Sant reported that xfs/435 and xfs/436 would
    report an non-empty xfs_buf slab on module remove. This isn't easily
    to reproduce, but is clearly a side effect of converting the buffer
    caceh to RUC freeing and lockless lookups. Sachin bisected and
    Darrick hit it when testing the patchset directly.

    Turns out that the xfs_buf slab is not destroyed when all the other
    XFS slab caches are destroyed. Instead, it's got it's own little
    wrapper function that gets called separately, and so it doesn't have
    an rcu_barrier() call in it that is needed to drain all the rcu
    callbacks before the slab is destroyed.

    Fix it by removing the xfs_buf_init/terminate wrappers that just
    allocate and destroy the xfs_buf slab, and move them to the same
    place that all the other slab caches are set up and destroyed.

    Reported-and-tested-by: Sachin Sant <sachinp@linux.ibm.com>
    Fixes: 298f34224506 ("xfs: lockless buffer lookup")
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:48 -05:00
Bill O'Donnell 2481f637d7 xfs: lockless buffer lookup
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 298f342245066309189d8637ca7339d56840c3e1
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Jul 14 12:05:07 2022 +1000

    xfs: lockless buffer lookup

    Now that we have a standalone fast path for buffer lookup, we can
    easily convert it to use rcu lookups. When we continually hammer the
    buffer cache with trylock lookups, we end up with a huge amount of
    lock contention on the per-ag buffer hash locks:

    -   92.71%     0.05%  [kernel]                  [k] xfs_inodegc_worker
       - 92.67% xfs_inodegc_worker
          - 92.13% xfs_inode_unlink
             - 91.52% xfs_inactive_ifree
                - 85.63% xfs_read_agi
                   - 85.61% xfs_trans_read_buf_map
                      - 85.59% xfs_buf_read_map
                         - xfs_buf_get_map
                            - 85.55% xfs_buf_find
                               - 72.87% _raw_spin_lock
                                  - do_raw_spin_lock
                                       71.86% __pv_queued_spin_lock_slowpath
                               - 8.74% xfs_buf_rele
                                  - 7.88% _raw_spin_lock
                                     - 7.88% do_raw_spin_lock
                                          7.63% __pv_queued_spin_lock_slowpath
                               - 1.70% xfs_buf_trylock
                                  - 1.68% down_trylock
                                     - 1.41% _raw_spin_lock_irqsave
                                        - 1.39% do_raw_spin_lock
                                             __pv_queued_spin_lock_slowpath
                               - 0.76% _raw_spin_unlock
                                    0.75% do_raw_spin_unlock

    This is basically hammering the pag->pag_buf_lock from lots of CPUs
    doing trylocks at the same time. Most of the buffer trylock
    operations ultimately fail after we've done the lookup, so we're
    really hammering the buf hash lock whilst making no progress.

    We can also see significant spinlock traffic on the same lock just
    under normal operation when lots of tasks are accessing metadata
    from the same AG, so let's avoid all this by converting the lookup
    fast path to leverages the rhashtable's ability to do rcu protected
    lookups.

    We avoid races with the buffer release path by using
    atomic_inc_not_zero() on the buffer hold count. Any buffer that is
    in the LRU will have a non-zero count, thereby allowing the lockless
    fast path to be taken in most cache hit situations. If the buffer
    hold count is zero, then it is likely going through the release path
    so in that case we fall back to the existing lookup miss slow path.

    The slow path will then do an atomic lookup and insert under the
    buffer hash lock and hence serialise correctly against buffer
    release freeing the buffer.

    The use of rcu protected lookups means that buffer handles now need
    to be freed by RCU callbacks (same as inodes). We still free the
    buffer pages before the RCU callback - we won't be trying to access
    them at all on a buffer that has zero references - but we need the
    buffer handle itself to be present for the entire rcu protected read
    side to detect a zero hold count correctly.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:47 -05:00
Bill O'Donnell 07f23f7901 xfs: remove a superflous hash lookup when inserting new buffers
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 32dd4f9c506b1bf147c24cf05423cd893bc06e38
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Jul 14 12:04:43 2022 +1000

    xfs: remove a superflous hash lookup when inserting new buffers

    Currently on the slow path insert we repeat the initial hash table
    lookup before we attempt the insert, resulting in a two traversals
    of the hash table to ensure the insert is valid. The rhashtable API
    provides a method for an atomic lookup and insert operation, so we
    can avoid one of the hash table traversals by using this method.

    Adapted from a large patch containing this optimisation by Christoph
    Hellwig.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:46 -05:00
Bill O'Donnell 2e7bfb55f1 xfs: reduce the number of atomic when locking a buffer after lookup
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit d8d9bbb0ee6c79191b704d88c8ae712b89e0d2bb
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Jul 14 12:04:38 2022 +1000

    xfs: reduce the number of atomic when locking a buffer after lookup

    Avoid an extra atomic operation in the non-trylock case by only
    doing a trylock if the XBF_TRYLOCK flag is set. This follows the
    pattern in the IO path with NOWAIT semantics where the
    "trylock-fail-lock" path showed 5-10% reduced throughput compared to
    just using single lock call when not under NOWAIT conditions. So
    make that same change here, too.

    See commit 942491c9e6 ("xfs: fix AIM7 regression") for details.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    [hch: split from a larger patch]
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:46 -05:00
Bill O'Donnell 28879ba059 xfs: merge xfs_buf_find() and xfs_buf_get_map()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 348000804a0f4dea74219a927e081d6e7dee792f
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Jul 14 12:04:31 2022 +1000

    xfs: merge xfs_buf_find() and xfs_buf_get_map()

    Now that we factored xfs_buf_find(), we can start separating into
    distinct fast and slow paths from xfs_buf_get_map(). We start by
    moving the lookup map and perag setup to _get_map(), and then move
    all the specifics of the fast path lookup into xfs_buf_lookup()
    and call it directly from _get_map(). We the move all the slow path
    code to xfs_buf_find_insert(), which is now also called directly
    from _get_map(). As such, xfs_buf_find() now goes away.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:46 -05:00
Bill O'Donnell 2b398f6f24 xfs: break up xfs_buf_find() into individual pieces
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit de67dc575434dca8d60b1e181ed5dd296392ffce
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Jul 14 12:02:46 2022 +1000

    xfs: break up xfs_buf_find() into individual pieces

    xfs_buf_find() is made up of three main parts: lookup, insert and
    locking. The interactions with xfs_buf_get_map() require it to be
    called twice - once for a pure lookup, and again on lookup failure
    so the insert path can be run. We want to simplify this down a lot,
    so split it into a fast path lookup, a slow path insert and a "lock
    the found buffer" helper. This will then let us integrate these
    operations more effectively into xfs_buf_get_map() in future
    patches.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:46 -05:00
Bill O'Donnell ac5ff39607 xfs: rework xfs_buf_incore() API
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 85c73bf726e41be276bcad3325d9a8aef10be289
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Jul 7 22:05:18 2022 +1000

    xfs: rework xfs_buf_incore() API

    Make it consistent with the other buffer APIs to return a error and
    the buffer is placed in a parameter.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:41 -05:00
Bill O'Donnell ca371f43d0 xfs: convert buffer flags to unsigned.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit b9b3fe152e4966cf8562630de67aa49e2f9c9222
Author: Dave Chinner <david@fromorbit.com>
Date:   Thu Apr 21 08:44:59 2022 +1000

    xfs: convert buffer flags to unsigned.

    5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
    fields to be unsigned. This manifests as a compiler error such as:

    /kisskb/src/fs/xfs/./xfs_trace.h:432:2: note: in expansion of macro 'TP_printk'
      TP_printk("dev %d:%d daddr 0x%llx bbcount 0x%x hold %d pincount %d "
      ^
    /kisskb/src/fs/xfs/./xfs_trace.h:440:5: note: in expansion of macro '__print_flags'
         __print_flags(__entry->flags, "|", XFS_BUF_FLAGS),
         ^
    /kisskb/src/fs/xfs/xfs_buf.h:67:4: note: in expansion of macro 'XBF_UNMAPPED'
      { XBF_UNMAPPED,  "UNMAPPED" }
        ^
    /kisskb/src/fs/xfs/./xfs_trace.h:440:40: note: in expansion of macro 'XFS_BUF_FLAGS'
         __print_flags(__entry->flags, "|", XFS_BUF_FLAGS),
                                            ^
    /kisskb/src/fs/xfs/./xfs_trace.h: In function 'trace_raw_output_xfs_buf_flags_class':
    /kisskb/src/fs/xfs/xfs_buf.h:46:23: error: initializer element is not constant
     #define XBF_UNMAPPED  (1 << 31)/* do not map the buffer */

    as __print_flags assigns XFS_BUF_FLAGS to a structure that uses an
    unsigned long for the flag. Since this results in the value of
    XBF_UNMAPPED causing a signed integer overflow, the result is
    technically undefined behavior, which gcc-5 does not accept as an
    integer constant.

    This is based on a patch from Arnd Bergman <arnd@arndb.de>.

    Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
    Signed-off-by: Dave Chinner <david@fromorbit.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:00 -05:00
Bill O'Donnell 176d497387 xfs: check buffer pin state after locking in delwri_submit
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit dbd0f5299302f8506637592e2373891a748c6990
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Mar 17 09:09:10 2022 -0700

    xfs: check buffer pin state after locking in delwri_submit

    AIL flushing can get stuck here:

    [316649.005769] INFO: task xfsaild/pmem1:324525 blocked for more than 123 seconds.
    [316649.007807]       Not tainted 5.17.0-rc6-dgc+ #975
    [316649.009186] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [316649.011720] task:xfsaild/pmem1   state:D stack:14544 pid:324525 ppid:     2 flags:0x00004000
    [316649.014112] Call Trace:
    [316649.014841]  <TASK>
    [316649.015492]  __schedule+0x30d/0x9e0
    [316649.017745]  schedule+0x55/0xd0
    [316649.018681]  io_schedule+0x4b/0x80
    [316649.019683]  xfs_buf_wait_unpin+0x9e/0xf0
    [316649.021850]  __xfs_buf_submit+0x14a/0x230
    [316649.023033]  xfs_buf_delwri_submit_buffers+0x107/0x280
    [316649.024511]  xfs_buf_delwri_submit_nowait+0x10/0x20
    [316649.025931]  xfsaild+0x27e/0x9d0
    [316649.028283]  kthread+0xf6/0x120
    [316649.030602]  ret_from_fork+0x1f/0x30

    in the situation where flushing gets preempted between the unpin
    check and the buffer trylock under nowait conditions:

            blk_start_plug(&plug);
            list_for_each_entry_safe(bp, n, buffer_list, b_list) {
                    if (!wait_list) {
                            if (xfs_buf_ispinned(bp)) {
                                    pinned++;
                                    continue;
                            }
    Here >>>>>>
                            if (!xfs_buf_trylock(bp))
                                    continue;

    This means submission is stuck until something else triggers a log
    force to unpin the buffer.

    To get onto the delwri list to begin with, the buffer pin state has
    already been checked, and hence it's relatively rare we get a race
    between flushing and encountering a pinned buffer in delwri
    submission to begin with. Further, to increase the pin count the
    buffer has to be locked, so the only way we can hit this race
    without failing the trylock is to be preempted between the pincount
    check seeing zero and the trylock being run.

    Hence to avoid this problem, just invert the order of trylock vs
    pin check. We shouldn't hit that many pinned buffers here, so
    optimising away the trylock for pinned buffers should not matter for
    performance at all.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:10:50 -05:00
Jeff Moyer 358fa83614 Merge branch 'main' into 'guilt/pmem-9.2'
Several patches to this file were backported out of order.  The result of this merge resolution matches upstream after the inclusion of all of the patches we have backported.

# Conflicts:
#   fs/iomap/buffered-io.c
2023-03-30 20:35:46 +00:00
Chris von Recklinghausen 8dced2b153 mm: shrinkers: provide shrinkers with names
Bugzilla: https://bugzilla.redhat.com/2160210

commit e33c267ab70de4249d22d7eab1cc7d68a889bac2
Author: Roman Gushchin <roman.gushchin@linux.dev>
Date:   Tue May 31 20:22:24 2022 -0700

    mm: shrinkers: provide shrinkers with names

    Currently shrinkers are anonymous objects.  For debugging purposes they
    can be identified by count/scan function names, but it's not always
    useful: e.g.  for superblock's shrinkers it's nice to have at least an
    idea of to which superblock the shrinker belongs.

    This commit adds names to shrinkers.  register_shrinker() and
    prealloc_shrinker() functions are extended to take a format and arguments
    to master a name.

    In some cases it's not possible to determine a good name at the time when
    a shrinker is allocated.  For such cases shrinker_debugfs_rename() is
    provided.

    The expected format is:
        <subsystem>-<shrinker_type>[:<instance>]-<id>
    For some shrinkers an instance can be encoded as (MAJOR:MINOR) pair.

    After this change the shrinker debugfs directory looks like:
      $ cd /sys/kernel/debug/shrinker/
      $ ls
        dquota-cache-16     sb-devpts-28     sb-proc-47       sb-tmpfs-42
        mm-shadow-18        sb-devtmpfs-5    sb-proc-48       sb-tmpfs-43
        mm-zspool:zram0-34  sb-hugetlbfs-17  sb-pstore-31     sb-tmpfs-44
        rcu-kfree-0         sb-hugetlbfs-33  sb-rootfs-2      sb-tmpfs-49
        sb-aio-20           sb-iomem-12      sb-securityfs-6  sb-tracefs-13
        sb-anon_inodefs-15  sb-mqueue-21     sb-selinuxfs-22  sb-xfs:vda1-36
        sb-bdev-3           sb-nsfs-4        sb-sockfs-8      sb-zsmalloc-19
        sb-bpf-32           sb-pipefs-14     sb-sysfs-26      thp-deferred_split-10
        sb-btrfs:vda2-24    sb-proc-25       sb-tmpfs-1       thp-zero-9
        sb-cgroup2-30       sb-proc-39       sb-tmpfs-27      xfs-buf:vda1-37
        sb-configfs-23      sb-proc-41       sb-tmpfs-29      xfs-inodegc:vda1-38
        sb-dax-11           sb-proc-45       sb-tmpfs-35
        sb-debugfs-7        sb-proc-46       sb-tmpfs-40

    [roman.gushchin@linux.dev: fix build warnings]
      Link: https://lkml.kernel.org/r/Yr+ZTnLb9lJk6fJO@castle
      Reported-by: kernel test robot <lkp@intel.com>
    Link: https://lkml.kernel.org/r/20220601032227.4076670-4-roman.gushchin@linux.dev
    Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
    Cc: Dave Chinner <dchinner@redhat.com>
    Cc: Hillf Danton <hdanton@sina.com>
    Cc: Kent Overstreet <kent.overstreet@gmail.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:17 -04:00
Chris von Recklinghausen cae92f5bda mm: introduce memalloc_retry_wait()
Conflicts:
	drop changes to fs/f2fs/gc.c fs/f2fs/data.c fs/f2fs/inode.c
	fs/f2fs/node.c fs/f2fs/recovery.c fs/f2fs/segment.c fs/f2fs/super.c -
		unsupported config
	fs/ext4/inline.c - The backport of
		36d116e99da7 ("ext4: Use scoped memory APIs in ext4_da_write_begin()")
		added an include of linux/sched/mm.h, citing the lack of this
		patch in its conflicts section. Keep it.

Bugzilla: https://bugzilla.redhat.com/2160210

commit 4034247a0d6ab281ba3293798ce67af494d86129
Author: NeilBrown <neilb@suse.de>
Date:   Fri Jan 14 14:07:14 2022 -0800

    mm: introduce memalloc_retry_wait()

    Various places in the kernel - largely in filesystems - respond to a
    memory allocation failure by looping around and re-trying.  Some of
    these cannot conveniently use __GFP_NOFAIL, for reasons such as:

     - a GFP_ATOMIC allocation, which __GFP_NOFAIL doesn't work on
     - a need to check for the process being signalled between failures
     - the possibility that other recovery actions could be performed
     - the allocation is quite deep in support code, and passing down an
       extra flag to say if __GFP_NOFAIL is wanted would be clumsy.

    Many of these currently use congestion_wait() which (in almost all
    cases) simply waits the given timeout - congestion isn't tracked for
    most devices.

    It isn't clear what the best delay is for loops, but it is clear that
    the various filesystems shouldn't be responsible for choosing a timeout.

    This patch introduces memalloc_retry_wait() with takes on that
    responsibility.  Code that wants to retry a memory allocation can call
    this function passing the GFP flags that were used.  It will wait
    however is appropriate.

    For now, it only considers __GFP_NORETRY and whatever
    gfpflags_allow_blocking() tests.  If blocking is allowed without
    __GFP_NORETRY, then alloc_page either made some reclaim progress, or
    waited for a while, before failing.  So there is no need for much
    further waiting.  memalloc_retry_wait() will wait until the current
    jiffie ends.  If this condition is not met, then alloc_page() won't have
    waited much if at all.  In that case memalloc_retry_wait() waits about
    200ms.  This is the delay that most current loops uses.

    linux/sched/mm.h needs to be included in some files now,
    but linux/backing-dev.h does not.

    Link: https://lkml.kernel.org/r/163754371968.13692.1277530886009912421@noble
.neil.brown.name
    Signed-off-by: NeilBrown <neilb@suse.de>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: "Theodore Ts'o" <tytso@mit.edu>
    Cc: Jaegeuk Kim <jaegeuk@kernel.org>
    Cc: Chao Yu <chao@kernel.org>
    Cc: Darrick J. Wong <djwong@kernel.org>
    Cc: Chuck Lever <chuck.lever@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:45 -04:00
Jeff Moyer 7f25e45b02 dax: return the partition offset from fs_dax_get_by_bdev
Bugzilla: https://bugzilla.redhat.com/2162211
Conflicts: dropped ext2 and erofs hunks, as they're not supported in RHEL.
  Fixed up ext4 conflict due to RHEL differences.

commit cd913c76f489def1a388e3a5b10df94948ede3f5
Author: Christoph Hellwig <hch@lst.de>
Date:   Mon Nov 29 11:21:59 2021 +0100

    dax: return the partition offset from fs_dax_get_by_bdev
    
    Prepare for the removal of the block_device from the DAX I/O path by
    returning the partition offset from fs_dax_get_by_bdev so that the file
    systems have it at hand for use during I/O.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Dan Williams <dan.j.williams@intel.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Link: https://lore.kernel.org/r/20211129102203.2243509-26-hch@lst.de
    Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-03-14 10:54:18 -04:00
Jeff Moyer cf1518f312 xfs: move dax device handling into xfs_{alloc,free}_buftarg
Bugzilla: https://bugzilla.redhat.com/2162211

commit 5b5abbefec1bea98abba8f1cffcf72c11c32a92d
Author: Christoph Hellwig <hch@lst.de>
Date:   Mon Nov 29 11:21:55 2021 +0100

    xfs: move dax device handling into xfs_{alloc,free}_buftarg
    
    Hide the DAX device lookup from the xfs_super.c code.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dan Williams <dan.j.williams@intel.com>
    Link: https://lore.kernel.org/r/20211129102203.2243509-22-hch@lst.de
    Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-03-14 10:54:17 -04:00
Frantisek Hrbata 64e5412cb6 Merge: XFS update to v5.16
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1508

Bugzilla: https://bugzilla.redhat.com/2125724

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>

Approved-by: Bill O'Donnell <bodonnel@redhat.com>
Approved-by: Brian Foster <bfoster@redhat.com>
Approved-by: Eric Sandeen <esandeen@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-11-23 02:46:01 -05:00
Frantisek Hrbata e9e9bc8da2 Merge: mm changes through v5.18 for 9.2
Merge conflicts:
-----------------
Conflicts with !1142(merged) "io_uring: update to v5.15"

fs/io-wq.c
        - static bool io_wqe_create_worker(struct io_wqe *wqe, struct io_wqe_acct *acct)
          !1142 already contains backport of 3146cba99aa2 ("io-wq: make worker creation resilient against signals")
          along with other commits which are not present in !1370. Resolved in favor of HEAD(!1142)
        - static int io_wqe_worker(void *data)
          !1370 does not contain 767a65e9f317 ("io-wq: fix potential race of acct->nr_workers")
          Resolved in favor of HEAD(!1142)
        - static void io_init_new_worker(struct io_wqe *wqe, struct io_worker *worker,
          HEAD(!1142) does not contain e32cf5dfbe22 ("kthread: Generalize pf_io_worker so it can point to struct kthread")
          Resolved in favor of !1370
        - static void create_worker_cont(struct callback_head *cb)
          !1370 does not contain 66e70be72288 ("io-wq: fix memory leak in create_io_worker()")
          Resolved in favor of HEAD(!1142)
        - static void io_workqueue_create(struct work_struct *work)
          !1370 does not contain 66e70be72288 ("io-wq: fix memory leak in create_io_worker()")
          Resolved in favor of HEAD(!1142)
        - static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
          !1370 does not contain 66e70be72288 ("io-wq: fix memory leak in create_io_worker()")
          Resolved in favor of HEAD(!1142)
        - static bool io_wq_work_match_item(struct io_wq_work *work, void *data)
          !1370 does not contain 713b9825a4c4 ("io-wq: fix cancellation on create-worker failure")
          Resolved in favor of HEAD(!1142)
        - static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work)
          !1370 is missing 713b9825a4c4 ("io-wq: fix cancellation on create-worker failure")
          removed wrongly merged run_cancel label
          Resolved in favor of HEAD(!1142)
        - static bool io_task_work_match(struct callback_head *cb, void *data)
          !1370 is missing 3b33e3f4a6c0 ("io-wq: fix silly logic error in io_task_work_match()")
          Resolved in favor of HEAD(!1142)
        - static void io_wq_exit_workers(struct io_wq *wq)
          !1370 is missing 3b33e3f4a6c0 ("io-wq: fix silly logic error in io_task_work_match()")
          Resolved in favor of HEAD(!1142)
        - int io_wq_max_workers(struct io_wq *wq, int *new_count)
          !1370 is missing 3b33e3f4a6c0 ("io-wq: fix silly logic error in io_task_work_match()")
fs/io_uring.c
        - static int io_register_iowq_max_workers(struct io_ring_ctx *ctx,
          !1370 is missing bunch of commits after 2e480058ddc2 ("io-wq: provide a way to limit max number of workers")
          Resolved in favor of HEAD(!1142)
include/uapi/linux/io_uring.h
        - !1370 is missing dd47c104533d ("io-wq: provide IO_WQ_* constants for IORING_REGISTER_IOWQ_MAX_WORKERS arg items")
          just a comment conflict
          Resolved in favor of HEAD(!1142)
kernel/exit.c
        - void __noreturn do_exit(long code)
        - !1370 contains bunch of commits after f552a27afe67 ("io_uring: remove files pointer in cancellation functions")
          Resolved in favor of !1370

Conflicts with !1357(merged) "NFS refresh for RHEL-9.2"

fs/nfs/callback.c
        - nfs4_callback_svc(void *vrqstp)
          !1370 is missing f49169c97fce ("NFSD: Remove svc_serv_ops::svo_module") where the module_put_and_kthread_exit() was removed
          Resolved in favor of HEAD(!1357)
fs/nfs/file.c
          !1357 is missing 187c82cb0380 ("fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio")
          Resolved in favor of HEAD(!1370)
fs/nfsd/nfssvc.c
        - nfsd(void *vrqstp)
          !1370 is missing f49169c97fce ("NFSD: Remove svc_serv_ops::svo_module")
          Resolved in favor of HEAD(!1357)
-----------------

MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1370

Bugzilla: https://bugzilla.redhat.com/2120352

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2099722

Patches 1-9 are changes to selftests
Patches 10-31 are reverts of RHEL-only patches to address COR CVE
Patches 32-320 are the machine dependent mm changes ported by Rafael
Patch 321 reverts the backport of 6692c98c7df5. See below.
Patches 322-981 are the machine independent mm changes
Patches 982-1016 are David Hildebrand's upstream changes to address the COR CVE

RHEL commit b23c298982 fork: Stop protecting back_fork_cleanup_cgroup_lock with CONFIG_NUMA
which is a backport of upstream 6692c98c7df5 and is reverted early in this series. 6692c98c7df5
is a fix for upstream 40966e316f86 which was not in RHEL until this series. 6692c98c7df5 is re-added
after 40966e316f86.

Omitted-fix: 310d1344e3c5 ("Revert "powerpc: Remove unused FW_FEATURE_NATIVE references"")
        to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716

Omitted-fix: 465d0eb0dc31 ("Docs/admin-guide/mm/damon/usage: fix the example code snip")
        to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716

Omitted-fix: 317314527d17 ("mm/hugetlb: correct demote page offset logic")
        to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716

Omitted-fix: 37dcc673d065 ("frontswap: don't call ->init if no ops are registered")
        to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716

Omitted-fix: 30c19366636f ("mm: fix BUG splat with kvmalloc + GFP_ATOMIC")
        to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716

Omitted: fix: fa84693b3c89 io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: 009ad9f0c6ee io_uring: drop ctx->uring_lock before acquiring sqd->lock
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: bc369921d670 io-wq: max_worker fixes
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743

Omitted-fix: e139a1ec92f8 io_uring: apply max_workers limit to all future users
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743

Omitted-fix: 71c9ce27bb57 io-wq: fix max-workers not correctly set on multi-node system
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743

Omitted-fix: 41d3a6bd1d37 io_uring: pin SQPOLL data before unlocking ring lock
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: bad119b9a000 io_uring: honour zeroes as io-wq worker limits
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743

Omitted-fix: 08bdbd39b584 io-wq: ensure that hash wait lock is IRQ disabling
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: 713b9825a4c4 io-wq: fix cancellation on create-worker failure
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: 3b33e3f4a6c0 io-wq: fix silly logic error in io_task_work_match()
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: 71e1cef2d794 io-wq: Remove duplicate code in io_workqueue_create()
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=210774

Omitted-fix: a226abcd5d42 io-wq: don't retry task_work creation failure on fatal conditions
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743

Omitted-fix: fa84693b3c89 io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL
        fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: dd47c104533d io-wq: provide IO_WQ_* constants for IORING_REGISTER_IOWQ_MAX_WORKERS arg items
        fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: 4f0712ccec09 hexagon: Fix function name in die()
	unsupported arch

Omitted-fix: 751971af2e36 csky: Fix function name in csky_alignment() and die()
	unsupported arch

Omitted-fix: dcbc65aac283 ptrace: Remove duplicated include in ptrace.c
        unsupported arch

Omitted-fix: eb48d4219879 drm/i915: Fix oops due to missing stack depot
	fixed in RHEL commit 105d2d4832 Merge DRM changes from upstream v5.16..v5.17

Omitted-fix: 751a9d69b197 drm/i915: Fix oops due to missing stack depot
	fixed in RHEL commit 99fc716fc4 Merge DRM changes from upstream v5.17..v5.18

Omitted-fix: eb48d4219879 drm/i915: Fix oops due to missing stack depot
	fixed in RHEL commit 105d2d4832 Merge DRM changes from upstream v5.16..v5.17

Omitted-fix: 751a9d69b197 drm/i915: Fix oops due to missing stack depot
	fixed in RHEL commit 99fc716fc4 Merge DRM changes from upstream v5.17..v5.18

Omitted-fix: b95dc06af3e6 drm/amdgpu: disable runpm if we are the primary adapter
        reverted later

Omitted-fix: 5a90c24ad028 Revert "drm/amdgpu: disable runpm if we are the primary adapter"
        revert of above omitted fix

Omitted-fix: 724bbe49c5e4 fs/ntfs3: provide block_invalidate_folio to fix memory leak
	unsupported fs

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>

Approved-by: John W. Linville <linville@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>
Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Lyude Paul <lyude@redhat.com>
Approved-by: Donald Dutile <ddutile@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-10-23 19:49:41 +02:00
Carlos Maiolino 25a40d32f8 xfs: rename _zone variables to _cache
Bugzilla: https://bugzilla.redhat.com/2125724

Conflicts:
	Small conflict at xfs_inode_alloc() due to out of order
	backport. Inode alloc using kmem_cache_alloc() has been
	converted to use alloc_inode_sb() before this patch.

Now that we've gotten rid of the kmem_zone_t typedef, rename the
variables to _cache since that's what they are.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
(cherry picked from commit 182696fb021fc196e5cbe641565ca40fcf0f885a)
2022-10-21 12:50:46 +02:00
Carlos Maiolino d912d565bb xfs: remove kmem_zone typedef
Bugzilla: https://bugzilla.redhat.com/2125724

Remove these typedefs by referencing kmem_cache directly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
(cherry picked from commit e7720afad068a6729d9cd3aaa08212f2f5a7ceff)
2022-10-21 12:50:46 +02:00
Chris von Recklinghausen 8219fb6550 remove bdi_congested() and wb_congested() and related functions
Bugzilla: https://bugzilla.redhat.com/2120352

commit b9b1335e640308acc1b8f26c739b804c80a6c147
Author: NeilBrown <neilb@suse.de>
Date:   Tue Mar 22 14:39:10 2022 -0700

    remove bdi_congested() and wb_congested() and related functions

    These functions are no longer useful as no BDIs report congestions any
    more.

    Removing the test on bdi_write_contested() in current_may_throttle()
    could cause a small change in behaviour, but only when PF_LOCAL_THROTTLE
    is set.

    So replace the calls by 'false' and simplify the code - and remove the
    functions.

    [akpm@linux-foundation.org: fix build]

    Link: https://lkml.kernel.org/r/164549983742.9187.2570198746005819592.stgit@noble.brown
    Signed-off-by: NeilBrown <neilb@suse.de>
    Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>   [nilfs]
    Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
    Cc: Chao Yu <chao@kernel.org>
    Cc: Darrick J. Wong <djwong@kernel.org>
    Cc: Ilya Dryomov <idryomov@gmail.com>
    Cc: Jaegeuk Kim <jaegeuk@kernel.org>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jeff Layton <jlayton@kernel.org>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Lars Ellenberg <lars.ellenberg@linbit.com>
    Cc: Miklos Szeredi <miklos@szeredi.hu>
    Cc: Paolo Valente <paolo.valente@linaro.org>
    Cc: Philipp Reisner <philipp.reisner@linbit.com>
    Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
    Cc: Wu Fengguang <fengguang.wu@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:49 -04:00
Ming Lei f0231c3baa fs/xfs: Use the enum req_op and blk_opf_t types
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2118511

commit d03025aef8676e826b69f8e3ec9bb59a5ad0c31d
Author: Bart Van Assche <bvanassche@acm.org>
Date:   Thu Jul 14 11:07:28 2022 -0700

    fs/xfs: Use the enum req_op and blk_opf_t types

    Improve static type checking by using the enum req_op type for variables
    that represent a request operation and the new blk_opf_t type for the
    combination of a request operation with request flags.

    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Cc: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Bart Van Assche <bvanassche@acm.org>
    Link: https://lore.kernel.org/r/20220714180729.1065367-63-bvanassche@acm.org
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2022-10-12 09:20:22 +08:00
Brian Foster 556fc6c4a0 xfs: xfs_is_shutdown vs xlog_is_shutdown cage fight
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 01728b44ef1b714756607be0210fbcf60c78efce
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Mar 17 09:09:13 2022 -0700

    xfs: xfs_is_shutdown vs xlog_is_shutdown cage fight

    I've been chasing a recent resurgence in generic/388 recovery
    failure and/or corruption events. The events have largely been
    uninitialised inode chunks being tripped over in log recovery
    such as:

     XFS (pmem1): User initiated shutdown received.
     pmem1: writeback error on inode 12621949, offset 1019904, sector 12968096
     XFS (pmem1): Log I/O Error (0x6) detected at xfs_fs_goingdown+0xa3/0xf0 (fs/xfs/xfs_fsops.c:500).  Shutting down filesystem.
     XFS (pmem1): Please unmount the filesystem and rectify the problem(s)
     XFS (pmem1): Unmounting Filesystem
     XFS (pmem1): Mounting V5 Filesystem
     XFS (pmem1): Starting recovery (logdev: internal)
     XFS (pmem1): bad inode magic/vsn daddr 8723584 #0 (magic=1818)
     XFS (pmem1): Metadata corruption detected at xfs_inode_buf_verify+0x180/0x190, xfs_inode block 0x851c80 xfs_inode_buf_verify
     XFS (pmem1): Unmount and run xfs_repair
     XFS (pmem1): First 128 bytes of corrupted metadata buffer:
     00000000: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
     00000010: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
     00000020: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
     00000030: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
     00000040: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
     00000050: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
     00000060: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
     00000070: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
     XFS (pmem1): metadata I/O error in "xlog_recover_items_pass2+0x52/0xc0" at daddr 0x851c80 len 32 error 117
     XFS (pmem1): log mount/recovery failed: error -117
     XFS (pmem1): log mount failed

    There have been isolated random other issues, too - xfs_repair fails
    because it finds some corruption in symlink blocks, rmap
    inconsistencies, etc - but they are nowhere near as common as the
    uninitialised inode chunk failure.

    The problem has clearly happened at runtime before recovery has run;
    I can see the ICREATE log item in the log shortly before the
    actively recovered range of the log. This means the ICREATE was
    definitely created and written to the log, but for some reason the
    tail of the log has been moved past the ordered buffer log item that
    tracks INODE_ALLOC buffers and, supposedly, prevents the tail of the
    log moving past the ICREATE log item before the inode chunk buffer
    is written to disk.

    Tracing the fsstress processes that are running when the filesystem
    shut down immediately pin-pointed the problem:

    user shutdown marks xfs_mount as shutdown

             godown-213341 [008]  6398.022871: console:              [ 6397.915392] XFS (pmem1): User initiated shutdown received.
    .....

    aild tries to push ordered inode cluster buffer

      xfsaild/pmem1-213314 [001]  6398.022974: xfs_buf_trylock:      dev 259:1 daddr 0x851c80 bbcount 0x20 hold 16 pincount 0 lock 0 flags DONE|INODES|PAGES caller xfs_inode_item_push+0x8e
      xfsaild/pmem1-213314 [001]  6398.022976: xfs_ilock_nowait:     dev 259:1 ino 0x851c80 flags ILOCK_SHARED caller xfs_iflush_cluster+0xae

    xfs_iflush_cluster() checks xfs_is_shutdown(), returns true,
    calls xfs_iflush_abort() to kill writeback of the inode.
    Inode is removed from AIL, drops cluster buffer reference.

      xfsaild/pmem1-213314 [001]  6398.022977: xfs_ail_delete:       dev 259:1 lip 0xffff88880247ed80 old lsn 7/20344 new lsn 7/21000 type XFS_LI_INODE flags IN_AIL
      xfsaild/pmem1-213314 [001]  6398.022978: xfs_buf_rele:         dev 259:1 daddr 0x851c80 bbcount 0x20 hold 17 pincount 0 lock 0 flags DONE|INODES|PAGES caller xfs_iflush_abort+0xd7

    .....

    All inodes on cluster buffer are aborted, then the cluster buffer
    itself is aborted and removed from the AIL *without writeback*:

    xfsaild/pmem1-213314 [001]  6398.023011: xfs_buf_error_relse:  dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_ioend_fail+0x33
       xfsaild/pmem1-213314 [001]  6398.023012: xfs_ail_delete:       dev 259:1 lip 0xffff8888053efde8 old lsn 7/20344 new lsn 7/20344 type XFS_LI_BUF flags IN_AIL

    The inode buffer was at 7/20344 when it was removed from the AIL.

       xfsaild/pmem1-213314 [001]  6398.023012: xfs_buf_item_relse:   dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_item_done+0x31
       xfsaild/pmem1-213314 [001]  6398.023012: xfs_buf_rele:         dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_item_relse+0x39

    .....

    Userspace is still running, doing stuff. an fsstress process runs
    syncfs() or sync() and we end up in sync_fs_one_sb() which issues
    a log force. This pushes on the CIL:

            fsstress-213322 [001]  6398.024430: xfs_fs_sync_fs:       dev 259:1 m_features 0x20000000019ff6e9 opstate (clean|shutdown|inodegc|blockgc) s_flags 0x70810000 caller sync_fs_one_sb+0x26
            fsstress-213322 [001]  6398.024430: xfs_log_force:        dev 259:1 lsn 0x0 caller xfs_fs_sync_fs+0x82
            fsstress-213322 [001]  6398.024430: xfs_log_force:        dev 259:1 lsn 0x5f caller xfs_log_force+0x7c
               <...>-194402 [001]  6398.024467: kmem_alloc:           size 176 flags 0x14 caller xlog_cil_push_work+0x9f

    And the CIL fills up iclogs with pending changes. This picks up
    the current tail from the AIL:

               <...>-194402 [001]  6398.024497: xlog_iclog_get_space: dev 259:1 state XLOG_STATE_ACTIVE refcnt 1 offset 0 lsn 0x0 flags  caller xlog_write+0x149
               <...>-194402 [001]  6398.024498: xlog_iclog_switch:    dev 259:1 state XLOG_STATE_ACTIVE refcnt 1 offset 0 lsn 0x700005408 flags  caller xlog_state_get_iclog_space+0x37e
               <...>-194402 [001]  6398.024521: xlog_iclog_release:   dev 259:1 state XLOG_STATE_WANT_SYNC refcnt 1 offset 32256 lsn 0x700005408 flags  caller xlog_write+0x5f9
               <...>-194402 [001]  6398.024522: xfs_log_assign_tail_lsn: dev 259:1 new tail lsn 7/21000, old lsn 7/20344, last sync 7/21448

    And it moves the tail of the log to 7/21000 from 7/20344. This
    *moves the tail of the log beyond the ICREATE transaction* that was
    at 7/20344 and pinned by the inode cluster buffer that was cancelled
    above.

    ....

             godown-213341 [008]  6398.027005: xfs_force_shutdown:   dev 259:1 tag logerror flags log_io|force_umount file fs/xfs/xfs_fsops.c line_num 500
              godown-213341 [008]  6398.027022: console:              [ 6397.915406] pmem1: writeback error on inode 12621949, offset 1019904, sector 12968096
              godown-213341 [008]  6398.030551: console:              [ 6397.919546] XFS (pmem1): Log I/O Error (0x6) detected at xfs_fs_goingdown+0xa3/0xf0 (fs/

    And finally the log itself is now shutdown, stopping all further
    writes to the log. But this is too late to prevent the corruption
    that moving the tail of the log forwards after we start cancelling
    writeback causes.

    The fundamental problem here is that we are using the wrong shutdown
    checks for log items. We've long conflated mount shutdown with log
    shutdown state, and I started separating that recently with the
    atomic shutdown state changes in commit b36d4651e165 ("xfs: make
    forced shutdown processing atomic"). The changes in that commit
    series are directly responsible for being able to diagnose this
    issue because it clearly separated mount shutdown from log shutdown.

    Essentially, once we start cancelling writeback of log items and
    removing them from the AIL because the filesystem is shut down, we
    *cannot* update the journal because we may have cancelled the items
    that pin the tail of the log. That moves the tail of the log
    forwards without having written the metadata back, hence we have
    corrupt in memory state and writing to the journal propagates that
    to the on-disk state.

    What commit b36d4651e165 makes clear is that log item state needs to
    change relative to log shutdown, not mount shutdown. IOWs, anything
    that aborts metadata writeback needs to check log shutdown state
    because log items directly affect log consistency. Having them check
    mount shutdown state introduces the above race condition where we
    cancel metadata writeback before the log shuts down.

    To fix this, this patch works through all log items and converts
    shutdown checks to use xlog_is_shutdown() rather than
    xfs_is_shutdown(), so that we don't start aborting metadata
    writeback before we shut off journal writes.

    AFAICT, this race condition is a zero day IO error handling bug in
    XFS that dates back to the introduction of XLOG_IO_ERROR,
    XLOG_STATE_IOERROR and XFS_FORCED_SHUTDOWN back in January 1997.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:37 -04:00
Brian Foster 21a25a1300 xfs: rename buffer cache index variable b_bn
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 4c7f65aea7b7fe66c08f8f7304c1ea3f7a871d5a
Author: Dave Chinner <dchinner@redhat.com>
Date:   Wed Aug 18 18:48:54 2021 -0700

    xfs: rename buffer cache index variable b_bn

    To stop external users from using b_bn as the disk address of the
    buffer, rename it to b_rhash_key to indicate that it is the buffer
    cache index, not the block number of the buffer. Code that needs the
    disk address should use xfs_buf_daddr() to obtain it.

    Do the rename and clean up any of the remaining internal b_bn users.
    Also clean up any remaining b_bn cruft that is now unused.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:36 -04:00
Brian Foster 74f147b83b xfs: introduce xfs_buf_daddr()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 04fcad80cd068731a779fb442f78234732683755
Author: Dave Chinner <dchinner@redhat.com>
Date:   Wed Aug 18 18:46:57 2021 -0700

    xfs: introduce xfs_buf_daddr()

    Introduce a helper function xfs_buf_daddr() to extract the disk
    address of the buffer from the struct xfs_buf. This will replace
    direct accesses to bp->b_bn and bp->b_maps[0].bm_bn, as well as
    the XFS_BUF_ADDR() macro.

    This patch introduces the helper function and replaces all uses of
    XFS_BUF_ADDR() as this is just a simple sed replacement.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:36 -04:00
Brian Foster d179379de4 xfs: replace XFS_FORCED_SHUTDOWN with xfs_is_shutdown
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 75c8c50fa16a23f8ac89ea74834ae8ddd1558d75
Author: Dave Chinner <dchinner@redhat.com>
Date:   Wed Aug 18 18:46:53 2021 -0700

    xfs: replace XFS_FORCED_SHUTDOWN with xfs_is_shutdown

    Remove the shouty macro and instead use the inline function that
    matches other state/feature check wrapper naming. This conversion
    was done with sed.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:34 -04:00
Brian Foster a672539203 xfs: convert remaining mount flags to state flags
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 2e973b2cd4cdb993be94cca4c33f532f1ed05316
Author: Dave Chinner <dchinner@redhat.com>
Date:   Wed Aug 18 18:46:52 2021 -0700

    xfs: convert remaining mount flags to state flags

    The remaining mount flags kept in m_flags are actually runtime state
    flags. These change dynamically, so they really should be updated
    atomically so we don't potentially lose an update due to racing
    modifications.

    Convert these remaining flags to be stored in m_opstate and use
    atomic bitops to set and clear the flags. This also adds a couple of
    simple wrappers for common state checks - read only and shutdown.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:34 -04:00
Brian Foster d54a790d1d xfs: replace xfs_sb_version checks with feature flag checks
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 38c26bfd90e1999650d5ef40f90d721f05916643
Author: Dave Chinner <dchinner@redhat.com>
Date:   Wed Aug 18 18:46:37 2021 -0700

    xfs: replace xfs_sb_version checks with feature flag checks

    Convert the xfs_sb_version_hasfoo() to checks against
    mp->m_features. Checks of the superblock itself during disk
    operations (e.g. in the read/write verifiers and the to/from disk
    formatters) are not converted - they operate purely on the
    superblock state. Everything else should use the mount features.

    Large parts of this conversion were done with sed with commands like
    this:

    for f in `git grep -l xfs_sb_version_has fs/xfs/*.c`; do
            sed -i -e 's/xfs_sb_version_has\(.*\)(&\(.*\)->m_sb)/xfs_has_\1(\2)/' $f
    done

    With manual cleanups for things like "xfs_has_extflgbit" and other
    little inconsistencies in naming.

    The result is ia lot less typing to check features and an XFS binary
    size reduced by a bit over 3kB:

    $ size -t fs/xfs/built-in.a
            text       data     bss     dec     hex filenam
    before  1130866  311352     484 1442702  16038e (TOTALS)
    after   1127727  311352     484 1439563  15f74b (TOTALS)

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:34 -04:00
Brian Foster 7fe76aa101 xfs: remove kmem_alloc_io()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 98fe2c3cef21b784e2efd1d9d891430d95b4f073
Author: Dave Chinner <dchinner@redhat.com>
Date:   Mon Aug 9 10:10:01 2021 -0700

    xfs: remove kmem_alloc_io()

    Since commit 59bb47985c ("mm, sl[aou]b: guarantee natural alignment
    for kmalloc(power-of-two)"), the core slab code now guarantees slab
    alignment in all situations sufficient for IO purposes (i.e. minimum
    of 512 byte alignment of >= 512 byte sized heap allocations) we no
    longer need the workaround in the XFS code to provide this
    guarantee.

    Replace the use of kmem_alloc_io() with kmem_alloc() or
    kmem_alloc_large() appropriately, and remove the kmem_alloc_io()
    interface altogether.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:24 -04:00
Ming Lei 794cb59448 block: pass a block_device and opf to bio_alloc
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083917
Conflicts: drop change on scsi/ufs/ufshpb.c & ntfs3, deal with
 trivial conflicts for iomap & erofs.

commit 07888c665b405b1cd3577ddebfeb74f4717a84c4
Author: Christoph Hellwig <hch@lst.de>
Date:   Mon Jan 24 10:11:05 2022 +0100

    block: pass a block_device and opf to bio_alloc

    Pass the block_device and operation that we plan to use this bio for to
    bio_alloc to optimize the assignment.  NULL/0 can be passed, both for the
    passthrough case on a raw request_queue and to temporarily avoid
    refactoring some nasty code.

    Also move the gfp_mask argument after the nr_vecs argument for a much
    more logical calling convention matching what most of the kernel does.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
    Link: https://lore.kernel.org/r/20220124091107.642561-18-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2022-06-22 08:56:19 +08:00
Ming Lei 9ce4bd7cc3 block: remove the bd_bdi in struct block_device
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2018403

commit a11d7fc2d05fb509cd9e33d4093507d6eda3ad53
Author: Christoph Hellwig <hch@lst.de>
Date:   Mon Aug 9 16:17:44 2021 +0200

    block: remove the bd_bdi in struct block_device

    Just retrieve the bdi from the disk.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Jan Kara <jack@suse.cz>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Link: https://lore.kernel.org/r/20210809141744.1203023-6-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2021-12-06 16:36:56 +08:00
Dave Chinner b5071ada51 xfs: remove xfs_blkdev_issue_flush
It's a one line wrapper around blkdev_issue_flush(). Just replace it
with direct calls to blkdev_issue_flush().

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-06-21 10:05:46 -07:00
Shaokun Zhang 9bb38aa080 xfs: remove redundant initialization of variable error
'error' will be initialized, so clean up the redundant initialization.

Cc: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-06-18 08:14:31 -07:00
Darrick J. Wong c3eabd3650 xfs: initial agnumber -> perag conversions for shrink
If we want to use active references to the perag to be able to gate
 shrink removing AGs and hence perags safely, we've got a fair bit of
 work to do actually use perags in all the places we need to.
 
 There's a lot of code that iterates ag numbers and then
 looks up perags from that, often multiple times for the same perag
 in the one operation. If we want to use reference counted perags for
 access control, then we need to convert all these uses to perag
 iterators, not agno iterators.
 
 [Patches 1-4]
 
 The first step of this is consolidating all the perag management -
 init, free, get, put, etc into a common location. THis is spread all
 over the place right now, so move it all into libxfs/xfs_ag.[ch].
 This does expose kernel only bits of the perag to libxfs and hence
 userspace, so the structures and code is rearranged to minimise the
 number of ifdefs that need to be added to the userspace codebase.
 The perag iterator in xfs_icache.c is promoted to a first class API
 and expanded to the needs of the code as required.
 
 [Patches 5-10]
 
 These are the first basic perag iterator conversions and changes to
 pass the perag down the stack from those iterators where
 appropriate. A lot of this is obvious, simple changes, though in
 some places we stop passing the perag down the stack because the
 code enters into an as yet unconverted subsystem that still uses raw
 AGs.
 
 [Patches 11-16]
 
 These replace the agno passed in the btree cursor for per-ag btree
 operations with a perag that is passed to the cursor init function.
 The cursor takes it's own reference to the perag, and the reference
 is dropped when the cursor is deleted. Hence we get reference
 coverage for the entire time the cursor is active, even if the code
 that initialised the cursor drops it's reference before the cursor
 or any of it's children (duplicates) have been deleted.
 
 The first patch adds the perag infrastructure for the cursor, the
 next four patches convert a btree cursor at a time, and the last
 removes the agno from the cursor once it is unused.
 
 [Patches 17-21]
 
 These patches are a demonstration of the simplifications and
 cleanups that come from plumbing the perag through interfaces that
 select and then operate on a specific AG. In this case the inode
 allocation algorithm does up to three walks across all AGs before it
 either allocates an inode or fails. Two of these walks are purely
 just to select the AG, and even then it doesn't guarantee inode
 allocation success so there's a third walk if the selected AG
 allocation fails.
 
 These patches collapse the selection and allocation into a single
 loop, simplifies the error handling because xfs_dir_ialloc() always
 returns ENOSPC if no AG was selected for inode allocation or we fail
 to allocate an inode in any AG, gets rid of xfs_dir_ialloc()
 wrapper, converts inode allocation to run entirely from a single
 perag instance, and then factors xfs_dialloc() into a much, much
 simpler loop which is easy to understand.
 
 Hence we end up with the same inode allocation logic, but it only
 needs two complete iterations at worst, makes AG selection and
 allocation atomic w.r.t. shrink and chops out out over 100 lines of
 code from this hot code path.
 
 [Patch 22]
 
 Converts the unlink path to pass perags through it.
 
 There's more conversion work to be done, but this patchset gets
 through a large chunk of it in one hit. Most of the iterators are
 converted, so once this is solidified we can move on to converting
 these to active references for being able to free perags while the
 fs is still active.
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEEmJOoJ8GffZYWSjj/regpR/R1+h0FAmC3HUgUHGRhdmlkQGZy
 b21vcmJpdC5jb20ACgkQregpR/R1+h2yaw/+P0JzpI+6n06Ei00mjgE/Du/WhMLi
 0JQ93Grlj+miuGGT9DgGCiRpoZnefhEk+BH6JqoEw1DQ3T5ilmAzrHLUUHSQC3+S
 dv85sJduheQ6yHuoO+4MzkaSq6JWKe7E9gZwAsVyBul5aSjdmaJaQdPwYMTXSXo0
 5Uqq8ECFkMcaHVNjcBfasgR/fdyWy2Qe4PFTHTHdQpd+DNZ9UXgFKHW2og+1iry/
 zDIvdIppJULA09TvVcZuFjd/1NzHQ/fLj5PAzz8GwagB4nz2x3s78Zevmo5yW/jK
 3/+50vXa8ldhiHDYGTS3QXvS0xJRyqUyD47eyWOOiojZw735jEvAlCgjX6+0X1HC
 k3gCkQLv8l96fRkvUpgnLf/fjrUnlCuNBkm9d1Eq2Tied8dvLDtiEzoC6L05Nqob
 yd/nIUb1zwJFa9tsoheHhn0bblTGX1+zP0lbRJBje0LotpNO9DjGX5JoIK4GR7F8
 y1VojcdgRI14HlxUnbF3p8wmQByN+M2tnp6GSdv9BA65bjqi05Rj/steFdZHBV6x
 wiRs8Yh6BTvMwKgufHhRQHfRahjNHQ/T/vOE+zNbWqemS9wtEUDop+KvPhC36R/k
 o/cmr23cF8ESX2eChk7XM4On3VEYpcvp2zSFgrFqZYl6RWOwEis3Htvce3KuSTPp
 8Xq70te0gr2DVUU=
 =YNzW
 -----END PGP SIGNATURE-----

Merge tag 'xfs-perag-conv-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-5.14-merge2

xfs: initial agnumber -> perag conversions for shrink

If we want to use active references to the perag to be able to gate
shrink removing AGs and hence perags safely, we've got a fair bit of
work to do actually use perags in all the places we need to.

There's a lot of code that iterates ag numbers and then
looks up perags from that, often multiple times for the same perag
in the one operation. If we want to use reference counted perags for
access control, then we need to convert all these uses to perag
iterators, not agno iterators.

[Patches 1-4]

The first step of this is consolidating all the perag management -
init, free, get, put, etc into a common location. THis is spread all
over the place right now, so move it all into libxfs/xfs_ag.[ch].
This does expose kernel only bits of the perag to libxfs and hence
userspace, so the structures and code is rearranged to minimise the
number of ifdefs that need to be added to the userspace codebase.
The perag iterator in xfs_icache.c is promoted to a first class API
and expanded to the needs of the code as required.

[Patches 5-10]

These are the first basic perag iterator conversions and changes to
pass the perag down the stack from those iterators where
appropriate. A lot of this is obvious, simple changes, though in
some places we stop passing the perag down the stack because the
code enters into an as yet unconverted subsystem that still uses raw
AGs.

[Patches 11-16]

These replace the agno passed in the btree cursor for per-ag btree
operations with a perag that is passed to the cursor init function.
The cursor takes it's own reference to the perag, and the reference
is dropped when the cursor is deleted. Hence we get reference
coverage for the entire time the cursor is active, even if the code
that initialised the cursor drops it's reference before the cursor
or any of it's children (duplicates) have been deleted.

The first patch adds the perag infrastructure for the cursor, the
next four patches convert a btree cursor at a time, and the last
removes the agno from the cursor once it is unused.

[Patches 17-21]

These patches are a demonstration of the simplifications and
cleanups that come from plumbing the perag through interfaces that
select and then operate on a specific AG. In this case the inode
allocation algorithm does up to three walks across all AGs before it
either allocates an inode or fails. Two of these walks are purely
just to select the AG, and even then it doesn't guarantee inode
allocation success so there's a third walk if the selected AG
allocation fails.

These patches collapse the selection and allocation into a single
loop, simplifies the error handling because xfs_dir_ialloc() always
returns ENOSPC if no AG was selected for inode allocation or we fail
to allocate an inode in any AG, gets rid of xfs_dir_ialloc()
wrapper, converts inode allocation to run entirely from a single
perag instance, and then factors xfs_dialloc() into a much, much
simpler loop which is easy to understand.

Hence we end up with the same inode allocation logic, but it only
needs two complete iterations at worst, makes AG selection and
allocation atomic w.r.t. shrink and chops out out over 100 lines of
code from this hot code path.

[Patch 22]

Converts the unlink path to pass perags through it.

There's more conversion work to be done, but this patchset gets
through a large chunk of it in one hit. Most of the iterators are
converted, so once this is solidified we can move on to converting
these to active references for being able to free perags while the
fs is still active.

* tag 'xfs-perag-conv-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (23 commits)
  xfs: remove xfs_perag_t
  xfs: use perag through unlink processing
  xfs: clean up and simplify xfs_dialloc()
  xfs: inode allocation can use a single perag instance
  xfs: get rid of xfs_dir_ialloc()
  xfs: collapse AG selection for inode allocation
  xfs: simplify xfs_dialloc_select_ag() return values
  xfs: remove agno from btree cursor
  xfs: use perag for ialloc btree cursors
  xfs: convert allocbt cursors to use perags
  xfs: convert refcount btree cursor to use perags
  xfs: convert rmap btree cursor to using a perag
  xfs: add a perag to the btree cursor
  xfs: pass perags around in fsmap data dev functions
  xfs: push perags through the ag reservation callouts
  xfs: pass perags through to the busy extent code
  xfs: convert secondary superblock walk to use perags
  xfs: convert xfs_iwalk to use perag references
  xfs: convert raw ag walks to use for_each_perag
  xfs: make for_each_perag... a first class citizen
  ...
2021-06-08 09:13:13 -07:00
Dave Chinner 8bcac7448a xfs: merge xfs_buf_allocate_memory
It only has one caller and is now a simple function, so merge it
into the caller.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-07 11:50:48 +10:00
Christoph Hellwig 170041f715 xfs: cleanup error handling in xfs_buf_get_map
Use a single goto label for freeing the buffer and returning an
error.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
2021-06-07 11:50:47 +10:00
Dave Chinner 289ae7b48c xfs: get rid of xb_to_gfp()
Only used in one place, so just open code the logic in the macro.
Based on a patch from Christoph Hellwig.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-07 11:50:17 +10:00
Christoph Hellwig 934d1076bb xfs: simplify the b_page_count calculation
Ever since we stopped using the Linux page cache to back XFS buffers
there is no need to take the start sector into account for
calculating the number of pages in a buffer, as the data always
start from the beginning of the buffer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
[dgc: modified to suit this series]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-07 11:50:00 +10:00
Christoph Hellwig 54cd3aa6f8 xfs: remove ->b_offset handling for page backed buffers
->b_offset can only be non-zero for _XBF_KMEM backed buffers, so
remove all code dealing with it for page backed buffers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
[dgc: modified to fit this patchset]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-07 11:49:50 +10:00