Commit Graph

229 Commits

Author SHA1 Message Date
CKI Backport Bot c549212983 xfs: fix freeing speculative preallocations for preallocated files
JIRA: https://issues.redhat.com/browse/RHEL-56816

commit 610b29161b0aa9feb59b78dc867553274f17fb01
Author: Christoph Hellwig <hch@lst.de>
Date:   Wed Jun 19 10:32:43 2024 -0700

    xfs: fix freeing speculative preallocations for preallocated files

    xfs_can_free_eofblocks returns false for files that have persistent
    preallocations unless the force flag is passed and there are delayed
    blocks.  This means it won't free delalloc reservations for files
    with persistent preallocations unless the force flag is set, and it
    will also free the persistent preallocations if the force flag is
    set and the file happens to have delayed allocations.

    Both of these are bad, so do away with the force flag and always free
    only post-EOF delayed allocations for files with the XFS_DIFLAG_PREALLOC
    or APPEND flags set.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>

Signed-off-by: CKI Backport Bot <cki-ci-bot+cki-gitlab-backport-bot@redhat.com>
2024-12-17 14:02:09 +00:00
Brian Foster 5a43e91669 xfs: don't free cowblocks from under dirty pagecache on unshare
JIRA: https://issues.redhat.com/browse/RHEL-64959

commit 4390f019ad7866c3791c3d768d2ff185d89e8ebe
Author: Brian Foster <bfoster@redhat.com>
Date:   Fri Sep 6 07:40:51 2024 -0400

    xfs: don't free cowblocks from under dirty pagecache on unshare

    fallocate unshare mode explicitly breaks extent sharing. When a
    command completes, it checks the data fork for any remaining shared
    extents to determine whether the reflink inode flag and COW fork
    preallocation can be removed. This logic doesn't consider in-core
    pagecache and I/O state, however, which means we can unsafely remove
    COW fork blocks that are still needed under certain conditions.

    For example, consider the following command sequence:

    xfs_io -fc "pwrite 0 1k" -c "reflink <file> 0 256k 1k" \
            -c "pwrite 0 32k" -c "funshare 0 1k" <file>

    This allocates a data block at offset 0, shares it, and then
    overwrites it with a larger buffered write. The overwrite triggers
    COW fork preallocation, 32 blocks by default, which maps the entire
    32k write to delalloc in the COW fork. All but the shared block at
    offset 0 remains hole mapped in the data fork. The unshare command
    redirties and flushes the folio at offset 0, removing the only
    shared extent from the inode. Since the inode no longer maps shared
    extents, unshare purges the COW fork before the remaining 28k may
    have written back.

    This leaves dirty pagecache backed by holes, which writeback quietly
    skips, thus leaving clean, non-zeroed pagecache over holes in the
    file. To verify, fiemap shows holes in the first 32k of the file and
    reads return different data across a remount:

    $ xfs_io -c "fiemap -v" <file>
    <file>:
     EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
       ...
       1: [8..511]:        hole               504
       ...
    $ xfs_io -c "pread -v 4k 8" <file>
    00001000:  cd cd cd cd cd cd cd cd  ........
    $ umount <mnt>; mount <dev> <mnt>
    $ xfs_io -c "pread -v 4k 8" <file>
    00001000:  00 00 00 00 00 00 00 00  ........

    To avoid this problem, make unshare follow the same rules used for
    background cowblock scanning and never purge the COW fork for inodes
    with dirty pagecache or in-flight I/O.

    Fixes: 46afb0628b ("xfs: only flush the unshared range in xfs_reflink_unshare")
    Signed-off-by: Brian Foster <bfoster@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Carlos Maiolino <cem@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2024-11-12 09:52:39 -05:00
Brian Foster 07934c215b xfs: skip background cowblock trims on inodes open for write
JIRA: https://issues.redhat.com/browse/RHEL-64959

commit 90a71daaf73f5d39bb0cbb3c7ab6af942fe6233e
Author: Brian Foster <bfoster@redhat.com>
Date:   Tue Sep 3 08:47:13 2024 -0400

    xfs: skip background cowblock trims on inodes open for write

    The background blockgc scanner runs on a 5m interval by default and
    trims preallocation (post-eof and cow fork) from inodes that are
    otherwise idle. Idle effectively means that iolock can be acquired
    without blocking and that the inode has no dirty pagecache or I/O in
    flight.

    This simple mechanism and heuristic has worked fairly well for
    post-eof speculative preallocations. Support for reflink and COW
    fork preallocations came sometime later and plugged into the same
    mechanism, with similar heuristics. Some recent testing has shown
    that COW fork preallocation may be notably more sensitive to blockgc
    processing than post-eof preallocation, however.

    For example, consider an 8GB reflinked file with a COW extent size
    hint of 1MB. A worst case fully randomized overwrite of this file
    results in ~8k extents of an average size of ~1MB. If the same
    workload is interrupted a couple times for blockgc processing
    (assuming the file goes idle), the resulting extent count explodes
    to over 100k extents with an average size <100kB. This is
    significantly worse than ideal and essentially defeats the COW
    extent size hint mechanism.

    While this particular test is instrumented, it reflects a fairly
    reasonable pattern in practice where random I/Os might spread out
    over a large period of time with varying periods of (in)activity.
    For example, consider a cloned disk image file for a VM or container
    with long uptime and variable and bursty usage. A background blockgc
    scan that races and processes the image file when it happens to be
    clean and idle can have a significant effect on the future
    fragmentation level of the file, even when still in use.

    To help combat this, update the heuristic to skip cowblocks inodes
    that are currently opened for write access during non-sync blockgc
    scans. This allows COW fork preallocations to persist for as long as
    possible unless otherwise needed for functional purposes (i.e. a
    sync scan), the file is idle and closed, or the inode is being
    evicted from cache. While here, update the comments to help
    distinguish performance oriented heuristics from the logic that
    exists to maintain functional correctness.

    Suggested-by: Darrick Wong <djwong@kernel.org>
    Signed-off-by: Brian Foster <bfoster@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Carlos Maiolino <cem@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2024-11-12 09:52:39 -05:00
Bill O'Donnell 846cf95c59 xfs: hide xfs_inode_is_allocated in scrub common code
JIRA: https://issues.redhat.com/browse/RHEL-57114

commit 0d2966345364ff1de74020ff280970a43e9849cc
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Thu Aug 10 07:48:12 2023 -0700

    xfs: hide xfs_inode_is_allocated in scrub common code

    This function is only used by online fsck, so let's move it there.
    In the next patch, we'll fix it to work properly and to require that the
    caller hold the AGI buffer locked.  No major changes aside from
    adjusting the signature a bit.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-10-15 10:46:29 -05:00
Lucas Zampieri 13f1b14fe0 Merge: xfs: load uncached unlinked inodes into memory on demand
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4514

JIRA: https://issues.redhat.com/browse/RHEL-7990

Signed-off-by: Pavel Reichl <preichl@redhat.com>

Approved-by: Bill O'Donnell <bodonnel@redhat.com>
Approved-by: Brian Foster <bfoster@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-07-09 14:03:15 +00:00
Pavel Reichl 875ab81f58 xfs: use i_prev_unlinked to distinguish inodes that are not on the unlinked list
JIRA: https://issues.redhat.com/browse/RHEL-7990

Alter the definition of i_prev_unlinked slightly to make it more obvious
when an inode with 0 link count is not part of the iunlink bucket lists
rooted in the AGI.  This distinction is necessary because it is not
sufficient to check inode.i_nlink to decide if an inode is on the
unlinked list.  Updates to i_nlink can happen while holding only
ILOCK_EXCL, but updates to an inode's position in the AGI unlinked list
(which happen after the nlink update) requires both ILOCK_EXCL and the
AGI buffer lock.

The next few patches will make it possible to reload an entire unlinked
bucket list when we're walking the inode table or performing handle
operations and need more than the ability to iget the last inode in the
chain.

The upcoming directory repair code also needs to be able to make this
distinction to decide if a zero link count directory should be moved to
the orphanage or allowed to inactivate.  An upcoming enhancement to the
online AGI fsck code will need this distinction to check and rebuild the
AGI unlinked buckets.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
(cherry picked from commit f12b96683d6976a3a07fdf3323277c79dbe8f6ab)
Signed-off-by: Pavel Reichl <preichl@redhat.com>
2024-06-19 23:41:18 +02:00
Bill O'Donnell 705fe576ae xfs: fix an inode lookup race in xchk_get_inode
JIRA: https://issues.redhat.com/browse/RHEL-25419

commit 302436c27c3fc61c1dab83f4c995dec12eb43161
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Tue Apr 11 19:00:21 2023 -0700

    xfs: fix an inode lookup race in xchk_get_inode

    In commit d658e, we tried to improve the robustnes of xchk_get_inode in
    the face of EINVAL returns from iget by calling xfs_imap to see if the
    inobt itself thinks that the inode is allocated.  Unfortunately, that
    commit didn't consider the possibility that the inode gets allocated
    after iget but before imap.  In this case, the imap call will succeed,
    but we turn that into a corruption error and tell userspace the inode is
    corrupt.

    Avoid this false corruption report by grabbing the AGI header and
    retrying the iget before calling imap.  If the iget succeeds, we can
    proceed with the usual scrub-by-handle code.  Fix all the incorrect
    comments too, since unreadable/corrupt inodes no longer result in EINVAL
    returns.

    Fixes: d658e72b4a ("xfs: distinguish between corrupt inode and invalid inum in xfs_scrub_get_inode")
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-06-05 16:56:25 -05:00
Bill O'Donnell 28e7f5db46 xfs: use per-mount cpumask to track nonempty percpu inodegc lists
JIRA: https://issues.redhat.com/browse/RHEL-15844

commit 62334fab47621dd91ab30dd5bb6c43d78a8ec279
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Mon Sep 11 08:39:03 2023 -0700

    xfs: use per-mount cpumask to track nonempty percpu inodegc lists

    Directly track which CPUs have contributed to the inodegc percpu lists
    instead of trusting the cpu online mask.  This eliminates a theoretical
    problem where the inodegc flush functions might fail to flush a CPU's
    inodes if that CPU happened to be dying at exactly the same time.  Most
    likely nobody's noticed this because the CPU dead hook moves the percpu
    inodegc list to another CPU and schedules that worker immediately.  But
    it's quite possible that this is a subtle race leading to UAF if the
    inodegc flush were part of an unmount.

    Further benefits: This reduces the overhead of the inodegc flush code
    slightly by allowing us to ignore CPUs that have empty lists.  Better
    yet, it reduces our dependence on the cpu online masks, which have been
    the cause of confusion and drama lately.

    Fixes: ab23a7768739 ("xfs: per-cpu deferred inode inactivation queues")
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-11-19 16:21:09 -06:00
Bill O'Donnell 14fa9c73d2 xfs: check that per-cpu inodegc workers actually run on that cpu
JIRA: https://issues.redhat.com/browse/RHEL-15844

Conflicts: diff in xfs_super.c due to previous out of order patch

commit b37c4c8339cd394ea6b8b415026603320a185651
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Tue May 2 09:16:12 2023 +1000

    xfs: check that per-cpu inodegc workers actually run on that cpu

    Now that we've allegedly worked out the problem of the per-cpu inodegc
    workers being scheduled on the wrong cpu, let's put in a debugging knob
    to let us know if a worker ever gets mis-scheduled again.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>
    Signed-off-by: Dave Chinner <david@fromorbit.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-11-19 16:21:08 -06:00
Bill O'Donnell 19fab0b814 xfs: collect errors from inodegc for unlinked inode recovery
JIRA: https://issues.redhat.com/browse/RHEL-2002

Conflicts: context differences due to out of order patch application

commit d4d12c02bf5f768f1b423c7ae2909c5afdfe0d5f
Author: Dave Chinner <dchinner@redhat.com>
Date:   Mon Jun 5 14:48:15 2023 +1000

    xfs: collect errors from inodegc for unlinked inode recovery

    Unlinked list recovery requires errors removing the inode the from
    the unlinked list get fed back to the main recovery loop. Now that
    we offload the unlinking to the inodegc work, we don't get errors
    being fed back when we trip over a corruption that prevents the
    inode from being removed from the unlinked list.

    This means we never clear the corrupt unlinked list bucket,
    resulting in runtime operations eventually tripping over it and
    shutting down.

    Fix this by collecting inodegc worker errors and feed them
    back to the flush caller. This is largely best effort - the only
    context that really cares is log recovery, and it only flushes a
    single inode at a time so we don't need complex synchronised
    handling. Essentially the inodegc workers will capture the first
    error that occurs and the next flush will gather them and clear
    them. The flush itself will only report the first gathered error.

    In the cases where callers can return errors, propagate the
    collected inodegc flush error up the error handling chain.

    In the case of inode unlinked list recovery, there are several
    superfluous calls to flush queued unlinked inodes -
    xlog_recover_iunlink_bucket() guarantees that it has flushed the
    inodegc and collected errors before it returns. Hence nothing in the
    calling path needs to run a flush, even when an error is returned.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Dave Chinner <david@fromorbit.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-11-10 07:22:27 -06:00
Bill O'Donnell 06467729c8 xfs: fix xfs_inodegc_stop racing with mod_delayed_work
JIRA: https://issues.redhat.com/browse/RHEL-2002

commit 2254a7396a0ca6309854948ee1c0a33fa4268cec
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Tue May 2 09:16:14 2023 +1000

    xfs: fix xfs_inodegc_stop racing with mod_delayed_work

    syzbot reported this warning from the faux inodegc shrinker that tries
    to kick off inodegc work:

    ------------[ cut here ]------------
    WARNING: CPU: 1 PID: 102 at kernel/workqueue.c:1445 __queue_work+0xd44/0x1120 kernel/workqueue.c:1444
    RIP: 0010:__queue_work+0xd44/0x1120 kernel/workqueue.c:1444
    Call Trace:
     __queue_delayed_work+0x1c8/0x270 kernel/workqueue.c:1672
     mod_delayed_work_on+0xe1/0x220 kernel/workqueue.c:1746
     xfs_inodegc_shrinker_scan fs/xfs/xfs_icache.c:2212 [inline]
     xfs_inodegc_shrinker_scan+0x250/0x4f0 fs/xfs/xfs_icache.c:2191
     do_shrink_slab+0x428/0xaa0 mm/vmscan.c:853
     shrink_slab+0x175/0x660 mm/vmscan.c:1013
     shrink_one+0x502/0x810 mm/vmscan.c:5343
     shrink_many mm/vmscan.c:5394 [inline]
     lru_gen_shrink_node mm/vmscan.c:5511 [inline]
     shrink_node+0x2064/0x35f0 mm/vmscan.c:6459
     kswapd_shrink_node mm/vmscan.c:7262 [inline]
     balance_pgdat+0xa02/0x1ac0 mm/vmscan.c:7452
     kswapd+0x677/0xd60 mm/vmscan.c:7712
     kthread+0x2e8/0x3a0 kernel/kthread.c:376
     ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308

    This warning corresponds to this code in __queue_work:

            /*
             * For a draining wq, only works from the same workqueue are
             * allowed. The __WQ_DESTROYING helps to spot the issue that
             * queues a new work item to a wq after destroy_workqueue(wq).
             */
            if (unlikely(wq->flags & (__WQ_DESTROYING | __WQ_DRAINING) &&
                         WARN_ON_ONCE(!is_chained_work(wq))))
                    return;

    For this to trip, we must have a thread draining the inodedgc workqueue
    and a second thread trying to queue inodegc work to that workqueue.
    This can happen if freezing or a ro remount race with reclaim poking our
    faux inodegc shrinker and another thread dropping an unlinked O_RDONLY
    file:

    Thread 0        Thread 1        Thread 2

    xfs_inodegc_stop

                                    xfs_inodegc_shrinker_scan
                                    xfs_is_inodegc_enabled
                                    <yes, will continue>

    xfs_clear_inodegc_enabled
    xfs_inodegc_queue_all
    <list empty, do not queue inodegc worker>

                    xfs_inodegc_queue
                    <add to list>
                    xfs_is_inodegc_enabled
                    <no, returns>

    drain_workqueue
    <set WQ_DRAINING>

                                    llist_empty
                                    <no, will queue list>
                                    mod_delayed_work_on(..., 0)
                                    __queue_work
                                    <sees WQ_DRAINING, kaboom>

    In other words, everything between the access to inodegc_enabled state
    and the decision to poke the inodegc workqueue requires some kind of
    coordination to avoid the WQ_DRAINING state.  We could perhaps introduce
    a lock here, but we could also try to eliminate WQ_DRAINING from the
    picture.

    We could replace the drain_workqueue call with a loop that flushes the
    workqueue and queues workers as long as there is at least one inode
    present in the per-cpu inodegc llists.  We've disabled inodegc at this
    point, so we know that the number of queued inodes will eventually hit
    zero as long as xfs_inodegc_start cannot reactivate the workers.

    There are four callers of xfs_inodegc_start.  Three of them come from the
    VFS with s_umount held: filesystem thawing, failed filesystem freezing,
    and the rw remount transition.  The fourth caller is mounting rw (no
    remount or freezing possible).

    There are three callers ofs xfs_inodegc_stop.  One is unmounting (no
    remount or thaw possible).  Two of them come from the VFS with s_umount
    held: fs freezing and ro remount transition.

    Hence, it is correct to replace the drain_workqueue call with a loop
    that drains the inodegc llists.

    Fixes: 6191cf3ad59f ("xfs: flush inodegc workqueue tasks before cancel")
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>
    Signed-off-by: Dave Chinner <david@fromorbit.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-11-10 07:22:26 -06:00
Bill O'Donnell 099de11f8b xfs: explicitly specify cpu when forcing inodegc delayed work to run immediately
JIRA: https://issues.redhat.com/browse/RHEL-2002

commit 03e0add80f4cf3f7393edb574eeb3a89a1db7758
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Tue May 2 09:16:05 2023 +1000

    xfs: explicitly specify cpu when forcing inodegc delayed work to run immediately

    I've been noticing odd racing behavior in the inodegc code that could
    only be explained by one cpu adding an inode to its inactivation llist
    at the same time that another cpu is processing that cpu's llist.
    Preemption is disabled between get/put_cpu_ptr, so the only explanation
    is scheduler mayhem.  I inserted the following debug code into
    xfs_inodegc_worker (see the next patch):

            ASSERT(gc->cpu == smp_processor_id());

    This assertion tripped during overnight tests on the arm64 machines, but
    curiously not on x86_64.  I think we haven't observed any resource leaks
    here because the lockfree list code can handle simultaneous llist_add
    and llist_del_all functions operating on the same list.  However, the
    whole point of having percpu inodegc lists is to take advantage of warm
    memory caches by inactivating inodes on the last processor to touch the
    inode.

    The incorrect scheduling seems to occur after an inodegc worker is
    subjected to mod_delayed_work().  This wraps mod_delayed_work_on with
    WORK_CPU_UNBOUND specified as the cpu number.  Unbound allows for
    scheduling on any cpu, not necessarily the same one that scheduled the
    work.

    Because preemption is disabled for as long as we have the gc pointer, I
    think it's safe to use current_cpu() (aka smp_processor_id) to queue the
    delayed work item on the correct cpu.

    Fixes: 7cf2b0f9611b ("xfs: bound maximum wait time for inodegc work")
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>
    Signed-off-by: Dave Chinner <david@fromorbit.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-11-10 07:22:26 -06:00
Bill O'Donnell e4f1581c61 xfs: convert xfs_imap() to take a perag
JIRA: https://issues.redhat.com/browse/RHEL-2002

commit 498f0adbcdb6a68403bfb9645a7555b789a7fee4
Author: Dave Chinner <dchinner@redhat.com>
Date:   Mon Feb 13 09:14:52 2023 +1100

    xfs: convert xfs_imap() to take a perag

    Callers have referenced perags but they don't pass it into
    xfs_imap() so it takes it's own reference. Fix that so we can change
    inode allocation over to using active references.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-11-10 07:22:21 -06:00
Bill O'Donnell 75f0e4d3c9 xfs: rework the perag trace points to be perag centric
JIRA: https://issues.redhat.com/browse/RHEL-2002

commit 368e2d09b41caa5b44a61bb518c362f46d6d615c
Author: Dave Chinner <dchinner@redhat.com>
Date:   Mon Feb 13 09:14:52 2023 +1100

    xfs: rework the perag trace points to be perag centric

    So that they all output the same information in the traces to make
    debugging refcount issues easier.

    This means that all the lookup/drop functions no longer need to use
    the full memory barrier atomic operations (atomic*_return()) so
    will have less overhead when tracing is off. The set/clear tag
    tracepoints no longer abuse the reference count to pass the tag -
    the tag being cleared is obvious from the _RET_IP_ that is recorded
    in the trace point.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-11-10 07:22:20 -06:00
Bill O'Donnell ea5491aa47 xfs: active perag reference counting
JIRA: https://issues.redhat.com/browse/RHEL-2002

commit c4d5660afbdcd3f0fa3bbf563e059511fba8445f
Author: Dave Chinner <dchinner@redhat.com>
Date:   Mon Feb 13 09:14:42 2023 +1100

    xfs: active perag reference counting

    We need to be able to dynamically remove instantiated AGs from
    memory safely, either for shrinking the filesystem or paging AG
    state in and out of memory (e.g. supporting millions of AGs). This
    means we need to be able to safely exclude operations from accessing
    perags while dynamic removal is in progress.

    To do this, introduce the concept of active and passive references.
    Active references are required for high level operations that make
    use of an AG for a given operation (e.g. allocation) and pin the
    perag in memory for the duration of the operation that is operating
    on the perag (e.g. transaction scope). This means we can fail to get
    an active reference to an AG, hence callers of the new active
    reference API must be able to handle lookup failure gracefully.

    Passive references are used in low level code, where we might need
    to access the perag structure for the purposes of completing high
    level operations. For example, buffers need to use passive
    references because:
    - we need to be able to do metadata IO during operations like grow
      and shrink transactions where high level active references to the
      AG have already been blocked
    - buffers need to pin the perag until they are reclaimed from
      memory, something that high level code has no direct control over.
    - unused cached buffers should not prevent a shrink from being
      started.

    Hence we have active references that will form exclusion barriers
    for operations to be performed on an AG, and passive references that
    will prevent reclaim of the perag until all objects with passive
    references have been reclaimed themselves.

    This patch introduce xfs_perag_grab()/xfs_perag_rele() as the API
    for active AG reference functionality. We also need to convert the
    for_each_perag*() iterators to use active references, which will
    start the process of converting high level code over to using active
    references. Conversion of non-iterator based code to active
    references will be done in followup patches.

    Note that the implementation using reference counting is really just
    a development vehicle for the API to ensure we don't have any leaks
    in the callers. Once we need to remove perag structures from memory
    dyanmically, we will need a much more robust per-ag state transition
    mechanism for preventing new references from being taken while we
    wait for existing references to drain before removal from memory can
    occur....

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-11-10 07:22:20 -06:00
Bill O'Donnell f5e747f4cb xfs: Fix deadlock on xfs_inodegc_worker
JIRA: https://issues.redhat.com/browse/RHEL-2002

commit 4da112513c01d7d0acf1025b8764349d46e177d6
Author: Wu Guanghao <wuguanghao3@huawei.com>
Date:   Tue Dec 27 09:41:30 2022 -0800

    xfs: Fix deadlock on xfs_inodegc_worker

    We are doing a test about deleting a large number of files
    when memory is low. A deadlock problem was found.

    [ 1240.279183] -> #1 (fs_reclaim){+.+.}-{0:0}:
    [ 1240.280450]        lock_acquire+0x197/0x460
    [ 1240.281548]        fs_reclaim_acquire.part.0+0x20/0x30
    [ 1240.282625]        kmem_cache_alloc+0x2b/0x940
    [ 1240.283816]        xfs_trans_alloc+0x8a/0x8b0
    [ 1240.284757]        xfs_inactive_ifree+0xe4/0x4e0
    [ 1240.285935]        xfs_inactive+0x4e9/0x8a0
    [ 1240.286836]        xfs_inodegc_worker+0x160/0x5e0
    [ 1240.287969]        process_one_work+0xa19/0x16b0
    [ 1240.289030]        worker_thread+0x9e/0x1050
    [ 1240.290131]        kthread+0x34f/0x460
    [ 1240.290999]        ret_from_fork+0x22/0x30
    [ 1240.291905]
    [ 1240.291905] -> #0 ((work_completion)(&gc->work)){+.+.}-{0:0}:
    [ 1240.293569]        check_prev_add+0x160/0x2490
    [ 1240.294473]        __lock_acquire+0x2c4d/0x5160
    [ 1240.295544]        lock_acquire+0x197/0x460
    [ 1240.296403]        __flush_work+0x6bc/0xa20
    [ 1240.297522]        xfs_inode_mark_reclaimable+0x6f0/0xdc0
    [ 1240.298649]        destroy_inode+0xc6/0x1b0
    [ 1240.299677]        dispose_list+0xe1/0x1d0
    [ 1240.300567]        prune_icache_sb+0xec/0x150
    [ 1240.301794]        super_cache_scan+0x2c9/0x480
    [ 1240.302776]        do_shrink_slab+0x3f0/0xaa0
    [ 1240.303671]        shrink_slab+0x170/0x660
    [ 1240.304601]        shrink_node+0x7f7/0x1df0
    [ 1240.305515]        balance_pgdat+0x766/0xf50
    [ 1240.306657]        kswapd+0x5bd/0xd20
    [ 1240.307551]        kthread+0x34f/0x460
    [ 1240.308346]        ret_from_fork+0x22/0x30
    [ 1240.309247]
    [ 1240.309247] other info that might help us debug this:
    [ 1240.309247]
    [ 1240.310944]  Possible unsafe locking scenario:
    [ 1240.310944]
    [ 1240.312379]        CPU0                    CPU1
    [ 1240.313363]        ----                    ----
    [ 1240.314433]   lock(fs_reclaim);
    [ 1240.315107]                                lock((work_completion)(&gc->work));
    [ 1240.316828]                                lock(fs_reclaim);
    [ 1240.318088]   lock((work_completion)(&gc->work));
    [ 1240.319203]
    [ 1240.319203]  *** DEADLOCK ***
    ...
    [ 2438.431081] Workqueue: xfs-inodegc/sda xfs_inodegc_worker
    [ 2438.432089] Call Trace:
    [ 2438.432562]  __schedule+0xa94/0x1d20
    [ 2438.435787]  schedule+0xbf/0x270
    [ 2438.436397]  schedule_timeout+0x6f8/0x8b0
    [ 2438.445126]  wait_for_completion+0x163/0x260
    [ 2438.448610]  __flush_work+0x4c4/0xa40
    [ 2438.455011]  xfs_inode_mark_reclaimable+0x6ef/0xda0
    [ 2438.456695]  destroy_inode+0xc6/0x1b0
    [ 2438.457375]  dispose_list+0xe1/0x1d0
    [ 2438.458834]  prune_icache_sb+0xe8/0x150
    [ 2438.461181]  super_cache_scan+0x2b3/0x470
    [ 2438.461950]  do_shrink_slab+0x3cf/0xa50
    [ 2438.462687]  shrink_slab+0x17d/0x660
    [ 2438.466392]  shrink_node+0x87e/0x1d40
    [ 2438.467894]  do_try_to_free_pages+0x364/0x1300
    [ 2438.471188]  try_to_free_pages+0x26c/0x5b0
    [ 2438.473567]  __alloc_pages_slowpath.constprop.136+0x7aa/0x2100
    [ 2438.482577]  __alloc_pages+0x5db/0x710
    [ 2438.485231]  alloc_pages+0x100/0x200
    [ 2438.485923]  allocate_slab+0x2c0/0x380
    [ 2438.486623]  ___slab_alloc+0x41f/0x690
    [ 2438.490254]  __slab_alloc+0x54/0x70
    [ 2438.491692]  kmem_cache_alloc+0x23e/0x270
    [ 2438.492437]  xfs_trans_alloc+0x88/0x880
    [ 2438.493168]  xfs_inactive_ifree+0xe2/0x4e0
    [ 2438.496419]  xfs_inactive+0x4eb/0x8b0
    [ 2438.497123]  xfs_inodegc_worker+0x16b/0x5e0
    [ 2438.497918]  process_one_work+0xbf7/0x1a20
    [ 2438.500316]  worker_thread+0x8c/0x1060
    [ 2438.504938]  ret_from_fork+0x22/0x30

    When the memory is insufficient, xfs_inonodegc_worker will trigger memory
    reclamation when memory is allocated, then flush_work() may be called to
    wait for the work to complete. This causes a deadlock.

    So use memalloc_nofs_save() to avoid triggering memory reclamation in
    xfs_inodegc_worker.

    Signed-off-by: Wu Guanghao <wuguanghao3@huawei.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-11-10 07:22:18 -06:00
Chris von Recklinghausen 1f619343f6 treewide: use get_random_u32() when possible
Conflicts:
	drivers/gpu/drm/tests/drm_buddy_test.c
	drivers/gpu/drm/tests/drm_mm_test.c - We already have
		ce28ab1380e8 ("drm/tests: Add back seed value information")
		so keep calls to kunit_info.
	drop changes to drivers/misc/habanalabs/gaudi2/gaudi2.c
		fs/ntfs3/fslog.c - files not in CS9
	net/sunrpc/auth_gss/gss_krb5_wrap.c - We already have
		7f675ca7757b ("SUNRPC: Improve Kerberos confounder generation")
		so code to change is gone.
	drivers/gpu/drm/i915/i915_gem_gtt.c
	drivers/gpu/drm/i915/selftests/i915_selftest.c
	drivers/gpu/drm/tests/drm_buddy_test.c
	drivers/gpu/drm/tests/drm_mm_test.c
		change added under
		4cb818386e ("Merge DRM changes from upstream v6.0.8..v6.1")

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit a251c17aa558d8e3128a528af5cf8b9d7caae4fd
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Wed Oct 5 17:43:22 2022 +0200

    treewide: use get_random_u32() when possible

    The prandom_u32() function has been a deprecated inline wrapper around
    get_random_u32() for several releases now, and compiles down to the
    exact same code. Replace the deprecated wrapper with a direct call to
    the real function. The same also applies to get_random_int(), which is
    just a wrapper around get_random_u32(). This was done as a basic find
    and replace.

    Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Reviewed-by: Yury Norov <yury.norov@gmail.com>
    Reviewed-by: Jan Kara <jack@suse.cz> # for ext4
    Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake
    Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd
    Acked-by: Jakub Kicinski <kuba@kernel.org>
    Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbol
t
    Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
    Acked-by: Helge Deller <deller@gmx.de> # for parisc
    Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:03 -04:00
Bill O'Donnell 3504e82d36 xfs: fix incorrect i_nlink caused by inode racing
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 28b4b0596343d19d140da059eee0e5c2b5328731
Author: Long Li <leo.lilong@huawei.com>
Date:   Thu Nov 17 13:02:56 2022 -0800

    xfs: fix incorrect i_nlink caused by inode racing

    The following error occurred during the fsstress test:

    XFS: Assertion failed: VFS_I(ip)->i_nlink >= 2, file: fs/xfs/xfs_inode.c, line: 2452

    The problem was that inode race condition causes incorrect i_nlink to be
    written to disk, and then it is read into memory. Consider the following
    call graph, inodes that are marked as both XFS_IFLUSHING and
    XFS_IRECLAIMABLE, i_nlink will be reset to 1 and then restored to original
    value in xfs_reinit_inode(). Therefore, the i_nlink of directory on disk
    may be set to 1.

      xfsaild
          xfs_inode_item_push
              xfs_iflush_cluster
                  xfs_iflush
                      xfs_inode_to_disk

      xfs_iget
          xfs_iget_cache_hit
              xfs_iget_recycle
                  xfs_reinit_inode
                      inode_init_always

    xfs_reinit_inode() needs to hold the ILOCK_EXCL as it is changing internal
    inode state and can race with other RCU protected inode lookups. On the
    read side, xfs_iflush_cluster() grabs the ILOCK_SHARED while under rcu +
    ip->i_flags_lock, and so xfs_iflush/xfs_inode_to_disk() are protected from
    racing inode updates (during transactions) by that lock.

    Fixes: ff7bebeb91 ("xfs: refactor the inode recycling code") # goes further back than this
    Signed-off-by: Long Li <leo.lilong@huawei.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:52 -05:00
Bill O'Donnell ba8109db31 xfs: don't leak memory when attr fork loading fails
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit c78c2d0903183a41beb90c56a923e30f90fa91b9
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Tue Jul 19 09:14:55 2022 -0700

    xfs: don't leak memory when attr fork loading fails

    I observed the following evidence of a memory leak while running xfs/399
    from the xfs fsck test suite (edited for brevity):

    XFS (sde): Metadata corruption detected at xfs_attr_shortform_verify_struct.part.0+0x7b/0xb0 [xfs], inode 0x1172 attr fork
    XFS: Assertion failed: ip->i_af.if_u1.if_data == NULL, file: fs/xfs/libxfs/xfs_inode_fork.c, line: 315
    ------------[ cut here ]------------
    WARNING: CPU: 2 PID: 91635 at fs/xfs/xfs_message.c:104 assfail+0x46/0x4a [xfs]
    CPU: 2 PID: 91635 Comm: xfs_scrub Tainted: G        W         5.19.0-rc7-xfsx #rc7 6e6475eb29fd9dda3181f81b7ca7ff961d277a40
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
    RIP: 0010:assfail+0x46/0x4a [xfs]
    Call Trace:
     <TASK>
     xfs_ifork_zap_attr+0x7c/0xb0
     xfs_iformat_attr_fork+0x86/0x110
     xfs_inode_from_disk+0x41d/0x480
     xfs_iget+0x389/0xd70
     xfs_bulkstat_one_int+0x5b/0x540
     xfs_bulkstat_iwalk+0x1e/0x30
     xfs_iwalk_ag_recs+0xd1/0x160
     xfs_iwalk_run_callbacks+0xb9/0x180
     xfs_iwalk_ag+0x1d8/0x2e0
     xfs_iwalk+0x141/0x220
     xfs_bulkstat+0x105/0x180
     xfs_ioc_bulkstat.constprop.0.isra.0+0xc5/0x130
     xfs_file_ioctl+0xa5f/0xef0
     __x64_sys_ioctl+0x82/0xa0
     do_syscall_64+0x2b/0x80
     entry_SYSCALL_64_after_hwframe+0x46/0xb0

    This newly-added assertion checks that there aren't any incore data
    structures hanging off the incore fork when we're trying to reset its
    contents.  From the call trace, it is evident that iget was trying to
    construct an incore inode from the ondisk inode, but the attr fork
    verifier failed and we were trying to undo all the memory allocations
    that we had done earlier.

    The three assertions in xfs_ifork_zap_attr check that the caller has
    already called xfs_idestroy_fork, which clearly has not been done here.
    As the zap function then zeroes the pointers, we've effectively leaked
    the memory.

    The shortest change would have been to insert an extra call to
    xfs_idestroy_fork, but it makes more sense to bundle the _idestroy_fork
    call into _zap_attr, since all other callsites call _idestroy_fork
    immediately prior to calling _zap_attr.  IOWs, it eliminates one way to
    fail.

    Note: This change only applies cleanly to 2ed5b09b3e8f, since we just
    reworked the attr fork lifetime.  However, I think this memory leak has
    existed since 0f45a1b20c, since the chain xfs_iformat_attr_fork ->
    xfs_iformat_local -> xfs_init_local_fork will allocate
    ifp->if_u1.if_data, but if xfs_ifork_verify_local_attr fails,
    xfs_iformat_attr_fork will free i_afp without freeing any of the stuff
    hanging off i_afp.  The solution for older kernels I think is to add the
    missing call to xfs_idestroy_fork just prior to calling kmem_cache_free.

    Found by fuzzing a.sfattr.hdr.totsize = lastbit in xfs/399.

    Fixes: 2ed5b09b3e8f ("xfs: make inode attribute forks a permanent part of struct xfs_inode")
    Probably-Fixes: 0f45a1b20c ("xfs: improve local fork verification")
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:48 -05:00
Bill O'Donnell 5b5b4424f2 xfs: add log item precommit operation
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit fad743d7cd8bd92d03c09e71f29eace860f50415
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Jul 14 11:47:26 2022 +1000

    xfs: add log item precommit operation

    For inodes that are dirty, we have an attached cluster buffer that
    we want to use to track the dirty inode through the AIL.
    Unfortunately, locking the cluster buffer and adding it to the
    transaction when the inode is first logged in a transaction leads to
    buffer lock ordering inversions.

    The specific problem is ordering against the AGI buffer. When
    modifying unlinked lists, the buffer lock order is AGI -> inode
    cluster buffer as the AGI buffer lock serialises all access to the
    unlinked lists. Unfortunately, functionality like xfs_droplink()
    logs the inode before calling xfs_iunlink(), as do various directory
    manipulation functions. The inode can be logged way down in the
    stack as far as the bmapi routines and hence, without a major
    rewrite of lots of APIs there's no way we can avoid the inode being
    logged by something until after the AGI has been logged.

    As we are going to be using ordered buffers for inode AIL tracking,
    there isn't a need to actually lock that buffer against modification
    as all the modifications are captured by logging the inode item
    itself. Hence we don't actually need to join the cluster buffer into
    the transaction until just before it is committed. This means we do
    not perturb any of the existing buffer lock orders in transactions,
    and the inode cluster buffer is always locked last in a transaction
    that doesn't otherwise touch inode cluster buffers.

    We do this by introducing a precommit log item method.  This commit
    just introduces the mechanism; the inode item implementation is in
    followup commits.

    The precommit items need to be sorted into consistent order as we
    may be locking multiple items here. Hence if we have two dirty
    inodes in cluster buffers A and B, and some other transaction has
    two separate dirty inodes in the same cluster buffers, locking them
    in different orders opens us up to ABBA deadlocks. Hence we sort the
    items on the transaction based on the presence of a sort log item
    method.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:45 -05:00
Bill O'Donnell 439ec50781 xfs: double link the unlinked inode list
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 2fd26cc07e9f8050e29bf314cbf1bcb64dbe088c
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Jul 14 11:46:43 2022 +1000

    xfs: double link the unlinked inode list

    Now we have forwards traversal via the incore inode in place, we now
    need to add back pointers to the incore inode to entirely replace
    the back reference cache. We use the same lookup semantics and
    constraints as for the forwards pointer lookups during unlinks, and
    so we can look up any inode in the unlinked list directly and update
    the list pointers, forwards or backwards, at any time.

    The only wrinkle in converting the unlinked list manipulations to
    use in-core previous pointers is that log recovery doesn't have the
    incore inode state built up so it can't just read in an inode and
    release it to finish off the unlink. Hence we need to modify the
    traversal in recovery to read one inode ahead before we
    release the inode at the head of the list. This populates the
    next->prev relationship sufficient to be able to replay the unlinked
    list and hence greatly simplify the runtime code.

    This recovery algorithm also requires that we actually remove inodes
    from the unlinked list one at a time as background inode
    inactivation will result in unlinked list removal racing with the
    building of the in-memory unlinked list state. We could serialise
    this by holding the AGI buffer lock when constructing the in memory
    state, but all that does is lockstep background processing with list
    building. It is much simpler to flush the inodegc immediately after
    releasing the inode so that it is unlinked immediately and there is
    no races present at all.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:44 -05:00
Bill O'Donnell 0036098801 xfs: use XFS_IFORK_Q to determine the presence of an xattr fork
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit e45d7cb2356e6b59fe64da28324025cc6fcd3fbd
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Sat Jul 9 10:56:06 2022 -0700

    xfs: use XFS_IFORK_Q to determine the presence of an xattr fork

    Modify xfs_ifork_ptr to return a NULL pointer if the caller asks for the
    attribute fork but i_forkoff is zero.  This eliminates the ambiguity
    between i_forkoff and i_af.if_present, which should make it easier to
    understand the lifetime of attr forks.

    While we're at it, remove the if_present checks around calls to
    xfs_idestroy_fork and xfs_ifork_zap_attr since they can both handle attr
    forks that have already been torn down.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:43 -05:00
Bill O'Donnell a2d362f29a xfs: make inode attribute forks a permanent part of struct xfs_inode
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

Conflicts: previous out of order application of 5625ea0 requires minor adjust to xfs_iomap.c

commit 2ed5b09b3e8fc274ae8fecd6ab7c5106a364bed1
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Sat Jul 9 10:56:06 2022 -0700

    xfs: make inode attribute forks a permanent part of struct xfs_inode

    Syzkaller reported a UAF bug a while back:

    ==================================================================
    BUG: KASAN: use-after-free in xfs_ilock_attr_map_shared+0xe3/0xf6 fs/xfs/xfs_inode.c:127
    Read of size 4 at addr ffff88802cec919c by task syz-executor262/2958

    CPU: 2 PID: 2958 Comm: syz-executor262 Not tainted
    5.15.0-0.30.3-20220406_1406 #3
    Hardware name: Red Hat KVM, BIOS 1.13.0-2.module+el8.3.0+7860+a7792d29
    04/01/2014
    Call Trace:
     <TASK>
     __dump_stack lib/dump_stack.c:88 [inline]
     dump_stack_lvl+0x82/0xa9 lib/dump_stack.c:106
     print_address_description.constprop.9+0x21/0x2d5 mm/kasan/report.c:256
     __kasan_report mm/kasan/report.c:442 [inline]
     kasan_report.cold.14+0x7f/0x11b mm/kasan/report.c:459
     xfs_ilock_attr_map_shared+0xe3/0xf6 fs/xfs/xfs_inode.c:127
     xfs_attr_get+0x378/0x4c2 fs/xfs/libxfs/xfs_attr.c:159
     xfs_xattr_get+0xe3/0x150 fs/xfs/xfs_xattr.c:36
     __vfs_getxattr+0xdf/0x13d fs/xattr.c:399
     cap_inode_need_killpriv+0x41/0x5d security/commoncap.c:300
     security_inode_need_killpriv+0x4c/0x97 security/security.c:1408
     dentry_needs_remove_privs.part.28+0x21/0x63 fs/inode.c:1912
     dentry_needs_remove_privs+0x80/0x9e fs/inode.c:1908
     do_truncate+0xc3/0x1e0 fs/open.c:56
     handle_truncate fs/namei.c:3084 [inline]
     do_open fs/namei.c:3432 [inline]
     path_openat+0x30ab/0x396d fs/namei.c:3561
     do_filp_open+0x1c4/0x290 fs/namei.c:3588
     do_sys_openat2+0x60d/0x98c fs/open.c:1212
     do_sys_open+0xcf/0x13c fs/open.c:1228
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x44/0x0
    RIP: 0033:0x7f7ef4bb753d
    Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48
    89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73
    01 c3 48 8b 0d 1b 79 2c 00 f7 d8 64 89 01 48
    RSP: 002b:00007f7ef52c2ed8 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
    RAX: ffffffffffffffda RBX: 0000000000404148 RCX: 00007f7ef4bb753d
    RDX: 00007f7ef4bb753d RSI: 0000000000000000 RDI: 0000000020004fc0
    RBP: 0000000000404140 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000246 R12: 0030656c69662f2e
    R13: 00007ffd794db37f R14: 00007ffd794db470 R15: 00007f7ef52c2fc0
     </TASK>

    Allocated by task 2953:
     kasan_save_stack+0x19/0x38 mm/kasan/common.c:38
     kasan_set_track mm/kasan/common.c:46 [inline]
     set_alloc_info mm/kasan/common.c:434 [inline]
     __kasan_slab_alloc+0x68/0x7c mm/kasan/common.c:467
     kasan_slab_alloc include/linux/kasan.h:254 [inline]
     slab_post_alloc_hook mm/slab.h:519 [inline]
     slab_alloc_node mm/slub.c:3213 [inline]
     slab_alloc mm/slub.c:3221 [inline]
     kmem_cache_alloc+0x11b/0x3eb mm/slub.c:3226
     kmem_cache_zalloc include/linux/slab.h:711 [inline]
     xfs_ifork_alloc+0x25/0xa2 fs/xfs/libxfs/xfs_inode_fork.c:287
     xfs_bmap_add_attrfork+0x3f2/0x9b1 fs/xfs/libxfs/xfs_bmap.c:1098
     xfs_attr_set+0xe38/0x12a7 fs/xfs/libxfs/xfs_attr.c:746
     xfs_xattr_set+0xeb/0x1a9 fs/xfs/xfs_xattr.c:59
     __vfs_setxattr+0x11b/0x177 fs/xattr.c:180
     __vfs_setxattr_noperm+0x128/0x5e0 fs/xattr.c:214
     __vfs_setxattr_locked+0x1d4/0x258 fs/xattr.c:275
     vfs_setxattr+0x154/0x33d fs/xattr.c:301
     setxattr+0x216/0x29f fs/xattr.c:575
     __do_sys_fsetxattr fs/xattr.c:632 [inline]
     __se_sys_fsetxattr fs/xattr.c:621 [inline]
     __x64_sys_fsetxattr+0x243/0x2fe fs/xattr.c:621
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x44/0x0

    Freed by task 2949:
     kasan_save_stack+0x19/0x38 mm/kasan/common.c:38
     kasan_set_track+0x1c/0x21 mm/kasan/common.c:46
     kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:360
     ____kasan_slab_free mm/kasan/common.c:366 [inline]
     ____kasan_slab_free mm/kasan/common.c:328 [inline]
     __kasan_slab_free+0xe2/0x10e mm/kasan/common.c:374
     kasan_slab_free include/linux/kasan.h:230 [inline]
     slab_free_hook mm/slub.c:1700 [inline]
     slab_free_freelist_hook mm/slub.c:1726 [inline]
     slab_free mm/slub.c:3492 [inline]
     kmem_cache_free+0xdc/0x3ce mm/slub.c:3508
     xfs_attr_fork_remove+0x8d/0x132 fs/xfs/libxfs/xfs_attr_leaf.c:773
     xfs_attr_sf_removename+0x5dd/0x6cb fs/xfs/libxfs/xfs_attr_leaf.c:822
     xfs_attr_remove_iter+0x68c/0x805 fs/xfs/libxfs/xfs_attr.c:1413
     xfs_attr_remove_args+0xb1/0x10d fs/xfs/libxfs/xfs_attr.c:684
     xfs_attr_set+0xf1e/0x12a7 fs/xfs/libxfs/xfs_attr.c:802
     xfs_xattr_set+0xeb/0x1a9 fs/xfs/xfs_xattr.c:59
     __vfs_removexattr+0x106/0x16a fs/xattr.c:468
     cap_inode_killpriv+0x24/0x47 security/commoncap.c:324
     security_inode_killpriv+0x54/0xa1 security/security.c:1414
     setattr_prepare+0x1a6/0x897 fs/attr.c:146
     xfs_vn_change_ok+0x111/0x15e fs/xfs/xfs_iops.c:682
     xfs_vn_setattr_size+0x5f/0x15a fs/xfs/xfs_iops.c:1065
     xfs_vn_setattr+0x125/0x2ad fs/xfs/xfs_iops.c:1093
     notify_change+0xae5/0x10a1 fs/attr.c:410
     do_truncate+0x134/0x1e0 fs/open.c:64
     handle_truncate fs/namei.c:3084 [inline]
     do_open fs/namei.c:3432 [inline]
     path_openat+0x30ab/0x396d fs/namei.c:3561
     do_filp_open+0x1c4/0x290 fs/namei.c:3588
     do_sys_openat2+0x60d/0x98c fs/open.c:1212
     do_sys_open+0xcf/0x13c fs/open.c:1228
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x44/0x0

    The buggy address belongs to the object at ffff88802cec9188
     which belongs to the cache xfs_ifork of size 40
    The buggy address is located 20 bytes inside of
     40-byte region [ffff88802cec9188, ffff88802cec91b0)
    The buggy address belongs to the page:
    page:00000000c3af36a1 refcount:1 mapcount:0 mapping:0000000000000000
    index:0x0 pfn:0x2cec9
    flags: 0xfffffc0000200(slab|node=0|zone=1|lastcpupid=0x1fffff)
    raw: 000fffffc0000200 ffffea00009d2580 0000000600000006 ffff88801a9ffc80
    raw: 0000000000000000 0000000080490049 00000001ffffffff 0000000000000000
    page dumped because: kasan: bad access detected

    Memory state around the buggy address:
     ffff88802cec9080: fb fb fb fc fc fa fb fb fb fb fc fc fb fb fb fb
     ffff88802cec9100: fb fc fc fb fb fb fb fb fc fc fb fb fb fb fb fc
    >ffff88802cec9180: fc fa fb fb fb fb fc fc fa fb fb fb fb fc fc fb
                                ^
     ffff88802cec9200: fb fb fb fb fc fc fb fb fb fb fb fc fc fb fb fb
     ffff88802cec9280: fb fb fc fc fa fb fb fb fb fc fc fa fb fb fb fb
    ==================================================================

    The root cause of this bug is the unlocked access to xfs_inode.i_afp
    from the getxattr code paths while trying to determine which ILOCK mode
    to use to stabilize the xattr data.  Unfortunately, the VFS does not
    acquire i_rwsem when vfs_getxattr (or listxattr) call into the
    filesystem, which means that getxattr can race with a removexattr that's
    tearing down the attr fork and crash:

    xfs_attr_set:                          xfs_attr_get:
    xfs_attr_fork_remove:                  xfs_ilock_attr_map_shared:

    xfs_idestroy_fork(ip->i_afp);
    kmem_cache_free(xfs_ifork_cache, ip->i_afp);

                                           if (ip->i_afp &&

    ip->i_afp = NULL;

                                               xfs_need_iread_extents(ip->i_afp))
                                           <KABOOM>

    ip->i_forkoff = 0;

    Regrettably, the VFS is much more lax about i_rwsem and getxattr than
    is immediately obvious -- not only does it not guarantee that we hold
    i_rwsem, it actually doesn't guarantee that we *don't* hold it either.
    The getxattr system call won't acquire the lock before calling XFS, but
    the file capabilities code calls getxattr with and without i_rwsem held
    to determine if the "security.capabilities" xattr is set on the file.

    Fixing the VFS locking requires a treewide investigation into every code
    path that could touch an xattr and what i_rwsem state it expects or sets
    up.  That could take years or even prove impossible; fortunately, we
    can fix this UAF problem inside XFS.

    An earlier version of this patch used smp_wmb in xfs_attr_fork_remove to
    ensure that i_forkoff is always zeroed before i_afp is set to null and
    changed the read paths to use smp_rmb before accessing i_forkoff and
    i_afp, which avoided these UAF problems.  However, the patch author was
    too busy dealing with other problems in the meantime, and by the time he
    came back to this issue, the situation had changed a bit.

    On a modern system with selinux, each inode will always have at least
    one xattr for the selinux label, so it doesn't make much sense to keep
    incurring the extra pointer dereference.  Furthermore, Allison's
    upcoming parent pointer patchset will also cause nearly every inode in
    the filesystem to have extended attributes.  Therefore, make the inode
    attribute fork structure part of struct xfs_inode, at a cost of 40 more
    bytes.

    This patch adds a clunky if_present field where necessary to maintain
    the existing logic of xattr fork null pointer testing in the existing
    codebase.  The next patch switches the logic over to XFS_IFORK_Q and it
    all goes away.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:42 -05:00
Bill O'Donnell 08529f7680 xfs: convert XFS_IFORK_PTR to a static inline helper
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 732436ef916b4f338d672ea56accfdb11e8d0732
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Sat Jul 9 10:56:05 2022 -0700

    xfs: convert XFS_IFORK_PTR to a static inline helper

    We're about to make this logic do a bit more, so convert the macro to a
    static inline function for better typechecking and fewer shouty macros.
    No functional changes here.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:42 -05:00
Bill O'Donnell 1c2b1203c8 xfs: use a separate frextents counter for rt extent reservations
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 2229276c5283264b8c2241c1ed972bbb136cab22
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Tue Apr 12 06:49:42 2022 +1000

    xfs: use a separate frextents counter for rt extent reservations

    As mentioned in the previous commit, the kernel misuses sb_frextents in
    the incore mount to reflect both incore reservations made by running
    transactions as well as the actual count of free rt extents on disk.
    This results in the superblock being written to the log with an
    underestimate of the number of rt extents that are marked free in the
    rtbitmap.

    Teaching XFS to recompute frextents after log recovery avoids
    operational problems in the current mount, but it doesn't solve the
    problem of us writing undercounted frextents which are then recovered by
    an older kernel that doesn't have that fix.

    Create an incore percpu counter to mirror the ondisk frextents.  This
    new counter will track transaction reservations and the only time we
    will touch the incore super counter (i.e the one that gets logged) is
    when those transactions commit updates to the rt bitmap.  This is in
    contrast to the lazysbcount counters (e.g. fdblocks), where we know that
    log recovery will always fix any incorrect counter that we log.
    As a bonus, we only take m_sb_lock at transaction commit time.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>
    Signed-off-by: Dave Chinner <david@fromorbit.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:10:59 -05:00
Bill O'Donnell 29b562062c xfs: aborting inodes on shutdown may need buffer lock
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit d2d7c0473586d2f22e85d615275f34cf19f94447
Author: Dave Chinner <dchinner@redhat.com>
Date:   Tue Mar 29 18:21:59 2022 -0700

    xfs: aborting inodes on shutdown may need buffer lock

    Most buffer io list operations are run with the bp->b_lock held, but
    xfs_iflush_abort() can be called without the buffer lock being held
    resulting in inodes being removed from the buffer list while other
    list operations are occurring. This causes problems with corrupted
    bp->b_io_list inode lists during filesystem shutdown, leading to
    traversals that never end, double removals from the AIL, etc.

    Fix this by passing the buffer to xfs_iflush_abort() if we have
    it locked. If the inode is attached to the buffer, we're going to
    have to remove it from the buffer list and we'd have to get the
    buffer off the inode log item to do that anyway.

    If we don't have a buffer passed in (e.g. from xfs_reclaim_inode())
    then we can determine if the inode has a log item and if it is
    attached to a buffer before we do anything else. If it does have an
    attached buffer, we can lock it safely (because the inode has a
    reference to it) and then perform the inode abort.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:10:53 -05:00
Bill O'Donnell e773cac34b xfs: Fix comments mentioning xfs_ialloc
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 132c460e49649685bf4b02ba43dea59062f797d9
Author: Yang Xu <xuyang2018.jy@fujitsu.com>
Date:   Tue Dec 21 09:38:19 2021 -0800

    xfs: Fix comments mentioning xfs_ialloc

    Since kernel commit 1abcf26101 ("xfs: move on-disk inode allocation out of xfs_ialloc()"),
    xfs_ialloc has been renamed to xfs_init_new_inode. So update this in comments.

    Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:10:45 -05:00
Chris von Recklinghausen 8dced2b153 mm: shrinkers: provide shrinkers with names
Bugzilla: https://bugzilla.redhat.com/2160210

commit e33c267ab70de4249d22d7eab1cc7d68a889bac2
Author: Roman Gushchin <roman.gushchin@linux.dev>
Date:   Tue May 31 20:22:24 2022 -0700

    mm: shrinkers: provide shrinkers with names

    Currently shrinkers are anonymous objects.  For debugging purposes they
    can be identified by count/scan function names, but it's not always
    useful: e.g.  for superblock's shrinkers it's nice to have at least an
    idea of to which superblock the shrinker belongs.

    This commit adds names to shrinkers.  register_shrinker() and
    prealloc_shrinker() functions are extended to take a format and arguments
    to master a name.

    In some cases it's not possible to determine a good name at the time when
    a shrinker is allocated.  For such cases shrinker_debugfs_rename() is
    provided.

    The expected format is:
        <subsystem>-<shrinker_type>[:<instance>]-<id>
    For some shrinkers an instance can be encoded as (MAJOR:MINOR) pair.

    After this change the shrinker debugfs directory looks like:
      $ cd /sys/kernel/debug/shrinker/
      $ ls
        dquota-cache-16     sb-devpts-28     sb-proc-47       sb-tmpfs-42
        mm-shadow-18        sb-devtmpfs-5    sb-proc-48       sb-tmpfs-43
        mm-zspool:zram0-34  sb-hugetlbfs-17  sb-pstore-31     sb-tmpfs-44
        rcu-kfree-0         sb-hugetlbfs-33  sb-rootfs-2      sb-tmpfs-49
        sb-aio-20           sb-iomem-12      sb-securityfs-6  sb-tracefs-13
        sb-anon_inodefs-15  sb-mqueue-21     sb-selinuxfs-22  sb-xfs:vda1-36
        sb-bdev-3           sb-nsfs-4        sb-sockfs-8      sb-zsmalloc-19
        sb-bpf-32           sb-pipefs-14     sb-sysfs-26      thp-deferred_split-10
        sb-btrfs:vda2-24    sb-proc-25       sb-tmpfs-1       thp-zero-9
        sb-cgroup2-30       sb-proc-39       sb-tmpfs-27      xfs-buf:vda1-37
        sb-configfs-23      sb-proc-41       sb-tmpfs-29      xfs-inodegc:vda1-38
        sb-dax-11           sb-proc-45       sb-tmpfs-35
        sb-debugfs-7        sb-proc-46       sb-tmpfs-40

    [roman.gushchin@linux.dev: fix build warnings]
      Link: https://lkml.kernel.org/r/Yr+ZTnLb9lJk6fJO@castle
      Reported-by: kernel test robot <lkp@intel.com>
    Link: https://lkml.kernel.org/r/20220601032227.4076670-4-roman.gushchin@linux.dev
    Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
    Cc: Dave Chinner <dchinner@redhat.com>
    Cc: Hillf Danton <hdanton@sina.com>
    Cc: Kent Overstreet <kent.overstreet@gmail.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:17 -04:00
Carlos Maiolino 25a40d32f8 xfs: rename _zone variables to _cache
Bugzilla: https://bugzilla.redhat.com/2125724

Conflicts:
	Small conflict at xfs_inode_alloc() due to out of order
	backport. Inode alloc using kmem_cache_alloc() has been
	converted to use alloc_inode_sb() before this patch.

Now that we've gotten rid of the kmem_zone_t typedef, rename the
variables to _cache since that's what they are.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
(cherry picked from commit 182696fb021fc196e5cbe641565ca40fcf0f885a)
2022-10-21 12:50:46 +02:00
Brian Foster 8b883607b2 xfs: introduce xfs_inodegc_push()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: git://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git

commit 5e672cd69f0a534a445df4372141fd0d1d00901d
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Jun 16 07:44:32 2022 -0700

    xfs: introduce xfs_inodegc_push()

    The current blocking mechanism for pushing the inodegc queue out to
    disk can result in systems becoming unusable when there is a long
    running inodegc operation. This is because the statfs()
    implementation currently issues a blocking flush of the inodegc
    queue and a significant number of common system utilities will call
    statfs() to discover something about the underlying filesystem.

    This can result in userspace operations getting stuck on inodegc
    progress, and when trying to remove a heavily reflinked file on slow
    storage with a full journal, this can result in delays measuring in
    hours.

    Avoid this problem by adding "push" function that expedites the
    flushing of the inodegc queue, but doesn't wait for it to complete.

    Convert xfs_fs_statfs() and xfs_qm_scall_getquota() to use this
    mechanism so they don't block but still ensure that queued
    operations are expedited.

    Fixes: ab23a7768739 ("xfs: per-cpu deferred inode inactivation queues")
    Reported-by: Chris Dunlop <chris@onthe.net.au>
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    [djwong: fix _getquota_next to use _inodegc_push too]
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:38 -04:00
Brian Foster 7c898d13de xfs: bound maximum wait time for inodegc work
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: git://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git

commit 7cf2b0f9611b9971d663e1fc3206eeda3b902922
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Jun 16 07:44:31 2022 -0700

    xfs: bound maximum wait time for inodegc work

    Currently inodegc work can sit queued on the per-cpu queue until
    the workqueue is either flushed of the queue reaches a depth that
    triggers work queuing (and later throttling). This means that we
    could queue work that waits for a long time for some other event to
    trigger flushing.

    Hence instead of just queueing work at a specific depth, use a
    delayed work that queues the work at a bound time. We can still
    schedule the work immediately at a given depth, but we no long need
    to worry about leaving a number of items on the list that won't get
    processed until external events prevail.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:38 -04:00
Brian Foster 5b13242246 xfs: flush inodegc workqueue tasks before cancel
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 6191cf3ad59fda5901160633fef8e41b064a5246
Author: Brian Foster <bfoster@redhat.com>
Date:   Tue Jan 18 11:32:35 2022 -0800

    xfs: flush inodegc workqueue tasks before cancel

    The xfs_inodegc_stop() helper performs a high level flush of pending
    work on the percpu queues and then runs a cancel_work_sync() on each
    of the percpu work tasks to ensure all work has completed before
    returning.  While cancel_work_sync() waits for wq tasks to complete,
    it does not guarantee work tasks have started. This means that the
    _stop() helper can queue and instantly cancel a wq task without
    having completed the associated work. This can be observed by
    tracepoint inspection of a simple "rm -f <file>; fsfreeze -f <mnt>"
    test:

            xfs_destroy_inode: ... ino 0x83 ...
            xfs_inode_set_need_inactive: ... ino 0x83 ...
            xfs_inodegc_stop: ...
            ...
            xfs_inodegc_start: ...
            xfs_inodegc_worker: ...
            xfs_inode_inactivating: ... ino 0x83 ...

    The first few lines show that the inode is removed and need inactive
    state set, but the inactivation work has not completed before the
    inodegc mechanism stops. The inactivation doesn't actually occur
    until the fs is unfrozen and the gc mechanism starts back up. Note
    that this test requires fsfreeze to reproduce because xfs_freeze
    indirectly invokes xfs_fs_statfs(), which calls xfs_inodegc_flush().

    When this occurs, the workqueue try_to_grab_pending() logic first
    tries to steal the pending bit, which does not succeed because the
    bit has been set by queue_work_on(). Subsequently, it checks for
    association of a pool workqueue from the work item under the pool
    lock. This association is set at the point a work item is queued and
    cleared when dequeued for processing. If the association exists, the
    work item is removed from the queue and cancel_work_sync() returns
    true. If the pwq association is cleared, the remove attempt assumes
    the task is busy and retries (eventually returning false to the
    caller after waiting for the work task to complete).

    To avoid this race, we can flush each work item explicitly before
    cancel. However, since the _queue_all() already schedules each
    underlying work item, the workqueue level helpers are sufficient to
    achieve the same ordering effect. E.g., the inodegc enabled flag
    prevents scheduling any further work in the _stop() case. Use the
    drain_workqueue() helper in this particular case to make the intent
    a bit more self explanatory.

    Signed-off-by: Brian Foster <bfoster@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:38 -04:00
Brian Foster 556fc6c4a0 xfs: xfs_is_shutdown vs xlog_is_shutdown cage fight
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 01728b44ef1b714756607be0210fbcf60c78efce
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Mar 17 09:09:13 2022 -0700

    xfs: xfs_is_shutdown vs xlog_is_shutdown cage fight

    I've been chasing a recent resurgence in generic/388 recovery
    failure and/or corruption events. The events have largely been
    uninitialised inode chunks being tripped over in log recovery
    such as:

     XFS (pmem1): User initiated shutdown received.
     pmem1: writeback error on inode 12621949, offset 1019904, sector 12968096
     XFS (pmem1): Log I/O Error (0x6) detected at xfs_fs_goingdown+0xa3/0xf0 (fs/xfs/xfs_fsops.c:500).  Shutting down filesystem.
     XFS (pmem1): Please unmount the filesystem and rectify the problem(s)
     XFS (pmem1): Unmounting Filesystem
     XFS (pmem1): Mounting V5 Filesystem
     XFS (pmem1): Starting recovery (logdev: internal)
     XFS (pmem1): bad inode magic/vsn daddr 8723584 #0 (magic=1818)
     XFS (pmem1): Metadata corruption detected at xfs_inode_buf_verify+0x180/0x190, xfs_inode block 0x851c80 xfs_inode_buf_verify
     XFS (pmem1): Unmount and run xfs_repair
     XFS (pmem1): First 128 bytes of corrupted metadata buffer:
     00000000: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
     00000010: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
     00000020: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
     00000030: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
     00000040: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
     00000050: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
     00000060: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
     00000070: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
     XFS (pmem1): metadata I/O error in "xlog_recover_items_pass2+0x52/0xc0" at daddr 0x851c80 len 32 error 117
     XFS (pmem1): log mount/recovery failed: error -117
     XFS (pmem1): log mount failed

    There have been isolated random other issues, too - xfs_repair fails
    because it finds some corruption in symlink blocks, rmap
    inconsistencies, etc - but they are nowhere near as common as the
    uninitialised inode chunk failure.

    The problem has clearly happened at runtime before recovery has run;
    I can see the ICREATE log item in the log shortly before the
    actively recovered range of the log. This means the ICREATE was
    definitely created and written to the log, but for some reason the
    tail of the log has been moved past the ordered buffer log item that
    tracks INODE_ALLOC buffers and, supposedly, prevents the tail of the
    log moving past the ICREATE log item before the inode chunk buffer
    is written to disk.

    Tracing the fsstress processes that are running when the filesystem
    shut down immediately pin-pointed the problem:

    user shutdown marks xfs_mount as shutdown

             godown-213341 [008]  6398.022871: console:              [ 6397.915392] XFS (pmem1): User initiated shutdown received.
    .....

    aild tries to push ordered inode cluster buffer

      xfsaild/pmem1-213314 [001]  6398.022974: xfs_buf_trylock:      dev 259:1 daddr 0x851c80 bbcount 0x20 hold 16 pincount 0 lock 0 flags DONE|INODES|PAGES caller xfs_inode_item_push+0x8e
      xfsaild/pmem1-213314 [001]  6398.022976: xfs_ilock_nowait:     dev 259:1 ino 0x851c80 flags ILOCK_SHARED caller xfs_iflush_cluster+0xae

    xfs_iflush_cluster() checks xfs_is_shutdown(), returns true,
    calls xfs_iflush_abort() to kill writeback of the inode.
    Inode is removed from AIL, drops cluster buffer reference.

      xfsaild/pmem1-213314 [001]  6398.022977: xfs_ail_delete:       dev 259:1 lip 0xffff88880247ed80 old lsn 7/20344 new lsn 7/21000 type XFS_LI_INODE flags IN_AIL
      xfsaild/pmem1-213314 [001]  6398.022978: xfs_buf_rele:         dev 259:1 daddr 0x851c80 bbcount 0x20 hold 17 pincount 0 lock 0 flags DONE|INODES|PAGES caller xfs_iflush_abort+0xd7

    .....

    All inodes on cluster buffer are aborted, then the cluster buffer
    itself is aborted and removed from the AIL *without writeback*:

    xfsaild/pmem1-213314 [001]  6398.023011: xfs_buf_error_relse:  dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_ioend_fail+0x33
       xfsaild/pmem1-213314 [001]  6398.023012: xfs_ail_delete:       dev 259:1 lip 0xffff8888053efde8 old lsn 7/20344 new lsn 7/20344 type XFS_LI_BUF flags IN_AIL

    The inode buffer was at 7/20344 when it was removed from the AIL.

       xfsaild/pmem1-213314 [001]  6398.023012: xfs_buf_item_relse:   dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_item_done+0x31
       xfsaild/pmem1-213314 [001]  6398.023012: xfs_buf_rele:         dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_item_relse+0x39

    .....

    Userspace is still running, doing stuff. an fsstress process runs
    syncfs() or sync() and we end up in sync_fs_one_sb() which issues
    a log force. This pushes on the CIL:

            fsstress-213322 [001]  6398.024430: xfs_fs_sync_fs:       dev 259:1 m_features 0x20000000019ff6e9 opstate (clean|shutdown|inodegc|blockgc) s_flags 0x70810000 caller sync_fs_one_sb+0x26
            fsstress-213322 [001]  6398.024430: xfs_log_force:        dev 259:1 lsn 0x0 caller xfs_fs_sync_fs+0x82
            fsstress-213322 [001]  6398.024430: xfs_log_force:        dev 259:1 lsn 0x5f caller xfs_log_force+0x7c
               <...>-194402 [001]  6398.024467: kmem_alloc:           size 176 flags 0x14 caller xlog_cil_push_work+0x9f

    And the CIL fills up iclogs with pending changes. This picks up
    the current tail from the AIL:

               <...>-194402 [001]  6398.024497: xlog_iclog_get_space: dev 259:1 state XLOG_STATE_ACTIVE refcnt 1 offset 0 lsn 0x0 flags  caller xlog_write+0x149
               <...>-194402 [001]  6398.024498: xlog_iclog_switch:    dev 259:1 state XLOG_STATE_ACTIVE refcnt 1 offset 0 lsn 0x700005408 flags  caller xlog_state_get_iclog_space+0x37e
               <...>-194402 [001]  6398.024521: xlog_iclog_release:   dev 259:1 state XLOG_STATE_WANT_SYNC refcnt 1 offset 32256 lsn 0x700005408 flags  caller xlog_write+0x5f9
               <...>-194402 [001]  6398.024522: xfs_log_assign_tail_lsn: dev 259:1 new tail lsn 7/21000, old lsn 7/20344, last sync 7/21448

    And it moves the tail of the log to 7/21000 from 7/20344. This
    *moves the tail of the log beyond the ICREATE transaction* that was
    at 7/20344 and pinned by the inode cluster buffer that was cancelled
    above.

    ....

             godown-213341 [008]  6398.027005: xfs_force_shutdown:   dev 259:1 tag logerror flags log_io|force_umount file fs/xfs/xfs_fsops.c line_num 500
              godown-213341 [008]  6398.027022: console:              [ 6397.915406] pmem1: writeback error on inode 12621949, offset 1019904, sector 12968096
              godown-213341 [008]  6398.030551: console:              [ 6397.919546] XFS (pmem1): Log I/O Error (0x6) detected at xfs_fs_goingdown+0xa3/0xf0 (fs/

    And finally the log itself is now shutdown, stopping all further
    writes to the log. But this is too late to prevent the corruption
    that moving the tail of the log forwards after we start cancelling
    writeback causes.

    The fundamental problem here is that we are using the wrong shutdown
    checks for log items. We've long conflated mount shutdown with log
    shutdown state, and I started separating that recently with the
    atomic shutdown state changes in commit b36d4651e165 ("xfs: make
    forced shutdown processing atomic"). The changes in that commit
    series are directly responsible for being able to diagnose this
    issue because it clearly separated mount shutdown from log shutdown.

    Essentially, once we start cancelling writeback of log items and
    removing them from the AIL because the filesystem is shut down, we
    *cannot* update the journal because we may have cancelled the items
    that pin the tail of the log. That moves the tail of the log
    forwards without having written the metadata back, hence we have
    corrupt in memory state and writing to the journal propagates that
    to the on-disk state.

    What commit b36d4651e165 makes clear is that log item state needs to
    change relative to log shutdown, not mount shutdown. IOWs, anything
    that aborts metadata writeback needs to check log shutdown state
    because log items directly affect log consistency. Having them check
    mount shutdown state introduces the above race condition where we
    cancel metadata writeback before the log shuts down.

    To fix this, this patch works through all log items and converts
    shutdown checks to use xlog_is_shutdown() rather than
    xfs_is_shutdown(), so that we don't start aborting metadata
    writeback before we shut off journal writes.

    AFAICT, this race condition is a zero day IO error handling bug in
    XFS that dates back to the introduction of XLOG_IO_ERROR,
    XLOG_STATE_IOERROR and XFS_FORCED_SHUTDOWN back in January 1997.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:37 -04:00
Brian Foster 9723c70ba6 xfs: remove xfs_inew_wait
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 1090427bf18f9835b3ccbd36edf43f2509444e27
Author: Christoph Hellwig <hch@lst.de>
Date:   Wed Nov 24 10:06:02 2021 -0800

    xfs: remove xfs_inew_wait

    With the remove of xfs_dqrele_all_inodes, xfs_inew_wait and all the
    infrastructure used to wake the XFS_INEW bit waitqueue is unused.

    Reported-by: kernel test robot <lkp@intel.com>
    Fixes: 777eb1fa857e ("xfs: remove xfs_dqrele_all_inodes")
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Brian Foster <bfoster@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:37 -04:00
Brian Foster d179379de4 xfs: replace XFS_FORCED_SHUTDOWN with xfs_is_shutdown
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 75c8c50fa16a23f8ac89ea74834ae8ddd1558d75
Author: Dave Chinner <dchinner@redhat.com>
Date:   Wed Aug 18 18:46:53 2021 -0700

    xfs: replace XFS_FORCED_SHUTDOWN with xfs_is_shutdown

    Remove the shouty macro and instead use the inline function that
    matches other state/feature check wrapper naming. This conversion
    was done with sed.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:34 -04:00
Brian Foster a672539203 xfs: convert remaining mount flags to state flags
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 2e973b2cd4cdb993be94cca4c33f532f1ed05316
Author: Dave Chinner <dchinner@redhat.com>
Date:   Wed Aug 18 18:46:52 2021 -0700

    xfs: convert remaining mount flags to state flags

    The remaining mount flags kept in m_flags are actually runtime state
    flags. These change dynamically, so they really should be updated
    atomically so we don't potentially lose an update due to racing
    modifications.

    Convert these remaining flags to be stored in m_opstate and use
    atomic bitops to set and clear the flags. This also adds a couple of
    simple wrappers for common state checks - read only and shutdown.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:34 -04:00
Brian Foster 6def1029c3 xfs: convert mount flags to features
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git
Conflicts: Work around out of order backport in xfs_fs_fill_super().

commit 0560f31a09e523090d1ab2bfe21c69d028c2bdf2
Author: Dave Chinner <dchinner@redhat.com>
Date:   Wed Aug 18 18:46:52 2021 -0700

    xfs: convert mount flags to features

    Replace m_flags feature checks with xfs_has_<feature>() calls and
    rework the setup code to set flags in m_features.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:34 -04:00
Brian Foster d54a790d1d xfs: replace xfs_sb_version checks with feature flag checks
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 38c26bfd90e1999650d5ef40f90d721f05916643
Author: Dave Chinner <dchinner@redhat.com>
Date:   Wed Aug 18 18:46:37 2021 -0700

    xfs: replace xfs_sb_version checks with feature flag checks

    Convert the xfs_sb_version_hasfoo() to checks against
    mp->m_features. Checks of the superblock itself during disk
    operations (e.g. in the read/write verifiers and the to/from disk
    formatters) are not converted - they operate purely on the
    superblock state. Everything else should use the mount features.

    Large parts of this conversion were done with sed with commands like
    this:

    for f in `git grep -l xfs_sb_version_has fs/xfs/*.c`; do
            sed -i -e 's/xfs_sb_version_has\(.*\)(&\(.*\)->m_sb)/xfs_has_\1(\2)/' $f
    done

    With manual cleanups for things like "xfs_has_extflgbit" and other
    little inconsistencies in naming.

    The result is ia lot less typing to check features and an XFS binary
    size reduced by a bit over 3kB:

    $ size -t fs/xfs/built-in.a
            text       data     bss     dec     hex filenam
    before  1130866  311352     484 1442702  16038e (TOTALS)
    after   1127727  311352     484 1439563  15f74b (TOTALS)

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:34 -04:00
Brian Foster 8468848c15 xfs: remove support for untagged lookups in xfs_icwalk*
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit a437b9b488e36e41026888fc0aa20ec83f39a643
Author: Christoph Hellwig <hch@lst.de>
Date:   Fri Aug 13 09:16:52 2021 -0700

    xfs: remove support for untagged lookups in xfs_icwalk*

    With quotaoff not allowing disabling of accounting there is no need
    for untagged lookups in this code, so remove the dead leftovers.

    Repoted-by: Dan Carpenter <dan.carpenter@oracle.com>
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    [djwong: convert to for_each_perag_tag]
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:31 -04:00
Brian Foster cfb8d9d215 xfs: throttle inode inactivation queuing on memory reclaim
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 40b1de007aca4f9ec4ee4322c29f026ebb60ac96
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Fri Aug 6 11:05:43 2021 -0700

    xfs: throttle inode inactivation queuing on memory reclaim

    Now that we defer inode inactivation, we've decoupled the process of
    unlinking or closing an inode from the process of inactivating it.  In
    theory this should lead to better throughput since we now inactivate the
    queued inodes in batches instead of one at a time.

    Unfortunately, one of the primary risks with this decoupling is the loss
    of rate control feedback between the frontend and background threads.
    In other words, a rm -rf /* thread can run the system out of memory if
    it can queue inodes for inactivation and jump to a new CPU faster than
    the background threads can actually clear the deferred work.  The
    workers can get scheduled off the CPU if they have to do IO, etc.

    To solve this problem, we configure a shrinker so that it will activate
    the /second/ time the shrinkers are called.  The custom shrinker will
    queue all percpu deferred inactivation workers immediately and set a
    flag to force frontend callers who are releasing a vfs inode to wait for
    the inactivation workers.

    On my test VM with 560M of RAM and a 2TB filesystem, this seems to solve
    most of the OOMing problem when deleting 10 million inodes.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:23 -04:00
Brian Foster 79c2de89dd xfs: use background worker pool when transactions can't get free space
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit e8d04c2abcebd66bdbacd53bb273d824d4e27080
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Fri Aug 6 11:05:42 2021 -0700

    xfs: use background worker pool when transactions can't get free space

    In xfs_trans_alloc, if the block reservation call returns ENOSPC, we
    call xfs_blockgc_free_space with a NULL icwalk structure to try to free
    space.  Each frontend thread that encounters this situation starts its
    own walk of the inode cache to see if it can find anything, which is
    wasteful since we don't have any additional selection criteria.  For
    this one common case, create a function that reschedules all pending
    background work immediately and flushes the workqueue so that the scan
    can run in parallel.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:23 -04:00
Brian Foster af833493a4 xfs: don't run speculative preallocation gc when fs is frozen
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 6f6490914d9b712004ddad648e47b1bf22647978
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Fri Aug 6 11:05:42 2021 -0700

    xfs: don't run speculative preallocation gc when fs is frozen

    Now that we have the infrastructure to switch background workers on and
    off at will, fix the block gc worker code so that we don't actually run
    the worker when the filesystem is frozen, same as we do for deferred
    inactivation.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:23 -04:00
Brian Foster f3c1be5b6f xfs: inactivate inodes any time we try to free speculative preallocations
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 2eb665027b6528c1a8e9158c2f722a6ec0af359d
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Fri Aug 6 11:05:41 2021 -0700

    xfs: inactivate inodes any time we try to free speculative preallocations

    Other parts of XFS have learned to call xfs_blockgc_free_{space,quota}
    to try to free speculative preallocations when space is tight.  This
    means that file writes, transaction reservation failures, quota limit
    enforcement, and the EOFBLOCKS ioctl all call this function to free
    space when things are tight.

    Since inode inactivation is now a background task, this means that the
    filesystem can be hanging on to unlinked but not yet freed space.  Add
    this to the list of things that xfs_blockgc_free_* makes writer threads
    scan for when they cannot reserve space.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:22 -04:00
Brian Foster c67c7fc08c xfs: queue inactivation immediately when free realtime extents are tight
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 65f03d8652b240aa66b99a07e3c423a51e967568
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Fri Aug 6 11:05:41 2021 -0700

    xfs: queue inactivation immediately when free realtime extents are tight

    Now that we have made the inactivation of unlinked inodes a background
    task to increase the throughput of file deletions, we need to be a
    little more careful about how long of a delay we can tolerate.

    Similar to the patch doing this for free space on the data device, if
    the file being inactivated is a realtime file and the realtime volume is
    running low on free extents, we want to run the worker ASAP so that the
    realtime allocator can make better decisions.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:22 -04:00
Brian Foster 9344ae67c6 xfs: queue inactivation immediately when quota is nearing enforcement
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 108523b8de676a45cef1f6c8566c444222b85de0
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Fri Aug 6 11:05:40 2021 -0700

    xfs: queue inactivation immediately when quota is nearing enforcement

    Now that we have made the inactivation of unlinked inodes a background
    task to increase the throughput of file deletions, we need to be a
    little more careful about how long of a delay we can tolerate.

    Specifically, if the dquots attached to the inode being inactivated are
    nearing any kind of enforcement boundary, we want to queue that
    inactivation work immediately so that users don't get EDQUOT/ENOSPC
    errors even after they deleted a bunch of files to stay within quota.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:22 -04:00
Brian Foster 8fbe28aff8 xfs: queue inactivation immediately when free space is tight
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 7d6f07d2c5ad9fce298889eeed317d512a2df8cd
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Fri Aug 6 11:05:40 2021 -0700

    xfs: queue inactivation immediately when free space is tight

    Now that we have made the inactivation of unlinked inodes a background
    task to increase the throughput of file deletions, we need to be a
    little more careful about how long of a delay we can tolerate.

    On a mostly empty filesystem, the risk of the allocator making poor
    decisions due to fragmentation of the free space on account a lengthy
    delay in background updates is minimal because there's plenty of space.
    However, if free space is tight, we want to deallocate unlinked inodes
    as quickly as possible to avoid fallocate ENOSPC and to give the
    allocator the best shot at optimal allocations for new writes.

    Therefore, queue the percpu worker immediately if the filesystem is more
    than 95% full.  This follows the same principle that XFS becomes less
    aggressive about speculative allocations and lazy cleanup (and more
    precise about accounting) when nearing full.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:22 -04:00
Brian Foster 552c0d6db7 xfs: per-cpu deferred inode inactivation queues
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit ab23a7768739a23d21d8a16ca37dff96b1ca957a
Author: Dave Chinner <dchinner@redhat.com>
Date:   Fri Aug 6 11:05:39 2021 -0700

    xfs: per-cpu deferred inode inactivation queues

    Move inode inactivation to background work contexts so that it no
    longer runs in the context that releases the final reference to an
    inode. This will allow process work that ends up blocking on
    inactivation to continue doing work while the filesytem processes
    the inactivation in the background.

    A typical demonstration of this is unlinking an inode with lots of
    extents. The extents are removed during inactivation, so this blocks
    the process that unlinked the inode from the directory structure. By
    moving the inactivation to the background process, the userspace
    applicaiton can keep working (e.g. unlinking the next inode in the
    directory) while the inactivation work on the previous inode is
    done by a different CPU.

    The implementation of the queue is relatively simple. We use a
    per-cpu lockless linked list (llist) to queue inodes for
    inactivation without requiring serialisation mechanisms, and a work
    item to allow the queue to be processed by a CPU bound worker
    thread. We also keep a count of the queue depth so that we can
    trigger work after a number of deferred inactivations have been
    queued.

    The use of a bound workqueue with a single work depth allows the
    workqueue to run one work item per CPU. We queue the work item on
    the CPU we are currently running on, and so this essentially gives
    us affine per-cpu worker threads for the per-cpu queues. THis
    maintains the effective CPU affinity that occurs within XFS at the
    AG level due to all objects in a directory being local to an AG.
    Hence inactivation work tends to run on the same CPU that last
    accessed all the objects that inactivation accesses and this
    maintains hot CPU caches for unlink workloads.

    A depth of 32 inodes was chosen to match the number of inodes in an
    inode cluster buffer. This hopefully allows sequential
    allocation/unlink behaviours to defering inactivation of all the
    inodes in a single cluster buffer at a time, further helping
    maintain hot CPU and buffer cache accesses while running
    inactivations.

    A hard per-cpu queue throttle of 256 inode has been set to avoid
    runaway queuing when inodes that take a long to time inactivate are
    being processed. For example, when unlinking inodes with large
    numbers of extents that can take a lot of processing to free.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    [djwong: tweak comments and tracepoints, convert opflags to state bits]
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:22 -04:00
Brian Foster 1698a6765d xfs: detach dquots from inode if we don't need to inactivate it
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 62af7d54a0ec0b6f99d7d55ebeb9ecbb3371bc67
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Fri Aug 6 11:05:39 2021 -0700

    xfs: detach dquots from inode if we don't need to inactivate it

    If we don't need to inactivate an inode, we can detach the dquots and
    move on to reclamation.  This isn't strictly required here; it's a
    preparation patch for deferred inactivation per reviewer request[1] to
    move the creation of xfs_inode_needs_inactivation into a separate
    change.  Eventually this !need_inactive chunk will turn into the code
    path for inodes that skip xfs_inactive and go straight to memory
    reclaim.

    [1] https://lore.kernel.org/linux-xfs/20210609012838.GW2945738@locust/T/#mca6d958521cb88bbc1bfe1a30767203328d410b5
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:22 -04:00
Brian Foster 92b1d83d7a xfs: move xfs_inactive call to xfs_inode_mark_reclaimable
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit c6c2066db3963e519a7ff8f432fcec956f4d23b4
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Fri Aug 6 11:05:38 2021 -0700

    xfs: move xfs_inactive call to xfs_inode_mark_reclaimable

    Move the xfs_inactive call and all the other debugging checks and stats
    updates into xfs_inode_mark_reclaimable because most of that are
    implementation details about the inode cache.  This is preparation for
    deferred inactivation that is coming up.  We also move it around
    xfs_icache.c in preparation for deferred inactivation.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:21 -04:00
Brian Foster 35d314bd0f xfs: remove xfs_dqrele_all_inodes
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 777eb1fa857ec38afd518b3adc25cfac0f4af13b
Author: Christoph Hellwig <hch@lst.de>
Date:   Fri Aug 6 11:05:36 2021 -0700

    xfs: remove xfs_dqrele_all_inodes

    xfs_dqrele_all_inodes is unused now, remove it and all supporting code.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:21 -04:00