Commit Graph

221 Commits

Author SHA1 Message Date
Bill O'Donnell 16bfa41bdf xfs: repair inode fork block mapping data structures
JIRA: https://issues.redhat.com/browse/RHEL-65728

commit 8f71bede8efd820627ac05c19eac2758214bc896
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Fri Dec 15 10:03:39 2023 -0800

    xfs: repair inode fork block mapping data structures

    Use the reverse-mapping btree information to rebuild an inode block map.
    Update the btree bulk loading code as necessary to support inode rooted
    btrees and fix some bitrot problems.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-11-20 11:26:07 -06:00
Bill O'Donnell 2863940279 xfs: use shifting and masking when converting rt extents, if possible
JIRA: https://issues.redhat.com/browse/RHEL-62760

commit ef5a83b7e597038d1c734ddb4bc00638082c2bf1
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Mon Oct 16 09:40:11 2023 -0700

    xfs: use shifting and masking when converting rt extents, if possible

    Avoid the costs of integer division (32-bit and 64-bit) if the realtime
    extent size is a power of two.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-11-09 10:06:38 -06:00
Bill O'Donnell 520fae453c xfs: create a helper to convert extlen to rtextlen
JIRA: https://issues.redhat.com/browse/RHEL-62760

commit 2c2b981b737a519907429f62148bbd9e40e01132
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Mon Oct 16 09:35:23 2023 -0700

    xfs: create a helper to convert extlen to rtextlen

    Create a helper to compute the realtime extent (xfs_rtxlen_t) from an
    extent length (xfs_extlen_t) value.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-11-09 10:06:36 -06:00
Bill O'Donnell 19fab0b814 xfs: collect errors from inodegc for unlinked inode recovery
JIRA: https://issues.redhat.com/browse/RHEL-2002

Conflicts: context differences due to out of order patch application

commit d4d12c02bf5f768f1b423c7ae2909c5afdfe0d5f
Author: Dave Chinner <dchinner@redhat.com>
Date:   Mon Jun 5 14:48:15 2023 +1000

    xfs: collect errors from inodegc for unlinked inode recovery

    Unlinked list recovery requires errors removing the inode the from
    the unlinked list get fed back to the main recovery loop. Now that
    we offload the unlinking to the inodegc work, we don't get errors
    being fed back when we trip over a corruption that prevents the
    inode from being removed from the unlinked list.

    This means we never clear the corrupt unlinked list bucket,
    resulting in runtime operations eventually tripping over it and
    shutting down.

    Fix this by collecting inodegc worker errors and feed them
    back to the flush caller. This is largely best effort - the only
    context that really cares is log recovery, and it only flushes a
    single inode at a time so we don't need complex synchronised
    handling. Essentially the inodegc workers will capture the first
    error that occurs and the next flush will gather them and clear
    them. The flush itself will only report the first gathered error.

    In the cases where callers can return errors, propagate the
    collected inodegc flush error up the error handling chain.

    In the case of inode unlinked list recovery, there are several
    superfluous calls to flush queued unlinked inodes -
    xlog_recover_iunlink_bucket() guarantees that it has flushed the
    inodegc and collected errors before it returns. Hence nothing in the
    calling path needs to run a flush, even when an error is returned.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Dave Chinner <david@fromorbit.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-11-10 07:22:27 -06:00
Bill O'Donnell 768fd4695a xfs: defered work could create precommits
JIRA: https://issues.redhat.com/browse/RHEL-2002

commit cb042117488dbf0b3b38b05771639890fada9a52
Author: Dave Chinner <dchinner@redhat.com>
Date:   Mon Jun 5 04:07:27 2023 +1000

    xfs: defered work could create precommits

    To fix a AGI-AGF-inode cluster buffer deadlock, we need to move
    inode cluster buffer operations to the ->iop_precommit() method.
    However, this means that deferred operations can require precommits
    to be run on the final transaction that the deferred ops pass back
    to xfs_trans_commit() context. This will be exposed by attribute
    handling, in that the last changes to the inode in the attr set
    state machine "disappear" because the precommit operation is not run.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Dave Chinner <david@fromorbit.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-11-10 07:22:26 -06:00
Bill O'Donnell f5803d353f xfs: don't assert fail on transaction cancel with deferred ops
JIRA: https://issues.redhat.com/browse/RHEL-2002

commit 55d5c3a386d74d3f374023c8fa386f524a9192e8
Author: Dave Chinner <dchinner@redhat.com>
Date:   Sat Feb 11 04:12:06 2023 +1100

    xfs: don't assert fail on transaction cancel with deferred ops

    We can error out of an allocation transaction when updating BMBT
    blocks when things go wrong. This can be a btree corruption, and
    unexpected ENOSPC, etc. In these cases, we already have deferred ops
    queued for the first allocation that has been done, and we just want
    to cancel out the transaction and shut down the filesystem on error.

    In fact, we do just that for production systems - the assert that we
    can't have a transaction with defer ops attached unless we are
    already shut down is bogus and gets in the way of debugging
    whatever issue is actually causing the transaction to be cancelled.

    Remove the assert because it is causing spurious test failures to
    hang test machines.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-11-10 07:22:20 -06:00
Bill O'Donnell 21482343ad xfs: t_firstblock is tracking AGs not blocks
JIRA: https://issues.redhat.com/browse/RHEL-2002

commit 692b6cddeb65a5170c1e63d25b1ffb7822e80f7d
Author: Dave Chinner <dchinner@redhat.com>
Date:   Sat Feb 11 04:11:06 2023 +1100

    xfs: t_firstblock is tracking AGs not blocks

    The tp->t_firstblock field is now raelly tracking the highest AG we
    have locked, not the block number of the highest allocation we've
    made. It's purpose is to prevent AGF locking deadlocks, so rename it
    to "highest AG" and simplify the implementation to just track the
    agno rather than a fsbno.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-11-10 07:22:20 -06:00
Bill O'Donnell 5b5b4424f2 xfs: add log item precommit operation
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit fad743d7cd8bd92d03c09e71f29eace860f50415
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Jul 14 11:47:26 2022 +1000

    xfs: add log item precommit operation

    For inodes that are dirty, we have an attached cluster buffer that
    we want to use to track the dirty inode through the AIL.
    Unfortunately, locking the cluster buffer and adding it to the
    transaction when the inode is first logged in a transaction leads to
    buffer lock ordering inversions.

    The specific problem is ordering against the AGI buffer. When
    modifying unlinked lists, the buffer lock order is AGI -> inode
    cluster buffer as the AGI buffer lock serialises all access to the
    unlinked lists. Unfortunately, functionality like xfs_droplink()
    logs the inode before calling xfs_iunlink(), as do various directory
    manipulation functions. The inode can be logged way down in the
    stack as far as the bmapi routines and hence, without a major
    rewrite of lots of APIs there's no way we can avoid the inode being
    logged by something until after the AGI has been logged.

    As we are going to be using ordered buffers for inode AIL tracking,
    there isn't a need to actually lock that buffer against modification
    as all the modifications are captured by logging the inode item
    itself. Hence we don't actually need to join the cluster buffer into
    the transaction until just before it is committed. This means we do
    not perturb any of the existing buffer lock orders in transactions,
    and the inode cluster buffer is always locked last in a transaction
    that doesn't otherwise touch inode cluster buffers.

    We do this by introducing a precommit log item method.  This commit
    just introduces the mechanism; the inode item implementation is in
    followup commits.

    The precommit items need to be sorted into consistent order as we
    may be locking multiple items here. Hence if we have two dirty
    inodes in cluster buffers A and B, and some other transaction has
    two separate dirty inodes in the same cluster buffers, locking them
    in different orders opens us up to ABBA deadlocks. Hence we sort the
    items on the transaction based on the presence of a sort log item
    method.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:45 -05:00
Bill O'Donnell 709be3d0cf xfs: convert log vector chain to use list heads
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 169248536a2b28e4228ba63772936c1ba979c9c0
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Jul 7 18:55:59 2022 +1000

    xfs: convert log vector chain to use list heads

    Because the next change is going to require sorting log vectors, and
    that requires arbitrary rearrangement of the list which cannot be
    done easily with a single linked list.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:37 -05:00
Bill O'Donnell f077f90354 xfs: report "max_resp" used for min log size computation
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 918247ce541995dba05391cf14d6061cf0844866
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Mon Apr 25 18:38:13 2022 -0700

    xfs: report "max_resp" used for min log size computation

    Move the tracepoint that computes the size of the transaction used to
    compute the minimum log size into xfs_log_get_max_trans_res so that we
    only have to compute this stuff once.

    Leave xfs_log_get_max_trans_res as a non-static function so that xfs_db
    can call it to report the results of the userspace computation of the
    same value to diagnose mkfs/kernel misinteractions.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:11 -05:00
Bill O'Donnell e1aa62f1cd xfs: log tickets don't need log client id
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit c7610dceed39d978ef1ee0f2ab5a3c8d2d54d120
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Apr 21 10:34:33 2022 +1000

    xfs: log tickets don't need log client id

    We currently set the log ticket client ID when we reserve a
    transaction. This client ID is only ever written to the log by
    a CIL checkpoint or unmount records, and so anything using a high
    level transaction allocated through xfs_trans_alloc() does not need
    a log ticket client ID to be set.

    For the CIL checkpoint, the client ID written to the journal is
    always XFS_TRANSACTION, and for the unmount record it is always
    XFS_LOG, and nothing else writes to the log. All of these operations
    tell xlog_write() exactly what they need to write to the log (the
    optype) and build their own opheaders for start, commit and unmount
    records. Hence we no longer need to set the client id in either the
    log ticket or the xfs_trans.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Brian Foster <bfoster@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
    Signed-off-by: Dave Chinner <david@fromorbit.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:02 -05:00
Bill O'Donnell 1c2b1203c8 xfs: use a separate frextents counter for rt extent reservations
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 2229276c5283264b8c2241c1ed972bbb136cab22
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Tue Apr 12 06:49:42 2022 +1000

    xfs: use a separate frextents counter for rt extent reservations

    As mentioned in the previous commit, the kernel misuses sb_frextents in
    the incore mount to reflect both incore reservations made by running
    transactions as well as the actual count of free rt extents on disk.
    This results in the superblock being written to the log with an
    underestimate of the number of rt extents that are marked free in the
    rtbitmap.

    Teaching XFS to recompute frextents after log recovery avoids
    operational problems in the current mount, but it doesn't solve the
    problem of us writing undercounted frextents which are then recovered by
    an older kernel that doesn't have that fix.

    Create an incore percpu counter to mirror the ondisk frextents.  This
    new counter will track transaction reservations and the only time we
    will touch the incore super counter (i.e the one that gets logged) is
    when those transactions commit updates to the rt bitmap.  This is in
    contrast to the lazysbcount counters (e.g. fdblocks), where we know that
    log recovery will always fix any incorrect counter that we log.
    As a bonus, we only take m_sb_lock at transaction commit time.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>
    Signed-off-by: Dave Chinner <david@fromorbit.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:10:59 -05:00
Bill O'Donnell dec75f5937 xfs: xfs_trans_commit() path must check for log shutdown
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 3c4cb76bce4380aee99c275b3920049350939e47
Author: Dave Chinner <dchinner@redhat.com>
Date:   Tue Mar 29 18:22:01 2022 -0700

    xfs: xfs_trans_commit() path must check for log shutdown

    If a shut races with xfs_trans_commit() and we have shut down the
    filesystem but not the log, we will still cancel the transaction.
    This can result in aborting dirty log items instead of committing and
    pinning them whilst the log is still running. Hence we can end up
    with dirty, unlogged metadata that isn't in the AIL in memory that
    can be flushed to disk via writeback clustering.

    This was discovered from a g/388 trace where an inode log item was
    having IO completed on it and it wasn't in the AIL, hence tripping
    asserts xfs_ail_check(). Inode cluster writeback started long after
    the filesystem shutdown started, and long after the transaction
    containing the dirty inode was aborted and the log item marked
    XFS_LI_ABORTED. The inode was seen as dirty and unpinned, so it
    was flushed. IO completion tried to remove the inode from the AIL,
    at which point stuff went bad:

     XFS (pmem1): Log I/O Error (0x6) detected at xfs_fs_goingdown+0xa3/0xf0 (fs/xfs/xfs_fsops.c:500).  Shutting down filesystem.
     XFS: Assertion failed: in_ail, file: fs/xfs/xfs_trans_ail.c, line: 67
     XFS (pmem1): Please unmount the filesystem and rectify the problem(s)
     Workqueue: xfs-buf/pmem1 xfs_buf_ioend_work
     RIP: 0010:assfail+0x27/0x2d
     Call Trace:
      <TASK>
      xfs_ail_check+0xa8/0x180
      xfs_ail_delete_one+0x3b/0xf0
      xfs_buf_inode_iodone+0x329/0x3f0
      xfs_buf_ioend+0x1f8/0x530
      xfs_buf_ioend_work+0x15/0x20
      process_one_work+0x1ac/0x390
      worker_thread+0x56/0x3c0
      kthread+0xf6/0x120
      ret_from_fork+0x1f/0x30
      </TASK>

    xfs_trans_commit() needs to check log state for shutdown, not mount
    state. It cannot abort dirty log items while the log is still
    running as dirty items must remained pinned in memory until they are
    either committed to the journal or the log has shut down and they
    can be safely tossed away. Hence if the log has not shut down, the
    xfs_trans_commit() path must allow completed transactions to commit
    to the CIL and pin the dirty items even if a mount shutdown has
    started.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:10:54 -05:00
Bill O'Donnell 021b44b1d5 xfs: AIL should be log centric
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 8eda87211097195d96d7d12be37dd39d6a7c8b80
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Mar 17 09:09:12 2022 -0700

    xfs: AIL should be log centric

    The AIL operates purely on log items, so it is a log centric
    subsystem. Divorce it from the xfs_mount and instead have it pass
    around xlog pointers.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:10:51 -05:00
Bill O'Donnell 9d13520b52 xfs: shut down filesystem if we xfs_trans_cancel with deferred work items
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 47a6df7cd3174b91c6c862eae0b8d4e13591df52
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Wed Dec 15 11:53:14 2021 -0800

    xfs: shut down filesystem if we xfs_trans_cancel with deferred work items

    While debugging some very strange rmap corruption reports in connection
    with the online directory repair code.  I root-caused the error to the
    following incorrect sequence:

    <start repair transaction>
    <expand directory, causing a deferred rmap to be queued>
    <roll transaction>
    <cancel transaction>

    Obviously, we should have committed the transaction instead of
    cancelling it.  Thinking more broadly, however, xfs_trans_cancel should
    have warned us that we were throwing away work item that we already
    committed to performing.  This is not correct, and we need to shut down
    the filesystem.

    Change xfs_trans_cancel to complain in the loudest manner if we're
    cancelling any transaction with deferred work items attached.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:10:44 -05:00
Carlos Maiolino 25a40d32f8 xfs: rename _zone variables to _cache
Bugzilla: https://bugzilla.redhat.com/2125724

Conflicts:
	Small conflict at xfs_inode_alloc() due to out of order
	backport. Inode alloc using kmem_cache_alloc() has been
	converted to use alloc_inode_sb() before this patch.

Now that we've gotten rid of the kmem_zone_t typedef, rename the
variables to _cache since that's what they are.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
(cherry picked from commit 182696fb021fc196e5cbe641565ca40fcf0f885a)
2022-10-21 12:50:46 +02:00
Carlos Maiolino d912d565bb xfs: remove kmem_zone typedef
Bugzilla: https://bugzilla.redhat.com/2125724

Remove these typedefs by referencing kmem_cache directly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
(cherry picked from commit e7720afad068a6729d9cd3aaa08212f2f5a7ceff)
2022-10-21 12:50:46 +02:00
Carlos Maiolino 9dd6c5fba8 xfs: remove the xfs_dsb_t typedef
Bugzilla: https://bugzilla.redhat.com/2125724

Remove the few leftover instances of the xfs_dinode_t typedef.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
(cherry picked from commit ed67ebfd7c4061b4b505ac42eb00e08dd09f4d38)
2022-10-21 12:50:46 +02:00
Brian Foster f24fb058dd xfs: log items should have a xlog pointer, not a mount
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit d86142dd7c4e10e50bdb3679b405d748214b2c28
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Mar 17 09:09:12 2022 -0700

    xfs: log items should have a xlog pointer, not a mount

    Log items belong to the log, not the xfs_mount. Convert the mount
    pointer in the log item to a xlog pointer in preparation for
    upcoming log centric changes to the log items.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:37 -04:00
Brian Foster d179379de4 xfs: replace XFS_FORCED_SHUTDOWN with xfs_is_shutdown
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 75c8c50fa16a23f8ac89ea74834ae8ddd1558d75
Author: Dave Chinner <dchinner@redhat.com>
Date:   Wed Aug 18 18:46:53 2021 -0700

    xfs: replace XFS_FORCED_SHUTDOWN with xfs_is_shutdown

    Remove the shouty macro and instead use the inline function that
    matches other state/feature check wrapper naming. This conversion
    was done with sed.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:34 -04:00
Brian Foster d54a790d1d xfs: replace xfs_sb_version checks with feature flag checks
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 38c26bfd90e1999650d5ef40f90d721f05916643
Author: Dave Chinner <dchinner@redhat.com>
Date:   Wed Aug 18 18:46:37 2021 -0700

    xfs: replace xfs_sb_version checks with feature flag checks

    Convert the xfs_sb_version_hasfoo() to checks against
    mp->m_features. Checks of the superblock itself during disk
    operations (e.g. in the read/write verifiers and the to/from disk
    formatters) are not converted - they operate purely on the
    superblock state. Everything else should use the mount features.

    Large parts of this conversion were done with sed with commands like
    this:

    for f in `git grep -l xfs_sb_version_has fs/xfs/*.c`; do
            sed -i -e 's/xfs_sb_version_has\(.*\)(&\(.*\)->m_sb)/xfs_has_\1(\2)/' $f
    done

    With manual cleanups for things like "xfs_has_extflgbit" and other
    little inconsistencies in naming.

    The result is ia lot less typing to check features and an XFS binary
    size reduced by a bit over 3kB:

    $ size -t fs/xfs/built-in.a
            text       data     bss     dec     hex filenam
    before  1130866  311352     484 1442702  16038e (TOTALS)
    after   1127727  311352     484 1439563  15f74b (TOTALS)

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:34 -04:00
Brian Foster 2151f12a8d xfs: AIL needs asynchronous CIL forcing
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 0020a190cf3eac16995143db41b21b82bacdcbe3
Author: Dave Chinner <dchinner@redhat.com>
Date:   Tue Aug 10 18:00:44 2021 -0700

    xfs: AIL needs asynchronous CIL forcing

    The AIL pushing is stalling on log forces when it comes across
    pinned items. This is happening on removal workloads where the AIL
    is dominated by stale items that are removed from AIL when the
    checkpoint that marks the items stale is committed to the journal.
    This results is relatively few items in the AIL, but those that are
    are often pinned as directories items are being removed from are
    still being logged.

    As a result, many push cycles through the CIL will first issue a
    blocking log force to unpin the items. This can take some time to
    complete, with tracing regularly showing push delays of half a
    second and sometimes up into the range of several seconds. Sequences
    like this aren't uncommon:

    ....
     399.829437:  xfsaild: last lsn 0x11002dd000 count 101 stuck 101 flushing 0 tout 20
    <wanted 20ms, got 270ms delay>
     400.099622:  xfsaild: target 0x11002f3600, prev 0x11002f3600, last lsn 0x0
     400.099623:  xfsaild: first lsn 0x11002f3600
     400.099679:  xfsaild: last lsn 0x1100305000 count 16 stuck 11 flushing 0 tout 50
    <wanted 50ms, got 500ms delay>
     400.589348:  xfsaild: target 0x110032e600, prev 0x11002f3600, last lsn 0x0
     400.589349:  xfsaild: first lsn 0x1100305000
     400.589595:  xfsaild: last lsn 0x110032e600 count 156 stuck 101 flushing 30 tout 50
    <wanted 50ms, got 460ms delay>
     400.950341:  xfsaild: target 0x1100353000, prev 0x110032e600, last lsn 0x0
     400.950343:  xfsaild: first lsn 0x1100317c00
     400.950436:  xfsaild: last lsn 0x110033d200 count 105 stuck 101 flushing 0 tout 20
    <wanted 20ms, got 200ms delay>
     401.142333:  xfsaild: target 0x1100361600, prev 0x1100353000, last lsn 0x0
     401.142334:  xfsaild: first lsn 0x110032e600
     401.142535:  xfsaild: last lsn 0x1100353000 count 122 stuck 101 flushing 8 tout 10
    <wanted 10ms, got 10ms delay>
     401.154323:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x1100353000
     401.154328:  xfsaild: first lsn 0x1100353000
     401.154389:  xfsaild: last lsn 0x1100353000 count 101 stuck 101 flushing 0 tout 20
    <wanted 20ms, got 300ms delay>
     401.451525:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
     401.451526:  xfsaild: first lsn 0x1100353000
     401.451804:  xfsaild: last lsn 0x1100377200 count 170 stuck 22 flushing 122 tout 50
    <wanted 50ms, got 500ms delay>
     401.933581:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
    ....

    In each of these cases, every AIL pass saw 101 log items stuck on
    the AIL (pinned) with very few other items being found. Each pass, a
    log force was issued, and delay between last/first is the sleep time
    + the sync log force time.

    Some of these 101 items pinned the tail of the log. The tail of the
    log does slowly creep forward (first lsn), but the problem is that
    the log is actually out of reservation space because it's been
    running so many transactions that stale items that never reach the
    AIL but consume log space. Hence we have a largely empty AIL, with
    long term pins on items that pin the tail of the log that don't get
    pushed frequently enough to keep log space available.

    The problem is the hundreds of milliseconds that we block in the log
    force pushing the CIL out to disk. The AIL should not be stalled
    like this - it needs to run and flush items that are at the tail of
    the log with minimal latency. What we really need to do is trigger a
    log flush, but then not wait for it at all - we've already done our
    waiting for stuff to complete when we backed off prior to the log
    force being issued.

    Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we
    still do a blocking flush of the CIL and that is what is causing the
    issue. Hence we need a new interface for the CIL to trigger an
    immediate background push of the CIL to get it moving faster but not
    to wait on that to occur. While the CIL is pushing, the AIL can also
    be pushing.

    We already have an internal interface to do this -
    xlog_cil_push_now() - but we need a wrapper for it to be used
    externally. xlog_cil_force_seq() can easily be extended to do what
    we need as it already implements the synchronous CIL push via
    xlog_cil_push_now(). Add the necessary flags and "push current
    sequence" semantics to xlog_cil_force_seq() and convert the AIL
    pushing to use it.

    One of the complexities here is that the CIL push does not guarantee
    that the commit record for the CIL checkpoint is written to disk.
    The current log force ensures this by submitting the current ACTIVE
    iclog that the commit record was written to. We need the CIL to
    actually write this commit record to disk for an async push to
    ensure that the checkpoint actually makes it to disk and unpins the
    pinned items in the checkpoint on completion. Hence we need to pass
    down to the CIL push that we are doing an async flush so that it can
    switch out the commit_iclog if necessary to get written to disk when
    the commit iclog is finally released.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:27 -04:00
Brian Foster 156272e64a xfs: convert XLOG_FORCED_SHUTDOWN() to xlog_is_shutdown()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 2039a272300b949c05888428877317b834c0b1fb
Author: Dave Chinner <dchinner@redhat.com>
Date:   Tue Aug 10 17:59:01 2021 -0700

    xfs: convert XLOG_FORCED_SHUTDOWN() to xlog_is_shutdown()

    Make it less shouty and a static inline before adding more calls
    through the log code.

    Also convert internal log code that uses XFS_FORCED_SHUTDOWN(mount)
    to use xlog_is_shutdown(log) as well.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:25 -04:00
Brian Foster 79c2de89dd xfs: use background worker pool when transactions can't get free space
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit e8d04c2abcebd66bdbacd53bb273d824d4e27080
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Fri Aug 6 11:05:42 2021 -0700

    xfs: use background worker pool when transactions can't get free space

    In xfs_trans_alloc, if the block reservation call returns ENOSPC, we
    call xfs_blockgc_free_space with a NULL icwalk structure to try to free
    space.  Each frontend thread that encounters this situation starts its
    own walk of the inode cache to see if it can find anything, which is
    wasteful since we don't have any additional selection criteria.  For
    this one common case, create a function that reschedules all pending
    background work immediately and flushes the workqueue so that the scan
    can run in parallel.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:23 -04:00
Andrey Albershteyn d28d9b8e65 xfs: reserve quota for dir expansion when linking/unlinking files
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2106569

commit 871b9316e7a778ff97bdc34fdb2f2977f616651d
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Fri Feb 25 16:18:41 2022 -0800

xfs: reserve quota for dir expansion when linking/unlinking files

XFS does not reserve quota for directory expansion when linking or
unlinking children from a directory.  This means that we don't reject
the expansion with EDQUOT when we're at or near a hard limit, which
means that unprivileged userspace can use link()/unlink() to exceed
quota.

The fix for this is nuanced -- link operations don't always expand the
directory, and we allow a link to proceed with no space reservation if
we don't need to add a block to the directory to handle the addition.
Unlink operations generally do not expand the directory (you'd have to
free a block and then cause a btree split) and we can defer the
directory block freeing if there is no space reservation.

Moreover, there is a further bug in that we do not trigger the blockgc
workers to try to clear space when we're out of quota.

To fix both cases, create a new xfs_trans_alloc_dir function that
allocates the transaction, locks and joins the inodes, and reserves
quota for the directory.  If there isn't sufficient space or quota,
we'll switch the caller to reservationless mode.  This should prevent
quota usage overruns with the least restriction in functionality.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
2022-07-27 10:52:13 +02:00
Dave Chinner 5f9b4b0de8 xfs: xfs_log_force_lsn isn't passed a LSN
In doing an investigation into AIL push stalls, I was looking at the
log force code to see if an async CIL push could be done instead.
This lead me to xfs_log_force_lsn() and looking at how it works.

xfs_log_force_lsn() is only called from inode synchronisation
contexts such as fsync(), and it takes the ip->i_itemp->ili_last_lsn
value as the LSN to sync the log to. This gets passed to
xlog_cil_force_lsn() via xfs_log_force_lsn() to flush the CIL to the
journal, and then used by xfs_log_force_lsn() to flush the iclogs to
the journal.

The problem is that ip->i_itemp->ili_last_lsn does not store a
log sequence number. What it stores is passed to it from the
->iop_committing method, which is called by xfs_log_commit_cil().
The value this passes to the iop_committing method is the CIL
context sequence number that the item was committed to.

As it turns out, xlog_cil_force_lsn() converts the sequence to an
actual commit LSN for the related context and returns that to
xfs_log_force_lsn(). xfs_log_force_lsn() overwrites it's "lsn"
variable that contained a sequence with an actual LSN and then uses
that to sync the iclogs.

This caused me some confusion for a while, even though I originally
wrote all this code a decade ago. ->iop_committing is only used by
a couple of log item types, and only inode items use the sequence
number it is passed.

Let's clean up the API, CIL structures and inode log item to call it
a sequence number, and make it clear that the high level code is
using CIL sequence numbers and not on-disk LSNs for integrity
synchronisation purposes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-06-21 10:12:33 -07:00
Dave Chinner 6543990a16 xfs: update superblock counters correctly for !lazysbcount
Keep the mount superblock counters up to date for !lazysbcount
filesystems so that when we log the superblock they do not need
updating in any way because they are already correct.

It's found by what Zorro reported:
1. mkfs.xfs -f -l lazy-count=0 -m crc=0 $dev
2. mount $dev $mnt
3. fsstress -d $mnt -p 100 -n 1000 (maybe need more or less io load)
4. umount $mnt
5. xfs_repair -n $dev
and I've seen no problem with this patch.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reported-by: Zorro Lang <zlang@redhat.com>
Reviewed-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-04-29 07:44:18 -07:00
Darrick J. Wong 1aec7c3d05 xfs: remove obsolete AGF counter debugging
In commit f8f2835a9c we changed the behavior of XFS to use EFIs to
remove blocks from an overfilled AGFL because there were complaints
about transaction overruns that stemmed from trying to free multiple
blocks in a single transaction.

Unfortunately, that commit missed a subtlety in the debug-mode
transaction accounting when a realtime volume is attached.  If a
realtime file undergoes a data fork mapping change such that realtime
extents are allocated (or freed) in the same transaction that a data
device block is also allocated (or freed), we can trip a debugging
assertion.  This can happen (for example) if a realtime extent is
allocated and it is necessary to reshape the bmbt to hold the new
mapping.

When we go to allocate a bmbt block from an AG, the first thing the data
device block allocator does is ensure that the freelist is the proper
length.  If the freelist is too long, it will trim the freelist to the
proper length.

In debug mode, trimming the freelist calls xfs_trans_agflist_delta() to
record the decrement in the AG free list count.  Prior to f8f28 we would
put the free block back in the free space btrees in the same
transaction, which calls xfs_trans_agblocks_delta() to record the
increment in the AG free block count.  Since AGFL blocks are included in
the global free block count (fdblocks), there is no corresponding
fdblocks update, so the AGFL free satisfies the following condition in
xfs_trans_apply_sb_deltas:

	/*
	 * Check that superblock mods match the mods made to AGF counters.
	 */
	ASSERT((tp->t_fdblocks_delta + tp->t_res_fdblocks_delta) ==
	       (tp->t_ag_freeblks_delta + tp->t_ag_flist_delta +
		tp->t_ag_btree_delta));

The comparison here used to be: (X + 0) == ((X+1) + -1 + 0), where X is
the number blocks that were allocated.

After commit f8f28 we defer the block freeing to the next chained
transaction, which means that the calls to xfs_trans_agflist_delta and
xfs_trans_agblocks_delta occur in separate transactions.  The (first)
transaction that shortens the free list trips on the comparison, which
has now become:

(X + 0) == ((X) + -1 + 0)

because we haven't freed the AGFL block yet; we've only logged an
intention to free it.  When the second transaction (the deferred free)
commits, it will evaluate the expression as:

(0 + 0) == (1 + 0 + 0)

and trip over that in turn.

At this point, the astute reader may note that the two commits tagged by
this patch have been in the kernel for a long time but haven't generated
any bug reports.  How is it that the author became aware of this bug?

This originally surfaced as an intermittent failure when I was testing
realtime rmap, but a different bug report by Zorro Lang reveals the same
assertion occuring on !lazysbcount filesystems.

The common factor to both reports (and why this problem wasn't
previously reported) becomes apparent if we consider when
xfs_trans_apply_sb_deltas is called by __xfs_trans_commit():

	if (tp->t_flags & XFS_TRANS_SB_DIRTY)
		xfs_trans_apply_sb_deltas(tp);

With a modern lazysbcount filesystem, transactions update only the
percpu counters, so they don't need to set XFS_TRANS_SB_DIRTY, hence
xfs_trans_apply_sb_deltas is rarely called.

However, updates to the count of free realtime extents are not part of
lazysbcount, so XFS_TRANS_SB_DIRTY will be set on transactions adding or
removing data fork mappings to realtime files; similarly,
XFS_TRANS_SB_DIRTY is always set on !lazysbcount filesystems.

Dave mentioned in response to an earlier version of this patch:

"IIUC, what you are saying is that this debug code is simply not
exercised in normal testing and hasn't been for the past decade?  And it
still won't be exercised on anything other than realtime device testing?

"...it was debugging code from 1994 that was largely turned into dead
code when lazysbcounters were introduced in 2007. Hence I'm not sure it
holds any value anymore."

This debugging code isn't especially helpful - you can modify the
flcount on one AG and the freeblks of another AG, and it won't trigger.
Add the fact that nobody noticed for a decade, and let's just get rid of
it (and start testing realtime :P).

This bug was found by running generic/051 on either a V4 filesystem
lacking lazysbcount; or a V5 filesystem with a realtime volume.

Cc: bfoster@redhat.com, zlang@redhat.com
Fixes: f8f2835a9c ("xfs: defer agfl block frees when dfops is available")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-04-29 07:44:18 -07:00
Christoph Hellwig 6e73a545f9 xfs: move the di_nblocks field to struct xfs_inode
In preparation of removing the historic icinode struct, move the nblocks
field into the containing xfs_inode structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07 14:37:03 -07:00
Gao Xiang fb2fc17201 xfs: support shrinking unused space in the last AG
As the first step of shrinking, this attempts to enable shrinking
unused space in the last allocation group by fixing up freespace
btree, agi, agf and adjusting super block and use a helper
xfs_ag_shrink_space() to fixup the last AG.

This can be all done in one transaction for now, so I think no
additional protection is needed.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-03-25 16:47:52 -07:00
Dave Chinner 5825bea052 xfs: __percpu_counter_compare() inode count debug too expensive
- 21.92% __xfs_trans_commit
     - 21.62% xfs_log_commit_cil
	- 11.69% xfs_trans_unreserve_and_mod_sb
	   - 11.58% __percpu_counter_compare
	      - 11.45% __percpu_counter_sum
		 - 10.29% _raw_spin_lock_irqsave
		    - 10.28% do_raw_spin_lock
			 __pv_queued_spin_lock_slowpath

We debated just getting rid of it last time this came up and
there was no real objection to removing it. Now it's the biggest
scalability limitation for debug kernels even on smallish machines,
so let's just get rid of it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-03-25 16:47:52 -07:00
Dave Chinner 756b1c3433 xfs: use current->journal_info for detecting transaction recursion
Because the iomap code using PF_MEMALLOC_NOFS to detect transaction
recursion in XFS is just wrong. Remove it from the iomap code and
replace it with XFS specific internal checks using
current->journal_info instead.

[djwong: This change also realigns the lifetime of NOFS flag changes to
match the incore transaction, instead of the inconsistent scheme we have
now.]

Fixes: 9070733b4e ("xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-02-25 08:07:04 -08:00
Darrick J. Wong 9febcda6f8 xfs: don't nest transactions when scanning for eofblocks
Brian Foster reported a lockdep warning on xfs/167:

============================================
WARNING: possible recursive locking detected
5.11.0-rc4 #35 Tainted: G        W I
--------------------------------------------
fsstress/17733 is trying to acquire lock:
ffff8e0fd1d90650 (sb_internal){++++}-{0:0}, at: xfs_free_eofblocks+0x104/0x1d0 [xfs]

but task is already holding lock:
ffff8e0fd1d90650 (sb_internal){++++}-{0:0}, at: xfs_trans_alloc_inode+0x5f/0x160 [xfs]

stack backtrace:
CPU: 38 PID: 17733 Comm: fsstress Tainted: G        W I       5.11.0-rc4 #35
Hardware name: Dell Inc. PowerEdge R740/01KPX8, BIOS 1.6.11 11/20/2018
Call Trace:
 dump_stack+0x8b/0xb0
 __lock_acquire.cold+0x159/0x2ab
 lock_acquire+0x116/0x370
 xfs_trans_alloc+0x1ad/0x310 [xfs]
 xfs_free_eofblocks+0x104/0x1d0 [xfs]
 xfs_blockgc_scan_inode+0x24/0x60 [xfs]
 xfs_inode_walk_ag+0x202/0x4b0 [xfs]
 xfs_inode_walk+0x66/0xc0 [xfs]
 xfs_trans_alloc+0x160/0x310 [xfs]
 xfs_trans_alloc_inode+0x5f/0x160 [xfs]
 xfs_alloc_file_space+0x105/0x300 [xfs]
 xfs_file_fallocate+0x270/0x460 [xfs]
 vfs_fallocate+0x14d/0x3d0
 __x64_sys_fallocate+0x3e/0x70
 do_syscall_64+0x33/0x40
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

The cause of this is the new code that spurs a scan to garbage collect
speculative preallocations if we fail to reserve enough blocks while
allocating a transaction.  While the warning itself is a fairly benign
lockdep complaint, it does expose a potential livelock if the rwsem
behavior ever changes with regards to nesting read locks when someone's
waiting for a write lock.

Fix this by freeing the transaction and jumping back to xfs_trans_alloc
like this patch in the V4 submission[1].

[1] https://lore.kernel.org/linux-xfs/161142798066.2171939.9311024588681972086.stgit@magnolia/

Fixes: a1a7d05a05 ("xfs: flush speculative space allocations when we run out of space")
Reported-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-02-25 08:07:04 -08:00
Darrick J. Wong a1a7d05a05 xfs: flush speculative space allocations when we run out of space
If a fs modification (creation, file write, reflink, etc.) is unable to
reserve enough space to handle the modification, try clearing whatever
space the filesystem might have been hanging onto in the hopes of
speeding up the filesystem.  The flushing behavior will become
particularly important when we add deferred inode inactivation because
that will increase the amount of space that isn't actively tied to user
data.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-03 09:18:49 -08:00
Darrick J. Wong 758303d144 xfs: flush eof/cowblocks if we can't reserve quota for chown
If a file user, group, or project change is unable to reserve enough
quota to handle the modification, try clearing whatever space the
filesystem might have been hanging onto in the hopes of speeding up the
filesystem.  The flushing behavior will become particularly important
when we add deferred inode inactivation because that will increase the
amount of space that isn't actively tied to user data.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-03 09:18:49 -08:00
Darrick J. Wong c237dd7c70 xfs: flush eof/cowblocks if we can't reserve quota for inode creation
If an inode creation is unable to reserve enough quota to handle the
modification, try clearing whatever space the filesystem might have been
hanging onto in the hopes of speeding up the filesystem.  The flushing
behavior will become particularly important when we add deferred inode
inactivation because that will increase the amount of space that isn't
actively tied to user data.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-03 09:18:49 -08:00
Darrick J. Wong 766aabd599 xfs: flush eof/cowblocks if we can't reserve quota for file blocks
If a fs modification (data write, reflink, xattr set, fallocate, etc.)
is unable to reserve enough quota to handle the modification, try
clearing whatever space the filesystem might have been hanging onto in
the hopes of speeding up the filesystem.  The flushing behavior will
become particularly important when we add deferred inode inactivation
because that will increase the amount of space that isn't actively tied
to user data.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-03 09:18:49 -08:00
Darrick J. Wong 5c615f0feb xfs: remove xfs_qm_vop_chown_reserve
Now that the only caller of this function is xfs_trans_alloc_ichange,
just open-code the meat of _chown_reserve in that caller.  Drop the
(redundant) [ugp]id checks because xfs has a 1:1 relationship between
quota ids and incore dquots.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-02-03 09:18:49 -08:00
Darrick J. Wong 7317a03df7 xfs: refactor inode ownership change transaction/inode/quota allocation idiom
For file ownership (uid, gid, prid) changes, create a new helper
xfs_trans_alloc_ichange that allocates a transaction and reserves the
appropriate amount of quota against that transction in preparation for a
change of user, group, or project id.  Replace all the open-coded idioms
with a single call to this helper so that we can contain the retry loops
in the next patchset.

This changes the locking behavior for ichange transactions slightly.
Since tr_ichange does not have a permanent reservation and cannot roll,
we pass XFS_ILOCK_EXCL to ijoin so that the inode will be unlocked
automatically at commit time.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-03 09:18:49 -08:00
Darrick J. Wong f2f7b9ff62 xfs: refactor inode creation transaction/inode/quota allocation idiom
For file creation, create a new helper xfs_trans_alloc_icreate that
allocates a transaction and reserves the appropriate amount of quota
against that transction.  Replace all the open-coded idioms with a
single call to this helper so that we can contain the retry loops in the
next patchset.

This changes the locking behavior for non-tempfile creation slightly, in
that we now make the quota reservation without holding the directory
ILOCK.  While the dquots chosen for inode creation are based on the
directory state at a given point in time, the directory ILOCK was
released as soon as the dquot references are picked up.  Hence it was
never necessary to hold the directory ILOCK for the quota reservation.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-02-03 09:18:49 -08:00
Darrick J. Wong 3de4eb106f xfs: allow reservation of rtblocks with xfs_trans_alloc_inode
Make it so that we can reserve rt blocks with the xfs_trans_alloc_inode
wrapper function, then convert a few more callsites.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-03 09:18:49 -08:00
Darrick J. Wong 3a1af6c317 xfs: refactor common transaction/inode/quota allocation idiom
Create a new helper xfs_trans_alloc_inode that allocates a transaction,
locks and joins an inode to it, and then reserves the appropriate amount
of quota against that transction.  Then replace all the open-coded
idioms with a single call to this helper.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-03 09:18:49 -08:00
Dave Chinner e82226138b xfs: remove xfs_buf_t typedef
Prepare for kernel xfs_buf  alignment by getting rid of the
xfs_buf_t typedef from userspace.

[darrick: This patch is a port of a userspace patch removing the
xfs_buf_t typedef in preparation to make the userspace xfs_buf code
behave more like its kernel counterpart.]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2020-12-16 16:07:34 -08:00
Kaixu Xia d6b8fc6c7a xfs: do the assert for all the log done items in xfs_trans_cancel
We should do the assert for all the log intent-done items if they appear
here. This patch detect intent-done items by the fact that their item ops
don't have iop_unpin and iop_push methods and also move the helper
xlog_item_is_intent to xfs_trans.h.

Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-09-25 11:34:07 -07:00
Christoph Hellwig cead0b10f5 xfs: simplify xfs_trans_getsb
Remove the mp argument as this function is only called in transaction
context, and open code xfs_getsb given that the function already accesses
the buffer pointer in the mount point directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-09-15 20:52:39 -07:00
Carlos Maiolino 32a2b11f46 xfs: Remove kmem_zone_zalloc() usage
Use kmem_cache_zalloc() directly.

With the exception of xlog_ticket_alloc() which will be dealt on the
next patch for readability.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2020-07-28 20:24:14 -07:00
Brian Foster f74681ba20 xfs: preserve rmapbt swapext block reservation from freed blocks
The rmapbt extent swap algorithm remaps individual extents between
the source inode and the target to trigger reverse mapping metadata
updates. If either inode straddles a format or other bmap allocation
boundary, the individual unmap and map cycles can trigger repeated
bmap block allocations and frees as the extent count bounces back
and forth across the boundary. While net block usage is bound across
the swap operation, this behavior can prematurely exhaust the
transaction block reservation because it continuously drains as the
transaction rolls. Each allocation accounts against the reservation
and each free returns to global free space on transaction roll.

The previous workaround to this problem attempted to detect this
boundary condition and provide surplus block reservation to
acommodate it. This is insufficient because more remaps can occur
than implied by the extent counts; if start offset boundaries are
not aligned between the two inodes, for example.

To address this problem more generically and dynamically, add a
transaction accounting mode that returns freed blocks to the
transaction reservation instead of the superblock counters on
transaction roll and use it when the rmapbt based algorithm is
active. This allows the chain of remap transactions to preserve the
block reservation based own its own frees and prevent premature
exhaustion regardless of the remap pattern. Note that this is only
safe for superblocks with lazy sb accounting, but the latter is
required for v5 supers and the rmap feature depends on v5.

Fixes: b3fed43482 ("xfs: account format bouncing into rmapbt swapext tx reservation")
Root-caused-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-07-06 10:46:56 -07:00
Dave Chinner b41b46c20c xfs: remove the m_active_trans counter
It's a global atomic counter, and we are hitting it at a rate of
half a million transactions a second, so it's bouncing the counter
cacheline all over the place on large machines. We don't actually
need it anymore - it used to be required because the VFS freeze code
could not track/prevent filesystem transactions that were running,
but that problem no longer exists.

Hence to remove the counter, we simply have to ensure that nothing
calls xfs_sync_sb() while we are trying to quiesce the filesytem.
That only happens if the log worker is still running when we call
xfs_quiesce_attr(). The log worker is cancelled at the end of
xfs_quiesce_attr() by calling xfs_log_quiesce(), so just call it
early here and then we can remove the counter altogether.

Concurrent create, 50 million inodes, identical 16p/16GB virtual
machines on different physical hosts. Machine A has twice the CPU
cores per socket of machine B:

		unpatched	patched
machine A:	3m16s		2m00s
machine B:	4m04s		4m05s

Create rates:
		unpatched	patched
machine A:	282k+/-31k	468k+/-21k
machine B:	231k+/-8k	233k+/-11k

Concurrent rm of same 50 million inodes:

		unpatched	patched
machine A:	6m42s		2m33s
machine B:	4m47s		4m47s

The transaction rate on the fast machine went from just under
300k/sec to 700k/sec, which indicates just how much of a bottleneck
this atomic counter was.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-27 08:49:25 -07:00
Dave Chinner f18c9a9030 xfs: reduce free inode accounting overhead
Shaokun Zhang reported that XFS was using substantial CPU time in
percpu_count_sum() when running a single threaded benchmark on
a high CPU count (128p) machine from xfs_mod_ifree(). The issue
is that the filesystem is empty when the benchmark runs, so inode
allocation is running with a very low inode free count.

With the percpu counter batching, this means comparisons when the
counter is less that 128 * 256 = 32768 use the slow path of adding
up all the counters across the CPUs, and this is expensive on high
CPU count machines.

The summing in xfs_mod_ifree() is only used to fire an assert if an
underrun occurs. The error is ignored by the higher level code.
Hence this is really just debug code and we don't need to run it
on production kernels, nor do we need such debug checks to return
error values just to trigger an assert.

Finally, xfs_mod_icount/xfs_mod_ifree are only called from
xfs_trans_unreserve_and_mod_sb(), so get rid of them and just
directly call the percpu_counter_add/percpu_counter_compare
functions. The compare functions are now run only on debug builds as
they are internal to ASSERT() checks and so only compiled in when
ASSERTs are active (CONFIG_XFS_DEBUG=y or CONFIG_XFS_WARN=y).

Reported-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-27 08:49:25 -07:00
Dave Chinner dc3ffbb140 xfs: gut error handling in xfs_trans_unreserve_and_mod_sb()
xfs: gut error handling in xfs_trans_unreserve_and_mod_sb()

From: Dave Chinner <dchinner@redhat.com>

The error handling in xfs_trans_unreserve_and_mod_sb() is largely
incorrect - rolling back the changes in the transaction if only one
counter underruns makes all the other counters incorrect. We still
allow the change to proceed and committing the transaction, except
now we have multiple incorrect counters instead of a single
underflow.

Further, we don't actually report the error to the caller, so this
is completely silent except on debug kernels that will assert on
failure before we even get to the rollback code.  Hence this error
handling is broken, untested, and largely unnecessary complexity.

Just remove it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-27 08:49:25 -07:00