Commit Graph

446 Commits

Author SHA1 Message Date
Bill O'Donnell c4c12defcd xfs: restrict when we try to align cow fork delalloc to cowextsz hints
JIRA: https://issues.redhat.com/browse/RHEL-68860

commit 288e1f693f04e66be99f27e7cbe4a45936a66745
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Wed Jun 19 10:32:44 2024 -0700

    xfs: restrict when we try to align cow fork delalloc to cowextsz hints

    xfs/205 produces the following failure when always_cow is enabled:

      --- a/tests/xfs/205.out       2024-02-28 16:20:24.437887970 -0800
      +++ b/tests/xfs/205.out.bad   2024-06-03 21:13:40.584000000 -0700
      @@ -1,4 +1,5 @@
       QA output created by 205
       *** one file
      +   !!! disk full (expected)
       *** one file, a few bytes at a time
       *** done

    This is the result of overly aggressive attempts to align cow fork
    delalloc reservations to the CoW extent size hint.  Looking at the trace
    data, we're trying to append a single fsblock to the "fred" file.
    Trying to create a speculative post-eof reservation fails because
    there's not enough space.

    We then set @prealloc_blocks to zero and try again, but the cowextsz
    alignment code triggers, which expands our request for a 1-fsblock
    reservation into a 39-block reservation.  There's not enough space for
    that, so the whole write fails with ENOSPC even though there's
    sufficient space in the filesystem to allocate the single block that we
    need to land the write.

    There are two things wrong here -- first, we shouldn't be attempting
    speculative preallocations beyond what was requested when we're low on
    space.  Second, if we've already computed a posteof preallocation, we
    shouldn't bother trying to align that to the cowextsize hint.

    Fix both of these problems by adding a flag that only enables the
    expansion of the delalloc reservation to the cowextsize if we're doing a
    non-extending write, and only if we're not doing an ENOSPC retry.  This
    requires us to move the ENOSPC retry logic to xfs_bmapi_reserve_delalloc.

    I probably should have caught this six years ago when 6ca30729c2 was
    being reviewed, but oh well.  Update the comments to reflect what the
    code does now.

    Fixes: 6ca30729c2 ("xfs: bmap code cleanup")
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2025-01-10 16:38:25 -06:00
Bill O'Donnell 03cd6113f0 xfs: fix backwards logic in xfs_bmap_alloc_account
JIRA: https://issues.redhat.com/browse/RHEL-65728

commit d61b40bf15ce453f3aa71f6b423938e239e7f8f8
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Mon Jan 8 18:17:34 2024 -0800

    xfs: fix backwards logic in xfs_bmap_alloc_account

    We're only allocating from the realtime device if the inode is marked
    for realtime and we're /not/ allocating into the attr fork.

    Fixes: 58643460546d ("xfs: also use xfs_bmap_btalloc_accounting for RT allocations")
    Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-11-20 11:26:20 -06:00
Bill O'Donnell 86c0442471 xfs: make if_data a void pointer
JIRA: https://issues.redhat.com/browse/RHEL-65728

commit 6e145f943bd86be47e54101fa5939f9ed0cb73e5
Author: Christoph Hellwig <hch@lst.de>
Date:   Wed Dec 20 07:34:55 2023 +0100

    xfs: make if_data a void pointer

    The xfs_ifork structure currently has a union of the if_root void pointer
    and the if_data char pointer.  In either case it is an opaque pointer
    that depends on the fork format.  Replace the union with a single if_data
    void pointer as that is what almost all callers want.  Only the symlink
    NULL termination code in xfs_init_local_fork actually needs a new local
    variable now.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>
    Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-11-20 11:26:17 -06:00
Bill O'Donnell 55fbfb062a xfs: indicate if xfs_bmap_adjacent changed ap->blkno
JIRA: https://issues.redhat.com/browse/RHEL-65728

commit 676544c27e710aee7f8357f57abd348d98b1ccd4
Author: Christoph Hellwig <hch@lst.de>
Date:   Mon Dec 18 05:57:24 2023 +0100

    xfs: indicate if xfs_bmap_adjacent changed ap->blkno

    Add a return value to xfs_bmap_adjacent to indicate if it did change
    ap->blkno or not.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
    Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-11-20 11:26:13 -06:00
Bill O'Donnell b97b0b0701 xfs: also use xfs_bmap_btalloc_accounting for RT allocations
JIRA: https://issues.redhat.com/browse/RHEL-65728

commit 58643460546da1dc61593fc6fd78762798b4534f
Author: Christoph Hellwig <hch@lst.de>
Date:   Mon Dec 18 05:57:20 2023 +0100

    xfs: also use xfs_bmap_btalloc_accounting for RT allocations

    Make xfs_bmap_btalloc_accounting more generic by handling the RT quota
    reservations and then also use it from xfs_bmap_rtalloc instead of
    open coding the accounting logic there.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
    Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-11-20 11:26:12 -06:00
Bill O'Donnell 2e13a658c5 xfs: remove the xfs_alloc_arg argument to xfs_bmap_btalloc_accounting
JIRA: https://issues.redhat.com/browse/RHEL-65728

commit eef519d746bbfb90cbad4077c2d39d7a359c3282
Author: Christoph Hellwig <hch@lst.de>
Date:   Mon Dec 18 05:57:19 2023 +0100

    xfs: remove the xfs_alloc_arg argument to xfs_bmap_btalloc_accounting

    xfs_bmap_btalloc_accounting only uses the len field from args, but that
    has just been propagated to ap->length field by the caller.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
    Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-11-20 11:26:12 -06:00
Bill O'Donnell 20b00ecd81 xfs: create a new inode fork block unmap helper
JIRA: https://issues.redhat.com/browse/RHEL-65728

commit a59eb5fc21b2a6dc160ee6cdf77f20bc186a88fd
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Fri Dec 15 10:03:43 2023 -0800

    xfs: create a new inode fork block unmap helper

    Create a new helper to unmap blocks from an inode's fork.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-11-20 11:26:09 -06:00
Bill O'Donnell 070bdf384b xfs: zap broken inode forks
JIRA: https://issues.redhat.com/browse/RHEL-65728

commit e744cef206055954517648070d2b3aaa3d2515ba
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Fri Dec 15 10:03:37 2023 -0800

    xfs: zap broken inode forks

    Determine if inode fork damage is responsible for the inode being unable
    to pass the ifork verifiers in xfs_iget and zap the fork contents if
    this is true.  Once this is done the fork will be empty but we'll be
    able to construct an in-core inode, and a subsequent call to the inode
    fork repair ioctl will search the rmapbt to rebuild the records that
    were in the fork.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-11-20 11:26:06 -06:00
Bill O'Donnell 488519dd5a xfs: pass the defer ops directly to xfs_defer_add
JIRA: https://issues.redhat.com/browse/RHEL-65728

commit 603ce8ab12094a2d9483c79a7541335e258a5328
Author: Christoph Hellwig <hch@lst.de>
Date:   Wed Dec 13 10:06:33 2023 +0100

    xfs: pass the defer ops directly to xfs_defer_add

    Pass a pointer to the xfs_defer_op_type structure to xfs_defer_add and
    remove the indirection through the xfs_defer_ops_type enum and a global
    table of all possible operations.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
    Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-11-20 11:26:00 -06:00
Bill O'Donnell 8ba9e90cd1 xfs: ensure logflagsp is initialized in xfs_bmap_del_extent_real
JIRA: https://issues.redhat.com/browse/RHEL-65728

commit e6af9c98cbf0164a619d95572136bfb54d482dd6
Author: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Date:   Tue Dec 5 13:58:58 2023 +0800

    xfs: ensure logflagsp is initialized in xfs_bmap_del_extent_real

    In the case of returning -ENOSPC, ensure logflagsp is initialized by 0.
    Otherwise the caller __xfs_bunmapi will set uninitialized illegal
    tmp_logflags value into xfs log, which might cause unpredictable error
    in the log recovery procedure.

    Also, remove the flags variable and set the *logflagsp directly, so that
    the code should be more robust in the long run.

    Fixes: 1b24b633aa ("xfs: move some more code into xfs_bmap_del_extent_real")
    Signed-off-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
    Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-11-20 11:25:57 -06:00
Bill O'Donnell 059e57c35e xfs: remove __xfs_free_extent_later
JIRA: https://issues.redhat.com/browse/RHEL-65728

commit 4c88fef3af4a51c2cdba6a28237e98da4873e8dc
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Wed Dec 6 18:40:57 2023 -0800

    xfs: remove __xfs_free_extent_later

    xfs_free_extent_later is a trivial helper, so remove it to reduce the
    amount of thinking required to understand the deferred freeing
    interface.  This will make it easier to introduce automatic reaping of
    speculative allocations in the next patch.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-11-20 11:25:54 -06:00
Bill O'Donnell d97dd3291d xfs: convert do_div calls to xfs_rtb_to_rtx helper calls
JIRA: https://issues.redhat.com/browse/RHEL-62760

commit 055641248f649b52620a5fe8774bea253690e057
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Mon Oct 16 09:37:47 2023 -0700

    xfs: convert do_div calls to xfs_rtb_to_rtx helper calls

    Convert these calls to use the helpers, and clean up all these places
    where the same variable can have different units depending on where it
    is in the function.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-11-09 10:06:37 -06:00
Bill O'Donnell 747c7f6fdf xfs: create helpers to convert rt block numbers to rt extent numbers
JIRA: https://issues.redhat.com/browse/RHEL-62760

commit 5dc3a80d46a450481df7f7e9fe673ba3eb4514c3
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Mon Oct 16 09:37:07 2023 -0700

    xfs: create helpers to convert rt block numbers to rt extent numbers

    Create helpers to do unit conversions of rt block numbers to rt extent
    numbers.  There are three variations -- one to compute the rt extent
    number from an rt block number; one to compute the offset of an rt block
    within an rt extent; and one to extract both.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-11-09 10:06:37 -06:00
Bill O'Donnell 0a4322cf76 xfs: create a helper to compute leftovers of realtime extents
JIRA: https://issues.redhat.com/browse/RHEL-62760

commit 68db60bf01c131c09bbe35adf43bd957a4c124bc
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Mon Oct 16 09:34:39 2023 -0700

    xfs: create a helper to compute leftovers of realtime extents

    Create a helper to compute the misalignment between a file extent
    (xfs_extlen_t) and a realtime extent.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-11-09 10:06:36 -06:00
Bill O'Donnell 13f201e90e xfs: rename xfs_verify_rtext to xfs_verify_rtbext
JIRA: https://issues.redhat.com/browse/RHEL-62760

commit 3d2b6d034f0feb7741b313f978a2fe45e917e1be
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Mon Oct 16 09:31:22 2023 -0700

    xfs: rename xfs_verify_rtext to xfs_verify_rtbext

    This helper function validates that a range of *blocks* in the
    realtime section is completely contained within the realtime section.
    It does /not/ validate ranges of *rtextents*.  Rename the function to
    avoid suggesting that it does, and change the type of the @len parameter
    since xfs_rtblock_t is a position unit, not a length unit.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-11-09 10:06:35 -06:00
Bill O'Donnell 1aef19b26a xfs: move the xfs_rtbitmap.c declarations to xfs_rtbitmap.h
JIRA: https://issues.redhat.com/browse/RHEL-62760

commit 13928113fc5b5e79c91796290a99ed991ac0efe2
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Mon Oct 16 09:21:47 2023 -0700

    xfs: move the xfs_rtbitmap.c declarations to xfs_rtbitmap.h

    Move all the declarations for functionality in xfs_rtbitmap.c into a
    separate xfs_rtbitmap.h header file.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-11-09 10:06:34 -06:00
Bill O'Donnell 22edee70ec xfs: hoist freeing of rt data fork extent mappings
JIRA: https://issues.redhat.com/browse/RHEL-62760

commit 6c664484337b37fa0cf6e958f4019623e30d40f7
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Mon Oct 16 09:16:22 2023 -0700

    xfs: hoist freeing of rt data fork extent mappings

    Currently, xfs_bmap_del_extent_real contains a bunch of code to convert
    the physical extent of a data fork mapping for a realtime file into rt
    extents and pass that to the rt extent freeing function.  Since the
    details of this aren't needed when CONFIG_XFS_REALTIME=n, move it to
    xfs_rtbitmap.c to reduce code size when realtime isn't enabled.

    This will (one day) enable realtime EFIs to reuse the same
    unit-converting call with less code duplication.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-11-09 10:06:33 -06:00
Bill O'Donnell ceb975e543 xfs: fix units conversion error in xfs_bmap_del_extent_delay
JIRA: https://issues.redhat.com/browse/RHEL-25419

commit ddd98076d5c075c8a6c49d9e6e8ee12844137f23
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Mon Oct 16 09:21:37 2023 -0700

    xfs: fix units conversion error in xfs_bmap_del_extent_delay

    The unit conversions in this function do not make sense.  First we
    convert a block count to bytes, then divide that bytes value by
    rextsize, which is in blocks, to get an rt extent count.  You can't
    divide bytes by blocks to get a (possibly multiblock) extent value.

    Fortunately nobody uses delalloc on the rt volume so this hasn't
    mattered.

    Fixes: fa5c836ca8 ("xfs: refactor xfs_bunmapi_cow")
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-06-06 10:32:56 -05:00
Bill O'Donnell 53919fd386 xfs: use deferred frees for btree block freeing
JIRA: https://issues.redhat.com/browse/RHEL-25419

Conflicts: context

commit b742d7b4f0e03df25c2a772adcded35044b625ca
Author: Dave Chinner <dchinner@redhat.com>
Date:   Wed Jun 28 11:04:32 2023 -0700

    xfs: use deferred frees for btree block freeing

    Btrees that aren't freespace management trees use the normal extent
    allocation and freeing routines for their blocks. Hence when a btree
    block is freed, a direct call to xfs_free_extent() is made and the
    extent is immediately freed. This puts the entire free space
    management btrees under this path, so we are stacking btrees on
    btrees in the call stack. The inobt, finobt and refcount btrees
    all do this.

    However, the bmap btree does not do this - it calls
    xfs_free_extent_later() to defer the extent free operation via an
    XEFI and hence it gets processed in deferred operation processing
    during the commit of the primary transaction (i.e. via intent
    chaining).

    We need to change xfs_free_extent() to behave in a non-blocking
    manner so that we can avoid deadlocks with busy extents near ENOSPC
    in transactions that free multiple extents. Inserting or removing a
    record from a btree can cause a multi-level tree merge operation and
    that will free multiple blocks from the btree in a single
    transaction. i.e. we can call xfs_free_extent() multiple times, and
    hence the btree manipulation transaction is vulnerable to this busy
    extent deadlock vector.

    To fix this, convert all the remaining callers of xfs_free_extent()
    to use xfs_free_extent_later() to queue XEFIs and hence defer
    processing of the extent frees to a context that can be safely
    restarted if a deadlock condition is detected.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-06-06 10:32:52 -05:00
Bill O'Donnell a8cc7b7360 xfs: _{attr,data}_map_shared should take ILOCK_EXCL until iread_extents is completely done
JIRA: https://issues.redhat.com/browse/RHEL-25419

commit c95356ca884885db702670e24933ee7f2b9f1754
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Wed Apr 12 15:49:10 2023 +1000

    xfs: _{attr,data}_map_shared should take ILOCK_EXCL until iread_extents is completely done

    While fuzzing the data fork extent count on a btree-format directory
    with xfs/375, I observed the following (excerpted) splat:

    XFS: Assertion failed: xfs_isilocked(ip, XFS_ILOCK_EXCL), file: fs/xfs/libxfs/xfs_bmap.c, line: 1208
    ------------[ cut here ]------------
    WARNING: CPU: 0 PID: 43192 at fs/xfs/xfs_message.c:104 assfail+0x46/0x4a [xfs]
    Call Trace:
     <TASK>
     xfs_iread_extents+0x1af/0x210 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
     xchk_dir_walk+0xb8/0x190 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
     xchk_parent_count_parent_dentries+0x41/0x80 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
     xchk_parent_validate+0x199/0x2e0 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
     xchk_parent+0xdf/0x130 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
     xfs_scrub_metadata+0x2b8/0x730 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
     xfs_scrubv_metadata+0x38b/0x4d0 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
     xfs_ioc_scrubv_metadata+0x111/0x160 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
     xfs_file_ioctl+0x367/0xf50 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
     __x64_sys_ioctl+0x82/0xa0
     do_syscall_64+0x2b/0x80
     entry_SYSCALL_64_after_hwframe+0x46/0xb0

    The cause of this is a race condition in xfs_ilock_data_map_shared,
    which performs an unlocked access to the data fork to guess which lock
    mode it needs:

    Thread 0                          Thread 1

    xfs_need_iread_extents
    <observe no iext tree>
    xfs_ilock(..., ILOCK_EXCL)
    xfs_iread_extents
    <observe no iext tree>
    <check ILOCK_EXCL>
    <load bmbt extents into iext>
    <notice iext size doesn't
     match nextents>
                                      xfs_need_iread_extents
                                      <observe iext tree>
                                      xfs_ilock(..., ILOCK_SHARED)
    <tear down iext tree>
    xfs_iunlock(..., ILOCK_EXCL)
                                      xfs_iread_extents
                                      <observe no iext tree>
                                      <check ILOCK_EXCL>
                                      *BOOM*

    Fix this race by adding a flag to the xfs_ifork structure to indicate
    that we have not yet read in the extent records and changing the
    predicate to look at the flag state, not if_height.  The memory barrier
    ensures that the flag will not be set until the very end of the
    function.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>
    Signed-off-by: Dave Chinner <david@fromorbit.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-06-06 10:32:48 -05:00
Bill O'Donnell f3724d2a82 xfs: complain about bad file mapping records in the ondisk bmbt
JIRA: https://issues.redhat.com/browse/RHEL-25419

commit 6a3bd8fcf9afb47c703cb268f30f60aa2e7af86a
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Tue Apr 11 19:00:05 2023 -0700

    xfs: complain about bad file mapping records in the ondisk bmbt

    Similar to what we've just done for the other btrees, create a function
    to log corrupt bmbt records and call it whenever we encounter a bad
    record in the ondisk btree.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-06-05 16:56:18 -05:00
Bill O'Donnell 72f3d97fa2 xfs: give xfs_bmap_intent its own perag reference
JIRA: https://issues.redhat.com/browse/RHEL-25419

commit 774a99b47b588bf0bd9f65d3b241d5bba0b2fcb0
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Tue Apr 11 18:59:53 2023 -0700

    xfs: give xfs_bmap_intent its own perag reference

    Give the xfs_bmap_intent an active reference to the perag structure
    data.  This reference will be used to enable scrub intent draining
    functionality in subsequent patches.  Later, shrink will use these
    passive references to know if an AG is quiesced or not.

    The reason why we take a passive ref for a file mapping operation is
    simple: we're committing to some sort of action involving space in an
    AG, so we want to indicate our interest in that AG.  The space is
    already allocated, so we need to be able to operate on AGs that are
    offline or being shrunk.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-06-05 16:56:13 -05:00
Bill O'Donnell 253e8028aa xfs: validate block number being freed before adding to xefi
JIRA: https://issues.redhat.com/browse/RHEL-2002

Conflicts: diffs in xfs_alloc.c due to out of order patch application

commit 7dfee17b13e5024c5c0ab1911859ded4182de3e5
Author: Dave Chinner <dchinner@redhat.com>
Date:   Mon Jun 5 14:48:15 2023 +1000

    xfs: validate block number being freed before adding to xefi

    Bad things happen in defered extent freeing operations if it is
    passed a bad block number in the xefi. This can come from a bogus
    agno/agbno pair from deferred agfl freeing, or just a bad fsbno
    being passed to __xfs_free_extent_later(). Either way, it's very
    difficult to diagnose where a null perag oops in EFI creation
    is coming from when the operation that queued the xefi has already
    been completed and there's no longer any trace of it around....

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Dave Chinner <david@fromorbit.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-11-10 07:22:26 -06:00
Bill O'Donnell b42e76eb96 xfs: don't unconditionally null args->pag in xfs_bmap_btalloc_at_eof
JIRA: https://issues.redhat.com/browse/RHEL-2002

commit b82a5c42a5fa7e79426ed047ced3f8482bb66fbc
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Tue May 2 09:14:27 2023 +1000

    xfs: don't unconditionally null args->pag in xfs_bmap_btalloc_at_eof

    xfs/170 on a filesystem with su=128k,sw=4 produces this splat:

    BUG: kernel NULL pointer dereference, address: 0000000000000010
    #PF: supervisor write access in kernel mode
    #PF: error_code(0x0002) - not-present page
    PGD 0 P4D 0
    Oops: 0002 [#1] PREEMPT SMP
    CPU: 1 PID: 4022907 Comm: dd Tainted: G        W          6.3.0-xfsx #2 6ebeeffbe9577d32
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20171121_152543-x86-ol7-bu
    RIP: 0010:xfs_perag_rele+0x10/0x70 [xfs]
    RSP: 0018:ffffc90001e43858 EFLAGS: 00010217
    RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000100
    RDX: ffffffffa054e717 RSI: 0000000000000005 RDI: 0000000000000000
    RBP: ffff888194eea000 R08: 0000000000000000 R09: 0000000000000037
    R10: ffff888100ac1cb0 R11: 0000000000000018 R12: 0000000000000000
    R13: ffffc90001e43a38 R14: ffff888194eea000 R15: ffff888194eea000
    FS:  00007f93d1a0e740(0000) GS:ffff88843fc80000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000010 CR3: 000000018a34f000 CR4: 00000000003506e0
    Call Trace:
     <TASK>
     xfs_bmap_btalloc+0x1a7/0x5d0 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
     xfs_bmapi_allocate+0xee/0x470 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
     xfs_bmapi_write+0x539/0x9e0 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
     xfs_iomap_write_direct+0x1bb/0x2b0 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
     xfs_direct_write_iomap_begin+0x51c/0x710 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
     iomap_iter+0x132/0x2f0
     __iomap_dio_rw+0x2f8/0x840
     iomap_dio_rw+0xe/0x30
     xfs_file_dio_write_aligned+0xad/0x180 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
     xfs_file_write_iter+0xfb/0x190 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
     vfs_write+0x2eb/0x410
     ksys_write+0x65/0xe0
     do_syscall_64+0x2b/0x80

    This crash occurs under the "out_low_space" label.  We grabbed a perag
    reference, passed it via args->pag into xfs_bmap_btalloc_at_eof, and
    afterwards args->pag is NULL.  Fix the second function not to clobber
    args->pag if the caller had passed one in.

    Fixes: 85843327094f ("xfs: factor xfs_bmap_btalloc()")
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>
    Signed-off-by: Dave Chinner <david@fromorbit.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-11-10 07:22:25 -06:00
Bill O'Donnell 897f5de500 xfs: fix livelock in delayed allocation at ENOSPC
JIRA: https://issues.redhat.com/browse/RHEL-2002

commit 9419092fb2630c30e4ffeb9ef61007ef0c61827a
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Apr 27 09:02:11 2023 +1000

    xfs: fix livelock in delayed allocation at ENOSPC

    On a filesystem with a non-zero stripe unit and a large sequential
    write, delayed allocation will set a minimum allocation length of
    the stripe unit. If allocation fails because there are no extents
    long enough for an aligned minlen allocation, it is supposed to
    fall back to unaligned allocation which allows single block extents
    to be allocated.

    When the allocator code was rewritting in the 6.3 cycle, this
    fallback was broken - the old code used args->fsbno as the both the
    allocation target and the allocation result, the new code passes the
    target as a separate parameter. The conversion didn't handle the
    aligned->unaligned fallback path correctly - it reset args->fsbno to
    the target fsbno on failure which broke allocation failure detection
    in the high level code and so it never fell back to unaligned
    allocations.

    This resulted in a loop in writeback trying to allocate an aligned
    block, getting a false positive success, trying to insert the result
    in the BMBT. This did nothing because the extent already was in the
    BMBT (merge results in an unchanged extent) and so it returned the
    prior extent to the conversion code as the current iomap.

    Because the iomap returned didn't cover the offset we tried to map,
    xfs_convert_blocks() then retries the allocation, which fails in the
    same way and now we have a livelock.

    Reported-and-tested-by: Brian Foster <bfoster@redhat.com>
    Fixes: 85843327094f ("xfs: factor xfs_bmap_btalloc()")
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-11-10 07:22:25 -06:00
Bill O'Donnell c66c020d08 xfs: return a referenced perag from filestreams allocator
JIRA: https://issues.redhat.com/browse/RHEL-2002

commit f8f1ed1ab3babad46b25e2dbe8de43b33fe7aaa6
Author: Dave Chinner <dchinner@redhat.com>
Date:   Mon Feb 13 09:14:56 2023 +1100

    xfs: return a referenced perag from filestreams allocator

    Now that the filestreams AG selection tracks active perags, we need
    to return an active perag to the core allocator code. This is
    because the file allocation the filestreams code will run are AG
    specific allocations and so need to pin the AG until the allocations
    complete.

    We cannot rely on the filestreams item reference to do this - the
    filestreams association can be torn down at any time, hence we
    need to have a separate reference for the allocation process to pin
    the AG after it has been selected.

    This means there is some perag juggling in allocation failure
    fallback paths as they will do all AG scans in the case the AG
    specific allocation fails. Hence we need to track the perag
    reference that the filestream allocator returned to make sure we
    don't leak it on repeated allocation failure.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-11-10 07:22:24 -06:00
Bill O'Donnell de2455a1d6 xfs: move xfs_bmap_btalloc_filestreams() to xfs_filestreams.c
JIRA: https://issues.redhat.com/browse/RHEL-2002

commit 8f7747ad8c52cde585b9456f6dbd1984af7b97bc
Author: Dave Chinner <dchinner@redhat.com>
Date:   Mon Feb 13 09:14:55 2023 +1100

    xfs: move xfs_bmap_btalloc_filestreams() to xfs_filestreams.c

    xfs_bmap_btalloc_filestreams() calls two filestreams functions to
    select the AG to allocate from. Both those functions end up in
    the same selection function that iterates all AGs multiple times.
    Worst case, xfs_bmap_btalloc_filestreams() can iterate all AGs 4
    times just to select the initial AG to allocate in.

    Move the AG selection to fs/xfs/xfs_filestreams.c as a single
    interface so that the inefficient AG interation is contained
    entirely within the filestreams code. This will allow the
    implementation to be simplified and made more efficient in future
    patches.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-11-10 07:22:23 -06:00
Bill O'Donnell 5770720a06 xfs: use xfs_bmap_longest_free_extent() in filestreams
JIRA: https://issues.redhat.com/browse/RHEL-2002

commit 05cf492a8d01f48d4b8d8f0b93f2d75de7349f12
Author: Dave Chinner <dchinner@redhat.com>
Date:   Mon Feb 13 09:14:55 2023 +1100

    xfs: use xfs_bmap_longest_free_extent() in filestreams

    The code in xfs_bmap_longest_free_extent() is open coded in
    xfs_filestream_pick_ag(). Export xfs_bmap_longest_free_extent and
    call it from the filestreams code instead.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-11-10 07:22:23 -06:00
Bill O'Donnell a48463610a xfs: get rid of notinit from xfs_bmap_longest_free_extent
JIRA: https://issues.redhat.com/browse/RHEL-2002

commit 6b637ad0c7be85ecb795697ea51051039b753da2
Author: Dave Chinner <dchinner@redhat.com>
Date:   Mon Feb 13 09:14:55 2023 +1100

    xfs: get rid of notinit from xfs_bmap_longest_free_extent

    It is only set if reading the AGF gets a EAGAIN error. Just return
    the EAGAIN error and handle that error in the callers.

    This means we can remove the not_init parameter from
    xfs_bmap_select_minlen(), too, because the use of not_init there is
    pessimistic. If we can't read the agf, it won't increase blen.

    The only time we actually care whether we checked all the AGFs for
    contiguous free space is when the best length is less than the
    minimum allocation length. If not_init is set, then we ignore blen
    and set the minimum alloc length to the absolute minimum, not the
    best length we know already is present.

    However, if blen is less than the minimum we're going to ignore it
    anyway, regardless of whether we scanned all the AGFs or not.  Hence
    not_init can go away, because we only use if blen is good from
    the scanned AGs otherwise we ignore it altogether and use minlen.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-11-10 07:22:23 -06:00
Bill O'Donnell b549aae861 xfs: factor out filestreams from xfs_bmap_btalloc_nullfb
JIRA: https://issues.redhat.com/browse/RHEL-2002

commit 89563e7dc099343bf7792515452e1a24005d98a6
Author: Dave Chinner <dchinner@redhat.com>
Date:   Mon Feb 13 09:14:54 2023 +1100

    xfs: factor out filestreams from xfs_bmap_btalloc_nullfb

    There's many if (filestreams) {} else {} branches in this function.
    Split it out into a filestreams specific function so that we can
    then work directly on cleaning up the filestreams code without
    impacting the rest of the allocation algorithms.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-11-10 07:22:23 -06:00
Bill O'Donnell 7fba5130ec xfs: fold xfs_alloc_ag_vextent() into callers
JIRA: https://issues.redhat.com/browse/RHEL-2002

commit 230e8fe8462ffda0849ea40b61dcf9f233854076
Author: Dave Chinner <dchinner@redhat.com>
Date:   Mon Feb 13 09:14:54 2023 +1100

    xfs: fold xfs_alloc_ag_vextent() into callers

    We don't need the multiplexing xfs_alloc_ag_vextent() provided
    anymore - we can just call the exact/near/size variants directly.
    This allows us to remove args->type completely and stop using
    args->fsbno as an input to the allocator algorithms.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-11-10 07:22:22 -06:00
Bill O'Donnell c3a9573941 xfs: introduce xfs_alloc_vextent_exact_bno()
JIRA: https://issues.redhat.com/browse/RHEL-2002

commit 5f36b2ce79f254dd00cdc88374271df7ce843d56
Author: Dave Chinner <dchinner@redhat.com>
Date:   Mon Feb 13 09:14:54 2023 +1100

    xfs: introduce xfs_alloc_vextent_exact_bno()

    Two of the callers to xfs_alloc_vextent_this_ag() actually want
    exact block number allocation, not anywhere-in-ag allocation. Split
    this out from _this_ag() as a first class citizen so no external
    extent allocation code needs to care about args->type anymore.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-11-10 07:22:22 -06:00
Bill O'Donnell 8ed52fa327 xfs: introduce xfs_alloc_vextent_near_bno()
JIRA: https://issues.redhat.com/browse/RHEL-2002

commit db4710fd12248e5d4c3842520cd13f034136576b
Author: Dave Chinner <dchinner@redhat.com>
Date:   Mon Feb 13 09:14:54 2023 +1100

    xfs: introduce xfs_alloc_vextent_near_bno()

    The remaining callers of xfs_alloc_vextent() are all doing NEAR_BNO
    allocations. We can replace that function with a new
    xfs_alloc_vextent_near_bno() function that does this explicitly.

    We also multiplex NEAR_BNO allocations through
    xfs_alloc_vextent_this_ag via args->type. Replace all of these with
    direct calls to xfs_alloc_vextent_near_bno(), too.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-11-10 07:22:22 -06:00
Bill O'Donnell 1dae8c9676 xfs: use xfs_alloc_vextent_start_bno() where appropriate
JIRA: https://issues.redhat.com/browse/RHEL-2002

commit 2a7f6d41d8b72412228ede538bdf0e81bf9738f4
Author: Dave Chinner <dchinner@redhat.com>
Date:   Mon Feb 13 09:14:53 2023 +1100

    xfs: use xfs_alloc_vextent_start_bno() where appropriate

    Change obvious callers of single AG allocation to use
    xfs_alloc_vextent_start_bno(). Callers no long need to specify
    XFS_ALLOCTYPE_START_BNO, and so the type can be driven inward and
    removed.

    While doing this, also pass the allocation target fsb as a parameter
    rather than encoding it in args->fsbno.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-11-10 07:22:22 -06:00
Bill O'Donnell 2a92353426 xfs: use xfs_alloc_vextent_first_ag() where appropriate
JIRA: https://issues.redhat.com/browse/RHEL-2002

commit 319c9e874ac8721acdb6583e3459ef595e5ed0a6
Author: Dave Chinner <dchinner@redhat.com>
Date:   Mon Feb 13 09:14:53 2023 +1100

    xfs: use xfs_alloc_vextent_first_ag() where appropriate

    Change obvious callers of single AG allocation to use
    xfs_alloc_vextent_first_ag(). This gets rid of
    XFS_ALLOCTYPE_FIRST_AG as the type used within
    xfs_alloc_vextent_first_ag() during iteration is _THIS_AG. Hence we
    can remove the setting of args->type from all the callers of
    _first_ag() and remove the alloctype.

    While doing this, pass the allocation target fsb as a parameter
    rather than encoding it in args->fsbno. This starts the process
    of making args->fsbno an output only variable rather than
    input/output.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-11-10 07:22:22 -06:00
Bill O'Donnell 3d84804c33 xfs: factor xfs_bmap_btalloc()
JIRA: https://issues.redhat.com/browse/RHEL-2002

commit 85843327094f9de9cf0129cd9a3a43128c6f5ac8
Author: Dave Chinner <dchinner@redhat.com>
Date:   Mon Feb 13 09:14:53 2023 +1100

    xfs: factor xfs_bmap_btalloc()

    There are several different contexts xfs_bmap_btalloc() handles, and
    large chunks of the code execute independent allocation contexts.
    Try to untangle this mess a bit.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-11-10 07:22:22 -06:00
Bill O'Donnell f393aca22e xfs: use xfs_alloc_vextent_this_ag() where appropriate
JIRA: https://issues.redhat.com/browse/RHEL-2002

commit 74c36a8689d3d8ca9d9e96759c9bbf337e049097
Author: Dave Chinner <dchinner@redhat.com>
Date:   Mon Feb 13 09:14:53 2023 +1100

    xfs: use xfs_alloc_vextent_this_ag() where appropriate

    Change obvious callers of single AG allocation to use
    xfs_alloc_vextent_this_ag(). Drive the per-ag grabbing out to the
    callers, too, so that callers with active references don't need
    to do new lookups just for an allocation in a context that already
    has a perag reference.

    The only remaining caller that does single AG allocation through
    xfs_alloc_vextent() is xfs_bmap_btalloc() with
    XFS_ALLOCTYPE_NEAR_BNO. That is going to need more untangling before
    it can be converted cleanly.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-11-10 07:22:21 -06:00
Bill O'Donnell 52017e5a79 xfs: introduce xfs_for_each_perag_wrap()
JIRA: https://issues.redhat.com/browse/RHEL-2002

commit 76257a15873ccce817e0c4441f6bb66fb8f8201c
Author: Dave Chinner <dchinner@redhat.com>
Date:   Mon Feb 13 09:14:53 2023 +1100

    xfs: introduce xfs_for_each_perag_wrap()

    In several places we iterate every AG from a specific start agno and
    wrap back to the first AG when we reach the end of the filesystem to
    continue searching. We don't have a primitive for this iteration
    yet, so add one for conversion of these algorithms to per-ag based
    iteration.

    The filestream AG select code is a mess, and this initially makes it
    worse. The per-ag selection needs to be driven completely into the
    filestream code to clean this up and it will be done in a future
    patch that makes the filestream allocator use active per-ag
    references correctly.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-11-10 07:22:21 -06:00
Bill O'Donnell 6ee6b421b0 xfs: perags need atomic operational state
JIRA: https://issues.redhat.com/browse/RHEL-2002

commit 7ac2ff8bb3713c7cb43564c04384af2ee7cc1f8d
Author: Dave Chinner <dchinner@redhat.com>
Date:   Mon Feb 13 09:14:52 2023 +1100

    xfs: perags need atomic operational state

    We currently don't have any flags or operational state in the
    xfs_perag except for the pagf_init and pagi_init flags. And the
    agflreset flag. Oh, there's also the pagf_metadata and pagi_inodeok
    flags, too.

    For controlling per-ag operations, we are going to need some atomic
    state flags. Hence add an opstate field similar to what we already
    have in the mount and log, and convert all these state flags across
    to atomic bit operations.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-11-10 07:22:21 -06:00
Bill O'Donnell 21482343ad xfs: t_firstblock is tracking AGs not blocks
JIRA: https://issues.redhat.com/browse/RHEL-2002

commit 692b6cddeb65a5170c1e63d25b1ffb7822e80f7d
Author: Dave Chinner <dchinner@redhat.com>
Date:   Sat Feb 11 04:11:06 2023 +1100

    xfs: t_firstblock is tracking AGs not blocks

    The tp->t_firstblock field is now raelly tracking the highest AG we
    have locked, not the block number of the highest allocation we've
    made. It's purpose is to prevent AGF locking deadlocks, so rename it
    to "highest AG" and simplify the implementation to just track the
    agno rather than a fsbno.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-11-10 07:22:20 -06:00
Bill O'Donnell 12b711e27a xfs: drop firstblock constraints from allocation setup
JIRA: https://issues.redhat.com/browse/RHEL-2002

commit 36b6ad2d9cb81b0d52ae1598286ca5809cd39003
Author: Dave Chinner <dchinner@redhat.com>
Date:   Sat Feb 11 04:10:06 2023 +1100

    xfs: drop firstblock constraints from allocation setup

    Now that xfs_alloc_vextent() does all the AGF deadlock prevention
    filtering for multiple allocations in a single transaction, we no
    longer need the allocation setup code to care about what AGs we
    might already have locked.

    Hence we can remove all the "nullfb" conditional logic in places
    like xfs_bmap_btalloc() and instead have them focus simply on
    setting up locality constraints. If the allocation fails due to
    AGF lock filtering in xfs_alloc_vextent, then we just fall back as
    we normally do to more relaxed allocation constraints.

    As a result, any allocation that allows AG scanning (i.e. not
    confined to a single AG) and does not force a worst case full
    filesystem scan will now be able to attempt allocation from AGs
    lower than that defined by tp->t_firstblock. This is because
    xfs_alloc_vextent() allows try-locking of the AGFs and hence enables
    low space algorithms to at least -try- to get space from AGs lower
    than the one that we have currently locked and allocated from. This
    is a significant improvement in the low space allocation algorithm.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-11-10 07:22:20 -06:00
Bill O'Donnell 3bec77d14d xfs: fix low space alloc deadlock
JIRA: https://issues.redhat.com/browse/RHEL-2002

commit 1dd0510f6d4b85616a36aabb9be38389467122d9
Author: Dave Chinner <dchinner@redhat.com>
Date:   Sat Feb 11 04:07:06 2023 +1100

    xfs: fix low space alloc deadlock

    I've recently encountered an ABBA deadlock with g/476. The upcoming
    changes seem to make this much easier to hit, but the underlying
    problem is a pre-existing one.

    Essentially, if we select an AG for allocation, then lock the AGF
    and then fail to allocate for some reason (e.g. minimum length
    requirements cannot be satisfied), then we drop out of the
    allocation with the AGF still locked.

    The caller then modifies the allocation constraints - usually
    loosening them up - and tries again. This can result in trying to
    access AGFs that are lower than the AGF we already have locked from
    the failed attempt. e.g. the failed attempt skipped several AGs
    before failing, so we have locks an AG higher than the start AG.
    Retrying the allocation from the start AG then causes us to violate
    AGF lock ordering and this can lead to deadlocks.

    The deadlock exists even if allocation succeeds - we can do a
    followup allocations in the same transaction for BMBT blocks that
    aren't guaranteed to be in the same AG as the original, and can move
    into higher AGs. Hence we really need to move the tp->t_firstblock
    tracking down into xfs_alloc_vextent() where it can be set when we
    exit with a locked AG.

    xfs_alloc_vextent() can also check there if the requested
    allocation falls within the allow range of AGs set by
    tp->t_firstblock. If we can't allocate within the range set, we have
    to fail the allocation. If we are allowed to to non-blocking AGF
    locking, we can ignore the AG locking order limitations as we can
    use try-locks for the first iteration over requested AG range.

    This invalidates a set of post allocation asserts that check that
    the allocation is always above tp->t_firstblock if it is set.
    Because we can use try-locks to avoid the deadlock in some
    circumstances, having a pre-existing locked AGF doesn't always
    prevent allocation from lower order AGFs. Hence those ASSERTs need
    to be removed.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-11-10 07:22:20 -06:00
Bill O'Donnell f589b806a9 xfs: pass the xfs_bmbt_irec directly through the log intent code
JIRA: https://issues.redhat.com/browse/RHEL-2002

commit ddccb81b26ec021ae1f3366aa996cc4c68dd75ce
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Wed Feb 1 09:50:53 2023 -0800

    xfs: pass the xfs_bmbt_irec directly through the log intent code

    Instead of repeatedly boxing and unboxing the incore extent mapping
    structure as it passes through the BUI code, pass the pointer directly
    through.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-11-10 07:22:19 -06:00
Bill O'Donnell acfc3ab59a xfs: invalidate xfs_bufs when allocating cow extents
JIRA: https://issues.redhat.com/browse/RHEL-2002

commit ddfdd530e43fcb3f7a0a69966e5f6c33497b4ae3
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Thu Dec 1 09:36:16 2022 -0800

    xfs: invalidate xfs_bufs when allocating cow extents

    While investigating test failures in xfs/17[1-3] in alwayscow mode, I
    noticed through code inspection that xfs_bmap_alloc_userdata isn't
    setting XFS_ALLOC_USERDATA when allocating extents for a file's CoW
    fork.  COW staging extents should be flagged as USERDATA, since user
    data are persisted to these blocks before being remapped into a file.

    This mis-classification has a few impacts on the behavior of the system.
    First, the filestreams allocator is supposed to keep allocating from a
    chosen AG until it runs out of space in that AG.  However, it only does
    that for USERDATA allocations, which means that COW allocations aren't
    tied to the filestreams AG.  Fortunately, few people use filestreams, so
    nobody's noticed.

    A more serious problem is that xfs_alloc_ag_vextent_small looks for a
    buffer to invalidate *if* the USERDATA flag is set and the AG is so full
    that the allocation had to come from the AGFL because the cntbt is
    empty.  The consequences of not invalidating the buffer are severe --
    if the AIL incorrectly checkpoints a buffer that is now being used to
    store user data, that action will clobber the user's written data.

    Fix filestreams and yet another data corruption vector by flagging COW
    allocations as USERDATA.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-11-10 07:22:18 -06:00
Bill O'Donnell f82d4529ed xfs: clean up "%Ld/%Lu" which doesn't meet C standard
JIRA: https://issues.redhat.com/browse/RHEL-2002

commit 78b0f58bdfef45aa9f3c7fbbd9b4d41abad6d85f
Author: Zeng Heng <zengheng4@huawei.com>
Date:   Mon Sep 19 06:47:14 2022 +1000

    xfs: clean up "%Ld/%Lu" which doesn't meet C standard

    The "%Ld" specifier, which represents long long unsigned,
    doesn't meet C language standard, and even more,
    it makes people easily mistake with "%ld", which represent
    long unsigned. So replace "%Ld" with "lld".

    Do the same with "%Lu".

    Signed-off-by: Zeng Heng <zengheng4@huawei.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>
    Signed-off-by: Dave Chinner <david@fromorbit.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-11-06 19:27:41 -06:00
Bill O'Donnell 44b97f1577 xfs: block reservation too large for minleft allocation
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2224502

commit d5753847b216db0e553e8065aa825cfe497ad143
Author: Dave Chinner <dchinner@redhat.com>
Date:   Sat Feb 11 04:09:06 2023 +1100

    xfs: block reservation too large for minleft allocation

    When we enter xfs_bmbt_alloc_block() without having first allocated
    a data extent (i.e. tp->t_firstblock == NULLFSBLOCK) because we
    are doing something like unwritten extent conversion, the transaction
    block reservation is used as the minleft value.

    This works for operations like unwritten extent conversion, but it
    assumes that the block reservation is only for a BMBT split. THis is
    not always true, and sometimes results in larger than necessary
    minleft values being set. We only actually need enough space for a
    btree split, something we already handle correctly in
    xfs_bmapi_write() via the xfs_bmapi_minleft() calculation.

    We should use xfs_bmapi_minleft() in xfs_bmbt_alloc_block() to
    calculate the number of blocks a BMBT split on this inode is going to
    require, not use the transaction block reservation that contains the
    maximum number of blocks this transaction may consume in it...

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-07-27 14:42:02 -05:00
Bill O'Donnell 3219617b1b xfs: replace inode fork size macros with functions
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit c01147d929899f02a0a8b15e406d12784768ca72
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Sat Jul 9 10:56:07 2022 -0700

    xfs: replace inode fork size macros with functions

    Replace the shouty macros here with typechecked helper functions.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:43 -05:00
Bill O'Donnell f77675b5d0 xfs: replace XFS_IFORK_Q with a proper predicate function
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 932b42c66cb5d0ca9800b128415b4ad6b1952b3e
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Sat Jul 9 10:56:06 2022 -0700

    xfs: replace XFS_IFORK_Q with a proper predicate function

    Replace this shouty macro with a real C function that has a more
    descriptive name.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:43 -05:00
Bill O'Donnell 0036098801 xfs: use XFS_IFORK_Q to determine the presence of an xattr fork
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit e45d7cb2356e6b59fe64da28324025cc6fcd3fbd
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Sat Jul 9 10:56:06 2022 -0700

    xfs: use XFS_IFORK_Q to determine the presence of an xattr fork

    Modify xfs_ifork_ptr to return a NULL pointer if the caller asks for the
    attribute fork but i_forkoff is zero.  This eliminates the ambiguity
    between i_forkoff and i_af.if_present, which should make it easier to
    understand the lifetime of attr forks.

    While we're at it, remove the if_present checks around calls to
    xfs_idestroy_fork and xfs_ifork_zap_attr since they can both handle attr
    forks that have already been torn down.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:43 -05:00
Bill O'Donnell a2d362f29a xfs: make inode attribute forks a permanent part of struct xfs_inode
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

Conflicts: previous out of order application of 5625ea0 requires minor adjust to xfs_iomap.c

commit 2ed5b09b3e8fc274ae8fecd6ab7c5106a364bed1
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Sat Jul 9 10:56:06 2022 -0700

    xfs: make inode attribute forks a permanent part of struct xfs_inode

    Syzkaller reported a UAF bug a while back:

    ==================================================================
    BUG: KASAN: use-after-free in xfs_ilock_attr_map_shared+0xe3/0xf6 fs/xfs/xfs_inode.c:127
    Read of size 4 at addr ffff88802cec919c by task syz-executor262/2958

    CPU: 2 PID: 2958 Comm: syz-executor262 Not tainted
    5.15.0-0.30.3-20220406_1406 #3
    Hardware name: Red Hat KVM, BIOS 1.13.0-2.module+el8.3.0+7860+a7792d29
    04/01/2014
    Call Trace:
     <TASK>
     __dump_stack lib/dump_stack.c:88 [inline]
     dump_stack_lvl+0x82/0xa9 lib/dump_stack.c:106
     print_address_description.constprop.9+0x21/0x2d5 mm/kasan/report.c:256
     __kasan_report mm/kasan/report.c:442 [inline]
     kasan_report.cold.14+0x7f/0x11b mm/kasan/report.c:459
     xfs_ilock_attr_map_shared+0xe3/0xf6 fs/xfs/xfs_inode.c:127
     xfs_attr_get+0x378/0x4c2 fs/xfs/libxfs/xfs_attr.c:159
     xfs_xattr_get+0xe3/0x150 fs/xfs/xfs_xattr.c:36
     __vfs_getxattr+0xdf/0x13d fs/xattr.c:399
     cap_inode_need_killpriv+0x41/0x5d security/commoncap.c:300
     security_inode_need_killpriv+0x4c/0x97 security/security.c:1408
     dentry_needs_remove_privs.part.28+0x21/0x63 fs/inode.c:1912
     dentry_needs_remove_privs+0x80/0x9e fs/inode.c:1908
     do_truncate+0xc3/0x1e0 fs/open.c:56
     handle_truncate fs/namei.c:3084 [inline]
     do_open fs/namei.c:3432 [inline]
     path_openat+0x30ab/0x396d fs/namei.c:3561
     do_filp_open+0x1c4/0x290 fs/namei.c:3588
     do_sys_openat2+0x60d/0x98c fs/open.c:1212
     do_sys_open+0xcf/0x13c fs/open.c:1228
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x44/0x0
    RIP: 0033:0x7f7ef4bb753d
    Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48
    89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73
    01 c3 48 8b 0d 1b 79 2c 00 f7 d8 64 89 01 48
    RSP: 002b:00007f7ef52c2ed8 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
    RAX: ffffffffffffffda RBX: 0000000000404148 RCX: 00007f7ef4bb753d
    RDX: 00007f7ef4bb753d RSI: 0000000000000000 RDI: 0000000020004fc0
    RBP: 0000000000404140 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000246 R12: 0030656c69662f2e
    R13: 00007ffd794db37f R14: 00007ffd794db470 R15: 00007f7ef52c2fc0
     </TASK>

    Allocated by task 2953:
     kasan_save_stack+0x19/0x38 mm/kasan/common.c:38
     kasan_set_track mm/kasan/common.c:46 [inline]
     set_alloc_info mm/kasan/common.c:434 [inline]
     __kasan_slab_alloc+0x68/0x7c mm/kasan/common.c:467
     kasan_slab_alloc include/linux/kasan.h:254 [inline]
     slab_post_alloc_hook mm/slab.h:519 [inline]
     slab_alloc_node mm/slub.c:3213 [inline]
     slab_alloc mm/slub.c:3221 [inline]
     kmem_cache_alloc+0x11b/0x3eb mm/slub.c:3226
     kmem_cache_zalloc include/linux/slab.h:711 [inline]
     xfs_ifork_alloc+0x25/0xa2 fs/xfs/libxfs/xfs_inode_fork.c:287
     xfs_bmap_add_attrfork+0x3f2/0x9b1 fs/xfs/libxfs/xfs_bmap.c:1098
     xfs_attr_set+0xe38/0x12a7 fs/xfs/libxfs/xfs_attr.c:746
     xfs_xattr_set+0xeb/0x1a9 fs/xfs/xfs_xattr.c:59
     __vfs_setxattr+0x11b/0x177 fs/xattr.c:180
     __vfs_setxattr_noperm+0x128/0x5e0 fs/xattr.c:214
     __vfs_setxattr_locked+0x1d4/0x258 fs/xattr.c:275
     vfs_setxattr+0x154/0x33d fs/xattr.c:301
     setxattr+0x216/0x29f fs/xattr.c:575
     __do_sys_fsetxattr fs/xattr.c:632 [inline]
     __se_sys_fsetxattr fs/xattr.c:621 [inline]
     __x64_sys_fsetxattr+0x243/0x2fe fs/xattr.c:621
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x44/0x0

    Freed by task 2949:
     kasan_save_stack+0x19/0x38 mm/kasan/common.c:38
     kasan_set_track+0x1c/0x21 mm/kasan/common.c:46
     kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:360
     ____kasan_slab_free mm/kasan/common.c:366 [inline]
     ____kasan_slab_free mm/kasan/common.c:328 [inline]
     __kasan_slab_free+0xe2/0x10e mm/kasan/common.c:374
     kasan_slab_free include/linux/kasan.h:230 [inline]
     slab_free_hook mm/slub.c:1700 [inline]
     slab_free_freelist_hook mm/slub.c:1726 [inline]
     slab_free mm/slub.c:3492 [inline]
     kmem_cache_free+0xdc/0x3ce mm/slub.c:3508
     xfs_attr_fork_remove+0x8d/0x132 fs/xfs/libxfs/xfs_attr_leaf.c:773
     xfs_attr_sf_removename+0x5dd/0x6cb fs/xfs/libxfs/xfs_attr_leaf.c:822
     xfs_attr_remove_iter+0x68c/0x805 fs/xfs/libxfs/xfs_attr.c:1413
     xfs_attr_remove_args+0xb1/0x10d fs/xfs/libxfs/xfs_attr.c:684
     xfs_attr_set+0xf1e/0x12a7 fs/xfs/libxfs/xfs_attr.c:802
     xfs_xattr_set+0xeb/0x1a9 fs/xfs/xfs_xattr.c:59
     __vfs_removexattr+0x106/0x16a fs/xattr.c:468
     cap_inode_killpriv+0x24/0x47 security/commoncap.c:324
     security_inode_killpriv+0x54/0xa1 security/security.c:1414
     setattr_prepare+0x1a6/0x897 fs/attr.c:146
     xfs_vn_change_ok+0x111/0x15e fs/xfs/xfs_iops.c:682
     xfs_vn_setattr_size+0x5f/0x15a fs/xfs/xfs_iops.c:1065
     xfs_vn_setattr+0x125/0x2ad fs/xfs/xfs_iops.c:1093
     notify_change+0xae5/0x10a1 fs/attr.c:410
     do_truncate+0x134/0x1e0 fs/open.c:64
     handle_truncate fs/namei.c:3084 [inline]
     do_open fs/namei.c:3432 [inline]
     path_openat+0x30ab/0x396d fs/namei.c:3561
     do_filp_open+0x1c4/0x290 fs/namei.c:3588
     do_sys_openat2+0x60d/0x98c fs/open.c:1212
     do_sys_open+0xcf/0x13c fs/open.c:1228
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x44/0x0

    The buggy address belongs to the object at ffff88802cec9188
     which belongs to the cache xfs_ifork of size 40
    The buggy address is located 20 bytes inside of
     40-byte region [ffff88802cec9188, ffff88802cec91b0)
    The buggy address belongs to the page:
    page:00000000c3af36a1 refcount:1 mapcount:0 mapping:0000000000000000
    index:0x0 pfn:0x2cec9
    flags: 0xfffffc0000200(slab|node=0|zone=1|lastcpupid=0x1fffff)
    raw: 000fffffc0000200 ffffea00009d2580 0000000600000006 ffff88801a9ffc80
    raw: 0000000000000000 0000000080490049 00000001ffffffff 0000000000000000
    page dumped because: kasan: bad access detected

    Memory state around the buggy address:
     ffff88802cec9080: fb fb fb fc fc fa fb fb fb fb fc fc fb fb fb fb
     ffff88802cec9100: fb fc fc fb fb fb fb fb fc fc fb fb fb fb fb fc
    >ffff88802cec9180: fc fa fb fb fb fb fc fc fa fb fb fb fb fc fc fb
                                ^
     ffff88802cec9200: fb fb fb fb fc fc fb fb fb fb fb fc fc fb fb fb
     ffff88802cec9280: fb fb fc fc fa fb fb fb fb fc fc fa fb fb fb fb
    ==================================================================

    The root cause of this bug is the unlocked access to xfs_inode.i_afp
    from the getxattr code paths while trying to determine which ILOCK mode
    to use to stabilize the xattr data.  Unfortunately, the VFS does not
    acquire i_rwsem when vfs_getxattr (or listxattr) call into the
    filesystem, which means that getxattr can race with a removexattr that's
    tearing down the attr fork and crash:

    xfs_attr_set:                          xfs_attr_get:
    xfs_attr_fork_remove:                  xfs_ilock_attr_map_shared:

    xfs_idestroy_fork(ip->i_afp);
    kmem_cache_free(xfs_ifork_cache, ip->i_afp);

                                           if (ip->i_afp &&

    ip->i_afp = NULL;

                                               xfs_need_iread_extents(ip->i_afp))
                                           <KABOOM>

    ip->i_forkoff = 0;

    Regrettably, the VFS is much more lax about i_rwsem and getxattr than
    is immediately obvious -- not only does it not guarantee that we hold
    i_rwsem, it actually doesn't guarantee that we *don't* hold it either.
    The getxattr system call won't acquire the lock before calling XFS, but
    the file capabilities code calls getxattr with and without i_rwsem held
    to determine if the "security.capabilities" xattr is set on the file.

    Fixing the VFS locking requires a treewide investigation into every code
    path that could touch an xattr and what i_rwsem state it expects or sets
    up.  That could take years or even prove impossible; fortunately, we
    can fix this UAF problem inside XFS.

    An earlier version of this patch used smp_wmb in xfs_attr_fork_remove to
    ensure that i_forkoff is always zeroed before i_afp is set to null and
    changed the read paths to use smp_rmb before accessing i_forkoff and
    i_afp, which avoided these UAF problems.  However, the patch author was
    too busy dealing with other problems in the meantime, and by the time he
    came back to this issue, the situation had changed a bit.

    On a modern system with selinux, each inode will always have at least
    one xattr for the selinux label, so it doesn't make much sense to keep
    incurring the extra pointer dereference.  Furthermore, Allison's
    upcoming parent pointer patchset will also cause nearly every inode in
    the filesystem to have extended attributes.  Therefore, make the inode
    attribute fork structure part of struct xfs_inode, at a cost of 40 more
    bytes.

    This patch adds a clunky if_present field where necessary to maintain
    the existing logic of xattr fork null pointer testing in the existing
    codebase.  The next patch switches the logic over to XFS_IFORK_Q and it
    all goes away.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:42 -05:00