Commit Graph

506 Commits

Author SHA1 Message Date
Bill O'Donnell b282be6864 xfs: ensure submit buffers on LSN boundaries in error handlers
JIRA: https://issues.redhat.com/browse/RHEL-68860

Conflicts: out of order application required line numbering change.

commit e4c3b72a6ea93ed9c1815c74312eee9305638852
Author: Long Li <leo.lilong@huawei.com>
Date:   Wed Jan 17 20:31:26 2024 +0800

    xfs: ensure submit buffers on LSN boundaries in error handlers

    While performing the IO fault injection test, I caught the following data
    corruption report:

     XFS (dm-0): Internal error ltbno + ltlen > bno at line 1957 of file fs/xfs/libxfs/xfs_alloc.c.  Caller xfs_free_ag_extent+0x79c/0x1130
     CPU: 3 PID: 33 Comm: kworker/3:0 Not tainted 6.5.0-rc7-next-20230825-00001-g7f8666926889 #214
     Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
     Workqueue: xfs-inodegc/dm-0 xfs_inodegc_worker
     Call Trace:
      <TASK>
      dump_stack_lvl+0x50/0x70
      xfs_corruption_error+0x134/0x150
      xfs_free_ag_extent+0x7d3/0x1130
      __xfs_free_extent+0x201/0x3c0
      xfs_trans_free_extent+0x29b/0xa10
      xfs_extent_free_finish_item+0x2a/0xb0
      xfs_defer_finish_noroll+0x8d1/0x1b40
      xfs_defer_finish+0x21/0x200
      xfs_itruncate_extents_flags+0x1cb/0x650
      xfs_free_eofblocks+0x18f/0x250
      xfs_inactive+0x485/0x570
      xfs_inodegc_worker+0x207/0x530
      process_scheduled_works+0x24a/0xe10
      worker_thread+0x5ac/0xc60
      kthread+0x2cd/0x3c0
      ret_from_fork+0x4a/0x80
      ret_from_fork_asm+0x11/0x20
      </TASK>
     XFS (dm-0): Corruption detected. Unmount and run xfs_repair

    After analyzing the disk image, it was found that the corruption was
    triggered by the fact that extent was recorded in both inode datafork
    and AGF btree blocks. After a long time of reproduction and analysis,
    we found that the reason of free sapce btree corruption was that the
    AGF btree was not recovered correctly.

    Consider the following situation, Checkpoint A and Checkpoint B are in
    the same record and share the same start LSN1, buf items of same object
    (AGF btree block) is included in both Checkpoint A and Checkpoint B. If
    the buf item in Checkpoint A has been recovered and updates metadata LSN
    permanently, then the buf item in Checkpoint B cannot be recovered,
    because log recovery skips items with a metadata LSN >= the current LSN
    of the recovery item. If there is still an inode item in Checkpoint B
    that records the Extent X, the Extent X will be recorded in both inode
    datafork and AGF btree block after Checkpoint B is recovered. Such
    transaction can be seen when allocing enxtent for inode bmap, it record
    both the addition of extent to the inode extent list and the removing
    extent from the AGF.

      |------------Record (LSN1)------------------|---Record (LSN2)---|
      |-------Checkpoint A----------|----------Checkpoint B-----------|
      |     Buf Item(Extent X)      | Buf Item / Inode item(Extent X) |
      |     Extent X is freed       |     Extent X is allocated       |

    After commit 12818d24db ("xfs: rework log recovery to submit buffers
    on LSN boundaries") was introduced, we submit buffers on lsn boundaries
    during log recovery. The above problem can be avoided under normal paths,
    but it's not guaranteed under abnormal paths. Consider the following
    process, if an error was encountered after recover buf item in Checkpoint
    A and before recover buf item in Checkpoint B, buffers that have been
    added to the buffer_list will still be submitted, this violates the
    submits rule on lsn boundaries. So buf item in Checkpoint B cannot be
    recovered on the next mount due to current lsn of transaction equal to
    metadata lsn on disk. The detailed process of the problem is as follows.

    First Mount:

      xlog_do_recovery_pass
        error = xlog_recover_process
          xlog_recover_process_data
            xlog_recover_process_ophdr
              xlog_recovery_process_trans
                ...
                  /* recover buf item in Checkpoint A */
                  xlog_recover_buf_commit_pass2
                    xlog_recover_do_reg_buffer
                    /* add buffer of agf btree block to buffer_list */
                    xfs_buf_delwri_queue(bp, buffer_list)
                ...
                ==> Encounter read IO error and return
        /* submit buffers regardless of error */
        if (!list_empty(&buffer_list))
          xfs_buf_delwri_submit(&buffer_list);

        <buf items of agf btree block in Checkpoint A recovery success>

    Second Mount:

      xlog_do_recovery_pass
        error = xlog_recover_process
          xlog_recover_process_data
            xlog_recover_process_ophdr
              xlog_recovery_process_trans
                ...
                  /* recover buf item in Checkpoint B */
                  xlog_recover_buf_commit_pass2
                    /* buffer of agf btree block wouldn't added to
                       buffer_list due to lsn equal to current_lsn */
                    if (XFS_LSN_CMP(lsn, current_lsn) >= 0)
                      goto out_release

        <buf items of agf btree block in Checkpoint B wouldn't recovery>

    In order to make sure that submits buffers on lsn boundaries in the
    abnormal paths, we need to check error status before submit buffers that
    have been added from the last record processed. If error status exist,
    buffers in the bufffer_list should not be writen to disk.

    Canceling the buffers in the buffer_list directly isn't correct, unlike
    any other place where write list was canceled, these buffers has been
    initialized by xfs_buf_item_init() during recovery and held by buf item,
    buf items will not be released in xfs_buf_delwri_cancel(), it's not easy
    to solve.

    If the filesystem has been shut down, then delwri list submission will
    error out all buffers on the list via IO submission/completion and do
    all the correct cleanup automatically. So shutting down the filesystem
    could prevents buffers in the bufffer_list from being written to disk.

    Fixes: 50d5c8d8e9 ("xfs: check LSN ordering for v5 superblocks during recovery")
    Signed-off-by: Long Li <leo.lilong@huawei.com>
    Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
    Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2025-01-10 16:38:27 -06:00
Bill O'Donnell 642f6fa085 xfs: pass the defer ops instead of type to xfs_defer_start_recovery
JIRA: https://issues.redhat.com/browse/RHEL-65728

commit dc22af64368291a86fb6b7eb2adab21c815836b7
Author: Christoph Hellwig <hch@lst.de>
Date:   Thu Dec 14 06:16:32 2023 +0100

    xfs: pass the defer ops instead of type to xfs_defer_start_recovery

    xfs_defer_start_recovery is only called from xlog_recover_intent_item,
    and the callers of that all have the actual xfs_defer_ops_type operation
    vector at hand.  Pass that directly instead of looking it up from the
    defer_op_types table.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-11-20 11:25:59 -06:00
Bill O'Donnell 2990c8ef58 xfs: move ->iop_recover to xfs_defer_op_type
JIRA: https://issues.redhat.com/browse/RHEL-65728

commit db7ccc0bac2add5a41b66578e376b49328fc99d0
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Wed Nov 22 13:39:25 2023 -0800

    xfs: move ->iop_recover to xfs_defer_op_type

    Finish off the series by moving the intent item recovery function
    pointer to the xfs_defer_op_type struct, since this is really a deferred
    work function now.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-11-20 11:25:49 -06:00
Bill O'Donnell f2a94e0149 xfs: use xfs_defer_finish_one to finish recovered work items
JIRA: https://issues.redhat.com/browse/RHEL-65728

commit e5f1a5146ec35f3ed5d7f5ac7807a10c0062b6b8
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Wed Nov 22 11:25:45 2023 -0800

    xfs: use xfs_defer_finish_one to finish recovered work items

    Get rid of the open-coded calls to xfs_defer_finish_one.  This also
    means that the recovery transaction takes care of cleaning up the dfp,
    and we have solved (I hope) all the ownership issues in recovery.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-11-20 11:25:48 -06:00
Bill O'Donnell 75128f30ea xfs: transfer recovered intent item ownership in ->iop_recover
JIRA: https://issues.redhat.com/browse/RHEL-65728

commit deb4cd8ba87f17b12c72b3827820d9c703e9fd95
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Wed Nov 22 10:47:10 2023 -0800

    xfs: transfer recovered intent item ownership in ->iop_recover

    Now that we pass the xfs_defer_pending object into the intent item
    recovery functions, we know exactly when ownership of the sole refcount
    passes from the recovery context to the intent done item.  At that
    point, we need to null out dfp_intent so that the recovery mechanism
    won't release it.  This should fix the UAF problem reported by Long Li.

    Note that we still want to recreate the full deferred work state.  That
    will be addressed in the next patches.

    Fixes: 2e76f188fd ("xfs: cancel intents immediately if process_intents fails")
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-11-20 11:25:47 -06:00
Bill O'Donnell d3b894d85d xfs: pass the xfs_defer_pending object to iop_recover
JIRA: https://issues.redhat.com/browse/RHEL-65728

commit a050acdfa8003a44eae4558fddafc7afb1aef458
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Wed Nov 22 10:38:10 2023 -0800

    xfs: pass the xfs_defer_pending object to iop_recover

    Now that log intent item recovery recreates the xfs_defer_pending state,
    we should pass that into the ->iop_recover routines so that the intent
    item can finish the recreation work.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-11-20 11:25:47 -06:00
Bill O'Donnell ec6910b028 xfs: use xfs_defer_pending objects to recover intent items
JIRA: https://issues.redhat.com/browse/RHEL-65728

commit 03f7767c9f6120ac933378fdec3bfd78bf07bc11
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Wed Nov 22 10:23:23 2023 -0800

    xfs: use xfs_defer_pending objects to recover intent items

    One thing I never quite got around to doing is porting the log intent
    item recovery code to reconstruct the deferred pending work state.  As a
    result, each intent item open codes xfs_defer_finish_one in its recovery
    method, because that's what the EFI code did before xfs_defer.c even
    existed.

    This is a gross thing to have left unfixed -- if an EFI cannot proceed
    due to busy extents, we end up creating separate new EFIs for each
    unfinished work item, which is a change in behavior from what runtime
    would have done.

    Worse yet, Long Li pointed out that there's a UAF in the recovery code.
    The ->commit_pass2 function adds the intent item to the AIL and drops
    the refcount.  The one remaining refcount is now owned by the recovery
    mechanism (aka the log intent items in the AIL) with the intent of
    giving the refcount to the intent done item in the ->iop_recover
    function.

    However, if something fails later in recovery, xlog_recover_finish will
    walk the recovered intent items in the AIL and release them.  If the CIL
    hasn't been pushed before that point (which is possible since we don't
    force the log until later) then the intent done release will try to free
    its associated intent, which has already been freed.

    This patch starts to address this mess by having the ->commit_pass2
    functions recreate the xfs_defer_pending state.  The next few patches
    will fix the recovery functions.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-11-20 11:25:47 -06:00
Bill O'Donnell 164c7ed57d xfs: abort intent items when recovery intents fail
JIRA: https://issues.redhat.com/browse/RHEL-62760

commit f8f9d952e42dd49ae534f61f2fa7ca0876cb9848
Author: Long Li <leo.lilong@huawei.com>
Date:   Mon Jul 31 20:46:18 2023 +0800

    xfs: abort intent items when recovery intents fail

    When recovering intents, we capture newly created intent items as part of
    committing recovered intent items.  If intent recovery fails at a later
    point, we forget to remove those newly created intent items from the AIL
    and hang:

        [root@localhost ~]# cat /proc/539/stack
        [<0>] xfs_ail_push_all_sync+0x174/0x230
        [<0>] xfs_unmount_flush_inodes+0x8d/0xd0
        [<0>] xfs_mountfs+0x15f7/0x1e70
        [<0>] xfs_fs_fill_super+0x10ec/0x1b20
        [<0>] get_tree_bdev+0x3c8/0x730
        [<0>] vfs_get_tree+0x89/0x2c0
        [<0>] path_mount+0xecf/0x1800
        [<0>] do_mount+0xf3/0x110
        [<0>] __x64_sys_mount+0x154/0x1f0
        [<0>] do_syscall_64+0x39/0x80
        [<0>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

    When newly created intent items fail to commit via transaction, intent
    recovery hasn't created done items for these newly created intent items,
    so the capture structure is the sole owner of the captured intent items.
    We must release them explicitly or else they leak:

    unreferenced object 0xffff888016719108 (size 432):
      comm "mount", pid 529, jiffies 4294706839 (age 144.463s)
      hex dump (first 32 bytes):
        08 91 71 16 80 88 ff ff 08 91 71 16 80 88 ff ff  ..q.......q.....
        18 91 71 16 80 88 ff ff 18 91 71 16 80 88 ff ff  ..q.......q.....
      backtrace:
        [<ffffffff8230c68f>] xfs_efi_init+0x18f/0x1d0
        [<ffffffff8230c720>] xfs_extent_free_create_intent+0x50/0x150
        [<ffffffff821b671a>] xfs_defer_create_intents+0x16a/0x340
        [<ffffffff821bac3e>] xfs_defer_ops_capture_and_commit+0x8e/0xad0
        [<ffffffff82322bb9>] xfs_cui_item_recover+0x819/0x980
        [<ffffffff823289b6>] xlog_recover_process_intents+0x246/0xb70
        [<ffffffff8233249a>] xlog_recover_finish+0x8a/0x9a0
        [<ffffffff822eeafb>] xfs_log_mount_finish+0x2bb/0x4a0
        [<ffffffff822c0f4f>] xfs_mountfs+0x14bf/0x1e70
        [<ffffffff822d1f80>] xfs_fs_fill_super+0x10d0/0x1b20
        [<ffffffff81a21fa2>] get_tree_bdev+0x3d2/0x6d0
        [<ffffffff81a1ee09>] vfs_get_tree+0x89/0x2c0
        [<ffffffff81a9f35f>] path_mount+0xecf/0x1800
        [<ffffffff81a9fd83>] do_mount+0xf3/0x110
        [<ffffffff81aa00e4>] __x64_sys_mount+0x154/0x1f0
        [<ffffffff83968739>] do_syscall_64+0x39/0x80

    Fix the problem above by abort intent items that don't have a done item
    when recovery intents fail.

    Fixes: e6fff81e48 ("xfs: proper replay of deferred ops queued during log recovery")
    Signed-off-by: Long Li <leo.lilong@huawei.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-11-09 10:06:44 -06:00
Bill O'Donnell 6c8f0df398 xfs: use roundup_pow_of_two instead of ffs during xlog_find_tail
JIRA: https://issues.redhat.com/browse/RHEL-57114

commit 8b010acb3154b669e52f0eef4a6d925e3cc1db2f
Author: Wang Jianchao <jianchwa@outlook.com>
Date:   Wed Sep 13 09:38:01 2023 +0800

    xfs: use roundup_pow_of_two instead of ffs during xlog_find_tail

    In our production environment, we find that mounting a 500M /boot
    which is umount cleanly needs ~6s. One cause is that ffs() is
    used by xlog_write_log_records() to decide the buffer size. It
    can cause a lot of small IO easily when xlog_clear_stale_blocks()
    needs to wrap around the end of log area and log head block is
    not power of two. Things are similar in xlog_find_verify_cycle().

    The code is able to handed bigger buffer very well, we can use
    roundup_pow_of_two() to replace ffs() directly to avoid small
    and sychronous IOs.

    Reviewed-by: Dave Chinner <dchinner@redhat.com>
    Signed-off-by: Wang Jianchao <wangjc136@midea.com>
    Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-10-15 10:46:33 -05:00
Lucas Zampieri bfdd109754 Merge: CVE-2024-41014: xfs: add bounds checking to xlog_recover_process_data
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4835

JIRA: https://issues.redhat.com/browse/RHEL-50862  
CVE: CVE-2024-41014

```
xfs: add bounds checking to xlog_recover_process_data

There is a lack of verification of the space occupied by fixed members
of xlog_op_header in the xlog_recover_process_data.

We can create a crafted image to trigger an out of bounds read by
following these steps:
    1) Mount an image of xfs, and do some file operations to leave records
    2) Before umounting, copy the image for subsequent steps to simulate
       abnormal exit. Because umount will ensure that tail_blk and
       head_blk are the same, which will result in the inability to enter
       xlog_recover_process_data
    3) Write a tool to parse and modify the copied image in step 2
    4) Make the end of the xlog_op_header entries only 1 byte away from
       xlog_rec_header->h_size
    5) xlog_rec_header->h_num_logops++
    6) Modify xlog_rec_header->h_crc

Fix:
Add a check to make sure there is sufficient space to access fixed members
of xlog_op_header.

Signed-off-by: lei lu <llfamsec@gmail.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
(cherry picked from commit fb63435b7c7dc112b1ae1baea5486e0a6e27b196)
```

Signed-off-by: CKI Backport Bot <cki-ci-bot+cki-gitlab-backport-bot@redhat.com>

Approved-by: Eric Sandeen <esandeen@redhat.com>
Approved-by: Andrey Albershteyn <aalbersh@redhat.com>
Approved-by: Brian Foster <bfoster@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-08-13 12:44:39 +00:00
CKI Backport Bot 66642cc409 xfs: add bounds checking to xlog_recover_process_data
JIRA: https://issues.redhat.com/browse/RHEL-50862
CVE: CVE-2024-41014

commit fb63435b7c7dc112b1ae1baea5486e0a6e27b196
Author: lei lu <llfamsec@gmail.com>
Date:   Mon Jun 3 17:46:08 2024 +0800

    xfs: add bounds checking to xlog_recover_process_data

    There is a lack of verification of the space occupied by fixed members
    of xlog_op_header in the xlog_recover_process_data.

    We can create a crafted image to trigger an out of bounds read by
    following these steps:
        1) Mount an image of xfs, and do some file operations to leave records
        2) Before umounting, copy the image for subsequent steps to simulate
           abnormal exit. Because umount will ensure that tail_blk and
           head_blk are the same, which will result in the inability to enter
           xlog_recover_process_data
        3) Write a tool to parse and modify the copied image in step 2
        4) Make the end of the xlog_op_header entries only 1 byte away from
           xlog_rec_header->h_size
        5) xlog_rec_header->h_num_logops++
        6) Modify xlog_rec_header->h_crc

    Fix:
    Add a check to make sure there is sufficient space to access fixed members
    of xlog_op_header.

    Signed-off-by: lei lu <llfamsec@gmail.com>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>

Signed-off-by: CKI Backport Bot <cki-ci-bot+cki-gitlab-backport-bot@redhat.com>
2024-07-29 16:59:35 +00:00
Bill O'Donnell 2a8af76e78 xfs: fix log recovery buffer allocation for the legacy h_size fixup
JIRA: https://issues.redhat.com/browse/RHEL-46479

CVE: CVE-2024-39472

Conflicts: diffs from upstream since patches from the same series
are deemed unnecessary and skipped here. Also the entire series from
dchinner from Jan 16, 2024, beginning with 10634530f7 (xfs: convert
kmem_zalloc() to kzalloc()), and ending with 2c1e31ed5c88
(xfs: place intent recovery under NOFS allocation context) includes
some dependencies with conversions from kmem_free() to kfree(), etc,
that are unnecessary for this fix patch.

commit 45cf976008ddef4a9c9a30310c9b4fb2a9a6602a
Author: Christoph Hellwig <hch@lst.de>
Date:   Tue Apr 30 06:07:55 2024 +0200

    xfs: fix log recovery buffer allocation for the legacy h_size fixup

    Commit a70f9fe52d ("xfs: detect and handle invalid iclog size set by
    mkfs") added a fixup for incorrect h_size values used for the initial
    umount record in old xfsprogs versions.  Later commit 0c771b99d6
    ("xfs: clean up calculation of LR header blocks") cleaned up the log
    reover buffer calculation, but stoped using the fixed up h_size value
    to size the log recovery buffer, which can lead to an out of bounds
    access when the incorrect h_size does not come from the old mkfs
    tool, but a fuzzer.

    Fix this by open coding xlog_logrec_hblks and taking the fixed h_size
    into account for this calculation.

    Fixes: 0c771b99d6 ("xfs: clean up calculation of LR header blocks")
    Reported-by: Sam Sun <samsun1006219@gmail.com>
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Brian Foster <bfoster@redhat.com>
    Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
    Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-07-15 15:39:39 -05:00
Bill O'Donnell 19fab0b814 xfs: collect errors from inodegc for unlinked inode recovery
JIRA: https://issues.redhat.com/browse/RHEL-2002

Conflicts: context differences due to out of order patch application

commit d4d12c02bf5f768f1b423c7ae2909c5afdfe0d5f
Author: Dave Chinner <dchinner@redhat.com>
Date:   Mon Jun 5 14:48:15 2023 +1000

    xfs: collect errors from inodegc for unlinked inode recovery

    Unlinked list recovery requires errors removing the inode the from
    the unlinked list get fed back to the main recovery loop. Now that
    we offload the unlinking to the inodegc work, we don't get errors
    being fed back when we trip over a corruption that prevents the
    inode from being removed from the unlinked list.

    This means we never clear the corrupt unlinked list bucket,
    resulting in runtime operations eventually tripping over it and
    shutting down.

    Fix this by collecting inodegc worker errors and feed them
    back to the flush caller. This is largely best effort - the only
    context that really cares is log recovery, and it only flushes a
    single inode at a time so we don't need complex synchronised
    handling. Essentially the inodegc workers will capture the first
    error that occurs and the next flush will gather them and clear
    them. The flush itself will only report the first gathered error.

    In the cases where callers can return errors, propagate the
    collected inodegc flush error up the error handling chain.

    In the case of inode unlinked list recovery, there are several
    superfluous calls to flush queued unlinked inodes -
    xlog_recover_iunlink_bucket() guarantees that it has flushed the
    inodegc and collected errors before it returns. Hence nothing in the
    calling path needs to run a flush, even when an error is returned.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Dave Chinner <david@fromorbit.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-11-10 07:22:27 -06:00
Bill O'Donnell 454ef639aa xfs: avoid a UAF when log intent item recovery fails
JIRA: https://issues.redhat.com/browse/RHEL-2002

commit 97cf79677ecb50a38517253ae2fd705849a7e51a
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Sun Oct 16 17:54:40 2022 -0700

    xfs: avoid a UAF when log intent item recovery fails

    KASAN reported a UAF bug when I was running xfs/235:

     BUG: KASAN: use-after-free in xlog_recover_process_intents+0xa77/0xae0 [xfs]
     Read of size 8 at addr ffff88804391b360 by task mount/5680

     CPU: 2 PID: 5680 Comm: mount Not tainted 6.0.0-xfsx #6.0.0 77e7b52a4943a975441e5ac90a5ad7748b7867f6
     Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
     Call Trace:
      <TASK>
      dump_stack_lvl+0x34/0x44
      print_report.cold+0x2cc/0x682
      kasan_report+0xa3/0x120
      xlog_recover_process_intents+0xa77/0xae0 [xfs fb841c7180aad3f8359438576e27867f5795667e]
      xlog_recover_finish+0x7d/0x970 [xfs fb841c7180aad3f8359438576e27867f5795667e]
      xfs_log_mount_finish+0x2d7/0x5d0 [xfs fb841c7180aad3f8359438576e27867f5795667e]
      xfs_mountfs+0x11d4/0x1d10 [xfs fb841c7180aad3f8359438576e27867f5795667e]
      xfs_fs_fill_super+0x13d5/0x1a80 [xfs fb841c7180aad3f8359438576e27867f5795667e]
      get_tree_bdev+0x3da/0x6e0
      vfs_get_tree+0x7d/0x240
      path_mount+0xdd3/0x17d0
      __x64_sys_mount+0x1fa/0x270
      do_syscall_64+0x2b/0x80
      entry_SYSCALL_64_after_hwframe+0x46/0xb0
     RIP: 0033:0x7ff5bc069eae
     Code: 48 8b 0d 85 1f 0f 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 52 1f 0f 00 f7 d8 64 89 01 48
     RSP: 002b:00007ffe433fd448 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
     RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ff5bc069eae
     RDX: 00005575d7213290 RSI: 00005575d72132d0 RDI: 00005575d72132b0
     RBP: 00005575d7212fd0 R08: 00005575d7213230 R09: 00005575d7213fe0
     R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
     R13: 00005575d7213290 R14: 00005575d72132b0 R15: 00005575d7212fd0
      </TASK>

     Allocated by task 5680:
      kasan_save_stack+0x1e/0x40
      __kasan_slab_alloc+0x66/0x80
      kmem_cache_alloc+0x152/0x320
      xfs_rui_init+0x17a/0x1b0 [xfs]
      xlog_recover_rui_commit_pass2+0xb9/0x2e0 [xfs]
      xlog_recover_items_pass2+0xe9/0x220 [xfs]
      xlog_recover_commit_trans+0x673/0x900 [xfs]
      xlog_recovery_process_trans+0xbe/0x130 [xfs]
      xlog_recover_process_data+0x103/0x2a0 [xfs]
      xlog_do_recovery_pass+0x548/0xc60 [xfs]
      xlog_do_log_recovery+0x62/0xc0 [xfs]
      xlog_do_recover+0x73/0x480 [xfs]
      xlog_recover+0x229/0x460 [xfs]
      xfs_log_mount+0x284/0x640 [xfs]
      xfs_mountfs+0xf8b/0x1d10 [xfs]
      xfs_fs_fill_super+0x13d5/0x1a80 [xfs]
      get_tree_bdev+0x3da/0x6e0
      vfs_get_tree+0x7d/0x240
      path_mount+0xdd3/0x17d0
      __x64_sys_mount+0x1fa/0x270
      do_syscall_64+0x2b/0x80
      entry_SYSCALL_64_after_hwframe+0x46/0xb0

     Freed by task 5680:
      kasan_save_stack+0x1e/0x40
      kasan_set_track+0x21/0x30
      kasan_set_free_info+0x20/0x30
      ____kasan_slab_free+0x144/0x1b0
      slab_free_freelist_hook+0xab/0x180
      kmem_cache_free+0x1f1/0x410
      xfs_rud_item_release+0x33/0x80 [xfs]
      xfs_trans_free_items+0xc3/0x220 [xfs]
      xfs_trans_cancel+0x1fa/0x590 [xfs]
      xfs_rui_item_recover+0x913/0xd60 [xfs]
      xlog_recover_process_intents+0x24e/0xae0 [xfs]
      xlog_recover_finish+0x7d/0x970 [xfs]
      xfs_log_mount_finish+0x2d7/0x5d0 [xfs]
      xfs_mountfs+0x11d4/0x1d10 [xfs]
      xfs_fs_fill_super+0x13d5/0x1a80 [xfs]
      get_tree_bdev+0x3da/0x6e0
      vfs_get_tree+0x7d/0x240
      path_mount+0xdd3/0x17d0
      __x64_sys_mount+0x1fa/0x270
      do_syscall_64+0x2b/0x80
      entry_SYSCALL_64_after_hwframe+0x46/0xb0

     The buggy address belongs to the object at ffff88804391b300
      which belongs to the cache xfs_rui_item of size 688
     The buggy address is located 96 bytes inside of
      688-byte region [ffff88804391b300, ffff88804391b5b0)

     The buggy address belongs to the physical page:
     page:ffffea00010e4600 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888043919320 pfn:0x43918
     head:ffffea00010e4600 order:2 compound_mapcount:0 compound_pincount:0
     flags: 0x4fff80000010200(slab|head|node=1|zone=1|lastcpupid=0xfff)
     raw: 04fff80000010200 0000000000000000 dead000000000122 ffff88807f0eadc0
     raw: ffff888043919320 0000000080140010 00000001ffffffff 0000000000000000
     page dumped because: kasan: bad access detected

     Memory state around the buggy address:
      ffff88804391b200: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      ffff88804391b280: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
     >ffff88804391b300: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                            ^
      ffff88804391b380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      ffff88804391b400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
     ==================================================================

    The test fuzzes an rmap btree block and starts writer threads to induce
    a filesystem shutdown on the corrupt block.  When the filesystem is
    remounted, recovery will try to replay the committed rmap intent item,
    but the corruption problem causes the recovery transaction to fail.
    Cancelling the transaction frees the RUD, which frees the RUI that we
    recovered.

    When we return to xlog_recover_process_intents, @lip is now a dangling
    pointer, and we cannot use it to find the iop_recover method for the
    tracepoint.  Hence we must store the item ops before calling
    ->iop_recover if we want to give it to the tracepoint so that the trace
    data will tell us exactly which intent item failed.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-11-06 19:42:17 -06:00
Bill O'Donnell 439ec50781 xfs: double link the unlinked inode list
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 2fd26cc07e9f8050e29bf314cbf1bcb64dbe088c
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Jul 14 11:46:43 2022 +1000

    xfs: double link the unlinked inode list

    Now we have forwards traversal via the incore inode in place, we now
    need to add back pointers to the incore inode to entirely replace
    the back reference cache. We use the same lookup semantics and
    constraints as for the forwards pointer lookups during unlinks, and
    so we can look up any inode in the unlinked list directly and update
    the list pointers, forwards or backwards, at any time.

    The only wrinkle in converting the unlinked list manipulations to
    use in-core previous pointers is that log recovery doesn't have the
    incore inode state built up so it can't just read in an inode and
    release it to finish off the unlink. Hence we need to modify the
    traversal in recovery to read one inode ahead before we
    release the inode at the head of the list. This populates the
    next->prev relationship sufficient to be able to replay the unlinked
    list and hence greatly simplify the runtime code.

    This recovery algorithm also requires that we actually remove inodes
    from the unlinked list one at a time as background inode
    inactivation will result in unlinked list removal racing with the
    building of the in-memory unlinked list state. We could serialise
    this by holding the AGI buffer lock when constructing the in memory
    state, but all that does is lockstep background processing with list
    building. It is much simpler to flush the inodegc immediately after
    releasing the inode so that it is unlinked immediately and there is
    no races present at all.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:44 -05:00
Bill O'Donnell 0a26a83d3a xfs: refactor xlog_recover_process_iunlinks()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 04755d2e5821b3afbaadd09fe5df58d04de36484
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Jul 14 11:42:39 2022 +1000

    xfs: refactor xlog_recover_process_iunlinks()

    For upcoming changes to the way inode unlinked list processing is
    done, the structure of recovery needs to change slightly. We also
    really need to untangle the messy error handling in list recovery
    so that actions like emptying the bucket on inode lookup failure
    are associated with the bucket list walk failing, not failing
    to look up the inode.

    Refactor the recovery code now to keep the re-organisation seperate
    to the algorithm changes.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:44 -05:00
Bill O'Donnell 959addd052 xfs: track the iunlink list pointer in the xfs_inode
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 4fcc94d653270fcc7800dbaf3b11f78cb462b293
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Jul 14 11:38:54 2022 +1000

    xfs: track the iunlink list pointer in the xfs_inode

    Having direct access to the i_next_unlinked pointer in unlinked
    inodes greatly simplifies the processing of inodes on the unlinked
    list. We no longer need to look up the inode buffer just to find
    next inode in the list if the xfs_inode is in memory. These
    improvements will be realised over upcoming patches as other
    dependencies on the inode buffer for unlinked list processing are
    removed.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:44 -05:00
Bill O'Donnell a27ea962ac xfs: Pre-calculate per-AG agbno geometry
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 0800169e3e2c97a033e8b7f3d1e6c689e0d71a19
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Jul 7 19:13:02 2022 +1000

    xfs: Pre-calculate per-AG agbno geometry

    There is a lot of overhead in functions like xfs_verify_agbno() that
    repeatedly calculate the geometry limits of an AG. These can be
    pre-calculated as they are static and the verification context has
    a per-ag context it can quickly reference.

    In the case of xfs_verify_agbno(), we now always have a perag
    context handy, so we can store the AG length and the minimum valid
    block in the AG in the perag. This means we don't have to calculate
    it on every call and it can be inlined in callers if we move it
    to xfs_ag.h.

    Move xfs_ag_block_count() to xfs_ag.c because it's really a
    per-ag function and not an XFS type function. We need a little
    bit of rework that is specific to xfs_initialise_perag() to allow
    growfs to calculate the new perag sizes before we've updated the
    primary superblock during the grow (chicken/egg situation).

    Note that we leave the original xfs_verify_agbno in place in
    xfs_types.c as a static function as other callers in that file do
    not have per-ag contexts so still need to go the long way. It's been
    renamed to xfs_verify_agno_agbno() to indicate it takes both an agno
    and an agbno to differentiate it from new function.

    Future commits will make similar changes for other per-ag geometry
    validation functions.

    Further:

    $ size --totals fs/xfs/built-in.a
               text    data     bss     dec     hex filename
    before  1483006  329588     572 1813166  1baaae (TOTALS)
    after   1482185  329588     572 1812345  1ba779 (TOTALS)

    This rework reduces the binary size by ~820 bytes, indicating
    that much less work is being done to bounds check the agbno values
    against on per-ag geometry information.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:40 -05:00
Bill O'Donnell 7048b0f4f2 xfs: pass perag to xfs_read_agi
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

Conflicts: fix one line error due to out of order rhel patch 13e2b274

commit 61021deb1faa5b2b913bf0ad76e2769276160b04
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Jul 7 19:07:47 2022 +1000

    xfs: pass perag to xfs_read_agi

    We have the perag in most palces we call xfs_read_agi, so pass the
    perag instead of a mount/agno pair.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:39 -05:00
Bill O'Donnell e208e0be52 xfs: convert buf_cancel_table allocation to kmalloc_array
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 910bbdf2f4d7df46781bc9b723048f5ebed3d0d7
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Fri May 27 10:27:19 2022 +1000

    xfs: convert buf_cancel_table allocation to kmalloc_array

    While we're messing around with how recovery allocates and frees the
    buffer cancellation table, convert the allocation to use kmalloc_array
    instead of the old kmem_alloc APIs, and make it handle a null return,
    even though that's not likely.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>
    Signed-off-by: Dave Chinner <david@fromorbit.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:31 -05:00
Bill O'Donnell a245138396 xfs: refactor buffer cancellation table allocation
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 2723234923b3294dbcf6019c288c87465e927ed4
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Fri May 27 10:26:17 2022 +1000

    xfs: refactor buffer cancellation table allocation

    Move the code that allocates and frees the buffer cancellation tables
    used by log recovery into the file that actually uses the tables.  This
    is a precursor to some cleanups and a memory leak fix.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>
    Signed-off-by: Dave Chinner <david@fromorbit.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:30 -05:00
Bill O'Donnell 7d7d1f5774 xfs: Remove dead code
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit e62c720817597f259b81f1ff004eb042293bf046
Author: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Date:   Sun May 22 16:46:57 2022 +1000

    xfs: Remove dead code

    Remove tht entire xlog_recover_check_summary() function, this entire
    function is dead code and has been for 12 years.

    Reported-by: Abaci Robot <abaci@linux.alibaba.com>
    Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>
    Signed-off-by: Dave Chinner <david@fromorbit.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:28 -05:00
Bill O'Donnell db4b5bf1ae xfs: Set up infrastructure for log attribute replay
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit fd920008784ead369e79c2be2f8d9cc736e306ca
Author: Allison Henderson <allison.henderson@oracle.com>
Date:   Wed May 4 12:41:02 2022 +1000

    xfs: Set up infrastructure for log attribute replay

    Currently attributes are modified directly across one or more
    transactions. But they are not logged or replayed in the event of an
    error. The goal of log attr replay is to enable logging and replaying
    of attribute operations using the existing delayed operations
    infrastructure.  This will later enable the attributes to become part of
    larger multi part operations that also must first be recorded to the
    log.  This is mostly of interest in the scheme of parent pointers which
    would need to maintain an attribute containing parent inode information
    any time an inode is moved, created, or removed.  Parent pointers would
    then be of interest to any feature that would need to quickly derive an
    inode path from the mount point. Online scrub, nfs lookups and fs grow
    or shrink operations are all features that could take advantage of this.

    This patch adds two new log item types for setting or removing
    attributes as deferred operations.  The xfs_attri_log_item will log an
    intent to set or remove an attribute.  The corresponding
    xfs_attrd_log_item holds a reference to the xfs_attri_log_item and is
    freed once the transaction is done.  Both log items use a generic
    xfs_attr_log_format structure that contains the attribute name, value,
    flags, inode, and an op_flag that indicates if the operations is a set
    or remove.

    [dchinner: added extra little bits needed for intent whiteouts]

    Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
    Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Dave Chinner <david@fromorbit.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:16 -05:00
Bill O'Donnell 35f4bdef44 xfs: log shutdown triggers should only shut down the log
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit b5f17bec1213a3ed2f4d79ad4c566e00cabe2a9b
Author: Dave Chinner <dchinner@redhat.com>
Date:   Tue Mar 29 18:22:01 2022 -0700

    xfs: log shutdown triggers should only shut down the log

    We've got a mess on our hands.

    1. xfs_trans_commit() cannot cancel transactions because the mount is
    shut down - that causes dirty, aborted, unlogged log items to sit
    unpinned in memory and potentially get written to disk before the
    log is shut down. Hence xfs_trans_commit() can only abort
    transactions when xlog_is_shutdown() is true.

    2. xfs_force_shutdown() is used in places to cause the current
    modification to be aborted via xfs_trans_commit() because it may be
    impractical or impossible to cancel the transaction directly, and
    hence xfs_trans_commit() must cancel transactions when
    xfs_is_shutdown() is true in this situation. But we can't do that
    because of #1.

    3. Log IO errors cause log shutdowns by calling xfs_force_shutdown()
    to shut down the mount and then the log from log IO completion.

    4. xfs_force_shutdown() can result in a log force being issued,
    which has to wait for log IO completion before it will mark the log
    as shut down. If #3 races with some other shutdown trigger that runs
    a log force, we rely on xfs_force_shutdown() silently ignoring #3
    and avoiding shutting down the log until the failed log force
    completes.

    5. To ensure #2 always works, we have to ensure that
    xfs_force_shutdown() does not return until the the log is shut down.
    But in the case of #4, this will result in a deadlock because the
    log Io completion will block waiting for a log force to complete
    which is blocked waiting for log IO to complete....

    So the very first thing we have to do here to untangle this mess is
    dissociate log shutdown triggers from mount shutdowns. We already
    have xlog_forced_shutdown, which will atomically transistion to the
    log a shutdown state. Due to internal asserts it cannot be called
    multiple times, but was done simply because the only place that
    could call it was xfs_do_force_shutdown() (i.e. the mount shutdown!)
    and that could only call it once and once only.  So the first thing
    we do is remove the asserts.

    We then convert all the internal log shutdown triggers to call
    xlog_force_shutdown() directly instead of xfs_force_shutdown(). This
    allows the log shutdown triggers to shut down the log without
    needing to care about mount based shutdown constraints. This means
    we shut down the log independently of the mount and the mount may
    not notice this until it's next attempt to read or modify metadata.
    At that point (e.g. xfs_trans_commit()) it will see that the log is
    shutdown, error out and shutdown the mount.

    To ensure that all the unmount behaviours and asserts track
    correctly as a result of a log shutdown, propagate the shutdown up
    to the mount if it is not already set. This keeps the mount and log
    state in sync, and saves a huge amount of hassle where code fails
    because of a log shutdown but only checks for mount shutdowns and
    hence ends up doing the wrong thing. Cleaning up that mess is
    an exercise for another day.

    This enables us to address the other problems noted above in
    followup patches.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:10:53 -05:00
Bill O'Donnell 79973eeacb xfs: shutdown in intent recovery has non-intent items in the AIL
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit ab9c81ef321f90dd208b1d4809c196c2794e4b15
Author: Dave Chinner <dchinner@redhat.com>
Date:   Tue Mar 29 18:22:00 2022 -0700

    xfs: shutdown in intent recovery has non-intent items in the AIL

    generic/388 triggered a failure in RUI recovery due to a corrupted
    btree record and the system then locked up hard due to a subsequent
    assert failure while holding a spinlock cancelling intents:

     XFS (pmem1): Corruption of in-memory data (0x8) detected at xfs_do_force_shutdown+0x1a/0x20 (fs/xfs/xfs_trans.c:964).  Shutting down filesystem.
     XFS (pmem1): Please unmount the filesystem and rectify the problem(s)
     XFS: Assertion failed: !xlog_item_is_intent(lip), file: fs/xfs/xfs_log_recover.c, line: 2632
     Call Trace:
      <TASK>
      xlog_recover_cancel_intents.isra.0+0xd1/0x120
      xlog_recover_finish+0xb9/0x110
      xfs_log_mount_finish+0x15a/0x1e0
      xfs_mountfs+0x540/0x910
      xfs_fs_fill_super+0x476/0x830
      get_tree_bdev+0x171/0x270
      ? xfs_init_fs_context+0x1e0/0x1e0
      xfs_fs_get_tree+0x15/0x20
      vfs_get_tree+0x24/0xc0
      path_mount+0x304/0xba0
      ? putname+0x55/0x60
      __x64_sys_mount+0x108/0x140
      do_syscall_64+0x35/0x80
      entry_SYSCALL_64_after_hwframe+0x44/0xae

    Essentially, there's dirty metadata in the AIL from intent recovery
    transactions, so when we go to cancel the remaining intents we assume
    that all objects after the first non-intent log item in the AIL are
    not intents.

    This is not true. Intent recovery can log new intents to continue
    the operations the original intent could not complete in a single
    transaction. The new intents are committed before they are deferred,
    which means if the CIL commits in the background they will get
    inserted into the AIL at the head.

    Hence if we shut down the filesystem while processing intent
    recovery, the AIL may have new intents active at the current head.
    Hence this check:

                    /*
                     * We're done when we see something other than an intent.
                     * There should be no intents left in the AIL now.
                     */
                    if (!xlog_item_is_intent(lip)) {
    #ifdef DEBUG
                            for (; lip; lip = xfs_trans_ail_cursor_next(ailp, &cur))
                                    ASSERT(!xlog_item_is_intent(lip));
    #endif
                            break;
                    }

    in both xlog_recover_process_intents() and
    log_recover_cancel_intents() is simply not valid. It was valid back
    when we only had EFI/EFD intents and didn't chain intents, but it
    hasn't been valid ever since intent recovery could create and commit
    new intents.

    Given that crashing the mount task like this pretty much prevents
    diagnosing what went wrong that lead to the initial failure that
    triggered intent cancellation, just remove the checks altogether.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:10:53 -05:00
Bill O'Donnell 96ac087c0d xfs: Remove redundant assignment of mp
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit f4901a182d33d05a3b7020e2af97c635f6c47959
Author: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Date:   Wed Jan 5 11:12:37 2022 -0800

    xfs: Remove redundant assignment of mp

    mp is being initialized to log->l_mp but this is never read
    as record is overwritten later on. Remove the redundant
    assignment.

    Cleans up the following clang-analyzer warning:

    fs/xfs/xfs_log_recover.c:3543:20: warning: Value stored to 'mp' during
    its initialization is never read [clang-analyzer-deadcode.DeadStores].

    Reported-by: Abaci Robot <abaci@linux.alibaba.com>
    Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:10:46 -05:00
Bill O'Donnell 9e0bb79551 xfs: only run COW extent recovery when there are no live extents
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 7993f1a431bc5271369d359941485a9340658ac3
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Wed Dec 15 11:52:23 2021 -0800

    xfs: only run COW extent recovery when there are no live extents

    As part of multiple customer escalations due to file data corruption
    after copy on write operations, I wrote some fstests that use fsstress
    to hammer on COW to shake things loose.  Regrettably, I caught some
    filesystem shutdowns due to incorrect rmap operations with the following
    loop:

    mount <filesystem>                              # (0)
    fsstress <run only readonly ops> &              # (1)
    while true; do
            fsstress <run all ops>
            mount -o remount,ro                     # (2)
            fsstress <run only readonly ops>
            mount -o remount,rw                     # (3)
    done

    When (2) happens, notice that (1) is still running.  xfs_remount_ro will
    call xfs_blockgc_stop to walk the inode cache to free all the COW
    extents, but the blockgc mechanism races with (1)'s reader threads to
    take IOLOCKs and loses, which means that it doesn't clean them all out.
    Call such a file (A).

    When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which
    walks the ondisk refcount btree and frees any COW extent that it finds.
    This function does not check the inode cache, which means that incore
    COW forks of inode (A) is now inconsistent with the ondisk metadata.  If
    one of those former COW extents are allocated and mapped into another
    file (B) and someone triggers a COW to the stale reservation in (A), A's
    dirty data will be written into (B) and once that's done, those blocks
    will be transferred to (A)'s data fork without bumping the refcount.

    The results are catastrophic -- file (B) and the refcount btree are now
    corrupt.  In the first patch, we fixed the race condition in (2) so that
    (A) will always flush the COW fork.  In this second patch, we move the
    _recover_cow call to the initial mount call in (0) for safety.

    As mentioned previously, xfs_reflink_recover_cow walks the refcount
    btree looking for COW staging extents, and frees them.  This was
    intended to be run at mount time (when we know there are no live inodes)
    to clean up any leftover staging events that may have been left behind
    during an unclean shutdown.  As a time "optimization" for readonly
    mounts, we deferred this to the ro->rw transition, not realizing that
    any failure to clean all COW forks during a rw->ro transition would
    result in catastrophic corruption.

    Therefore, remove this optimization and only run the recovery routine
    when we're guaranteed not to have any COW staging extents anywhere,
    which means we always run this at mount time.  While we're at it, move
    the callsite to xfs_log_mount_finish because any refcount btree
    expansion (however unlikely given that we're removing records from the
    right side of the index) must be fed by a per-AG reservation, which
    doesn't exist in its current location.

    Fixes: 174edb0e46 ("xfs: store in-progress CoW allocations in the refcount btree")
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:10:44 -05:00
Frantisek Hrbata 64e5412cb6 Merge: XFS update to v5.16
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1508

Bugzilla: https://bugzilla.redhat.com/2125724

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>

Approved-by: Bill O'Donnell <bodonnel@redhat.com>
Approved-by: Brian Foster <bfoster@redhat.com>
Approved-by: Eric Sandeen <esandeen@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-11-23 02:46:01 -05:00
Carlos Maiolino 854effbe4c xfs: port the defer ops capture and continue to resource capture
Bugzilla: https://bugzilla.redhat.com/2125724

When log recovery tries to recover a transaction that had log intent
items attached to it, it has to save certain parts of the transaction
state (reservation, dfops chain, inodes with no automatic unlock) so
that it can finish single-stepping the recovered transactions before
finishing the chains.

This is done with the xfs_defer_ops_capture and xfs_defer_ops_continue
functions.  Right now they open-code this functionality, so let's port
this to the formalized resource capture structure that we introduced in
the previous patch.  This enables us to hold up to two inodes and two
buffers during log recovery, the same way we do for regular runtime.

With this patch applied, we'll be ready to support atomic extent swap
which holds two inodes; and logged xattrs which holds one inode and one
xattr leaf buffer.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
(cherry picked from commit 512edfac85d243ed6a5a5f42f513ebb7c2d32863)
2022-10-21 12:50:46 +02:00
Ming Lei f0231c3baa fs/xfs: Use the enum req_op and blk_opf_t types
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2118511

commit d03025aef8676e826b69f8e3ec9bb59a5ad0c31d
Author: Bart Van Assche <bvanassche@acm.org>
Date:   Thu Jul 14 11:07:28 2022 -0700

    fs/xfs: Use the enum req_op and blk_opf_t types

    Improve static type checking by using the enum req_op type for variables
    that represent a request operation and the new blk_opf_t type for the
    combination of a request operation with request flags.

    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Cc: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Bart Van Assche <bvanassche@acm.org>
    Link: https://lore.kernel.org/r/20220714180729.1065367-63-bvanassche@acm.org
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2022-10-12 09:20:22 +08:00
Brian Foster 13e2b27442 xfs: flush inode gc workqueue before clearing agi bucket
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git
Conflicts: Pass mp directly due to nonexistent pag. The pag
	reference comes in subsequent refactoring patches.

commit 04a98a036cf8b810dda172a9dcfcbd783bf63655
Author: Zhang Yi <yi.zhang@huawei.com>
Date:   Thu Jul 14 11:36:36 2022 +1000

    xfs: flush inode gc workqueue before clearing agi bucket

    In the procedure of recover AGI unlinked lists, if something bad
    happenes on one of the unlinked inode in the bucket list, we would call
    xlog_recover_clear_agi_bucket() to clear the whole unlinked bucket list,
    not the unlinked inodes after the bad one. If we have already added some
    inodes to the gc workqueue before the bad inode in the list, we could
    get below error when freeing those inodes, and finaly fail to complete
    the log recover procedure.

     XFS (ram0): Internal error xfs_iunlink_remove at line 2456 of file
     fs/xfs/xfs_inode.c.  Caller xfs_ifree+0xb0/0x360 [xfs]

    The problem is xlog_recover_clear_agi_bucket() clear the bucket list, so
    the gc worker fail to check the agino in xfs_verify_agino(). Fix this by
    flush workqueue before clearing the bucket.

    Fixes: ab23a7768739 ("xfs: per-cpu deferred inode inactivation queues")
    Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Dave Chinner <david@fromorbit.com>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:38 -04:00
Brian Foster e406884b42 xfs: introduce xfs_sb_is_v5 helper
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit d6837c1aab42e70141fd3875ba05eb69ffb220f0
Author: Dave Chinner <dchinner@redhat.com>
Date:   Wed Aug 18 18:46:56 2021 -0700

    xfs: introduce xfs_sb_is_v5 helper

    Rather than open coding XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5
    checks everywhere, add a simple wrapper to encapsulate this and make
    the code easier to read.

    This allows us to remove the xfs_sb_version_has_v3inode() wrapper
    which is only used in xfs_format.h now and is just a version number
    check.

    There are a couple of places where we should be checking the mount
    feature bits rather than the superblock version (e.g. remount), so
    those are converted to use xfs_has_crc(mp) instead.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:35 -04:00
Brian Foster a672539203 xfs: convert remaining mount flags to state flags
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 2e973b2cd4cdb993be94cca4c33f532f1ed05316
Author: Dave Chinner <dchinner@redhat.com>
Date:   Wed Aug 18 18:46:52 2021 -0700

    xfs: convert remaining mount flags to state flags

    The remaining mount flags kept in m_flags are actually runtime state
    flags. These change dynamically, so they really should be updated
    atomically so we don't potentially lose an update due to racing
    modifications.

    Convert these remaining flags to be stored in m_opstate and use
    atomic bitops to set and clear the flags. This also adds a couple of
    simple wrappers for common state checks - read only and shutdown.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:34 -04:00
Brian Foster d54a790d1d xfs: replace xfs_sb_version checks with feature flag checks
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 38c26bfd90e1999650d5ef40f90d721f05916643
Author: Dave Chinner <dchinner@redhat.com>
Date:   Wed Aug 18 18:46:37 2021 -0700

    xfs: replace xfs_sb_version checks with feature flag checks

    Convert the xfs_sb_version_hasfoo() to checks against
    mp->m_features. Checks of the superblock itself during disk
    operations (e.g. in the read/write verifiers and the to/from disk
    formatters) are not converted - they operate purely on the
    superblock state. Everything else should use the mount features.

    Large parts of this conversion were done with sed with commands like
    this:

    for f in `git grep -l xfs_sb_version_has fs/xfs/*.c`; do
            sed -i -e 's/xfs_sb_version_has\(.*\)(&\(.*\)->m_sb)/xfs_has_\1(\2)/' $f
    done

    With manual cleanups for things like "xfs_has_extflgbit" and other
    little inconsistencies in naming.

    The result is ia lot less typing to check features and an XFS binary
    size reduced by a bit over 3kB:

    $ size -t fs/xfs/built-in.a
            text       data     bss     dec     hex filenam
    before  1130866  311352     484 1442702  16038e (TOTALS)
    after   1127727  311352     484 1439563  15f74b (TOTALS)

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:34 -04:00
Brian Foster 0cb6373dde xfs: reflect sb features in xfs_mount
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit a1d86e8dec8c1325d301c9d5594bb794bc428fc3
Author: Dave Chinner <dchinner@redhat.com>
Date:   Wed Aug 18 18:46:26 2021 -0700

    xfs: reflect sb features in xfs_mount

    Currently on-disk feature checks require decoding the superblock
    fileds and so can be non-trivial. We have almost 400 hundred
    individual feature checks in the XFS code, so this is a significant
    amount of code. To reduce runtime check overhead, pre-process all
    the version flags into a features field in the xfs_mount at mount
    time so we can convert all the feature checks to a simple flag
    check.

    There is also a need to convert the dynamic feature flags to update
    the m_features field. This is required for attr, attr2 and quota
    features. New xfs_mount based wrappers are added for this.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:34 -04:00
Brian Foster 93c5f50b4c xfs: convert log flags to an operational state field
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit e1d06e5f668a403f48538f0d6b163edfd4342adf
Author: Dave Chinner <dchinner@redhat.com>
Date:   Tue Aug 10 17:59:02 2021 -0700

    xfs: convert log flags to an operational state field

    log->l_flags doesn't actually contain "flags" as such, it contains
    operational state information that can change at runtime. For the
    shutdown state, this at least should be an atomic bit because
    it is read without holding locks in many places and so using atomic
    bitops for the state field modifications makes sense.

    This allows us to use things like test_and_set_bit() on state
    changes (e.g. setting XLOG_TAIL_WARN) to avoid races in setting the
    state when we aren't holding locks.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:26 -04:00
Brian Foster b275d24234 xfs: move recovery needed state updates to xfs_log_mount_finish
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit fd67d8a07208ab06560287b7b9334c2d50b7d6d7
Author: Dave Chinner <dchinner@redhat.com>
Date:   Tue Aug 10 17:59:02 2021 -0700

    xfs: move recovery needed state updates to xfs_log_mount_finish

    xfs_log_mount_finish() needs to know if recovery is needed or not to
    make decisions on whether to flush the log and AIL.  Move the
    handling of the NEED_RECOVERY state out to this function rather than
    needing a temporary variable to store this state over the call to
    xlog_recover_finish().

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:25 -04:00
Brian Foster 156272e64a xfs: convert XLOG_FORCED_SHUTDOWN() to xlog_is_shutdown()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 2039a272300b949c05888428877317b834c0b1fb
Author: Dave Chinner <dchinner@redhat.com>
Date:   Tue Aug 10 17:59:01 2021 -0700

    xfs: convert XLOG_FORCED_SHUTDOWN() to xlog_is_shutdown()

    Make it less shouty and a static inline before adding more calls
    through the log code.

    Also convert internal log code that uses XFS_FORCED_SHUTDOWN(mount)
    to use xlog_is_shutdown(log) as well.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:25 -04:00
Brian Foster 63abcca3f1 xfs: refactor xfs_iget calls from log intent recovery
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 4bc619833f738f4fa8d931a71610795ebf5cec0e
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Sun Aug 8 08:27:13 2021 -0700

    xfs: refactor xfs_iget calls from log intent recovery

    Hoist the code from xfs_bui_item_recover that igets an inode and marks
    it as being part of log intent recovery.  The next patch will want a
    common function.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
    Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:25 -04:00
Brian Foster 8207eb529d xfs: allow setting and clearing of log incompat feature flags
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 908ce71e54f8265fa909200410d6c50ab9a2d302
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Sun Aug 8 08:27:12 2021 -0700

    xfs: allow setting and clearing of log incompat feature flags

    Log incompat feature flags in the superblock exist for one purpose: to
    protect the contents of a dirty log from replay on a kernel that isn't
    prepared to handle those dirty contents.  This means that they can be
    cleared if (a) we know the log is clean and (b) we know that there
    aren't any other threads in the system that might be setting or relying
    upon a log incompat flag.

    Therefore, clear the log incompat flags when we've finished recovering
    the log, when we're unmounting cleanly, remounting read-only, or
    freezing; and provide a function so that subsequent patches can start
    using this.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
    Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:24 -04:00
Brian Foster 8bf8cc906b xfs: replace kmem_alloc_large() with kvmalloc()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit d634525db63e9e946c3229fb93c8d9b763afbaf3
Author: Dave Chinner <dchinner@redhat.com>
Date:   Mon Aug 9 10:10:01 2021 -0700

    xfs: replace kmem_alloc_large() with kvmalloc()

    There is no reason for this wrapper existing anymore. All the places
    that use KM_NOFS allocation are within transaction contexts and
    hence covered by memalloc_nofs_save/restore contexts. Hence we don't
    need any special handling of vmalloc for large IOs anymore and
    so special casing this code isn't necessary.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:24 -04:00
Brian Foster 7fe76aa101 xfs: remove kmem_alloc_io()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 98fe2c3cef21b784e2efd1d9d891430d95b4f073
Author: Dave Chinner <dchinner@redhat.com>
Date:   Mon Aug 9 10:10:01 2021 -0700

    xfs: remove kmem_alloc_io()

    Since commit 59bb47985c ("mm, sl[aou]b: guarantee natural alignment
    for kmalloc(power-of-two)"), the core slab code now guarantees slab
    alignment in all situations sufficient for IO purposes (i.e. minimum
    of 512 byte alignment of >= 512 byte sized heap allocations) we no
    longer need the workaround in the XFS code to provide this
    guarantee.

    Replace the use of kmem_alloc_io() with kmem_alloc() or
    kmem_alloc_large() appropriately, and remove the kmem_alloc_io()
    interface altogether.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:24 -04:00
Brian Foster 552c0d6db7 xfs: per-cpu deferred inode inactivation queues
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit ab23a7768739a23d21d8a16ca37dff96b1ca957a
Author: Dave Chinner <dchinner@redhat.com>
Date:   Fri Aug 6 11:05:39 2021 -0700

    xfs: per-cpu deferred inode inactivation queues

    Move inode inactivation to background work contexts so that it no
    longer runs in the context that releases the final reference to an
    inode. This will allow process work that ends up blocking on
    inactivation to continue doing work while the filesytem processes
    the inactivation in the background.

    A typical demonstration of this is unlinking an inode with lots of
    extents. The extents are removed during inactivation, so this blocks
    the process that unlinked the inode from the directory structure. By
    moving the inactivation to the background process, the userspace
    applicaiton can keep working (e.g. unlinking the next inode in the
    directory) while the inactivation work on the previous inode is
    done by a different CPU.

    The implementation of the queue is relatively simple. We use a
    per-cpu lockless linked list (llist) to queue inodes for
    inactivation without requiring serialisation mechanisms, and a work
    item to allow the queue to be processed by a CPU bound worker
    thread. We also keep a count of the queue depth so that we can
    trigger work after a number of deferred inactivations have been
    queued.

    The use of a bound workqueue with a single work depth allows the
    workqueue to run one work item per CPU. We queue the work item on
    the CPU we are currently running on, and so this essentially gives
    us affine per-cpu worker threads for the per-cpu queues. THis
    maintains the effective CPU affinity that occurs within XFS at the
    AG level due to all objects in a directory being local to an AG.
    Hence inactivation work tends to run on the same CPU that last
    accessed all the objects that inactivation accesses and this
    maintains hot CPU caches for unlink workloads.

    A depth of 32 inodes was chosen to match the number of inodes in an
    inode cluster buffer. This hopefully allows sequential
    allocation/unlink behaviours to defering inactivation of all the
    inodes in a single cluster buffer at a time, further helping
    maintain hot CPU and buffer cache accesses while running
    inactivations.

    A hard per-cpu queue throttle of 256 inode has been set to avoid
    runaway queuing when inodes that take a long to time inactivate are
    being processed. For example, when unlinking inodes with large
    numbers of extents that can take a lot of processing to free.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    [djwong: tweak comments and tracepoints, convert opflags to state bits]
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:22 -04:00
Rafael Aquini e2e7fe38b6 mm: Add kvrealloc()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit de2860f4636256836450c6543be744a50118fc66
Author: Dave Chinner <dchinner@redhat.com>
Date:   Mon Aug 9 10:10:00 2021 -0700

    mm: Add kvrealloc()

    During log recovery of an XFS filesystem with 64kB directory
    buffers, rebuilding a buffer split across two log records results
    in a memory allocation warning from krealloc like this:

    xfs filesystem being mounted at /mnt/scratch supports timestamps until 2038 (0x7fffffff)
    XFS (dm-0): Unmounting Filesystem
    XFS (dm-0): Mounting V5 Filesystem
    XFS (dm-0): Starting recovery (logdev: internal)
    ------------[ cut here ]------------
    WARNING: CPU: 5 PID: 3435170 at mm/page_alloc.c:3539 get_page_from_freelist+0xdee/0xe40
    .....
    RIP: 0010:get_page_from_freelist+0xdee/0xe40
    Call Trace:
     ? complete+0x3f/0x50
     __alloc_pages+0x16f/0x300
     alloc_pages+0x87/0x110
     kmalloc_order+0x2c/0x90
     kmalloc_order_trace+0x1d/0x90
     __kmalloc_track_caller+0x215/0x270
     ? xlog_recover_add_to_cont_trans+0x63/0x1f0
     krealloc+0x54/0xb0
     xlog_recover_add_to_cont_trans+0x63/0x1f0
     xlog_recovery_process_trans+0xc1/0xd0
     xlog_recover_process_ophdr+0x86/0x130
     xlog_recover_process_data+0x9f/0x160
     xlog_recover_process+0xa2/0x120
     xlog_do_recovery_pass+0x40b/0x7d0
     ? __irq_work_queue_local+0x4f/0x60
     ? irq_work_queue+0x3a/0x50
     xlog_do_log_recovery+0x70/0x150
     xlog_do_recover+0x38/0x1d0
     xlog_recover+0xd8/0x170
     xfs_log_mount+0x181/0x300
     xfs_mountfs+0x4a1/0x9b0
     xfs_fs_fill_super+0x3c0/0x7b0
     get_tree_bdev+0x171/0x270
     ? suffix_kstrtoint.constprop.0+0xf0/0xf0
     xfs_fs_get_tree+0x15/0x20
     vfs_get_tree+0x24/0xc0
     path_mount+0x2f5/0xaf0
     __x64_sys_mount+0x108/0x140
     do_syscall_64+0x3a/0x70
     entry_SYSCALL_64_after_hwframe+0x44/0xae

    Essentially, we are taking a multi-order allocation from kmem_alloc()
    (which has an open coded no fail, no warn loop) and then
    reallocating it out to 64kB using krealloc(__GFP_NOFAIL) and that is
    then triggering the above warning.

    This is a regression caused by converting this code from an open
    coded no fail/no warn reallocation loop to using __GFP_NOFAIL.

    What we actually need here is kvrealloc(), so that if contiguous
    page allocation fails we fall back to vmalloc() and we don't
    get nasty warnings happening in XFS.

    Fixes: 771915c4f6 ("xfs: remove kmem_realloc()")
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:40:25 -05:00
Darrick J. Wong 4e6b8270c8 xfs: force the log offline when log intent item recovery fails
If any part of log intent item recovery fails, we should shut down the
log immediately to stop the log from writing a clean unmount record to
disk, because the metadata is not consistent.  The inability to cancel a
dirty transaction catches most of these cases, but there are a few
things that have slipped through the cracks, such as ENOSPC from a
transaction allocation, or runtime errors that result in cancellation of
a non-dirty transaction.

This solves some weird behaviors reported by customers where a system
goes down, the first mount fails, the second succeeds, but then the fs
goes down later because of inconsistent metadata.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-06-21 10:14:24 -07:00
Dave Chinner 934933c3ee xfs: convert raw ag walks to use for_each_perag
Convert the raw walks to an iterator, pulling the current AG out of
pag->pag_agno instead of the loop iterator variable.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 10:48:24 +10:00
Dave Chinner 9bbafc7191 xfs: move xfs_perag_get/put to xfs_ag.[ch]
They are AG functions, not superblock functions, so move them to the
appropriate location.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 10:48:24 +10:00
Christoph Hellwig 9b3beb028f xfs: remove the di_dmevmask and di_dmstate fields from struct xfs_icdinode
The legacy DMAPI fields were never set by upstream Linux XFS, and have no
way to be read using the kernel APIs.  So instead of bloating the in-core
inode for them just copy them from the on-disk inode into the log when
logging the inode.  The only caveat is that we need to make sure to zero
the fields for newly read or deleted inodes, which is solved using a new
flag in the inode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07 14:37:03 -07:00
Christoph Hellwig af9dcddef6 xfs: split xfs_imap_to_bp
Split looking up the dinode from xfs_imap_to_bp, which can be
significantly simplified as a result.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07 14:37:02 -07:00
Bhaskar Chowdhury bd24a4f5f7 xfs: Rudimentary typo fixes
s/filesytem/filesystem/
s/instrumention/instrumentation/

Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-03-25 16:47:52 -07:00