Commit Graph

291 Commits

Author SHA1 Message Date
Patrick Talbert fdb3eab93b Merge: XFS: Update #3 for RHEL9.6 (upstream v6.7-6.8)
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5785

XFS: update #3 for RHEL9.6. Backport upstream v6.7-6.8, including fixes patches post v6.8.

JIRA: https://issues.redhat.com/browse/RHEL-65728

Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5633

Omitted-fix:  a18a69bbec083 ("xfs: use the recalculated transaction reservation in xfs_growfs_rt_bmblock")

Missing several dependency patches that merged upstream in Aug 2024, well beyond this update
(e.g. 7996f10ce6c xfs: factor out a xfs_growfs_rt_bmblock helper).

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>

Approved-by: Rafael Aquini <raquini@redhat.com>
Approved-by: Brian Foster <bfoster@redhat.com>
Approved-by: Eric Sandeen <esandeen@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Patrick Talbert <ptalbert@redhat.com>
2024-12-30 07:30:09 -05:00
Rado Vrbovsky 6c6c3d9cfe Merge: XFS: Update #2 for RHEL9.6 (upstream v6.6-6.7)
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5633

XFS: update for RHEL9.6. Backport upstream v6.6-6.7, including fixes patches post v6.7.

JIRA: https://issues.redhat.com/browse/RHEL-62760

Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5188

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>

Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: Brian Foster <bfoster@redhat.com>
Approved-by: Rafael Aquini <raquini@redhat.com>
Approved-by: Eric Sandeen <esandeen@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
2024-11-25 13:17:37 +00:00
Bill O'Donnell 8436f40296 xfs: set inode sick state flags when we zap either ondisk fork
JIRA: https://issues.redhat.com/browse/RHEL-65728

commit d9041681dd2f5334529a68868c9266631c384de4
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Fri Dec 15 10:03:35 2023 -0800

    xfs: set inode sick state flags when we zap either ondisk fork

    In a few patches, we'll add some online repair code that tries to
    massage the ondisk inode record just enough to get it to pass the inode
    verifiers so that we can continue with more file repairs.  Part of that
    massaging can include zapping the ondisk forks to clear errors.  After
    that point, the bmap fork repair functions will rebuild the zapped
    forks.

    Christoph asked for stronger protections against online repair zapping a
    fork to get the inode to load vs. other threads trying to access the
    partially repaired file.  Do this by adding a special "[DA]FORK_ZAPPED"
    inode health flag whenever repair zaps a fork, and sprinkling checks for
    that flag into the various file operations for things that don't like
    handling an unexpected zero-extents fork.

    In practice xfs_scrub will scrub and fix the forks almost immediately
    after zapping them, so the window is very small.  However, if a crash or
    unmount should occur, we can still detect these zapped inode forks by
    looking for a zero-extents fork when data was expected.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-11-20 11:26:05 -06:00
Bill O'Donnell 9d385687d1 xfs: respect the stable writes flag on the RT device
JIRA: https://issues.redhat.com/browse/RHEL-62760

commit 9c04138414c00ae61421f36ada002712c4bac94a
Author: Christoph Hellwig <hch@lst.de>
Date:   Wed Oct 25 16:10:20 2023 +0200

    xfs: respect the stable writes flag on the RT device

    Update the per-folio stable writes flag dependening on which device an
    inode resides on.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20231025141020.192413-5-hch@lst.de
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Christian Brauner <brauner@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-11-09 10:06:46 -06:00
Bill O'Donnell 7024e5c267 xfs: allow read IO and FICLONE to run concurrently
JIRA: https://issues.redhat.com/browse/RHEL-62760

commit 14a537983b228cb050ceca3a5b743d01315dc4aa
Author: Catherine Hoang <catherine.hoang@oracle.com>
Date:   Tue Oct 17 13:12:08 2023 -0700

    xfs: allow read IO and FICLONE to run concurrently

    One of our VM cluster management products needs to snapshot KVM image
    files so that they can be restored in case of failure. Snapshotting is
    done by redirecting VM disk writes to a sidecar file and using reflink
    on the disk image, specifically the FICLONE ioctl as used by
    "cp --reflink". Reflink locks the source and destination files while it
    operates, which means that reads from the main vm disk image are blocked,
    causing the vm to stall. When an image file is heavily fragmented, the
    copy process could take several minutes. Some of the vm image files have
    50-100 million extent records, and duplicating that much metadata locks
    the file for 30 minutes or more. Having activities suspended for such
    a long time in a cluster node could result in node eviction.

    Clone operations and read IO do not change any data in the source file,
    so they should be able to run concurrently. Demote the exclusive locks
    taken by FICLONE to shared locks to allow reads while cloning. While a
    clone is in progress, writes will take the IOLOCK_EXCL, so they block
    until the clone completes.

    Link: https://lore.kernel.org/linux-xfs/8911B94D-DD29-4D6E-B5BC-32EAF1866245@oracle.com/
    Signed-off-by: Catherine Hoang <catherine.hoang@oracle.com>
    Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-11-09 10:06:44 -06:00
Ian Kent 2171c567b5 fs: port inode_init_owner() to mnt_idmap
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

Conflicts: For consistency drop btrfs hunks because it isn't supported in
	CentOS Stream and other backports also drop such hunks.
	CentOS Stream does not have upstream commit 3db1de0e582c3 ("f2fs:
	change the current atomic write way") so there is no call to
	f2fs_get_tmpfile() in f2fs_ioc_start_atomic_write() to change.
	The above patch also adds the definition of f2fs_get_tmpfile()
	to fs/f2fs/f2fs.h so it's not there to change resulting in a
	hunk reject for fs/f2fs/f2fs.h.
        Upstream commit 787caf1bdcd9f ("f2fs: fix to enable compress for
        newly created file if extension matches") is not present in CentOS
        Stream resulting in a number of rejects against fs/f2fs/namei.c,
        manually apply these changes.
	Dropped hunks for ntfs3 because the source is not present in
	the CentOS Stream source tree.
	CentOS Stream commit 892da692fa ("shmem: support idmapped
	mounts for tmpfs") which causes a reject in fs/shmem.c, manually
	apply the hunk (note: taking account of these changes at the times
	they are needed will result in an updated mm/shmem.c once this
	series is completed).
	Update to add incremental changes needed due to CentOS Stream
	commit 469e1d13f6 ("shmem: quota support").

commit f2d40141d5d90b882e2c35b226f9244a63b82b6e
Author: Christian Brauner <brauner@kernel.org>
Date:   Fri Jan 13 12:49:25 2023 +0100

    fs: port inode_init_owner() to mnt_idmap

    Convert to struct mnt_idmap.

    Last cycle we merged the necessary infrastructure in
    256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
    This is just the conversion to struct mnt_idmap.

    Currently we still pass around the plain namespace that was attached to a
    mount. This is in general pretty convenient but it makes it easy to
    conflate namespaces that are relevant on the filesystem with namespaces
    that are relevent on the mount level. Especially for non-vfs developers
    without detailed knowledge in this area this can be a potential source for
    bugs.

    Once the conversion to struct mnt_idmap is done all helpers down to the
    really low-level helpers will take a struct mnt_idmap argument instead of
    two namespace arguments. This way it becomes impossible to conflate the two
    eliminating the possibility of any bugs. All of the vfs and all filesystems
    only operate on struct mnt_idmap.

    Acked-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-16 10:45:26 +08:00
Pavel Reichl fb7e91a8db xfs: make inode unlinked bucket recovery work with quotacheck
JIRA: https://issues.redhat.com/browse/RHEL-7990

Teach quotacheck to reload the unlinked inode lists when walking the
inode table.  This requires extra state handling, since it's possible
that a reloaded inode will get inactivated before quotacheck tries to
scan it; in this case, we need to ensure that the reloaded inode does
not have dquots attached when it is freed.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
(cherry picked from commit 49813a21ed57895b73ec4ed3b99d4beec931496f)
Signed-off-by: Pavel Reichl <preichl@redhat.com>
2024-06-19 23:41:45 +02:00
Pavel Reichl c5942acf7c xfs: reload entire unlinked bucket lists
JIRA: https://issues.redhat.com/browse/RHEL-7990

The previous patch to reload unrecovered unlinked inodes when adding a
newly created inode to the unlinked list is missing a key piece of
functionality.  It doesn't handle the case that someone calls xfs_iget
on an inode that is not the last item in the incore list.  For example,
if at mount time the ondisk iunlink bucket looks like this:

AGI -> 7 -> 22 -> 3 -> NULL

None of these three inodes are cached in memory.  Now let's say that
someone tries to open inode 3 by handle.  We need to walk the list to
make sure that inodes 7 and 22 get loaded cold, and that the
i_prev_unlinked of inode 3 gets set to 22.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
(cherry picked from commit 83771c50e42b92de6740a63e152c96c052d37736)
Signed-off-by: Pavel Reichl <preichl@redhat.com>
2024-06-19 23:41:36 +02:00
Pavel Reichl 875ab81f58 xfs: use i_prev_unlinked to distinguish inodes that are not on the unlinked list
JIRA: https://issues.redhat.com/browse/RHEL-7990

Alter the definition of i_prev_unlinked slightly to make it more obvious
when an inode with 0 link count is not part of the iunlink bucket lists
rooted in the AGI.  This distinction is necessary because it is not
sufficient to check inode.i_nlink to decide if an inode is on the
unlinked list.  Updates to i_nlink can happen while holding only
ILOCK_EXCL, but updates to an inode's position in the AGI unlinked list
(which happen after the nlink update) requires both ILOCK_EXCL and the
AGI buffer lock.

The next few patches will make it possible to reload an entire unlinked
bucket list when we're walking the inode table or performing handle
operations and need more than the ability to iget the last inode in the
chain.

The upcoming directory repair code also needs to be able to make this
distinction to decide if a zero link count directory should be moved to
the orphanage or allowed to inactivate.  An upcoming enhancement to the
online AGI fsck code will need this distinction to check and rebuild the
AGI unlinked buckets.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
(cherry picked from commit f12b96683d6976a3a07fdf3323277c79dbe8f6ab)
Signed-off-by: Pavel Reichl <preichl@redhat.com>
2024-06-19 23:41:18 +02:00
Bill O'Donnell 19fab0b814 xfs: collect errors from inodegc for unlinked inode recovery
JIRA: https://issues.redhat.com/browse/RHEL-2002

Conflicts: context differences due to out of order patch application

commit d4d12c02bf5f768f1b423c7ae2909c5afdfe0d5f
Author: Dave Chinner <dchinner@redhat.com>
Date:   Mon Jun 5 14:48:15 2023 +1000

    xfs: collect errors from inodegc for unlinked inode recovery

    Unlinked list recovery requires errors removing the inode the from
    the unlinked list get fed back to the main recovery loop. Now that
    we offload the unlinking to the inodegc work, we don't get errors
    being fed back when we trip over a corruption that prevents the
    inode from being removed from the unlinked list.

    This means we never clear the corrupt unlinked list bucket,
    resulting in runtime operations eventually tripping over it and
    shutting down.

    Fix this by collecting inodegc worker errors and feed them
    back to the flush caller. This is largely best effort - the only
    context that really cares is log recovery, and it only flushes a
    single inode at a time so we don't need complex synchronised
    handling. Essentially the inodegc workers will capture the first
    error that occurs and the next flush will gather them and clear
    them. The flush itself will only report the first gathered error.

    In the cases where callers can return errors, propagate the
    collected inodegc flush error up the error handling chain.

    In the case of inode unlinked list recovery, there are several
    superfluous calls to flush queued unlinked inodes -
    xlog_recover_iunlink_bucket() guarantees that it has flushed the
    inodegc and collected errors before it returns. Hence nothing in the
    calling path needs to run a flush, even when an error is returned.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Dave Chinner <david@fromorbit.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-11-10 07:22:27 -06:00
Bill O'Donnell 2118b9b94d xfs: add dax dedupe support
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2192730

commit 13f9e267fdbba30820ce3999338b7d8fe7d6bf77
Author: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Date:   Fri Jun 3 13:37:38 2022 +0800

    xfs: add dax dedupe support

    Introduce xfs_mmaplock_two_inodes_and_break_dax_layout() for dax files who
    are going to be deduped.  After that, call compare range function only
    when files are both DAX or not.

    Link: https://lkml.kernel.org/r/20220603053738.1218681-15-ruansy.fnst@fujitsu.com
    Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Dan Williams <dan.j.wiliams@intel.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
    Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
    Cc: Jane Chu <jane.chu@oracle.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Ritesh Harjani <riteshh@linux.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-06-16 10:35:47 -05:00
Bill O'Donnell 439ec50781 xfs: double link the unlinked inode list
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 2fd26cc07e9f8050e29bf314cbf1bcb64dbe088c
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Jul 14 11:46:43 2022 +1000

    xfs: double link the unlinked inode list

    Now we have forwards traversal via the incore inode in place, we now
    need to add back pointers to the incore inode to entirely replace
    the back reference cache. We use the same lookup semantics and
    constraints as for the forwards pointer lookups during unlinks, and
    so we can look up any inode in the unlinked list directly and update
    the list pointers, forwards or backwards, at any time.

    The only wrinkle in converting the unlinked list manipulations to
    use in-core previous pointers is that log recovery doesn't have the
    incore inode state built up so it can't just read in an inode and
    release it to finish off the unlink. Hence we need to modify the
    traversal in recovery to read one inode ahead before we
    release the inode at the head of the list. This populates the
    next->prev relationship sufficient to be able to replay the unlinked
    list and hence greatly simplify the runtime code.

    This recovery algorithm also requires that we actually remove inodes
    from the unlinked list one at a time as background inode
    inactivation will result in unlinked list removal racing with the
    building of the in-memory unlinked list state. We could serialise
    this by holding the AGI buffer lock when constructing the in memory
    state, but all that does is lockstep background processing with list
    building. It is much simpler to flush the inodegc immediately after
    releasing the inode so that it is unlinked immediately and there is
    no races present at all.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:44 -05:00
Bill O'Donnell 959addd052 xfs: track the iunlink list pointer in the xfs_inode
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 4fcc94d653270fcc7800dbaf3b11f78cb462b293
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Jul 14 11:38:54 2022 +1000

    xfs: track the iunlink list pointer in the xfs_inode

    Having direct access to the i_next_unlinked pointer in unlinked
    inodes greatly simplifies the processing of inodes on the unlinked
    list. We no longer need to look up the inode buffer just to find
    next inode in the list if the xfs_inode is in memory. These
    improvements will be realised over upcoming patches as other
    dependencies on the inode buffer for unlinked list processing are
    removed.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:44 -05:00
Bill O'Donnell 3219617b1b xfs: replace inode fork size macros with functions
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit c01147d929899f02a0a8b15e406d12784768ca72
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Sat Jul 9 10:56:07 2022 -0700

    xfs: replace inode fork size macros with functions

    Replace the shouty macros here with typechecked helper functions.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:43 -05:00
Bill O'Donnell f77675b5d0 xfs: replace XFS_IFORK_Q with a proper predicate function
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 932b42c66cb5d0ca9800b128415b4ad6b1952b3e
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Sat Jul 9 10:56:06 2022 -0700

    xfs: replace XFS_IFORK_Q with a proper predicate function

    Replace this shouty macro with a real C function that has a more
    descriptive name.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:43 -05:00
Bill O'Donnell 0036098801 xfs: use XFS_IFORK_Q to determine the presence of an xattr fork
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit e45d7cb2356e6b59fe64da28324025cc6fcd3fbd
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Sat Jul 9 10:56:06 2022 -0700

    xfs: use XFS_IFORK_Q to determine the presence of an xattr fork

    Modify xfs_ifork_ptr to return a NULL pointer if the caller asks for the
    attribute fork but i_forkoff is zero.  This eliminates the ambiguity
    between i_forkoff and i_af.if_present, which should make it easier to
    understand the lifetime of attr forks.

    While we're at it, remove the if_present checks around calls to
    xfs_idestroy_fork and xfs_ifork_zap_attr since they can both handle attr
    forks that have already been torn down.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:43 -05:00
Bill O'Donnell a2d362f29a xfs: make inode attribute forks a permanent part of struct xfs_inode
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

Conflicts: previous out of order application of 5625ea0 requires minor adjust to xfs_iomap.c

commit 2ed5b09b3e8fc274ae8fecd6ab7c5106a364bed1
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Sat Jul 9 10:56:06 2022 -0700

    xfs: make inode attribute forks a permanent part of struct xfs_inode

    Syzkaller reported a UAF bug a while back:

    ==================================================================
    BUG: KASAN: use-after-free in xfs_ilock_attr_map_shared+0xe3/0xf6 fs/xfs/xfs_inode.c:127
    Read of size 4 at addr ffff88802cec919c by task syz-executor262/2958

    CPU: 2 PID: 2958 Comm: syz-executor262 Not tainted
    5.15.0-0.30.3-20220406_1406 #3
    Hardware name: Red Hat KVM, BIOS 1.13.0-2.module+el8.3.0+7860+a7792d29
    04/01/2014
    Call Trace:
     <TASK>
     __dump_stack lib/dump_stack.c:88 [inline]
     dump_stack_lvl+0x82/0xa9 lib/dump_stack.c:106
     print_address_description.constprop.9+0x21/0x2d5 mm/kasan/report.c:256
     __kasan_report mm/kasan/report.c:442 [inline]
     kasan_report.cold.14+0x7f/0x11b mm/kasan/report.c:459
     xfs_ilock_attr_map_shared+0xe3/0xf6 fs/xfs/xfs_inode.c:127
     xfs_attr_get+0x378/0x4c2 fs/xfs/libxfs/xfs_attr.c:159
     xfs_xattr_get+0xe3/0x150 fs/xfs/xfs_xattr.c:36
     __vfs_getxattr+0xdf/0x13d fs/xattr.c:399
     cap_inode_need_killpriv+0x41/0x5d security/commoncap.c:300
     security_inode_need_killpriv+0x4c/0x97 security/security.c:1408
     dentry_needs_remove_privs.part.28+0x21/0x63 fs/inode.c:1912
     dentry_needs_remove_privs+0x80/0x9e fs/inode.c:1908
     do_truncate+0xc3/0x1e0 fs/open.c:56
     handle_truncate fs/namei.c:3084 [inline]
     do_open fs/namei.c:3432 [inline]
     path_openat+0x30ab/0x396d fs/namei.c:3561
     do_filp_open+0x1c4/0x290 fs/namei.c:3588
     do_sys_openat2+0x60d/0x98c fs/open.c:1212
     do_sys_open+0xcf/0x13c fs/open.c:1228
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x44/0x0
    RIP: 0033:0x7f7ef4bb753d
    Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48
    89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73
    01 c3 48 8b 0d 1b 79 2c 00 f7 d8 64 89 01 48
    RSP: 002b:00007f7ef52c2ed8 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
    RAX: ffffffffffffffda RBX: 0000000000404148 RCX: 00007f7ef4bb753d
    RDX: 00007f7ef4bb753d RSI: 0000000000000000 RDI: 0000000020004fc0
    RBP: 0000000000404140 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000246 R12: 0030656c69662f2e
    R13: 00007ffd794db37f R14: 00007ffd794db470 R15: 00007f7ef52c2fc0
     </TASK>

    Allocated by task 2953:
     kasan_save_stack+0x19/0x38 mm/kasan/common.c:38
     kasan_set_track mm/kasan/common.c:46 [inline]
     set_alloc_info mm/kasan/common.c:434 [inline]
     __kasan_slab_alloc+0x68/0x7c mm/kasan/common.c:467
     kasan_slab_alloc include/linux/kasan.h:254 [inline]
     slab_post_alloc_hook mm/slab.h:519 [inline]
     slab_alloc_node mm/slub.c:3213 [inline]
     slab_alloc mm/slub.c:3221 [inline]
     kmem_cache_alloc+0x11b/0x3eb mm/slub.c:3226
     kmem_cache_zalloc include/linux/slab.h:711 [inline]
     xfs_ifork_alloc+0x25/0xa2 fs/xfs/libxfs/xfs_inode_fork.c:287
     xfs_bmap_add_attrfork+0x3f2/0x9b1 fs/xfs/libxfs/xfs_bmap.c:1098
     xfs_attr_set+0xe38/0x12a7 fs/xfs/libxfs/xfs_attr.c:746
     xfs_xattr_set+0xeb/0x1a9 fs/xfs/xfs_xattr.c:59
     __vfs_setxattr+0x11b/0x177 fs/xattr.c:180
     __vfs_setxattr_noperm+0x128/0x5e0 fs/xattr.c:214
     __vfs_setxattr_locked+0x1d4/0x258 fs/xattr.c:275
     vfs_setxattr+0x154/0x33d fs/xattr.c:301
     setxattr+0x216/0x29f fs/xattr.c:575
     __do_sys_fsetxattr fs/xattr.c:632 [inline]
     __se_sys_fsetxattr fs/xattr.c:621 [inline]
     __x64_sys_fsetxattr+0x243/0x2fe fs/xattr.c:621
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x44/0x0

    Freed by task 2949:
     kasan_save_stack+0x19/0x38 mm/kasan/common.c:38
     kasan_set_track+0x1c/0x21 mm/kasan/common.c:46
     kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:360
     ____kasan_slab_free mm/kasan/common.c:366 [inline]
     ____kasan_slab_free mm/kasan/common.c:328 [inline]
     __kasan_slab_free+0xe2/0x10e mm/kasan/common.c:374
     kasan_slab_free include/linux/kasan.h:230 [inline]
     slab_free_hook mm/slub.c:1700 [inline]
     slab_free_freelist_hook mm/slub.c:1726 [inline]
     slab_free mm/slub.c:3492 [inline]
     kmem_cache_free+0xdc/0x3ce mm/slub.c:3508
     xfs_attr_fork_remove+0x8d/0x132 fs/xfs/libxfs/xfs_attr_leaf.c:773
     xfs_attr_sf_removename+0x5dd/0x6cb fs/xfs/libxfs/xfs_attr_leaf.c:822
     xfs_attr_remove_iter+0x68c/0x805 fs/xfs/libxfs/xfs_attr.c:1413
     xfs_attr_remove_args+0xb1/0x10d fs/xfs/libxfs/xfs_attr.c:684
     xfs_attr_set+0xf1e/0x12a7 fs/xfs/libxfs/xfs_attr.c:802
     xfs_xattr_set+0xeb/0x1a9 fs/xfs/xfs_xattr.c:59
     __vfs_removexattr+0x106/0x16a fs/xattr.c:468
     cap_inode_killpriv+0x24/0x47 security/commoncap.c:324
     security_inode_killpriv+0x54/0xa1 security/security.c:1414
     setattr_prepare+0x1a6/0x897 fs/attr.c:146
     xfs_vn_change_ok+0x111/0x15e fs/xfs/xfs_iops.c:682
     xfs_vn_setattr_size+0x5f/0x15a fs/xfs/xfs_iops.c:1065
     xfs_vn_setattr+0x125/0x2ad fs/xfs/xfs_iops.c:1093
     notify_change+0xae5/0x10a1 fs/attr.c:410
     do_truncate+0x134/0x1e0 fs/open.c:64
     handle_truncate fs/namei.c:3084 [inline]
     do_open fs/namei.c:3432 [inline]
     path_openat+0x30ab/0x396d fs/namei.c:3561
     do_filp_open+0x1c4/0x290 fs/namei.c:3588
     do_sys_openat2+0x60d/0x98c fs/open.c:1212
     do_sys_open+0xcf/0x13c fs/open.c:1228
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x44/0x0

    The buggy address belongs to the object at ffff88802cec9188
     which belongs to the cache xfs_ifork of size 40
    The buggy address is located 20 bytes inside of
     40-byte region [ffff88802cec9188, ffff88802cec91b0)
    The buggy address belongs to the page:
    page:00000000c3af36a1 refcount:1 mapcount:0 mapping:0000000000000000
    index:0x0 pfn:0x2cec9
    flags: 0xfffffc0000200(slab|node=0|zone=1|lastcpupid=0x1fffff)
    raw: 000fffffc0000200 ffffea00009d2580 0000000600000006 ffff88801a9ffc80
    raw: 0000000000000000 0000000080490049 00000001ffffffff 0000000000000000
    page dumped because: kasan: bad access detected

    Memory state around the buggy address:
     ffff88802cec9080: fb fb fb fc fc fa fb fb fb fb fc fc fb fb fb fb
     ffff88802cec9100: fb fc fc fb fb fb fb fb fc fc fb fb fb fb fb fc
    >ffff88802cec9180: fc fa fb fb fb fb fc fc fa fb fb fb fb fc fc fb
                                ^
     ffff88802cec9200: fb fb fb fb fc fc fb fb fb fb fb fc fc fb fb fb
     ffff88802cec9280: fb fb fc fc fa fb fb fb fb fc fc fa fb fb fb fb
    ==================================================================

    The root cause of this bug is the unlocked access to xfs_inode.i_afp
    from the getxattr code paths while trying to determine which ILOCK mode
    to use to stabilize the xattr data.  Unfortunately, the VFS does not
    acquire i_rwsem when vfs_getxattr (or listxattr) call into the
    filesystem, which means that getxattr can race with a removexattr that's
    tearing down the attr fork and crash:

    xfs_attr_set:                          xfs_attr_get:
    xfs_attr_fork_remove:                  xfs_ilock_attr_map_shared:

    xfs_idestroy_fork(ip->i_afp);
    kmem_cache_free(xfs_ifork_cache, ip->i_afp);

                                           if (ip->i_afp &&

    ip->i_afp = NULL;

                                               xfs_need_iread_extents(ip->i_afp))
                                           <KABOOM>

    ip->i_forkoff = 0;

    Regrettably, the VFS is much more lax about i_rwsem and getxattr than
    is immediately obvious -- not only does it not guarantee that we hold
    i_rwsem, it actually doesn't guarantee that we *don't* hold it either.
    The getxattr system call won't acquire the lock before calling XFS, but
    the file capabilities code calls getxattr with and without i_rwsem held
    to determine if the "security.capabilities" xattr is set on the file.

    Fixing the VFS locking requires a treewide investigation into every code
    path that could touch an xattr and what i_rwsem state it expects or sets
    up.  That could take years or even prove impossible; fortunately, we
    can fix this UAF problem inside XFS.

    An earlier version of this patch used smp_wmb in xfs_attr_fork_remove to
    ensure that i_forkoff is always zeroed before i_afp is set to null and
    changed the read paths to use smp_rmb before accessing i_forkoff and
    i_afp, which avoided these UAF problems.  However, the patch author was
    too busy dealing with other problems in the meantime, and by the time he
    came back to this issue, the situation had changed a bit.

    On a modern system with selinux, each inode will always have at least
    one xattr for the selinux label, so it doesn't make much sense to keep
    incurring the extra pointer dereference.  Furthermore, Allison's
    upcoming parent pointer patchset will also cause nearly every inode in
    the filesystem to have extended attributes.  Therefore, make the inode
    attribute fork structure part of struct xfs_inode, at a cost of 40 more
    bytes.

    This patch adds a clunky if_present field where necessary to maintain
    the existing logic of xattr fork null pointer testing in the existing
    codebase.  The next patch switches the logic over to XFS_IFORK_Q and it
    all goes away.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:42 -05:00
Bill O'Donnell 08529f7680 xfs: convert XFS_IFORK_PTR to a static inline helper
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 732436ef916b4f338d672ea56accfdb11e8d0732
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Sat Jul 9 10:56:05 2022 -0700

    xfs: convert XFS_IFORK_PTR to a static inline helper

    We're about to make this logic do a bit more, so convert the macro to a
    static inline function for better typechecking and fewer shouty macros.
    No functional changes here.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:42 -05:00
Bill O'Donnell 4e0101a18b xfs: Introduce XFS_DIFLAG2_NREXT64 and associated helpers
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 9b7d16e34bbebc0398b1dd4f2d64ae6793fdc5ea
Author: Chandan Babu R <chandan.babu@oracle.com>
Date:   Tue Nov 16 09:04:43 2021 +0000

    xfs: Introduce XFS_DIFLAG2_NREXT64 and associated helpers

    This commit adds the new per-inode flag XFS_DIFLAG2_NREXT64 to indicate that
    an inode supports 64-bit extent counters. This flag is also enabled by default
    on newly created inodes when the corresponding filesystem has large extent
    counter feature bit (i.e. XFS_FEAT_NREXT64) set.

    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>
    Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:10:57 -05:00
Bill O'Donnell ac51bc7be1 xfs: constify the name argument to various directory functions
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 996b2329b20a89963fa577d495cf057dd7bf129c
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Wed Mar 9 10:16:09 2022 -0800

    xfs: constify the name argument to various directory functions

    Various directory functions do not modify their @name parameter,
    so mark it const to make that clear.  This will enable us to mark
    the global xfs_name_dotdot variable as const to prevent mischief.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:10:50 -05:00
Jeff Moyer 8304a71fb0 xfs: convert inode lock flags to unsigned.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237

commit a103375307ade71f3394889310ba37abb23c1c21
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Apr 21 10:47:16 2022 +1000

    xfs: convert inode lock flags to unsigned.
    
    5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
    fields to be unsigned.
    
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
    Signed-off-by: Dave Chinner <david@fromorbit.com>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-04-29 07:51:02 -04:00
Chris von Recklinghausen 841abeceba xfs: move xfs_update_prealloc_flags() to xfs_pnfs.c
Bugzilla: https://bugzilla.redhat.com/2160210

commit b39a04636fd7454911b80e7b5ab2a66b011a8145
Author: Dave Chinner <dchinner@redhat.com>
Date:   Mon Jan 31 13:20:10 2022 -0800

    xfs: move xfs_update_prealloc_flags() to xfs_pnfs.c

    The operations that xfs_update_prealloc_flags() perform are now
    unique to xfs_fs_map_blocks(), so move xfs_update_prealloc_flags()
    to be a static function in xfs_pnfs.c and cut out all the
    other functionality that is doesn't use anymore.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:46 -04:00
Carlos Maiolino 25a40d32f8 xfs: rename _zone variables to _cache
Bugzilla: https://bugzilla.redhat.com/2125724

Conflicts:
	Small conflict at xfs_inode_alloc() due to out of order
	backport. Inode alloc using kmem_cache_alloc() has been
	converted to use alloc_inode_sb() before this patch.

Now that we've gotten rid of the kmem_zone_t typedef, rename the
variables to _cache since that's what they are.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
(cherry picked from commit 182696fb021fc196e5cbe641565ca40fcf0f885a)
2022-10-21 12:50:46 +02:00
Carlos Maiolino d912d565bb xfs: remove kmem_zone typedef
Bugzilla: https://bugzilla.redhat.com/2125724

Remove these typedefs by referencing kmem_cache directly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
(cherry picked from commit e7720afad068a6729d9cd3aaa08212f2f5a7ceff)
2022-10-21 12:50:46 +02:00
Brian Foster 9723c70ba6 xfs: remove xfs_inew_wait
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 1090427bf18f9835b3ccbd36edf43f2509444e27
Author: Christoph Hellwig <hch@lst.de>
Date:   Wed Nov 24 10:06:02 2021 -0800

    xfs: remove xfs_inew_wait

    With the remove of xfs_dqrele_all_inodes, xfs_inew_wait and all the
    infrastructure used to wake the XFS_INEW bit waitqueue is unused.

    Reported-by: kernel test robot <lkp@intel.com>
    Fixes: 777eb1fa857e ("xfs: remove xfs_dqrele_all_inodes")
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Brian Foster <bfoster@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:37 -04:00
Brian Foster 6def1029c3 xfs: convert mount flags to features
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git
Conflicts: Work around out of order backport in xfs_fs_fill_super().

commit 0560f31a09e523090d1ab2bfe21c69d028c2bdf2
Author: Dave Chinner <dchinner@redhat.com>
Date:   Wed Aug 18 18:46:52 2021 -0700

    xfs: convert mount flags to features

    Replace m_flags feature checks with xfs_has_<feature>() calls and
    rework the setup code to set flags in m_features.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:34 -04:00
Brian Foster 552c0d6db7 xfs: per-cpu deferred inode inactivation queues
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit ab23a7768739a23d21d8a16ca37dff96b1ca957a
Author: Dave Chinner <dchinner@redhat.com>
Date:   Fri Aug 6 11:05:39 2021 -0700

    xfs: per-cpu deferred inode inactivation queues

    Move inode inactivation to background work contexts so that it no
    longer runs in the context that releases the final reference to an
    inode. This will allow process work that ends up blocking on
    inactivation to continue doing work while the filesytem processes
    the inactivation in the background.

    A typical demonstration of this is unlinking an inode with lots of
    extents. The extents are removed during inactivation, so this blocks
    the process that unlinked the inode from the directory structure. By
    moving the inactivation to the background process, the userspace
    applicaiton can keep working (e.g. unlinking the next inode in the
    directory) while the inactivation work on the previous inode is
    done by a different CPU.

    The implementation of the queue is relatively simple. We use a
    per-cpu lockless linked list (llist) to queue inodes for
    inactivation without requiring serialisation mechanisms, and a work
    item to allow the queue to be processed by a CPU bound worker
    thread. We also keep a count of the queue depth so that we can
    trigger work after a number of deferred inactivations have been
    queued.

    The use of a bound workqueue with a single work depth allows the
    workqueue to run one work item per CPU. We queue the work item on
    the CPU we are currently running on, and so this essentially gives
    us affine per-cpu worker threads for the per-cpu queues. THis
    maintains the effective CPU affinity that occurs within XFS at the
    AG level due to all objects in a directory being local to an AG.
    Hence inactivation work tends to run on the same CPU that last
    accessed all the objects that inactivation accesses and this
    maintains hot CPU caches for unlink workloads.

    A depth of 32 inodes was chosen to match the number of inodes in an
    inode cluster buffer. This hopefully allows sequential
    allocation/unlink behaviours to defering inactivation of all the
    inodes in a single cluster buffer at a time, further helping
    maintain hot CPU and buffer cache accesses while running
    inactivations.

    A hard per-cpu queue throttle of 256 inode has been set to avoid
    runaway queuing when inodes that take a long to time inactivate are
    being processed. For example, when unlinking inodes with large
    numbers of extents that can take a lot of processing to free.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    [djwong: tweak comments and tracepoints, convert opflags to state bits]
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:22 -04:00
Brian Foster 1698a6765d xfs: detach dquots from inode if we don't need to inactivate it
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 62af7d54a0ec0b6f99d7d55ebeb9ecbb3371bc67
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Fri Aug 6 11:05:39 2021 -0700

    xfs: detach dquots from inode if we don't need to inactivate it

    If we don't need to inactivate an inode, we can detach the dquots and
    move on to reclamation.  This isn't strictly required here; it's a
    preparation patch for deferred inactivation per reviewer request[1] to
    move the creation of xfs_inode_needs_inactivation into a separate
    change.  Eventually this !need_inactive chunk will turn into the code
    path for inodes that skip xfs_inactive and go straight to memory
    reclaim.

    [1] https://lore.kernel.org/linux-xfs/20210609012838.GW2945738@locust/T/#mca6d958521cb88bbc1bfe1a30767203328d410b5
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:22 -04:00
Brian Foster 2271d50477 xfs: Convert to use invalidate_lock
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 2433480a7e1d0c057442b284c336cfaa61523117
Author: Jan Kara <jack@suse.cz>
Date:   Mon Apr 12 18:56:24 2021 +0200

    xfs: Convert to use invalidate_lock

    Use invalidate_lock instead of XFS internal i_mmap_lock. The intended
    purpose of invalidate_lock is exactly the same. Note that the locking in
    __xfs_filemap_fault() slightly changes as filemap_fault() already takes
    invalidate_lock.

    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    CC: <linux-xfs@vger.kernel.org>
    CC: "Darrick J. Wong" <djwong@kernel.org>
    Signed-off-by: Jan Kara <jack@suse.cz>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:20 -04:00
Brian Foster a232ad8b72 xfs: Refactor xfs_isilocked()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit e31cbde7ecdcfdf22eac6fd37e63548adacc4ede
Author: Pavel Reichl <preichl@redhat.com>
Date:   Fri Oct 16 04:10:02 2020 +0200

    xfs: Refactor xfs_isilocked()

    Introduce a new __xfs_rwsem_islocked predicate to encapsulate checking
    the state of a rw_semaphore, then refactor xfs_isilocked to use it.

    Signed-off-by: Pavel Reichl <preichl@redhat.com>
    Suggested-by: Dave Chinner <dchinner@redhat.com>
    Suggested-by: Eric Sandeen <sandeen@redhat.com>
    Suggested-by: Darrick J. Wong <darrick.wong@oracle.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Jan Kara <jack@suse.cz>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:20 -04:00
Pavel Reichl 54b50bb602 xfs: remove XFS_PREALLOC_SYNC
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2085722
Tested: xfstests
Upstream Status: upstream

Callers can acheive the same thing by calling xfs_log_force_inode()
after making their modifications. There is no need for
xfs_update_prealloc_flags() to do this.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
(cherry picked from commit 472c6e46f589c26057596dcba160712a5b3e02c5)
Signed-off-by: Pavel Reichl <preichl@redhat.com>
2022-07-20 16:01:29 +02:00
Dave Chinner b652afd937 xfs: get rid of xfs_dir_ialloc()
This is just a simple wrapper around the per-ag inode allocation
that doesn't need to exist. The internal mechanism to select and
allocate within an AG does not need to be exposed outside
xfs_ialloc.c, and it being exposed simply makes it harder to follow
the code and simplify it.

This is simplified by internalising xf_dialloc_select_ag() and
xfs_dialloc_ag() into a single xfs_dialloc() function and then
xfs_dir_ialloc() can go away.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 10:48:24 +10:00
Christoph Hellwig e98d5e882b xfs: move the di_crtime field to struct xfs_inode
Move the crtime field from struct xfs_icdinode into stuct xfs_inode and
remove the now entirely unused struct xfs_icdinode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07 14:37:05 -07:00
Christoph Hellwig 3e09ab8fdc xfs: move the di_flags2 field to struct xfs_inode
In preparation of removing the historic icinode struct, move the flags2
field into the containing xfs_inode structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07 14:37:05 -07:00
Christoph Hellwig db07349da2 xfs: move the di_flags field to struct xfs_inode
In preparation of removing the historic icinode struct, move the flags
field into the containing xfs_inode structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07 14:37:05 -07:00
Christoph Hellwig 7821ea302d xfs: move the di_forkoff field to struct xfs_inode
In preparation of removing the historic icinode struct, move the
forkoff field into the containing xfs_inode structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07 14:37:05 -07:00
Christoph Hellwig ee7b83fd36 xfs: use a union for i_cowextsize and i_flushiter
The i_cowextsize field is only used for v3 inodes, and the i_flushiter
field is only used for v1/v2 inodes.  Use a union to pack the inode a
littler better after adding a few missing guards around their usage.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07 14:37:05 -07:00
Christoph Hellwig 965e0a1ad2 xfs: move the di_flushiter field to struct xfs_inode
In preparation of removing the historic icinode struct, move the
flushiter field into the containing xfs_inode structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07 14:37:04 -07:00
Christoph Hellwig b33ce57d3e xfs: move the di_cowextsize field to struct xfs_inode
In preparation of removing the historic icinode struct, move the
cowextsize field into the containing xfs_inode structure.  Also
switch to use the xfs_extlen_t instead of a uint32_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07 14:37:04 -07:00
Christoph Hellwig 031474c28a xfs: move the di_extsize field to struct xfs_inode
In preparation of removing the historic icinode struct, move the extsize
field into the containing xfs_inode structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07 14:37:04 -07:00
Christoph Hellwig 6e73a545f9 xfs: move the di_nblocks field to struct xfs_inode
In preparation of removing the historic icinode struct, move the nblocks
field into the containing xfs_inode structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07 14:37:03 -07:00
Christoph Hellwig 13d2c10b05 xfs: move the di_size field to struct xfs_inode
In preparation of removing the historic icinode struct, move the on-disk
size field into the containing xfs_inode structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07 14:37:03 -07:00
Christoph Hellwig ceaf603c70 xfs: move the di_projid field to struct xfs_inode
In preparation of removing the historic icinode struct, move the projid
field into the containing xfs_inode structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07 14:37:03 -07:00
Christoph Hellwig 9b3beb028f xfs: remove the di_dmevmask and di_dmstate fields from struct xfs_icdinode
The legacy DMAPI fields were never set by upstream Linux XFS, and have no
way to be read using the kernel APIs.  So instead of bloating the in-core
inode for them just copy them from the on-disk inode into the log when
logging the inode.  The only caveat is that we need to make sure to zero
the fields for newly read or deleted inodes, which is solved using a new
flag in the inode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-07 14:37:03 -07:00
Dave Chinner e6a688c332 xfs: initialise attr fork on inode create
When we allocate a new inode, we often need to add an attribute to
the inode as part of the create. This can happen as a result of
needing to add default ACLs or security labels before the inode is
made visible to userspace.

This is highly inefficient right now. We do the create transaction
to allocate the inode, then we do an "add attr fork" transaction to
modify the just created empty inode to set the inode fork offset to
allow attributes to be stored, then we go and do the attribute
creation.

This means 3 transactions instead of 1 to allocate an inode, and
this greatly increases the load on the CIL commit code, resulting in
excessive contention on the CIL spin locks and performance
degradation:

 18.99%  [kernel]                [k] __pv_queued_spin_lock_slowpath
  3.57%  [kernel]                [k] do_raw_spin_lock
  2.51%  [kernel]                [k] __raw_callee_save___pv_queued_spin_unlock
  2.48%  [kernel]                [k] memcpy
  2.34%  [kernel]                [k] xfs_log_commit_cil

The typical profile resulting from running fsmark on a selinux enabled
filesytem is adds this overhead to the create path:

  - 15.30% xfs_init_security
     - 15.23% security_inode_init_security
	- 13.05% xfs_initxattrs
	   - 12.94% xfs_attr_set
	      - 6.75% xfs_bmap_add_attrfork
		 - 5.51% xfs_trans_commit
		    - 5.48% __xfs_trans_commit
		       - 5.35% xfs_log_commit_cil
			  - 3.86% _raw_spin_lock
			     - do_raw_spin_lock
				  __pv_queued_spin_lock_slowpath
		 - 0.70% xfs_trans_alloc
		      0.52% xfs_trans_reserve
	      - 5.41% xfs_attr_set_args
		 - 5.39% xfs_attr_set_shortform.constprop.0
		    - 4.46% xfs_trans_commit
		       - 4.46% __xfs_trans_commit
			  - 4.33% xfs_log_commit_cil
			     - 2.74% _raw_spin_lock
				- do_raw_spin_lock
				     __pv_queued_spin_lock_slowpath
			       0.60% xfs_inode_item_format
		      0.90% xfs_attr_try_sf_addname
	- 1.99% selinux_inode_init_security
	   - 1.02% security_sid_to_context_force
	      - 1.00% security_sid_to_context_core
		 - 0.92% sidtab_entry_to_string
		    - 0.90% sidtab_sid2str_get
			 0.59% sidtab_sid2str_put.part.0
	   - 0.82% selinux_determine_inode_label
	      - 0.77% security_transition_sid
		   0.70% security_compute_sid.part.0

And fsmark creation rate performance drops by ~25%. The key point to
note here is that half the additional overhead comes from adding the
attribute fork to the newly created inode. That's crazy, considering
we can do this same thing at inode create time with a couple of
lines of code and no extra overhead.

So, if we know we are going to add an attribute immediately after
creating the inode, let's just initialise the attribute fork inside
the create transaction and chop that whole chunk of code out of
the create fast path. This completely removes the performance
drop caused by enabling SELinux, and the profile looks like:

     - 8.99% xfs_init_security
         - 9.00% security_inode_init_security
            - 6.43% xfs_initxattrs
               - 6.37% xfs_attr_set
                  - 5.45% xfs_attr_set_args
                     - 5.42% xfs_attr_set_shortform.constprop.0
                        - 4.51% xfs_trans_commit
                           - 4.54% __xfs_trans_commit
                              - 4.59% xfs_log_commit_cil
                                 - 2.67% _raw_spin_lock
                                    - 3.28% do_raw_spin_lock
                                         3.08% __pv_queued_spin_lock_slowpath
                                   0.66% xfs_inode_item_format
                        - 0.90% xfs_attr_try_sf_addname
                  - 0.60% xfs_trans_alloc
            - 2.35% selinux_inode_init_security
               - 1.25% security_sid_to_context_force
                  - 1.21% security_sid_to_context_core
                     - 1.19% sidtab_entry_to_string
                        - 1.20% sidtab_sid2str_get
                           - 0.86% sidtab_sid2str_put.part.0
                              - 0.62% _raw_spin_lock_irqsave
                                 - 0.77% do_raw_spin_lock
                                      __pv_queued_spin_lock_slowpath
               - 0.84% selinux_determine_inode_label
                  - 0.83% security_transition_sid
                       0.86% security_compute_sid.part.0

Which indicates the XFS overhead of creating the selinux xattr has
been halved. This doesn't fix the CIL lock contention problem, just
means it's not a limiting factor for this workload. Lock contention
in the security subsystems is going to be an issue soon, though...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[djwong: fix compilation error when CONFIG_SECURITY=n]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@redhat.com>
2021-03-25 16:47:51 -07:00
Darrick J. Wong 383e32b0d0 xfs: prevent metadata files from being inactivated
Files containing metadata (quota records, rt bitmap and summary info)
are fully managed by the filesystem, which means that all resource
cleanup must be explicit, not automatic.  This means that they should
never be subjected automatic to post-eof truncation, nor should they be
freed automatically even if the link count drops to zero.

In other words, xfs_inactive() should leave these files alone.  Add the
necessary predicate functions to make this happen.  This adds a second
layer of prevention for the kinds of fs corruption that was fixed by
commit f4c32e87de.  If we ever decide to support removing metadata
files, we should make all those metadata updates explicit.

Rearrange the order of #includes to fix compiler errors, since
xfs_mount.h is supposed to be included before xfs_inode.h

Followup-to: f4c32e87de ("xfs: fix realtime bitmap/summary file truncation when growing rt volume")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-03-25 16:47:50 -07:00
Christoph Hellwig f736d93d76
xfs: support idmapped mounts
Enable idmapped mounts for xfs. This basically just means passing down
the user_namespace argument from the VFS methods down to where it is
passed to the relevant helpers.

Note that full-filesystem bulkstat is not supported from inside idmapped
mounts as it is an administrative operation that acts on the whole file
system. The limitation is not applied to the bulkstat single operation
that just operates on a single inode.

Link: https://lore.kernel.org/r/20210121131959.646623-40-christian.brauner@ubuntu.com
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:43:46 +01:00
Dave Chinner 1abcf26101 xfs: move on-disk inode allocation out of xfs_ialloc()
So xfs_ialloc() will only address in-core inode allocation then,
Also, rename xfs_ialloc() to xfs_dir_ialloc_init() in order to
keep everything in xfs_inode.c under the same namespace.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-12-12 10:48:24 -08:00
Darrick J. Wong f93e5436f0 xfs: widen ondisk inode timestamps to deal with y2038+
Redesign the ondisk inode timestamps to be a simple unsigned 64-bit
counter of nanoseconds since 14 Dec 1901 (i.e. the minimum time in the
32-bit unix time epoch).  This enables us to handle dates up to 2486,
which solves the y2038 problem.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2020-09-15 20:52:41 -07:00
Dave Chinner 718ecc5035 xfs: xfs_iflock is no longer a completion
With the recent rework of the inode cluster flushing, we no longer
ever wait on the the inode flush "lock". It was never a lock in the
first place, just a completion to allow callers to wait for inode IO
to complete. We now never wait for flush completion as all inode
flushing is non-blocking. Hence we can get rid of all the iflock
infrastructure and instead just set and check a state flag.

Rename the XFS_IFLOCK flag to XFS_IFLUSHING, convert all the
xfs_iflock_nowait() test-and-set operations on that flag, and
replace all the xfs_ifunlock() calls to clear operations.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-09-06 18:05:51 -07:00