Commit Graph

235 Commits

Author SHA1 Message Date
Jeff Moyer 5331d880be Add do_ftruncate that truncates a struct file
JIRA: https://issues.redhat.com/browse/RHEL-64867
Conflicts: Slight context difference due to out-of-order backport of
  commit abf08576afe3 ("fs: port vfs_*() helpers to struct mnt_idmap")
commit 5f0d594c602f870e3a3872f7ea42bf846a1d26cf
Author: Tony Solomonik <tony.solomonik@gmail.com>
Date:   Fri Feb 2 14:17:23 2024 +0200

    Add do_ftruncate that truncates a struct file
    
    do_sys_ftruncate receives a file descriptor, fgets the struct file, and
    finally actually truncates the file.
    
    do_ftruncate allows for passing in a file directly, with the caller
    already holding a reference to it.
    
    Signed-off-by: Tony Solomonik <tony.solomonik@gmail.com>
    Reviewed-by: Christian Brauner <brauner@kernel.org>
    Link: https://lore.kernel.org/r/20240202121724.17461-2-tony.solomonik@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-11-28 15:34:44 -05:00
Ian Kent 9cd7658c81 nfs: use vfs setgid helper
JIRA: https://issues.redhat.com/browse/RHEL-33888
Upstream status: Linus

commit 4f704d9a8352f5c0a8fcdb6213b934630342bd44
Author: Christian Brauner <brauner@kernel.org>
Date:   Tue Mar 14 12:51:10 2023 +0100

    nfs: use vfs setgid helper

    We've aligned setgid behavior over multiple kernel releases. The details
    can be found in the following two merge messages:
    cf619f891971 ("Merge tag 'fs.ovl.setgid.v6.2')
    426b4ca2d6a5 ("Merge tag 'fs.setgid.v6.0')
    Consistent setgid stripping behavior is now encapsulated in the
    setattr_should_drop_sgid() helper which is used by all filesystems that
    strip setgid bits outside of vfs proper. Switch nfs to rely on this
    helper as well. Without this patch the setgid stripping tests in
    xfstests will fail.

    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Message-Id: <20230313-fs-nfs-setgid-v2-1-9a59f436cfc0@kernel.org>
    Signed-off-by: Christian Brauner <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-16 14:20:01 +08:00
Ian Kent 69f3621dc7 fs: move mnt_idmap
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

commit 3707d84c13670bf09b4a9a4dc6733326d8344b31
Author: Christian Brauner <brauner@kernel.org>
Date:   Fri Jan 13 12:49:33 2023 +0100

    fs: move mnt_idmap

    Now that we converted everything to just rely on struct mnt_idmap move it all
    into a separate file. This ensure that no code can poke around in struct
    mnt_idmap without any dedicated helpers and makes it easier to extend it in the
    future. Filesystems will now not be able to conflate mount and filesystem
    idmappings as they are two distinct types and require distinct helpers that
    cannot be used interchangeably. We are now also able to extend struct mnt_idmap
    as we see fit.

    Acked-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-16 14:19:56 +08:00
Ian Kent edf17476c7 fs: port privilege checking helpers to mnt_idmap
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

Conflicts: For consistency drop btrfs hunks because it isn't supported in
	CentOS Stream and other backports also drop such hunks.
	Upstream merge commit 05e6295f7b5e0 ("Merge tag 'fs.idmapped.v6.3'
	of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping")
	together with Upstream commit facd61053cff1 ("fuse: fixes after
	adapting to new posix acl api") results in a conflict in
	fs/fuse/acl.c, adjust to suit.

commit 9452e93e6dae862d7aeff2b11236d79bde6f9b66
Author: Christian Brauner <brauner@kernel.org>
Date:   Fri Jan 13 12:49:27 2023 +0100

    fs: port privilege checking helpers to mnt_idmap

    Convert to struct mnt_idmap.

    Last cycle we merged the necessary infrastructure in
    256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
    This is just the conversion to struct mnt_idmap.

    Currently we still pass around the plain namespace that was attached to a
    mount. This is in general pretty convenient but it makes it easy to
    conflate namespaces that are relevant on the filesystem with namespaces
    that are relevent on the mount level. Especially for non-vfs developers
    without detailed knowledge in this area this can be a potential source for
    bugs.

    Once the conversion to struct mnt_idmap is done all helpers down to the
    really low-level helpers will take a struct mnt_idmap argument instead of
    two namespace arguments. This way it becomes impossible to conflate the two
    eliminating the possibility of any bugs. All of the vfs and all filesystems
    only operate on struct mnt_idmap.

    Acked-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-16 10:45:31 +08:00
Ian Kent 304ec491ee fs: port ->permission() to pass mnt_idmap
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

Conflicts: For consistency drop btrfs hunks because it isn't supported in
	CentOS Stream and other backports also drop such hunks.
	CentOS Stream commit 48fa94aacd ("ceph: fscrypt_auth handling
	for ceph") is presnt which causes fuzz 2 in hunk #1 in
	fs/ceph/super.h.
	Upstream commit 427505ffeaa46 ("exportfs: use pr_debug for
	unreachable debug statements") is not present causing fuzz 2
	in hunk #1 against fs/exportfs/expfs.c.
	Dropped hunks for ksmbd because the source is not present in the
	CentOS Stream source tree.
	Upstream commit 03fa86e9f79d8 ("namei: stash the sampled ->d_seq
	into nameidata") is not present causing a fuzz 1 for hunk #14
	against fs/namei.c.
	CentOS Stream c4f3dd0731 ("nfsd: handle failure to collect
	pre/post-op attrs more sanely") is present and causes a rejects
	for hunks #4 and #5 against fs/nfsd/vfs.c, apply manually.
	Dropped hunks for ntfs3 because the source is not present in the
	CentOS Stream source tree.
	CentOS Stream commit 98ba731fc7 ("ovl: Move xattr support to
	new xattrs.c file") moves ovl_xattr_set() and ovl_xattr_get()
	from fs/overlayfs/inode.c to fs/overlayfs/xattrs.c which causes
	hunks #4 and #5 to fail, manually apply to fs/overlayfs/xattrs.c.
	CentOS Stream commit 55177e4b83 ("ovl: mark xwhiteouts directory
	with overlay.opaque='x'") and commit d17b324bb6 ("ovl: use
	ovl_numlower() and ovl_lowerstack() accessors") change the first
	and third hunks of fs/overlayfs/namei.c causing them to fail,
	manually apply.
	CentOS Stream commit 98ba731fc7 ("ovl: Move xattr support to
	new xattrs.c file") causes fuzz 2 in hunk #5 of
	fs/overlayfs/overlayfs.h
	CentOS Stream commit 355a9c490a ("ovl: Add an alternative
	type of whiteout") changes ovl_cache_update_ino() to
	ovl_cache_update() in fs/overlayfs/readdir.c, make the change
	manually.
	Upstream commit 217af7e2f4deb ("apparmor: refactor profile
	rules and attachments") is not in CentOS Stream causing hunk #1
	to fail to apply so manually apply the change.

commit 4609e1f18e19c3b302e1eb4858334bca1532f780
Author: Christian Brauner <brauner@kernel.org>
Date:   Fri Jan 13 12:49:22 2023 +0100

    fs: port ->permission() to pass mnt_idmap

    Convert to struct mnt_idmap.

    Last cycle we merged the necessary infrastructure in
    256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
    This is just the conversion to struct mnt_idmap.

    Currently we still pass around the plain namespace that was attached to a
    mount. This is in general pretty convenient but it makes it easy to
    conflate namespaces that are relevant on the filesystem with namespaces
    that are relevent on the mount level. Especially for non-vfs developers
    without detailed knowledge in this area this can be a potential source for
    bugs.

    Once the conversion to struct mnt_idmap is done all helpers down to the
    really low-level helpers will take a struct mnt_idmap argument instead of
    two namespace arguments. This way it becomes impossible to conflate the two
    eliminating the possibility of any bugs. All of the vfs and all filesystems
    only operate on struct mnt_idmap.

    Acked-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-16 10:45:20 +08:00
Ian Kent a050a48e12 may_linkat(): constify path
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

commit 8996682b10ff4de4f6f36fc81211f0a1c0437495
Author: Al Viro <viro@zeniv.linux.org.uk>
Date:   Thu Aug 4 12:53:46 2022 -0400

    may_linkat(): constify path

    Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-16 10:45:19 +08:00
Ian Kent d584d976a2 acl: conver higher-level helpers to rely on mnt_idmap
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

commit 5a6f52d20ce3cd6d30103a27f18edff337da191b
Author: Christian Brauner <brauner@kernel.org>
Date:   Fri Oct 28 09:56:20 2022 +0200

    acl: conver higher-level helpers to rely on mnt_idmap

    Convert an initial portion to rely on struct mnt_idmap by converting the
    high level xattr helpers.

    Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-16 08:29:49 +08:00
Ian Kent dc1f3bea48 attr: use consistent sgid stripping checks
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

commit ed5a7047d2011cb6b2bf84ceb6680124cc6a7d95
Author: Christian Brauner <brauner@kernel.org>
Date:   Mon Oct 17 17:06:37 2022 +0200

    attr: use consistent sgid stripping checks

    Currently setgid stripping in file_remove_privs()'s should_remove_suid()
    helper is inconsistent with other parts of the vfs. Specifically, it only
    raises ATTR_KILL_SGID if the inode is S_ISGID and S_IXGRP but not if the
    inode isn't in the caller's groups and the caller isn't privileged over the
    inode although we require this already in setattr_prepare() and
    setattr_copy() and so all filesystem implement this requirement implicitly
    because they have to use setattr_{prepare,copy}() anyway.

    But the inconsistency shows up in setgid stripping bugs for overlayfs in
    xfstests (e.g., generic/673, generic/683, generic/685, generic/686,
    generic/687). For example, we test whether suid and setgid stripping works
    correctly when performing various write-like operations as an unprivileged
    user (fallocate, reflink, write, etc.):

    echo "Test 1 - qa_user, non-exec file $verb"
    setup_testfile
    chmod a+rws $junk_file
    commit_and_check "$qa_user" "$verb" 64k 64k

    The test basically creates a file with 6666 permissions. While the file has
    the S_ISUID and S_ISGID bits set it does not have the S_IXGRP set. On a
    regular filesystem like xfs what will happen is:

    sys_fallocate()
    -> vfs_fallocate()
       -> xfs_file_fallocate()
          -> file_modified()
             -> __file_remove_privs()
                -> dentry_needs_remove_privs()
                   -> should_remove_suid()
                -> __remove_privs()
                   newattrs.ia_valid = ATTR_FORCE | kill;
                   -> notify_change()
                      -> setattr_copy()

    In should_remove_suid() we can see that ATTR_KILL_SUID is raised
    unconditionally because the file in the test has S_ISUID set.

    But we also see that ATTR_KILL_SGID won't be set because while the file
    is S_ISGID it is not S_IXGRP (see above) which is a condition for
    ATTR_KILL_SGID being raised.

    So by the time we call notify_change() we have attr->ia_valid set to
    ATTR_KILL_SUID | ATTR_FORCE. Now notify_change() sees that
    ATTR_KILL_SUID is set and does:

    ia_valid = attr->ia_valid |= ATTR_MODE
    attr->ia_mode = (inode->i_mode & ~S_ISUID);

    which means that when we call setattr_copy() later we will definitely
    update inode->i_mode. Note that attr->ia_mode still contains S_ISGID.

    Now we call into the filesystem's ->setattr() inode operation which will
    end up calling setattr_copy(). Since ATTR_MODE is set we will hit:

    if (ia_valid & ATTR_MODE) {
            umode_t mode = attr->ia_mode;
            vfsgid_t vfsgid = i_gid_into_vfsgid(mnt_userns, inode);
            if (!vfsgid_in_group_p(vfsgid) &&
                !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID))
                    mode &= ~S_ISGID;
            inode->i_mode = mode;
    }

    and since the caller in the test is neither capable nor in the group of the
    inode the S_ISGID bit is stripped.

    But assume the file isn't suid then ATTR_KILL_SUID won't be raised which
    has the consequence that neither the setgid nor the suid bits are stripped
    even though it should be stripped because the inode isn't in the caller's
    groups and the caller isn't privileged over the inode.

    If overlayfs is in the mix things become a bit more complicated and the bug
    shows up more clearly. When e.g., ovl_setattr() is hit from
    ovl_fallocate()'s call to file_remove_privs() then ATTR_KILL_SUID and
    ATTR_KILL_SGID might be raised but because the check in notify_change() is
    questioning the ATTR_KILL_SGID flag again by requiring S_IXGRP for it to be
    stripped the S_ISGID bit isn't removed even though it should be stripped:

    sys_fallocate()
    -> vfs_fallocate()
       -> ovl_fallocate()
          -> file_remove_privs()
             -> dentry_needs_remove_privs()
                -> should_remove_suid()
             -> __remove_privs()
                newattrs.ia_valid = ATTR_FORCE | kill;
                -> notify_change()
                   -> ovl_setattr()
                      // TAKE ON MOUNTER'S CREDS
                      -> ovl_do_notify_change()
                         -> notify_change()
                      // GIVE UP MOUNTER'S CREDS
         // TAKE ON MOUNTER'S CREDS
         -> vfs_fallocate()
            -> xfs_file_fallocate()
               -> file_modified()
                  -> __file_remove_privs()
                     -> dentry_needs_remove_privs()
                        -> should_remove_suid()
                     -> __remove_privs()
                        newattrs.ia_valid = attr_force | kill;
                        -> notify_change()

    The fix for all of this is to make file_remove_privs()'s
    should_remove_suid() helper to perform the same checks as we already
    require in setattr_prepare() and setattr_copy() and have notify_change()
    not pointlessly requiring S_IXGRP again. It doesn't make any sense in the
    first place because the caller must calculate the flags via
    should_remove_suid() anyway which would raise ATTR_KILL_SGID.

    While we're at it we move should_remove_suid() from inode.c to attr.c
    where it belongs with the rest of the iattr helpers. Especially since it
    returns ATTR_KILL_S{G,U}ID flags. We also rename it to
    setattr_should_drop_suidgid() to better reflect that it indicates both
    setuid and setgid bit removal and also that it returns attr flags.

    Running xfstests with this doesn't report any regressions. We should really
    try and use consistent checks.

    Reviewed-by: Amir Goldstein <amir73il@gmail.com>
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-15 16:12:34 +08:00
Ian Kent 94eb87da65 attr: add setattr_should_drop_sgid()
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

commit 72ae017c5451860443a16fb2a8c243bff3e396b8
Author: Christian Brauner <brauner@kernel.org>
Date:   Mon Oct 17 17:06:36 2022 +0200

    attr: add setattr_should_drop_sgid()

    The current setgid stripping logic during write and ownership change
    operations is inconsistent and strewn over multiple places. In order to
    consolidate it and make more consistent we'll add a new helper
    setattr_should_drop_sgid(). The function retains the old behavior where
    we remove the S_ISGID bit unconditionally when S_IXGRP is set but also
    when it isn't set and the caller is neither in the group of the inode
    nor privileged over the inode.

    We will use this helper both in write operation permission removal such
    as file_remove_privs() as well as in ownership change operations.

    Reviewed-by: Amir Goldstein <amir73il@gmail.com>
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-15 16:12:26 +08:00
Ian Kent d8268a324b attr: add in_group_or_capable()
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

Conflicts: CentOS Stream has commit 42bfe37a25 ("fs: add ctime
	accessors infrastructure") so adjust the context to match.

commit 11c2a8700cdcabf9b639b7204a1e38e2a0b6798e
Author: Christian Brauner <brauner@kernel.org>
Date:   Mon Oct 17 17:06:34 2022 +0200

    attr: add in_group_or_capable()

    In setattr_{copy,prepare}() we need to perform the same permission
    checks to determine whether we need to drop the setgid bit or not.
    Instead of open-coding it twice add a simple helper the encapsulates the
    logic. We will reuse this helpers to make dropping the setgid bit during
    write operations more consistent in a follow up patch.

    Reviewed-by: Amir Goldstein <amir73il@gmail.com>
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-15 16:12:24 +08:00
Ian Kent 8c7e81cebd xattr: use posix acl api
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

commit 318e66856ddec05384f32d60b5598128289f4e7b
Author: Christian Brauner <brauner@kernel.org>
Date:   Thu Sep 22 17:17:22 2022 +0200

    xattr: use posix acl api

    In previous patches we built a new posix api solely around get and set
    inode operations. Now that we have all the pieces in place we can switch
    the system calls and the vfs over to only rely on this api when
    interacting with posix acls. This finally removes all type unsafety and
    type conversion issues explained in detail in [1] that we aim to get rid
    of.

    With the new posix acl api we immediately translate into an appropriate
    kernel internal struct posix_acl format both when getting and setting
    posix acls. This is a stark contrast to before were we hacked unsafe raw
    values into the uapi struct that was stored in a void pointer relying
    and having filesystems and security modules hack around in the uapi
    struct as well.

    Link: https://lore.kernel.org/all/20220801145520.1532837-1-brauner@kernel.org [1]
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-15 16:12:09 +08:00
Ian Kent 1567cdbcd9 internal: add may_write_xattr()
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

commit 56851bc9b9f072dd738f25ed29c0d5abe9f2908b
Author: Christian Brauner <brauner@kernel.org>
Date:   Thu Sep 29 10:47:36 2022 +0200

    internal: add may_write_xattr()

    Split out the generic checks whether an inode allows writing xattrs. Since
    security.* and system.* xattrs don't have any restrictions and we're going
    to split out posix acls into a dedicated api we will use this helper to
    check whether we can write posix acls.

    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-15 16:11:53 +08:00
Jeff Moyer 6625cdcc16 file: remove pointless wrapper
JIRA: https://issues.redhat.com/browse/RHEL-27755
Conflicts: RHEL is missing commit ed192c59f869 ("file: mostly
eliminate spurious relocking in __range_close"), which causes some
context differences.

commit 24fa3ae9467f49dd9698fd884f2c6b13cc8ea12d
Author: Christian Brauner <brauner@kernel.org>
Date:   Thu Nov 30 13:49:08 2023 +0100

    file: remove pointless wrapper
    
    Only io_uring uses __close_fd_get_file(). All it does is hide
    current->files but io_uring accesses files_struct directly right now
    anyway so it's a bit pointless. Just rename pick_file() to
    file_close_fd_locked() and let io_uring use it. Add a lockdep assert in
    there that we expect the caller to hold file_lock while we're at it.
    
    Link: https://lore.kernel.org/r/20231130-vfs-files-fixes-v1-2-e73ca6f4ea83@kernel.org
    Reviewed-by: Jens Axboe <axboe@kernel.dk>
    Reviewed-by: Jan Kara <jack@suse.cz>
    Signed-off-by: Christian Brauner <brauner@kernel.org>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-07-02 10:12:34 -04:00
Ming Lei 5c2c363ba8 fs: move sb_init_dio_done_wq out of direct-io.c
JIRA: https://issues.redhat.com/browse/RHEL-29564

commit 439bc39b3cf0014b1b75075812f7ef0f8baa9674
Author: Christoph Hellwig <hch@lst.de>
Date:   Wed Jan 25 07:58:38 2023 +0100

    fs: move sb_init_dio_done_wq out of direct-io.c

    sb_init_dio_done_wq is also used by the iomap code, so move it to
    super.c in preparation for building direct-io.c conditionally.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Eric Biggers <ebiggers@google.com>
    Reviewed-by: Jan Kara <jack@suse.cz>
    Link: https://lore.kernel.org/r/20230125065839.191256-2-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2024-04-17 10:06:22 +08:00
Ming Lei e749b02780 fs: remove emergency_thaw_bdev
JIRA: https://issues.redhat.com/browse/RHEL-29564

commit 4a8b719f95c0dcd15fb7a04b806ad8139fa7c850
Author: Christoph Hellwig <hch@lst.de>
Date:   Tue Aug 1 19:21:56 2023 +0200

    fs: remove emergency_thaw_bdev

    Fold emergency_thaw_bdev into it's only caller, to prepare for buffer.c
    to be built only when buffer_head support is enabled.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
    Reviewed-by: Hannes Reinecke <hare@suse.de>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Reviewed-by: Christian Brauner <brauner@kernel.org>
    Link: https://lore.kernel.org/r/20230801172201.1923299-2-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2024-04-17 09:52:44 +08:00
Ming Lei 46e90a4fe5 super: make locking naming consistent
JIRA: https://issues.redhat.com/browse/RHEL-29564

commit d8ce82efdece373b570f35acc8a29487b2087b84
Author: Christian Brauner <brauner@kernel.org>
Date:   Fri Aug 18 16:00:49 2023 +0200

    super: make locking naming consistent

    Make the naming consistent with the earlier introduced
    super_lock_{read,write}() helpers.

    Reviewed-by: Jan Kara <jack@suse.cz>
    Message-Id: <20230818-vfs-super-fixes-v3-v3-2-9f0b1876e46b@kernel.org>
    Signed-off-by: Christian Brauner <brauner@kernel.org>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2024-04-17 09:46:43 +08:00
Ming Lei c3745bf638 fs: simplify invalidate_inodes
JIRA: https://issues.redhat.com/browse/RHEL-29564

commit e127b9bccdb04e5fc4444431de37309a68aedafa
Author: Christoph Hellwig <hch@lst.de>
Date:   Fri Aug 11 12:08:28 2023 +0200

    fs: simplify invalidate_inodes

    kill_dirty has always been true for a long time, so hard code it and
    remove the unused return value.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Christian Brauner <brauner@kernel.org>
    Reviewed-by: Josef Bacik <josef@toxicpanda.com>
    Message-Id: <20230811100828.1897174-18-hch@lst.de>
    Signed-off-by: Christian Brauner <brauner@kernel.org>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2024-04-17 09:46:43 +08:00
Ming Lei 58e6b9cfbf dentry.h: trim externs
JIRA: https://issues.redhat.com/browse/RHEL-29564

commit 0d486510f86eb8162022ed61e6dc424a10909a10
Author: Al Viro <viro@zeniv.linux.org.uk>
Date:   Fri Nov 10 15:22:40 2023 -0500

    dentry.h: trim externs

    d_instantiate_unique() had been gone for 7 years; __d_lookup...()
    and shrink_dcache_for_umount() are fs/internal.h fodder.

    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2024-04-17 09:46:41 +08:00
Chris von Recklinghausen a62c96b4b2 don't use __kernel_write() on kmap_local_page()
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 06bbaa6dc53cb72040db952053432541acb9adc7
Author: Al Viro <viro@zeniv.linux.org.uk>
Date:   Mon Sep 26 11:59:14 2022 -0400

    [coredump] don't use __kernel_write() on kmap_local_page()

    passing kmap_local_page() result to __kernel_write() is unsafe -
    random ->write_iter() might (and 9p one does) get unhappy when
    passed ITER_KVEC with pointer that came from kmap_local_page().

    Fix by providing a variant of __kernel_write() that takes an iov_iter
    from caller (__kernel_write() becomes a trivial wrapper) and adding
    dump_emit_page() that parallels dump_emit(), except that instead of
    __kernel_write() it uses __kernel_write_iter() with ITER_BVEC source.

    Fixes: 3159ed5779 "fs/coredump: use kmap_local_page()"
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:08 -04:00
Jeff Moyer 8a92fcb818 fs: export rw_verify_area()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237

commit 871129332d74c9e94bd110932ac4445833995639
Author: Omar Sandoval <osandov@fb.com>
Date:   Wed Sep 4 12:13:25 2019 -0700

    fs: export rw_verify_area()
    
    I'm adding btrfs ioctls to read and write compressed data, and rather
    than duplicating the checks in rw_verify_area(), let's just export it.
    
    Reviewed-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: Omar Sandoval <osandov@fb.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-04-29 04:47:02 -04:00
Jan Stancek f302196b1b Merge: io_uring: update to v5.19
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2190

Sync up our io_uring code with upstream v5.19, but do not enable it.  The goal is to be bug-for-bug compatible with this version of the code.  I'll post further MRs that will sync to later releases, and then a final MR with remaining fixes.

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2123490

Omitted-fix: df6d3422d3ee ("io_uring/kbuf: fix not advancing READV kbuf ring")
	Fixes will be pulled in by later merge requests
Omitted-fix: 9d94c04c0db0 ("io_uring/filetable: fix file reference underflow")
	Fixes will be pulled in by later merge requests
Omitted-fix: 48ba08374e77 ("io_uring: fix size calculation when registering buf ring")
	Fixes will be pulled in by later merge requests
Omitted-fix: 36632d062975 ("io_uring: Replace 0-length array with flexible array")
	Fixes will be pulled in by later merge requests
Omitted-fix: 336d28a8f380 ("io_uring: recycle kbuf recycle on tw requeue")
	Fixes will be pulled in by later merge requests
Omitted-fix: 91482864768a ("io_uring: fix multishot accept request leaks")
	Fixes will be pulled in by later merge requests
Omitted-fix: dd9373402280 ("Smack: Provide read control for io_uring_cmd")
	Fixes will be pulled in by later merge requests
Omitted-fix: f4d653dcaa4e ("selinux: implement the security_uring_cmd() LSM hook")
	Fixes will be pulled in by later merge requests
Omitted-fix: 2a5840124009 ("lsm,io_uring: add LSM hooks for the new uring_cmd file op")
	Fixes will be pulled in by later merge requests
Omitted-fix: 3b8fdd1dc35e ("io_uring/fdinfo: fix sqe dumping for IORING_SETUP_SQE128")
	Fixes will be pulled in by later merge requests
Omitted-fix: 00927931cb63 ("io_uring: fix fdinfo sqe offsets calculation")
	Fixes will be pulled in by later merge requests
Omitted-fix: 9d2789ac9d60 ("block/io_uring: pass in issue_flags for uring_cmd task_work handling")
	Fixes will be pulled in by later merge requests
Omitted-fix: 02a4d923e440 ("io_uring/rsrc: fix null-ptr-deref in io_file_bitmap_get()")
	Fixes will be pulled in by later merge requests

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>

Approved-by: Brian Foster <bfoster@redhat.com>
Approved-by: Guillaume Nault <gnault@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-04-13 07:46:41 +02:00
Chris von Recklinghausen 0fd18f7da9 fs/buffer: Convert __block_write_begin_int() to take a folio
Bugzilla: https://bugzilla.redhat.com/2160210

commit d1bd0b4ebfe0521964e6937195bd2f76866660c7
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Wed Nov 3 14:05:47 2021 -0400

    fs/buffer: Convert __block_write_begin_int() to take a folio

    There are no plans to convert buffer_head infrastructure to use large
    folios, but __block_write_begin_int() is called from iomap, and it's
    more convenient and less error-prone if we pass in a folio from iomap.
    It also has a nice saving of almost 200 bytes of code from removing
    repeated calls to compound_head().

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:43 -04:00
Jeff Moyer 38f8af8bb4 Unify the primitives for file descriptor closing
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2123490

commit 6319194ec57b0452dcda4589d24c4e7db299c5bf
Author: Al Viro <viro@zeniv.linux.org.uk>
Date:   Thu May 12 17:08:03 2022 -0400

    Unify the primitives for file descriptor closing
    
    Currently we have 3 primitives for removing an opened file from descriptor
    table - pick_file(), __close_fd_get_file() and close_fd_get_file().  Their
    calling conventions are rather odd and there's a code duplication for no
    good reason.  They can be unified -
    
    1) have __range_close() cap max_fd in the very beginning; that way
    we don't need separate way for pick_file() to report being past the end
    of descriptor table.
    
    2) make {__,}close_fd_get_file() return file (or NULL) directly, rather
    than returning it via struct file ** argument.  Don't bother with
    (bogus) return value - nobody wants that -ENOENT.
    
    3) make pick_file() return NULL on unopened descriptor - the only caller
    that used to care about the distinction between descriptor past the end
    of descriptor table and finding NULL in descriptor table doesn't give
    a damn after (1).
    
    4) lift ->files_lock out of pick_file()
    
    That actually simplifies the callers, as well as the primitives themselves.
    Code duplication is also gone...
    
    Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-03-16 16:32:34 -04:00
Jeff Moyer 88e5e1f8d2 fs: split off do_getxattr from getxattr
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2123490

commit c975cad931570004b5f51248424a2a696fb65630
Author: Stefan Roesch <shr@fb.com>
Date:   Sun Apr 24 18:13:50 2022 -0600

    fs: split off do_getxattr from getxattr
    
    This splits off do_getxattr function from the getxattr function. This will
    allow io_uring to call it from its io worker.
    
    Signed-off-by: Stefan Roesch <shr@fb.com>
    Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
    Link: https://lore.kernel.org/r/20220323154420.3301504-3-shr@fb.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-03-16 08:25:43 -04:00
Jeff Moyer c38214410c fs: split off setxattr_copy and do_setxattr function from setxattr
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2123490
Conflicts: RHEL does not have 705191b03d50 ("fs: fix acl
translation"), which added an argument to
posix_acl_fix_xattr_from_user.  Keep the old calling convention.

commit 1a91794ce8481a293c5ef432feb440aee1455619
Author: Stefan Roesch <shr@fb.com>
Date:   Sun Apr 24 18:10:46 2022 -0600

    fs: split off setxattr_copy and do_setxattr function from setxattr
    
    This splits of the setup part of the function setxattr in its own
    dedicated function called setxattr_copy. In addition it also exposes a new
    function called do_setxattr for making the setxattr call.
    
    This makes it possible to call these two functions from io_uring in the
    processing of an xattr request.
    
    Signed-off-by: Stefan Roesch <shr@fb.com>
    Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
    Link: https://lore.kernel.org/r/20220323154420.3301504-2-shr@fb.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-03-16 08:24:43 -04:00
Jeff Moyer 7758b47b36 io-uring: Make statx API stable
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2113073

commit 1b6fe6e0dfecf8c82a64fb87148ad9333fa2f62e
Author: Stefan Roesch <shr@fb.com>
Date:   Fri Feb 25 10:53:26 2022 -0800

    io-uring: Make statx API stable
    
    One of the key architectual tenets is to keep the parameters for
    io-uring stable. After the call has been submitted, its value can
    be changed. Unfortunaltely this is not the case for the current statx
    implementation.
    
    IO-Uring change:
    This changes replaces the const char * filename pointer in the io_statx
    structure with a struct filename *. In addition it also creates the
    filename object during the prepare phase.
    
    With this change, the opcode also needs to invoke cleanup, so the
    filename object gets freed after processing the request.
    
    fs change:
    This replaces the const char* __user filename parameter in the two
    functions do_statx and vfs_statx with a struct filename *. In addition
    to be able to correctly construct a filename object a new helper
    function getname_statx_lookup_flags is introduced. The function makes
    sure that do_statx and vfs_statx is invoked with the correct lookup flags.
    
    Signed-off-by: Stefan Roesch <shr@fb.com>
    Link: https://lore.kernel.org/r/20220225185326.1373304-2-shr@fb.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2022-11-08 09:31:48 -05:00
Frantisek Hrbata 49fc72059c Merge: iomap update to v5.16
Merge conflicts:
-----------------
fs/iomap/buffered-io.c
        - to_iomap_page()
        HEAD(!1370) contains 4b86405d81 ("iomap: Convert to_iomap_page to take a folio")
        which is missing in !1417. Resolved in favor of HEAD(!1370)
fs/iomap/direct-io.c
        - iomap_dio_bio_iter()
        Keep changes from !1417, but remove definition of align variable, because this was removed
        in HEAD(!1407) by 73d48cec18 ("iomap: add support for dma aligned direct-io")

MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1417

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2130933

Update iomap code to upstream v5.16

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>

Approved-by: Andreas Gruenbacher <agruenba@redhat.com>
Approved-by: Brian Foster <bfoster@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-11-08 02:23:49 -05:00
Frantisek Hrbata e9e9bc8da2 Merge: mm changes through v5.18 for 9.2
Merge conflicts:
-----------------
Conflicts with !1142(merged) "io_uring: update to v5.15"

fs/io-wq.c
        - static bool io_wqe_create_worker(struct io_wqe *wqe, struct io_wqe_acct *acct)
          !1142 already contains backport of 3146cba99aa2 ("io-wq: make worker creation resilient against signals")
          along with other commits which are not present in !1370. Resolved in favor of HEAD(!1142)
        - static int io_wqe_worker(void *data)
          !1370 does not contain 767a65e9f317 ("io-wq: fix potential race of acct->nr_workers")
          Resolved in favor of HEAD(!1142)
        - static void io_init_new_worker(struct io_wqe *wqe, struct io_worker *worker,
          HEAD(!1142) does not contain e32cf5dfbe22 ("kthread: Generalize pf_io_worker so it can point to struct kthread")
          Resolved in favor of !1370
        - static void create_worker_cont(struct callback_head *cb)
          !1370 does not contain 66e70be72288 ("io-wq: fix memory leak in create_io_worker()")
          Resolved in favor of HEAD(!1142)
        - static void io_workqueue_create(struct work_struct *work)
          !1370 does not contain 66e70be72288 ("io-wq: fix memory leak in create_io_worker()")
          Resolved in favor of HEAD(!1142)
        - static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
          !1370 does not contain 66e70be72288 ("io-wq: fix memory leak in create_io_worker()")
          Resolved in favor of HEAD(!1142)
        - static bool io_wq_work_match_item(struct io_wq_work *work, void *data)
          !1370 does not contain 713b9825a4c4 ("io-wq: fix cancellation on create-worker failure")
          Resolved in favor of HEAD(!1142)
        - static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work)
          !1370 is missing 713b9825a4c4 ("io-wq: fix cancellation on create-worker failure")
          removed wrongly merged run_cancel label
          Resolved in favor of HEAD(!1142)
        - static bool io_task_work_match(struct callback_head *cb, void *data)
          !1370 is missing 3b33e3f4a6c0 ("io-wq: fix silly logic error in io_task_work_match()")
          Resolved in favor of HEAD(!1142)
        - static void io_wq_exit_workers(struct io_wq *wq)
          !1370 is missing 3b33e3f4a6c0 ("io-wq: fix silly logic error in io_task_work_match()")
          Resolved in favor of HEAD(!1142)
        - int io_wq_max_workers(struct io_wq *wq, int *new_count)
          !1370 is missing 3b33e3f4a6c0 ("io-wq: fix silly logic error in io_task_work_match()")
fs/io_uring.c
        - static int io_register_iowq_max_workers(struct io_ring_ctx *ctx,
          !1370 is missing bunch of commits after 2e480058ddc2 ("io-wq: provide a way to limit max number of workers")
          Resolved in favor of HEAD(!1142)
include/uapi/linux/io_uring.h
        - !1370 is missing dd47c104533d ("io-wq: provide IO_WQ_* constants for IORING_REGISTER_IOWQ_MAX_WORKERS arg items")
          just a comment conflict
          Resolved in favor of HEAD(!1142)
kernel/exit.c
        - void __noreturn do_exit(long code)
        - !1370 contains bunch of commits after f552a27afe67 ("io_uring: remove files pointer in cancellation functions")
          Resolved in favor of !1370

Conflicts with !1357(merged) "NFS refresh for RHEL-9.2"

fs/nfs/callback.c
        - nfs4_callback_svc(void *vrqstp)
          !1370 is missing f49169c97fce ("NFSD: Remove svc_serv_ops::svo_module") where the module_put_and_kthread_exit() was removed
          Resolved in favor of HEAD(!1357)
fs/nfs/file.c
          !1357 is missing 187c82cb0380 ("fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio")
          Resolved in favor of HEAD(!1370)
fs/nfsd/nfssvc.c
        - nfsd(void *vrqstp)
          !1370 is missing f49169c97fce ("NFSD: Remove svc_serv_ops::svo_module")
          Resolved in favor of HEAD(!1357)
-----------------

MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1370

Bugzilla: https://bugzilla.redhat.com/2120352

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2099722

Patches 1-9 are changes to selftests
Patches 10-31 are reverts of RHEL-only patches to address COR CVE
Patches 32-320 are the machine dependent mm changes ported by Rafael
Patch 321 reverts the backport of 6692c98c7df5. See below.
Patches 322-981 are the machine independent mm changes
Patches 982-1016 are David Hildebrand's upstream changes to address the COR CVE

RHEL commit b23c298982 fork: Stop protecting back_fork_cleanup_cgroup_lock with CONFIG_NUMA
which is a backport of upstream 6692c98c7df5 and is reverted early in this series. 6692c98c7df5
is a fix for upstream 40966e316f86 which was not in RHEL until this series. 6692c98c7df5 is re-added
after 40966e316f86.

Omitted-fix: 310d1344e3c5 ("Revert "powerpc: Remove unused FW_FEATURE_NATIVE references"")
        to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716

Omitted-fix: 465d0eb0dc31 ("Docs/admin-guide/mm/damon/usage: fix the example code snip")
        to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716

Omitted-fix: 317314527d17 ("mm/hugetlb: correct demote page offset logic")
        to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716

Omitted-fix: 37dcc673d065 ("frontswap: don't call ->init if no ops are registered")
        to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716

Omitted-fix: 30c19366636f ("mm: fix BUG splat with kvmalloc + GFP_ATOMIC")
        to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716

Omitted: fix: fa84693b3c89 io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: 009ad9f0c6ee io_uring: drop ctx->uring_lock before acquiring sqd->lock
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: bc369921d670 io-wq: max_worker fixes
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743

Omitted-fix: e139a1ec92f8 io_uring: apply max_workers limit to all future users
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743

Omitted-fix: 71c9ce27bb57 io-wq: fix max-workers not correctly set on multi-node system
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743

Omitted-fix: 41d3a6bd1d37 io_uring: pin SQPOLL data before unlocking ring lock
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: bad119b9a000 io_uring: honour zeroes as io-wq worker limits
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743

Omitted-fix: 08bdbd39b584 io-wq: ensure that hash wait lock is IRQ disabling
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: 713b9825a4c4 io-wq: fix cancellation on create-worker failure
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: 3b33e3f4a6c0 io-wq: fix silly logic error in io_task_work_match()
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: 71e1cef2d794 io-wq: Remove duplicate code in io_workqueue_create()
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=210774

Omitted-fix: a226abcd5d42 io-wq: don't retry task_work creation failure on fatal conditions
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743

Omitted-fix: fa84693b3c89 io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL
        fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: dd47c104533d io-wq: provide IO_WQ_* constants for IORING_REGISTER_IOWQ_MAX_WORKERS arg items
        fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: 4f0712ccec09 hexagon: Fix function name in die()
	unsupported arch

Omitted-fix: 751971af2e36 csky: Fix function name in csky_alignment() and die()
	unsupported arch

Omitted-fix: dcbc65aac283 ptrace: Remove duplicated include in ptrace.c
        unsupported arch

Omitted-fix: eb48d4219879 drm/i915: Fix oops due to missing stack depot
	fixed in RHEL commit 105d2d4832 Merge DRM changes from upstream v5.16..v5.17

Omitted-fix: 751a9d69b197 drm/i915: Fix oops due to missing stack depot
	fixed in RHEL commit 99fc716fc4 Merge DRM changes from upstream v5.17..v5.18

Omitted-fix: eb48d4219879 drm/i915: Fix oops due to missing stack depot
	fixed in RHEL commit 105d2d4832 Merge DRM changes from upstream v5.16..v5.17

Omitted-fix: 751a9d69b197 drm/i915: Fix oops due to missing stack depot
	fixed in RHEL commit 99fc716fc4 Merge DRM changes from upstream v5.17..v5.18

Omitted-fix: b95dc06af3e6 drm/amdgpu: disable runpm if we are the primary adapter
        reverted later

Omitted-fix: 5a90c24ad028 Revert "drm/amdgpu: disable runpm if we are the primary adapter"
        revert of above omitted fix

Omitted-fix: 724bbe49c5e4 fs/ntfs3: provide block_invalidate_folio to fix memory leak
	unsupported fs

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>

Approved-by: John W. Linville <linville@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>
Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Lyude Paul <lyude@redhat.com>
Approved-by: Donald Dutile <ddutile@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-10-23 19:49:41 +02:00
Chris von Recklinghausen 3a3ee2bede vfs: keep inodes with page cache off the inode shrinker LRU
Conflicts:
	mm/filemap.c - We already have
		452e9e6992fe ("filemap: Add filemap_remove_folio and __filemap_remove_folio")
		so just add the spin_lock call.
	mm/truncate.c - The backport of
		51dcbdac28d4 ("mm: Convert find_lock_entries() to use a folio_batch")
		listed the lack of this patch as a conflict. Keep the
		''fbatch->nr = j;' line.
	mm/vmscan.c - We already have
		be7c07d60e13 ("mm/vmscan: Convert __remove_mapping() to take a folio")
		so change a couple of lines from 'if (!PageSwapCache(page))'
		to 'if (!folio_test_swapcache(folio))'

Bugzilla: https://bugzilla.redhat.com/2120352

commit 51b8c1fe250d1bd70c1722dc3c414f5cff2d7cca
Author: Johannes Weiner <hannes@cmpxchg.org>
Date:   Mon Nov 8 18:31:24 2021 -0800

    vfs: keep inodes with page cache off the inode shrinker LRU

    Historically (pre-2.5), the inode shrinker used to reclaim only empty
    inodes and skip over those that still contained page cache.  This caused
    problems on highmem hosts: struct inode could put fill lowmem zones
    before the cache was getting reclaimed in the highmem zones.

    To address this, the inode shrinker started to strip page cache to
    facilitate reclaiming lowmem.  However, this comes with its own set of
    problems: the shrinkers may drop actively used page cache just because
    the inodes are not currently open or dirty - think working with a large
    git tree.  It further doesn't respect cgroup memory protection settings
    and can cause priority inversions between containers.

    Nowadays, the page cache also holds non-resident info for evicted cache
    pages in order to detect refaults.  We've come to rely heavily on this
    data inside reclaim for protecting the cache workingset and driving swap
    behavior.  We also use it to quantify and report workload health through
    psi.  The latter in turn is used for fleet health monitoring, as well as
    driving automated memory sizing of workloads and containers, proactive
    reclaim and memory offloading schemes.

    The consequences of dropping page cache prematurely is that we're seeing
    subtle and not-so-subtle failures in all of the above-mentioned
    scenarios, with the workload generally entering unexpected thrashing
    states while losing the ability to reliably detect it.

    To fix this on non-highmem systems at least, going back to rotating
    inodes on the LRU isn't feasible.  We've tried (commit a76cf1a474
    ("mm: don't reclaim inodes with many attached pages")) and failed
    (commit 69056ee6a8 ("Revert "mm: don't reclaim inodes with many
    attached pages"")).

    The issue is mostly that shrinker pools attract pressure based on their
    size, and when objects get skipped the shrinkers remember this as
    deferred reclaim work.  This accumulates excessive pressure on the
    remaining inodes, and we can quickly eat into heavily used ones, or
    dirty ones that require IO to reclaim, when there potentially is plenty
    of cold, clean cache around still.

    Instead, this patch keeps populated inodes off the inode LRU in the
    first place - just like an open file or dirty state would.  An otherwise
    clean and unused inode then gets queued when the last cache entry
    disappears.  This solves the problem without reintroducing the reclaim
    issues, and generally is a bit more scalable than having to wade through
    potentially hundreds of thousands of busy inodes.

    Locking is a bit tricky because the locks protecting the inode state
    (i_lock) and the inode LRU (lru_list.lock) don't nest inside the
    irq-safe page cache lock (i_pages.xa_lock).  Page cache deletions are
    serialized through i_lock, taken before the i_pages lock, to make sure
    depopulated inodes are queued reliably.  Additions may race with
    deletions, but we'll check again in the shrinker.  If additions race
    with the shrinker itself, we're protected by the i_lock: if find_inode()
    or iput() win, the shrinker will bail on the elevated i_count or
    I_REFERENCED; if the shrinker wins and goes ahead with the inode, it
    will set I_FREEING and inhibit further igets(), which will cause the
    other side to create a new instance of the inode instead.

    Link: https://lkml.kernel.org/r/20210614211904.14420-4-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Dave Chinner <david@fromorbit.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:31 -04:00
Carlos Maiolino 06017b9485 fs: mark the iomap argument to __block_write_begin_int const
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2130933

__block_write_begin_int never modifies the passed in iomap, so mark it
const.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
(cherry picked from commit 6d49cc8545e9e9e9e5a14e75fd044f049bd6077e)

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
2022-10-03 15:41:31 +02:00
Jeff Moyer 1fbefb3b5d io_uring: add support for IORING_OP_LINKAT
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2107656

commit cf30da90bc3a26911d369f199411f38b701394de
Author: Dmitry Kadashev <dkadashev@gmail.com>
Date:   Thu Jul 8 13:34:47 2021 +0700

    io_uring: add support for IORING_OP_LINKAT
    
    IORING_OP_LINKAT behaves like linkat(2) and takes the same flags and
    arguments.
    
    In some internal places 'hardlink' is used instead of 'link' to avoid
    confusion with the SQE links. Name 'link' conflicts with the existing
    'link' member of io_kiocb.
    
    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Suggested-by: Christian Brauner <christian.brauner@ubuntu.com>
    Link: https://lore.kernel.org/io-uring/20210514145259.wtl4xcsp52woi6ab@wittgenstein/
    Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com>
    Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
    Link: https://lore.kernel.org/r/20210708063447.3556403-12-dkadashev@gmail.com
    [axboe: add splice_fd_in check]
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2022-07-15 14:58:36 -04:00
Jeff Moyer be79cf9740 io_uring: add support for IORING_OP_SYMLINKAT
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2107656

commit 7a8721f84fcb3b2946a92380b6fc311e017ff02c
Author: Dmitry Kadashev <dkadashev@gmail.com>
Date:   Thu Jul 8 13:34:46 2021 +0700

    io_uring: add support for IORING_OP_SYMLINKAT
    
    IORING_OP_SYMLINKAT behaves like symlinkat(2) and takes the same flags
    and arguments.
    
    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Suggested-by: Christian Brauner <christian.brauner@ubuntu.com>
    Link: https://lore.kernel.org/io-uring/20210514145259.wtl4xcsp52woi6ab@wittgenstein/
    Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com>
    Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
    Link: https://lore.kernel.org/r/20210708063447.3556403-11-dkadashev@gmail.com
    [axboe: add splice_fd_in check]
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2022-07-15 14:58:36 -04:00
Jeff Moyer 0d9e019bc5 namei: update do_*() helpers to return ints
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2107656

commit 45f30dab395730aa3b3da14d9f19ea0d7d43db53
Author: Dmitry Kadashev <dkadashev@gmail.com>
Date:   Thu Jul 8 13:34:44 2021 +0700

    namei: update do_*() helpers to return ints
    
    Update the following to return int rather than long, for uniformity with
    the rest of the do_* helpers in namei.c:
    
    * do_rmdir()
    * do_unlinkat()
    * do_mkdirat()
    * do_mknodat()
    * do_symlinkat()
    
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Christian Brauner <christian.brauner@ubuntu.com>
    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Link: https://lore.kernel.org/io-uring/20210514143202.dmzfcgz5hnauy7ze@wittgenstein/
    Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com>
    Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
    Link: https://lore.kernel.org/r/20210708063447.3556403-9-dkadashev@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2022-07-15 14:58:36 -04:00
Jeff Moyer 5c9ba391d7 namei: make do_mkdirat() take struct filename
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2107656

commit 584d3226d665214dc1c498045c253529acdd3134
Author: Dmitry Kadashev <dkadashev@gmail.com>
Date:   Thu Jul 8 13:34:39 2021 +0700

    namei: make do_mkdirat() take struct filename
    
    Pass in the struct filename pointers instead of the user string, and
    update the three callers to do the same. This is heavily based on
    commit dbea8d345177 ("fs: make do_renameat2() take struct filename").
    
    This behaves like do_unlinkat() and do_renameat2().
    
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com>
    Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
    Link: https://lore.kernel.org/r/20210708063447.3556403-4-dkadashev@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2022-07-15 14:58:35 -04:00
Ming Lei 204b92f755 block: simplify the block device syncing code
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2018403

commit 1e03a36bdff4709c1bbf0f57f60ae3f776d51adf
Author: Christoph Hellwig <hch@lst.de>
Date:   Tue Oct 19 08:25:30 2021 +0200

    block: simplify the block device syncing code

    Get rid of the indirections and just provide a sync_bdevs
    helper for the generic sync code.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20211019062530.2174626-8-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2021-12-06 16:45:28 +08:00
Ming Lei 93522f6059 block: remove __sync_blockdev
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2018403

commit 70164eb6ccb76ab679b016b4b60123bf4ec6c162
Author: Christoph Hellwig <hch@lst.de>
Date:   Tue Oct 19 08:25:25 2021 +0200

    block: remove __sync_blockdev

    Instead offer a new sync_blockdev_nowait helper for the !wait case.
    This new helper is exported as it will grow modular callers in a bit.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20211019062530.2174626-3-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2021-12-06 16:45:28 +08:00
Ming Lei f6352118db block: move fs/block_dev.c to block/bdev.c
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2018403

commit 0dca4462ed0681649fdcd5700a6ddfbaa65fa178
Author: Christoph Hellwig <hch@lst.de>
Date:   Tue Sep 7 16:13:03 2021 +0200

    block: move fs/block_dev.c to block/bdev.c

    Move it together with the rest of the block layer.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20210907141303.1371844-3-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2021-12-06 16:38:56 +08:00
Paul Gortmaker 1e7107c5ef cgroup1: fix leaked context root causing sporadic NULL deref in LTP
Richard reported sporadic (roughly one in 10 or so) null dereferences and
other strange behaviour for a set of automated LTP tests.  Things like:

   BUG: kernel NULL pointer dereference, address: 0000000000000008
   #PF: supervisor read access in kernel mode
   #PF: error_code(0x0000) - not-present page
   PGD 0 P4D 0
   Oops: 0000 [#1] PREEMPT SMP PTI
   CPU: 0 PID: 1516 Comm: umount Not tainted 5.10.0-yocto-standard #1
   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-48-gd9c812dda519-prebuilt.qemu.org 04/01/2014
   RIP: 0010:kernfs_sop_show_path+0x1b/0x60

...or these others:

   RIP: 0010:do_mkdirat+0x6a/0xf0
   RIP: 0010:d_alloc_parallel+0x98/0x510
   RIP: 0010:do_readlinkat+0x86/0x120

There were other less common instances of some kind of a general scribble
but the common theme was mount and cgroup and a dubious dentry triggering
the NULL dereference.  I was only able to reproduce it under qemu by
replicating Richard's setup as closely as possible - I never did get it
to happen on bare metal, even while keeping everything else the same.

In commit 71d883c37e ("cgroup_do_mount(): massage calling conventions")
we see this as a part of the overall change:

   --------------
           struct cgroup_subsys *ss;
   -       struct dentry *dentry;

   [...]

   -       dentry = cgroup_do_mount(&cgroup_fs_type, fc->sb_flags, root,
   -                                CGROUP_SUPER_MAGIC, ns);

   [...]

   -       if (percpu_ref_is_dying(&root->cgrp.self.refcnt)) {
   -               struct super_block *sb = dentry->d_sb;
   -               dput(dentry);
   +       ret = cgroup_do_mount(fc, CGROUP_SUPER_MAGIC, ns);
   +       if (!ret && percpu_ref_is_dying(&root->cgrp.self.refcnt)) {
   +               struct super_block *sb = fc->root->d_sb;
   +               dput(fc->root);
                   deactivate_locked_super(sb);
                   msleep(10);
                   return restart_syscall();
           }
   --------------

In changing from the local "*dentry" variable to using fc->root, we now
export/leave that dentry pointer in the file context after doing the dput()
in the unlikely "is_dying" case.   With LTP doing a crazy amount of back to
back mount/unmount [testcases/bin/cgroup_regression_5_1.sh] the unlikely
becomes slightly likely and then bad things happen.

A fix would be to not leave the stale reference in fc->root as follows:

   --------------
                  dput(fc->root);
  +               fc->root = NULL;
                  deactivate_locked_super(sb);
   --------------

...but then we are just open-coding a duplicate of fc_drop_locked() so we
simply use that instead.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: stable@vger.kernel.org      # v5.1+
Reported-by: Richard Purdie <richard.purdie@linuxfoundation.org>
Fixes: 71d883c37e ("cgroup_do_mount(): massage calling conventions")
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2021-07-21 06:39:20 -10:00
Al Viro ffb37ca3bd switch file_open_root() to struct path
... and provide file_open_root_mnt(), using the root of given mount.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-04-07 13:56:43 -04:00
Linus Torvalds 7d6beb71da idmapped-mounts-v5.12
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYCegywAKCRCRxhvAZXjc
 ouJ6AQDlf+7jCQlQdeKKoN9QDFfMzG1ooemat36EpRRTONaGuAD8D9A4sUsG4+5f
 4IU5Lj9oY4DEmF8HenbWK2ZHsesL2Qg=
 =yPaw
 -----END PGP SIGNATURE-----

Merge tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull idmapped mounts from Christian Brauner:
 "This introduces idmapped mounts which has been in the making for some
  time. Simply put, different mounts can expose the same file or
  directory with different ownership. This initial implementation comes
  with ports for fat, ext4 and with Christoph's port for xfs with more
  filesystems being actively worked on by independent people and
  maintainers.

  Idmapping mounts handle a wide range of long standing use-cases. Here
  are just a few:

   - Idmapped mounts make it possible to easily share files between
     multiple users or multiple machines especially in complex
     scenarios. For example, idmapped mounts will be used in the
     implementation of portable home directories in
     systemd-homed.service(8) where they allow users to move their home
     directory to an external storage device and use it on multiple
     computers where they are assigned different uids and gids. This
     effectively makes it possible to assign random uids and gids at
     login time.

   - It is possible to share files from the host with unprivileged
     containers without having to change ownership permanently through
     chown(2).

   - It is possible to idmap a container's rootfs and without having to
     mangle every file. For example, Chromebooks use it to share the
     user's Download folder with their unprivileged containers in their
     Linux subsystem.

   - It is possible to share files between containers with
     non-overlapping idmappings.

   - Filesystem that lack a proper concept of ownership such as fat can
     use idmapped mounts to implement discretionary access (DAC)
     permission checking.

   - They allow users to efficiently changing ownership on a per-mount
     basis without having to (recursively) chown(2) all files. In
     contrast to chown (2) changing ownership of large sets of files is
     instantenous with idmapped mounts. This is especially useful when
     ownership of a whole root filesystem of a virtual machine or
     container is changed. With idmapped mounts a single syscall
     mount_setattr syscall will be sufficient to change the ownership of
     all files.

   - Idmapped mounts always take the current ownership into account as
     idmappings specify what a given uid or gid is supposed to be mapped
     to. This contrasts with the chown(2) syscall which cannot by itself
     take the current ownership of the files it changes into account. It
     simply changes the ownership to the specified uid and gid. This is
     especially problematic when recursively chown(2)ing a large set of
     files which is commong with the aforementioned portable home
     directory and container and vm scenario.

   - Idmapped mounts allow to change ownership locally, restricting it
     to specific mounts, and temporarily as the ownership changes only
     apply as long as the mount exists.

  Several userspace projects have either already put up patches and
  pull-requests for this feature or will do so should you decide to pull
  this:

   - systemd: In a wide variety of scenarios but especially right away
     in their implementation of portable home directories.

         https://systemd.io/HOME_DIRECTORY/

   - container runtimes: containerd, runC, LXD:To share data between
     host and unprivileged containers, unprivileged and privileged
     containers, etc. The pull request for idmapped mounts support in
     containerd, the default Kubernetes runtime is already up for quite
     a while now: https://github.com/containerd/containerd/pull/4734

   - The virtio-fs developers and several users have expressed interest
     in using this feature with virtual machines once virtio-fs is
     ported.

   - ChromeOS: Sharing host-directories with unprivileged containers.

  I've tightly synced with all those projects and all of those listed
  here have also expressed their need/desire for this feature on the
  mailing list. For more info on how people use this there's a bunch of
  talks about this too. Here's just two recent ones:

      https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdf
      https://fosdem.org/2021/schedule/event/containers_idmap/

  This comes with an extensive xfstests suite covering both ext4 and
  xfs:

      https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts

  It covers truncation, creation, opening, xattrs, vfscaps, setid
  execution, setgid inheritance and more both with idmapped and
  non-idmapped mounts. It already helped to discover an unrelated xfs
  setgid inheritance bug which has since been fixed in mainline. It will
  be sent for inclusion with the xfstests project should you decide to
  merge this.

  In order to support per-mount idmappings vfsmounts are marked with
  user namespaces. The idmapping of the user namespace will be used to
  map the ids of vfs objects when they are accessed through that mount.
  By default all vfsmounts are marked with the initial user namespace.
  The initial user namespace is used to indicate that a mount is not
  idmapped. All operations behave as before and this is verified in the
  testsuite.

  Based on prior discussions we want to attach the whole user namespace
  and not just a dedicated idmapping struct. This allows us to reuse all
  the helpers that already exist for dealing with idmappings instead of
  introducing a whole new range of helpers. In addition, if we decide in
  the future that we are confident enough to enable unprivileged users
  to setup idmapped mounts the permission checking can take into account
  whether the caller is privileged in the user namespace the mount is
  currently marked with.

  The user namespace the mount will be marked with can be specified by
  passing a file descriptor refering to the user namespace as an
  argument to the new mount_setattr() syscall together with the new
  MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern
  of extensibility.

  The following conditions must be met in order to create an idmapped
  mount:

   - The caller must currently have the CAP_SYS_ADMIN capability in the
     user namespace the underlying filesystem has been mounted in.

   - The underlying filesystem must support idmapped mounts.

   - The mount must not already be idmapped. This also implies that the
     idmapping of a mount cannot be altered once it has been idmapped.

   - The mount must be a detached/anonymous mount, i.e. it must have
     been created by calling open_tree() with the OPEN_TREE_CLONE flag
     and it must not already have been visible in the filesystem.

  The last two points guarantee easier semantics for userspace and the
  kernel and make the implementation significantly simpler.

  By default vfsmounts are marked with the initial user namespace and no
  behavioral or performance changes are observed.

  The manpage with a detailed description can be found here:

      1d7b902e28

  In order to support idmapped mounts, filesystems need to be changed
  and mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. The
  patches to convert individual filesystem are not very large or
  complicated overall as can be seen from the included fat, ext4, and
  xfs ports. Patches for other filesystems are actively worked on and
  will be sent out separately. The xfstestsuite can be used to verify
  that port has been done correctly.

  The mount_setattr() syscall is motivated independent of the idmapped
  mounts patches and it's been around since July 2019. One of the most
  valuable features of the new mount api is the ability to perform
  mounts based on file descriptors only.

  Together with the lookup restrictions available in the openat2()
  RESOLVE_* flag namespace which we added in v5.6 this is the first time
  we are close to hardened and race-free (e.g. symlinks) mounting and
  path resolution.

  While userspace has started porting to the new mount api to mount
  proper filesystems and create new bind-mounts it is currently not
  possible to change mount options of an already existing bind mount in
  the new mount api since the mount_setattr() syscall is missing.

  With the addition of the mount_setattr() syscall we remove this last
  restriction and userspace can now fully port to the new mount api,
  covering every use-case the old mount api could. We also add the
  crucial ability to recursively change mount options for a whole mount
  tree, both removing and adding mount options at the same time. This
  syscall has been requested multiple times by various people and
  projects.

  There is a simple tool available at

      https://github.com/brauner/mount-idmapped

  that allows to create idmapped mounts so people can play with this
  patch series. I'll add support for the regular mount binary should you
  decide to pull this in the following weeks:

  Here's an example to a simple idmapped mount of another user's home
  directory:

	u1001@f2-vm:/$ sudo ./mount --idmap both:1000:1001:1 /home/ubuntu/ /mnt

	u1001@f2-vm:/$ ls -al /home/ubuntu/
	total 28
	drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 .
	drwxr-xr-x 4 root   root   4096 Oct 28 04:00 ..
	-rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history
	-rw-r--r-- 1 ubuntu ubuntu  220 Feb 25  2020 .bash_logout
	-rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25  2020 .bashrc
	-rw-r--r-- 1 ubuntu ubuntu  807 Feb 25  2020 .profile
	-rw-r--r-- 1 ubuntu ubuntu    0 Oct 16 16:11 .sudo_as_admin_successful
	-rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo

	u1001@f2-vm:/$ ls -al /mnt/
	total 28
	drwxr-xr-x  2 u1001 u1001 4096 Oct 28 22:07 .
	drwxr-xr-x 29 root  root  4096 Oct 28 22:01 ..
	-rw-------  1 u1001 u1001 3154 Oct 28 22:12 .bash_history
	-rw-r--r--  1 u1001 u1001  220 Feb 25  2020 .bash_logout
	-rw-r--r--  1 u1001 u1001 3771 Feb 25  2020 .bashrc
	-rw-r--r--  1 u1001 u1001  807 Feb 25  2020 .profile
	-rw-r--r--  1 u1001 u1001    0 Oct 16 16:11 .sudo_as_admin_successful
	-rw-------  1 u1001 u1001 1144 Oct 28 00:43 .viminfo

	u1001@f2-vm:/$ touch /mnt/my-file

	u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file

	u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file

	u1001@f2-vm:/$ ls -al /mnt/my-file
	-rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file

	u1001@f2-vm:/$ ls -al /home/ubuntu/my-file
	-rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file

	u1001@f2-vm:/$ getfacl /mnt/my-file
	getfacl: Removing leading '/' from absolute path names
	# file: mnt/my-file
	# owner: u1001
	# group: u1001
	user::rw-
	user:u1001:rwx
	group::rw-
	mask::rwx
	other::r--

	u1001@f2-vm:/$ getfacl /home/ubuntu/my-file
	getfacl: Removing leading '/' from absolute path names
	# file: home/ubuntu/my-file
	# owner: ubuntu
	# group: ubuntu
	user::rw-
	user:ubuntu:rwx
	group::rw-
	mask::rwx
	other::r--"

* tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: (41 commits)
  xfs: remove the possibly unused mp variable in xfs_file_compat_ioctl
  xfs: support idmapped mounts
  ext4: support idmapped mounts
  fat: handle idmapped mounts
  tests: add mount_setattr() selftests
  fs: introduce MOUNT_ATTR_IDMAP
  fs: add mount_setattr()
  fs: add attr_flags_to_mnt_flags helper
  fs: split out functions to hold writers
  namespace: only take read lock in do_reconfigure_mnt()
  mount: make {lock,unlock}_mount_hash() static
  namespace: take lock_mount_hash() directly when changing flags
  nfs: do not export idmapped mounts
  overlayfs: do not mount on top of idmapped mounts
  ecryptfs: do not mount on top of idmapped mounts
  ima: handle idmapped mounts
  apparmor: handle idmapped mounts
  fs: make helpers idmap mount aware
  exec: handle idmapped mounts
  would_dump: handle idmapped mounts
  ...
2021-02-23 13:39:45 -08:00
Linus Torvalds 5bbb336ba7 for-5.12/io_uring-2021-02-17
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmAtYbYQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgppeWD/4xKhzBCGZWOkdycaaPhsUTOjNNIPmCBhlz
 QQj4KFSEuJNKACUg53Ak0oECJTaH5976kjKkKs7Z+hzmkEwboLBI4erkcT9MGC3M
 mPx349qBq9X3sYaFrUJF3h0sjRr+wa60nWQ01oVH8HkfI4bCNCHoqo5jDvMPWsYT
 ksFbUm8YWEZmi0K2yXFWXuJIN2bVBd72a8CrvtF3ksdEMYxbWWTOAcrhYJ4H5/U7
 BQjWIxiIVsAoJohcXWq/Swh8cgvgb5uJVpNUU8VEFob/jI3Gc3YojIToISB6soUL
 DNhDJLeyZjuXfE1Ej+ySas9bpdG4LgxzsDBl9lFl9EQkSo1c3h/lEx85aeixAZla
 QfjTOVUabzdPzvZ9H1yDQISxjVLy2PotnhVMy/rSSrnDKlowtNB9iEzd6cpzFzxU
 fxomz1d6+w8rZY9jaRIAcMNa6bEOuYmcP9V8rIzGeg3Mm3jqL7H/JgJu5s2YbjpN
 InmTNu4cwLeTO65DzqVxF8UGbZ2tHbMm5pNeVBYxuY1adRgJFlIOP5kYlNlyiY+D
 Bt41CRuK3hqpYfXh7nSK8U4BKEhMikTCS0W4aKL5EzLZ20rxjgTlaHZiOBqd9vep
 1tqNjPIvL2jWfF+5shwAZbupj3WKbuVqi4S2jXljv+Wkmk4ZVLSX3fQZv2I7JTHM
 I2qa59PB4A==
 =8MX/
 -----END PGP SIGNATURE-----

Merge tag 'for-5.12/io_uring-2021-02-17' of git://git.kernel.dk/linux-block

Pull io_uring updates from Jens Axboe:
 "Highlights from this cycles are things like request recycling and
  task_work optimizations, which net us anywhere from 10-20% of speedups
  on workloads that mostly are inline.

  This work was originally done to put io_uring under memcg, which adds
  considerable overhead. But it's a really nice win as well. Also worth
  highlighting is the LOOKUP_CACHED work in the VFS, and using it in
  io_uring. Greatly speeds up the fast path for file opens.

  Summary:

   - Put io_uring under memcg protection. We accounted just the rings
     themselves under rlimit memlock before, now we account everything.

   - Request cache recycling, persistent across invocations (Pavel, me)

   - First part of a cleanup/improvement to buffer registration (Bijan)

   - SQPOLL fixes (Hao)

   - File registration NULL pointer fixup (Dan)

   - LOOKUP_CACHED support for io_uring

   - Disable /proc/thread-self/ for io_uring, like we do for /proc/self

   - Add Pavel to the io_uring MAINTAINERS entry

   - Tons of code cleanups and optimizations (Pavel)

   - Support for skip entries in file registration (Noah)"

* tag 'for-5.12/io_uring-2021-02-17' of git://git.kernel.dk/linux-block: (103 commits)
  io_uring: tctx->task_lock should be IRQ safe
  proc: don't allow async path resolution of /proc/thread-self components
  io_uring: kill cached requests from exiting task closing the ring
  io_uring: add helper to free all request caches
  io_uring: allow task match to be passed to io_req_cache_free()
  io-wq: clear out worker ->fs and ->files
  io_uring: optimise io_init_req() flags setting
  io_uring: clean io_req_find_next() fast check
  io_uring: don't check PF_EXITING from syscall
  io_uring: don't split out consume out of SQE get
  io_uring: save ctx put/get for task_work submit
  io_uring: don't duplicate io_req_task_queue()
  io_uring: optimise SQPOLL mm/files grabbing
  io_uring: optimise out unlikely link queue
  io_uring: take compl state from submit state
  io_uring: inline io_complete_rw_common()
  io_uring: move res check out of io_rw_reissue()
  io_uring: simplify iopoll reissuing
  io_uring: clean up io_req_free_batch_finish()
  io_uring: move submit side state closer in the ring
  ...
2021-02-21 11:10:39 -08:00
Jens Axboe 53dec2ea74 fs: provide locked helper variant of close_fd_get_file()
Assumes current->files->file_lock is already held on invocation. Helps
the caller check the file before removing the fd, if it needs to.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 10:02:42 -07:00
Al Viro b964bf53e5 teach sendfile(2) to handle send-to-pipe directly
no point going through the intermediate pipe

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-01-25 23:29:36 -05:00
Christian Brauner ba73d98745
namei: handle idmapped mounts in may_*() helpers
The may_follow_link(), may_linkat(), may_lookup(), may_open(),
may_o_create(), may_create_in_sticky(), may_delete(), and may_create()
helpers determine whether the caller is privileged enough to perform the
associated operations. Let them handle idmapped mounts by mapping the
inode or fsids according to the mount's user namespace. Afterwards the
checks are identical to non-idmapped inodes. The patch takes care to
retrieve the mount's user namespace right before performing permission
checks and passing it down into the fileystem so the user namespace
can't change in between by someone idmapping a mount that is currently
not idmapped. If the initial user namespace is passed nothing changes so
non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-13-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:17 +01:00
Linus Torvalds ac7ac4618c for-5.11/block-2020-12-14
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl/Xec8QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpoLbEACzXypgZWwMdfgRckA/Vt333rXHtbhUV+hK
 2XP+P81iRvr9Esi31UPbRp82vrgcDO0cpI1QmQojS5U5TIQP88BfXptfRZZu48eb
 wT5RDDNQ34HItqAh/yEuYsv9yUKcxeIrB99tBVvM+4UmQg9zTdIW3mg6PvCBdbhV
 N38jI0tCF/PJatjfRuphT/nXonQLPWBlVDmZk06KZQFOwQe9ep1vUi1+nbiRPuo3
 geFBpTh1Kp6Vl1B3n4RpECs6Y7I0RRuJdaH2sDizICla1/BW91F9fQwHimNnUxUq
 e1Q1kMuh6ftcQGkYlHSYcPhuv6CvorldTZCO5arPxWpcwvxriTSMRPWAgUr5pEiF
 fhiGhqeDu9e6vl9vS31wUD1B30hy+jFz9wyjRrDwJ3cPHH1JVBjTzvdX+cIh/1ku
 IbIwUMteUtvUrzqAv/DzbGhedp7xWtOFaVo8j0QFYh9zkjd6b8yDOF/yztwX2gjY
 Xt1cd+KpDSiN449ZRaoMI0sCJAxqzhMa6nsWlb0L7KuNyWKAbvKQBm9Rb47FLV9A
 Vx70KC+zkFoyw23capvIahmQazerriUJ5PGe0lVm6ROgmIFdCpXTPDjnrvq/6RZ/
 GEpD7gTW9atGJ7EuEE8686sAfKD5kneChWLX5EHXf0d0AG5Mr2lKsluiGp5LpPJg
 Q1Xqs6xwww==
 =zo4w
 -----END PGP SIGNATURE-----

Merge tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
 "Another series of killing more code than what is being added, again
  thanks to Christoph's relentless cleanups and tech debt tackling.

  This contains:

   - blk-iocost improvements (Baolin Wang)

   - part0 iostat fix (Jeffle Xu)

   - Disable iopoll for split bios (Jeffle Xu)

   - block tracepoint cleanups (Christoph Hellwig)

   - Merging of struct block_device and hd_struct (Christoph Hellwig)

   - Rework/cleanup of how block device sizes are updated (Christoph
     Hellwig)

   - Simplification of gendisk lookup and removal of block device
     aliasing (Christoph Hellwig)

   - Block device ioctl cleanups (Christoph Hellwig)

   - Removal of bdget()/blkdev_get() as exported API (Christoph Hellwig)

   - Disk change rework, avoid ->revalidate_disk() (Christoph Hellwig)

   - sbitmap improvements (Pavel Begunkov)

   - Hybrid polling fix (Pavel Begunkov)

   - bvec iteration improvements (Pavel Begunkov)

   - Zone revalidation fixes (Damien Le Moal)

   - blk-throttle limit fix (Yu Kuai)

   - Various little fixes"

* tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-block: (126 commits)
  blk-mq: fix msec comment from micro to milli seconds
  blk-mq: update arg in comment of blk_mq_map_queue
  blk-mq: add helper allocating tagset->tags
  Revert "block: Fix a lockdep complaint triggered by request queue flushing"
  nvme-loop: use blk_mq_hctx_set_fq_lock_class to set loop's lock class
  blk-mq: add new API of blk_mq_hctx_set_fq_lock_class
  block: disable iopoll for split bio
  block: Improve blk_revalidate_disk_zones() checks
  sbitmap: simplify wrap check
  sbitmap: replace CAS with atomic and
  sbitmap: remove swap_lock
  sbitmap: optimise sbitmap_deferred_clear()
  blk-mq: skip hybrid polling if iopoll doesn't spin
  blk-iocost: Factor out the base vrate change into a separate function
  blk-iocost: Factor out the active iocgs' state check into a separate function
  blk-iocost: Move the usage ratio calculation to the correct place
  blk-iocost: Remove unnecessary advance declaration
  blk-iocost: Fix some typos in comments
  blktrace: fix up a kerneldoc comment
  block: remove the request_queue to argument request based tracepoints
  ...
2020-12-16 12:57:51 -08:00
Jens Axboe e886663cfd fs: make do_renameat2() take struct filename
Pass in the struct filename pointers instead of the user string, and
update the three callers to do the same.

This behaves like do_unlinkat(), which also takes a filename struct and
puts it when it is done. Converting callers is then trivial.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:03:59 -07:00
Christoph Hellwig 4e7b5671c6 block: remove i_bdev
Switch the block device lookup interfaces to directly work with a dev_t
so that struct block_device references are only acquired by the
blkdev_get variants (and the blk-cgroup special case).  This means that
we now don't need an extra reference in the inode and can generally
simplify handling of struct block_device to keep the lookups contained
in the core block layer code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Coly Li <colyli@suse.de>		[bcache]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:39 -07:00
Christoph Hellwig 60b498852b fs: remove get_super_thawed and get_super_exclusive_thawed
Just open code the wait in the only caller of both functions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:38 -07:00
Christoph Hellwig 028abd9222 fs: remove compat_sys_mount
compat_sys_mount is identical to the regular sys_mount now, so remove it
and use the native version everywhere.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-09-22 23:45:57 -04:00
Linus Torvalds e1ec517e18 Merge branch 'hch.init_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull init and set_fs() cleanups from Al Viro:
 "Christoph's 'getting rid of ksys_...() uses under KERNEL_DS' series"

* 'hch.init_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (50 commits)
  init: add an init_dup helper
  init: add an init_utimes helper
  init: add an init_stat helper
  init: add an init_mknod helper
  init: add an init_mkdir helper
  init: add an init_symlink helper
  init: add an init_link helper
  init: add an init_eaccess helper
  init: add an init_chmod helper
  init: add an init_chown helper
  init: add an init_chroot helper
  init: add an init_chdir helper
  init: add an init_rmdir helper
  init: add an init_unlink helper
  init: add an init_umount helper
  init: add an init_mount helper
  init: mark create_dev as __init
  init: mark console_on_rootfs as __init
  init: initialize ramdisk_execute_command at compile time
  devtmpfs: refactor devtmpfsd()
  ...
2020-08-07 09:40:34 -07:00