Commit Graph

385 Commits

Author SHA1 Message Date
Ryan Sullivan a9f783f1d6 security: Create file_truncate hook from path_truncate hook
JIRA: https://issues.redhat.com/browse/RHEL-8810

Changes:
- Remove 0 as the second argument passed to call_int_hook() in
security_file_truncate()

Conflicts:
- Outdated code removed that is now handled in do_ftruncate() rather than
do_sys_ftruncate()

Like path_truncate, the file_truncate hook also restricts file
truncation, but is called in the cases where truncation is attempted
on an already-opened file.

This is required in a subsequent commit to handle ftruncate()
operations differently to truncate() operations.

Fixes: 3a16fa673c11 ("landlock: Support file truncation")

Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: John Johansen <john.johansen@canonical.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Günther Noack <gnoack3000@gmail.com>
Link: https://lore.kernel.org/r/20221018182216.301684-2-gnoack3000@gmail.com
Signed-off-by: Mickaël Salaün <mic@digikod.net>
(cherry picked from commit 3350607dc5637be2563f484dcfe2fed456f3d4ff)
Signed-off-by: Ryan Sullivan <rysulliv@redhat.com>
2025-02-07 17:05:30 -05:00
Jeff Moyer 5331d880be Add do_ftruncate that truncates a struct file
JIRA: https://issues.redhat.com/browse/RHEL-64867
Conflicts: Slight context difference due to out-of-order backport of
  commit abf08576afe3 ("fs: port vfs_*() helpers to struct mnt_idmap")
commit 5f0d594c602f870e3a3872f7ea42bf846a1d26cf
Author: Tony Solomonik <tony.solomonik@gmail.com>
Date:   Fri Feb 2 14:17:23 2024 +0200

    Add do_ftruncate that truncates a struct file
    
    do_sys_ftruncate receives a file descriptor, fgets the struct file, and
    finally actually truncates the file.
    
    do_ftruncate allows for passing in a file directly, with the caller
    already holding a reference to it.
    
    Signed-off-by: Tony Solomonik <tony.solomonik@gmail.com>
    Reviewed-by: Christian Brauner <brauner@kernel.org>
    Link: https://lore.kernel.org/r/20240202121724.17461-2-tony.solomonik@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-11-28 15:34:44 -05:00
Ian Kent 252894f3c0 fs: port vfs{g,u}id helpers to mnt_idmap
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

Conflicts: Dropped hunks for ksmbd because the source is not present in the
	CentOS Stream source tree.

commit 4d7ca4090184c153f8ccb1a68ca5cf136dac108b
Author: Christian Brauner <brauner@kernel.org>
Date:   Fri Jan 13 12:49:32 2023 +0100

    fs: port vfs{g,u}id helpers to mnt_idmap

    Convert to struct mnt_idmap.

    Last cycle we merged the necessary infrastructure in
    256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
    This is just the conversion to struct mnt_idmap.

    Currently we still pass around the plain namespace that was attached to a
    mount. This is in general pretty convenient but it makes it easy to
    conflate namespaces that are relevant on the filesystem with namespaces
    that are relevent on the mount level. Especially for non-vfs developers
    without detailed knowledge in this area this can be a potential source for
    bugs.

    Once the conversion to struct mnt_idmap is done all helpers down to the
    really low-level helpers will take a struct mnt_idmap argument instead of
    two namespace arguments. This way it becomes impossible to conflate the two
    eliminating the possibility of any bugs. All of the vfs and all filesystems
    only operate on struct mnt_idmap.

    Acked-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-16 12:36:04 +08:00
Ian Kent edf17476c7 fs: port privilege checking helpers to mnt_idmap
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

Conflicts: For consistency drop btrfs hunks because it isn't supported in
	CentOS Stream and other backports also drop such hunks.
	Upstream merge commit 05e6295f7b5e0 ("Merge tag 'fs.idmapped.v6.3'
	of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping")
	together with Upstream commit facd61053cff1 ("fuse: fixes after
	adapting to new posix acl api") results in a conflict in
	fs/fuse/acl.c, adjust to suit.

commit 9452e93e6dae862d7aeff2b11236d79bde6f9b66
Author: Christian Brauner <brauner@kernel.org>
Date:   Fri Jan 13 12:49:27 2023 +0100

    fs: port privilege checking helpers to mnt_idmap

    Convert to struct mnt_idmap.

    Last cycle we merged the necessary infrastructure in
    256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
    This is just the conversion to struct mnt_idmap.

    Currently we still pass around the plain namespace that was attached to a
    mount. This is in general pretty convenient but it makes it easy to
    conflate namespaces that are relevant on the filesystem with namespaces
    that are relevent on the mount level. Especially for non-vfs developers
    without detailed knowledge in this area this can be a potential source for
    bugs.

    Once the conversion to struct mnt_idmap is done all helpers down to the
    really low-level helpers will take a struct mnt_idmap argument instead of
    two namespace arguments. This way it becomes impossible to conflate the two
    eliminating the possibility of any bugs. All of the vfs and all filesystems
    only operate on struct mnt_idmap.

    Acked-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-16 10:45:31 +08:00
Ian Kent 304ec491ee fs: port ->permission() to pass mnt_idmap
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

Conflicts: For consistency drop btrfs hunks because it isn't supported in
	CentOS Stream and other backports also drop such hunks.
	CentOS Stream commit 48fa94aacd ("ceph: fscrypt_auth handling
	for ceph") is presnt which causes fuzz 2 in hunk #1 in
	fs/ceph/super.h.
	Upstream commit 427505ffeaa46 ("exportfs: use pr_debug for
	unreachable debug statements") is not present causing fuzz 2
	in hunk #1 against fs/exportfs/expfs.c.
	Dropped hunks for ksmbd because the source is not present in the
	CentOS Stream source tree.
	Upstream commit 03fa86e9f79d8 ("namei: stash the sampled ->d_seq
	into nameidata") is not present causing a fuzz 1 for hunk #14
	against fs/namei.c.
	CentOS Stream c4f3dd0731 ("nfsd: handle failure to collect
	pre/post-op attrs more sanely") is present and causes a rejects
	for hunks #4 and #5 against fs/nfsd/vfs.c, apply manually.
	Dropped hunks for ntfs3 because the source is not present in the
	CentOS Stream source tree.
	CentOS Stream commit 98ba731fc7 ("ovl: Move xattr support to
	new xattrs.c file") moves ovl_xattr_set() and ovl_xattr_get()
	from fs/overlayfs/inode.c to fs/overlayfs/xattrs.c which causes
	hunks #4 and #5 to fail, manually apply to fs/overlayfs/xattrs.c.
	CentOS Stream commit 55177e4b83 ("ovl: mark xwhiteouts directory
	with overlay.opaque='x'") and commit d17b324bb6 ("ovl: use
	ovl_numlower() and ovl_lowerstack() accessors") change the first
	and third hunks of fs/overlayfs/namei.c causing them to fail,
	manually apply.
	CentOS Stream commit 98ba731fc7 ("ovl: Move xattr support to
	new xattrs.c file") causes fuzz 2 in hunk #5 of
	fs/overlayfs/overlayfs.h
	CentOS Stream commit 355a9c490a ("ovl: Add an alternative
	type of whiteout") changes ovl_cache_update_ino() to
	ovl_cache_update() in fs/overlayfs/readdir.c, make the change
	manually.
	Upstream commit 217af7e2f4deb ("apparmor: refactor profile
	rules and attachments") is not in CentOS Stream causing hunk #1
	to fail to apply so manually apply the change.

commit 4609e1f18e19c3b302e1eb4858334bca1532f780
Author: Christian Brauner <brauner@kernel.org>
Date:   Fri Jan 13 12:49:22 2023 +0100

    fs: port ->permission() to pass mnt_idmap

    Convert to struct mnt_idmap.

    Last cycle we merged the necessary infrastructure in
    256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
    This is just the conversion to struct mnt_idmap.

    Currently we still pass around the plain namespace that was attached to a
    mount. This is in general pretty convenient but it makes it easy to
    conflate namespaces that are relevant on the filesystem with namespaces
    that are relevent on the mount level. Especially for non-vfs developers
    without detailed knowledge in this area this can be a potential source for
    bugs.

    Once the conversion to struct mnt_idmap is done all helpers down to the
    really low-level helpers will take a struct mnt_idmap argument instead of
    two namespace arguments. This way it becomes impossible to conflate the two
    eliminating the possibility of any bugs. All of the vfs and all filesystems
    only operate on struct mnt_idmap.

    Acked-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-16 10:45:20 +08:00
Ian Kent f7a70a9fc1 fs: port vfs_*() helpers to struct mnt_idmap
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

Conflicts: There was a whitespasce difference possibly due to CentOS
	Stream commit c912400e45 ("fs: Fix description of
	vfs_tmpfile()")
	CentOS Stream commit c4f3dd0731 ("nfsd: handle failure to
	collect pre/post-op attrs more sanely") is present which caused
	a hunk reject in fs/nfsd/nfs3proc.c and two hunks to be rejected
	in fs/nfsd/vfs.c the hunks were manually applied.
	Upstream commit 79b05beaa5c34 ("af_unix: Acquire/Release
	per-netns hash table's locks.") is not present in CentOS Stream
	fixed the conflict manually.
	Dropped ksmbd hunks, ksmbd source is not present.
	Upstream commit 3350607dc5637 ("security: Create file_truncate
	hook from path_truncate hook") is not present in CentOS Stream.

commit abf08576afe31506b812c8c1be9714f78613f300
Author: Christian Brauner <brauner@kernel.org>
Date:   Fri Jan 13 12:49:10 2023 +0100

    fs: port vfs_*() helpers to struct mnt_idmap

    Convert to struct mnt_idmap.

    Last cycle we merged the necessary infrastructure in
    256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
    This is just the conversion to struct mnt_idmap.

    Currently we still pass around the plain namespace that was attached to a
    mount. This is in general pretty convenient but it makes it easy to
    conflate namespaces that are relevant on the filesystem with namespaces
    that are relevent on the mount level. Especially for non-vfs developers
    without detailed knowledge in this area this can be a potential source for
    bugs.

    Once the conversion to struct mnt_idmap is done all helpers down to the
    really low-level helpers will take a struct mnt_idmap argument instead of
    two namespace arguments. This way it becomes impossible to conflate the two
    eliminating the possibility of any bugs. All of the vfs and all filesystems
    only operate on struct mnt_idmap.

    Acked-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-16 08:29:51 +08:00
Ian Kent dc1f3bea48 attr: use consistent sgid stripping checks
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

commit ed5a7047d2011cb6b2bf84ceb6680124cc6a7d95
Author: Christian Brauner <brauner@kernel.org>
Date:   Mon Oct 17 17:06:37 2022 +0200

    attr: use consistent sgid stripping checks

    Currently setgid stripping in file_remove_privs()'s should_remove_suid()
    helper is inconsistent with other parts of the vfs. Specifically, it only
    raises ATTR_KILL_SGID if the inode is S_ISGID and S_IXGRP but not if the
    inode isn't in the caller's groups and the caller isn't privileged over the
    inode although we require this already in setattr_prepare() and
    setattr_copy() and so all filesystem implement this requirement implicitly
    because they have to use setattr_{prepare,copy}() anyway.

    But the inconsistency shows up in setgid stripping bugs for overlayfs in
    xfstests (e.g., generic/673, generic/683, generic/685, generic/686,
    generic/687). For example, we test whether suid and setgid stripping works
    correctly when performing various write-like operations as an unprivileged
    user (fallocate, reflink, write, etc.):

    echo "Test 1 - qa_user, non-exec file $verb"
    setup_testfile
    chmod a+rws $junk_file
    commit_and_check "$qa_user" "$verb" 64k 64k

    The test basically creates a file with 6666 permissions. While the file has
    the S_ISUID and S_ISGID bits set it does not have the S_IXGRP set. On a
    regular filesystem like xfs what will happen is:

    sys_fallocate()
    -> vfs_fallocate()
       -> xfs_file_fallocate()
          -> file_modified()
             -> __file_remove_privs()
                -> dentry_needs_remove_privs()
                   -> should_remove_suid()
                -> __remove_privs()
                   newattrs.ia_valid = ATTR_FORCE | kill;
                   -> notify_change()
                      -> setattr_copy()

    In should_remove_suid() we can see that ATTR_KILL_SUID is raised
    unconditionally because the file in the test has S_ISUID set.

    But we also see that ATTR_KILL_SGID won't be set because while the file
    is S_ISGID it is not S_IXGRP (see above) which is a condition for
    ATTR_KILL_SGID being raised.

    So by the time we call notify_change() we have attr->ia_valid set to
    ATTR_KILL_SUID | ATTR_FORCE. Now notify_change() sees that
    ATTR_KILL_SUID is set and does:

    ia_valid = attr->ia_valid |= ATTR_MODE
    attr->ia_mode = (inode->i_mode & ~S_ISUID);

    which means that when we call setattr_copy() later we will definitely
    update inode->i_mode. Note that attr->ia_mode still contains S_ISGID.

    Now we call into the filesystem's ->setattr() inode operation which will
    end up calling setattr_copy(). Since ATTR_MODE is set we will hit:

    if (ia_valid & ATTR_MODE) {
            umode_t mode = attr->ia_mode;
            vfsgid_t vfsgid = i_gid_into_vfsgid(mnt_userns, inode);
            if (!vfsgid_in_group_p(vfsgid) &&
                !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID))
                    mode &= ~S_ISGID;
            inode->i_mode = mode;
    }

    and since the caller in the test is neither capable nor in the group of the
    inode the S_ISGID bit is stripped.

    But assume the file isn't suid then ATTR_KILL_SUID won't be raised which
    has the consequence that neither the setgid nor the suid bits are stripped
    even though it should be stripped because the inode isn't in the caller's
    groups and the caller isn't privileged over the inode.

    If overlayfs is in the mix things become a bit more complicated and the bug
    shows up more clearly. When e.g., ovl_setattr() is hit from
    ovl_fallocate()'s call to file_remove_privs() then ATTR_KILL_SUID and
    ATTR_KILL_SGID might be raised but because the check in notify_change() is
    questioning the ATTR_KILL_SGID flag again by requiring S_IXGRP for it to be
    stripped the S_ISGID bit isn't removed even though it should be stripped:

    sys_fallocate()
    -> vfs_fallocate()
       -> ovl_fallocate()
          -> file_remove_privs()
             -> dentry_needs_remove_privs()
                -> should_remove_suid()
             -> __remove_privs()
                newattrs.ia_valid = ATTR_FORCE | kill;
                -> notify_change()
                   -> ovl_setattr()
                      // TAKE ON MOUNTER'S CREDS
                      -> ovl_do_notify_change()
                         -> notify_change()
                      // GIVE UP MOUNTER'S CREDS
         // TAKE ON MOUNTER'S CREDS
         -> vfs_fallocate()
            -> xfs_file_fallocate()
               -> file_modified()
                  -> __file_remove_privs()
                     -> dentry_needs_remove_privs()
                        -> should_remove_suid()
                     -> __remove_privs()
                        newattrs.ia_valid = attr_force | kill;
                        -> notify_change()

    The fix for all of this is to make file_remove_privs()'s
    should_remove_suid() helper to perform the same checks as we already
    require in setattr_prepare() and setattr_copy() and have notify_change()
    not pointlessly requiring S_IXGRP again. It doesn't make any sense in the
    first place because the caller must calculate the flags via
    should_remove_suid() anyway which would raise ATTR_KILL_SGID.

    While we're at it we move should_remove_suid() from inode.c to attr.c
    where it belongs with the rest of the iattr helpers. Especially since it
    returns ATTR_KILL_S{G,U}ID flags. We also rename it to
    setattr_should_drop_suidgid() to better reflect that it indicates both
    setuid and setgid bit removal and also that it returns attr flags.

    Running xfstests with this doesn't report any regressions. We should really
    try and use consistent checks.

    Reviewed-by: Amir Goldstein <amir73il@gmail.com>
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-15 16:12:34 +08:00
Ian Kent 9d81504f35 open: always initialize ownership fields
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

commit f52d74b190f8d10ec01cd5774eca77c2186c8ab7
Author: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date:   Mon Sep 19 20:05:12 2022 +0900

    open: always initialize ownership fields

    Beginning of the merge window we introduced the vfs{g,u}id_t types in
    b27c82e12965 ("attr: port attribute changes to new types") and changed
    various codepaths over including chown_common().

    During that change we forgot to account for the case were the passed
    ownership value is -1. In this case the ownership fields in struct iattr
    aren't initialized but we rely on them being initialized by the time we
    generate the ownership to pass down to the LSMs. All the major LSMs
    don't care about the ownership values at all. Only Tomoyo uses them and
    so it took a while for syzbot to unearth this issue.

    Fix this by initializing the ownership fields and do it within the
    retry_deleg block. While notify_change() doesn't alter the ownership
    fields currently we shouldn't rely on it.

    Since no kernel has been released with these changes this does not
    needed to be backported to any stable kernels.

    [Christian Brauner (Microsoft) <brauner@kernel.org>]
    * rewrote commit message
    * use INVALID_VFS{G,U}ID macros

    Fixes: b27c82e12965 ("attr: port attribute changes to new types") # mainline only
    Reported-and-tested-by: syzbot+541e21dcc32c4046cba9@syzkaller.appspotmail.com
    Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
    Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-15 16:11:12 +08:00
Ian Kent 8763195146 attr: port attribute changes to new types
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

Conflict: Hunk 2 of fs/f2fs/file.c failed to apply but the source looked
	identical and required manual application.
	Hunks 2 and 3 failed to apply to fs/attr.c due to CentOS Stream
	commit 33c38120a3 ("fs: account for group membership") having
	already been applied requiring manual application.
	Update to add incremental changes needed due to CentOS Stream
	("shmem: quota support").

commit b27c82e1296572cfa3997e58db3118a33915f85c
Author: Christian Brauner <brauner@kernel.org>
Date:   Tue Jun 21 16:14:54 2022 +0200

    attr: port attribute changes to new types

    Now that we introduced new infrastructure to increase the type safety
    for filesystems supporting idmapped mounts port the first part of the
    vfs over to them.

    This ports the attribute changes codepaths to rely on the new better
    helpers using a dedicated type.

    Before this change we used to take a shortcut and place the actual
    values that would be written to inode->i_{g,u}id into struct iattr. This
    had the advantage that we moved idmappings mostly out of the picture
    early on but it made reasoning about changes more difficult than it
    should be.

    The filesystem was never explicitly told that it dealt with an idmapped
    mount. The transition to the value that needed to be stored in
    inode->i_{g,u}id appeared way too early and increased the probability of
    bugs in various codepaths.

    We know place the same value in struct iattr no matter if this is an
    idmapped mount or not. The vfs will only deal with type safe
    vfs{g,u}id_t. This makes it massively safer to perform permission checks
    as the type will tell us what checks we need to perform and what helpers
    we need to use.

    Fileystems raising FS_ALLOW_IDMAP can't simply write ia_vfs{g,u}id to
    inode->i_{g,u}id since they are different types. Instead they need to
    use the dedicated vfs{g,u}id_to_k{g,u}id() helpers that map the
    vfs{g,u}id into the filesystem.

    The other nice effect is that filesystems like overlayfs don't need to
    care about idmappings explicitly anymore and can simply set up struct
    iattr accordingly directly.

    Link: https://lore.kernel.org/lkml/CAHk-=win6+ahs1EwLkcq8apqLi_1wXFWbrPf340zYEhObpz4jA@mail.gmail.com [1]
    Link: https://lore.kernel.org/r/20220621141454.2914719-9-brauner@kernel.org
    Cc: Seth Forshee <sforshee@digitalocean.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Aleksa Sarai <cyphar@cyphar.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    CC: linux-fsdevel@vger.kernel.org
    Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-15 16:10:59 +08:00
CKI Backport Bot f11747ac19 ftruncate: pass a signed offset
JIRA: https://issues.redhat.com/browse/RHEL-51605
CVE: CVE-2024-42084

commit 4b8e88e563b5f666446d002ad0dc1e6e8e7102b0
Author: Arnd Bergmann <arnd@arndb.de>
Date:   Wed Jun 19 11:34:09 2024 +0200

    ftruncate: pass a signed offset

    The old ftruncate() syscall, using the 32-bit off_t misses a sign
    extension when called in compat mode on 64-bit architectures.  As a
    result, passing a negative length accidentally succeeds in truncating
    to file size between 2GiB and 4GiB.

    Changing the type of the compat syscall to the signed compat_off_t
    changes the behavior so it instead returns -EINVAL.

    The native entry point, the truncate() syscall and the corresponding
    loff_t based variants are all correct already and do not suffer
    from this mistake.

    Fixes: 3f6d078d4a ("fix compat truncate/ftruncate")
    Reviewed-by: Christian Brauner <brauner@kernel.org>
    Cc: stable@vger.kernel.org
    Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Signed-off-by: CKI Backport Bot <cki-ci-bot+cki-gitlab-backport-bot@redhat.com>
2024-07-30 17:17:40 +00:00
Ming Lei 900174223d fs: use __fput_sync in close(2)
JIRA: https://issues.redhat.com/browse/RHEL-34573

commit 021a160abf62c19aff36c920566efb4f690e964a
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Tue Aug 8 19:26:35 2023 +0200

    fs: use __fput_sync in close(2)

    close(2) is a special case which guarantees a shallow kernel stack,
    making delegation to task_work machinery unnecessary. Said delegation is
    problematic as it involves atomic ops and interrupt masking trips, none
    of which are cheap on x86-64. Forcing close(2) to do it looks like an
    oversight in the original work.

    Moreover presence of CONFIG_RSEQ adds an additional overhead as fput()
    -> task_work_add(..., TWA_RESUME) -> set_notify_resume() makes the
    thread returning to userspace land in resume_user_mode_work(), where
    rseq_handle_notify_resume takes a SMAP round-trip if rseq is enabled for
    the thread (and it is by default with contemporary glibc).

    Sample result when benchmarking open1_processes -t 1 from will-it-scale
    (that's an open + close loop) + tmpfs on /tmp, running on the Sapphire
    Rapid CPU (ops/s):
    stock+RSEQ:     1329857
    stock-RSEQ:     1421667 (+7%)
    patched:        1523521 (+14.5% / +7%) (with / without rseq)

    Patched result is the same regardless of rseq as the codepath is avoided.

    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Christian Brauner <brauner@kernel.org>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2024-04-30 21:18:18 +08:00
Waiman Long c965fac1e0 fchmodat2: add support for AT_EMPTY_PATH
JIRA: https://issues.redhat.com/browse/RHEL-28616

commit 5daeb41a6fc9d0d81cb2291884b7410e062d8fa1
Author: Aleksa Sarai <cyphar@cyphar.com>
Date:   Fri, 28 Jul 2023 21:58:26 +1000

    fchmodat2: add support for AT_EMPTY_PATH

    This allows userspace to avoid going through /proc/self/fd when dealing
    with all types of file descriptors for chmod(), and makes fchmodat2() a
    proper superset of all other chmod syscalls.

    The primary difference between fchmodat2(AT_EMPTY_PATH) and fchmod() is
    that fchmod() doesn't operate on O_PATH file descriptors by design. To
    quote open(2):

    > O_PATH (since Linux 2.6.39)
    > [...]
    > The file itself is not opened, and other file operations (e.g.,
    > read(2), write(2), fchmod(2), fchown(2), fgetxattr(2), ioctl(2),
    > mmap(2)) fail with the error EBADF.

    However, procfs has allowed userspace to do this operation ever since
    the introduction of O_PATH through magic-links, so adding this feature
    is only an improvement for programs that have to mess around with
    /proc/self/fd/$n today to get this behaviour. In addition,
    fchownat(AT_EMPTY_PATH) has existed since the introduction of O_PATH and
    allows chown() operations directly on O_PATH descriptors.

    Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
    Acked-by: Alexey Gladkov <legion@kernel.org>
    Message-Id: <20230728-fchmodat2-at_empty_path-v1-1-f3add31d3516@cyphar.com>
    Signed-off-by: Christian Brauner <brauner@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-03-27 10:05:55 -04:00
Waiman Long 1e08831d22 fs: Add fchmodat2()
JIRA: https://issues.redhat.com/browse/RHEL-28616

commit 09da082b07bbae1c11d9560c8502800039aebcea
Author: Alexey Gladkov <legion@kernel.org>
Date:   Tue, 11 Jul 2023 18:16:04 +0200

    fs: Add fchmodat2()

    On the userspace side fchmodat(3) is implemented as a wrapper
    function which implements the POSIX-specified interface. This
    interface differs from the underlying kernel system call, which does not
    have a flags argument. Most implementations require procfs [1][2].

    There doesn't appear to be a good userspace workaround for this issue
    but the implementation in the kernel is pretty straight-forward.

    The new fchmodat2() syscall allows to pass the AT_SYMLINK_NOFOLLOW flag,
    unlike existing fchmodat.

    [1] https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=17eca54051ee28ba1ec3f9aed170a62630959143;hb=a492b1e5ef7ab50c6fdd4e4e9879ea5569ab0a6c#l35
    [2] https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c?id=718f363bc2067b6487900eddc9180c84e7739f80#n28

    Co-developed-by: Palmer Dabbelt <palmer@sifive.com>
    Signed-off-by: Palmer Dabbelt <palmer@sifive.com>
    Signed-off-by: Alexey Gladkov <legion@kernel.org>
    Acked-by: Arnd Bergmann <arnd@arndb.de>
    Message-Id: <f2a846ef495943c5d101011eebcf01179d0c7b61.1689092120.git.legion@kernel.org>
    [brauner: pre reviews, do flag conversion in do_fchmodat() directly]
    Signed-off-by: Christian Brauner <brauner@kernel.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2024-03-27 10:05:54 -04:00
Jeff Moyer da5eea0749 fsnotify: move fsnotify_open() hook into do_dentry_open()
JIRA: https://issues.redhat.com/browse/RHEL-12076

commit 7b8c9d7bb4570ee4800642009c8f2d9756004552
Author: Amir Goldstein <amir73il@gmail.com>
Date:   Sun Jun 11 15:24:29 2023 +0300

    fsnotify: move fsnotify_open() hook into do_dentry_open()
    
    fsnotify_open() hook is called only from high level system calls
    context and not called for the very many helpers to open files.
    
    This may makes sense for many of the special file open cases, but it is
    inconsistent with fsnotify_close() hook that is called for every last
    fput() of on a file object with FMODE_OPENED.
    
    As a result, it is possible to observe ACCESS, MODIFY and CLOSE events
    without ever observing an OPEN event.
    
    Fix this inconsistency by replacing all the fsnotify_open() hooks with
    a single hook inside do_dentry_open().
    
    If there are special cases that would like to opt-out of the possible
    overhead of fsnotify() call in fsnotify_open(), they would probably also
    want to avoid the overhead of fsnotify() call in the rest of the fsnotify
    hooks, so they should be opening that file with the __FMODE_NONOTIFY flag.
    
    However, in the majority of those cases, the s_fsnotify_connectors
    optimization in fsnotify_parent() would be sufficient to avoid the
    overhead of fsnotify() call anyway.
    
    Signed-off-by: Amir Goldstein <amir73il@gmail.com>
    Signed-off-by: Jan Kara <jack@suse.cz>
    Message-Id: <20230611122429.1499617-1-amir73il@gmail.com>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-11-02 15:31:54 -04:00
Chris von Recklinghausen 4dd834cc12 fs: remove no_llseek
Conflicts: drivers/gpu/drm/drm_fops.c - We already added this change as
	part of RHEL commit
	7a3deb5bcc ("Merge DRM changes from upstream v5.19..v6.0")

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 868941b14441282ba08761b770fc6cad69d5bdb7
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Wed Jun 29 15:07:00 2022 +0200

    fs: remove no_llseek

    Now that all callers of ->llseek are going through vfs_llseek(), we
    don't gain anything by keeping no_llseek around. Nothing actually calls
    it and setting ->llseek to no_lseek is completely equivalent to
    leaving it NULL.

    Longer term (== by the end of merge window) we want to remove all such
    intializations.  To simplify the merge window this commit does *not*
    touch initializers - it only defines no_llseek as NULL (and simplifies
    the tests on file opening).

    At -rc1 we'll need do a mechanical removal of no_llseek -

    git grep -l -w no_llseek | grep -v porting.rst | while read i; do
            sed -i '/\<no_llseek\>/d' $i
    done
    would do it.

    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:12:52 -04:00
Chris von Recklinghausen 5102946dae fs: clear or set FMODE_LSEEK based on llseek function
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit e7478158e1378325907edfdd960eca98a1be405b
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Wed Jun 29 15:06:57 2022 +0200

    fs: clear or set FMODE_LSEEK based on llseek function

    Pipe-like behaviour on llseek(2) (i.e. unconditionally failing with
    -ESPIPE) can be expresses in 3 ways:
            1) ->llseek set to NULL in file_operations
            2) ->llseek set to no_llseek in file_operations
            3) FMODE_LSEEK *not* set in ->f_mode.

    Enforce (3) in cases (1) and (2); that will allow to simplify the
    checks and eventually get rid of no_llseek boilerplate.

    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:12:51 -04:00
Chris von Recklinghausen f2ae8afa36 fs: support mapped mounts of mapped filesystems
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit bd303368b776eead1c29e6cdda82bde7128b82a7
Author: Christian Brauner <christian.brauner@ubuntu.com>
Date:   Fri Dec 3 12:17:07 2021 +0100

    fs: support mapped mounts of mapped filesystems

    In previous patches we added new and modified existing helpers to handle
    idmapped mounts of filesystems mounted with an idmapping. In this final
    patch we convert all relevant places in the vfs to actually pass the
    filesystem's idmapping into these helpers.

    With this the vfs is in shape to handle idmapped mounts of filesystems
    mounted with an idmapping. Note that this is just the generic
    infrastructure. Actually adding support for idmapped mounts to a
    filesystem mountable with an idmapping is follow-up work.

    In this patch we extend the definition of an idmapped mount from a mount
    that that has the initial idmapping attached to it to a mount that has
    an idmapping attached to it which is not the same as the idmapping the
    filesystem was mounted with.

    As before we do not allow the initial idmapping to be attached to a
    mount. In addition this patch prevents that the idmapping the filesystem
    was mounted with can be attached to a mount created based on this
    filesystem.

    This has multiple reasons and advantages. First, attaching the initial
    idmapping or the filesystem's idmapping doesn't make much sense as in
    both cases the values of the i_{g,u}id and other places where k{g,u}ids
    are used do not change. Second, a user that really wants to do this for
    whatever reason can just create a separate dedicated identical idmapping
    to attach to the mount. Third, we can continue to use the initial
    idmapping as an indicator that a mount is not idmapped allowing us to
    continue to keep passing the initial idmapping into the mapping helpers
    to tell them that something isn't an idmapped mount even if the
    filesystem is mounted with an idmapping.

    Link: https://lore.kernel.org/r/20211123114227.3124056-11-brauner@kernel.org (v1)
    Link: https://lore.kernel.org/r/20211130121032.3753852-11-brauner@kernel.org (v2)
    Link: https://lore.kernel.org/r/20211203111707.3901969-11-brauner@kernel.org
    Cc: Seth Forshee <sforshee@digitalocean.com>
    Cc: Amir Goldstein <amir73il@gmail.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    CC: linux-fsdevel@vger.kernel.org
    Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
    Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:12:33 -04:00
Chris von Recklinghausen e24bc3eeac fs: use low-level mapping helpers
Conflicts: drop changes to fs/ksmbd/smbacl.c fs/ksmbd/smbacl.h - files
	not in RHEL

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 4472071331549e911a5abad41aea6e3be855a1a4
Author: Christian Brauner <christian.brauner@ubuntu.com>
Date:   Fri Dec 3 12:17:03 2021 +0100

    fs: use low-level mapping helpers

    In a few places the vfs needs to interact with bare k{g,u}ids directly
    instead of struct inode. These are just a few. In previous patches we
    introduced low-level mapping helpers that are able to support
    filesystems mounted an idmapping. This patch simply converts the places
    to use these new helpers.

    Link: https://lore.kernel.org/r/20211123114227.3124056-7-brauner@kernel.org
(v1)
    Link: https://lore.kernel.org/r/20211130121032.3753852-7-brauner@kernel.org
(v2)
    Link: https://lore.kernel.org/r/20211203111707.3901969-7-brauner@kernel.org
    Cc: Seth Forshee <sforshee@digitalocean.com>
    Cc: Amir Goldstein <amir73il@gmail.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    CC: linux-fsdevel@vger.kernel.org
    Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
    Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:12:32 -04:00
Chris von Recklinghausen 44d9f4a9fa fs: move mapping helpers
Conflicts:
	drop changes to fs/ksmbd/smbacl.c, fs/ksmbd/smbacl.h - files/fs
		not in CS9
	fs/posix_acl.c - We don't have
		332f606b32b6 ("ovl: enable RCU'd ->get_acl()")
		so don't include linux/namei.h
	include/linux/fs.h - We already have
		8b9f3ac5b01d ("fs: introduce alloc_inode_sb() to allocate filesystems specific inode")
		so keep include of linux/slab.h

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit a793d79ea3e041081cd7cbd8ee43d0b5e4914a2b
Author: Christian Brauner <christian.brauner@ubuntu.com>
Date:   Fri Dec 3 12:16:59 2021 +0100

    fs: move mapping helpers

    The low-level mapping helpers were so far crammed into fs.h. They are
    out of place there. The fs.h header should just contain the higher-level
    mapping helpers that interact directly with vfs objects such as struct
    super_block or struct inode and not the bare mapping helpers. Similarly,
    only vfs and specific fs code shall interact with low-level mapping
    helpers. And so they won't be made accessible automatically through
    regular {g,u}id helpers.

    Link: https://lore.kernel.org/r/20211123114227.3124056-3-brauner@kernel.org
(v1)
    Link: https://lore.kernel.org/r/20211130121032.3753852-3-brauner@kernel.org
(v2)
    Link: https://lore.kernel.org/r/20211203111707.3901969-3-brauner@kernel.org
    Cc: Seth Forshee <sforshee@digitalocean.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    CC: linux-fsdevel@vger.kernel.org
    Reviewed-by: Amir Goldstein <amir73il@gmail.com>
    Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
    Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:12:31 -04:00
Chris von Recklinghausen 3f6b9d910f VFS: add FMODE_CAN_ODIRECT file flag
Bugzilla: https://bugzilla.redhat.com/2160210

commit a2ad63daa88b9d6846976fd2a0b5e4f5cfc58377
Author: NeilBrown <neilb@suse.de>
Date:   Mon May 9 18:20:49 2022 -0700

    VFS: add FMODE_CAN_ODIRECT file flag

    Currently various places test if direct IO is possible on a file by
    checking for the existence of the direct_IO address space operation.
    This is a poor choice, as the direct_IO operation may not be used - it is
    only used if the generic_file_*_iter functions are called for direct IO
    and some filesystems - particularly NFS - don't do this.

    Instead, introduce a new f_mode flag: FMODE_CAN_ODIRECT and change the
    various places to check this (avoiding pointer dereferences).
    do_dentry_open() will set this flag if ->direct_IO is present, so
    filesystems do not need to be changed.

    NFS *is* changed, to set the flag explicitly and discard the direct_IO
    entry in the address_space_operations for files.

    Other filesystems which currently use noop_direct_IO could usefully be
    changed to set this flag instead.

    Link: https://lkml.kernel.org/r/164859778128.29473.15189737957277399416.stgit@noble.brown
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: NeilBrown <neilb@suse.de>
    Tested-by: David Howells <dhowells@redhat.com>
    Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:01 -04:00
Chris von Recklinghausen cfacd80e4b riscv: compat: syscall: Add compat_sys_call_table implementation
Bugzilla: https://bugzilla.redhat.com/2160210

commit 59c10c52f573faca862cda5ebcdd43831608eb5a
Author: Guo Ren <guoren@linux.alibaba.com>
Date:   Tue Apr 5 15:13:05 2022 +0800

    riscv: compat: syscall: Add compat_sys_call_table implementation

    Implement compat sys_call_table and some system call functions:
    truncate64, ftruncate64, fallocate, pread64, pwrite64,
    sync_file_range, readahead, fadvise64_64 which need argument
    translation.

    Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
    Signed-off-by: Guo Ren <guoren@kernel.org>
    Reviewed-by: Arnd Bergmann <arnd@arndb.de>
    Tested-by: Heiko Stuebner <heiko@sntech.de>
    Link: https://lore.kernel.org/r/20220405071314.3225832-12-guoren@kernel.org
    Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:51 -04:00
Ming Lei 34a5f9ec68 fs: remove fs.f_write_hint
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2118511

commit 7b12e49669c99f63bc12351c57e581f1f14d4adf
Author: Christoph Hellwig <hch@lst.de>
Date:   Tue Mar 8 07:05:29 2022 +0100

    fs: remove fs.f_write_hint

    The value is now completely unused except for reporting it back through
    the F_GET_FILE_RW_HINT ioctl, so remove the value and the two ioctls
    for it.

    Trying to use the F_SET_FILE_RW_HINT and F_GET_FILE_RW_HINT fcntls will
    now return EINVAL, just like it would on a kernel that never supported
    this functionality in the first place.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
    Link: https://lore.kernel.org/r/20220308060529.736277-3-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2022-10-12 09:20:11 +08:00
Benjamin Coddington ba6377f812 NFSD: Instantiate a struct file when creating a regular NFSv4 file
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1905809

commit fb70bf124b051d4ded4ce57511dfec6d3ebf2b43
Author: Chuck Lever <chuck.lever@oracle.com>
Date:   Wed Mar 30 10:30:54 2022 -0400

    NFSD: Instantiate a struct file when creating a regular NFSv4 file

    There have been reports of races that cause NFSv4 OPEN(CREATE) to
    return an error even though the requested file was created. NFSv4
    does not provide a status code for this case.

    To mitigate some of these problems, reorganize the NFSv4
    OPEN(CREATE) logic to allocate resources before the file is actually
    created, and open the new file while the parent directory is still
    locked.

    Two new APIs are added:

    + Add an API that works like nfsd_file_acquire() but does not open
    the underlying file. The OPEN(CREATE) path can use this API when it
    already has an open file.

    + Add an API that is kin to dentry_open(). NFSD needs to create a
    file and grab an open "struct file *" atomically. The
    alloc_empty_file() has to be done before the inode create. If it
    fails (for example, because the NFS server has exceeded its
    max_files limit), we avoid creating the file and can still return
    an error to the NFS client.

    BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=382
    Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
    Tested-by: JianHong Yin <jiyin@redhat.com>

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
2022-06-07 13:31:39 -04:00
Richard Guy Briggs 865ba9ed53 audit: add OPENAT2 record to list "how" info
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2035124
Upstream Status: v5.16-rc1
Conflicts:
  - include/uapi/linux/audit.h AUDIT_URINGOP/AUDIT_OPENAT2, went in via selinux tree
    commit 5bd2182d58e9 ("audit,io_uring,io-wq: add some basic audit support to io_uring")

commit 571e5c0efcb29c5dac8cf2949d3eed84ec43056c
Author: Richard Guy Briggs <rgb@redhat.com>
Date:   Wed May 19 16:00:22 2021 -0400

    audit: add OPENAT2 record to list "how" info

    Since the openat2(2) syscall uses a struct open_how pointer to communicate
    its parameters they are not usefully recorded by the audit SYSCALL record's
    four existing arguments.

    Add a new audit record type OPENAT2 that reports the parameters in its
    third argument, struct open_how with fields oflag, mode and resolve.

    The new record in the context of an event would look like:
    time->Wed Mar 17 16:28:53 2021
    type=PROCTITLE msg=audit(1616012933.531:184): proctitle=
      73797363616C6C735F66696C652F6F70656E617432002F746D702F61756469742D
      7465737473756974652D737641440066696C652D6F70656E617432
    type=PATH msg=audit(1616012933.531:184): item=1 name="file-openat2"
      inode=29 dev=00:1f mode=0100600 ouid=0 ogid=0 rdev=00:00
      obj=unconfined_u:object_r:user_tmp_t:s0 nametype=CREATE
      cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0
    type=PATH msg=audit(1616012933.531:184):
      item=0 name="/root/rgb/git/audit-testsuite/tests"
      inode=25 dev=00:1f mode=040700 ouid=0 ogid=0 rdev=00:00
      obj=unconfined_u:object_r:user_tmp_t:s0 nametype=PARENT
      cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0
    type=CWD msg=audit(1616012933.531:184):
      cwd="/root/rgb/git/audit-testsuite/tests"
    type=OPENAT2 msg=audit(1616012933.531:184):
      oflag=0100302 mode=0600 resolve=0xa
    type=SYSCALL msg=audit(1616012933.531:184): arch=c000003e syscall=437
      success=yes exit=4 a0=3 a1=7ffe315f1c53 a2=7ffe315f1550 a3=18
      items=2 ppid=528 pid=540 auid=0 uid=0 gid=0 euid=0 suid=0
      fsuid=0 egid=0 sgid=0 fsgid=0 tty=ttyS0 ses=1 comm="openat2"
      exe="/root/rgb/git/audit-testsuite/tests/syscalls_file/openat2"
      subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
      key="testsuite-1616012933-bjAUcEPO"

    Link: https://lore.kernel.org/r/d23fbb89186754487850367224b060e26f9b7181.1621363275.git.rgb@redhat.com
    Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
    Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
    [PM: tweak subject, wrap example, move AUDIT_OPENAT2 to 1337]
    Signed-off-by: Paul Moore <paul@paul-moore.com>

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
2022-03-25 20:02:50 -04:00
Jeffrey Layton 5b7ac1214f fs: remove mandatory file locking support
Bugzilla: http://bugzilla.redhat.com/2017438

commit f7e33bdbd6d1bdf9c3df8bba5abcf3399f957ac3
Author: Jeff Layton <jlayton@kernel.org>
Date:   Thu Aug 19 14:56:38 2021 -0400

    fs: remove mandatory file locking support

    We added CONFIG_MANDATORY_FILE_LOCKING in 2015, and soon after turned it
    off in Fedora and RHEL8. Several other distros have followed suit.

    I've heard of one problem in all that time: Someone migrated from an
    older distro that supported "-o mand" to one that didn't, and the host
    had a fstab entry with "mand" in it which broke on reboot. They didn't
    actually _use_ mandatory locking so they just removed the mount option
    and moved on.

    This patch rips out mandatory locking support wholesale from the kernel,
    along with the Kconfig option and the Documentation file. It also
    changes the mount code to ignore the "mand" mount option instead of
    erroring out, and to throw a big, ugly warning.

    Signed-off-by: Jeff Layton <jlayton@kernel.org>

Signed-off-by: Jeffrey Layton <jlayton@redhat.com>
2021-11-01 13:56:12 -04:00
Linus Torvalds 58ec9059b3 Merge branch 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs name lookup updates from Al Viro:
 "Small namei.c patch series, mostly to simplify the rules for nameidata
  state. It's actually from the previous cycle - but I didn't post it
  for review in time...

  Changes visible outside of fs/namei.c: file_open_root() calling
  conventions change, some freed bits in LOOKUP_... space"

* 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  namei: make sure nd->depth is always valid
  teach set_nameidata() to handle setting the root as well
  take LOOKUP_{ROOT,ROOT_GRABBED,JUMPED} out of LOOKUP_... space
  switch file_open_root() to struct path
2021-07-03 11:41:14 -07:00
Linus Torvalds 71bd934101 Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
 "190 patches.

  Subsystems affected by this patch series: mm (hugetlb, userfaultfd,
  vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock,
  migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap,
  zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc,
  core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs,
  signals, exec, kcov, selftests, compress/decompress, and ipc"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits)
  ipc/util.c: use binary search for max_idx
  ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock
  ipc: use kmalloc for msg_queue and shmid_kernel
  ipc sem: use kvmalloc for sem_undo allocation
  lib/decompressors: remove set but not used variabled 'level'
  selftests/vm/pkeys: exercise x86 XSAVE init state
  selftests/vm/pkeys: refill shadow register after implicit kernel write
  selftests/vm/pkeys: handle negative sys_pkey_alloc() return code
  selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
  kcov: add __no_sanitize_coverage to fix noinstr for all architectures
  exec: remove checks in __register_bimfmt()
  x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
  hfsplus: report create_date to kstat.btime
  hfsplus: remove unnecessary oom message
  nilfs2: remove redundant continue statement in a while-loop
  kprobes: remove duplicated strong free_insn_page in x86 and s390
  init: print out unknown kernel parameters
  checkpatch: do not complain about positive return values starting with EPOLL
  checkpatch: improve the indented label test
  checkpatch: scripts/spdxcheck.py now requires python3
  ...
2021-07-02 12:08:10 -07:00
Collin Fijalkovich eb6ecbed0a mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs
Transparent huge pages are supported for read-only non-shmem files, but
are only used for vmas with VM_DENYWRITE.  This condition ensures that
file THPs are protected from writes while an application is running
(ETXTBSY).  Any existing file THPs are then dropped from the page cache
when a file is opened for write in do_dentry_open().  Since sys_mmap
ignores MAP_DENYWRITE, this constrains the use of file THPs to vmas
produced by execve().

Systems that make heavy use of shared libraries (e.g.  Android) are unable
to apply VM_DENYWRITE through the dynamic linker, preventing them from
benefiting from the resultant reduced contention on the TLB.

This patch reduces the constraint on file THPs allowing use with any
executable mapping from a file not opened for write (see
inode_is_open_for_write()).  It also introduces additional conditions to
ensure that files opened for write will never be backed by file THPs.

Restricting the use of THPs to executable mappings eliminates the risk
that a read-only file later opened for write would encounter significant
latencies due to page cache truncation.

The ld linker flag '-z max-page-size=(hugepage size)' can be used to
produce executables with the necessary layout.  The dynamic linker must
map these file's segments at a hugepage size aligned vma for the mapping
to be backed with THPs.

Comparison of the performance characteristics of 4KB and 2MB-backed
libraries follows; the Android dex2oat tool was used to AOT compile an
example application on a single ARM core.

4KB Pages:
==========

count              event_name            # count / runtime
598,995,035,942    cpu-cycles            # 1.800861 GHz
 81,195,620,851    raw-stall-frontend    # 244.112 M/sec
347,754,466,597    iTLB-loads            # 1.046 G/sec
  2,970,248,900    iTLB-load-misses      # 0.854122% miss rate

Total test time: 332.854998 seconds.

2MB Pages:
==========

count              event_name            # count / runtime
592,872,663,047    cpu-cycles            # 1.800358 GHz
 76,485,624,143    raw-stall-frontend    # 232.261 M/sec
350,478,413,710    iTLB-loads            # 1.064 G/sec
    803,233,322    iTLB-load-misses      # 0.229182% miss rate

Total test time: 329.826087 seconds

A check of /proc/$(pidof dex2oat64)/smaps shows THPs in use:

/apex/com.android.art/lib64/libart.so
FilePmdMapped:      4096 kB

/apex/com.android.art/lib64/libart-compiler.so
FilePmdMapped:      2048 kB

Link: https://lkml.kernel.org/r/20210406000930.3455850-1-cfijalkovich@google.com
Signed-off-by: Collin Fijalkovich <cfijalkovich@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Acked-by: Song Liu <song@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Hridya Valsaraju <hridya@google.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:29 -07:00
Christian Brauner cfe80306a0
open: don't silently ignore unknown O-flags in openat2()
The new openat2() syscall verifies that no unknown O-flag values are
set and returns an error to userspace if they are while the older open
syscalls like open() and openat() simply ignore unknown flag values:

  #define O_FLAG_CURRENTLY_INVALID (1 << 31)
  struct open_how how = {
          .flags = O_RDONLY | O_FLAG_CURRENTLY_INVALID,
          .resolve = 0,
  };

  /* fails */
  fd = openat2(-EBADF, "/dev/null", &how, sizeof(how));

  /* succeeds */
  fd = openat(-EBADF, "/dev/null", O_RDONLY | O_FLAG_CURRENTLY_INVALID);

However, openat2() silently truncates the upper 32 bits meaning:

  #define O_FLAG_CURRENTLY_INVALID_LOWER32 (1 << 31)
  #define O_FLAG_CURRENTLY_INVALID_UPPER32 (1 << 40)

  struct open_how how_lowe32 = {
          .flags = O_RDONLY | O_FLAG_CURRENTLY_INVALID_LOWER32,
  };

  struct open_how how_upper32 = {
          .flags = O_RDONLY | O_FLAG_CURRENTLY_INVALID_UPPER32,
  };

  /* fails */
  fd = openat2(-EBADF, "/dev/null", &how_lower32, sizeof(how_lower32));

  /* succeeds */
  fd = openat2(-EBADF, "/dev/null", &how_upper32, sizeof(how_upper32));

Fix this by preventing the immediate truncation in build_open_flags().

There's a snafu here though stripping FMODE_* directly from flags would
cause the upper 32 bits to be truncated as well due to integer promotion
rules since FMODE_* is unsigned int, O_* are signed ints (yuck).

In addition, struct open_flags currently defines flags to be 32 bit
which is reasonable. If we simply were to bump it to 64 bit we would
need to change a lot of code preemptively which doesn't seem worth it.
So simply add a compile-time check verifying that all currently known
O_* flags are within the 32 bit range and fail to build if they aren't
anymore.

This change shouldn't regress old open syscalls since they silently
truncate any unknown values anyway. It is a tiny semantic change for
openat2() but it is very unlikely people pass ing > 32 bit unknown flags
and the syscall is relatively new too.

Link: https://lore.kernel.org/r/20210528092417.3942079-3-brauner@kernel.org
Cc: Christoph Hellwig <hch@lst.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reported-by: Richard Guy Briggs <rgb@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Aleksa Sarai <cyphar@cyphar.com>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-05-28 17:44:37 +02:00
Al Viro ffb37ca3bd switch file_open_root() to struct path
... and provide file_open_root_mnt(), using the root of given mount.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-04-07 13:56:43 -04:00
Linus Torvalds 7d6beb71da idmapped-mounts-v5.12
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYCegywAKCRCRxhvAZXjc
 ouJ6AQDlf+7jCQlQdeKKoN9QDFfMzG1ooemat36EpRRTONaGuAD8D9A4sUsG4+5f
 4IU5Lj9oY4DEmF8HenbWK2ZHsesL2Qg=
 =yPaw
 -----END PGP SIGNATURE-----

Merge tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull idmapped mounts from Christian Brauner:
 "This introduces idmapped mounts which has been in the making for some
  time. Simply put, different mounts can expose the same file or
  directory with different ownership. This initial implementation comes
  with ports for fat, ext4 and with Christoph's port for xfs with more
  filesystems being actively worked on by independent people and
  maintainers.

  Idmapping mounts handle a wide range of long standing use-cases. Here
  are just a few:

   - Idmapped mounts make it possible to easily share files between
     multiple users or multiple machines especially in complex
     scenarios. For example, idmapped mounts will be used in the
     implementation of portable home directories in
     systemd-homed.service(8) where they allow users to move their home
     directory to an external storage device and use it on multiple
     computers where they are assigned different uids and gids. This
     effectively makes it possible to assign random uids and gids at
     login time.

   - It is possible to share files from the host with unprivileged
     containers without having to change ownership permanently through
     chown(2).

   - It is possible to idmap a container's rootfs and without having to
     mangle every file. For example, Chromebooks use it to share the
     user's Download folder with their unprivileged containers in their
     Linux subsystem.

   - It is possible to share files between containers with
     non-overlapping idmappings.

   - Filesystem that lack a proper concept of ownership such as fat can
     use idmapped mounts to implement discretionary access (DAC)
     permission checking.

   - They allow users to efficiently changing ownership on a per-mount
     basis without having to (recursively) chown(2) all files. In
     contrast to chown (2) changing ownership of large sets of files is
     instantenous with idmapped mounts. This is especially useful when
     ownership of a whole root filesystem of a virtual machine or
     container is changed. With idmapped mounts a single syscall
     mount_setattr syscall will be sufficient to change the ownership of
     all files.

   - Idmapped mounts always take the current ownership into account as
     idmappings specify what a given uid or gid is supposed to be mapped
     to. This contrasts with the chown(2) syscall which cannot by itself
     take the current ownership of the files it changes into account. It
     simply changes the ownership to the specified uid and gid. This is
     especially problematic when recursively chown(2)ing a large set of
     files which is commong with the aforementioned portable home
     directory and container and vm scenario.

   - Idmapped mounts allow to change ownership locally, restricting it
     to specific mounts, and temporarily as the ownership changes only
     apply as long as the mount exists.

  Several userspace projects have either already put up patches and
  pull-requests for this feature or will do so should you decide to pull
  this:

   - systemd: In a wide variety of scenarios but especially right away
     in their implementation of portable home directories.

         https://systemd.io/HOME_DIRECTORY/

   - container runtimes: containerd, runC, LXD:To share data between
     host and unprivileged containers, unprivileged and privileged
     containers, etc. The pull request for idmapped mounts support in
     containerd, the default Kubernetes runtime is already up for quite
     a while now: https://github.com/containerd/containerd/pull/4734

   - The virtio-fs developers and several users have expressed interest
     in using this feature with virtual machines once virtio-fs is
     ported.

   - ChromeOS: Sharing host-directories with unprivileged containers.

  I've tightly synced with all those projects and all of those listed
  here have also expressed their need/desire for this feature on the
  mailing list. For more info on how people use this there's a bunch of
  talks about this too. Here's just two recent ones:

      https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdf
      https://fosdem.org/2021/schedule/event/containers_idmap/

  This comes with an extensive xfstests suite covering both ext4 and
  xfs:

      https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts

  It covers truncation, creation, opening, xattrs, vfscaps, setid
  execution, setgid inheritance and more both with idmapped and
  non-idmapped mounts. It already helped to discover an unrelated xfs
  setgid inheritance bug which has since been fixed in mainline. It will
  be sent for inclusion with the xfstests project should you decide to
  merge this.

  In order to support per-mount idmappings vfsmounts are marked with
  user namespaces. The idmapping of the user namespace will be used to
  map the ids of vfs objects when they are accessed through that mount.
  By default all vfsmounts are marked with the initial user namespace.
  The initial user namespace is used to indicate that a mount is not
  idmapped. All operations behave as before and this is verified in the
  testsuite.

  Based on prior discussions we want to attach the whole user namespace
  and not just a dedicated idmapping struct. This allows us to reuse all
  the helpers that already exist for dealing with idmappings instead of
  introducing a whole new range of helpers. In addition, if we decide in
  the future that we are confident enough to enable unprivileged users
  to setup idmapped mounts the permission checking can take into account
  whether the caller is privileged in the user namespace the mount is
  currently marked with.

  The user namespace the mount will be marked with can be specified by
  passing a file descriptor refering to the user namespace as an
  argument to the new mount_setattr() syscall together with the new
  MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern
  of extensibility.

  The following conditions must be met in order to create an idmapped
  mount:

   - The caller must currently have the CAP_SYS_ADMIN capability in the
     user namespace the underlying filesystem has been mounted in.

   - The underlying filesystem must support idmapped mounts.

   - The mount must not already be idmapped. This also implies that the
     idmapping of a mount cannot be altered once it has been idmapped.

   - The mount must be a detached/anonymous mount, i.e. it must have
     been created by calling open_tree() with the OPEN_TREE_CLONE flag
     and it must not already have been visible in the filesystem.

  The last two points guarantee easier semantics for userspace and the
  kernel and make the implementation significantly simpler.

  By default vfsmounts are marked with the initial user namespace and no
  behavioral or performance changes are observed.

  The manpage with a detailed description can be found here:

      1d7b902e28

  In order to support idmapped mounts, filesystems need to be changed
  and mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. The
  patches to convert individual filesystem are not very large or
  complicated overall as can be seen from the included fat, ext4, and
  xfs ports. Patches for other filesystems are actively worked on and
  will be sent out separately. The xfstestsuite can be used to verify
  that port has been done correctly.

  The mount_setattr() syscall is motivated independent of the idmapped
  mounts patches and it's been around since July 2019. One of the most
  valuable features of the new mount api is the ability to perform
  mounts based on file descriptors only.

  Together with the lookup restrictions available in the openat2()
  RESOLVE_* flag namespace which we added in v5.6 this is the first time
  we are close to hardened and race-free (e.g. symlinks) mounting and
  path resolution.

  While userspace has started porting to the new mount api to mount
  proper filesystems and create new bind-mounts it is currently not
  possible to change mount options of an already existing bind mount in
  the new mount api since the mount_setattr() syscall is missing.

  With the addition of the mount_setattr() syscall we remove this last
  restriction and userspace can now fully port to the new mount api,
  covering every use-case the old mount api could. We also add the
  crucial ability to recursively change mount options for a whole mount
  tree, both removing and adding mount options at the same time. This
  syscall has been requested multiple times by various people and
  projects.

  There is a simple tool available at

      https://github.com/brauner/mount-idmapped

  that allows to create idmapped mounts so people can play with this
  patch series. I'll add support for the regular mount binary should you
  decide to pull this in the following weeks:

  Here's an example to a simple idmapped mount of another user's home
  directory:

	u1001@f2-vm:/$ sudo ./mount --idmap both:1000:1001:1 /home/ubuntu/ /mnt

	u1001@f2-vm:/$ ls -al /home/ubuntu/
	total 28
	drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 .
	drwxr-xr-x 4 root   root   4096 Oct 28 04:00 ..
	-rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history
	-rw-r--r-- 1 ubuntu ubuntu  220 Feb 25  2020 .bash_logout
	-rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25  2020 .bashrc
	-rw-r--r-- 1 ubuntu ubuntu  807 Feb 25  2020 .profile
	-rw-r--r-- 1 ubuntu ubuntu    0 Oct 16 16:11 .sudo_as_admin_successful
	-rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo

	u1001@f2-vm:/$ ls -al /mnt/
	total 28
	drwxr-xr-x  2 u1001 u1001 4096 Oct 28 22:07 .
	drwxr-xr-x 29 root  root  4096 Oct 28 22:01 ..
	-rw-------  1 u1001 u1001 3154 Oct 28 22:12 .bash_history
	-rw-r--r--  1 u1001 u1001  220 Feb 25  2020 .bash_logout
	-rw-r--r--  1 u1001 u1001 3771 Feb 25  2020 .bashrc
	-rw-r--r--  1 u1001 u1001  807 Feb 25  2020 .profile
	-rw-r--r--  1 u1001 u1001    0 Oct 16 16:11 .sudo_as_admin_successful
	-rw-------  1 u1001 u1001 1144 Oct 28 00:43 .viminfo

	u1001@f2-vm:/$ touch /mnt/my-file

	u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file

	u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file

	u1001@f2-vm:/$ ls -al /mnt/my-file
	-rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file

	u1001@f2-vm:/$ ls -al /home/ubuntu/my-file
	-rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file

	u1001@f2-vm:/$ getfacl /mnt/my-file
	getfacl: Removing leading '/' from absolute path names
	# file: mnt/my-file
	# owner: u1001
	# group: u1001
	user::rw-
	user:u1001:rwx
	group::rw-
	mask::rwx
	other::r--

	u1001@f2-vm:/$ getfacl /home/ubuntu/my-file
	getfacl: Removing leading '/' from absolute path names
	# file: home/ubuntu/my-file
	# owner: ubuntu
	# group: ubuntu
	user::rw-
	user:ubuntu:rwx
	group::rw-
	mask::rwx
	other::r--"

* tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: (41 commits)
  xfs: remove the possibly unused mp variable in xfs_file_compat_ioctl
  xfs: support idmapped mounts
  ext4: support idmapped mounts
  fat: handle idmapped mounts
  tests: add mount_setattr() selftests
  fs: introduce MOUNT_ATTR_IDMAP
  fs: add mount_setattr()
  fs: add attr_flags_to_mnt_flags helper
  fs: split out functions to hold writers
  namespace: only take read lock in do_reconfigure_mnt()
  mount: make {lock,unlock}_mount_hash() static
  namespace: take lock_mount_hash() directly when changing flags
  nfs: do not export idmapped mounts
  overlayfs: do not mount on top of idmapped mounts
  ecryptfs: do not mount on top of idmapped mounts
  ima: handle idmapped mounts
  apparmor: handle idmapped mounts
  fs: make helpers idmap mount aware
  exec: handle idmapped mounts
  would_dump: handle idmapped mounts
  ...
2021-02-23 13:39:45 -08:00
Christian Brauner b8b546a061
open: handle idmapped mounts
For core file operations such as changing directories or chrooting,
determining file access, changing mode or ownership the vfs will verify
that the caller is privileged over the inode. Extend the various helpers
to handle idmapped mounts. If the inode is accessed through an idmapped
mount map it into the mount's user namespace. Afterwards the permissions
checks are identical to non-idmapped mounts. When changing file
ownership we need to map the uid and gid from the mount's user
namespace. If the initial user namespace is passed nothing changes so
non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-17-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:18 +01:00
Christian Brauner 643fe55a06
open: handle idmapped mounts in do_truncate()
When truncating files the vfs will verify that the caller is privileged
over the inode. Extend it to handle idmapped mounts. If the inode is
accessed through an idmapped mount it is mapped according to the mount's
user namespace. Afterwards the permissions checks are identical to
non-idmapped mounts. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-16-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:18 +01:00
Christian Brauner 2f221d6f7b
attr: handle idmapped mounts
When file attributes are changed most filesystems rely on the
setattr_prepare(), setattr_copy(), and notify_change() helpers for
initialization and permission checking. Let them handle idmapped mounts.
If the inode is accessed through an idmapped mount map it into the
mount's user namespace. Afterwards the checks are identical to
non-idmapped mounts. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Helpers that perform checks on the ia_uid and ia_gid fields in struct
iattr assume that ia_uid and ia_gid are intended values and have already
been mapped correctly at the userspace-kernelspace boundary as we
already do today. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:16 +01:00
Christian Brauner 47291baa8d
namei: make permission helpers idmapped mount aware
The two helpers inode_permission() and generic_permission() are used by
the vfs to perform basic permission checking by verifying that the
caller is privileged over an inode. In order to handle idmapped mounts
we extend the two helpers with an additional user namespace argument.
On idmapped mounts the two helpers will make sure to map the inode
according to the mount's user namespace and then peform identical
permission checks to inode_permission() and generic_permission(). If the
initial user namespace is passed nothing changes so non-idmapped mounts
will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-6-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:16 +01:00
Christian Brauner 02f92b3868
fs: add file and path permissions helpers
Add two simple helpers to check permissions on a file and path
respectively and convert over some callers. It simplifies quite a few
codepaths and also reduces the churn in later patches quite a bit.
Christoph also correctly points out that this makes codepaths (e.g.
ioctls) way easier to follow that would otherwise have to do more
complex argument passing than necessary.

Link: https://lore.kernel.org/r/20210121131959.646623-4-christian.brauner@ubuntu.com
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:16 +01:00
Jens Axboe 99668f6180 fs: expose LOOKUP_CACHED through openat2() RESOLVE_CACHED
Now that we support non-blocking path resolution internally, expose it
via openat2() in the struct open_how ->resolve flags. This allows
applications using openat2() to limit path resolution to the extent that
it is already cached.

If the lookup cannot be satisfied in a non-blocking manner, openat2(2)
will return -1/-EAGAIN.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-01-04 11:42:26 -05:00
Linus Torvalds faf145d6f3 Merge branch 'exec-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull execve updates from Eric Biederman:
 "This set of changes ultimately fixes the interaction of posix file
  lock and exec. Fundamentally most of the change is just moving where
  unshare_files is called during exec, and tweaking the users of
  files_struct so that the count of files_struct is not unnecessarily
  played with.

  Along the way fcheck and related helpers were renamed to more
  accurately reflect what they do.

  There were also many other small changes that fell out, as this is the
  first time in a long time much of this code has been touched.

  Benchmarks haven't turned up any practical issues but Al Viro has
  observed a possibility for a lot of pounding on task_lock. So I have
  some changes in progress to convert put_files_struct to always rcu
  free files_struct. That wasn't ready for the merge window so that will
  have to wait until next time"

* 'exec-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits)
  exec: Move io_uring_task_cancel after the point of no return
  coredump: Document coredump code exclusively used by cell spufs
  file: Remove get_files_struct
  file: Rename __close_fd_get_file close_fd_get_file
  file: Replace ksys_close with close_fd
  file: Rename __close_fd to close_fd and remove the files parameter
  file: Merge __alloc_fd into alloc_fd
  file: In f_dupfd read RLIMIT_NOFILE once.
  file: Merge __fd_install into fd_install
  proc/fd: In fdinfo seq_show don't use get_files_struct
  bpf/task_iter: In task_file_seq_get_next use task_lookup_next_fd_rcu
  proc/fd: In proc_readfd_common use task_lookup_next_fd_rcu
  file: Implement task_lookup_next_fd_rcu
  kcmp: In get_file_raw_ptr use task_lookup_fd_rcu
  proc/fd: In tid_fd_mode use task_lookup_fd_rcu
  file: Implement task_lookup_fd_rcu
  file: Rename fcheck lookup_fd_rcu
  file: Replace fcheck_files with files_lookup_fd_rcu
  file: Factor files_lookup_fd_locked out of fcheck_files
  file: Rename __fcheck_files to files_lookup_fd_raw
  ...
2020-12-15 19:29:43 -08:00
Eric W. Biederman 8760c909f5 file: Rename __close_fd to close_fd and remove the files parameter
The function __close_fd was added to support binder[1].  Now that
binder has been fixed to no longer need __close_fd[2] all calls
to __close_fd pass current->files.

Therefore transform the files parameter into a local variable
initialized to current->files, and rename __close_fd to close_fd to
reflect this change, and keep it in sync with the similar changes to
__alloc_fd, and __fd_install.

This removes the need for callers to care about the extra care that
needs to be take if anything except current->files is passed, by
limiting the callers to only operation on current->files.

[1] 483ce1d4b8 ("take descriptor-related part of close() to file.c")
[2] 44d8047f1d ("binder: use standard functions to allocate fds")
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
v1: https://lkml.kernel.org/r/20200817220425.9389-17-ebiederm@xmission.com
Link: https://lkml.kernel.org/r/20201120231441.29911-21-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10 12:42:59 -06:00
Aleksa Sarai 398840f8bb
openat2: reject RESOLVE_BENEATH|RESOLVE_IN_ROOT
This was an oversight in the original implementation, as it makes no
sense to specify both scoping flags to the same openat2(2) invocation
(before this patch, the result of such an invocation was equivalent to
RESOLVE_IN_ROOT being ignored).

This is a userspace-visible ABI change, but the only user of openat2(2)
at the moment is LXC which doesn't specify both flags and so no
userspace programs will break as a result.

Fixes: fddb5d430a ("open: introduce openat2(2) syscall")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: <stable@vger.kernel.org> # v5.6+
Link: https://lore.kernel.org/r/20201027235044.5240-2-cyphar@cyphar.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-12-03 10:15:50 +01:00
Kees Cook 633fb6ac39 exec: move S_ISREG() check earlier
The execve(2)/uselib(2) syscalls have always rejected non-regular files.
Recently, it was noticed that a deadlock was introduced when trying to
execute pipes, as the S_ISREG() test was happening too late.  This was
fixed in commit 73601ea5b7 ("fs/open.c: allow opening only regular files
during execve()"), but it was added after inode_permission() had already
run, which meant LSMs could see bogus attempts to execute non-regular
files.

Move the test into the other inode type checks (which already look for
other pathological conditions[1]).  Since there is no need to use
FMODE_EXEC while we still have access to "acc_mode", also switch the test
to MAY_EXEC.

Also include a comment with the redundant S_ISREG() checks at the end of
execve(2)/uselib(2) to note that they are present to avoid any mistakes.

My notes on the call path, and related arguments, checks, etc:

do_open_execat()
    struct open_flags open_exec_flags = {
        .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
        .acc_mode = MAY_EXEC,
        ...
    do_filp_open(dfd, filename, open_flags)
        path_openat(nameidata, open_flags, flags)
            file = alloc_empty_file(open_flags, current_cred());
            do_open(nameidata, file, open_flags)
                may_open(path, acc_mode, open_flag)
		    /* new location of MAY_EXEC vs S_ISREG() test */
                    inode_permission(inode, MAY_OPEN | acc_mode)
                        security_inode_permission(inode, acc_mode)
                vfs_open(path, file)
                    do_dentry_open(file, path->dentry->d_inode, open)
                        /* old location of FMODE_EXEC vs S_ISREG() test */
                        security_file_open(f)
                        open()

[1] https://lore.kernel.org/lkml/202006041910.9EF0C602@keescook/

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: http://lkml.kernel.org/r/20200605160013.3954297-3-keescook@chromium.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 10:58:01 -07:00
Linus Torvalds e1ec517e18 Merge branch 'hch.init_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull init and set_fs() cleanups from Al Viro:
 "Christoph's 'getting rid of ksys_...() uses under KERNEL_DS' series"

* 'hch.init_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (50 commits)
  init: add an init_dup helper
  init: add an init_utimes helper
  init: add an init_stat helper
  init: add an init_mknod helper
  init: add an init_mkdir helper
  init: add an init_symlink helper
  init: add an init_link helper
  init: add an init_eaccess helper
  init: add an init_chmod helper
  init: add an init_chown helper
  init: add an init_chroot helper
  init: add an init_chdir helper
  init: add an init_rmdir helper
  init: add an init_unlink helper
  init: add an init_umount helper
  init: add an init_mount helper
  init: mark create_dev as __init
  init: mark console_on_rootfs as __init
  init: initialize ramdisk_execute_command at compile time
  devtmpfs: refactor devtmpfsd()
  ...
2020-08-07 09:40:34 -07:00
Christoph Hellwig eb9d7d390e init: add an init_eaccess helper
Add a simple helper to check if a file exists based on kernel space file
name and switch the early init code over to it.  Note that this
theoretically changes behavior as it always is based on the effective
permissions.  But during early init that doesn't make a difference.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-31 08:17:53 +02:00
Christoph Hellwig 1097742efc init: add an init_chmod helper
Add a simple helper to chmod with a kernel space file name and switch
the early init code over to it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-31 08:17:53 +02:00
Christoph Hellwig b873498f99 init: add an init_chown helper
Add a simple helper to chown with a kernel space file name and switch
the early init code over to it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-31 08:17:52 +02:00
Christoph Hellwig 4b7ca5014c init: add an init_chroot helper
Add a simple helper to chroot with a kernel space file name and switch
the early init code over to it.  Remove the now unused ksys_chroot.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-31 08:17:52 +02:00
Christoph Hellwig db63f1e315 init: add an init_chdir helper
Add a simple helper to chdir with a kernel space file name and switch
the early init code over to it.  Remove the now unused ksys_chdir.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-31 08:17:52 +02:00
Christoph Hellwig b25ba7c3c9 fs: remove ksys_fchmod
Fold it into the only remaining caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-31 08:16:01 +02:00
Christoph Hellwig 166e07c37c fs: remove ksys_open
Just open code it in the two callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-31 08:16:00 +02:00
Christoph Hellwig 9e96c8c0e9 fs: add a vfs_fchmod helper
Add a helper for struct file based chmode operations.  To be used by
the initramfs code soon.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-16 15:33:04 +02:00