Commit Graph

632 Commits

Author SHA1 Message Date
Rafael Aquini 3e9ef3d8b9 ipc,namespace: batch free ipc_namespace structures
JIRA: https://issues.redhat.com/browse/RHEL-83456

This patch is a backport of the following upstream commit:
commit da27f796a832122ee533c7685438dad1c4e338dd
Author: Rik van Riel <riel@surriel.com>
Date:   Fri Jan 27 13:46:51 2023 -0500

    ipc,namespace: batch free ipc_namespace structures

    Instead of waiting for an RCU grace period between each ipc_namespace
    structure that is being freed, wait an RCU grace period for every batch
    of ipc_namespace structures.

    Thanks to Al Viro for the suggestion of the helper function.

    This speeds up the run time of the test case that allocates ipc_namespaces
    in a loop from 6 minutes, to a little over 1 second:

    real    0m1.192s
    user    0m0.038s
    sys     0m1.152s

    Signed-off-by: Rik van Riel <riel@surriel.com>
    Reported-by: Chris Mason <clm@meta.com>
    Tested-by: Giuseppe Scrivano <gscrivan@redhat.com>
    Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2025-03-21 11:01:58 -04:00
Ian Kent a4f69f84d0 mnt_idmapping: remove check_fsmapping()
JIRA: https://issues.redhat.com/browse/RHEL-62007
Upstream status: Linus

commit e65a29f0235a438ece414d2d99bbf0d31aa97d04
Author: Christian Brauner <brauner@kernel.org>
Date:   Wed Nov 22 13:44:37 2023 +0100

    mnt_idmapping: remove check_fsmapping()

    The helper is a bit pointless. Just open-code the check.

    Link: https://lore.kernel.org/r/20231122-vfs-mnt_idmap-v1-1-dae4abdde5bd@kernel.org
    Signed-off-by: Christian Brauner <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-11-29 15:53:58 +08:00
Ian Kent 69f3621dc7 fs: move mnt_idmap
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

commit 3707d84c13670bf09b4a9a4dc6733326d8344b31
Author: Christian Brauner <brauner@kernel.org>
Date:   Fri Jan 13 12:49:33 2023 +0100

    fs: move mnt_idmap

    Now that we converted everything to just rely on struct mnt_idmap move it all
    into a separate file. This ensure that no code can poke around in struct
    mnt_idmap without any dedicated helpers and makes it easier to extend it in the
    future. Filesystems will now not be able to conflate mount and filesystem
    idmappings as they are two distinct types and require distinct helpers that
    cannot be used interchangeably. We are now also able to extend struct mnt_idmap
    as we see fit.

    Acked-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-16 14:19:56 +08:00
Ian Kent db8603ce12 fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

Conflicts: Dropped hunks for ksmbd because the source is not present in the
	CentOS Stream source tree.
	CentOS Stream has commit bb901646d2 ("ovl: let helper
	ovl_i_path_real() return the realinode") which wasn't present
	upstream when this patch was applied, correct manually.
	CentOS Stream does not have upstream commit c7423dbdbc9ec
	("ima: Handle -ESTALE returned by ima_filter_rule_match()")
	which results in a reject of hunk #3 against
	security/integrity/ima/ima_policy.c, so manually apply hunk.
	Upstream merge commit 05e6295f7b5e0 ("Merge tag 'fs.idmapped.v6.3'
	of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping")
	together with Upstream commit facd61053cff1 ("fuse: fixes after
	adapting to new posix acl api") results in a conflict in
	fs/fuse/acl.c, adjust to suit.
	Update the call to i_uid_into_vfsuid() from 2740f64cb7f00
	("filelocks: use mount idmapping for setlease permission check")
	to pass an idmap instead of a user namespace.
	It looks like Linus made a change to the merge request "Merge tag
	8834147f95056 ("fscache-rewrite-20220111") to account for idmap
	changes (probably the ones in this commit, so add the change here.

commit e67fe63341b8117d7e0d9acf0f1222d5138b9266
Author: Christian Brauner <brauner@kernel.org>
Date:   Fri Jan 13 12:49:30 2023 +0100

    fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap

    Convert to struct mnt_idmap.
    Remove legacy file_mnt_user_ns() and mnt_user_ns().

    Last cycle we merged the necessary infrastructure in
    256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
    This is just the conversion to struct mnt_idmap.

    Currently we still pass around the plain namespace that was attached to a
    mount. This is in general pretty convenient but it makes it easy to
    conflate namespaces that are relevant on the filesystem with namespaces
    that are relevent on the mount level. Especially for non-vfs developers
    without detailed knowledge in this area this can be a potential source for
    bugs.

    Once the conversion to struct mnt_idmap is done all helpers down to the
    really low-level helpers will take a struct mnt_idmap argument instead of
    two namespace arguments. This way it becomes impossible to conflate the two
    eliminating the possibility of any bugs. All of the vfs and all filesystems
    only operate on struct mnt_idmap.

    Acked-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-16 11:02:01 +08:00
Ian Kent f301788abb fs: introduce dedicated idmap type for mounts
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

Conflicts: CentOS Stream commit 89eccbf810 ("fs: hold writers when
	changing mount's idmapping") introduced a whitespace change in
	mnt_allow_writers() which needed to be corrected.

commit 256c8aed2b420a7c57ed6469fbb0f8310f5aeec9
Author: Christian Brauner <brauner@kernel.org>
Date:   Wed Oct 26 12:51:27 2022 +0200

    fs: introduce dedicated idmap type for mounts

    Last cycle we've already made the interaction with idmapped mounts more
    robust and type safe by introducing the vfs{g,u}id_t type. This cycle we
    concluded the conversion and removed the legacy helpers.

    Currently we still pass around the plain namespace that was attached to
    a mount. This is in general pretty convenient but it makes it easy to
    conflate filesystem and mount namespaces and what different roles they
    have to play. Especially for filesystem developers without much
    experience in this area this is an easy source for bugs.

    Instead of passing the plain namespace we introduce a dedicated type
    struct mnt_idmap and replace the pointer with a pointer to a struct
    mnt_idmap. There are no semantic or size changes for the mount struct
    caused by this.

    We then start converting all places aware of idmapped mounts to rely on
    struct mnt_idmap. Once the conversion is done all helpers down to the
    really low-level make_vfs{g,u}id() and from_vfs{g,u}id() will take a
    struct mnt_idmap argument instead of two namespace arguments. This way
    it becomes impossible to conflate the two, removing and thus eliminating
    the possibility of any bugs. Fwiw, I fixed some issues in that area a
    while ago in ntfs3 and ksmbd in the past. Afterwards, only low-level
    code can ultimately use the associated namespace for any permission
    checks. Even most of the vfs can be ultimately completely oblivious
    about this and filesystems will never interact with it directly in any
    form in the future.

    A struct mnt_idmap currently encompasses a simple refcount and a pointer
    to the relevant namespace the mount is idmapped to. If a mount isn't
    idmapped then it will point to a static nop_mnt_idmap. If it is an
    idmapped mount it will point to a new struct mnt_idmap. As usual there
    are no allocations or anything happening for non-idmapped mounts.
    Everthing is carefully written to be a nop for non-idmapped mounts as
    has always been the case.

    If an idmapped mount or mount tree is created a new struct mnt_idmap is
    allocated and a reference taken on the relevant namespace. For each
    mount in a mount tree that gets idmapped or a mount that inherits the
    idmap when it is cloned the reference count on the associated struct
    mnt_idmap is bumped. Just a reminder that we only allow a mount to
    change it's idmapping a single time and only if it hasn't already been
    attached to the filesystems and has no active writers.

    The actual changes are fairly straightforward. This will have huge
    benefits for maintenance and security in the long run even if it causes
    some churn. I'm aware that there's some cost for all of you. And I'll
    commit to doing this work and make this as painless as I can.

    Note that this also makes it possible to extend struct mount_idmap in
    the future. For example, it would be possible to place the namespace
    pointer in an anonymous union together with an idmapping struct. This
    would allow us to expose an api to userspace that would let it specify
    idmappings directly instead of having to go through the detour of
    setting up namespaces at all.

    This just adds the infrastructure and doesn't do any conversions.

    Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-16 08:29:48 +08:00
Ian Kent 6bad94a0f1 fs: unset MNT_WRITE_HOLD on failure
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

commit 0014edaedfd804dbf35b009808789325ca615716
Author: Christian Brauner <brauner@kernel.org>
Date:   Wed Apr 20 15:19:25 2022 +0200

    fs: unset MNT_WRITE_HOLD on failure

    After mnt_hold_writers() has been called we will always have set MNT_WRITE_HOLD
    and consequently we always need to pair mnt_hold_writers() with
    mnt_unhold_writers(). After the recent cleanup in [1] where Al switched from a
    do-while to a for loop the cleanup currently fails to unset MNT_WRITE_HOLD for
    the first mount that was changed. Fix this and make sure that the first mount
    will be cleaned up and add some comments to make it more obvious.

    Link: https://lore.kernel.org/lkml/0000000000007cc21d05dd0432b8@google.com
    Link: https://lore.kernel.org/lkml/00000000000080e10e05dd043247@google.com
    Link: https://lore.kernel.org/r/20220420131925.2464685-1-brauner@kernel.org
    Fixes: e257039f0fc7 ("mount_setattr(): clean the control flow and calling conventions") [1]
    Cc: Hillf Danton <hdanton@sina.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Reported-by: syzbot+10a16d1c43580983f6a2@syzkaller.appspotmail.com
    Reported-by: syzbot+306090cfa3294f0bbfb3@syzkaller.appspotmail.com
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-16 08:29:46 +08:00
Ian Kent 7e3549e9f9 mount_setattr(): clean the control flow and calling conventions
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

commit e257039f0fc7da36ac3a522ef9a5cb4ae7852e67
Author: Al Viro <viro@zeniv.linux.org.uk>
Date:   Mon Feb 28 23:04:20 2022 -0500

    mount_setattr(): clean the control flow and calling conventions

    separate the "cleanup" and "apply" codepaths (they have almost no overlap),
    fold the "cleanup" into "prepare" (which eliminates the need of ->revert)
    and make loops more idiomatic.

    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-16 08:29:46 +08:00
Ian Kent 1370862807 fs: clean up mount_setattr control flow
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

commit 87bb5b60019c60e1f902e6885734cc4e5135c2d9
Author: Christian Brauner <brauner@kernel.org>
Date:   Thu Feb 3 14:14:11 2022 +0100

    fs: clean up mount_setattr control flow

    Simplify the control flow of mount_setattr_{prepare,commit} so they
    become easier to follow. We kept using both an integer error variable
    that was passed by pointer as well as a pointer as an indicator for
    whether or not we need to revert or commit the prepared changes.
    Simplify this and just use the pointer. If we successfully changed
    properties the revert pointer will be NULL and if we failed to change
    properties it will indicate where we failed and thus need to stop
    reverting.

    Link: https://lore.kernel.org/r/20220203131411.3093040-8-brauner@kernel.org
    Cc: Seth Forshee <seth.forshee@digitalocean.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: linux-fsdevel@vger.kernel.org
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Christian Brauner <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-16 08:29:45 +08:00
Ian Kent 04686b2da2 fs: don't open-code mnt_hold_writers()
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

commit ad1844a0127af8fbb87d3d7019907260daf6466b
Author: Christian Brauner <brauner@kernel.org>
Date:   Thu Feb 3 14:14:10 2022 +0100

    fs: don't open-code mnt_hold_writers()

    Remove sb_prepare_remount_readonly()'s open-coded mnt_hold_writers()
    implementation with the real helper we introduced in commit fbdc2f6c40
    ("fs: split out functions to hold writers").

    Link: https://lore.kernel.org/r/20220203131411.3093040-7-brauner@kernel.org
    Cc: Seth Forshee <seth.forshee@digitalocean.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: linux-fsdevel@vger.kernel.org
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Christian Brauner <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-16 08:29:44 +08:00
Ian Kent b8896a25cc fs: add mnt_allow_writers() and simplify mount_setattr_prepare()
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

Conflicts: Make the back port of upstream 89eccbf810 ("fs: hold
	writers when changing mount's idmapping") look like upstream
	commit a26f788b6e7a ("fs: add mnt_allow_writers() and simplify
	mount_setattr_prepare()"). See CentOS Stream commit 89eccbf810.

commit a26f788b6e7a10d193c4cc6889818e4d625e9461
Author: Christian Brauner <brauner@kernel.org>
Date:   Thu Feb 3 14:14:08 2022 +0100

    fs: add mnt_allow_writers() and simplify mount_setattr_prepare()

    Add a tiny helper that lets us simplify the control-flow and can be used
    in the next patch to avoid adding another condition open-coded into
    mount_setattr_prepare(). Instead we can add it into the new helper.

    Link: https://lore.kernel.org/r/20220203131411.3093040-5-brauner@kernel.org
    Cc: Seth Forshee <seth.forshee@digitalocean.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: linux-fsdevel@vger.kernel.org
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Christian Brauner <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-16 08:29:43 +08:00
Chris von Recklinghausen f2ae8afa36 fs: support mapped mounts of mapped filesystems
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit bd303368b776eead1c29e6cdda82bde7128b82a7
Author: Christian Brauner <christian.brauner@ubuntu.com>
Date:   Fri Dec 3 12:17:07 2021 +0100

    fs: support mapped mounts of mapped filesystems

    In previous patches we added new and modified existing helpers to handle
    idmapped mounts of filesystems mounted with an idmapping. In this final
    patch we convert all relevant places in the vfs to actually pass the
    filesystem's idmapping into these helpers.

    With this the vfs is in shape to handle idmapped mounts of filesystems
    mounted with an idmapping. Note that this is just the generic
    infrastructure. Actually adding support for idmapped mounts to a
    filesystem mountable with an idmapping is follow-up work.

    In this patch we extend the definition of an idmapped mount from a mount
    that that has the initial idmapping attached to it to a mount that has
    an idmapping attached to it which is not the same as the idmapping the
    filesystem was mounted with.

    As before we do not allow the initial idmapping to be attached to a
    mount. In addition this patch prevents that the idmapping the filesystem
    was mounted with can be attached to a mount created based on this
    filesystem.

    This has multiple reasons and advantages. First, attaching the initial
    idmapping or the filesystem's idmapping doesn't make much sense as in
    both cases the values of the i_{g,u}id and other places where k{g,u}ids
    are used do not change. Second, a user that really wants to do this for
    whatever reason can just create a separate dedicated identical idmapping
    to attach to the mount. Third, we can continue to use the initial
    idmapping as an indicator that a mount is not idmapped allowing us to
    continue to keep passing the initial idmapping into the mapping helpers
    to tell them that something isn't an idmapped mount even if the
    filesystem is mounted with an idmapping.

    Link: https://lore.kernel.org/r/20211123114227.3124056-11-brauner@kernel.org (v1)
    Link: https://lore.kernel.org/r/20211130121032.3753852-11-brauner@kernel.org (v2)
    Link: https://lore.kernel.org/r/20211203111707.3901969-11-brauner@kernel.org
    Cc: Seth Forshee <sforshee@digitalocean.com>
    Cc: Amir Goldstein <amir73il@gmail.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    CC: linux-fsdevel@vger.kernel.org
    Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
    Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:12:33 -04:00
Alex Gladkov 177f45dd8a Revert "Disable idmapped mounts"
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2132410
JIRA: https://issues.redhat.com/browse/RHELBU-1999
Upstream Status: RHEL-only

Re-enable the idmapped mount. This technology is too important for our
projects to wait until the upstream address of all the issues.

This reverts commit c93c728360.

Author: Alexey Gladkov <agladkov@redhat.com>
Date:   Fri Nov 12 19:29:51 2021 +0100

    Disable idmapped mounts

    Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2018141
    Upstream Status: RHEL only

    The commmit 9caccd4154 ("fs: introduce MOUNT_ATTR_IDMAP") added
    idmapped mounts. During the merge, Eric W. Biederman raised concerns [1]
    about the security of the changes, but the discussion did not continue.

    [1] https://lore.kernel.org/all/m18s7481xc.fsf@fess.ebiederm.org/

    Signed-off-by: Alexey Gladkov <agladkov@redhat.com>

Signed-off-by: Alex Gladkov <agladkov@redhat.com>
2023-06-28 22:23:04 +02:00
Jan Stancek bb21a71e46 Merge: fs: backport idmapped mounts fixes
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2207

Bugzilla: https://bugzilla.redhat.com/2179877

Merge request contains a backport of fixes related to idmapped mounts. They are required to enable idmapped mounts in RHEL.

Signed-off-by: Alex Gladkov <agladkov@redhat.com>

Approved-by: Adrian Reber <areber@redhat.com>
Approved-by: Brian Foster <bfoster@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-04-04 11:53:03 +02:00
Alex Gladkov 89eccbf810 fs: hold writers when changing mount's idmapping
Bugzilla: https://bugzilla.redhat.com/2179877
Conflicts: Context diff due to missing commits:
           a26f788b6e7a ("fs: add mnt_allow_writers() and simplify mount_setattr_prepare()")
           e257039f0fc7 ("mount_setattr(): clean the control flow and calling conventions")

           Functionally, the commit is backported to 5.14 by the author
           of the original changes:

           539f1d4879

commit e1bbcd277a53e08d619ffeec56c5c9287f2bf42f
Author: Christian Brauner <brauner@kernel.org>
Date:   Tue May 10 11:58:40 2022 +0200

    fs: hold writers when changing mount's idmapping

    Hold writers when changing a mount's idmapping to make it more robust.

    The vfs layer takes care to retrieve the idmapping of a mount once
    ensuring that the idmapping used for vfs permission checking is
    identical to the idmapping passed down to the filesystem.

    For ioctl codepaths the filesystem itself is responsible for taking the
    idmapping into account if they need to. While all filesystems with
    FS_ALLOW_IDMAP raised take the same precautions as the vfs we should
    enforce it explicitly by making sure there are no active writers on the
    relevant mount while changing the idmapping.

    This is similar to turning a mount ro with the difference that in
    contrast to turning a mount ro changing the idmapping can only ever be
    done once while a mount can transition between ro and rw as much as it
    wants.

    This is a minor user-visible change. But it is extremely unlikely to
    matter. The caller must've created a detached mount via OPEN_TREE_CLONE
    and then handed that O_PATH fd to another process or thread which then
    must've gotten a writable fd for that mount and started creating files
    in there while the caller is still changing mount properties. While not
    impossible it will be an extremely rare corner-case and should in
    general be considered a bug in the application. Consider making a mount
    MOUNT_ATTR_NOEXEC or MOUNT_ATTR_NODEV while allowing someone else to
    perform lookups or exec'ing in parallel by handing them a copy of the
    OPEN_TREE_CLONE fd or another fd beneath that mount.

    Link: https://lore.kernel.org/r/20220510095840.152264-1-brauner@kernel.org
    Cc: Seth Forshee <seth.forshee@digitalocean.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: linux-fsdevel@vger.kernel.org
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Alex Gladkov <agladkov@redhat.com>
2023-03-27 19:52:11 +02:00
Alex Gladkov 5679dd7668 fs: simplify check in mount_setattr_commit()
Bugzilla: https://bugzilla.redhat.com/2179877

commit 03b6abee9ba67c20c4e5253e1a347d8c26edc511
Author: Christian Brauner <brauner@kernel.org>
Date:   Thu Feb 3 14:14:09 2022 +0100

    fs: simplify check in mount_setattr_commit()

    In order to determine whether we need to call mnt_unhold_writers() in
    mount_setattr_commit() we currently do not just check whether
    MNT_WRITE_HOLD is set but also if a read-only mount was requested.

    However, checking whether MNT_WRITE_HOLD is set is enough. Setting
    MNT_WRITE_HOLD requires lock_mount_hash() to be held and it must be
    unset before calling unlock_mount_hash(). This guarantees that if we see
    MNT_WRITE_HOLD we know that we were the ones who set it earlier. We
    don't need to care about why we set it. Plus, leaving this additional
    read-only check in makes the code more confusing because it implies that
    MNT_WRITE_HOLD could've been set by another thread when it really can't.
    Remove it and update the associated comment.

    Link: https://lore.kernel.org/r/20220203131411.3093040-6-brauner@kernel.org
    Cc: Seth Forshee <seth.forshee@digitalocean.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: linux-fsdevel@vger.kernel.org
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Christian Brauner <brauner@kernel.org>

Signed-off-by: Alex Gladkov <agladkov@redhat.com>
2023-03-27 19:28:03 +02:00
Alex Gladkov 1f4d508653 fs: require CAP_SYS_ADMIN in target namespace for idmapped mounts
Bugzilla: https://bugzilla.redhat.com/2179877

commit bf1ac16edf6770a92bc75cf2373f1f9feea398a4
Author: Seth Forshee <sforshee@digitalocean.com>
Date:   Tue Aug 16 11:47:52 2022 -0500

    fs: require CAP_SYS_ADMIN in target namespace for idmapped mounts

    Idmapped mounts should not allow a user to map file ownsership into a
    range of ids which is not under the control of that user. However, we
    currently don't check whether the mounter is privileged wrt to the
    target user namespace.

    Currently no FS_USERNS_MOUNT filesystems support idmapped mounts, thus
    this is not a problem as only CAP_SYS_ADMIN in init_user_ns is allowed
    to set up idmapped mounts. But this could change in the future, so add a
    check to refuse to create idmapped mounts when the mounter does not have
    CAP_SYS_ADMIN in the target user namespace.

    Fixes: bd303368b776 ("fs: support mapped mounts of mapped filesystems")
    Signed-off-by: Seth Forshee <sforshee@digitalocean.com>
    Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
    Link: https://lore.kernel.org/r/20220816164752.2595240-1-sforshee@digitalocean.com
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Alex Gladkov <agladkov@redhat.com>
2023-03-27 19:02:08 +02:00
Chris von Recklinghausen f6ed5ea18b fs: move namespace sysctls and declare fs base directory
Bugzilla: https://bugzilla.redhat.com/2160210

commit ab171b952c6e065779687b44041038efdadb3915
Author: Luis Chamberlain <mcgrof@kernel.org>
Date:   Fri Jan 21 22:13:27 2022 -0800

    fs: move namespace sysctls and declare fs base directory

    This moves the namespace sysctls to its own file as part of the
    kernel/sysctl.c spring cleaning

    Since we have now removed all sysctls for "fs", we now have to declare
    it on the filesystem code, we do that using the new helper, which
    reduces boiler plate code.

    We rename init_fs_shared_sysctls() to init_fs_sysctls() to reflect that
    now fs/sysctls.c is taking on the burden of being the first to register
    the base directory as well.

    Lastly, since init code will load in the order in which we link it we
    have to move the sysctl code to be linked in early, so that its early
    init routine runs prior to other fs code.  This way, other filesystem
    code can register their own sysctls using the helpers after this:

      * register_sysctl_init()
      * register_sysctl()

    Link: https://lkml.kernel.org/r/20211129211943.640266-3-mcgrof@kernel.org
    Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
    Cc: Antti Palosaari <crope@iki.fi>
    Cc: Christian Brauner <christian.brauner@ubuntu.com>
    Cc: "David S. Miller" <davem@davemloft.net>
    Cc: Eric Biederman <ebiederm@xmission.com>
    Cc: Eric Biggers <ebiggers@google.com>
    Cc: Iurii Zaikin <yzaikin@google.com>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Lukas Middendorf <kernel@tuxforce.de>
    Cc: Masami Hiramatsu <mhiramat@kernel.org>
    Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
    Cc: Stephen Kitt <steve@sk2.org>
    Cc: Xiaoming Ni <nixiaoming@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:32 -04:00
Alex Gladkov 52db2052ee fs/mount_setattr: always cleanup mount_kattr
Bugzilla: https://bugzilla.redhat.com/2179877

commit 012e332286e2bb9f6ac77d195f17e74b2963d663
Author: Christian Brauner <brauner@kernel.org>
Date:   Thu Dec 30 20:23:09 2021 +0100

    fs/mount_setattr: always cleanup mount_kattr

    Make sure that finish_mount_kattr() is called after mount_kattr was
    succesfully built in both the success and failure case to prevent
    leaking any references we took when we built it.  We returned early if
    path lookup failed thereby risking to leak an additional reference we
    took when building mount_kattr when an idmapped mount was requested.

    Cc: linux-fsdevel@vger.kernel.org
    Cc: stable@vger.kernel.org
    Fixes: 9caccd4154 ("fs: introduce MOUNT_ATTR_IDMAP")
    Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Alex Gladkov <agladkov@redhat.com>
2023-03-20 17:58:28 +01:00
Jeffrey Layton 4a34ae6171 fs: add is_idmapped_mnt() helper
Bugzilla: http://bugzilla.redhat.com/1229736

commit bb49e9e730c2906a958eee273a7819f401543d6c
Author: Christian Brauner <christian.brauner@ubuntu.com>
Date:   Fri Dec 3 12:16:58 2021 +0100

    fs: add is_idmapped_mnt() helper

    Multiple places open-code the same check to determine whether a given
    mount is idmapped. Introduce a simple helper function that can be used
    instead. This allows us to get rid of the fragile open-coding. We will
    later change the check that is used to determine whether a given mount
    is idmapped. Introducing a helper allows us to do this in a single
    place instead of doing it for multiple places.

    Link: https://lore.kernel.org/r/20211123114227.3124056-2-brauner@kernel.org (v1)
    Link: https://lore.kernel.org/r/20211130121032.3753852-2-brauner@kernel.org (v2)
    Link: https://lore.kernel.org/r/20211203111707.3901969-2-brauner@kernel.org
    Cc: Seth Forshee <sforshee@digitalocean.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    CC: linux-fsdevel@vger.kernel.org
    Reviewed-by: Amir Goldstein <amir73il@gmail.com>
    Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
    Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

Signed-off-by: Jeffrey Layton <jlayton@redhat.com>
2022-08-22 12:31:31 -04:00
Herton R. Krzesinski d635b9c68b Merge: mm: update generic MM code to upstream v5.15
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/201
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396
Brew URL: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=41434412
 Testing: KT1-lite + regressions and performance (scheduler, network, and fs)
          benchmarks, as documented on the BZ.

In order to provide support for several future feature requests (virtio-mem,
filesystems, core-kernel and memory management) targeted for RHEL-9,
the patchset is bringing the core-MM codebase up to upstream v5.15.

This patchset is composed of upstream cherry picks that represent the
difference between current RHEL-9 v5.14 code base and upstream v5.15 plus
their relevant follow-up fixes.

Omitted-fix: 15eb7c888e749 ("locking/rwsem: Add missing __init_rwsem() for PREEMPT_RT")
	     already backported into RHEL9 via commit de3eb21475

Omitted-fix: 6341eb6f39bb7 ("drm/i915/selftests: exercise shmem_writeback with THP")
	     dependencies for this selftest follow up (and the follow-up itself)
	     shall be dealt with via DRM update work done by the graphics team.

Omitted-fix: f24b062607678 ("mm/damon: grammar s/works/work/")
Omitted-fix: db7a347b26fe0 ("mm/damon/dbgfs: use '__GFP_NOWARN' for user-specified size buffer allocation")
Omitted-fix: d78f3853f831e ("mm/damon/dbgfs: fix missed use of damon_dbgfs_lock")
	     albeit DAMON initial integration is part of v5.15, we're explicitly
             introducing it disabled in this backport. DAMON follow-ups, and future
             enablement will be dealt with via a separated (already filed) BZ ticket.

Omitted-fix: e66435936756d ("mm: fix mismerge of folio page flag manipulators")
	     folio pages are a feature integrated into v5.16, and this merge-fix
	     commit is non-relevant to this particular patchset.

Signed-off-by: Rafael Aquini <aquini@redhat.com>
RH-Acked-by: John W. Linville <linville@redhat.com>
RH-Acked-by: Waiman Long <longman@redhat.com>
RH-Acked-by: Prarit Bhargava <prarit@redhat.com>
RH-Acked-by: David Arcari <darcari@redhat.com>
RH-Acked-by: Aristeu Rozanski <arozansk@redhat.com>
RH-Acked-by: Phil Auld <pauld@redhat.com>
RH-Acked-by: David Hildenbrand <david@redhat.com>
RH-Acked-by: Chris von Recklinghausen <crecklin@redhat.com>
Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2021-12-15 11:00:36 -03:00
Herton R. Krzesinski c4420e91ed Merge: Disable idmapped mounts
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/131
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2018141
Upstream Status: RHEL only

The commmit 9caccd4154 ("fs: introduce MOUNT_ATTR_IDMAP") added
idmapped mounts. During the merge, Eric W. Biederman raised concerns [1]
about the security of the changes, but the discussion did not continue.

[1] https://lore.kernel.org/all/m18s7481xc.fsf@fess.ebiederm.org/

Signed-off-by: Alexey Gladkov <agladkov@redhat.com>
RH-Acked-by: Ondrej Mosnáček <omosnace@redhat.com>
RH-Acked-by: Alexey Gladkov <agladkov@redhat.com>
RH-Acked-by: ebiederm <>
Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2021-12-03 10:57:22 -03:00
Rafael Aquini 712e09fcae memcg: enable accounting for new namesapces and struct nsproxy
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 30acd0bdfb86548172168a0cc71d455944de0683
Author: Vasily Averin <vvs@virtuozzo.com>
Date:   Thu Sep 2 14:55:27 2021 -0700

    memcg: enable accounting for new namesapces and struct nsproxy

    Container admin can create new namespaces and force kernel to allocate up
    to several pages of memory for the namespaces and its associated
    structures.

    Net and uts namespaces have enabled accounting for such allocations.  It
    makes sense to account for rest ones to restrict the host's memory
    consumption from inside the memcg-limited container.

    Link: https://lkml.kernel.org/r/5525bcbf-533e-da27-79b7-158686c64e13@virtuozzo.com
    Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
    Acked-by: Serge Hallyn <serge@hallyn.com>
    Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
    Acked-by: Kirill Tkhai <ktkhai@virtuozzo.com>
    Reviewed-by: Shakeel Butt <shakeelb@google.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Alexey Dobriyan <adobriyan@gmail.com>
    Cc: Andrei Vagin <avagin@gmail.com>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Borislav Petkov <bp@suse.de>
    Cc: Dmitry Safonov <0x7f454c46@gmail.com>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: "H. Peter Anvin" <hpa@zytor.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: "J. Bruce Fields" <bfields@fieldses.org>
    Cc: Jeff Layton <jlayton@kernel.org>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Jiri Slaby <jirislaby@kernel.org>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
    Cc: Yutian Yang <nglaive@gmail.com>
    Cc: Zefan Li <lizefan.x@bytedance.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:41:20 -05:00
Rafael Aquini 94b0c951d4 memcg: enable accounting for mnt_cache entries
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 79f6540ba88dfb383ecf057a3425e668105ca774
Author: Vasily Averin <vvs@virtuozzo.com>
Date:   Thu Sep 2 14:55:10 2021 -0700

    memcg: enable accounting for mnt_cache entries

    Patch series "memcg accounting from OpenVZ", v7.

    OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels.
    Initially we used our own accounting subsystem, then partially committed
    it to upstream, and a few years ago switched to cgroups v1.  Now we're
    rebasing again, revising our old patches and trying to push them upstream.

    We try to protect the host system from any misuse of kernel memory
    allocation triggered by untrusted users inside the containers.

    Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
    list, though I would be very grateful for any comments from maintainersi
    of affected subsystems or other people added in cc:

    Compared to the upstream, we additionally account the following kernel objects:
    - network devices and its Tx/Rx queues
    - ipv4/v6 addresses and routing-related objects
    - inet_bind_bucket cache objects
    - VLAN group arrays
    - ipv6/sit: ip_tunnel_prl
    - scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets
    - nsproxy and namespace objects itself
    - IPC objects: semaphores, message queues and share memory segments
    - mounts
    - pollfd and select bits arrays
    - signals and posix timers
    - file lock
    - fasync_struct used by the file lease code and driver's fasync queues
    - tty objects
    - per-mm LDT

    We have an incorrect/incomplete/obsoleted accounting for few other kernel
    objects: sk_filter, af_packets, netlink and xt_counters for iptables.
    They require rework and probably will be dropped at all.

    Also we're going to add an accounting for nft, however it is not ready
    yet.

    We have not tested performance on upstream, however, our performance team
    compares our current RHEL7-based production kernel and reports that they
    are at least not worse as the according original RHEL7 kernel.

    This patch (of 10):

    The kernel allocates ~400 bytes of 'struct mount' for any new mount.
    Creating a new mount namespace clones most of the parent mounts, and this
    can be repeated many times.  Additionally, each mount allocates up to
    PATH_MAX=4096 bytes for mnt->mnt_devname.

    It makes sense to account for these allocations to restrict the host's
    memory consumption from inside the memcg-limited container.

    Link: https://lkml.kernel.org/r/045db11f-4a45-7c9b-2664-5b32c2b44943@virtuozzo.com
    Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
    Reviewed-by: Shakeel Butt <shakeelb@google.com>
    Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Yutian Yang <nglaive@gmail.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Alexey Dobriyan <adobriyan@gmail.com>
    Cc: Andrei Vagin <avagin@gmail.com>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Dmitry Safonov <0x7f454c46@gmail.com>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: "H. Peter Anvin" <hpa@zytor.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: "J. Bruce Fields" <bfields@fieldses.org>
    Cc: Jeff Layton <jlayton@kernel.org>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Jiri Slaby <jirislaby@kernel.org>
    Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Serge Hallyn <serge@hallyn.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Zefan Li <lizefan.x@bytedance.com>
    Cc: Borislav Petkov <bp@suse.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:41:18 -05:00
Alexey Gladkov c93c728360 Disable idmapped mounts
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2018141
Upstream Status: RHEL only

The commmit 9caccd4154 ("fs: introduce MOUNT_ATTR_IDMAP") added
idmapped mounts. During the merge, Eric W. Biederman raised concerns [1]
about the security of the changes, but the discussion did not continue.

[1] https://lore.kernel.org/all/m18s7481xc.fsf@fess.ebiederm.org/

Signed-off-by: Alexey Gladkov <agladkov@redhat.com>
2021-11-15 16:21:18 +01:00
Jeffrey Layton 5b7ac1214f fs: remove mandatory file locking support
Bugzilla: http://bugzilla.redhat.com/2017438

commit f7e33bdbd6d1bdf9c3df8bba5abcf3399f957ac3
Author: Jeff Layton <jlayton@kernel.org>
Date:   Thu Aug 19 14:56:38 2021 -0400

    fs: remove mandatory file locking support

    We added CONFIG_MANDATORY_FILE_LOCKING in 2015, and soon after turned it
    off in Fedora and RHEL8. Several other distros have followed suit.

    I've heard of one problem in all that time: Someone migrated from an
    older distro that supported "-o mand" to one that didn't, and the host
    had a fstab entry with "mand" in it which broke on reboot. They didn't
    actually _use_ mandatory locking so they just removed the mount option
    and moved on.

    This patch rips out mandatory locking support wholesale from the kernel,
    along with the Kconfig option and the Documentation file. It also
    changes the mount code to ignore the "mand" mount option instead of
    erroring out, and to throw a big, ugly warning.

    Signed-off-by: Jeff Layton <jlayton@kernel.org>

Signed-off-by: Jeffrey Layton <jlayton@redhat.com>
2021-11-01 13:56:12 -04:00
Linus Torvalds 15517c724c File locking change for v5.14
-----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEES8DXskRxsqGE6vXTAA5oQRlWghUFAmEg5JwTHGpsYXl0b25A
 a2VybmVsLm9yZwAKCRAADmhBGVaCFfX3D/94AsPqwbnIyLgP2KjoltTKtnSsOZba
 tzTKwrA7vPB9b0aFg0i42omwRtGQk5jPaB8J+tBx+oc/JjO+Dpj5FOXOI9PjFr//
 2zXBuMdvYWaPlsl9nw5dFzhybdPnv+HgNXA8PRqy/DifU7/oLWxHzQ4sGvr1XXNc
 vow5KzLe1K4R+hQBwqKLh/9VVE4+foVTAwXWcFBi/RrDf5X97BF7s98JYsNxfpbn
 EYSyLr818TCUlivAUzx+y0JD+qUknZLuNYWZ2HkYvI29y5h1gUCVmn1orlCDdDq/
 G3SCF8M66l4xGW+bpzrEabqIDl+0IWhX+grU+UVdiMH4YuXmY4HPdNgKYIpJolyy
 zdg8cCO3OAHTJiwkj5jSfxFHQnzPr58LSQGDfrIrNjcDKlQbx3c+8R5yMy7Ar7jZ
 XAM6h88PAuBknTDWBQogZtuSqKbV8D2LsAUVgRKA7iKSwYXXUmZdW+UDiOavsHmg
 n2fbi1sPP7woQKyXzFddwxG3+2Nzby8BE7xuyHTXdOrQJNE1PvICH+WVo2m8+FCe
 uLKLXNf4UECO2MaSyd6k90v3AVty4u1EDTbitgcERztWGzazZdTVlmRdpwy35V+B
 wskKALzjGgqFoSwkEAnXQdk4+Gk9EtnrXD+HYfmxGYKk87fOBCgJtqDWYw3RHK7n
 giBfvji9HK3R8A==
 =xZLO
 -----END PGP SIGNATURE-----

Merge tag 'locks-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull mandatory file locking deprecation warning from Jeff Layton:
 "As discussed on the list, this patch just adds a new warning for folks
  who still have mandatory locking enabled and actually mount with '-o
  mand'. I'd like to get this in for v5.14 so we can push this out into
  stable kernels and hopefully reach folks who have mounts with -o mand.

  For now, I'm operating under the assumption that we'll fully remove
  this support in v5.15, but we can move that out if any legitimate
  users of this facility speak up between now and then"

* tag 'locks-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
  fs: warn about impending deprecation of mandatory locks
2021-08-21 10:50:22 -07:00
Jeff Layton fdd92b64d1 fs: warn about impending deprecation of mandatory locks
We've had CONFIG_MANDATORY_FILE_LOCKING since 2015 and a lot of distros
have disabled it. Warn the stragglers that still use "-o mand" that
we'll be dropping support for that mount option.

Cc: stable@vger.kernel.org
Signed-off-by: Jeff Layton <jlayton@kernel.org>
2021-08-21 07:32:45 -04:00
Miklos Szeredi 427215d85e ovl: prevent private clone if bind mount is not allowed
Add the following checks from __do_loopback() to clone_private_mount() as
well:

 - verify that the mount is in the current namespace

 - verify that there are no locked children

Reported-by: Alois Wohlschlager <alois1@gmx-topmail.de>
Fixes: c771d683a6 ("vfs: introduce clone_private_mount()")
Cc: <stable@vger.kernel.org> # v3.18
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-08-10 10:21:31 +02:00
Christian Brauner dd8b477f9a
mount: Support "nosymfollow" in new mount api
Commit dab741e0e0 ("Add a "nosymfollow" mount option.") added support
for the "nosymfollow" mount option allowing to block following symlinks
when resolving paths. The mount option so far was only available in the
old mount api. Make it available in the new mount api as well. Bonus is
that it can be applied to a whole subtree not just a single mount.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Mattias Nissler <mnissler@chromium.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ross Zwisler <zwisler@google.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-06-01 12:09:27 +02:00
Christian Brauner 2ca4dcc490
fs/mount_setattr: tighten permission checks
We currently don't have any filesystems that support idmapped mounts
which are mountable inside a user namespace. That was a deliberate
decision for now as a userns root can just mount the filesystem
themselves. So enforce this restriction explicitly until there's a real
use-case for this. This way we can notice it and will have a chance to
adapt and audit our translation helpers and fstests appropriately if we
need to support such filesystems.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@vger.kernel.org
CC: linux-fsdevel@vger.kernel.org
Suggested-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-05-12 14:13:16 +02:00
Randy Dunlap 1f287bc4e2 fs/namespace: correct/improve kernel-doc notation
Fix kernel-doc warnings in fs/namespace.c:

./fs/namespace.c:1379: warning: Function parameter or member 'm' not described in 'may_umount_tree'
./fs/namespace.c:1379: warning: Excess function parameter 'mnt' description in 'may_umount_tree'
./fs/namespace.c:1950: warning: Function parameter or member 'path' not described in 'clone_private_mount'

Also convert path_is_mountpoint() comments to kernel-doc.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Allegedly-acked-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/20210318025227.4162-1-rdunlap@infradead.org
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2021-03-31 14:22:55 -06:00
Linus Torvalds 5ceabb6078 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Assorted stuff pile - no common topic here"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  whack-a-mole: don't open-code iminor/imajor
  9p: fix misuse of sscanf() in v9fs_stat2inode()
  audit_alloc_mark(): don't open-code ERR_CAST()
  fs/inode.c: make inode_init_always() initialize i_ino to 0
  vfs: don't unnecessarily clone write access for writable fds
2021-02-27 08:07:12 -08:00
Christian Brauner 9caccd4154
fs: introduce MOUNT_ATTR_IDMAP
Introduce a new mount bind mount property to allow idmapping mounts. The
MOUNT_ATTR_IDMAP flag can be set via the new mount_setattr() syscall
together with a file descriptor referring to a user namespace.

The user namespace referenced by the namespace file descriptor will be
attached to the bind mount. All interactions with the filesystem going
through that mount will be mapped according to the mapping specified in
the user namespace attached to it.

Using user namespaces to mark mounts means we can reuse all the existing
infrastructure in the kernel that already exists to handle idmappings
and can also use this for permission checking to allow unprivileged user
to create idmapped mounts in the future.

Idmapping a mount is decoupled from the caller's user and mount
namespace. This means idmapped mounts can be created in the initial
user namespace which is an important use-case for systemd-homed,
portable usb-sticks between systems, sharing data between the initial
user namespace and unprivileged containers, and other use-cases that
have been brought up. For example, assume a home directory where all
files are owned by uid and gid 1000 and the home directory is brought to
a new laptop where the user has id 12345. The system administrator can
simply create a mount of this home directory with a mapping of
1000:12345:1 and other mappings to indicate the ids should be kept.
(With this it is e.g. also possible to create idmapped mounts on the
host with an identity mapping 1:1:100000 where the root user is not
mapped. A user with root access that e.g. has been pivot rooted into
such a mount on the host will be not be able to execute, read, write, or
create files as root.)

Given that mapping a mount is decoupled from the caller's user namespace
a sufficiently privileged process such as a container manager can set up
an idmapped mount for the container and the container can simply pivot
root to it. There's no need for the container to do anything. The mount
will appear correctly mapped independent of the user namespace the
container uses. This means we don't need to mark a mount as idmappable.

In order to create an idmapped mount the caller must currently be
privileged in the user namespace of the superblock the mount belongs to.
Once a mount has been idmapped we don't allow it to change its mapping.
This keeps permission checking and life-cycle management simple. Users
wanting to change the idmapped can always create a new detached mount
with a different idmapping.

Link: https://lore.kernel.org/r/20210121131959.646623-36-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Mauricio Vásquez Bernal <mauricio@kinvolk.io>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:43:45 +01:00
Christian Brauner 2a1867219c
fs: add mount_setattr()
This implements the missing mount_setattr() syscall. While the new mount
api allows to change the properties of a superblock there is currently
no way to change the properties of a mount or a mount tree using file
descriptors which the new mount api is based on. In addition the old
mount api has the restriction that mount options cannot be applied
recursively. This hasn't changed since changing mount options on a
per-mount basis was implemented in [1] and has been a frequent request
not just for convenience but also for security reasons. The legacy
mount syscall is unable to accommodate this behavior without introducing
a whole new set of flags because MS_REC | MS_REMOUNT | MS_BIND |
MS_RDONLY | MS_NOEXEC | [...] only apply the mount option to the topmost
mount. Changing MS_REC to apply to the whole mount tree would mean
introducing a significant uapi change and would likely cause significant
regressions.

The new mount_setattr() syscall allows to recursively clear and set
mount options in one shot. Multiple calls to change mount options
requesting the same changes are idempotent:

int mount_setattr(int dfd, const char *path, unsigned flags,
                  struct mount_attr *uattr, size_t usize);

Flags to modify path resolution behavior are specified in the @flags
argument. Currently, AT_EMPTY_PATH, AT_RECURSIVE, AT_SYMLINK_NOFOLLOW,
and AT_NO_AUTOMOUNT are supported. If useful, additional lookup flags to
restrict path resolution as introduced with openat2() might be supported
in the future.

The mount_setattr() syscall can be expected to grow over time and is
designed with extensibility in mind. It follows the extensible syscall
pattern we have used with other syscalls such as openat2(), clone3(),
sched_{set,get}attr(), and others.
The set of mount options is passed in the uapi struct mount_attr which
currently has the following layout:

struct mount_attr {
	__u64 attr_set;
	__u64 attr_clr;
	__u64 propagation;
	__u64 userns_fd;
};

The @attr_set and @attr_clr members are used to clear and set mount
options. This way a user can e.g. request that a set of flags is to be
raised such as turning mounts readonly by raising MOUNT_ATTR_RDONLY in
@attr_set while at the same time requesting that another set of flags is
to be lowered such as removing noexec from a mount tree by specifying
MOUNT_ATTR_NOEXEC in @attr_clr.

Note, since the MOUNT_ATTR_<atime> values are an enum starting from 0,
not a bitmap, users wanting to transition to a different atime setting
cannot simply specify the atime setting in @attr_set, but must also
specify MOUNT_ATTR__ATIME in the @attr_clr field. So we ensure that
MOUNT_ATTR__ATIME can't be partially set in @attr_clr and that @attr_set
can't have any atime bits set if MOUNT_ATTR__ATIME isn't set in
@attr_clr.

The @propagation field lets callers specify the propagation type of a
mount tree. Propagation is a single property that has four different
settings and as such is not really a flag argument but an enum.
Specifically, it would be unclear what setting and clearing propagation
settings in combination would amount to. The legacy mount() syscall thus
forbids the combination of multiple propagation settings too. The goal
is to keep the semantics of mount propagation somewhat simple as they
are overly complex as it is.

The @userns_fd field lets user specify a user namespace whose idmapping
becomes the idmapping of the mount. This is implemented and explained in
detail in the next patch.

[1]: commit 2e4b7fcd92 ("[PATCH] r/o bind mounts: honor mount writer counts at remount")

Link: https://lore.kernel.org/r/20210121131959.646623-35-christian.brauner@ubuntu.com
Cc: David Howells <dhowells@redhat.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-api@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:42:45 +01:00
Christian Brauner 5b490500f9
fs: add attr_flags_to_mnt_flags helper
Add a simple helper to translate uapi MOUNT_ATTR_* flags to MNT_* flags
which we will use in follow-up patches too.

Link: https://lore.kernel.org/r/20210121131959.646623-34-christian.brauner@ubuntu.com
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:29:34 +01:00
Christian Brauner fbdc2f6c40
fs: split out functions to hold writers
When a mount is marked read-only we set MNT_WRITE_HOLD on it if there
aren't currently any active writers. Split this logic out into simple
helpers that we can use in follow-up patches.

Link: https://lore.kernel.org/r/20210121131959.646623-33-christian.brauner@ubuntu.com
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:29:34 +01:00
Christian Brauner e58ace1a0f
namespace: only take read lock in do_reconfigure_mnt()
do_reconfigure_mnt() used to take the down_write(&sb->s_umount) lock
which seems unnecessary since we're not changing the superblock. We're
only checking whether it is already read-only. Setting other mount
attributes is protected by lock_mount_hash() afaict and not by s_umount.

The history of down_write(&sb->s_umount) lock being taken when setting
mount attributes dates back to the introduction of MNT_READONLY in [2].
This introduced the concept of having read-only mounts in contrast to
just having a read-only superblock. When it got introduced it was simply
plumbed into do_remount() which already took down_write(&sb->s_umount)
because it was only used to actually change the superblock before [2].
Afaict, it would've already been possible back then to only use
down_read(&sb->s_umount) for MS_BIND | MS_REMOUNT since actual mount
options were protected by the vfsmount lock already. But that would've
meant special casing the locking for MS_BIND | MS_REMOUNT in
do_remount() which people might not have considered worth it.
Then in [1] MS_BIND | MS_REMOUNT mount option changes were split out of
do_remount() into do_reconfigure_mnt() but the down_write(&sb->s_umount)
lock was simply copied over.
Now that we have this be a separate helper only take the
down_read(&sb->s_umount) lock since we're only interested in checking
whether the super block is currently read-only and blocking any writers
from changing it. Essentially, checking that the super block is
read-only has the advantage that we can avoid having to go into the
slowpath and through MNT_WRITE_HOLD and can simply set the read-only
flag on the mount in set_mount_attributes().

[1]: commit 43f5e655ef ("vfs: Separate changing mount flags full remount")
[2]: commit 2e4b7fcd92 ("[PATCH] r/o bind mounts: honor mount writer counts at remount")

Link: https://lore.kernel.org/r/20210121131959.646623-32-christian.brauner@ubuntu.com
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:29:34 +01:00
Christian Brauner d033cb6784
mount: make {lock,unlock}_mount_hash() static
The lock_mount_hash() and unlock_mount_hash() helpers are never called
outside a single file. Remove them from the header and make them static
to reflect this fact. There's no need to have them callable from other
places right now, as Christoph observed.

Link: https://lore.kernel.org/r/20210121131959.646623-31-christian.brauner@ubuntu.com
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:29:34 +01:00
Christian Brauner 68847c9417
namespace: take lock_mount_hash() directly when changing flags
Changing mount options always ends up taking lock_mount_hash() but when
MNT_READONLY is requested and neither the mount nor the superblock are
MNT_READONLY we end up taking the lock, dropping it, and retaking it to
change the other mount attributes. Instead, let's acquire the lock once
when changing the mount attributes. This simplifies the locking in these
codepath, makes them easier to reason about and avoids having to
reacquire the lock right after dropping it.

Link: https://lore.kernel.org/r/20210121131959.646623-30-christian.brauner@ubuntu.com
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:29:34 +01:00
Christian Brauner a6435940b6
mount: attach mappings to mounts
In order to support per-mount idmappings vfsmounts are marked with user
namespaces. The idmapping of the user namespace will be used to map the
ids of vfs objects when they are accessed through that mount. By default
all vfsmounts are marked with the initial user namespace. The initial
user namespace is used to indicate that a mount is not idmapped. All
operations behave as before.

Based on prior discussions we want to attach the whole user namespace
and not just a dedicated idmapping struct. This allows us to reuse all
the helpers that already exist for dealing with idmappings instead of
introducing a whole new range of helpers. In addition, if we decide in
the future that we are confident enough to enable unprivileged users to
setup idmapped mounts the permission checking can take into account
whether the caller is privileged in the user namespace the mount is
currently marked with.
Later patches enforce that once a mount has been idmapped it can't be
remapped. This keeps permission checking and life-cycle management
simple. Users wanting to change the idmapped can always create a new
detached mount with a different idmapping.

Add a new mnt_userns member to vfsmount and two simple helpers to
retrieve the mnt_userns from vfsmounts and files.

The idea to attach user namespaces to vfsmounts has been floated around
in various forms at Linux Plumbers in ~2018 with the original idea
tracing back to a discussion in 2017 at a conference in St. Petersburg
between Christoph, Tycho, and myself.

Link: https://lore.kernel.org/r/20210121131959.646623-2-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:15 +01:00
Al Viro a0a6df9afc umount(2): move the flag validity checks first
Unfortunately, there's userland code that used to rely upon these
checks being done before anything else to check for UMOUNT_NOFOLLOW
support.  That broke in 41525f56e2 ("fs: refactor ksys_umount").
Separate those from the rest of checks and move them to ksys_umount();
unlike everything else in there, this can be sanely done there.

Reported-by: Sargun Dhillon <sargun@sargun.me>
Fixes: 41525f56e2 ("fs: refactor ksys_umount")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-01-04 15:31:58 -05:00
Eric Biggers 14e43bf435 vfs: don't unnecessarily clone write access for writable fds
There's no need for mnt_want_write_file() to increment mnt_writers when
the file is already open for writing, provided that
mnt_drop_write_file() is changed to conditionally decrement it.

We seem to have ended up in the current situation because
mnt_want_write_file() used to be paired with mnt_drop_write(), due to
mnt_drop_write_file() not having been added yet.  So originally
mnt_want_write_file() had to always increment mnt_writers.

But later mnt_drop_write_file() was added, and all callers of
mnt_want_write_file() were paired with it.  This makes the compatibility
between mnt_want_write_file() and mnt_drop_write() no longer necessary.

Therefore, make __mnt_want_write_file() and __mnt_drop_write_file() skip
incrementing mnt_writers on files already open for writing.  This
removes the only caller of mnt_clone_write(), so remove that too.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-01-04 14:02:08 -05:00
Linus Torvalds 7bb5226c8a Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Assorted patches from previous cycle(s)..."

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fix hostfs_open() use of ->f_path.dentry
  Make sure that make_create_in_sticky() never sees uninitialized value of dir_mode
  fs: Kill DCACHE_DONTCACHE dentry even if DCACHE_REFERENCED is set
  fs: Handle I_DONTCACHE in iput_final() instead of generic_drop_inode()
  fs/namespace.c: WARN if mnt_count has become negative
2020-12-25 10:54:29 -08:00
Linus Torvalds f9b4240b07 fixes-v5.11
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCX9daOgAKCRCRxhvAZXjc
 ohPkAQChXUB2BAjtIzXlCkZoDBbzHHblm5DZ37oy/4xYFmAcEwEA5sw6dQqyGHnF
 GEP9def51HvXLpBV2BzNUGggo1SoGgQ=
 =w/cO
 -----END PGP SIGNATURE-----

Merge tag 'fixes-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull misc fixes from Christian Brauner:
 "This contains several fixes which felt worth being combined into a
  single branch:

   - Use put_nsproxy() instead of open-coding it switch_task_namespaces()

   - Kirill's work to unify lifecycle management for all namespaces. The
     lifetime counters are used identically for all namespaces types.
     Namespaces may of course have additional unrelated counters and
     these are not altered. This work allows us to unify the type of the
     counters and reduces maintenance cost by moving the counter in one
     place and indicating that basic lifetime management is identical
     for all namespaces.

   - Peilin's fix adding three byte padding to Dmitry's
     PTRACE_GET_SYSCALL_INFO uapi struct to prevent an info leak.

   - Two smal patches to convert from the /* fall through */ comment
     annotation to the fallthrough keyword annotation which I had taken
     into my branch and into -next before df561f6688 ("treewide: Use
     fallthrough pseudo-keyword") made it upstream which fixed this
     tree-wide.

     Since I didn't want to invalidate all testing for other commits I
     didn't rebase and kept them"

* tag 'fixes-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  nsproxy: use put_nsproxy() in switch_task_namespaces()
  sys: Convert to the new fallthrough notation
  signal: Convert to the new fallthrough notation
  time: Use generic ns_common::count
  cgroup: Use generic ns_common::count
  mnt: Use generic ns_common::count
  user: Use generic ns_common::count
  pid: Use generic ns_common::count
  ipc: Use generic ns_common::count
  uts: Use generic ns_common::count
  net: Use generic ns_common::count
  ns: Add a common refcount into ns_common
  ptrace: Prevent kernel-infoleak in ptrace_get_syscall_info()
2020-12-14 16:40:27 -08:00
Eric Biggers edf7ddbf1c fs/namespace.c: WARN if mnt_count has become negative
Missing calls to mntget() (or equivalently, too many calls to mntput())
are hard to detect because mntput() delays freeing mounts using
task_work_add(), then again using call_rcu().  As a result, mnt_count
can often be decremented to -1 without getting a KASAN use-after-free
report.  Such cases are still bugs though, and they point to real
use-after-frees being possible.

For an example of this, see the bug fixed by commit 1b0b9cc8d3
("vfs: fsmount: add missing mntget()"), discussed at
https://lkml.kernel.org/linux-fsdevel/20190605135401.GB30925@xxxxxxxxxxxxxxxxxxxxxxxxx/T/#u.
This bug *should* have been trivial to find.  But actually, it wasn't
found until syzkaller happened to use fchdir() to manipulate the
reference count just right for the bug to be noticeable.

Address this by making mntput_no_expire() issue a WARN if mnt_count has
become negative.

Suggested-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-12-10 17:33:17 -05:00
Linus Torvalds 0eac1102e9 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Assorted stuff all over the place (the largest group here is
  Christoph's stat cleanups)"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: remove KSTAT_QUERY_FLAGS
  fs: remove vfs_stat_set_lookup_flags
  fs: move vfs_fstatat out of line
  fs: implement vfs_stat and vfs_lstat in terms of vfs_fstatat
  fs: remove vfs_statx_fd
  fs: omfs: use kmemdup() rather than kmalloc+memcpy
  [PATCH] reduce boilerplate in fsid handling
  fs: Remove duplicated flag O_NDELAY occurring twice in VALID_OPEN_FLAGS
  selftests: mount: add nosymfollow tests
  Add a "nosymfollow" mount option.
2020-10-24 12:26:05 -07:00
Jens Axboe 91989c7078 task_work: cleanup notification modes
A previous commit changed the notification mode from true/false to an
int, allowing notify-no, notify-yes, or signal-notify. This was
backwards compatible in the sense that any existing true/false user
would translate to either 0 (on notification sent) or 1, the latter
which mapped to TWA_RESUME. TWA_SIGNAL was assigned a value of 2.

Clean this up properly, and define a proper enum for the notification
mode. Now we have:

- TWA_NONE. This is 0, same as before the original change, meaning no
  notification requested.
- TWA_RESUME. This is 1, same as before the original change, meaning
  that we use TIF_NOTIFY_RESUME.
- TWA_SIGNAL. This uses TIF_SIGPENDING/JOBCTL_TASK_WORK for the
  notification.

Clean up all the callers, switching their 0/1/false/true to using the
appropriate TWA_* mode for notifications.

Fixes: e91b481623 ("task_work: teach task_work_add() to do signal_wake_up()")
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-17 15:05:30 -06:00
Linus Torvalds 22230cd2c5 Merge branch 'compat.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull compat mount cleanups from Al Viro:
 "The last remnants of mount(2) compat buried by Christoph.

  Buried into NFS, that is.

  Generally I'm less enthusiastic about "let's use in_compat_syscall()
  deep in call chain" kind of approach than Christoph seems to be, but
  in this case it's warranted - that had been an NFS-specific wart,
  hopefully not to be repeated in any other filesystems (read: any new
  filesystem introducing non-text mount options will get NAKed even if
  it doesn't mess the layout up).

  IOW, not worth trying to grow an infrastructure that would avoid that
  use of in_compat_syscall()..."

* 'compat.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: remove compat_sys_mount
  fs,nfs: lift compat nfs4 mount data handling into the nfs code
  nfs: simplify nfs4_parse_monolithic
2020-10-12 16:44:57 -07:00
Christoph Hellwig 028abd9222 fs: remove compat_sys_mount
compat_sys_mount is identical to the regular sys_mount now, so remove it
and use the native version everywhere.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-09-22 23:45:57 -04:00
Catalin Marinas d563d678aa fs: Handle intra-page faults in copy_mount_options()
The copy_mount_options() function takes a user pointer argument but no
size and it tries to read up to a PAGE_SIZE. However, copy_from_user()
is not guaranteed to return all the accessible bytes if, for example,
the access crosses a page boundary and gets a fault on the second page.
To work around this, the current copy_mount_options() implementation
performs two copy_from_user() passes, first to the end of the current
page and the second to what's left in the subsequent page.

On arm64 with MTE enabled, access to a user page may trigger a fault
after part of the buffer in a page has been copied (when the user
pointer tag, bits 56-59, no longer matches the allocation tag stored in
memory). Allow copy_mount_options() to handle such intra-page faults by
resorting to byte at a time copy in case of copy_from_user() failure.

Note that copy_from_user() handles the zeroing of the kernel buffer in
case of error.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
2020-09-04 12:46:07 +01:00