Commit Graph

655 Commits

Author SHA1 Message Date
Patrick Talbert 21a6558f73 Merge: CVE-2024-43882: exec: Fix ToCToU between perm check and set-uid/gid usage
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6200

JIRA: https://issues.redhat.com/browse/RHEL-55562
CVE: CVE-2024-43882

```
exec: Fix ToCToU between perm check and set-uid/gid usage

When opening a file for exec via do_filp_open(), permission checking is
done against the file's metadata at that moment, and on success, a file
pointer is passed back. Much later in the execve() code path, the file
metadata (specifically mode, uid, and gid) is used to determine if/how
to set the uid and gid. However, those values may have changed since the
permissions check, meaning the execution may gain unintended privileges.

For example, if a file could change permissions from executable and not
set-id:

---------x 1 root root 16048 Aug  7 13:16 target

to set-id and non-executable:

---S------ 1 root root 16048 Aug  7 13:16 target

it is possible to gain root privileges when execution should have been
disallowed.

While this race condition is rare in real-world scenarios, it has been
observed (and proven exploitable) when package managers are updating
the setuid bits of installed programs. Such files start with being
world-executable but then are adjusted to be group-exec with a set-uid
bit. For example, "chmod o-x,u+s target" makes "target" executable only
by uid "root" and gid "cdrom", while also becoming setuid-root:

-rwxr-xr-x 1 root cdrom 16048 Aug  7 13:16 target

becomes:

-rwsr-xr-- 1 root cdrom 16048 Aug  7 13:16 target

But racing the chmod means users without group "cdrom" membership can
get the permission to execute "target" just before the chmod, and when
the chmod finishes, the exec reaches brpm_fill_uid(), and performs the
setuid to root, violating the expressed authorization of "only cdrom
group members can setuid to root".

Re-check that we still have execute permissions in case the metadata
has changed. It would be better to keep a copy from the perm-check time,
but until we can do that refactoring, the least-bad option is to do a
full inode_permission() call (under inode lock). It is understood that
this is safe against dead-locks, but hardly optimal.

Reported-by: Marco Vanotti <mvanotti@google.com>
Tested-by: Marco Vanotti <mvanotti@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Kees Cook <kees@kernel.org>
(cherry picked from commit f50733b45d865f91db90919f8311e2127ce5a0cb)
```

Signed-off-by: CKI Backport Bot <cki-ci-bot+cki-gitlab-backport-bot@redhat.com>

---

<small>Created 2025-01-17 01:04 UTC by backporter - [KWF FAQ](https://red.ht/kernel_workflow_doc) - [Slack #team-kernel-workflow](https://redhat-internal.slack.com/archives/C04LRUPMJQ5) - [Source](https://gitlab.com/cki-project/kernel-workflow/-/blob/main/webhook/utils/backporter.py) - [Documentation](https://gitlab.com/cki-project/kernel-workflow/-/blob/main/docs/README.backporter.md) - [Report an issue](https://gitlab.com/cki-project/kernel-workflow/-/issues/new?issue%5Btitle%5D=backporter%20webhook%20issue)</small>

Approved-by: Pavel Reichl <preichl@redhat.com>
Approved-by: Carlos Maiolino <cmaiolino@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Patrick Talbert <ptalbert@redhat.com>
2025-02-03 10:00:43 -05:00
Patrick Talbert 143d5ac2a9 Merge: CVE-2024-50271: ucounts: Split rlimit and ucount values and max values
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6027

JIRA: https://issues.redhat.com/browse/RHEL-68020

CVE: CVE-2024-50271

- 012f4d5d25e9ef92ee129bd5aa7aa60f692681e1 signal: restore the override_rlimit logic
- de399236e240743ad2dd10d719c37b97ddf31996 ucounts: Split rlimit and ucount values and max values

Signed-off-by: Radostin Stoyanov <radostin@redhat.com>

Approved-by: Rafael Aquini <raquini@redhat.com>
Approved-by: Herton R. Krzesinski <herton@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Patrick Talbert <ptalbert@redhat.com>
2025-02-03 10:00:41 -05:00
CKI Backport Bot 0eec6bb2a5 exec: Fix ToCToU between perm check and set-uid/gid usage
JIRA: https://issues.redhat.com/browse/RHEL-55562
CVE: CVE-2024-43882

commit f50733b45d865f91db90919f8311e2127ce5a0cb
Author: Kees Cook <kees@kernel.org>
Date:   Thu Aug 8 11:39:08 2024 -0700

    exec: Fix ToCToU between perm check and set-uid/gid usage

    When opening a file for exec via do_filp_open(), permission checking is
    done against the file's metadata at that moment, and on success, a file
    pointer is passed back. Much later in the execve() code path, the file
    metadata (specifically mode, uid, and gid) is used to determine if/how
    to set the uid and gid. However, those values may have changed since the
    permissions check, meaning the execution may gain unintended privileges.

    For example, if a file could change permissions from executable and not
    set-id:

    ---------x 1 root root 16048 Aug  7 13:16 target

    to set-id and non-executable:

    ---S------ 1 root root 16048 Aug  7 13:16 target

    it is possible to gain root privileges when execution should have been
    disallowed.

    While this race condition is rare in real-world scenarios, it has been
    observed (and proven exploitable) when package managers are updating
    the setuid bits of installed programs. Such files start with being
    world-executable but then are adjusted to be group-exec with a set-uid
    bit. For example, "chmod o-x,u+s target" makes "target" executable only
    by uid "root" and gid "cdrom", while also becoming setuid-root:

    -rwxr-xr-x 1 root cdrom 16048 Aug  7 13:16 target

    becomes:

    -rwsr-xr-- 1 root cdrom 16048 Aug  7 13:16 target

    But racing the chmod means users without group "cdrom" membership can
    get the permission to execute "target" just before the chmod, and when
    the chmod finishes, the exec reaches brpm_fill_uid(), and performs the
    setuid to root, violating the expressed authorization of "only cdrom
    group members can setuid to root".

    Re-check that we still have execute permissions in case the metadata
    has changed. It would be better to keep a copy from the perm-check time,
    but until we can do that refactoring, the least-bad option is to do a
    full inode_permission() call (under inode lock). It is understood that
    this is safe against dead-locks, but hardly optimal.

    Reported-by: Marco Vanotti <mvanotti@google.com>
    Tested-by: Marco Vanotti <mvanotti@google.com>
    Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: stable@vger.kernel.org
    Cc: Eric Biederman <ebiederm@xmission.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Christian Brauner <brauner@kernel.org>
    Signed-off-by: Kees Cook <kees@kernel.org>

Signed-off-by: CKI Backport Bot <cki-ci-bot+cki-gitlab-backport-bot@redhat.com>
2025-01-17 01:04:29 +00:00
Radostin Stoyanov 46364cd74c ucounts: Split rlimit and ucount values and max values
JIRA: https://issues.redhat.com/browse/RHEL-68020
CVE: CVE-2024-50271

commit de399236e240743ad2dd10d719c37b97ddf31996
Author: Alexey Gladkov <legion@kernel.org>
Date:   Wed Mat 18 19:17:30 2022 +0200

    ucounts: Split rlimit and ucount values and max values

    Since the semantics of maximum rlimit values are different, it would be
    better not to mix ucount and rlimit values. This will prevent the error
    of using inc_count/dec_ucount for rlimit parameters.

    This patch also renames the functions to emphasize the lack of
    connection between rlimit and ucount.

    v3:
    - Fix BUG:KASAN:use-after-free_in_dec_ucount.

    v2:
    - Fix the array-index-out-of-bounds that was found by the lkp project.

    Reported-by: kernel test robot <oliver.sang@intel.com>
    Signed-off-by: Alexey Gladkov <legion@kernel.org>
    Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
    Link: https://lkml.kernel.org/r/20220518171730.l65lmnnjtnxnftpq@example.org
    Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

Signed-off-by: Radostin Stoyanov <radostin@redhat.com>
2024-12-20 15:31:08 +00:00
Rafael Aquini c428cfb451 mm/ksm: fix ksm exec support for prctl
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 3a9e567ca45fb5280065283d10d9a11f0db61d2b
Author: Jinjiang Tu <tujinjiang@huawei.com>
Date:   Thu Mar 28 19:10:08 2024 +0800

    mm/ksm: fix ksm exec support for prctl

    Patch series "mm/ksm: fix ksm exec support for prctl", v4.

    commit 3c6f33b7273a ("mm/ksm: support fork/exec for prctl") inherits
    MMF_VM_MERGE_ANY flag when a task calls execve().  However, it doesn't
    create the mm_slot, so ksmd will not try to scan this task.  The first
    patch fixes the issue.

    The second patch refactors to prepare for the third patch.  The third
    patch extends the selftests of ksm to verfity the deduplication really
    happens after fork/exec inherits ths KSM setting.

    This patch (of 3):

    commit 3c6f33b7273a ("mm/ksm: support fork/exec for prctl") inherits
    MMF_VM_MERGE_ANY flag when a task calls execve().  Howerver, it doesn't
    create the mm_slot, so ksmd will not try to scan this task.

    To fix it, allocate and add the mm_slot to ksm_mm_head in __bprm_mm_init()
    when the mm has MMF_VM_MERGE_ANY flag.

    Link: https://lkml.kernel.org/r/20240328111010.1502191-1-tujinjiang@huawei.com
    Link: https://lkml.kernel.org/r/20240328111010.1502191-2-tujinjiang@huawei.com
    Fixes: 3c6f33b7273a ("mm/ksm: support fork/exec for prctl")
    Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Nanyong Sun <sunnanyong@huawei.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Stefan Roesch <shr@devkernel.io>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:24:55 -05:00
Rado Vrbovsky 993b335734 Merge: Update arch/{x86,powerpc,arm64}/mm to v6.6
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5391

JIRA: https://issues.redhat.com/browse/RHEL-55461  
JIRA: https://issues.redhat.com/browse/RHEL-55465  
JIRA: https://issues.redhat.com/browse/RHEL-55462  
Depends: !5252 

Updated the respective arch mm directories to v6.6. Most of the patches  
have already been updated or included by the respective arch teams and by  
Rafael's mm update to v6.6.   
  
Dropped the following to avoid issues with the ppc64le build:  
41b7a347bf14 powerpc: Book3S 64-bit outline-only KASAN support  
c7b9ed7c34a9 powerpc/64e: KASAN Full support for BOOK3E/64  

Omitted-fix: 7bd6680b47fa Revert "Revert "arm64: dma: Drop cache invalidation from arch_dma_prep_coherent()""
Omitted-fix: 7b59e8ae92fe arm64: dts: qcom: sc7280: Mark SCM as dma-coherent for chrome devices
Omitted-fix: a54b7fa6b9ab arm64: dts: qcom: sc7180: Mark SCM as dma-coherent for trogdor
Omitted-fix: 9a5f0b11e49e arm64: dts: qcom: sc7180: Mark SCM as dma-coherent for IDP
Omitted-fix: cd87d9f58439 x86/mm: further clarify switch_mm_irqs_off() documentation
  
Signed-off-by: Audra Mitchell <audra@redhat.com>

Approved-by: Rafael Aquini <raquini@redhat.com>
Approved-by: Vladis Dronov <vdronov@redhat.com>
Approved-by: Herton R. Krzesinski <herton@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: Nico Pache <npache@redhat.com>
Approved-by: Lenny Szubowicz <lszubowi@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
2024-11-12 08:02:20 +00:00
Rado Vrbovsky c154c6dc53 Merge: fs: backport mnt_idmap type
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4324

JIRA: https://issues.redhat.com/browse/RHEL-33888

This MR back ports idmapping changes to sync. our RHEL-9 kernel with the
upstream kernel to version 6.3.

Our current kernel has idmapped mounts support but there have been many
changes since this initial implementation in the base kernel. In
particular we need the type safety changes and we have seen difficulty
back porting other requested changes on more than one occassion.

The Jira this MR has been raised for is arother example of such a request.

It is needed for a back port of a BPF feature to RHEL 9 which allows BPF
programs to do file verification with LSM and fsverity. To satisfy this
request changes made in the upstream 6.3 kernel are needed which is the
reason we have chosen upstream 6.3 as the target release for the MR.

The first fix has been omitted because it appears to be the same as
24b5308cf5ee ("selftests/filesystems: grant executable permission to
run_fat_tests.sh"). In any case the requirement is to make the path
tools/testing/selftests/filesystems/fat/run_fat_tests.sh executable which
is done.

The second and third Omitted patches are a straight apply and revert leaving
the source unchanged.

Omitted-Fix: 1d4beeb4edc7 ("selftests/filesystems: grant executable permission to run_fat_tests.sh")

Omitted-Fix: 4a47c6385bb4 ovl: turn of SB_POSIXACL with idmapped layers temporarily

Omitted-Fix: 7c4d37c269ac Revert "ovl: turn of SB_POSIXACL with idmapped layers temporarily"

Signed-off-by: Ian Kent <ikent@redhat.com>

Approved-by: Scott Mayhew <smayhew@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: Xin Long <lxin@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
2024-11-11 08:26:30 +00:00
Audra Mitchell 8947f5b14c lazy tlb: introduce lazy tlb mm refcount helper functions
JIRA: https://issues.redhat.com/browse/RHEL-55462

This patch is a backport of the following upstream commit:
commit aa464ba9a1e444d5ef95bb63ee3b2ef26fc96ed7
Author: Nicholas Piggin <npiggin@gmail.com>
Date:   Fri Feb 3 17:18:34 2023 +1000

    lazy tlb: introduce lazy tlb mm refcount helper functions

    Add explicit _lazy_tlb annotated functions for lazy tlb mm refcounting.
    This makes the lazy tlb mm references more obvious, and allows the
    refcounting scheme to be modified in later changes.  There is no
    functional change with this patch.

    Link: https://lkml.kernel.org/r/20230203071837.1136453-3-npiggin@gmail.com
    Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-11-04 09:14:17 -05:00
Ian Kent db8603ce12 fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

Conflicts: Dropped hunks for ksmbd because the source is not present in the
	CentOS Stream source tree.
	CentOS Stream has commit bb901646d2 ("ovl: let helper
	ovl_i_path_real() return the realinode") which wasn't present
	upstream when this patch was applied, correct manually.
	CentOS Stream does not have upstream commit c7423dbdbc9ec
	("ima: Handle -ESTALE returned by ima_filter_rule_match()")
	which results in a reject of hunk #3 against
	security/integrity/ima/ima_policy.c, so manually apply hunk.
	Upstream merge commit 05e6295f7b5e0 ("Merge tag 'fs.idmapped.v6.3'
	of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping")
	together with Upstream commit facd61053cff1 ("fuse: fixes after
	adapting to new posix acl api") results in a conflict in
	fs/fuse/acl.c, adjust to suit.
	Update the call to i_uid_into_vfsuid() from 2740f64cb7f00
	("filelocks: use mount idmapping for setlease permission check")
	to pass an idmap instead of a user namespace.
	It looks like Linus made a change to the merge request "Merge tag
	8834147f95056 ("fscache-rewrite-20220111") to account for idmap
	changes (probably the ones in this commit, so add the change here.

commit e67fe63341b8117d7e0d9acf0f1222d5138b9266
Author: Christian Brauner <brauner@kernel.org>
Date:   Fri Jan 13 12:49:30 2023 +0100

    fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap

    Convert to struct mnt_idmap.
    Remove legacy file_mnt_user_ns() and mnt_user_ns().

    Last cycle we merged the necessary infrastructure in
    256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
    This is just the conversion to struct mnt_idmap.

    Currently we still pass around the plain namespace that was attached to a
    mount. This is in general pretty convenient but it makes it easy to
    conflate namespaces that are relevant on the filesystem with namespaces
    that are relevent on the mount level. Especially for non-vfs developers
    without detailed knowledge in this area this can be a potential source for
    bugs.

    Once the conversion to struct mnt_idmap is done all helpers down to the
    really low-level helpers will take a struct mnt_idmap argument instead of
    two namespace arguments. This way it becomes impossible to conflate the two
    eliminating the possibility of any bugs. All of the vfs and all filesystems
    only operate on struct mnt_idmap.

    Acked-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-16 11:02:01 +08:00
Ian Kent edf17476c7 fs: port privilege checking helpers to mnt_idmap
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

Conflicts: For consistency drop btrfs hunks because it isn't supported in
	CentOS Stream and other backports also drop such hunks.
	Upstream merge commit 05e6295f7b5e0 ("Merge tag 'fs.idmapped.v6.3'
	of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping")
	together with Upstream commit facd61053cff1 ("fuse: fixes after
	adapting to new posix acl api") results in a conflict in
	fs/fuse/acl.c, adjust to suit.

commit 9452e93e6dae862d7aeff2b11236d79bde6f9b66
Author: Christian Brauner <brauner@kernel.org>
Date:   Fri Jan 13 12:49:27 2023 +0100

    fs: port privilege checking helpers to mnt_idmap

    Convert to struct mnt_idmap.

    Last cycle we merged the necessary infrastructure in
    256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
    This is just the conversion to struct mnt_idmap.

    Currently we still pass around the plain namespace that was attached to a
    mount. This is in general pretty convenient but it makes it easy to
    conflate namespaces that are relevant on the filesystem with namespaces
    that are relevent on the mount level. Especially for non-vfs developers
    without detailed knowledge in this area this can be a potential source for
    bugs.

    Once the conversion to struct mnt_idmap is done all helpers down to the
    really low-level helpers will take a struct mnt_idmap argument instead of
    two namespace arguments. This way it becomes impossible to conflate the two
    eliminating the possibility of any bugs. All of the vfs and all filesystems
    only operate on struct mnt_idmap.

    Acked-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-16 10:45:31 +08:00
Ian Kent 304ec491ee fs: port ->permission() to pass mnt_idmap
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

Conflicts: For consistency drop btrfs hunks because it isn't supported in
	CentOS Stream and other backports also drop such hunks.
	CentOS Stream commit 48fa94aacd ("ceph: fscrypt_auth handling
	for ceph") is presnt which causes fuzz 2 in hunk #1 in
	fs/ceph/super.h.
	Upstream commit 427505ffeaa46 ("exportfs: use pr_debug for
	unreachable debug statements") is not present causing fuzz 2
	in hunk #1 against fs/exportfs/expfs.c.
	Dropped hunks for ksmbd because the source is not present in the
	CentOS Stream source tree.
	Upstream commit 03fa86e9f79d8 ("namei: stash the sampled ->d_seq
	into nameidata") is not present causing a fuzz 1 for hunk #14
	against fs/namei.c.
	CentOS Stream c4f3dd0731 ("nfsd: handle failure to collect
	pre/post-op attrs more sanely") is present and causes a rejects
	for hunks #4 and #5 against fs/nfsd/vfs.c, apply manually.
	Dropped hunks for ntfs3 because the source is not present in the
	CentOS Stream source tree.
	CentOS Stream commit 98ba731fc7 ("ovl: Move xattr support to
	new xattrs.c file") moves ovl_xattr_set() and ovl_xattr_get()
	from fs/overlayfs/inode.c to fs/overlayfs/xattrs.c which causes
	hunks #4 and #5 to fail, manually apply to fs/overlayfs/xattrs.c.
	CentOS Stream commit 55177e4b83 ("ovl: mark xwhiteouts directory
	with overlay.opaque='x'") and commit d17b324bb6 ("ovl: use
	ovl_numlower() and ovl_lowerstack() accessors") change the first
	and third hunks of fs/overlayfs/namei.c causing them to fail,
	manually apply.
	CentOS Stream commit 98ba731fc7 ("ovl: Move xattr support to
	new xattrs.c file") causes fuzz 2 in hunk #5 of
	fs/overlayfs/overlayfs.h
	CentOS Stream commit 355a9c490a ("ovl: Add an alternative
	type of whiteout") changes ovl_cache_update_ino() to
	ovl_cache_update() in fs/overlayfs/readdir.c, make the change
	manually.
	Upstream commit 217af7e2f4deb ("apparmor: refactor profile
	rules and attachments") is not in CentOS Stream causing hunk #1
	to fail to apply so manually apply the change.

commit 4609e1f18e19c3b302e1eb4858334bca1532f780
Author: Christian Brauner <brauner@kernel.org>
Date:   Fri Jan 13 12:49:22 2023 +0100

    fs: port ->permission() to pass mnt_idmap

    Convert to struct mnt_idmap.

    Last cycle we merged the necessary infrastructure in
    256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
    This is just the conversion to struct mnt_idmap.

    Currently we still pass around the plain namespace that was attached to a
    mount. This is in general pretty convenient but it makes it easy to
    conflate namespaces that are relevant on the filesystem with namespaces
    that are relevent on the mount level. Especially for non-vfs developers
    without detailed knowledge in this area this can be a potential source for
    bugs.

    Once the conversion to struct mnt_idmap is done all helpers down to the
    really low-level helpers will take a struct mnt_idmap argument instead of
    two namespace arguments. This way it becomes impossible to conflate the two
    eliminating the possibility of any bugs. All of the vfs and all filesystems
    only operate on struct mnt_idmap.

    Acked-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-16 10:45:20 +08:00
Ian Kent bff9bc5749 fs: use type safe idmapping helpers
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

commit a2bd096fb2d7f50fb4db246b33e7bfcf5e2eda3a
Author: Christian Brauner <brauner@kernel.org>
Date:   Wed Jun 22 22:12:16 2022 +0200

    fs: use type safe idmapping helpers

    We already ported most parts and filesystems over for v6.0 to the new
    vfs{g,u}id_t type and associated helpers for v6.0. Convert the remaining
    places so we can remove all the old helpers.
    This is a non-functional change.

    Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-15 16:12:38 +08:00
Ian Kent c6d247f6b2 bprm_fill_uid(): don't open-code file_inode()
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

commit e6ae43812460450bdb42f14c5813ac42d6bc9067
Author: Al Viro <viro@zeniv.linux.org.uk>
Date:   Sat Aug 20 11:46:10 2022 -0400

    bprm_fill_uid(): don't open-code file_inode()

    Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-15 16:12:37 +08:00
Rafael Aquini c98797e544 mm: set up vma iterator for vma_iter_prealloc() calls
JIRA: https://issues.redhat.com/browse/RHEL-27743
Conflicts:
  * context differences on the 1st, 3rd, and 15th hunks due to out-of-order
      backport of upstream commits ad9f006351c3 ("mm: always lock new vma
      before inserting into vma tree") and c9d6e982c3f8 ("mm: move vma
      locking out of vma_prepare and dup_anon_vma");
  * context difference on the 4th hunk due to out-of-order backport of upstream
      commit 1419430c8abb ("mmap: fix vma_iterator in error path of vma_merge()")

This patch is a backport of the following upstream commit:
commit b5df09226450165c434084d346fcb6d4858b0d52
Author: Liam R. Howlett <Liam.Howlett@oracle.com>
Date:   Mon Jul 24 14:31:52 2023 -0400

    mm: set up vma iterator for vma_iter_prealloc() calls

    Set the correct limits for vma_iter_prealloc() calls so that the maple
    tree can be smarter about how many nodes are needed.

    Link: https://lkml.kernel.org/r/20230724183157.3939892-11-Liam.Howlett@oracle.com
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Cc: Peng Zhang <zhangpeng.00@bytedance.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:19:30 -04:00
Rafael Aquini 410830503d mm: always expand the stack with the mmap write lock held
JIRA: https://issues.redhat.com/browse/RHEL-27742
Conflicts:
  * arch/parisc/mm/fault.c: hunks dropped as there were merge conflicts not
       worth of fixing for this unsupported hardware arch;
  * drivers/iommu/amd/iommu_v2.c: hunk dropped given out-of-order backport
       of upstream commit 5a0b11a180a9 ("iommu/amd: Remove iommu_v2 module")
  * mm/memory.c: differences on the 2nd hunk due to upstream conflict with
       commit ca5e863233e8 ("mm/gup: remove vmas parameter from
       get_user_pages_remote()") that ended up solved by merge commit
       9471f1f2f502 ("Merge branch 'expand-stack'").

This patch is a backport of the following upstream commit:
commit 8d7071af890768438c14db6172cc8f9f4d04e184
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Sat Jun 24 13:45:51 2023 -0700

    mm: always expand the stack with the mmap write lock held

    This finishes the job of always holding the mmap write lock when
    extending the user stack vma, and removes the 'write_locked' argument
    from the vm helper functions again.

    For some cases, we just avoid expanding the stack at all: drivers and
    page pinning really shouldn't be extending any stacks.  Let's see if any
    strange users really wanted that.

    It's worth noting that architectures that weren't converted to the new
    lock_mm_and_find_vma() helper function are left using the legacy
    "expand_stack()" function, but it has been changed to drop the mmap_lock
    and take it for writing while expanding the vma.  This makes it fairly
    straightforward to convert the remaining architectures.

    As a result of dropping and re-taking the lock, the calling conventions
    for this function have also changed, since the old vma may no longer be
    valid.  So it will now return the new vma if successful, and NULL - and
    the lock dropped - if the area could not be extended.

    Tested-by: Vegard Nossum <vegard.nossum@oracle.com>
    Tested-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> # ia64
    Tested-by: Frank Scheiner <frank.scheiner@web.de> # ia64
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:37:19 -04:00
Rafael Aquini 15c74a651e execve: expand new process stack manually ahead of time
JIRA: https://issues.redhat.com/browse/RHEL-27742
Conflicts:
  * differences are because this commit had a merge conflict upstream with
    commit ca5e863233e8 ("mm/gup: remove vmas parameter from
    get_user_pages_remote()") that ended up solved by merge commit
    9471f1f2f502 ("Merge branch 'expand-stack'").

This patch is a backport of the following upstream commit:
commit f313c51d26aa87e69633c9b46efb37a930faca71
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Mon Jun 19 11:34:15 2023 -0700

    execve: expand new process stack manually ahead of time

    This is a small step towards a model where GUP itself would not expand
    the stack, and any user that needs GUP to not look up existing mappings,
    but actually expand on them, would have to do so manually before-hand,
    and with the mm lock held for writing.

    It turns out that execve() already did almost exactly that, except it
    didn't take the mm lock at all (it's single-threaded so no locking
    technically needed, but it could cause lockdep errors).  And it only did
    it for the CONFIG_STACK_GROWSUP case, since in that case GUP has
    obviously never expanded the stack downwards.

    So just make that CONFIG_STACK_GROWSUP case do the right thing with
    locking, and enable it generally.  This will eventually help GUP, and in
    the meantime avoids a special case and the lockdep issue.

    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:37:18 -04:00
Rafael Aquini 7bc9b5120c exec: Remove FOLL_FORCE for stack setup
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit cd57e443831d8eeb083c7165bce195d886e216d4
Author: Kees Cook <keescook@chromium.org>
Date:   Thu Nov 17 16:31:55 2022 -0800

    exec: Remove FOLL_FORCE for stack setup

    It does not appear that FOLL_FORCE should be needed for setting up the
    stack pages. They are allocated using the nascent brpm->vma, which was
    newly created with VM_STACK_FLAGS, which an arch can override, but they
    all appear to include VM_WRITE | VM_MAYWRITE. Remove FOLL_FORCE.

    Cc: Eric Biederman <ebiederm@xmission.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: linux-fsdevel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Link: https://lore.kernel.org/lkml/202211171439.CDE720EAD@keescook/
    Signed-off-by: Kees Cook <keescook@chromium.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:37:17 -04:00
Rafael Aquini 0ce393dc54 mm: make find_extend_vma() fail if write lock not held
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit f440fa1ac955e2898893f9301568435eb5cdfc4b
Author: Liam R. Howlett <Liam.Howlett@oracle.com>
Date:   Fri Jun 16 15:58:54 2023 -0700

    mm: make find_extend_vma() fail if write lock not held

    Make calls to extend_vma() and find_extend_vma() fail if the write lock
    is required.

    To avoid making this a flag-day event, this still allows the old
    read-locking case for the trivial situations, and passes in a flag to
    say "is it write-locked".  That way write-lockers can say "yes, I'm
    being careful", and legacy users will continue to work in all the common
    cases until they have been fully converted to the new world order.

    Co-Developed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:37:17 -04:00
Rafael Aquini f42421e30d exec: simplify initial stack size expansion
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit bfb4a2b95875a47a01234f2de113ec089d524e71
Author: Rolf Eike Beer <eb@emlix.com>
Date:   Wed Oct 19 09:32:35 2022 +0200

    exec: simplify initial stack size expansion

    I had a hard time trying to understand completely why it is using vm_end in
    one side of the expression and vm_start in the other one, and using
    something in the "if" clause that is not an exact copy of what is used
    below. The whole point is that the stack_size variable that was used in the
    "if" clause is the difference between vm_start and vm_end, which is not far
    away but makes this thing harder to read than it must be.

    Signed-off-by: Rolf Eike Beer <eb@emlix.com>
    Signed-off-by: Kees Cook <keescook@chromium.org>
    Link: https://lore.kernel.org/r/2017429.gqNitNVd0C@mobilepool36.emlix.com

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:37:16 -04:00
Rafael Aquini e24b3ade32 mm/gup: remove vmas parameter from get_user_pages_remote()
JIRA: https://issues.redhat.com/browse/RHEL-27742
Conflicts:
  - virt/kvm/async_pf.c: minor context diff due to out-of-order backport of
    upstream commit 08284765f03b7 ("KVM: Get reference to VM's address space
    in the async #PF worker")

This patch is a backport of the following upstream commit:
commit ca5e863233e8f6acd1792fd85d6bc2729a1b2c10
Author: Lorenzo Stoakes <lstoakes@gmail.com>
Date:   Wed May 17 20:25:39 2023 +0100

    mm/gup: remove vmas parameter from get_user_pages_remote()

    The only instances of get_user_pages_remote() invocations which used the
    vmas parameter were for a single page which can instead simply look up the
    VMA directly. In particular:-

    - __update_ref_ctr() looked up the VMA but did nothing with it so we simply
      remove it.

    - __access_remote_vm() was already using vma_lookup() when the original
      lookup failed so by doing the lookup directly this also de-duplicates the
      code.

    We are able to perform these VMA operations as we already hold the
    mmap_lock in order to be able to call get_user_pages_remote().

    As part of this work we add get_user_page_vma_remote() which abstracts the
    VMA lookup, error handling and decrementing the page reference count should
    the VMA lookup fail.

    This forms part of a broader set of patches intended to eliminate the vmas
    parameter altogether.

    [akpm@linux-foundation.org: avoid passing NULL to PTR_ERR]
    Link: https://lkml.kernel.org/r/d20128c849ecdbf4dd01cc828fcec32127ed939a.1684350871.git.lstoakes@gmail.com
    Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
    Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> (for arm64)
    Acked-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Janosch Frank <frankja@linux.ibm.com> (for s390)
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Cc: Christian König <christian.koenig@amd.com>
    Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Jarkko Sakkinen <jarkko@kernel.org>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
    Cc: Sean Christopherson <seanjc@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:35:35 -04:00
Aristeu Rozanski e214620cfb mm: replace vma->vm_flags direct modifications with modifier calls
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me
Conflicts: dropped stuff we don't support when not applying cleanly, left the rest for sake of saving work

commit 1c71222e5f2393b5ea1a41795c67589eea7e3490
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Thu Jan 26 11:37:49 2023 -0800

    mm: replace vma->vm_flags direct modifications with modifier calls

    Replace direct modifications to vma->vm_flags with calls to modifier
    functions to be able to track flag changes and to keep vma locking
    correctness.

    [akpm@linux-foundation.org: fix drivers/misc/open-dice.c, per Hyeonggon Yoo]
    Link: https://lkml.kernel.org/r/20230126193752.297968-5-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
    Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com>
    Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Arjun Roy <arjunroy@google.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Joel Fernandes <joelaf@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Kent Overstreet <kent.overstreet@linux.dev>
    Cc: Laurent Dufour <ldufour@linux.ibm.com>
    Cc: Lorenzo Stoakes <lstoakes@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Minchan Kim <minchan@google.com>
    Cc: Paul E. McKenney <paulmck@kernel.org>
    Cc: Peter Oskolkov <posk@google.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Punit Agrawal <punit.agrawal@bytedance.com>
    Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Soheil Hassas Yeganeh <soheil@google.com>
    Cc: Song Liu <songliubraving@fb.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:17 -04:00
Aristeu Rozanski 1832e45d48 mm/mmap: don't use __vma_adjust() in shift_arg_pages()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit cf51e86dfbe39b7cae3a9de650d035af22dd5fb4
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Fri Jan 20 11:26:46 2023 -0500

    mm/mmap: don't use __vma_adjust() in shift_arg_pages()

    Introduce shrink_vma() which uses the vma_prepare() and vma_complete()
    functions to reduce the vma coverage.

    Convert shift_arg_pages() to use expand_vma() and the new shrink_vma()
    function.  Remove support from __vma_adjust() to reduce a vma size since
    shift_arg_pages() is the only user that shrinks a VMA in this way.

    Link: https://lkml.kernel.org/r/20230120162650.984577-46-Liam.Howlett@oracle.com
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:16 -04:00
Aristeu Rozanski 728b8a88b3 mm: don't use __vma_adjust() in __split_vma()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit b2b3b886738fec5e89ca9ebc720eba1a8f615753
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Fri Jan 20 11:26:44 2023 -0500

    mm: don't use __vma_adjust() in __split_vma()

    Use the abstracted locking and maple tree operations.  Since __split_vma()
    is the only user of the __vma_adjust() function to use the insert
    argument, drop that argument.  Remove the NULL passed through from
    fs/exec's shift_arg_pages() and mremap() at the same time.

    Link: https://lkml.kernel.org/r/20230120162650.984577-44-Liam.Howlett@oracle.com
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:16 -04:00
Aristeu Rozanski c1ff56ff95 mm: add vma iterator to vma_adjust() arguments
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit b373037fa9bb374f26bbabc0779fe990d02d33b7
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Fri Jan 20 11:26:37 2023 -0500

    mm: add vma iterator to vma_adjust() arguments

    Change the vma_adjust() function definition to accept the vma iterator and
    pass it through to __vma_adjust().

    Update fs/exec to use the new vma_adjust() function parameters.

    Update mm/mremap to use the new vma_adjust() function parameters.

    Revert the __split_vma() calls back from __vma_adjust() to vma_adjust()
    and pass through the vma iterator.

    Link: https://lkml.kernel.org/r/20230120162650.984577-37-Liam.Howlett@oracle.com
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:16 -04:00
Aristeu Rozanski 8054ffde35 mm: change mprotect_fixup to vma iterator
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 2286a6914c776ec34cd97e4573b1466d055cb9de
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Fri Jan 20 11:26:18 2023 -0500

    mm: change mprotect_fixup to vma iterator

    Use the vma iterator so that the iterator can be invalidated or updated to
    avoid each caller doing so.

    Link: https://lkml.kernel.org/r/20230120162650.984577-18-Liam.Howlett@oracle.com
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:14 -04:00
Chris von Recklinghausen 777a83832f exec: use VMA iterator instead of linked list
JIRA: https://issues.redhat.com/browse/RHEL-27736

commit 19066e58682ec156aac8d6cf94b79ab2f122a556
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Sep 6 19:48:56 2022 +0000

    exec: use VMA iterator instead of linked list

    Remove a use of the vm_next list by doing the initial lookup with the VMA
    iterator and then using it to find the next entry.

    Link: https://lkml.kernel.org/r/20220906194824.2110408-42-Liam.Howlett@oracle.com
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Tested-by: Yu Zhao <yuzhao@google.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:19:50 -04:00
Chris von Recklinghausen eb370ae179 mm: remove vmacache
JIRA: https://issues.redhat.com/browse/RHEL-27736

commit 7964cf8caa4dfa42c4149f3833d3878713cda3dc
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:48:51 2022 +0000

    mm: remove vmacache

    By using the maple tree and the maple tree state, the vmacache is no
    longer beneficial and is complicating the VMA code.  Remove the vmacache
    to reduce the work in keeping it up to date and code complexity.

    Link: https://lkml.kernel.org/r/20220906194824.2110408-26-Liam.Howlett@oracle.com
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Tested-by: Yu Zhao <yuzhao@google.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-01 11:19:46 -04:00
Jeff Moyer da5eea0749 fsnotify: move fsnotify_open() hook into do_dentry_open()
JIRA: https://issues.redhat.com/browse/RHEL-12076

commit 7b8c9d7bb4570ee4800642009c8f2d9756004552
Author: Amir Goldstein <amir73il@gmail.com>
Date:   Sun Jun 11 15:24:29 2023 +0300

    fsnotify: move fsnotify_open() hook into do_dentry_open()
    
    fsnotify_open() hook is called only from high level system calls
    context and not called for the very many helpers to open files.
    
    This may makes sense for many of the special file open cases, but it is
    inconsistent with fsnotify_close() hook that is called for every last
    fput() of on a file object with FMODE_OPENED.
    
    As a result, it is possible to observe ACCESS, MODIFY and CLOSE events
    without ever observing an OPEN event.
    
    Fix this inconsistency by replacing all the fsnotify_open() hooks with
    a single hook inside do_dentry_open().
    
    If there are special cases that would like to opt-out of the possible
    overhead of fsnotify() call in fsnotify_open(), they would probably also
    want to avoid the overhead of fsnotify() call in the rest of the fsnotify
    hooks, so they should be opening that file with the __FMODE_NONOTIFY flag.
    
    However, in the majority of those cases, the s_fsnotify_connectors
    optimization in fsnotify_parent() would be sufficient to avoid the
    overhead of fsnotify() call anyway.
    
    Signed-off-by: Amir Goldstein <amir73il@gmail.com>
    Signed-off-by: Jan Kara <jack@suse.cz>
    Message-Id: <20230611122429.1499617-1-amir73il@gmail.com>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-11-02 15:31:54 -04:00
Chris von Recklinghausen 3e2b7d226a mm: multi-gen LRU: move lru_gen_add_mm() out of IRQ-off region
Conflicts: fs/exec.c - We don't have
	7964cf8caa4d ("mm: remove vmacache")
	so keep zeroing tsk->mm->vmacache_seqnum and calling vmacache_flush

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit dda1c41a07b4a4c3f99b5b28c1e8c485205fe860
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Wed Oct 26 15:48:30 2022 +0200

    mm: multi-gen LRU: move lru_gen_add_mm() out of IRQ-off region

    lru_gen_add_mm() has been added within an IRQ-off region in the commit
    mentioned below.  The other invocations of lru_gen_add_mm() are not within
    an IRQ-off region.

    The invocation within IRQ-off region is problematic on PREEMPT_RT because
    the function is using a spin_lock_t which must not be used within
    IRQ-disabled regions.

    The other invocations of lru_gen_add_mm() occur while
    task_struct::alloc_lock is acquired.  Move lru_gen_add_mm() after
    interrupts are enabled and before task_unlock().

    Link: https://lkml.kernel.org/r/20221026134830.711887-1-bigeasy@linutronix.d
e
    Fixes: bd74fdaea1460 ("mm: multi-gen LRU: support page table walks")
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Acked-by: Yu Zhao <yuzhao@google.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: "Eric W . Biederman" <ebiederm@xmission.com>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:15:07 -04:00
Chris von Recklinghausen b92cce1ea6 mm: multi-gen LRU: support page table walks
Conflicts:
	fs/exec.c - We already have
		33a2d6bc3480 ("Revert "fs/exec: allow to unshare a time namespace on vfork+exec"")
		so don't add call to timens_on_fork back in
	include/linux/mmzone.h - We already have
		e6ad640bc404 ("mm: deduplicate cacheline padding code")
		so keep CACHELINE_PADDING(_pad2_) over ZONE_PADDING(_pad2_)
	mm/vmscan.c - The backport of
		badc28d4924b ("mm: shrinkers: fix deadlock in shrinker debugfs")
		added an #include <linux/debugfs.h>. Keep it.

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit bd74fdaea146029e4fa12c6de89adbe0779348a9
Author: Yu Zhao <yuzhao@google.com>
Date:   Sun Sep 18 02:00:05 2022 -0600

    mm: multi-gen LRU: support page table walks

    To further exploit spatial locality, the aging prefers to walk page tables
    to search for young PTEs and promote hot pages.  A kill switch will be
    added in the next patch to disable this behavior.  When disabled, the
    aging relies on the rmap only.

    NB: this behavior has nothing similar with the page table scanning in the
    2.4 kernel [1], which searches page tables for old PTEs, adds cold pages
    to swapcache and unmaps them.

    To avoid confusion, the term "iteration" specifically means the traversal
    of an entire mm_struct list; the term "walk" will be applied to page
    tables and the rmap, as usual.

    An mm_struct list is maintained for each memcg, and an mm_struct follows
    its owner task to the new memcg when this task is migrated.  Given an
    lruvec, the aging iterates lruvec_memcg()->mm_list and calls
    walk_page_range() with each mm_struct on this list to promote hot pages
    before it increments max_seq.

    When multiple page table walkers iterate the same list, each of them gets
    a unique mm_struct; therefore they can run concurrently.  Page table
    walkers ignore any misplaced pages, e.g., if an mm_struct was migrated,
    pages it left in the previous memcg will not be promoted when its current
    memcg is under reclaim.  Similarly, page table walkers will not promote
    pages from nodes other than the one under reclaim.

    This patch uses the following optimizations when walking page tables:
    1. It tracks the usage of mm_struct's between context switches so that
       page table walkers can skip processes that have been sleeping since
       the last iteration.
    2. It uses generational Bloom filters to record populated branches so
       that page table walkers can reduce their search space based on the
       query results, e.g., to skip page tables containing mostly holes or
       misplaced pages.
    3. It takes advantage of the accessed bit in non-leaf PMD entries when
       CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
    4. It does not zigzag between a PGD table and the same PMD table
       spanning multiple VMAs. IOW, it finishes all the VMAs within the
       range of the same PMD table before it returns to a PGD table. This
       improves the cache performance for workloads that have large
       numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.

    Server benchmark results:
      Single workload:
        fio (buffered I/O): no change

      Single workload:
        memcached (anon): +[8, 10]%
                    Ops/sec      KB/sec
          patch1-7: 1147696.57   44640.29
          patch1-8: 1245274.91   48435.66

      Configurations:
        no change

    Client benchmark results:
      kswapd profiles:
        patch1-7
          48.16%  lzo1x_1_do_compress (real work)
           8.20%  page_vma_mapped_walk (overhead)
           7.06%  _raw_spin_unlock_irq
           2.92%  ptep_clear_flush
           2.53%  __zram_bvec_write
           2.11%  do_raw_spin_lock
           2.02%  memmove
           1.93%  lru_gen_look_around
           1.56%  free_unref_page_list
           1.40%  memset

        patch1-8
          49.44%  lzo1x_1_do_compress (real work)
           6.19%  page_vma_mapped_walk (overhead)
           5.97%  _raw_spin_unlock_irq
           3.13%  get_pfn_folio
           2.85%  ptep_clear_flush
           2.42%  __zram_bvec_write
           2.08%  do_raw_spin_lock
           1.92%  memmove
           1.44%  alloc_zspage
           1.36%  memset

      Configurations:
        no change

    Thanks to the following developers for their efforts [3].
      kernel test robot <lkp@intel.com>

    [1] https://lwn.net/Articles/23732/
    [2] https://llvm.org/docs/ScudoHardenedAllocator.html
    [3] https://lore.kernel.org/r/202204160827.ekEARWQo-lkp@intel.com/

    Link: https://lkml.kernel.org/r/20220918080010.2920238-9-yuzhao@google.com
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
    Cc: Andi Kleen <ak@linux.intel.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Barry Song <baohua@kernel.org>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Hillf Danton <hdanton@sina.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michael Larabel <Michael@MichaelLarabel.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:46 -04:00
Chris von Recklinghausen 321740bffd Revert "fs/exec: allow to unshare a time namespace on vfork+exec"
Conflicts: kernel/fork.c - We already have
	2b5f9dad32ed ("s/exec: switch timens when a task gets a new mm")
	so don't add back in the code it deleted.

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 33a2d6bc3480f9f8ac8c8def29854f98cc8bfee2
Author: Andrei Vagin <avagin@gmail.com>
Date:   Tue Sep 13 03:25:51 2022 -0700

    Revert "fs/exec: allow to unshare a time namespace on vfork+exec"

    This reverts commit 133e2d3e81de5d9706cab2dd1d52d231c27382e5.

    Alexey pointed out a few undesirable side effects of the reverted change.
    First, it doesn't take into account that CLONE_VFORK can be used with
    CLONE_THREAD. Second, a child process doesn't enter a target time name-space
,
    if its parent dies before the child calls exec. It happens because the paren
t
    clears vfork_done.

    Eric W. Biederman suggests installing a time namespace as a task gets a new
mm.
    It includes all new processes cloned without CLONE_VM and all tasks that cal
l
    exec(). This is an user API change, but we think there aren't users that dep
end
    on the old behavior.

    It is too late to make such changes in this release, so let's roll back
    this patch and introduce the right one in the next release.

    Cc: Alexey Izbyshev <izbyshev@ispras.ru>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Dmitry Safonov <0x7f454c46@gmail.com>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Florian Weimer <fweimer@redhat.com>
    Cc: Kees Cook <keescook@chromium.org>
    Signed-off-by: Andrei Vagin <avagin@gmail.com>
    Signed-off-by: Kees Cook <keescook@chromium.org>
    Link: https://lore.kernel.org/r/20220913102551.1121611-3-avagin@google.com
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:08 -04:00
Chris von Recklinghausen 276e42586d fs/exec: allow to unshare a time namespace on vfork+exec
Conflicts: kernel/fork.c - We already have
	2b5f9dad32ed ("s/exec: switch timens when a task gets a new mm")
	so code to change is gone.

JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 133e2d3e81de5d9706cab2dd1d52d231c27382e5
Author: Andrei Vagin <avagin@gmail.com>
Date:   Sun Jun 12 23:07:22 2022 -0700

    fs/exec: allow to unshare a time namespace on vfork+exec

    Right now, a new process can't be forked in another time namespace
    if it shares mm with its parent. It is prohibited, because each time
    namespace has its own vvar page that is mapped into a process address
    space.

    When a process calls exec, it gets a new mm and so it could be "legal"
    to switch time namespace in that case. This was not implemented and
    now if we want to do this, we need to add another clone flag to not
    break backward compatibility.

    We don't have any user requests to switch times on exec except the
    vfork+exec combination, so there is no reason to add a new clone flag.
    As for vfork+exec, this should be safe to allow switching timens with
    the current clone flag. Right now, vfork (CLONE_VFORK | CLONE_VM) fails
    if a child is forked into another time namespace. With this change,
    vfork creates a new process in parent's timens, and the following exec
    does the actual switch to the target time namespace.

    Suggested-by: Florian Weimer <fweimer@redhat.com>
    Signed-off-by: Andrei Vagin <avagin@gmail.com>
    Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
    Signed-off-by: Kees Cook <keescook@chromium.org>
    Link: https://lore.kernel.org/r/20220613060723.197407-1-avagin@gmail.com

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:12:50 -04:00
Chris von Recklinghausen 97f7745102 fs/coredump: move coredump sysctls into its own file
Bugzilla: https://bugzilla.redhat.com/2160210

commit f0bc21b268c1464603192a00851cdbbf7c2cdc36
Author: Xiaoming Ni <nixiaoming@huawei.com>
Date:   Fri Jan 21 22:13:38 2022 -0800

    fs/coredump: move coredump sysctls into its own file

    This moves the fs/coredump.c respective sysctls to its own file.

    Link: https://lkml.kernel.org/r/20211129211943.640266-6-mcgrof@kernel.org
    Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
    Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
    Cc: Antti Palosaari <crope@iki.fi>
    Cc: Christian Brauner <christian.brauner@ubuntu.com>
    Cc: "David S. Miller" <davem@davemloft.net>
    Cc: Eric Biederman <ebiederm@xmission.com>
    Cc: Eric Biggers <ebiggers@google.com>
    Cc: Iurii Zaikin <yzaikin@google.com>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Lukas Middendorf <kernel@tuxforce.de>
    Cc: Masami Hiramatsu <mhiramat@kernel.org>
    Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
    Cc: Stephen Kitt <steve@sk2.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:33 -04:00
Chris von Recklinghausen c47399a873 fs: move fs/exec.c sysctls into its own file
Bugzilla: https://bugzilla.redhat.com/2160210

commit 66ad398634c21e0a42ce10002ae06c39352da0d1
Author: Luis Chamberlain <mcgrof@kernel.org>
Date:   Fri Jan 21 22:13:17 2022 -0800

    fs: move fs/exec.c sysctls into its own file

    kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
    dishes, this makes it very difficult to maintain.

    To help with this maintenance let's start by moving sysctls to places
    where they actually belong.  The proc sysctl maintainers do not want to
    know what sysctl knobs you wish to add for your own piece of code, we
    just care about the core logic.

    So move the fs/exec.c respective sysctls to its own file.

    Since checkpatch complains about style issues with the old code, this
    move also fixes a few of those minor style issues:

      * Use pr_warn() instead of prink(WARNING
      * New empty lines are wanted at the beginning of routines

    Link: https://lkml.kernel.org/r/20211129205548.605569-9-mcgrof@kernel.org
    Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
    Cc: Antti Palosaari <crope@iki.fi>
    Cc: Eric Biederman <ebiederm@xmission.com>
    Cc: Iurii Zaikin <yzaikin@google.com>
    Cc: "J. Bruce Fields" <bfields@fieldses.org>
    Cc: Jeff Layton <jlayton@kernel.org>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Lukas Middendorf <kernel@tuxforce.de>
    Cc: Stephen Kitt <steve@sk2.org>
    Cc: Xiaoming Ni <nixiaoming@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:32 -04:00
Oleg Nesterov 9da64baa41 fs/exec: switch timens when a task gets a new mm
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2116442

commit 2b5f9dad32ed19e8db3b0f10a84aa824a219803b
Author: Andrei Vagin <avagin@gmail.com>
Date:   Tue Sep 20 17:31:19 2022 -0700

    fs/exec: switch timens when a task gets a new mm

    Changing a time namespace requires remapping a vvar page, so we don't want
    to allow doing that if any other tasks can use the same mm.

    Currently, we install a time namespace when a task is created with a new
    vm. exec() is another case when a task gets a new mm and so it can switch
    a time namespace safely, but it isn't handled now.

    One more issue of the current interface is that clone() with CLONE_VM isn't
    allowed if the current task has unshared a time namespace
    (timens_for_children doesn't match the current timens).

    Both these issues make some inconvenience for users. For example, Alexey
    and Florian reported that posix_spawn() uses vfork+exec and this pattern
    doesn't work with time namespaces due to the both described issues.
    LXC needed to workaround the exec() issue by calling setns.

    In the commit 133e2d3e81de5 ("fs/exec: allow to unshare a time namespace on
    vfork+exec"), we tried to fix these issues with minimal impact on UAPI. But
    it adds extra complexity and some undesirable side effects. Eric suggested
    fixing the issues properly because here are all the reasons to suppose that
    there are no users that depend on the old behavior.

    Cc: Alexey Izbyshev <izbyshev@ispras.ru>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Dmitry Safonov <0x7f454c46@gmail.com>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Florian Weimer <fweimer@redhat.com>
    Cc: Kees Cook <keescook@chromium.org>
    Suggested-by: "Eric W. Biederman" <ebiederm@xmission.com>
    Origin-author: "Eric W. Biederman" <ebiederm@xmission.com>
    Signed-off-by: Andrei Vagin <avagin@gmail.com>
    Signed-off-by: Kees Cook <keescook@chromium.org>
    Link: https://lore.kernel.org/r/20220921003120.209637-1-avagin@google.com

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2023-01-11 10:43:02 +01:00
Frantisek Hrbata e9e9bc8da2 Merge: mm changes through v5.18 for 9.2
Merge conflicts:
-----------------
Conflicts with !1142(merged) "io_uring: update to v5.15"

fs/io-wq.c
        - static bool io_wqe_create_worker(struct io_wqe *wqe, struct io_wqe_acct *acct)
          !1142 already contains backport of 3146cba99aa2 ("io-wq: make worker creation resilient against signals")
          along with other commits which are not present in !1370. Resolved in favor of HEAD(!1142)
        - static int io_wqe_worker(void *data)
          !1370 does not contain 767a65e9f317 ("io-wq: fix potential race of acct->nr_workers")
          Resolved in favor of HEAD(!1142)
        - static void io_init_new_worker(struct io_wqe *wqe, struct io_worker *worker,
          HEAD(!1142) does not contain e32cf5dfbe22 ("kthread: Generalize pf_io_worker so it can point to struct kthread")
          Resolved in favor of !1370
        - static void create_worker_cont(struct callback_head *cb)
          !1370 does not contain 66e70be72288 ("io-wq: fix memory leak in create_io_worker()")
          Resolved in favor of HEAD(!1142)
        - static void io_workqueue_create(struct work_struct *work)
          !1370 does not contain 66e70be72288 ("io-wq: fix memory leak in create_io_worker()")
          Resolved in favor of HEAD(!1142)
        - static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
          !1370 does not contain 66e70be72288 ("io-wq: fix memory leak in create_io_worker()")
          Resolved in favor of HEAD(!1142)
        - static bool io_wq_work_match_item(struct io_wq_work *work, void *data)
          !1370 does not contain 713b9825a4c4 ("io-wq: fix cancellation on create-worker failure")
          Resolved in favor of HEAD(!1142)
        - static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work)
          !1370 is missing 713b9825a4c4 ("io-wq: fix cancellation on create-worker failure")
          removed wrongly merged run_cancel label
          Resolved in favor of HEAD(!1142)
        - static bool io_task_work_match(struct callback_head *cb, void *data)
          !1370 is missing 3b33e3f4a6c0 ("io-wq: fix silly logic error in io_task_work_match()")
          Resolved in favor of HEAD(!1142)
        - static void io_wq_exit_workers(struct io_wq *wq)
          !1370 is missing 3b33e3f4a6c0 ("io-wq: fix silly logic error in io_task_work_match()")
          Resolved in favor of HEAD(!1142)
        - int io_wq_max_workers(struct io_wq *wq, int *new_count)
          !1370 is missing 3b33e3f4a6c0 ("io-wq: fix silly logic error in io_task_work_match()")
fs/io_uring.c
        - static int io_register_iowq_max_workers(struct io_ring_ctx *ctx,
          !1370 is missing bunch of commits after 2e480058ddc2 ("io-wq: provide a way to limit max number of workers")
          Resolved in favor of HEAD(!1142)
include/uapi/linux/io_uring.h
        - !1370 is missing dd47c104533d ("io-wq: provide IO_WQ_* constants for IORING_REGISTER_IOWQ_MAX_WORKERS arg items")
          just a comment conflict
          Resolved in favor of HEAD(!1142)
kernel/exit.c
        - void __noreturn do_exit(long code)
        - !1370 contains bunch of commits after f552a27afe67 ("io_uring: remove files pointer in cancellation functions")
          Resolved in favor of !1370

Conflicts with !1357(merged) "NFS refresh for RHEL-9.2"

fs/nfs/callback.c
        - nfs4_callback_svc(void *vrqstp)
          !1370 is missing f49169c97fce ("NFSD: Remove svc_serv_ops::svo_module") where the module_put_and_kthread_exit() was removed
          Resolved in favor of HEAD(!1357)
fs/nfs/file.c
          !1357 is missing 187c82cb0380 ("fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio")
          Resolved in favor of HEAD(!1370)
fs/nfsd/nfssvc.c
        - nfsd(void *vrqstp)
          !1370 is missing f49169c97fce ("NFSD: Remove svc_serv_ops::svo_module")
          Resolved in favor of HEAD(!1357)
-----------------

MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1370

Bugzilla: https://bugzilla.redhat.com/2120352

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2099722

Patches 1-9 are changes to selftests
Patches 10-31 are reverts of RHEL-only patches to address COR CVE
Patches 32-320 are the machine dependent mm changes ported by Rafael
Patch 321 reverts the backport of 6692c98c7df5. See below.
Patches 322-981 are the machine independent mm changes
Patches 982-1016 are David Hildebrand's upstream changes to address the COR CVE

RHEL commit b23c298982 fork: Stop protecting back_fork_cleanup_cgroup_lock with CONFIG_NUMA
which is a backport of upstream 6692c98c7df5 and is reverted early in this series. 6692c98c7df5
is a fix for upstream 40966e316f86 which was not in RHEL until this series. 6692c98c7df5 is re-added
after 40966e316f86.

Omitted-fix: 310d1344e3c5 ("Revert "powerpc: Remove unused FW_FEATURE_NATIVE references"")
        to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716

Omitted-fix: 465d0eb0dc31 ("Docs/admin-guide/mm/damon/usage: fix the example code snip")
        to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716

Omitted-fix: 317314527d17 ("mm/hugetlb: correct demote page offset logic")
        to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716

Omitted-fix: 37dcc673d065 ("frontswap: don't call ->init if no ops are registered")
        to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716

Omitted-fix: 30c19366636f ("mm: fix BUG splat with kvmalloc + GFP_ATOMIC")
        to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716

Omitted: fix: fa84693b3c89 io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: 009ad9f0c6ee io_uring: drop ctx->uring_lock before acquiring sqd->lock
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: bc369921d670 io-wq: max_worker fixes
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743

Omitted-fix: e139a1ec92f8 io_uring: apply max_workers limit to all future users
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743

Omitted-fix: 71c9ce27bb57 io-wq: fix max-workers not correctly set on multi-node system
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743

Omitted-fix: 41d3a6bd1d37 io_uring: pin SQPOLL data before unlocking ring lock
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: bad119b9a000 io_uring: honour zeroes as io-wq worker limits
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743

Omitted-fix: 08bdbd39b584 io-wq: ensure that hash wait lock is IRQ disabling
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: 713b9825a4c4 io-wq: fix cancellation on create-worker failure
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: 3b33e3f4a6c0 io-wq: fix silly logic error in io_task_work_match()
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: 71e1cef2d794 io-wq: Remove duplicate code in io_workqueue_create()
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=210774

Omitted-fix: a226abcd5d42 io-wq: don't retry task_work creation failure on fatal conditions
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743

Omitted-fix: fa84693b3c89 io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL
        fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: dd47c104533d io-wq: provide IO_WQ_* constants for IORING_REGISTER_IOWQ_MAX_WORKERS arg items
        fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: 4f0712ccec09 hexagon: Fix function name in die()
	unsupported arch

Omitted-fix: 751971af2e36 csky: Fix function name in csky_alignment() and die()
	unsupported arch

Omitted-fix: dcbc65aac283 ptrace: Remove duplicated include in ptrace.c
        unsupported arch

Omitted-fix: eb48d4219879 drm/i915: Fix oops due to missing stack depot
	fixed in RHEL commit 105d2d4832 Merge DRM changes from upstream v5.16..v5.17

Omitted-fix: 751a9d69b197 drm/i915: Fix oops due to missing stack depot
	fixed in RHEL commit 99fc716fc4 Merge DRM changes from upstream v5.17..v5.18

Omitted-fix: eb48d4219879 drm/i915: Fix oops due to missing stack depot
	fixed in RHEL commit 105d2d4832 Merge DRM changes from upstream v5.16..v5.17

Omitted-fix: 751a9d69b197 drm/i915: Fix oops due to missing stack depot
	fixed in RHEL commit 99fc716fc4 Merge DRM changes from upstream v5.17..v5.18

Omitted-fix: b95dc06af3e6 drm/amdgpu: disable runpm if we are the primary adapter
        reverted later

Omitted-fix: 5a90c24ad028 Revert "drm/amdgpu: disable runpm if we are the primary adapter"
        revert of above omitted fix

Omitted-fix: 724bbe49c5e4 fs/ntfs3: provide block_invalidate_folio to fix memory leak
	unsupported fs

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>

Approved-by: John W. Linville <linville@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>
Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Lyude Paul <lyude@redhat.com>
Approved-by: Donald Dutile <ddutile@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-10-23 19:49:41 +02:00
Frantisek Hrbata f422c448a1 Merge: io_uring: update to v5.15
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1142

# Merge Request Required Information

## Summary of Changes
Update the io_uring code base and its dependencies to v5.15.  We will not enable the functionality at this time, this is only a preparatory patch series.  The patch series does touch other subsystems, though:

91ef658fb8b8-namei-ignore-ERR-NULL-names-in-putname.patch
0ee50b47532a-namei-change-filename_parentat-calling-conventions.patch
584d3226d665-namei-make-do_mkdirat-take-struct-filename.patch
7797251bb5ab-namei-make-do_mknodat-take-struct-filename.patch
da2d0cede330-namei-make-do_symlinkat-take-struct-filename.patch
8228e2c31319-namei-add-getname_uflags.patch
020250f31c4c-namei-make-do_linkat-take-struct-filename.patch
45f30dab3957-namei-update-do_-helpers-to-return-ints.patch
d32f89da7fa8-net-add-accept-helper-not-installing-fd.patch
2112ff5ce0c1-iov_iter-track-truncated-size.patch
0766ec82e5fb-namei-Fix-use-after-free-in-kern_path_locked.patch
8fb0f47a9d7a-iov_iter-add-helper-to-save-iov_iter-state.patch
7dedd3e18077-Revert-iov_iter-track-truncated-size.patch
3a862cacf867-fs-add-anon_inode_getfile_secure-similar-to-anon_inode_getfd_secure.patch

As a result, file system, block and networking tests should be run.

Omitted-fix: 81132a39c152 ("fs: remove fget_many and fput_many interface")
             This is outside the scope of this MR, and isn't a "fix" so much as a performance enhancement.

## Approved Bugzilla Ticket
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>

Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Ming Lei <ming.lei@redhat.com>
Approved-by: Brian Foster <bfoster@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-10-21 09:47:33 -04:00
Chris von Recklinghausen 579cdcc5b5 mm/mprotect: use mmu_gather
Bugzilla: https://bugzilla.redhat.com/2120352

commit 4a18419f71cdf9155d2d2a6c79546f720978b990
Author: Nadav Amit <namit@vmware.com>
Date:   Mon May 9 18:20:50 2022 -0700

    mm/mprotect: use mmu_gather

    Patch series "mm/mprotect: avoid unnecessary TLB flushes", v6.

    This patchset is intended to remove unnecessary TLB flushes during
    mprotect() syscalls.  Once this patch-set make it through, similar and
    further optimizations for MADV_COLD and userfaultfd would be possible.

    Basically, there are 3 optimizations in this patch-set:

    1. Use TLB batching infrastructure to batch flushes across VMAs and do
       better/fewer flushes.  This would also be handy for later userfaultfd
       enhancements.

    2. Avoid unnecessary TLB flushes.  This optimization is the one that
       provides most of the performance benefits.  Unlike previous versions,
       we now only avoid flushes that would not result in spurious
       page-faults.

    3. Avoiding TLB flushes on change_huge_pmd() that are only needed to
       prevent the A/D bits from changing.

    Andrew asked for some benchmark numbers.  I do not have an easy
    determinate macrobenchmark in which it is easy to show benefit.  I
    therefore ran a microbenchmark: a loop that does the following on
    anonymous memory, just as a sanity check to see that time is saved by
    avoiding TLB flushes.  The loop goes:

            mprotect(p, PAGE_SIZE, PROT_READ)
            mprotect(p, PAGE_SIZE, PROT_READ|PROT_WRITE)
            *p = 0; // make the page writable

    The test was run in KVM guest with 1 or 2 threads (the second thread was
    busy-looping).  I measured the time (cycles) of each operation:

                    1 thread                2 threads
                    mmots   +patch          mmots   +patch
    PROT_READ       3494    2725 (-22%)     8630    7788 (-10%)
    PROT_READ|WRITE 3952    2724 (-31%)     9075    2865 (-68%)

    [ mmots = v5.17-rc6-mmots-2022-03-06-20-38 ]

    The exact numbers are really meaningless, but the benefit is clear.  There
    are 2 interesting results though.

    (1) PROT_READ is cheaper, while one can expect it not to be affected.
    This is presumably due to TLB miss that is saved

    (2) Without memory access (*p = 0), the speedup of the patch is even
    greater.  In that scenario mprotect(PROT_READ) also avoids the TLB flush.
    As a result both operations on the patched kernel take roughly ~1500
    cycles (with either 1 or 2 threads), whereas on mmotm their cost is as
    high as presented in the table.

    This patch (of 3):

    change_pXX_range() currently does not use mmu_gather, but instead
    implements its own deferred TLB flushes scheme.  This both complicates the
    code, as developers need to be aware of different invalidation schemes,
    and prevents opportunities to avoid TLB flushes or perform them in finer
    granularity.

    The use of mmu_gather for modified PTEs has benefits in various scenarios
    even if pages are not released.  For instance, if only a single page needs
    to be flushed out of a range of many pages, only that page would be
    flushed.  If a THP page is flushed, on x86 a single TLB invlpg instruction
    can be used instead of 512 instructions (or a full TLB flush, which would
    Linux would actually use by default).  mprotect() over multiple VMAs
    requires a single flush.

    Use mmu_gather in change_pXX_range().  As the pages are not released, only
    record the flushed range using tlb_flush_pXX_range().

    Handle THP similarly and get rid of flush_cache_range() which becomes
    redundant since tlb_start_vma() calls it when needed.

    Link: https://lkml.kernel.org/r/20220401180821.1986781-1-namit@vmware.com
    Link: https://lkml.kernel.org/r/20220401180821.1986781-2-namit@vmware.com
    Signed-off-by: Nadav Amit <namit@vmware.com>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Andrew Cooper <andrew.cooper3@citrix.com>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Will Deacon <will@kernel.org>
    Cc: Yu Zhao <yuzhao@google.com>
    Cc: Nick Piggin <npiggin@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:08 -04:00
Chris von Recklinghausen 34f31c17ed kthread: Don't allocate kthread_struct for init and umh
Bugzilla: https://bugzilla.redhat.com/2120352

commit 343f4c49f2438d8920f1f76fa823ee59b91f02e4
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Mon Apr 11 11:40:14 2022 -0500

    kthread: Don't allocate kthread_struct for init and umh

    If kthread_is_per_cpu runs concurrently with free_kthread_struct the
    kthread_struct that was just freed may be read from.

    This bug was introduced by commit 40966e316f86 ("kthread: Ensure
    struct kthread is present for all kthreads").  When kthread_struct
    started to be allocated for all tasks that have PF_KTHREAD set.  This
    in turn required the kthread_struct to be freed in kernel_execve and
    violated the assumption that kthread_struct will have the same
    lifetime as the task.

    Looking a bit deeper this only applies to callers of kernel_execve
    which is just the init process and the user mode helper processes.
    These processes really don't want to be kernel threads but are for
    historical reasons.  Mostly that copy_thread does not know how to take
    a kernel mode function to the process with for processes without
    PF_KTHREAD or PF_IO_WORKER set.

    Solve this by not allocating kthread_struct for the init process and
    the user mode helper processes.

    This is done by adding a kthread member to struct kernel_clone_args.
    Setting kthread in fork_idle and kernel_thread.  Adding
    user_mode_thread that works like kernel_thread except it does not set
    kthread.  In fork only allocating the kthread_struct if .kthread is set.

    I have looked at kernel/kthread.c and since commit 40966e316f86
    ("kthread: Ensure struct kthread is present for all kthreads") there
    have been no assumptions added that to_kthread or __to_kthread will
    not return NULL.

    There are a few callers of to_kthread or __to_kthread that assume a
    non-NULL struct kthread pointer will be returned.  These functions are
    kthread_data(), kthread_parmme(), kthread_exit(), kthread(),
    kthread_park(), kthread_unpark(), kthread_stop().  All of those functions
    can reasonably expected to be called when it is know that a task is a
    kthread so that assumption seems reasonable.

    Cc: stable@vger.kernel.org
    Fixes: 40966e316f86 ("kthread: Ensure struct kthread is present for all kthreads")
    Reported-by: Максим Кутявин <maximkabox13@gmail.com>
    Link: https://lkml.kernel.org/r/20220506141512.516114-1-ebiederm@xmission.com
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:28:07 -04:00
Chris von Recklinghausen ea2fa2fb80 uaccess: remove CONFIG_SET_FS
Conflicts: in arch/, only keep changes to arch/Kconfig and
	arch/arm64/kernel/traps.c. All other arch files in the upstream version
	of this patch are dropped.

Bugzilla: https://bugzilla.redhat.com/2120352

commit 967747bbc084b93b54e66f9047d342232314cd25
Author: Arnd Bergmann <arnd@arndb.de>
Date:   Fri Feb 11 21:42:45 2022 +0100

    uaccess: remove CONFIG_SET_FS

    There are no remaining callers of set_fs(), so CONFIG_SET_FS
    can be removed globally, along with the thread_info field and
    any references to it.

    This turns access_ok() into a cheaper check against TASK_SIZE_MAX.

    As CONFIG_SET_FS is now gone, drop all remaining references to
    set_fs()/get_fs(), mm_segment_t, user_addr_max() and uaccess_kernel().

    Acked-by: Sam Ravnborg <sam@ravnborg.org> # for sparc32 changes
    Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
    Tested-by: Sergey Matyukevich <sergey.matyukevich@synopsys.com> # for arc ch
anges
    Acked-by: Stafford Horne <shorne@gmail.com> # [openrisc, asm-generic]
    Acked-by: Dinh Nguyen <dinguyen@kernel.org>
    Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:45 -04:00
Chris von Recklinghausen e545ae66be signal: Remove the helper signal_group_exit
Bugzilla: https://bugzilla.redhat.com/2120352

commit 49697335e0b441b0553598c1b48ee9ebb053d2f1
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Thu Jun 24 02:14:30 2021 -0500

    signal: Remove the helper signal_group_exit

    This helper is misleading.  It tests for an ongoing exec as well as
    the process having received a fatal signal.

    Sometimes it is appropriate to treat an on-going exec differently than
    a process that is shutting down due to a fatal signal.  In particular
    taking the fast path out of exit_signals instead of retargeting
    signals is not appropriate during exec, and not changing the the exit
    code in do_group_exit during exec.

    Removing the helper makes it more obvious what is going on as both
    cases must be coded for explicitly.

    While removing the helper fix the two cases where I have observed
    using signal_group_exit resulted in the wrong result.

    In exit_signals only test for SIGNAL_GROUP_EXIT so that signals are
    retargetted during an exec.

    In do_group_exit use 0 as the exit code during an exec as de_thread
    does not set group_exit_code.  As best as I can determine
    group_exit_code has been is set to 0 most of the time during
    de_thread.  During a thread group stop group_exit_code is set to the
    stop signal and when the thread group receives SIGCONT group_exit_code
    is reset to 0.

    Link: https://lkml.kernel.org/r/20211213225350.27481-8-ebiederm@xmission.com
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:36 -04:00
Chris von Recklinghausen e44c428417 signal: Rename group_exit_task group_exec_task
Bugzilla: https://bugzilla.redhat.com/2120352

commit 60700e38fb68e800607ca7a027060d5419fc5798
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Sun Jun 6 13:47:53 2021 -0500

    signal: Rename group_exit_task group_exec_task

    The only remaining user of group_exit_task is exec.  Rename the field
    so that it is clear which part of the code uses it.

    Update the comment above the definition of group_exec_task to document
    how it is currently used.

    Link: https://lkml.kernel.org/r/20211213225350.27481-7-ebiederm@xmission.com
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:36 -04:00
Chris von Recklinghausen e1e51160dc kthread: Ensure struct kthread is present for all kthreads
Bugzilla: https://bugzilla.redhat.com/2120352

commit 40966e316f86b8cfd83abd31ccb4df729309d3e7
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Thu Dec 2 09:56:14 2021 -0600

    kthread: Ensure struct kthread is present for all kthreads

    Today the rules are a bit iffy and arbitrary about which kernel
    threads have struct kthread present.  Both idle threads and thread
    started with create_kthread want struct kthread present so that is
    effectively all kernel threads.  Make the rule that if PF_KTHREAD
    and the task is running then struct kthread is present.

    This will allow the kernel thread code to using tsk->exit_code
    with different semantics from ordinary processes.

    To make ensure that struct kthread is present for all
    kernel threads move it's allocation into copy_process.

    Add a deallocation of struct kthread in exec for processes
    that were kernel threads.

    Move the allocation of struct kthread for the initial thread
    earlier so that it is not repeated for each additional idle
    thread.

    Move the initialization of struct kthread into set_kthread_struct
    so that the structure is always and reliably initailized.

    Clear set_child_tid in free_kthread_struct to ensure the kthread
    struct is reliably freed during exec.  The function
    free_kthread_struct does not need to clear vfork_done during exec as
    exec_mm_release called from exec_mmap has already cleared vfork_done.

    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:33 -04:00
Chris von Recklinghausen 3623566c7f signal: Replace force_sigsegv(SIGSEGV) with force_fatal_sig(SIGSEGV)
Conflicts: drop changes to arch/m68k/kernel/traps.c - usupported arch

Bugzilla: https://bugzilla.redhat.com/2120352

commit e21294a7aaae32c5d7154b187113a04db5852e37
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Mon Oct 25 10:50:57 2021 -0500

    signal: Replace force_sigsegv(SIGSEGV) with force_fatal_sig(SIGSEGV)

    Now that force_fatal_sig exists it is unnecessary and a bit confusing
    to use force_sigsegv in cases where the simpler force_fatal_sig is
    wanted.  So change every instance we can to make the code clearer.

    Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
    Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
    Link: https://lkml.kernel.org/r/877de7jrev.fsf@disp2133
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:26 -04:00
Chris von Recklinghausen d51a01798b exec: Check for a pending fatal signal instead of core_state
Bugzilla: https://bugzilla.redhat.com/2120352

commit 7e3c4fb7fc19bcf20657de3edb718ec1b26c7df3
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Fri Sep 3 10:26:05 2021 -0500

    exec: Check for a pending fatal signal instead of core_state

    Prevent exec continuing when a fatal signal is pending by replacing
    mmap_read_lock with mmap_read_lock_killable.  This is always the right
    thing to do as userspace will never observe an exec complete when
    there is a fatal signal pending.

    With that change it becomes unnecessary to explicitly test for a core
    dump in progress.  In coredump_wait zap_threads arranges under
    mmap_write_lock for all tasks that use a mm to also have SIGKILL
    pending, which means mmap_read_lock_killable will always return -EINTR
    when old_mm->core_state is present.

    Link: https://lkml.kernel.org/r/87fstux27w.fsf@disp2133
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:24 -04:00
Wander Lairson Costa b89dd8173e
posix-cpu-timers: Cleanup CPU timers before freeing them during exec
Bugzilla: https://bugzilla.redhat.com/2116968

commit e362359ace6f87c201531872486ff295df306d13
Author: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Date:   Tue Aug 9 14:07:51 2022 -0300

    posix-cpu-timers: Cleanup CPU timers before freeing them during exec

    Commit 55e8c8eb2c ("posix-cpu-timers: Store a reference to a pid not a
    task") started looking up tasks by PID when deleting a CPU timer.

    When a non-leader thread calls execve, it will switch PIDs with the leader
    process. Then, as it calls exit_itimers, posix_cpu_timer_del cannot find
    the task because the timer still points out to the old PID.

    That means that armed timers won't be disarmed, that is, they won't be
    removed from the timerqueue_list. exit_itimers will still release their
    memory, and when that list is later processed, it leads to a
    use-after-free.

    Clean up the timers from the de-threaded task before freeing them. This
    prevents a reported use-after-free.

    Fixes: 55e8c8eb2c ("posix-cpu-timers: Store a reference to a pid not a task")
    Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
    Cc: <stable@vger.kernel.org>
    Link: https://lore.kernel.org/r/20220809170751.164716-1-cascardo@canonical.com

Signed-off-by: Wander Lairson Costa <wander@redhat.com>
2022-08-29 16:15:20 -03:00
Wander Lairson Costa a532f4903a
fix race between exit_itimers() and /proc/pid/timers
Bugzilla: https://bugzilla.redhat.com/2116968

commit d5b36a4dbd06c5e8e36ca8ccc552f679069e2946
Author: Oleg Nesterov <oleg@redhat.com>
Date:   Mon Jul 11 18:16:25 2022 +0200

    fix race between exit_itimers() and /proc/pid/timers

    As Chris explains, the comment above exit_itimers() is not correct,
    we can race with proc_timers_seq_ops. Change exit_itimers() to clear
    signal->posix_timers with ->siglock held.

    Cc: <stable@vger.kernel.org>
    Reported-by: chris@accessvector.net
    Signed-off-by: Oleg Nesterov <oleg@redhat.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Wander Lairson Costa <wander@redhat.com>
2022-08-29 16:15:01 -03:00
Jeff Moyer 7fcd6b5262 namei: add getname_uflags()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2107656

commit 8228e2c313194f13f1d1806ed5734a26c38d49ac
Author: Dmitry Kadashev <dkadashev@gmail.com>
Date:   Thu Jul 8 13:34:42 2021 +0700

    namei: add getname_uflags()
    
    There are a couple of places where we already open-code the (flags &
    AT_EMPTY_PATH) check and io_uring will likely add another one in the
    future.  Let's just add a simple helper getname_uflags() that handles
    this directly and use it.
    
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Christian Brauner <christian.brauner@ubuntu.com>
    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Link: https://lore.kernel.org/io-uring/20210415100815.edrn4a7cy26wkowe@wittgenstein/
    Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
    Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com>
    Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
    Link: https://lore.kernel.org/r/20210708063447.3556403-7-dkadashev@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2022-07-15 14:58:36 -04:00
Rafael Aquini a65076a3c1 exec: Force single empty string when argv is empty
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2097485

This patch is a backport of the following upstream commit:
commit dcd46d897adb70d63e025f175a00a89797d31a43
Author: Kees Cook <keescook@chromium.org>
Date:   Mon Jan 31 16:09:47 2022 -0800

    exec: Force single empty string when argv is empty

    Quoting[1] Ariadne Conill:

    "In several other operating systems, it is a hard requirement that the
    second argument to execve(2) be the name of a program, thus prohibiting
    a scenario where argc < 1. POSIX 2017 also recommends this behaviour,
    but it is not an explicit requirement[2]:

        The argument arg0 should point to a filename string that is
        associated with the process being started by one of the exec
        functions.
    ...
    Interestingly, Michael Kerrisk opened an issue about this in 2008[3],
    but there was no consensus to support fixing this issue then.
    Hopefully now that CVE-2021-4034 shows practical exploitative use[4]
    of this bug in a shellcode, we can reconsider.

    This issue is being tracked in the KSPP issue tracker[5]."

    While the initial code searches[6][7] turned up what appeared to be
    mostly corner case tests, trying to that just reject argv == NULL
    (or an immediately terminated pointer list) quickly started tripping[8]
    existing userspace programs.

    The next best approach is forcing a single empty string into argv and
    adjusting argc to match. The number of programs depending on argc == 0
    seems a smaller set than those calling execve with a NULL argv.

    Account for the additional stack space in bprm_stack_limits(). Inject an
    empty string when argc == 0 (and set argc = 1). Warn about the case so
    userspace has some notice about the change:

        process './argc0' launched './argc0' with NULL argv: empty string added

    Additionally WARN() and reject NULL argv usage for kernel threads.

    [1] https://lore.kernel.org/lkml/20220127000724.15106-1-ariadne@dereferenced.org/
    [2] https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
    [3] https://bugzilla.kernel.org/show_bug.cgi?id=8408
    [4] https://www.qualys.com/2022/01/25/cve-2021-4034/pwnkit.txt
    [5] https://github.com/KSPP/linux/issues/176
    [6] https://codesearch.debian.net/search?q=execve%5C+*%5C%28%5B%5E%2C%5D%2B%2C+*NULL&literal=0
    [7] https://codesearch.debian.net/search?q=execlp%3F%5Cs*%5C%28%5B%5E%2C%5D%2B%2C%5Cs*NULL&literal=0
    [8] https://lore.kernel.org/lkml/20220131144352.GE16385@xsang-OptiPlex-9020/
    Reported-by: Ariadne Conill <ariadne@dereferenced.org>
    Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Rich Felker <dalias@libc.org>
    Cc: Eric Biederman <ebiederm@xmission.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: linux-fsdevel@vger.kernel.org
    Cc: stable@vger.kernel.org
    Signed-off-by: Kees Cook <keescook@chromium.org>
    Acked-by: Christian Brauner <brauner@kernel.org>
    Acked-by: Ariadne Conill <ariadne@dereferenced.org>
    Acked-by: Andy Lutomirski <luto@kernel.org>
    Link: https://lore.kernel.org/r/20220201000947.2453721-1-keescook@chromium.org

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2022-06-23 21:23:01 -04:00
Rafael Aquini 571362221a mm/pagemap: add mmap_assert_locked() annotations to find_vma*()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 5b78ed24e8ec48602c1d6f5a188e58d000c81e2b
Author: Luigi Rizzo <lrizzo@google.com>
Date:   Thu Sep 2 14:56:46 2021 -0700

    mm/pagemap: add mmap_assert_locked() annotations to find_vma*()

    find_vma() and variants need protection when used.  This patch adds
    mmap_assert_lock() calls in the functions.

    To make sure the invariant is satisfied, we also need to add a
    mmap_read_lock() around the get_user_pages_remote() call in
    get_arg_page().  The lock is not strictly necessary because the mm has
    been newly created, but the extra cost is limited because the same mutex
    was also acquired shortly before in __bprm_mm_init(), so it is hot and
    uncontended.

    [penguin-kernel@i-love.sakura.ne.jp: TOMOYO needs the same protection which get_arg_page() needs]
      Link: https://lkml.kernel.org/r/58bb6bf7-a57e-8a40-e74b-39584b415152@i-love.sakura.ne.jp

    Link: https://lkml.kernel.org/r/20210731175341.3458608-1-lrizzo@google.com
    Signed-off-by: Luigi Rizzo <lrizzo@google.com>
    Cc: David Rientjes <rientjes@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:41:34 -05:00