Centos-kernel-stream-9

Commit Graph

Author	SHA1	Message	Date
Patrick Talbert	21a6558f73	Merge: CVE-2024-43882: exec: Fix ToCToU between perm check and set-uid/gid usage MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6200 JIRA: https://issues.redhat.com/browse/RHEL-55562 CVE: CVE-2024-43882 ``` exec: Fix ToCToU between perm check and set-uid/gid usage When opening a file for exec via do_filp_open(), permission checking is done against the file's metadata at that moment, and on success, a file pointer is passed back. Much later in the execve() code path, the file metadata (specifically mode, uid, and gid) is used to determine if/how to set the uid and gid. However, those values may have changed since the permissions check, meaning the execution may gain unintended privileges. For example, if a file could change permissions from executable and not set-id: ---------x 1 root root 16048 Aug 7 13:16 target to set-id and non-executable: ---S------ 1 root root 16048 Aug 7 13:16 target it is possible to gain root privileges when execution should have been disallowed. While this race condition is rare in real-world scenarios, it has been observed (and proven exploitable) when package managers are updating the setuid bits of installed programs. Such files start with being world-executable but then are adjusted to be group-exec with a set-uid bit. For example, "chmod o-x,u+s target" makes "target" executable only by uid "root" and gid "cdrom", while also becoming setuid-root: -rwxr-xr-x 1 root cdrom 16048 Aug 7 13:16 target becomes: -rwsr-xr-- 1 root cdrom 16048 Aug 7 13:16 target But racing the chmod means users without group "cdrom" membership can get the permission to execute "target" just before the chmod, and when the chmod finishes, the exec reaches brpm_fill_uid(), and performs the setuid to root, violating the expressed authorization of "only cdrom group members can setuid to root". Re-check that we still have execute permissions in case the metadata has changed. It would be better to keep a copy from the perm-check time, but until we can do that refactoring, the least-bad option is to do a full inode_permission() call (under inode lock). It is understood that this is safe against dead-locks, but hardly optimal. Reported-by: Marco Vanotti <mvanotti@google.com> Tested-by: Marco Vanotti <mvanotti@google.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@vger.kernel.org Cc: Eric Biederman <ebiederm@xmission.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Kees Cook <kees@kernel.org> (cherry picked from commit f50733b45d865f91db90919f8311e2127ce5a0cb) ``` Signed-off-by: CKI Backport Bot <cki-ci-bot+cki-gitlab-backport-bot@redhat.com> --- <small>Created 2025-01-17 01:04 UTC by backporter - [KWF FAQ](https://red.ht/kernel_workflow_doc) - [Slack #team-kernel-workflow](https://redhat-internal.slack.com/archives/C04LRUPMJQ5) - [Source](https://gitlab.com/cki-project/kernel-workflow/-/blob/main/webhook/utils/backporter.py) - [Documentation](https://gitlab.com/cki-project/kernel-workflow/-/blob/main/docs/README.backporter.md) - [Report an issue](https://gitlab.com/cki-project/kernel-workflow/-/issues/new?issue%5Btitle%5D=backporter%20webhook%20issue)</small> Approved-by: Pavel Reichl <preichl@redhat.com> Approved-by: Carlos Maiolino <cmaiolino@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Patrick Talbert <ptalbert@redhat.com>	2025-02-03 10:00:43 -05:00
Patrick Talbert	143d5ac2a9	Merge: CVE-2024-50271: ucounts: Split rlimit and ucount values and max values MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6027 JIRA: https://issues.redhat.com/browse/RHEL-68020 CVE: CVE-2024-50271 - 012f4d5d25e9ef92ee129bd5aa7aa60f692681e1 signal: restore the override_rlimit logic - de399236e240743ad2dd10d719c37b97ddf31996 ucounts: Split rlimit and ucount values and max values Signed-off-by: Radostin Stoyanov <radostin@redhat.com> Approved-by: Rafael Aquini <raquini@redhat.com> Approved-by: Herton R. Krzesinski <herton@redhat.com> Approved-by: Waiman Long <longman@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Patrick Talbert <ptalbert@redhat.com>	2025-02-03 10:00:41 -05:00
CKI Backport Bot	0eec6bb2a5	exec: Fix ToCToU between perm check and set-uid/gid usage JIRA: https://issues.redhat.com/browse/RHEL-55562 CVE: CVE-2024-43882 commit f50733b45d865f91db90919f8311e2127ce5a0cb Author: Kees Cook <kees@kernel.org> Date: Thu Aug 8 11:39:08 2024 -0700 exec: Fix ToCToU between perm check and set-uid/gid usage When opening a file for exec via do_filp_open(), permission checking is done against the file's metadata at that moment, and on success, a file pointer is passed back. Much later in the execve() code path, the file metadata (specifically mode, uid, and gid) is used to determine if/how to set the uid and gid. However, those values may have changed since the permissions check, meaning the execution may gain unintended privileges. For example, if a file could change permissions from executable and not set-id: ---------x 1 root root 16048 Aug 7 13:16 target to set-id and non-executable: ---S------ 1 root root 16048 Aug 7 13:16 target it is possible to gain root privileges when execution should have been disallowed. While this race condition is rare in real-world scenarios, it has been observed (and proven exploitable) when package managers are updating the setuid bits of installed programs. Such files start with being world-executable but then are adjusted to be group-exec with a set-uid bit. For example, "chmod o-x,u+s target" makes "target" executable only by uid "root" and gid "cdrom", while also becoming setuid-root: -rwxr-xr-x 1 root cdrom 16048 Aug 7 13:16 target becomes: -rwsr-xr-- 1 root cdrom 16048 Aug 7 13:16 target But racing the chmod means users without group "cdrom" membership can get the permission to execute "target" just before the chmod, and when the chmod finishes, the exec reaches brpm_fill_uid(), and performs the setuid to root, violating the expressed authorization of "only cdrom group members can setuid to root". Re-check that we still have execute permissions in case the metadata has changed. It would be better to keep a copy from the perm-check time, but until we can do that refactoring, the least-bad option is to do a full inode_permission() call (under inode lock). It is understood that this is safe against dead-locks, but hardly optimal. Reported-by: Marco Vanotti <mvanotti@google.com> Tested-by: Marco Vanotti <mvanotti@google.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@vger.kernel.org Cc: Eric Biederman <ebiederm@xmission.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: CKI Backport Bot <cki-ci-bot+cki-gitlab-backport-bot@redhat.com>	2025-01-17 01:04:29 +00:00
Radostin Stoyanov	46364cd74c	ucounts: Split rlimit and ucount values and max values JIRA: https://issues.redhat.com/browse/RHEL-68020 CVE: CVE-2024-50271 commit de399236e240743ad2dd10d719c37b97ddf31996 Author: Alexey Gladkov <legion@kernel.org> Date: Wed Mat 18 19:17:30 2022 +0200 ucounts: Split rlimit and ucount values and max values Since the semantics of maximum rlimit values are different, it would be better not to mix ucount and rlimit values. This will prevent the error of using inc_count/dec_ucount for rlimit parameters. This patch also renames the functions to emphasize the lack of connection between rlimit and ucount. v3: - Fix BUG:KASAN:use-after-free_in_dec_ucount. v2: - Fix the array-index-out-of-bounds that was found by the lkp project. Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Alexey Gladkov <legion@kernel.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Link: https://lkml.kernel.org/r/20220518171730.l65lmnnjtnxnftpq@example.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Radostin Stoyanov <radostin@redhat.com>	2024-12-20 15:31:08 +00:00
Rafael Aquini	c428cfb451	mm/ksm: fix ksm exec support for prctl JIRA: https://issues.redhat.com/browse/RHEL-27745 This patch is a backport of the following upstream commit: commit 3a9e567ca45fb5280065283d10d9a11f0db61d2b Author: Jinjiang Tu <tujinjiang@huawei.com> Date: Thu Mar 28 19:10:08 2024 +0800 mm/ksm: fix ksm exec support for prctl Patch series "mm/ksm: fix ksm exec support for prctl", v4. commit 3c6f33b7273a ("mm/ksm: support fork/exec for prctl") inherits MMF_VM_MERGE_ANY flag when a task calls execve(). However, it doesn't create the mm_slot, so ksmd will not try to scan this task. The first patch fixes the issue. The second patch refactors to prepare for the third patch. The third patch extends the selftests of ksm to verfity the deduplication really happens after fork/exec inherits ths KSM setting. This patch (of 3): commit 3c6f33b7273a ("mm/ksm: support fork/exec for prctl") inherits MMF_VM_MERGE_ANY flag when a task calls execve(). Howerver, it doesn't create the mm_slot, so ksmd will not try to scan this task. To fix it, allocate and add the mm_slot to ksm_mm_head in __bprm_mm_init() when the mm has MMF_VM_MERGE_ANY flag. Link: https://lkml.kernel.org/r/20240328111010.1502191-1-tujinjiang@huawei.com Link: https://lkml.kernel.org/r/20240328111010.1502191-2-tujinjiang@huawei.com Fixes: 3c6f33b7273a ("mm/ksm: support fork/exec for prctl") Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Rik van Riel <riel@surriel.com> Cc: Stefan Roesch <shr@devkernel.io> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rafael Aquini <raquini@redhat.com>	2024-12-09 12:24:55 -05:00
Rado Vrbovsky	993b335734	Merge: Update arch/{x86,powerpc,arm64}/mm to v6.6 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5391 JIRA: https://issues.redhat.com/browse/RHEL-55461 JIRA: https://issues.redhat.com/browse/RHEL-55465 JIRA: https://issues.redhat.com/browse/RHEL-55462 Depends: !5252 Updated the respective arch mm directories to v6.6. Most of the patches have already been updated or included by the respective arch teams and by Rafael's mm update to v6.6. Dropped the following to avoid issues with the ppc64le build: 41b7a347bf14 powerpc: Book3S 64-bit outline-only KASAN support c7b9ed7c34a9 powerpc/64e: KASAN Full support for BOOK3E/64 Omitted-fix: 7bd6680b47fa Revert "Revert "arm64: dma: Drop cache invalidation from arch_dma_prep_coherent()"" Omitted-fix: 7b59e8ae92fe arm64: dts: qcom: sc7280: Mark SCM as dma-coherent for chrome devices Omitted-fix: a54b7fa6b9ab arm64: dts: qcom: sc7180: Mark SCM as dma-coherent for trogdor Omitted-fix: 9a5f0b11e49e arm64: dts: qcom: sc7180: Mark SCM as dma-coherent for IDP Omitted-fix: cd87d9f58439 x86/mm: further clarify switch_mm_irqs_off() documentation Signed-off-by: Audra Mitchell <audra@redhat.com> Approved-by: Rafael Aquini <raquini@redhat.com> Approved-by: Vladis Dronov <vdronov@redhat.com> Approved-by: Herton R. Krzesinski <herton@redhat.com> Approved-by: Chris von Recklinghausen <crecklin@redhat.com> Approved-by: Nico Pache <npache@redhat.com> Approved-by: Lenny Szubowicz <lszubowi@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>	2024-11-12 08:02:20 +00:00
Rado Vrbovsky	c154c6dc53	Merge: fs: backport mnt_idmap type MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4324 JIRA: https://issues.redhat.com/browse/RHEL-33888 This MR back ports idmapping changes to sync. our RHEL-9 kernel with the upstream kernel to version 6.3. Our current kernel has idmapped mounts support but there have been many changes since this initial implementation in the base kernel. In particular we need the type safety changes and we have seen difficulty back porting other requested changes on more than one occassion. The Jira this MR has been raised for is arother example of such a request. It is needed for a back port of a BPF feature to RHEL 9 which allows BPF programs to do file verification with LSM and fsverity. To satisfy this request changes made in the upstream 6.3 kernel are needed which is the reason we have chosen upstream 6.3 as the target release for the MR. The first fix has been omitted because it appears to be the same as 24b5308cf5ee ("selftests/filesystems: grant executable permission to run_fat_tests.sh"). In any case the requirement is to make the path tools/testing/selftests/filesystems/fat/run_fat_tests.sh executable which is done. The second and third Omitted patches are a straight apply and revert leaving the source unchanged. Omitted-Fix: 1d4beeb4edc7 ("selftests/filesystems: grant executable permission to run_fat_tests.sh") Omitted-Fix: 4a47c6385bb4 ovl: turn of SB_POSIXACL with idmapped layers temporarily Omitted-Fix: 7c4d37c269ac Revert "ovl: turn of SB_POSIXACL with idmapped layers temporarily" Signed-off-by: Ian Kent <ikent@redhat.com> Approved-by: Scott Mayhew <smayhew@redhat.com> Approved-by: Chris von Recklinghausen <crecklin@redhat.com> Approved-by: Xin Long <lxin@redhat.com> Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com> Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>	2024-11-11 08:26:30 +00:00
Audra Mitchell	8947f5b14c	lazy tlb: introduce lazy tlb mm refcount helper functions JIRA: https://issues.redhat.com/browse/RHEL-55462 This patch is a backport of the following upstream commit: commit aa464ba9a1e444d5ef95bb63ee3b2ef26fc96ed7 Author: Nicholas Piggin <npiggin@gmail.com> Date: Fri Feb 3 17:18:34 2023 +1000 lazy tlb: introduce lazy tlb mm refcount helper functions Add explicit _lazy_tlb annotated functions for lazy tlb mm refcounting. This makes the lazy tlb mm references more obvious, and allows the refcounting scheme to be modified in later changes. There is no functional change with this patch. Link: https://lkml.kernel.org/r/20230203071837.1136453-3-npiggin@gmail.com Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Audra Mitchell <audra@redhat.com>	2024-11-04 09:14:17 -05:00
Ian Kent	db8603ce12	fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap JIRA: https://issues.redhat.com/browse/RHEL-33888 Status: Linus Conflicts: Dropped hunks for ksmbd because the source is not present in the CentOS Stream source tree. CentOS Stream has commit `bb901646d2` ("ovl: let helper ovl_i_path_real() return the realinode") which wasn't present upstream when this patch was applied, correct manually. CentOS Stream does not have upstream commit c7423dbdbc9ec ("ima: Handle -ESTALE returned by ima_filter_rule_match()") which results in a reject of hunk #3 against security/integrity/ima/ima_policy.c, so manually apply hunk. Upstream merge commit 05e6295f7b5e0 ("Merge tag 'fs.idmapped.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping") together with Upstream commit facd61053cff1 ("fuse: fixes after adapting to new posix acl api") results in a conflict in fs/fuse/acl.c, adjust to suit. Update the call to i_uid_into_vfsuid() from 2740f64cb7f00 ("filelocks: use mount idmapping for setlease permission check") to pass an idmap instead of a user namespace. It looks like Linus made a change to the merge request "Merge tag 8834147f95056 ("fscache-rewrite-20220111") to account for idmap changes (probably the ones in this commit, so add the change here. commit e67fe63341b8117d7e0d9acf0f1222d5138b9266 Author: Christian Brauner <brauner@kernel.org> Date: Fri Jan 13 12:49:30 2023 +0100 fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap Convert to struct mnt_idmap. Remove legacy file_mnt_user_ns() and mnt_user_ns(). Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Ian Kent <ikent@redhat.com>	2024-10-16 11:02:01 +08:00
Ian Kent	edf17476c7	fs: port privilege checking helpers to mnt_idmap JIRA: https://issues.redhat.com/browse/RHEL-33888 Status: Linus Conflicts: For consistency drop btrfs hunks because it isn't supported in CentOS Stream and other backports also drop such hunks. Upstream merge commit 05e6295f7b5e0 ("Merge tag 'fs.idmapped.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping") together with Upstream commit facd61053cff1 ("fuse: fixes after adapting to new posix acl api") results in a conflict in fs/fuse/acl.c, adjust to suit. commit 9452e93e6dae862d7aeff2b11236d79bde6f9b66 Author: Christian Brauner <brauner@kernel.org> Date: Fri Jan 13 12:49:27 2023 +0100 fs: port privilege checking helpers to mnt_idmap Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Ian Kent <ikent@redhat.com>	2024-10-16 10:45:31 +08:00
Ian Kent	304ec491ee	fs: port ->permission() to pass mnt_idmap JIRA: https://issues.redhat.com/browse/RHEL-33888 Status: Linus Conflicts: For consistency drop btrfs hunks because it isn't supported in CentOS Stream and other backports also drop such hunks. CentOS Stream commit `48fa94aacd` ("ceph: fscrypt_auth handling for ceph") is presnt which causes fuzz 2 in hunk #1 in fs/ceph/super.h. Upstream commit 427505ffeaa46 ("exportfs: use pr_debug for unreachable debug statements") is not present causing fuzz 2 in hunk #1 against fs/exportfs/expfs.c. Dropped hunks for ksmbd because the source is not present in the CentOS Stream source tree. Upstream commit 03fa86e9f79d8 ("namei: stash the sampled ->d_seq into nameidata") is not present causing a fuzz 1 for hunk #14 against fs/namei.c. CentOS Stream `c4f3dd0731` ("nfsd: handle failure to collect pre/post-op attrs more sanely") is present and causes a rejects for hunks #4 and #5 against fs/nfsd/vfs.c, apply manually. Dropped hunks for ntfs3 because the source is not present in the CentOS Stream source tree. CentOS Stream commit `98ba731fc7` ("ovl: Move xattr support to new xattrs.c file") moves ovl_xattr_set() and ovl_xattr_get() from fs/overlayfs/inode.c to fs/overlayfs/xattrs.c which causes hunks #4 and #5 to fail, manually apply to fs/overlayfs/xattrs.c. CentOS Stream commit `55177e4b83` ("ovl: mark xwhiteouts directory with overlay.opaque='x'") and commit `d17b324bb6` ("ovl: use ovl_numlower() and ovl_lowerstack() accessors") change the first and third hunks of fs/overlayfs/namei.c causing them to fail, manually apply. CentOS Stream commit `98ba731fc7` ("ovl: Move xattr support to new xattrs.c file") causes fuzz 2 in hunk #5 of fs/overlayfs/overlayfs.h CentOS Stream commit `355a9c490a` ("ovl: Add an alternative type of whiteout") changes ovl_cache_update_ino() to ovl_cache_update() in fs/overlayfs/readdir.c, make the change manually. Upstream commit 217af7e2f4deb ("apparmor: refactor profile rules and attachments") is not in CentOS Stream causing hunk #1 to fail to apply so manually apply the change. commit 4609e1f18e19c3b302e1eb4858334bca1532f780 Author: Christian Brauner <brauner@kernel.org> Date: Fri Jan 13 12:49:22 2023 +0100 fs: port ->permission() to pass mnt_idmap Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Ian Kent <ikent@redhat.com>	2024-10-16 10:45:20 +08:00
Ian Kent	bff9bc5749	fs: use type safe idmapping helpers JIRA: https://issues.redhat.com/browse/RHEL-33888 Status: Linus commit a2bd096fb2d7f50fb4db246b33e7bfcf5e2eda3a Author: Christian Brauner <brauner@kernel.org> Date: Wed Jun 22 22:12:16 2022 +0200 fs: use type safe idmapping helpers We already ported most parts and filesystems over for v6.0 to the new vfs{g,u}id_t type and associated helpers for v6.0. Convert the remaining places so we can remove all the old helpers. This is a non-functional change. Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Ian Kent <ikent@redhat.com>	2024-10-15 16:12:38 +08:00
Ian Kent	c6d247f6b2	bprm_fill_uid(): don't open-code file_inode() JIRA: https://issues.redhat.com/browse/RHEL-33888 Status: Linus commit e6ae43812460450bdb42f14c5813ac42d6bc9067 Author: Al Viro <viro@zeniv.linux.org.uk> Date: Sat Aug 20 11:46:10 2022 -0400 bprm_fill_uid(): don't open-code file_inode() Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Ian Kent <ikent@redhat.com>	2024-10-15 16:12:37 +08:00
Rafael Aquini	c98797e544	mm: set up vma iterator for vma_iter_prealloc() calls JIRA: https://issues.redhat.com/browse/RHEL-27743 Conflicts: * context differences on the 1st, 3rd, and 15th hunks due to out-of-order backport of upstream commits ad9f006351c3 ("mm: always lock new vma before inserting into vma tree") and c9d6e982c3f8 ("mm: move vma locking out of vma_prepare and dup_anon_vma"); * context difference on the 4th hunk due to out-of-order backport of upstream commit 1419430c8abb ("mmap: fix vma_iterator in error path of vma_merge()") This patch is a backport of the following upstream commit: commit b5df09226450165c434084d346fcb6d4858b0d52 Author: Liam R. Howlett <Liam.Howlett@oracle.com> Date: Mon Jul 24 14:31:52 2023 -0400 mm: set up vma iterator for vma_iter_prealloc() calls Set the correct limits for vma_iter_prealloc() calls so that the maple tree can be smarter about how many nodes are needed. Link: https://lkml.kernel.org/r/20230724183157.3939892-11-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rafael Aquini <raquini@redhat.com>	2024-10-01 11:19:30 -04:00
Rafael Aquini	410830503d	mm: always expand the stack with the mmap write lock held JIRA: https://issues.redhat.com/browse/RHEL-27742 Conflicts: * arch/parisc/mm/fault.c: hunks dropped as there were merge conflicts not worth of fixing for this unsupported hardware arch; * drivers/iommu/amd/iommu_v2.c: hunk dropped given out-of-order backport of upstream commit 5a0b11a180a9 ("iommu/amd: Remove iommu_v2 module") * mm/memory.c: differences on the 2nd hunk due to upstream conflict with commit ca5e863233e8 ("mm/gup: remove vmas parameter from get_user_pages_remote()") that ended up solved by merge commit 9471f1f2f502 ("Merge branch 'expand-stack'"). This patch is a backport of the following upstream commit: commit 8d7071af890768438c14db6172cc8f9f4d04e184 Author: Linus Torvalds <torvalds@linux-foundation.org> Date: Sat Jun 24 13:45:51 2023 -0700 mm: always expand the stack with the mmap write lock held This finishes the job of always holding the mmap write lock when extending the user stack vma, and removes the 'write_locked' argument from the vm helper functions again. For some cases, we just avoid expanding the stack at all: drivers and page pinning really shouldn't be extending any stacks. Let's see if any strange users really wanted that. It's worth noting that architectures that weren't converted to the new lock_mm_and_find_vma() helper function are left using the legacy "expand_stack()" function, but it has been changed to drop the mmap_lock and take it for writing while expanding the vma. This makes it fairly straightforward to convert the remaining architectures. As a result of dropping and re-taking the lock, the calling conventions for this function have also changed, since the old vma may no longer be valid. So it will now return the new vma if successful, and NULL - and the lock dropped - if the area could not be extended. Tested-by: Vegard Nossum <vegard.nossum@oracle.com> Tested-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> # ia64 Tested-by: Frank Scheiner <frank.scheiner@web.de> # ia64 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Rafael Aquini <raquini@redhat.com>	2024-09-05 20:37:19 -04:00
Rafael Aquini	15c74a651e	execve: expand new process stack manually ahead of time JIRA: https://issues.redhat.com/browse/RHEL-27742 Conflicts: * differences are because this commit had a merge conflict upstream with commit ca5e863233e8 ("mm/gup: remove vmas parameter from get_user_pages_remote()") that ended up solved by merge commit 9471f1f2f502 ("Merge branch 'expand-stack'"). This patch is a backport of the following upstream commit: commit f313c51d26aa87e69633c9b46efb37a930faca71 Author: Linus Torvalds <torvalds@linux-foundation.org> Date: Mon Jun 19 11:34:15 2023 -0700 execve: expand new process stack manually ahead of time This is a small step towards a model where GUP itself would not expand the stack, and any user that needs GUP to not look up existing mappings, but actually expand on them, would have to do so manually before-hand, and with the mm lock held for writing. It turns out that execve() already did almost exactly that, except it didn't take the mm lock at all (it's single-threaded so no locking technically needed, but it could cause lockdep errors). And it only did it for the CONFIG_STACK_GROWSUP case, since in that case GUP has obviously never expanded the stack downwards. So just make that CONFIG_STACK_GROWSUP case do the right thing with locking, and enable it generally. This will eventually help GUP, and in the meantime avoids a special case and the lockdep issue. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Rafael Aquini <raquini@redhat.com>	2024-09-05 20:37:18 -04:00
Rafael Aquini	7bc9b5120c	exec: Remove FOLL_FORCE for stack setup JIRA: https://issues.redhat.com/browse/RHEL-27742 This patch is a backport of the following upstream commit: commit cd57e443831d8eeb083c7165bce195d886e216d4 Author: Kees Cook <keescook@chromium.org> Date: Thu Nov 17 16:31:55 2022 -0800 exec: Remove FOLL_FORCE for stack setup It does not appear that FOLL_FORCE should be needed for setting up the stack pages. They are allocated using the nascent brpm->vma, which was newly created with VM_STACK_FLAGS, which an arch can override, but they all appear to include VM_WRITE \| VM_MAYWRITE. Remove FOLL_FORCE. Cc: Eric Biederman <ebiederm@xmission.com> Cc: David Hildenbrand <david@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Cc: linux-mm@kvack.org Link: https://lore.kernel.org/lkml/202211171439.CDE720EAD@keescook/ Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Rafael Aquini <raquini@redhat.com>	2024-09-05 20:37:17 -04:00
Rafael Aquini	0ce393dc54	mm: make find_extend_vma() fail if write lock not held JIRA: https://issues.redhat.com/browse/RHEL-27742 This patch is a backport of the following upstream commit: commit f440fa1ac955e2898893f9301568435eb5cdfc4b Author: Liam R. Howlett <Liam.Howlett@oracle.com> Date: Fri Jun 16 15:58:54 2023 -0700 mm: make find_extend_vma() fail if write lock not held Make calls to extend_vma() and find_extend_vma() fail if the write lock is required. To avoid making this a flag-day event, this still allows the old read-locking case for the trivial situations, and passes in a flag to say "is it write-locked". That way write-lockers can say "yes, I'm being careful", and legacy users will continue to work in all the common cases until they have been fully converted to the new world order. Co-Developed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Rafael Aquini <raquini@redhat.com>	2024-09-05 20:37:17 -04:00
Rafael Aquini	f42421e30d	exec: simplify initial stack size expansion JIRA: https://issues.redhat.com/browse/RHEL-27742 This patch is a backport of the following upstream commit: commit bfb4a2b95875a47a01234f2de113ec089d524e71 Author: Rolf Eike Beer <eb@emlix.com> Date: Wed Oct 19 09:32:35 2022 +0200 exec: simplify initial stack size expansion I had a hard time trying to understand completely why it is using vm_end in one side of the expression and vm_start in the other one, and using something in the "if" clause that is not an exact copy of what is used below. The whole point is that the stack_size variable that was used in the "if" clause is the difference between vm_start and vm_end, which is not far away but makes this thing harder to read than it must be. Signed-off-by: Rolf Eike Beer <eb@emlix.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/2017429.gqNitNVd0C@mobilepool36.emlix.com Signed-off-by: Rafael Aquini <raquini@redhat.com>	2024-09-05 20:37:16 -04:00
Rafael Aquini	e24b3ade32	mm/gup: remove vmas parameter from get_user_pages_remote() JIRA: https://issues.redhat.com/browse/RHEL-27742 Conflicts: - virt/kvm/async_pf.c: minor context diff due to out-of-order backport of upstream commit 08284765f03b7 ("KVM: Get reference to VM's address space in the async #PF worker") This patch is a backport of the following upstream commit: commit ca5e863233e8f6acd1792fd85d6bc2729a1b2c10 Author: Lorenzo Stoakes <lstoakes@gmail.com> Date: Wed May 17 20:25:39 2023 +0100 mm/gup: remove vmas parameter from get_user_pages_remote() The only instances of get_user_pages_remote() invocations which used the vmas parameter were for a single page which can instead simply look up the VMA directly. In particular:- - __update_ref_ctr() looked up the VMA but did nothing with it so we simply remove it. - __access_remote_vm() was already using vma_lookup() when the original lookup failed so by doing the lookup directly this also de-duplicates the code. We are able to perform these VMA operations as we already hold the mmap_lock in order to be able to call get_user_pages_remote(). As part of this work we add get_user_page_vma_remote() which abstracts the VMA lookup, error handling and decrementing the page reference count should the VMA lookup fail. This forms part of a broader set of patches intended to eliminate the vmas parameter altogether. [akpm@linux-foundation.org: avoid passing NULL to PTR_ERR] Link: https://lkml.kernel.org/r/d20128c849ecdbf4dd01cc828fcec32127ed939a.1684350871.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> (for arm64) Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> (for s390) Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Christian König <christian.koenig@amd.com> Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Sakari Ailus <sakari.ailus@linux.intel.com> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rafael Aquini <raquini@redhat.com>	2024-09-05 20:35:35 -04:00
Aristeu Rozanski	e214620cfb	mm: replace vma->vm_flags direct modifications with modifier calls JIRA: https://issues.redhat.com/browse/RHEL-27740 Tested: by me Conflicts: dropped stuff we don't support when not applying cleanly, left the rest for sake of saving work commit 1c71222e5f2393b5ea1a41795c67589eea7e3490 Author: Suren Baghdasaryan <surenb@google.com> Date: Thu Jan 26 11:37:49 2023 -0800 mm: replace vma->vm_flags direct modifications with modifier calls Replace direct modifications to vma->vm_flags with calls to modifier functions to be able to track flag changes and to keep vma locking correctness. [akpm@linux-foundation.org: fix drivers/misc/open-dice.c, per Hyeonggon Yoo] Link: https://lkml.kernel.org/r/20230126193752.297968-5-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjun Roy <arjunroy@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David Rientjes <rientjes@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@google.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Oskolkov <posk@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>	2024-04-29 14:33:17 -04:00
Aristeu Rozanski	1832e45d48	mm/mmap: don't use __vma_adjust() in shift_arg_pages() JIRA: https://issues.redhat.com/browse/RHEL-27740 Tested: by me commit cf51e86dfbe39b7cae3a9de650d035af22dd5fb4 Author: Liam R. Howlett <Liam.Howlett@Oracle.com> Date: Fri Jan 20 11:26:46 2023 -0500 mm/mmap: don't use __vma_adjust() in shift_arg_pages() Introduce shrink_vma() which uses the vma_prepare() and vma_complete() functions to reduce the vma coverage. Convert shift_arg_pages() to use expand_vma() and the new shrink_vma() function. Remove support from __vma_adjust() to reduce a vma size since shift_arg_pages() is the only user that shrinks a VMA in this way. Link: https://lkml.kernel.org/r/20230120162650.984577-46-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>	2024-04-29 14:33:16 -04:00
Aristeu Rozanski	728b8a88b3	mm: don't use __vma_adjust() in __split_vma() JIRA: https://issues.redhat.com/browse/RHEL-27740 Tested: by me commit b2b3b886738fec5e89ca9ebc720eba1a8f615753 Author: Liam R. Howlett <Liam.Howlett@Oracle.com> Date: Fri Jan 20 11:26:44 2023 -0500 mm: don't use __vma_adjust() in __split_vma() Use the abstracted locking and maple tree operations. Since __split_vma() is the only user of the __vma_adjust() function to use the insert argument, drop that argument. Remove the NULL passed through from fs/exec's shift_arg_pages() and mremap() at the same time. Link: https://lkml.kernel.org/r/20230120162650.984577-44-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>	2024-04-29 14:33:16 -04:00
Aristeu Rozanski	c1ff56ff95	mm: add vma iterator to vma_adjust() arguments JIRA: https://issues.redhat.com/browse/RHEL-27740 Tested: by me commit b373037fa9bb374f26bbabc0779fe990d02d33b7 Author: Liam R. Howlett <Liam.Howlett@Oracle.com> Date: Fri Jan 20 11:26:37 2023 -0500 mm: add vma iterator to vma_adjust() arguments Change the vma_adjust() function definition to accept the vma iterator and pass it through to __vma_adjust(). Update fs/exec to use the new vma_adjust() function parameters. Update mm/mremap to use the new vma_adjust() function parameters. Revert the __split_vma() calls back from __vma_adjust() to vma_adjust() and pass through the vma iterator. Link: https://lkml.kernel.org/r/20230120162650.984577-37-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>	2024-04-29 14:33:16 -04:00
Aristeu Rozanski	8054ffde35	mm: change mprotect_fixup to vma iterator JIRA: https://issues.redhat.com/browse/RHEL-27740 Tested: by me commit 2286a6914c776ec34cd97e4573b1466d055cb9de Author: Liam R. Howlett <Liam.Howlett@Oracle.com> Date: Fri Jan 20 11:26:18 2023 -0500 mm: change mprotect_fixup to vma iterator Use the vma iterator so that the iterator can be invalidated or updated to avoid each caller doing so. Link: https://lkml.kernel.org/r/20230120162650.984577-18-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>	2024-04-29 14:33:14 -04:00
Chris von Recklinghausen	777a83832f	exec: use VMA iterator instead of linked list JIRA: https://issues.redhat.com/browse/RHEL-27736 commit 19066e58682ec156aac8d6cf94b79ab2f122a556 Author: Matthew Wilcox (Oracle) <willy@infradead.org> Date: Tue Sep 6 19:48:56 2022 +0000 exec: use VMA iterator instead of linked list Remove a use of the vm_next list by doing the initial lookup with the VMA iterator and then using it to find the next entry. Link: https://lkml.kernel.org/r/20220906194824.2110408-42-Liam.Howlett@oracle.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2024-04-01 11:19:50 -04:00
Chris von Recklinghausen	eb370ae179	mm: remove vmacache JIRA: https://issues.redhat.com/browse/RHEL-27736 commit 7964cf8caa4dfa42c4149f3833d3878713cda3dc Author: Liam R. Howlett <Liam.Howlett@Oracle.com> Date: Tue Sep 6 19:48:51 2022 +0000 mm: remove vmacache By using the maple tree and the maple tree state, the vmacache is no longer beneficial and is complicating the VMA code. Remove the vmacache to reduce the work in keeping it up to date and code complexity. Link: https://lkml.kernel.org/r/20220906194824.2110408-26-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2024-04-01 11:19:46 -04:00
Jeff Moyer	da5eea0749	fsnotify: move fsnotify_open() hook into do_dentry_open() JIRA: https://issues.redhat.com/browse/RHEL-12076 commit 7b8c9d7bb4570ee4800642009c8f2d9756004552 Author: Amir Goldstein <amir73il@gmail.com> Date: Sun Jun 11 15:24:29 2023 +0300 fsnotify: move fsnotify_open() hook into do_dentry_open() fsnotify_open() hook is called only from high level system calls context and not called for the very many helpers to open files. This may makes sense for many of the special file open cases, but it is inconsistent with fsnotify_close() hook that is called for every last fput() of on a file object with FMODE_OPENED. As a result, it is possible to observe ACCESS, MODIFY and CLOSE events without ever observing an OPEN event. Fix this inconsistency by replacing all the fsnotify_open() hooks with a single hook inside do_dentry_open(). If there are special cases that would like to opt-out of the possible overhead of fsnotify() call in fsnotify_open(), they would probably also want to avoid the overhead of fsnotify() call in the rest of the fsnotify hooks, so they should be opening that file with the __FMODE_NONOTIFY flag. However, in the majority of those cases, the s_fsnotify_connectors optimization in fsnotify_parent() would be sufficient to avoid the overhead of fsnotify() call anyway. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <20230611122429.1499617-1-amir73il@gmail.com> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>	2023-11-02 15:31:54 -04:00
Chris von Recklinghausen	3e2b7d226a	mm: multi-gen LRU: move lru_gen_add_mm() out of IRQ-off region Conflicts: fs/exec.c - We don't have 7964cf8caa4d ("mm: remove vmacache") so keep zeroing tsk->mm->vmacache_seqnum and calling vmacache_flush JIRA: https://issues.redhat.com/browse/RHEL-1848 commit dda1c41a07b4a4c3f99b5b28c1e8c485205fe860 Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Date: Wed Oct 26 15:48:30 2022 +0200 mm: multi-gen LRU: move lru_gen_add_mm() out of IRQ-off region lru_gen_add_mm() has been added within an IRQ-off region in the commit mentioned below. The other invocations of lru_gen_add_mm() are not within an IRQ-off region. The invocation within IRQ-off region is problematic on PREEMPT_RT because the function is using a spin_lock_t which must not be used within IRQ-disabled regions. The other invocations of lru_gen_add_mm() occur while task_struct::alloc_lock is acquired. Move lru_gen_add_mm() after interrupts are enabled and before task_unlock(). Link: https://lkml.kernel.org/r/20221026134830.711887-1-bigeasy@linutronix.d e Fixes: bd74fdaea1460 ("mm: multi-gen LRU: support page table walks") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Yu Zhao <yuzhao@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2023-10-20 06:15:07 -04:00
Chris von Recklinghausen	b92cce1ea6	mm: multi-gen LRU: support page table walks Conflicts: fs/exec.c - We already have 33a2d6bc3480 ("Revert "fs/exec: allow to unshare a time namespace on vfork+exec"") so don't add call to timens_on_fork back in include/linux/mmzone.h - We already have e6ad640bc404 ("mm: deduplicate cacheline padding code") so keep CACHELINE_PADDING(_pad2_) over ZONE_PADDING(_pad2_) mm/vmscan.c - The backport of badc28d4924b ("mm: shrinkers: fix deadlock in shrinker debugfs") added an #include <linux/debugfs.h>. Keep it. JIRA: https://issues.redhat.com/browse/RHEL-1848 commit bd74fdaea146029e4fa12c6de89adbe0779348a9 Author: Yu Zhao <yuzhao@google.com> Date: Sun Sep 18 02:00:05 2022 -0600 mm: multi-gen LRU: support page table walks To further exploit spatial locality, the aging prefers to walk page tables to search for young PTEs and promote hot pages. A kill switch will be added in the next patch to disable this behavior. When disabled, the aging relies on the rmap only. NB: this behavior has nothing similar with the page table scanning in the 2.4 kernel [1], which searches page tables for old PTEs, adds cold pages to swapcache and unmaps them. To avoid confusion, the term "iteration" specifically means the traversal of an entire mm_struct list; the term "walk" will be applied to page tables and the rmap, as usual. An mm_struct list is maintained for each memcg, and an mm_struct follows its owner task to the new memcg when this task is migrated. Given an lruvec, the aging iterates lruvec_memcg()->mm_list and calls walk_page_range() with each mm_struct on this list to promote hot pages before it increments max_seq. When multiple page table walkers iterate the same list, each of them gets a unique mm_struct; therefore they can run concurrently. Page table walkers ignore any misplaced pages, e.g., if an mm_struct was migrated, pages it left in the previous memcg will not be promoted when its current memcg is under reclaim. Similarly, page table walkers will not promote pages from nodes other than the one under reclaim. This patch uses the following optimizations when walking page tables: 1. It tracks the usage of mm_struct's between context switches so that page table walkers can skip processes that have been sleeping since the last iteration. 2. It uses generational Bloom filters to record populated branches so that page table walkers can reduce their search space based on the query results, e.g., to skip page tables containing mostly holes or misplaced pages. 3. It takes advantage of the accessed bit in non-leaf PMD entries when CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y. 4. It does not zigzag between a PGD table and the same PMD table spanning multiple VMAs. IOW, it finishes all the VMAs within the range of the same PMD table before it returns to a PGD table. This improves the cache performance for workloads that have large numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5. Server benchmark results: Single workload: fio (buffered I/O): no change Single workload: memcached (anon): +[8, 10]% Ops/sec KB/sec patch1-7: 1147696.57 44640.29 patch1-8: 1245274.91 48435.66 Configurations: no change Client benchmark results: kswapd profiles: patch1-7 48.16% lzo1x_1_do_compress (real work) 8.20% page_vma_mapped_walk (overhead) 7.06% _raw_spin_unlock_irq 2.92% ptep_clear_flush 2.53% __zram_bvec_write 2.11% do_raw_spin_lock 2.02% memmove 1.93% lru_gen_look_around 1.56% free_unref_page_list 1.40% memset patch1-8 49.44% lzo1x_1_do_compress (real work) 6.19% page_vma_mapped_walk (overhead) 5.97% _raw_spin_unlock_irq 3.13% get_pfn_folio 2.85% ptep_clear_flush 2.42% __zram_bvec_write 2.08% do_raw_spin_lock 1.92% memmove 1.44% alloc_zspage 1.36% memset Configurations: no change Thanks to the following developers for their efforts [3]. kernel test robot <lkp@intel.com> [1] https://lwn.net/Articles/23732/ [2] https://llvm.org/docs/ScudoHardenedAllocator.html [3] https://lore.kernel.org/r/202204160827.ekEARWQo-lkp@intel.com/ Link: https://lkml.kernel.org/r/20220918080010.2920238-9-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2023-10-20 06:13:46 -04:00
Chris von Recklinghausen	321740bffd	Revert "fs/exec: allow to unshare a time namespace on vfork+exec" Conflicts: kernel/fork.c - We already have 2b5f9dad32ed ("s/exec: switch timens when a task gets a new mm") so don't add back in the code it deleted. JIRA: https://issues.redhat.com/browse/RHEL-1848 commit 33a2d6bc3480f9f8ac8c8def29854f98cc8bfee2 Author: Andrei Vagin <avagin@gmail.com> Date: Tue Sep 13 03:25:51 2022 -0700 Revert "fs/exec: allow to unshare a time namespace on vfork+exec" This reverts commit 133e2d3e81de5d9706cab2dd1d52d231c27382e5. Alexey pointed out a few undesirable side effects of the reverted change. First, it doesn't take into account that CLONE_VFORK can be used with CLONE_THREAD. Second, a child process doesn't enter a target time name-space , if its parent dies before the child calls exec. It happens because the paren t clears vfork_done. Eric W. Biederman suggests installing a time namespace as a task gets a new mm. It includes all new processes cloned without CLONE_VM and all tasks that cal l exec(). This is an user API change, but we think there aren't users that dep end on the old behavior. It is too late to make such changes in this release, so let's roll back this patch and introduce the right one in the next release. Cc: Alexey Izbyshev <izbyshev@ispras.ru> Cc: Christian Brauner <brauner@kernel.org> Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrei Vagin <avagin@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220913102551.1121611-3-avagin@google.com Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2023-10-20 06:13:08 -04:00
Chris von Recklinghausen	276e42586d	fs/exec: allow to unshare a time namespace on vfork+exec Conflicts: kernel/fork.c - We already have 2b5f9dad32ed ("s/exec: switch timens when a task gets a new mm") so code to change is gone. JIRA: https://issues.redhat.com/browse/RHEL-1848 commit 133e2d3e81de5d9706cab2dd1d52d231c27382e5 Author: Andrei Vagin <avagin@gmail.com> Date: Sun Jun 12 23:07:22 2022 -0700 fs/exec: allow to unshare a time namespace on vfork+exec Right now, a new process can't be forked in another time namespace if it shares mm with its parent. It is prohibited, because each time namespace has its own vvar page that is mapped into a process address space. When a process calls exec, it gets a new mm and so it could be "legal" to switch time namespace in that case. This was not implemented and now if we want to do this, we need to add another clone flag to not break backward compatibility. We don't have any user requests to switch times on exec except the vfork+exec combination, so there is no reason to add a new clone flag. As for vfork+exec, this should be safe to allow switching timens with the current clone flag. Right now, vfork (CLONE_VFORK \| CLONE_VM) fails if a child is forked into another time namespace. With this change, vfork creates a new process in parent's timens, and the following exec does the actual switch to the target time namespace. Suggested-by: Florian Weimer <fweimer@redhat.com> Signed-off-by: Andrei Vagin <avagin@gmail.com> Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220613060723.197407-1-avagin@gmail.com Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2023-10-20 06:12:50 -04:00
Chris von Recklinghausen	97f7745102	fs/coredump: move coredump sysctls into its own file Bugzilla: https://bugzilla.redhat.com/2160210 commit f0bc21b268c1464603192a00851cdbbf7c2cdc36 Author: Xiaoming Ni <nixiaoming@huawei.com> Date: Fri Jan 21 22:13:38 2022 -0800 fs/coredump: move coredump sysctls into its own file This moves the fs/coredump.c respective sysctls to its own file. Link: https://lkml.kernel.org/r/20211129211943.640266-6-mcgrof@kernel.org Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: Antti Palosaari <crope@iki.fi> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Lukas Middendorf <kernel@tuxforce.de> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Stephen Kitt <steve@sk2.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Rafael Aquini <aquini@redhat.com> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2023-03-24 11:18:33 -04:00
Chris von Recklinghausen	c47399a873	fs: move fs/exec.c sysctls into its own file Bugzilla: https://bugzilla.redhat.com/2160210 commit 66ad398634c21e0a42ce10002ae06c39352da0d1 Author: Luis Chamberlain <mcgrof@kernel.org> Date: Fri Jan 21 22:13:17 2022 -0800 fs: move fs/exec.c sysctls into its own file kernel/sysctl.c is a kitchen sink where everyone leaves their dirty dishes, this makes it very difficult to maintain. To help with this maintenance let's start by moving sysctls to places where they actually belong. The proc sysctl maintainers do not want to know what sysctl knobs you wish to add for your own piece of code, we just care about the core logic. So move the fs/exec.c respective sysctls to its own file. Since checkpatch complains about style issues with the old code, this move also fixes a few of those minor style issues: * Use pr_warn() instead of prink(WARNING * New empty lines are wanted at the beginning of routines Link: https://lkml.kernel.org/r/20211129205548.605569-9-mcgrof@kernel.org Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Antti Palosaari <crope@iki.fi> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Lukas Middendorf <kernel@tuxforce.de> Cc: Stephen Kitt <steve@sk2.org> Cc: Xiaoming Ni <nixiaoming@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Rafael Aquini <aquini@redhat.com> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2023-03-24 11:18:32 -04:00
Oleg Nesterov	9da64baa41	fs/exec: switch timens when a task gets a new mm Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2116442 commit 2b5f9dad32ed19e8db3b0f10a84aa824a219803b Author: Andrei Vagin <avagin@gmail.com> Date: Tue Sep 20 17:31:19 2022 -0700 fs/exec: switch timens when a task gets a new mm Changing a time namespace requires remapping a vvar page, so we don't want to allow doing that if any other tasks can use the same mm. Currently, we install a time namespace when a task is created with a new vm. exec() is another case when a task gets a new mm and so it can switch a time namespace safely, but it isn't handled now. One more issue of the current interface is that clone() with CLONE_VM isn't allowed if the current task has unshared a time namespace (timens_for_children doesn't match the current timens). Both these issues make some inconvenience for users. For example, Alexey and Florian reported that posix_spawn() uses vfork+exec and this pattern doesn't work with time namespaces due to the both described issues. LXC needed to workaround the exec() issue by calling setns. In the commit 133e2d3e81de5 ("fs/exec: allow to unshare a time namespace on vfork+exec"), we tried to fix these issues with minimal impact on UAPI. But it adds extra complexity and some undesirable side effects. Eric suggested fixing the issues properly because here are all the reasons to suppose that there are no users that depend on the old behavior. Cc: Alexey Izbyshev <izbyshev@ispras.ru> Cc: Christian Brauner <brauner@kernel.org> Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Kees Cook <keescook@chromium.org> Suggested-by: "Eric W. Biederman" <ebiederm@xmission.com> Origin-author: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrei Vagin <avagin@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220921003120.209637-1-avagin@google.com Signed-off-by: Oleg Nesterov <oleg@redhat.com>	2023-01-11 10:43:02 +01:00
Frantisek Hrbata	e9e9bc8da2	Merge: mm changes through v5.18 for 9.2 Merge conflicts: ----------------- Conflicts with !1142(merged) "io_uring: update to v5.15" fs/io-wq.c - static bool io_wqe_create_worker(struct io_wqe wqe, struct io_wqe_acct acct) !1142 already contains backport of 3146cba99aa2 ("io-wq: make worker creation resilient against signals") along with other commits which are not present in !1370. Resolved in favor of HEAD(!1142) - static int io_wqe_worker(void data) !1370 does not contain 767a65e9f317 ("io-wq: fix potential race of acct->nr_workers") Resolved in favor of HEAD(!1142) - static void io_init_new_worker(struct io_wqe wqe, struct io_worker worker, HEAD(!1142) does not contain e32cf5dfbe22 ("kthread: Generalize pf_io_worker so it can point to struct kthread") Resolved in favor of !1370 - static void create_worker_cont(struct callback_head cb) !1370 does not contain 66e70be72288 ("io-wq: fix memory leak in create_io_worker()") Resolved in favor of HEAD(!1142) - static void io_workqueue_create(struct work_struct work) !1370 does not contain 66e70be72288 ("io-wq: fix memory leak in create_io_worker()") Resolved in favor of HEAD(!1142) - static bool create_io_worker(struct io_wq wq, struct io_wqe wqe, int index) !1370 does not contain 66e70be72288 ("io-wq: fix memory leak in create_io_worker()") Resolved in favor of HEAD(!1142) - static bool io_wq_work_match_item(struct io_wq_work work, void data) !1370 does not contain 713b9825a4c4 ("io-wq: fix cancellation on create-worker failure") Resolved in favor of HEAD(!1142) - static void io_wqe_enqueue(struct io_wqe wqe, struct io_wq_work work) !1370 is missing 713b9825a4c4 ("io-wq: fix cancellation on create-worker failure") removed wrongly merged run_cancel label Resolved in favor of HEAD(!1142) - static bool io_task_work_match(struct callback_head cb, void data) !1370 is missing 3b33e3f4a6c0 ("io-wq: fix silly logic error in io_task_work_match()") Resolved in favor of HEAD(!1142) - static void io_wq_exit_workers(struct io_wq wq) !1370 is missing 3b33e3f4a6c0 ("io-wq: fix silly logic error in io_task_work_match()") Resolved in favor of HEAD(!1142) - int io_wq_max_workers(struct io_wq wq, int new_count) !1370 is missing 3b33e3f4a6c0 ("io-wq: fix silly logic error in io_task_work_match()") fs/io_uring.c - static int io_register_iowq_max_workers(struct io_ring_ctx ctx, !1370 is missing bunch of commits after 2e480058ddc2 ("io-wq: provide a way to limit max number of workers") Resolved in favor of HEAD(!1142) include/uapi/linux/io_uring.h - !1370 is missing dd47c104533d ("io-wq: provide IO_WQ_ constants for IORING_REGISTER_IOWQ_MAX_WORKERS arg items") just a comment conflict Resolved in favor of HEAD(!1142) kernel/exit.c - void __noreturn do_exit(long code) - !1370 contains bunch of commits after f552a27afe67 ("io_uring: remove files pointer in cancellation functions") Resolved in favor of !1370 Conflicts with !1357(merged) "NFS refresh for RHEL-9.2" fs/nfs/callback.c - nfs4_callback_svc(void vrqstp) !1370 is missing f49169c97fce ("NFSD: Remove svc_serv_ops::svo_module") where the module_put_and_kthread_exit() was removed Resolved in favor of HEAD(!1357) fs/nfs/file.c !1357 is missing 187c82cb0380 ("fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio") Resolved in favor of HEAD(!1370) fs/nfsd/nfssvc.c - nfsd(void vrqstp) !1370 is missing f49169c97fce ("NFSD: Remove svc_serv_ops::svo_module") Resolved in favor of HEAD(!1357) ----------------- MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1370 Bugzilla: https://bugzilla.redhat.com/2120352 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2099722 Patches 1-9 are changes to selftests Patches 10-31 are reverts of RHEL-only patches to address COR CVE Patches 32-320 are the machine dependent mm changes ported by Rafael Patch 321 reverts the backport of 6692c98c7df5. See below. Patches 322-981 are the machine independent mm changes Patches 982-1016 are David Hildebrand's upstream changes to address the COR CVE RHEL commit `b23c298982` fork: Stop protecting back_fork_cleanup_cgroup_lock with CONFIG_NUMA which is a backport of upstream 6692c98c7df5 and is reverted early in this series. 6692c98c7df5 is a fix for upstream 40966e316f86 which was not in RHEL until this series. 6692c98c7df5 is re-added after 40966e316f86. Omitted-fix: 310d1344e3c5 ("Revert "powerpc: Remove unused FW_FEATURE_NATIVE references"") to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716 Omitted-fix: 465d0eb0dc31 ("Docs/admin-guide/mm/damon/usage: fix the example code snip") to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716 Omitted-fix: 317314527d17 ("mm/hugetlb: correct demote page offset logic") to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716 Omitted-fix: 37dcc673d065 ("frontswap: don't call ->init if no ops are registered") to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716 Omitted-fix: 30c19366636f ("mm: fix BUG splat with kvmalloc + GFP_ATOMIC") to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716 Omitted: fix: fa84693b3c89 io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656 Omitted-fix: 009ad9f0c6ee io_uring: drop ctx->uring_lock before acquiring sqd->lock fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656 Omitted-fix: bc369921d670 io-wq: max_worker fixes fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743 Omitted-fix: e139a1ec92f8 io_uring: apply max_workers limit to all future users fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743 Omitted-fix: 71c9ce27bb57 io-wq: fix max-workers not correctly set on multi-node system fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743 Omitted-fix: 41d3a6bd1d37 io_uring: pin SQPOLL data before unlocking ring lock fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656 Omitted-fix: bad119b9a000 io_uring: honour zeroes as io-wq worker limits fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743 Omitted-fix: 08bdbd39b584 io-wq: ensure that hash wait lock is IRQ disabling fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656 Omitted-fix: 713b9825a4c4 io-wq: fix cancellation on create-worker failure fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656 Omitted-fix: 3b33e3f4a6c0 io-wq: fix silly logic error in io_task_work_match() fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656 Omitted-fix: 71e1cef2d794 io-wq: Remove duplicate code in io_workqueue_create() fixed under https://bugzilla.redhat.com/show_bug.cgi?id=210774 Omitted-fix: a226abcd5d42 io-wq: don't retry task_work creation failure on fatal conditions fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743 Omitted-fix: fa84693b3c89 io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656 Omitted-fix: dd47c104533d io-wq: provide IO_WQ_* constants for IORING_REGISTER_IOWQ_MAX_WORKERS arg items fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656 Omitted-fix: 4f0712ccec09 hexagon: Fix function name in die() unsupported arch Omitted-fix: 751971af2e36 csky: Fix function name in csky_alignment() and die() unsupported arch Omitted-fix: dcbc65aac283 ptrace: Remove duplicated include in ptrace.c unsupported arch Omitted-fix: eb48d4219879 drm/i915: Fix oops due to missing stack depot fixed in RHEL commit `105d2d4832` Merge DRM changes from upstream v5.16..v5.17 Omitted-fix: 751a9d69b197 drm/i915: Fix oops due to missing stack depot fixed in RHEL commit `99fc716fc4` Merge DRM changes from upstream v5.17..v5.18 Omitted-fix: eb48d4219879 drm/i915: Fix oops due to missing stack depot fixed in RHEL commit `105d2d4832` Merge DRM changes from upstream v5.16..v5.17 Omitted-fix: 751a9d69b197 drm/i915: Fix oops due to missing stack depot fixed in RHEL commit `99fc716fc4` Merge DRM changes from upstream v5.17..v5.18 Omitted-fix: b95dc06af3e6 drm/amdgpu: disable runpm if we are the primary adapter reverted later Omitted-fix: 5a90c24ad028 Revert "drm/amdgpu: disable runpm if we are the primary adapter" revert of above omitted fix Omitted-fix: 724bbe49c5e4 fs/ntfs3: provide block_invalidate_folio to fix memory leak unsupported fs Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com> Approved-by: John W. Linville <linville@redhat.com> Approved-by: Jiri Benc <jbenc@redhat.com> Approved-by: Jarod Wilson <jarod@redhat.com> Approved-by: Prarit Bhargava <prarit@redhat.com> Approved-by: Lyude Paul <lyude@redhat.com> Approved-by: Donald Dutile <ddutile@redhat.com> Approved-by: Rafael Aquini <aquini@redhat.com> Approved-by: Phil Auld <pauld@redhat.com> Approved-by: Waiman Long <longman@redhat.com> Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>	2022-10-23 19:49:41 +02:00
Frantisek Hrbata	f422c448a1	Merge: io_uring: update to v5.15 MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1142 # Merge Request Required Information ## Summary of Changes Update the io_uring code base and its dependencies to v5.15. We will not enable the functionality at this time, this is only a preparatory patch series. The patch series does touch other subsystems, though: 91ef658fb8b8-namei-ignore-ERR-NULL-names-in-putname.patch 0ee50b47532a-namei-change-filename_parentat-calling-conventions.patch 584d3226d665-namei-make-do_mkdirat-take-struct-filename.patch 7797251bb5ab-namei-make-do_mknodat-take-struct-filename.patch da2d0cede330-namei-make-do_symlinkat-take-struct-filename.patch 8228e2c31319-namei-add-getname_uflags.patch 020250f31c4c-namei-make-do_linkat-take-struct-filename.patch 45f30dab3957-namei-update-do_-helpers-to-return-ints.patch d32f89da7fa8-net-add-accept-helper-not-installing-fd.patch 2112ff5ce0c1-iov_iter-track-truncated-size.patch 0766ec82e5fb-namei-Fix-use-after-free-in-kern_path_locked.patch 8fb0f47a9d7a-iov_iter-add-helper-to-save-iov_iter-state.patch 7dedd3e18077-Revert-iov_iter-track-truncated-size.patch 3a862cacf867-fs-add-anon_inode_getfile_secure-similar-to-anon_inode_getfd_secure.patch As a result, file system, block and networking tests should be run. Omitted-fix: 81132a39c152 ("fs: remove fget_many and fput_many interface") This is outside the scope of this MR, and isn't a "fix" so much as a performance enhancement. ## Approved Bugzilla Ticket Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2107656 Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Approved-by: Antoine Tenart <atenart@redhat.com> Approved-by: Ming Lei <ming.lei@redhat.com> Approved-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>	2022-10-21 09:47:33 -04:00
Chris von Recklinghausen	579cdcc5b5	mm/mprotect: use mmu_gather Bugzilla: https://bugzilla.redhat.com/2120352 commit 4a18419f71cdf9155d2d2a6c79546f720978b990 Author: Nadav Amit <namit@vmware.com> Date: Mon May 9 18:20:50 2022 -0700 mm/mprotect: use mmu_gather Patch series "mm/mprotect: avoid unnecessary TLB flushes", v6. This patchset is intended to remove unnecessary TLB flushes during mprotect() syscalls. Once this patch-set make it through, similar and further optimizations for MADV_COLD and userfaultfd would be possible. Basically, there are 3 optimizations in this patch-set: 1. Use TLB batching infrastructure to batch flushes across VMAs and do better/fewer flushes. This would also be handy for later userfaultfd enhancements. 2. Avoid unnecessary TLB flushes. This optimization is the one that provides most of the performance benefits. Unlike previous versions, we now only avoid flushes that would not result in spurious page-faults. 3. Avoiding TLB flushes on change_huge_pmd() that are only needed to prevent the A/D bits from changing. Andrew asked for some benchmark numbers. I do not have an easy determinate macrobenchmark in which it is easy to show benefit. I therefore ran a microbenchmark: a loop that does the following on anonymous memory, just as a sanity check to see that time is saved by avoiding TLB flushes. The loop goes: mprotect(p, PAGE_SIZE, PROT_READ) mprotect(p, PAGE_SIZE, PROT_READ\|PROT_WRITE) p = 0; // make the page writable The test was run in KVM guest with 1 or 2 threads (the second thread was busy-looping). I measured the time (cycles) of each operation: 1 thread 2 threads mmots +patch mmots +patch PROT_READ 3494 2725 (-22%) 8630 7788 (-10%) PROT_READ\|WRITE 3952 2724 (-31%) 9075 2865 (-68%) [ mmots = v5.17-rc6-mmots-2022-03-06-20-38 ] The exact numbers are really meaningless, but the benefit is clear. There are 2 interesting results though. (1) PROT_READ is cheaper, while one can expect it not to be affected. This is presumably due to TLB miss that is saved (2) Without memory access (p = 0), the speedup of the patch is even greater. In that scenario mprotect(PROT_READ) also avoids the TLB flush. As a result both operations on the patched kernel take roughly ~1500 cycles (with either 1 or 2 threads), whereas on mmotm their cost is as high as presented in the table. This patch (of 3): change_pXX_range() currently does not use mmu_gather, but instead implements its own deferred TLB flushes scheme. This both complicates the code, as developers need to be aware of different invalidation schemes, and prevents opportunities to avoid TLB flushes or perform them in finer granularity. The use of mmu_gather for modified PTEs has benefits in various scenarios even if pages are not released. For instance, if only a single page needs to be flushed out of a range of many pages, only that page would be flushed. If a THP page is flushed, on x86 a single TLB invlpg instruction can be used instead of 512 instructions (or a full TLB flush, which would Linux would actually use by default). mprotect() over multiple VMAs requires a single flush. Use mmu_gather in change_pXX_range(). As the pages are not released, only record the flushed range using tlb_flush_pXX_range(). Handle THP similarly and get rid of flush_cache_range() which becomes redundant since tlb_start_vma() calls it when needed. Link: https://lkml.kernel.org/r/20220401180821.1986781-1-namit@vmware.com Link: https://lkml.kernel.org/r/20220401180821.1986781-2-namit@vmware.com Signed-off-by: Nadav Amit <namit@vmware.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Peter Xu <peterx@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Cc: Nick Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2022-10-12 07:28:08 -04:00
Chris von Recklinghausen	34f31c17ed	kthread: Don't allocate kthread_struct for init and umh Bugzilla: https://bugzilla.redhat.com/2120352 commit 343f4c49f2438d8920f1f76fa823ee59b91f02e4 Author: Eric W. Biederman <ebiederm@xmission.com> Date: Mon Apr 11 11:40:14 2022 -0500 kthread: Don't allocate kthread_struct for init and umh If kthread_is_per_cpu runs concurrently with free_kthread_struct the kthread_struct that was just freed may be read from. This bug was introduced by commit 40966e316f86 ("kthread: Ensure struct kthread is present for all kthreads"). When kthread_struct started to be allocated for all tasks that have PF_KTHREAD set. This in turn required the kthread_struct to be freed in kernel_execve and violated the assumption that kthread_struct will have the same lifetime as the task. Looking a bit deeper this only applies to callers of kernel_execve which is just the init process and the user mode helper processes. These processes really don't want to be kernel threads but are for historical reasons. Mostly that copy_thread does not know how to take a kernel mode function to the process with for processes without PF_KTHREAD or PF_IO_WORKER set. Solve this by not allocating kthread_struct for the init process and the user mode helper processes. This is done by adding a kthread member to struct kernel_clone_args. Setting kthread in fork_idle and kernel_thread. Adding user_mode_thread that works like kernel_thread except it does not set kthread. In fork only allocating the kthread_struct if .kthread is set. I have looked at kernel/kthread.c and since commit 40966e316f86 ("kthread: Ensure struct kthread is present for all kthreads") there have been no assumptions added that to_kthread or __to_kthread will not return NULL. There are a few callers of to_kthread or __to_kthread that assume a non-NULL struct kthread pointer will be returned. These functions are kthread_data(), kthread_parmme(), kthread_exit(), kthread(), kthread_park(), kthread_unpark(), kthread_stop(). All of those functions can reasonably expected to be called when it is know that a task is a kthread so that assumption seems reasonable. Cc: stable@vger.kernel.org Fixes: 40966e316f86 ("kthread: Ensure struct kthread is present for all kthreads") Reported-by: Максим Кутявин <maximkabox13@gmail.com> Link: https://lkml.kernel.org/r/20220506141512.516114-1-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2022-10-12 07:28:07 -04:00
Chris von Recklinghausen	ea2fa2fb80	uaccess: remove CONFIG_SET_FS Conflicts: in arch/, only keep changes to arch/Kconfig and arch/arm64/kernel/traps.c. All other arch files in the upstream version of this patch are dropped. Bugzilla: https://bugzilla.redhat.com/2120352 commit 967747bbc084b93b54e66f9047d342232314cd25 Author: Arnd Bergmann <arnd@arndb.de> Date: Fri Feb 11 21:42:45 2022 +0100 uaccess: remove CONFIG_SET_FS There are no remaining callers of set_fs(), so CONFIG_SET_FS can be removed globally, along with the thread_info field and any references to it. This turns access_ok() into a cheaper check against TASK_SIZE_MAX. As CONFIG_SET_FS is now gone, drop all remaining references to set_fs()/get_fs(), mm_segment_t, user_addr_max() and uaccess_kernel(). Acked-by: Sam Ravnborg <sam@ravnborg.org> # for sparc32 changes Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Tested-by: Sergey Matyukevich <sergey.matyukevich@synopsys.com> # for arc ch anges Acked-by: Stafford Horne <shorne@gmail.com> # [openrisc, asm-generic] Acked-by: Dinh Nguyen <dinguyen@kernel.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2022-10-12 07:27:45 -04:00
Chris von Recklinghausen	e545ae66be	signal: Remove the helper signal_group_exit Bugzilla: https://bugzilla.redhat.com/2120352 commit 49697335e0b441b0553598c1b48ee9ebb053d2f1 Author: Eric W. Biederman <ebiederm@xmission.com> Date: Thu Jun 24 02:14:30 2021 -0500 signal: Remove the helper signal_group_exit This helper is misleading. It tests for an ongoing exec as well as the process having received a fatal signal. Sometimes it is appropriate to treat an on-going exec differently than a process that is shutting down due to a fatal signal. In particular taking the fast path out of exit_signals instead of retargeting signals is not appropriate during exec, and not changing the the exit code in do_group_exit during exec. Removing the helper makes it more obvious what is going on as both cases must be coded for explicitly. While removing the helper fix the two cases where I have observed using signal_group_exit resulted in the wrong result. In exit_signals only test for SIGNAL_GROUP_EXIT so that signals are retargetted during an exec. In do_group_exit use 0 as the exit code during an exec as de_thread does not set group_exit_code. As best as I can determine group_exit_code has been is set to 0 most of the time during de_thread. During a thread group stop group_exit_code is set to the stop signal and when the thread group receives SIGCONT group_exit_code is reset to 0. Link: https://lkml.kernel.org/r/20211213225350.27481-8-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2022-10-12 07:27:36 -04:00
Chris von Recklinghausen	e44c428417	signal: Rename group_exit_task group_exec_task Bugzilla: https://bugzilla.redhat.com/2120352 commit 60700e38fb68e800607ca7a027060d5419fc5798 Author: Eric W. Biederman <ebiederm@xmission.com> Date: Sun Jun 6 13:47:53 2021 -0500 signal: Rename group_exit_task group_exec_task The only remaining user of group_exit_task is exec. Rename the field so that it is clear which part of the code uses it. Update the comment above the definition of group_exec_task to document how it is currently used. Link: https://lkml.kernel.org/r/20211213225350.27481-7-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2022-10-12 07:27:36 -04:00
Chris von Recklinghausen	e1e51160dc	kthread: Ensure struct kthread is present for all kthreads Bugzilla: https://bugzilla.redhat.com/2120352 commit 40966e316f86b8cfd83abd31ccb4df729309d3e7 Author: Eric W. Biederman <ebiederm@xmission.com> Date: Thu Dec 2 09:56:14 2021 -0600 kthread: Ensure struct kthread is present for all kthreads Today the rules are a bit iffy and arbitrary about which kernel threads have struct kthread present. Both idle threads and thread started with create_kthread want struct kthread present so that is effectively all kernel threads. Make the rule that if PF_KTHREAD and the task is running then struct kthread is present. This will allow the kernel thread code to using tsk->exit_code with different semantics from ordinary processes. To make ensure that struct kthread is present for all kernel threads move it's allocation into copy_process. Add a deallocation of struct kthread in exec for processes that were kernel threads. Move the allocation of struct kthread for the initial thread earlier so that it is not repeated for each additional idle thread. Move the initialization of struct kthread into set_kthread_struct so that the structure is always and reliably initailized. Clear set_child_tid in free_kthread_struct to ensure the kthread struct is reliably freed during exec. The function free_kthread_struct does not need to clear vfork_done during exec as exec_mm_release called from exec_mmap has already cleared vfork_done. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2022-10-12 07:27:33 -04:00
Chris von Recklinghausen	3623566c7f	signal: Replace force_sigsegv(SIGSEGV) with force_fatal_sig(SIGSEGV) Conflicts: drop changes to arch/m68k/kernel/traps.c - usupported arch Bugzilla: https://bugzilla.redhat.com/2120352 commit e21294a7aaae32c5d7154b187113a04db5852e37 Author: Eric W. Biederman <ebiederm@xmission.com> Date: Mon Oct 25 10:50:57 2021 -0500 signal: Replace force_sigsegv(SIGSEGV) with force_fatal_sig(SIGSEGV) Now that force_fatal_sig exists it is unnecessary and a bit confusing to use force_sigsegv in cases where the simpler force_fatal_sig is wanted. So change every instance we can to make the code clearer. Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org> Link: https://lkml.kernel.org/r/877de7jrev.fsf@disp2133 Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2022-10-12 07:27:26 -04:00
Chris von Recklinghausen	d51a01798b	exec: Check for a pending fatal signal instead of core_state Bugzilla: https://bugzilla.redhat.com/2120352 commit 7e3c4fb7fc19bcf20657de3edb718ec1b26c7df3 Author: Eric W. Biederman <ebiederm@xmission.com> Date: Fri Sep 3 10:26:05 2021 -0500 exec: Check for a pending fatal signal instead of core_state Prevent exec continuing when a fatal signal is pending by replacing mmap_read_lock with mmap_read_lock_killable. This is always the right thing to do as userspace will never observe an exec complete when there is a fatal signal pending. With that change it becomes unnecessary to explicitly test for a core dump in progress. In coredump_wait zap_threads arranges under mmap_write_lock for all tasks that use a mm to also have SIGKILL pending, which means mmap_read_lock_killable will always return -EINTR when old_mm->core_state is present. Link: https://lkml.kernel.org/r/87fstux27w.fsf@disp2133 Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2022-10-12 07:27:24 -04:00
Wander Lairson Costa	b89dd8173e	posix-cpu-timers: Cleanup CPU timers before freeing them during exec Bugzilla: https://bugzilla.redhat.com/2116968 commit e362359ace6f87c201531872486ff295df306d13 Author: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Date: Tue Aug 9 14:07:51 2022 -0300 posix-cpu-timers: Cleanup CPU timers before freeing them during exec Commit `55e8c8eb2c` ("posix-cpu-timers: Store a reference to a pid not a task") started looking up tasks by PID when deleting a CPU timer. When a non-leader thread calls execve, it will switch PIDs with the leader process. Then, as it calls exit_itimers, posix_cpu_timer_del cannot find the task because the timer still points out to the old PID. That means that armed timers won't be disarmed, that is, they won't be removed from the timerqueue_list. exit_itimers will still release their memory, and when that list is later processed, it leads to a use-after-free. Clean up the timers from the de-threaded task before freeing them. This prevents a reported use-after-free. Fixes: `55e8c8eb2c` ("posix-cpu-timers: Store a reference to a pid not a task") Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20220809170751.164716-1-cascardo@canonical.com Signed-off-by: Wander Lairson Costa <wander@redhat.com>	2022-08-29 16:15:20 -03:00
Wander Lairson Costa	a532f4903a	fix race between exit_itimers() and /proc/pid/timers Bugzilla: https://bugzilla.redhat.com/2116968 commit d5b36a4dbd06c5e8e36ca8ccc552f679069e2946 Author: Oleg Nesterov <oleg@redhat.com> Date: Mon Jul 11 18:16:25 2022 +0200 fix race between exit_itimers() and /proc/pid/timers As Chris explains, the comment above exit_itimers() is not correct, we can race with proc_timers_seq_ops. Change exit_itimers() to clear signal->posix_timers with ->siglock held. Cc: <stable@vger.kernel.org> Reported-by: chris@accessvector.net Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Wander Lairson Costa <wander@redhat.com>	2022-08-29 16:15:01 -03:00
Jeff Moyer	7fcd6b5262	namei: add getname_uflags() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2107656 commit 8228e2c313194f13f1d1806ed5734a26c38d49ac Author: Dmitry Kadashev <dkadashev@gmail.com> Date: Thu Jul 8 13:34:42 2021 +0700 namei: add getname_uflags() There are a couple of places where we already open-code the (flags & AT_EMPTY_PATH) check and io_uring will likely add another one in the future. Let's just add a simple helper getname_uflags() that handles this directly and use it. Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/io-uring/20210415100815.edrn4a7cy26wkowe@wittgenstein/ Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20210708063447.3556403-7-dkadashev@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>	2022-07-15 14:58:36 -04:00
Rafael Aquini	a65076a3c1	exec: Force single empty string when argv is empty Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2097485 This patch is a backport of the following upstream commit: commit dcd46d897adb70d63e025f175a00a89797d31a43 Author: Kees Cook <keescook@chromium.org> Date: Mon Jan 31 16:09:47 2022 -0800 exec: Force single empty string when argv is empty Quoting[1] Ariadne Conill: "In several other operating systems, it is a hard requirement that the second argument to execve(2) be the name of a program, thus prohibiting a scenario where argc < 1. POSIX 2017 also recommends this behaviour, but it is not an explicit requirement[2]: The argument arg0 should point to a filename string that is associated with the process being started by one of the exec functions. ... Interestingly, Michael Kerrisk opened an issue about this in 2008[3], but there was no consensus to support fixing this issue then. Hopefully now that CVE-2021-4034 shows practical exploitative use[4] of this bug in a shellcode, we can reconsider. This issue is being tracked in the KSPP issue tracker[5]." While the initial code searches[6][7] turned up what appeared to be mostly corner case tests, trying to that just reject argv == NULL (or an immediately terminated pointer list) quickly started tripping[8] existing userspace programs. The next best approach is forcing a single empty string into argv and adjusting argc to match. The number of programs depending on argc == 0 seems a smaller set than those calling execve with a NULL argv. Account for the additional stack space in bprm_stack_limits(). Inject an empty string when argc == 0 (and set argc = 1). Warn about the case so userspace has some notice about the change: process './argc0' launched './argc0' with NULL argv: empty string added Additionally WARN() and reject NULL argv usage for kernel threads. [1] https://lore.kernel.org/lkml/20220127000724.15106-1-ariadne@dereferenced.org/ [2] https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html [3] https://bugzilla.kernel.org/show_bug.cgi?id=8408 [4] https://www.qualys.com/2022/01/25/cve-2021-4034/pwnkit.txt [5] https://github.com/KSPP/linux/issues/176 [6] https://codesearch.debian.net/search?q=execve%5C+%5C%28%5B%5E%2C%5D%2B%2C+NULL&literal=0 [7] https://codesearch.debian.net/search?q=execlp%3F%5Cs%5C%28%5B%5E%2C%5D%2B%2C%5CsNULL&literal=0 [8] https://lore.kernel.org/lkml/20220131144352.GE16385@xsang-OptiPlex-9020/ Reported-by: Ariadne Conill <ariadne@dereferenced.org> Reported-by: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Rich Felker <dalias@libc.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Cc: stable@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: Ariadne Conill <ariadne@dereferenced.org> Acked-by: Andy Lutomirski <luto@kernel.org> Link: https://lore.kernel.org/r/20220201000947.2453721-1-keescook@chromium.org Signed-off-by: Rafael Aquini <aquini@redhat.com>	2022-06-23 21:23:01 -04:00
Rafael Aquini	571362221a	mm/pagemap: add mmap_assert_locked() annotations to find_vma() Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396 This patch is a backport of the following upstream commit: commit 5b78ed24e8ec48602c1d6f5a188e58d000c81e2b Author: Luigi Rizzo <lrizzo@google.com> Date: Thu Sep 2 14:56:46 2021 -0700 mm/pagemap: add mmap_assert_locked() annotations to find_vma() find_vma() and variants need protection when used. This patch adds mmap_assert_lock() calls in the functions. To make sure the invariant is satisfied, we also need to add a mmap_read_lock() around the get_user_pages_remote() call in get_arg_page(). The lock is not strictly necessary because the mm has been newly created, but the extra cost is limited because the same mutex was also acquired shortly before in __bprm_mm_init(), so it is hot and uncontended. [penguin-kernel@i-love.sakura.ne.jp: TOMOYO needs the same protection which get_arg_page() needs] Link: https://lkml.kernel.org/r/58bb6bf7-a57e-8a40-e74b-39584b415152@i-love.sakura.ne.jp Link: https://lkml.kernel.org/r/20210731175341.3458608-1-lrizzo@google.com Signed-off-by: Luigi Rizzo <lrizzo@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Rafael Aquini <aquini@redhat.com>	2021-11-29 11:41:34 -05:00

1 2 3 4 5 ...

655 Commits