Commit Graph

461 Commits

Author SHA1 Message Date
Ming Lei ec1abb2bd5 fs: Move enum rw_hint into a new header file
JIRA: https://issues.redhat.com/browse/RHEL-79409

commit fe3944fb245ab99570552a3bf970b00058a9ca6d
Author: Bart Van Assche <bvanassche@acm.org>
Date:   Fri Feb 2 12:39:23 2024 -0800

    fs: Move enum rw_hint into a new header file

    Move enum rw_hint into a new header file to prepare for using this data
    type in the block layer. Add the attribute __packed to reduce the space
    occupied by instances of this data type from four bytes to one byte.
    Change the data type of i_write_hint from u8 into enum rw_hint.

    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Acked-by: Chao Yu <chao@kernel.org> # for the F2FS part
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Bart Van Assche <bvanassche@acm.org>
    Link: https://lore.kernel.org/r/20240202203926.2478590-5-bvanassche@acm.org
    Signed-off-by: Christian Brauner <brauner@kernel.org>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2025-03-14 16:48:20 +08:00
Rafael Aquini 43fc22497a fs: improve dump_mapping() robustness
JIRA: https://issues.redhat.com/browse/RHEL-27745

This patch is a backport of the following upstream commit:
commit 8b3d838139bcd1e552f1899191f734264ce2a1a5
Author: Baolin Wang <baolin.wang@linux.alibaba.com>
Date:   Tue Jan 16 15:53:35 2024 +0800

    fs: improve dump_mapping() robustness

    We met a kernel crash issue when running stress-ng testing, and the
    system crashes when printing the dentry name in dump_mapping().

    Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
    pc : dentry_name+0xd8/0x224
    lr : pointer+0x22c/0x370
    sp : ffff800025f134c0
    ......
    Call trace:
      dentry_name+0xd8/0x224
      pointer+0x22c/0x370
      vsnprintf+0x1ec/0x730
      vscnprintf+0x2c/0x60
      vprintk_store+0x70/0x234
      vprintk_emit+0xe0/0x24c
      vprintk_default+0x3c/0x44
      vprintk_func+0x84/0x2d0
      printk+0x64/0x88
      __dump_page+0x52c/0x530
      dump_page+0x14/0x20
      set_migratetype_isolate+0x110/0x224
      start_isolate_page_range+0xc4/0x20c
      offline_pages+0x124/0x474
      memory_block_offline+0x44/0xf4
      memory_subsys_offline+0x3c/0x70
      device_offline+0xf0/0x120
      ......

    The root cause is that, one thread is doing page migration, and we will
    use the target page's ->mapping field to save 'anon_vma' pointer between
    page unmap and page move, and now the target page is locked and refcount
    is 1.

    Currently, there is another stress-ng thread performing memory hotplug,
    attempting to offline the target page that is being migrated. It discovers
    that the refcount of this target page is 1, preventing the offline operation,
    thus proceeding to dump the page. However, page_mapping() of the target
    page may return an incorrect file mapping to crash the system in dump_mapping(),
    since the target page->mapping only saves 'anon_vma' pointer without setting
    PAGE_MAPPING_ANON flag.

    The page migration issue has been fixed by commit d1adb25df711 ("mm: migrate:
    fix getting incorrect page mapping during page migration"). In addition,
    Matthew suggested we should also improve dump_mapping()'s robustness to
    resilient against the kernel crash [1].

    With checking the 'dentry.parent' and 'dentry.d_name.name' used by
    dentry_name(), I can see dump_mapping() will output the invalid dentry
    instead of crashing the system when this issue is reproduced again.

    [12211.189128] page:fffff7de047741c0 refcount:1 mapcount:0 mapping:ffff989117f55ea0 index:0x1 pfn:0x211dd07
    [12211.189144] aops:0x0 ino:1 invalid dentry:74786574206e6870
    [12211.189148] flags: 0x57ffffc0000001(locked|node=1|zone=2|lastcpupid=0x1fffff)
    [12211.189150] page_type: 0xffffffff()
    [12211.189153] raw: 0057ffffc0000001 0000000000000000 dead000000000122 ffff989117f55ea0
    [12211.189154] raw: 0000000000000001 0000000000000001 00000001ffffffff 0000000000000000
    [12211.189155] page dumped because: unmovable page

    [1] https://lore.kernel.org/all/ZXxn%2F0oixJxxAnpF@casper.infradead.org/

    Suggested-by: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Link: https://lore.kernel.org/r/937ab1f87328516821d39be672b6bc18861d9d3e.1705391420.git.baolin.wang@linux.alibaba.com
    Signed-off-by: Christian Brauner <brauner@kernel.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-12-09 12:24:10 -05:00
CKI Backport Bot a1abdec3cc fs/inode: Prevent dump_mapping() accessing invalid dentry.d_name.name
JIRA: https://issues.redhat.com/browse/RHEL-64530
CVE: CVE-2024-49934

commit 7f7b850689ac06a62befe26e1fd1806799e7f152
Author: Li Zhijian <lizhijian@fujitsu.com>
Date:   Mon Aug 26 13:55:03 2024 +0800

    fs/inode: Prevent dump_mapping() accessing invalid dentry.d_name.name

    It's observed that a crash occurs during hot-remove a memory device,
    in which user is accessing the hugetlb. See calltrace as following:

    ------------[ cut here ]------------
    WARNING: CPU: 1 PID: 14045 at arch/x86/mm/fault.c:1278 do_user_addr_fault+0x2a0/0x790
    Modules linked in: kmem device_dax cxl_mem cxl_pmem cxl_port cxl_pci dax_hmem dax_pmem nd_pmem cxl_acpi nd_btt cxl_core crc32c_intel nvme virtiofs fuse nvme_core nfit libnvdimm dm_multipath scsi_dh_rdac scsi_dh_emc s
    mirror dm_region_hash dm_log dm_mod
    CPU: 1 PID: 14045 Comm: daxctl Not tainted 6.10.0-rc2-lizhijian+ #492
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
    RIP: 0010:do_user_addr_fault+0x2a0/0x790
    Code: 48 8b 00 a8 04 0f 84 b5 fe ff ff e9 1c ff ff ff 4c 89 e9 4c 89 e2 be 01 00 00 00 bf 02 00 00 00 e8 b5 ef 24 00 e9 42 fe ff ff <0f> 0b 48 83 c4 08 4c 89 ea 48 89 ee 4c 89 e7 5b 5d 41 5c 41 5d 41
    RSP: 0000:ffffc90000a575f0 EFLAGS: 00010046
    RAX: ffff88800c303600 RBX: 0000000000000000 RCX: 0000000000000000
    RDX: 0000000000001000 RSI: ffffffff82504162 RDI: ffffffff824b2c36
    RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: ffffc90000a57658
    R13: 0000000000001000 R14: ffff88800bc2e040 R15: 0000000000000000
    FS:  00007f51cb57d880(0000) GS:ffff88807fd00000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000001000 CR3: 00000000072e2004 CR4: 00000000001706f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
     <TASK>
     ? __warn+0x8d/0x190
     ? do_user_addr_fault+0x2a0/0x790
     ? report_bug+0x1c3/0x1d0
     ? handle_bug+0x3c/0x70
     ? exc_invalid_op+0x14/0x70
     ? asm_exc_invalid_op+0x16/0x20
     ? do_user_addr_fault+0x2a0/0x790
     ? exc_page_fault+0x31/0x200
     exc_page_fault+0x68/0x200
    <...snip...>
    BUG: unable to handle page fault for address: 0000000000001000
     #PF: supervisor read access in kernel mode
     #PF: error_code(0x0000) - not-present page
     PGD 800000000ad92067 P4D 800000000ad92067 PUD 7677067 PMD 0
     Oops: Oops: 0000 [#1] PREEMPT SMP PTI
     ---[ end trace 0000000000000000 ]---
     BUG: unable to handle page fault for address: 0000000000001000
     #PF: supervisor read access in kernel mode
     #PF: error_code(0x0000) - not-present page
     PGD 800000000ad92067 P4D 800000000ad92067 PUD 7677067 PMD 0
     Oops: Oops: 0000 [#1] PREEMPT SMP PTI
     CPU: 1 PID: 14045 Comm: daxctl Kdump: loaded Tainted: G        W          6.10.0-rc2-lizhijian+ #492
     Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
     RIP: 0010:dentry_name+0x1f4/0x440
    <...snip...>
    ? dentry_name+0x2fa/0x440
    vsnprintf+0x1f3/0x4f0
    vprintk_store+0x23a/0x540
    vprintk_emit+0x6d/0x330
    _printk+0x58/0x80
    dump_mapping+0x10b/0x1a0
    ? __pfx_free_object_rcu+0x10/0x10
    __dump_page+0x26b/0x3e0
    ? vprintk_emit+0xe0/0x330
    ? _printk+0x58/0x80
    ? dump_page+0x17/0x50
    dump_page+0x17/0x50
    do_migrate_range+0x2f7/0x7f0
    ? do_migrate_range+0x42/0x7f0
    ? offline_pages+0x2f4/0x8c0
    offline_pages+0x60a/0x8c0
    memory_subsys_offline+0x9f/0x1c0
    ? lockdep_hardirqs_on+0x77/0x100
    ? _raw_spin_unlock_irqrestore+0x38/0x60
    device_offline+0xe3/0x110
    state_store+0x6e/0xc0
    kernfs_fop_write_iter+0x143/0x200
    vfs_write+0x39f/0x560
    ksys_write+0x65/0xf0
    do_syscall_64+0x62/0x130

    Previously, some sanity check have been done in dump_mapping() before
    the print facility parsing '%pd' though, it's still possible to run into
    an invalid dentry.d_name.name.

    Since dump_mapping() only needs to dump the filename only, retrieve it
    by itself in a safer way to prevent an unnecessary crash.

    Note that either retrieving the filename with '%pd' or
    strncpy_from_kernel_nofault(), the filename could be unreliable.

    Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
    Link: https://lore.kernel.org/r/20240826055503.1522320-1-lizhijian@fujitsu.com
    Reviewed-by: Jan Kara <jack@suse.cz>
    Signed-off-by: Christian Brauner <brauner@kernel.org>

Signed-off-by: CKI Backport Bot <cki-ci-bot+cki-gitlab-backport-bot@redhat.com>
2024-11-28 00:29:48 +00:00
Rado Vrbovsky 6c6c3d9cfe Merge: XFS: Update #2 for RHEL9.6 (upstream v6.6-6.7)
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5633

XFS: update for RHEL9.6. Backport upstream v6.6-6.7, including fixes patches post v6.7.

JIRA: https://issues.redhat.com/browse/RHEL-62760

Depends: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5188

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>

Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: Brian Foster <bfoster@redhat.com>
Approved-by: Rafael Aquini <raquini@redhat.com>
Approved-by: Eric Sandeen <esandeen@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
2024-11-25 13:17:37 +00:00
Bill O'Donnell 770d4ac268 filemap: add a per-mapping stable writes flag
JIRA: https://issues.redhat.com/browse/RHEL-62760

Conflicts: context errors in pagemap.h

commit 762321dab9a72760bf9aec48362f932717c9424d
Author: Christoph Hellwig <hch@lst.de>
Date:   Wed Oct 25 16:10:17 2023 +0200

    filemap: add a per-mapping stable writes flag

    folio_wait_stable waits for writeback to finish before modifying the
    contents of a folio again, e.g. to support check summing of the data
    in the block integrity code.

    Currently this behavior is controlled by the SB_I_STABLE_WRITES flag
    on the super_block, which means it is uniform for the entire file system.
    This is wrong for the block device pseudofs which is shared by all
    block devices, or file systems that can use multiple devices like XFS
    witht the RT subvolume or btrfs (although btrfs currently reimplements
    folio_wait_stable anyway).

    Add a per-address_space AS_STABLE_WRITES flag to control the behavior
    in a more fine grained way.  The existing SB_I_STABLE_WRITES is kept
    to initialize AS_STABLE_WRITES to the existing default which covers
    most cases.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20231025141020.192413-2-hch@lst.de
    Tested-by: Ilya Dryomov <idryomov@gmail.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Christian Brauner <brauner@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2024-11-09 10:06:46 -06:00
Scott Mayhew f67a532577 fs: drop the timespec64 arg from generic_update_time
JIRA: https://issues.redhat.com/browse/RHEL-59704

commit 541d4c798a598854fcce7326d947cbcbd35701d6
Author: Jeff Layton <jlayton@kernel.org>
Date:   Mon Aug 7 15:38:34 2023 -0400

    fs: drop the timespec64 arg from generic_update_time

    In future patches we're going to change how the ctime is updated
    to keep track of when it has been queried. The way that the update_time
    operation works (and a lot of its callers) make this difficult, since
    they grab a timestamp early and then pass it down to eventually be
    copied into the inode.

    All of the existing update_time callers pass in the result of
    current_time() in some fashion. Drop the "time" parameter from
    generic_update_time, and rework it to fetch its own timestamp.

    This change means that an update_time could fetch a different timestamp
    than was seen in inode_needs_update_time. update_time is only ever
    called with one of two flag combinations: Either S_ATIME is set, or
    S_MTIME|S_CTIME|S_VERSION are set.

    With this change we now treat the flags argument as an indicator that
    some value needed to be updated when last checked, rather than an
    indication to update specific timestamps.

    Rework the logic for updating the timestamps and put it in a new
    inode_update_timestamps helper that other update_time routines can use.
    S_ATIME is as treated as we always have, but if any of the other three
    are set, then we attempt to update all three.

    Also, some callers of generic_update_time need to know what timestamps
    were actually updated. Change it to return an S_* flag mask to indicate
    that and rework the callers to expect it.

    Signed-off-by: Jeff Layton <jlayton@kernel.org>
    Reviewed-by: Jan Kara <jack@suse.cz>
    Message-Id: <20230807-mgctime-v7-3-d1dec143a704@kernel.org>
    Signed-off-by: Christian Brauner <brauner@kernel.org>

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
2024-10-25 12:35:45 -04:00
Scott Mayhew 0874600fa0 fs: convert to ctime accessor functions
JIRA: https://issues.redhat.com/browse/RHEL-59704

commit 2276e5ba8567f683c49a36ba885d0fe6abe2b45e
Author: Jeff Layton <jlayton@kernel.org>
Date:   Wed Jul 5 15:00:50 2023 -0400

    fs: convert to ctime accessor functions

    In later patches, we're going to change how the inode's ctime field is
    used. Switch to using accessor functions instead of raw accesses of
    inode->i_ctime.

    Reviewed-by: Jan Kara <jack@suse.cz>
    Signed-off-by: Jeff Layton <jlayton@kernel.org>
    Message-Id: <20230705190309.579783-23-jlayton@kernel.org>
    Signed-off-by: Christian Brauner <brauner@kernel.org>

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
2024-10-25 12:35:44 -04:00
Ian Kent 4f34d24b81 fs: port fs{g,u}id helpers to mnt_idmap
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

commit c14329d39f2daa8132e1bbe5cc531da387bcf44a
Author: Christian Brauner <brauner@kernel.org>
Date:   Fri Jan 13 12:49:31 2023 +0100

    fs: port fs{g,u}id helpers to mnt_idmap

    Convert to struct mnt_idmap.

    Last cycle we merged the necessary infrastructure in
    256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
    This is just the conversion to struct mnt_idmap.

    Currently we still pass around the plain namespace that was attached to a
    mount. This is in general pretty convenient but it makes it easy to
    conflate namespaces that are relevant on the filesystem with namespaces
    that are relevent on the mount level. Especially for non-vfs developers
    without detailed knowledge in this area this can be a potential source for
    bugs.

    Once the conversion to struct mnt_idmap is done all helpers down to the
    really low-level helpers will take a struct mnt_idmap argument instead of
    two namespace arguments. This way it becomes impossible to conflate the two
    eliminating the possibility of any bugs. All of the vfs and all filesystems
    only operate on struct mnt_idmap.

    Acked-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-16 12:30:42 +08:00
Ian Kent db8603ce12 fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

Conflicts: Dropped hunks for ksmbd because the source is not present in the
	CentOS Stream source tree.
	CentOS Stream has commit bb901646d2 ("ovl: let helper
	ovl_i_path_real() return the realinode") which wasn't present
	upstream when this patch was applied, correct manually.
	CentOS Stream does not have upstream commit c7423dbdbc9ec
	("ima: Handle -ESTALE returned by ima_filter_rule_match()")
	which results in a reject of hunk #3 against
	security/integrity/ima/ima_policy.c, so manually apply hunk.
	Upstream merge commit 05e6295f7b5e0 ("Merge tag 'fs.idmapped.v6.3'
	of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping")
	together with Upstream commit facd61053cff1 ("fuse: fixes after
	adapting to new posix acl api") results in a conflict in
	fs/fuse/acl.c, adjust to suit.
	Update the call to i_uid_into_vfsuid() from 2740f64cb7f00
	("filelocks: use mount idmapping for setlease permission check")
	to pass an idmap instead of a user namespace.
	It looks like Linus made a change to the merge request "Merge tag
	8834147f95056 ("fscache-rewrite-20220111") to account for idmap
	changes (probably the ones in this commit, so add the change here.

commit e67fe63341b8117d7e0d9acf0f1222d5138b9266
Author: Christian Brauner <brauner@kernel.org>
Date:   Fri Jan 13 12:49:30 2023 +0100

    fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap

    Convert to struct mnt_idmap.
    Remove legacy file_mnt_user_ns() and mnt_user_ns().

    Last cycle we merged the necessary infrastructure in
    256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
    This is just the conversion to struct mnt_idmap.

    Currently we still pass around the plain namespace that was attached to a
    mount. This is in general pretty convenient but it makes it easy to
    conflate namespaces that are relevant on the filesystem with namespaces
    that are relevent on the mount level. Especially for non-vfs developers
    without detailed knowledge in this area this can be a potential source for
    bugs.

    Once the conversion to struct mnt_idmap is done all helpers down to the
    really low-level helpers will take a struct mnt_idmap argument instead of
    two namespace arguments. This way it becomes impossible to conflate the two
    eliminating the possibility of any bugs. All of the vfs and all filesystems
    only operate on struct mnt_idmap.

    Acked-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-16 11:02:01 +08:00
Ian Kent edf17476c7 fs: port privilege checking helpers to mnt_idmap
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

Conflicts: For consistency drop btrfs hunks because it isn't supported in
	CentOS Stream and other backports also drop such hunks.
	Upstream merge commit 05e6295f7b5e0 ("Merge tag 'fs.idmapped.v6.3'
	of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping")
	together with Upstream commit facd61053cff1 ("fuse: fixes after
	adapting to new posix acl api") results in a conflict in
	fs/fuse/acl.c, adjust to suit.

commit 9452e93e6dae862d7aeff2b11236d79bde6f9b66
Author: Christian Brauner <brauner@kernel.org>
Date:   Fri Jan 13 12:49:27 2023 +0100

    fs: port privilege checking helpers to mnt_idmap

    Convert to struct mnt_idmap.

    Last cycle we merged the necessary infrastructure in
    256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
    This is just the conversion to struct mnt_idmap.

    Currently we still pass around the plain namespace that was attached to a
    mount. This is in general pretty convenient but it makes it easy to
    conflate namespaces that are relevant on the filesystem with namespaces
    that are relevent on the mount level. Especially for non-vfs developers
    without detailed knowledge in this area this can be a potential source for
    bugs.

    Once the conversion to struct mnt_idmap is done all helpers down to the
    really low-level helpers will take a struct mnt_idmap argument instead of
    two namespace arguments. This way it becomes impossible to conflate the two
    eliminating the possibility of any bugs. All of the vfs and all filesystems
    only operate on struct mnt_idmap.

    Acked-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-16 10:45:31 +08:00
Ian Kent dcafde7644 fs: port inode_owner_or_capable() to mnt_idmap
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

Conflicts: For consistency drop btrfs hunks because it isn't supported in
        CentOS Stream and other backports also drop such hunks.
        CentOS Stream does not have upstream commit 3db1de0e582c3 ("f2fs:
        change the current atomic write way") which results in a reject for
        hunk #1 in f2fs_ioc_start_atomic_write() of fs/f2fs/file.c, so
        make the required change manually.
	Upstream commit 7bc155fec5b371 ("f2fs: kill volatile write
	support") is not present in CentOS Stream so make the additional
	changes needed.

commit 01beba7957a26f9b7179127e8ad56bb5a0f56138
Author: Christian Brauner <brauner@kernel.org>
Date:   Fri Jan 13 12:49:26 2023 +0100

    fs: port inode_owner_or_capable() to mnt_idmap

    Convert to struct mnt_idmap.

    Last cycle we merged the necessary infrastructure in
    256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
    This is just the conversion to struct mnt_idmap.

    Currently we still pass around the plain namespace that was attached to a
    mount. This is in general pretty convenient but it makes it easy to
    conflate namespaces that are relevant on the filesystem with namespaces
    that are relevent on the mount level. Especially for non-vfs developers
    without detailed knowledge in this area this can be a potential source for
    bugs.

    Once the conversion to struct mnt_idmap is done all helpers down to the
    really low-level helpers will take a struct mnt_idmap argument instead of
    two namespace arguments. This way it becomes impossible to conflate the two
    eliminating the possibility of any bugs. All of the vfs and all filesystems
    only operate on struct mnt_idmap.

    Acked-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-16 10:45:27 +08:00
Ian Kent 2171c567b5 fs: port inode_init_owner() to mnt_idmap
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

Conflicts: For consistency drop btrfs hunks because it isn't supported in
	CentOS Stream and other backports also drop such hunks.
	CentOS Stream does not have upstream commit 3db1de0e582c3 ("f2fs:
	change the current atomic write way") so there is no call to
	f2fs_get_tmpfile() in f2fs_ioc_start_atomic_write() to change.
	The above patch also adds the definition of f2fs_get_tmpfile()
	to fs/f2fs/f2fs.h so it's not there to change resulting in a
	hunk reject for fs/f2fs/f2fs.h.
        Upstream commit 787caf1bdcd9f ("f2fs: fix to enable compress for
        newly created file if extension matches") is not present in CentOS
        Stream resulting in a number of rejects against fs/f2fs/namei.c,
        manually apply these changes.
	Dropped hunks for ntfs3 because the source is not present in
	the CentOS Stream source tree.
	CentOS Stream commit 892da692fa ("shmem: support idmapped
	mounts for tmpfs") which causes a reject in fs/shmem.c, manually
	apply the hunk (note: taking account of these changes at the times
	they are needed will result in an updated mm/shmem.c once this
	series is completed).
	Update to add incremental changes needed due to CentOS Stream
	commit 469e1d13f6 ("shmem: quota support").

commit f2d40141d5d90b882e2c35b226f9244a63b82b6e
Author: Christian Brauner <brauner@kernel.org>
Date:   Fri Jan 13 12:49:25 2023 +0100

    fs: port inode_init_owner() to mnt_idmap

    Convert to struct mnt_idmap.

    Last cycle we merged the necessary infrastructure in
    256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
    This is just the conversion to struct mnt_idmap.

    Currently we still pass around the plain namespace that was attached to a
    mount. This is in general pretty convenient but it makes it easy to
    conflate namespaces that are relevant on the filesystem with namespaces
    that are relevent on the mount level. Especially for non-vfs developers
    without detailed knowledge in this area this can be a potential source for
    bugs.

    Once the conversion to struct mnt_idmap is done all helpers down to the
    really low-level helpers will take a struct mnt_idmap argument instead of
    two namespace arguments. This way it becomes impossible to conflate the two
    eliminating the possibility of any bugs. All of the vfs and all filesystems
    only operate on struct mnt_idmap.

    Acked-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-16 10:45:26 +08:00
Ian Kent 304ec491ee fs: port ->permission() to pass mnt_idmap
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

Conflicts: For consistency drop btrfs hunks because it isn't supported in
	CentOS Stream and other backports also drop such hunks.
	CentOS Stream commit 48fa94aacd ("ceph: fscrypt_auth handling
	for ceph") is presnt which causes fuzz 2 in hunk #1 in
	fs/ceph/super.h.
	Upstream commit 427505ffeaa46 ("exportfs: use pr_debug for
	unreachable debug statements") is not present causing fuzz 2
	in hunk #1 against fs/exportfs/expfs.c.
	Dropped hunks for ksmbd because the source is not present in the
	CentOS Stream source tree.
	Upstream commit 03fa86e9f79d8 ("namei: stash the sampled ->d_seq
	into nameidata") is not present causing a fuzz 1 for hunk #14
	against fs/namei.c.
	CentOS Stream c4f3dd0731 ("nfsd: handle failure to collect
	pre/post-op attrs more sanely") is present and causes a rejects
	for hunks #4 and #5 against fs/nfsd/vfs.c, apply manually.
	Dropped hunks for ntfs3 because the source is not present in the
	CentOS Stream source tree.
	CentOS Stream commit 98ba731fc7 ("ovl: Move xattr support to
	new xattrs.c file") moves ovl_xattr_set() and ovl_xattr_get()
	from fs/overlayfs/inode.c to fs/overlayfs/xattrs.c which causes
	hunks #4 and #5 to fail, manually apply to fs/overlayfs/xattrs.c.
	CentOS Stream commit 55177e4b83 ("ovl: mark xwhiteouts directory
	with overlay.opaque='x'") and commit d17b324bb6 ("ovl: use
	ovl_numlower() and ovl_lowerstack() accessors") change the first
	and third hunks of fs/overlayfs/namei.c causing them to fail,
	manually apply.
	CentOS Stream commit 98ba731fc7 ("ovl: Move xattr support to
	new xattrs.c file") causes fuzz 2 in hunk #5 of
	fs/overlayfs/overlayfs.h
	CentOS Stream commit 355a9c490a ("ovl: Add an alternative
	type of whiteout") changes ovl_cache_update_ino() to
	ovl_cache_update() in fs/overlayfs/readdir.c, make the change
	manually.
	Upstream commit 217af7e2f4deb ("apparmor: refactor profile
	rules and attachments") is not in CentOS Stream causing hunk #1
	to fail to apply so manually apply the change.

commit 4609e1f18e19c3b302e1eb4858334bca1532f780
Author: Christian Brauner <brauner@kernel.org>
Date:   Fri Jan 13 12:49:22 2023 +0100

    fs: port ->permission() to pass mnt_idmap

    Convert to struct mnt_idmap.

    Last cycle we merged the necessary infrastructure in
    256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
    This is just the conversion to struct mnt_idmap.

    Currently we still pass around the plain namespace that was attached to a
    mount. This is in general pretty convenient but it makes it easy to
    conflate namespaces that are relevant on the filesystem with namespaces
    that are relevent on the mount level. Especially for non-vfs developers
    without detailed knowledge in this area this can be a potential source for
    bugs.

    Once the conversion to struct mnt_idmap is done all helpers down to the
    really low-level helpers will take a struct mnt_idmap argument instead of
    two namespace arguments. This way it becomes impossible to conflate the two
    eliminating the possibility of any bugs. All of the vfs and all filesystems
    only operate on struct mnt_idmap.

    Acked-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-16 10:45:20 +08:00
Ian Kent f7a70a9fc1 fs: port vfs_*() helpers to struct mnt_idmap
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

Conflicts: There was a whitespasce difference possibly due to CentOS
	Stream commit c912400e45 ("fs: Fix description of
	vfs_tmpfile()")
	CentOS Stream commit c4f3dd0731 ("nfsd: handle failure to
	collect pre/post-op attrs more sanely") is present which caused
	a hunk reject in fs/nfsd/nfs3proc.c and two hunks to be rejected
	in fs/nfsd/vfs.c the hunks were manually applied.
	Upstream commit 79b05beaa5c34 ("af_unix: Acquire/Release
	per-netns hash table's locks.") is not present in CentOS Stream
	fixed the conflict manually.
	Dropped ksmbd hunks, ksmbd source is not present.
	Upstream commit 3350607dc5637 ("security: Create file_truncate
	hook from path_truncate hook") is not present in CentOS Stream.

commit abf08576afe31506b812c8c1be9714f78613f300
Author: Christian Brauner <brauner@kernel.org>
Date:   Fri Jan 13 12:49:10 2023 +0100

    fs: port vfs_*() helpers to struct mnt_idmap

    Convert to struct mnt_idmap.

    Last cycle we merged the necessary infrastructure in
    256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
    This is just the conversion to struct mnt_idmap.

    Currently we still pass around the plain namespace that was attached to a
    mount. This is in general pretty convenient but it makes it easy to
    conflate namespaces that are relevant on the filesystem with namespaces
    that are relevent on the mount level. Especially for non-vfs developers
    without detailed knowledge in this area this can be a potential source for
    bugs.

    Once the conversion to struct mnt_idmap is done all helpers down to the
    really low-level helpers will take a struct mnt_idmap argument instead of
    two namespace arguments. This way it becomes impossible to conflate the two
    eliminating the possibility of any bugs. All of the vfs and all filesystems
    only operate on struct mnt_idmap.

    Acked-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-16 08:29:51 +08:00
Ian Kent bff9bc5749 fs: use type safe idmapping helpers
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

commit a2bd096fb2d7f50fb4db246b33e7bfcf5e2eda3a
Author: Christian Brauner <brauner@kernel.org>
Date:   Wed Jun 22 22:12:16 2022 +0200

    fs: use type safe idmapping helpers

    We already ported most parts and filesystems over for v6.0 to the new
    vfs{g,u}id_t type and associated helpers for v6.0. Convert the remaining
    places so we can remove all the old helpers.
    This is a non-functional change.

    Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-15 16:12:38 +08:00
Ian Kent dc1f3bea48 attr: use consistent sgid stripping checks
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

commit ed5a7047d2011cb6b2bf84ceb6680124cc6a7d95
Author: Christian Brauner <brauner@kernel.org>
Date:   Mon Oct 17 17:06:37 2022 +0200

    attr: use consistent sgid stripping checks

    Currently setgid stripping in file_remove_privs()'s should_remove_suid()
    helper is inconsistent with other parts of the vfs. Specifically, it only
    raises ATTR_KILL_SGID if the inode is S_ISGID and S_IXGRP but not if the
    inode isn't in the caller's groups and the caller isn't privileged over the
    inode although we require this already in setattr_prepare() and
    setattr_copy() and so all filesystem implement this requirement implicitly
    because they have to use setattr_{prepare,copy}() anyway.

    But the inconsistency shows up in setgid stripping bugs for overlayfs in
    xfstests (e.g., generic/673, generic/683, generic/685, generic/686,
    generic/687). For example, we test whether suid and setgid stripping works
    correctly when performing various write-like operations as an unprivileged
    user (fallocate, reflink, write, etc.):

    echo "Test 1 - qa_user, non-exec file $verb"
    setup_testfile
    chmod a+rws $junk_file
    commit_and_check "$qa_user" "$verb" 64k 64k

    The test basically creates a file with 6666 permissions. While the file has
    the S_ISUID and S_ISGID bits set it does not have the S_IXGRP set. On a
    regular filesystem like xfs what will happen is:

    sys_fallocate()
    -> vfs_fallocate()
       -> xfs_file_fallocate()
          -> file_modified()
             -> __file_remove_privs()
                -> dentry_needs_remove_privs()
                   -> should_remove_suid()
                -> __remove_privs()
                   newattrs.ia_valid = ATTR_FORCE | kill;
                   -> notify_change()
                      -> setattr_copy()

    In should_remove_suid() we can see that ATTR_KILL_SUID is raised
    unconditionally because the file in the test has S_ISUID set.

    But we also see that ATTR_KILL_SGID won't be set because while the file
    is S_ISGID it is not S_IXGRP (see above) which is a condition for
    ATTR_KILL_SGID being raised.

    So by the time we call notify_change() we have attr->ia_valid set to
    ATTR_KILL_SUID | ATTR_FORCE. Now notify_change() sees that
    ATTR_KILL_SUID is set and does:

    ia_valid = attr->ia_valid |= ATTR_MODE
    attr->ia_mode = (inode->i_mode & ~S_ISUID);

    which means that when we call setattr_copy() later we will definitely
    update inode->i_mode. Note that attr->ia_mode still contains S_ISGID.

    Now we call into the filesystem's ->setattr() inode operation which will
    end up calling setattr_copy(). Since ATTR_MODE is set we will hit:

    if (ia_valid & ATTR_MODE) {
            umode_t mode = attr->ia_mode;
            vfsgid_t vfsgid = i_gid_into_vfsgid(mnt_userns, inode);
            if (!vfsgid_in_group_p(vfsgid) &&
                !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID))
                    mode &= ~S_ISGID;
            inode->i_mode = mode;
    }

    and since the caller in the test is neither capable nor in the group of the
    inode the S_ISGID bit is stripped.

    But assume the file isn't suid then ATTR_KILL_SUID won't be raised which
    has the consequence that neither the setgid nor the suid bits are stripped
    even though it should be stripped because the inode isn't in the caller's
    groups and the caller isn't privileged over the inode.

    If overlayfs is in the mix things become a bit more complicated and the bug
    shows up more clearly. When e.g., ovl_setattr() is hit from
    ovl_fallocate()'s call to file_remove_privs() then ATTR_KILL_SUID and
    ATTR_KILL_SGID might be raised but because the check in notify_change() is
    questioning the ATTR_KILL_SGID flag again by requiring S_IXGRP for it to be
    stripped the S_ISGID bit isn't removed even though it should be stripped:

    sys_fallocate()
    -> vfs_fallocate()
       -> ovl_fallocate()
          -> file_remove_privs()
             -> dentry_needs_remove_privs()
                -> should_remove_suid()
             -> __remove_privs()
                newattrs.ia_valid = ATTR_FORCE | kill;
                -> notify_change()
                   -> ovl_setattr()
                      // TAKE ON MOUNTER'S CREDS
                      -> ovl_do_notify_change()
                         -> notify_change()
                      // GIVE UP MOUNTER'S CREDS
         // TAKE ON MOUNTER'S CREDS
         -> vfs_fallocate()
            -> xfs_file_fallocate()
               -> file_modified()
                  -> __file_remove_privs()
                     -> dentry_needs_remove_privs()
                        -> should_remove_suid()
                     -> __remove_privs()
                        newattrs.ia_valid = attr_force | kill;
                        -> notify_change()

    The fix for all of this is to make file_remove_privs()'s
    should_remove_suid() helper to perform the same checks as we already
    require in setattr_prepare() and setattr_copy() and have notify_change()
    not pointlessly requiring S_IXGRP again. It doesn't make any sense in the
    first place because the caller must calculate the flags via
    should_remove_suid() anyway which would raise ATTR_KILL_SGID.

    While we're at it we move should_remove_suid() from inode.c to attr.c
    where it belongs with the rest of the iattr helpers. Especially since it
    returns ATTR_KILL_S{G,U}ID flags. We also rename it to
    setattr_should_drop_suidgid() to better reflect that it indicates both
    setuid and setgid bit removal and also that it returns attr flags.

    Running xfstests with this doesn't report any regressions. We should really
    try and use consistent checks.

    Reviewed-by: Amir Goldstein <amir73il@gmail.com>
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-15 16:12:34 +08:00
Ian Kent b064d0b523 fs: move should_remove_suid()
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

commit e243e3f94c804ecca9a8241b5babe28f35258ef4
Author: Christian Brauner <brauner@kernel.org>
Date:   Mon Oct 17 17:06:35 2022 +0200

    fs: move should_remove_suid()

    Move the helper from inode.c to attr.c. This keeps the the core of the
    set{g,u}id stripping logic in one place when we add follow-up changes.
    It is the better place anyway, since should_remove_suid() returns
    ATTR_KILL_S{G,U}ID flags.

    Reviewed-by: Amir Goldstein <amir73il@gmail.com>
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-15 16:12:25 +08:00
Ian Kent d8268a324b attr: add in_group_or_capable()
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

Conflicts: CentOS Stream has commit 42bfe37a25 ("fs: add ctime
	accessors infrastructure") so adjust the context to match.

commit 11c2a8700cdcabf9b639b7204a1e38e2a0b6798e
Author: Christian Brauner <brauner@kernel.org>
Date:   Mon Oct 17 17:06:34 2022 +0200

    attr: add in_group_or_capable()

    In setattr_{copy,prepare}() we need to perform the same permission
    checks to determine whether we need to drop the setgid bit or not.
    Instead of open-coding it twice add a simple helper the encapsulates the
    logic. We will reuse this helpers to make dropping the setgid bit during
    write operations more consistent in a follow up patch.

    Reviewed-by: Amir Goldstein <amir73il@gmail.com>
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-15 16:12:24 +08:00
Rafael Aquini 9a083ffc2d list_lru: allow explicit memcg and NUMA node selection
JIRA: https://issues.redhat.com/browse/RHEL-40684
Conflicts:
    * include/linux/list_lru.h: minor context differences due to missing
         upstream v6.8 commit 7679e14098c9 ("mm: list_lru: Update kernel
         documentation to follow the requirements")

This patch is a backport of the following upstream commit:
commit 0a97c01cd20bb96359d8c9dedad92a061ed34e0b
Author: Nhat Pham <nphamcs@gmail.com>
Date:   Thu Nov 30 11:40:18 2023 -0800

    list_lru: allow explicit memcg and NUMA node selection

    Patch series "workload-specific and memory pressure-driven zswap
    writeback", v8.

    There are currently several issues with zswap writeback:

    1. There is only a single global LRU for zswap, making it impossible to
       perform worload-specific shrinking - an memcg under memory pressure
       cannot determine which pages in the pool it owns, and often ends up
       writing pages from other memcgs. This issue has been previously
       observed in practice and mitigated by simply disabling
       memcg-initiated shrinking:

       https://lore.kernel.org/all/20230530232435.3097106-1-nphamcs@gmail.com/T/#u

       But this solution leaves a lot to be desired, as we still do not
       have an avenue for an memcg to free up its own memory locked up in
       the zswap pool.

    2. We only shrink the zswap pool when the user-defined limit is hit.
       This means that if we set the limit too high, cold data that are
       unlikely to be used again will reside in the pool, wasting precious
       memory. It is hard to predict how much zswap space will be needed
       ahead of time, as this depends on the workload (specifically, on
       factors such as memory access patterns and compressibility of the
       memory pages).

    This patch series solves these issues by separating the global zswap LRU
    into per-memcg and per-NUMA LRUs, and performs workload-specific (i.e
    memcg- and NUMA-aware) zswap writeback under memory pressure.  The new
    shrinker does not have any parameter that must be tuned by the user, and
    can be opted in or out on a per-memcg basis.

    As a proof of concept, we ran the following synthetic benchmark: build the
    linux kernel in a memory-limited cgroup, and allocate some cold data in
    tmpfs to see if the shrinker could write them out and improved the overall
    performance.  Depending on the amount of cold data generated, we observe
    from 14% to 35% reduction in kernel CPU time used in the kernel builds.

    This patch (of 6):

    The interface of list_lru is based on the assumption that the list node
    and the data it represents belong to the same allocated on the correct
    node/memcg.  While this assumption is valid for existing slab objects LRU
    such as dentries and inodes, it is undocumented, and rather inflexible for
    certain potential list_lru users (such as the upcoming zswap shrinker and
    the THP shrinker).  It has caused us a lot of issues during our
    development.

    This patch changes list_lru interface so that the caller must explicitly
    specify numa node and memcg when adding and removing objects.  The old
    list_lru_add() and list_lru_del() are renamed to list_lru_add_obj() and
    list_lru_del_obj(), respectively.

    It also extends the list_lru API with a new function, list_lru_putback,
    which undoes a previous list_lru_isolate call.  Unlike list_lru_add, it
    does not increment the LRU node count (as list_lru_isolate does not
    decrement the node count).  list_lru_putback also allows for explicit
    memcg and NUMA node selection.

    Link: https://lkml.kernel.org/r/20231130194023.4102148-1-nphamcs@gmail.com
    Link: https://lkml.kernel.org/r/20231130194023.4102148-2-nphamcs@gmail.com
    Signed-off-by: Nhat Pham <nphamcs@gmail.com>
    Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
    Cc: Chris Li <chrisl@kernel.org>
    Cc: Dan Streetman <ddstreet@ieee.org>
    Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Seth Jennings <sjenning@redhat.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Shuah Khan <shuah@kernel.org>
    Cc: Vitaly Wool <vitaly.wool@konsulko.com>
    Cc: Yosry Ahmed <yosryahmed@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2024-06-28 12:24:14 -04:00
Chris von Recklinghausen 41d58c77af mm: vmscan: refactor updating current->reclaim_state
Conflicts: mm/slob.c - We already have
	6630e950d532 ("mm/slob: remove slob.c")
	so the file is gone.

JIRA: https://issues.redhat.com/browse/RHEL-27741

commit c7b23b68e2aa93f86a206222d23ccd9a21f5982a
Author: Yosry Ahmed <yosryahmed@google.com>
Date:   Thu Apr 13 10:40:34 2023 +0000

    mm: vmscan: refactor updating current->reclaim_state

    During reclaim, we keep track of pages reclaimed from other means than
    LRU-based reclaim through scan_control->reclaim_state->reclaimed_slab,
    which we stash a pointer to in current task_struct.

    However, we keep track of more than just reclaimed slab pages through
    this.  We also use it for clean file pages dropped through pruned inodes,
    and xfs buffer pages freed.  Rename reclaimed_slab to reclaimed, and add a
    helper function that wraps updating it through current, so that future
    changes to this logic are contained within include/linux/swap.h.

    Link: https://lkml.kernel.org/r/20230413104034.1086717-4-yosryahmed@google.c
om
    Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Darrick J. Wong <djwong@kernel.org>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: NeilBrown <neilb@suse.de>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Tim Chen <tim.c.chen@linux.intel.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Yu Zhao <yuzhao@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:00:59 -04:00
Ming Lei c3745bf638 fs: simplify invalidate_inodes
JIRA: https://issues.redhat.com/browse/RHEL-29564

commit e127b9bccdb04e5fc4444431de37309a68aedafa
Author: Christoph Hellwig <hch@lst.de>
Date:   Fri Aug 11 12:08:28 2023 +0200

    fs: simplify invalidate_inodes

    kill_dirty has always been true for a long time, so hard code it and
    remove the unused return value.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Christian Brauner <brauner@kernel.org>
    Reviewed-by: Josef Bacik <josef@toxicpanda.com>
    Message-Id: <20230811100828.1897174-18-hch@lst.de>
    Signed-off-by: Christian Brauner <brauner@kernel.org>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2024-04-17 09:46:43 +08:00
Lucas Zampieri 60e48163b8
Merge: ceph+fscrypt: add fscrypt support
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3502

JIRA: https://issues.redhat.com/browse/RHEL-19813

Add ceph fscrypt feature support. This will also backport the latest fs-crypt patches together.

In upstream there still some patches not applied and need to wait for days.

commit acfaf9453bde61be725c104ba29b7b0902bc246e
Author: Xiubo Li <xiubli@redhat.com>
Date:   Tue Dec 12 12:11:41 2023 +0800

    ceph: implement -o test_dummy_encryption mount option

    JIRA: https://issues.redhat.com/browse/RHEL-19542

    commit 6b5717bd30ab7f35792d20b71211055bdb43e6de
    Author: Jeff Layton <jlayton@kernel.org>
    Date:   Tue Sep 8 09:47:40 2020 -0400

        ceph: implement -o test_dummy_encryption mount option

        Add support for the test_dummy_encryption mount option. This allows us
        to test the encrypted codepaths in ceph without having to manually set
        keys, etc.

        [ lhenriques: fix potential fsc->fsc_dummy_enc_policy memory leak in
          ceph_real_mount() ]

        Signed-off-by: Jeff Layton <jlayton@kernel.org>
        Reviewed-by: Xiubo Li <xiubli@redhat.com>
        Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
        Reviewed-by: Milind Changire <mchangir@redhat.com>
        Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

Signed-off-by: Xiubo Li <xiubli@redhat.com>

Related to RHEL-19813

Approved-by: Venky Shankar <vshankar@redhat.com>
Approved-by: Milind Changire <mchangir@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-04-12 14:43:30 -03:00
Lucas Zampieri 35d215e024
Merge: USB/TB code rebase of supported drivers to upstream v6.6
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3875

JIRA: https://issues.redhat.com/browse/RHEL-28809

BUILD INFO: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=59819606

FUNCTIONAL TESTING: QA

SMOKE TESTING:
On a HP EliteBook 840 G8, smoke testing didn't present any regressions between the regular x rebased kernel with the insertion of:
* Generic 4 port USB HUB:
  - USB 2.0 Sandisk typeA 16G
  - USB Headset Gamer ELG Surround 7.1
* USB 2.0 Kingston TypeA 32G
* USB 3.2 Gen2 Sandisk TypeC 32G

DESCRIPTION:
This rebases supported USB and Thunderbolt drivers to upstream kernel v6.6.

By design, changes on this rebase are limited to supported usb and thunderbolt drivers.

Changes which happen to touch the drivers but are tree-wide are selectively or partially pulled in, when relevant.

Signed-off-by: Desnes Nunes <desnesn@redhat.com>

Approved-by: Bastien Nocera <bnocera@redhat.com>
Approved-by: Eric Chanudet <echanude@redhat.com>
Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com>

Merged-by: Lucas Zampieri <lzampier@redhat.com>
2024-04-10 15:35:27 -03:00
Xiubo Li 1f4cd8ccfb fs: change test in inode_insert5 for adding to the sb list
JIRA: https://issues.redhat.com/browse/RHEL-19813

commit 18cc912b8a2acaf32589241fbac47192ab90db14
Author: Jeff Layton <jlayton@kernel.org>
Date:   Thu Mar 31 16:29:00 2022 -0400

    fs: change test in inode_insert5 for adding to the sb list

    inode_insert5 currently looks at I_CREATING to decide whether to insert
    the inode into the sb list. This test is a bit ambiguous, as I_CREATING
    state is not directly related to that list.

    This test is also problematic for some upcoming ceph changes to add
    fscrypt support. We need to be able to allocate an inode using new_inode
    and insert it into the hash later iff we end up using it, and doing that
    now means that we double add it and corrupt the list.

    What we really want to know in this test is whether the inode is already
    in its superblock list, and then add it if it isn't. Have it test for
    list_empty instead and ensure that we always initialize the list by
    doing it in inode_init_once. It's only ever removed from the list with
    list_del_init, so that should be sufficient.

    Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
    Signed-off-by: Jeff Layton <jlayton@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>
    Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

Signed-off-by: Xiubo Li <xiubli@redhat.com>
2024-03-26 10:24:11 +08:00
Prarit Bhargava c8a912b06d locking: remove spin_lock_prefetch
JIRA: https://issues.redhat.com/browse/RHEL-25415

commit c8afaa1b0f8bc93d013ab2ea6b9649958af3f1d3
Author: Mateusz Guzik <mjguzik@gmail.com>
Date:   Sat Aug 12 18:15:54 2023 +0200

    locking: remove spin_lock_prefetch

    The only remaining consumer is new_inode, where it showed up in 2001 as
    commit c37fa164f793 ("v2.4.9.9 -> v2.4.9.10") in a historical repo [1]
    with a changelog which does not mention it.

    Since then the line got only touched up to keep compiling.

    While it may have been of benefit back in the day, it is guaranteed to
    at best not get in the way in the multicore setting -- as the code
    performs *a lot* of work between the prefetch and actual lock acquire,
    any contention means the cacheline is already invalid by the time the
    routine calls spin_lock().  It adds spurious traffic, for short.

    On top of it prefetch is notoriously tricky to use for single-threaded
    purposes, making it questionable from the get go.

    As such, remove it.

    I admit upfront I did not see value in benchmarking this change, but I
    can do it if that is deemed appropriate.

    Removal from new_inode and of the entire thing are in the same patch as
    requested by Linus, so whatever weird looks can be directed at that guy.

    Link: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/fs/inode.c?id=c37fa164f793735b32aa3f53154ff1a7659e6442 [1]
    Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
2024-03-20 09:43:20 -04:00
Desnes Nunes 42bfe37a25 fs: add ctime accessors infrastructure
JIRA: https://issues.redhat.com/browse/RHEL-28809
Conflicts:
* Avoiding commit <11c2a8700cdc> ("attr: add in_group_or_capable()")

commit 9b6304c1d53745c300b86f202d0dcff395e2d2db
Author: Jeff Layton <jlayton@kernel.org>
Date: Wed, 5 Jul 2023 14:58:10 -0400

  struct timespec64 has unused bits in the tv_nsec field that can be used
  for other purposes. In future patches, we're going to change how the
  inode->i_ctime is accessed in certain inodes in order to make use of
  them. In order to do that safely though, we'll need to eradicate raw
  accesses of the inode->i_ctime field from the kernel.

  Add new accessor functions for the ctime that we use to replace them.

  Reviewed-by: Jan Kara <jack@suse.cz>
  Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
  Signed-off-by: Jeff Layton <jlayton@kernel.org>
  Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
  Message-Id: <20230705185812.579118-2-jlayton@kernel.org>
  Signed-off-by: Christian Brauner <brauner@kernel.org>

Signed-off-by: Desnes Nunes <desnesn@redhat.com>
2024-03-18 15:42:25 -03:00
Ming Lei f6a6eee21c fs: remove the special !CONFIG_BLOCK def_blk_fops
JIRA: https://issues.redhat.com/browse/RHEL-1516
Conflicts: context difference because direct-io.o in Makefile

commit bda2795a630b2f6c417675bfbf4d90ef7503dfc7
Author: Christoph Hellwig <hch@lst.de>
Date:   Mon May 8 07:44:05 2023 -0700

    fs: remove the special !CONFIG_BLOCK def_blk_fops

    def_blk_fops always returns -ENODEV, which dosn't match the return value
    of a non-existing block device with CONFIG_BLOCK, which is -ENXIO.
    Just remove the extra implementation and fall back to the default
    no_open_fops that always returns -ENXIO.

    Fixes: 9361401eb7 ("[PATCH] BLOCK: Make it possible to disable the block layer [try #6]")
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20230508144405.41792-1-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2023-09-18 15:59:29 +08:00
Jeff Moyer bde319ceec fs: Add async write file modification handling.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237

commit 66fa3cedf16abc82d19b943e3289c82e685419d5
Author: Stefan Roesch <shr@fb.com>
Date:   Thu Jun 23 10:51:53 2022 -0700

    fs: Add async write file modification handling.
    
    This adds a file_modified_async() function to return -EAGAIN if the
    request either requires to remove privileges or needs to update the file
    modification time. This is required for async buffered writes, so the
    request gets handled in the io worker of io-uring.
    
    Signed-off-by: Stefan Roesch <shr@fb.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Jan Kara <jack@suse.cz>
    Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Link: https://lore.kernel.org/r/20220623175157.1715274-11-shr@fb.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-04-29 07:47:02 -04:00
Jeff Moyer 0ba1326a1a fs: Split off inode_needs_update_time and __file_update_time
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237
Conflicts: We are missing commit e60feb445fce ("fs: export an
  inode_update_time helper"), so keep using update_time().

commit 6a2aa5d85de534471dd023773236f113eaef26f0
Author: Stefan Roesch <shr@fb.com>
Date:   Thu Jun 23 10:51:52 2022 -0700

    fs: Split off inode_needs_update_time and __file_update_time
    
    This splits off the functions inode_needs_update_time() and
    __file_update_time() from the function file_update_time().
    
    This is required to support async buffered writes.
    No intended functional changes in this patch.
    
    Signed-off-by: Stefan Roesch <shr@fb.com>
    Reviewed-by: Jan Kara <jack@suse.cz>
    Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Link: https://lore.kernel.org/r/20220623175157.1715274-10-shr@fb.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-04-29 07:46:02 -04:00
Jeff Moyer 3ae1c2ccf9 fs: __file_remove_privs(): restore call to inode_has_no_xattr()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237

commit 41191cf6bf565f4139046d7be68ec30c290af92d
Author: Stefan Roesch <shr@fb.com>
Date:   Tue Aug 16 08:31:58 2022 -0700

    fs: __file_remove_privs(): restore call to inode_has_no_xattr()
    
    This restores the call to inode_has_no_xattr() in the function
    __file_remove_privs(). In case the dentry_meeds_remove_privs() returned
    0, the function inode_has_no_xattr() was not called.
    
    Signed-off-by: Stefan Roesch <shr@fb.com>
    Fixes: faf99b563558 ("fs: add __remove_file_privs() with flags parameter")
    Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
    Link: https://lore.kernel.org/r/20220816153158.1925040-1-shr@fb.com
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-04-29 07:45:02 -04:00
Jeff Moyer bc26c4000b fs: add __remove_file_privs() with flags parameter
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2068237

commit faf99b563558f74188b7ca34faae1c1da49a7261
Author: Stefan Roesch <shr@fb.com>
Date:   Thu Jun 23 10:51:51 2022 -0700

    fs: add __remove_file_privs() with flags parameter
    
    This adds the function __remove_file_privs, which allows the caller to
    pass the kiocb flags parameter.
    
    No intended functional changes in this patch.
    
    Signed-off-by: Stefan Roesch <shr@fb.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Jan Kara <jack@suse.cz>
    Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Link: https://lore.kernel.org/r/20220623175157.1715274-9-shr@fb.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-04-29 07:44:02 -04:00
Chris von Recklinghausen c991a31fce mm: Remove __delete_from_page_cache()
Bugzilla: https://bugzilla.redhat.com/2160210

commit 6ffcd825e7d0416d78fd41cd5b7856a78122cc8c
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Jun 28 20:41:40 2022 -0400

    mm: Remove __delete_from_page_cache()

    This wrapper is no longer used.  Remove it and all references to it.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:19:15 -04:00
Chris von Recklinghausen eecd85274e fs: move inode sysctls to its own file
Bugzilla: https://bugzilla.redhat.com/2160210

commit 1d67fe585049d3e2448b997af78c68cbf90ada09
Author: Luis Chamberlain <mcgrof@kernel.org>
Date:   Fri Jan 21 22:12:52 2022 -0800

    fs: move inode sysctls to its own file

    Patch series "sysctl: 4th set of kernel/sysctl cleanups".

    This is slimming down the fs uses of kernel/sysctl.c to the point that
    the next step is to just get rid of the fs base directory for it and
    move that elsehwere, so that next patch series starts dealing with that
    to demo how we can end up cleaning up a full base directory from
    kernel/sysctl.c, one at a time.

    This patch (of 9):

    kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
    dishes, this makes it very difficult to maintain.

    To help with this maintenance let's start by moving sysctls to places
    where they actually belong.  The proc sysctl maintainers do not want to
    know what sysctl knobs you wish to add for your own piece of code, we
    just care about the core logic.

    So move the inode sysctls to its own file.  Since we are no longer using
    this outside of fs/ remove the extern declaration of its respective proc
    helper.

    We use early_initcall() as it is the earliest we can use.

    [arnd@arndb.de: avoid unused-variable warning]
      Link: https://lkml.kernel.org/r/20211203190123.874239-1-arnd@kernel.org

    Link: https://lkml.kernel.org/r/20211129205548.605569-1-mcgrof@kernel.org
    Link: https://lkml.kernel.org/r/20211129205548.605569-2-mcgrof@kernel.org
    Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
   Signed-off-by: Arnd Bergmann <arnd@arndb.de>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Iurii Zaikin <yzaikin@google.com>
    Cc: Xiaoming Ni <nixiaoming@huawei.com>
    Cc: Eric Biederman <ebiederm@xmission.com>
    Cc: Stephen Kitt <steve@sk2.org>
    Cc: Lukas Middendorf <kernel@tuxforce.de>
    Cc: Antti Palosaari <crope@iki.fi>
    Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
    Cc: Jeff Layton <jlayton@kernel.org>
    Cc: "J. Bruce Fields" <bfields@fieldses.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:31 -04:00
Andrey Albershteyn 40322e8902 fs: move S_ISGID stripping into the vfs_*() helpers
Bugzilla: http://bugzilla.redhat.com/2128900
Bugzilla: http://bugzilla.redhat.com/2128898
Conflicts: Multiple mnt_userns arguments

commit 1639a49ccdce58ea248841ed9b23babcce6dbb0b
Author: Yang Xu <xuyang2018.jy@fujitsu.com>
Date:   Thu Jul 14 14:11:27 2022 +0800

    fs: move S_ISGID stripping into the vfs_*() helpers

    Move setgid handling out of individual filesystems and into the VFS
    itself to stop the proliferation of setgid inheritance bugs.

    Creating files that have both the S_IXGRP and S_ISGID bit raised in
    directories that themselves have the S_ISGID bit set requires additional
    privileges to avoid security issues.

    When a filesystem creates a new inode it needs to take care that the
    caller is either in the group of the newly created inode or they have
    CAP_FSETID in their current user namespace and are privileged over the
    parent directory of the new inode. If any of these two conditions is
    true then the S_ISGID bit can be raised for an S_IXGRP file and if not
    it needs to be stripped.

    However, there are several key issues with the current implementation:

    * S_ISGID stripping logic is entangled with umask stripping.

      If a filesystem doesn't support or enable POSIX ACLs then umask
      stripping is done directly in the vfs before calling into the
      filesystem.
      If the filesystem does support POSIX ACLs then unmask stripping may be
      done in the filesystem itself when calling posix_acl_create().

      Since umask stripping has an effect on S_ISGID inheritance, e.g., by
      stripping the S_IXGRP bit from the file to be created and all relevant
      filesystems have to call posix_acl_create() before inode_init_owner()
      where we currently take care of S_ISGID handling S_ISGID handling is
      order dependent. IOW, whether or not you get a setgid bit depends on
      POSIX ACLs and umask and in what order they are called.

      Note that technically filesystems are free to impose their own
      ordering between posix_acl_create() and inode_init_owner() meaning
      that there's additional ordering issues that influence S_SIGID
      inheritance.

    * Filesystems that don't rely on inode_init_owner() don't get S_ISGID
      stripping logic.

      While that may be intentional (e.g. network filesystems might just
      defer setgid stripping to a server) it is often just a security issue.

    This is not just ugly it's unsustainably messy especially since we do
    still have bugs in this area years after the initial round of setgid
    bugfixes.

    So the current state is quite messy and while we won't be able to make
    it completely clean as posix_acl_create() is still a filesystem specific
    call we can improve the S_SIGD stripping situation quite a bit by
    hoisting it out of inode_init_owner() and into the vfs creation
    operations. This means we alleviate the burden for filesystems to handle
    S_ISGID stripping correctly and can standardize the ordering between
    S_ISGID and umask stripping in the vfs.

    We add a new helper vfs_prepare_mode() so S_ISGID handling is now done
    in the VFS before umask handling. This has S_ISGID handling is
    unaffected unaffected by whether umask stripping is done by the VFS
    itself (if no POSIX ACLs are supported or enabled) or in the filesystem
    in posix_acl_create() (if POSIX ACLs are supported).

    The vfs_prepare_mode() helper is called directly in vfs_*() helpers that
    create new filesystem objects. We need to move them into there to make
    sure that filesystems like overlayfs hat have callchains like:

    sys_mknod()
    -> do_mknodat(mode)
       -> .mknod = ovl_mknod(mode)
          -> ovl_create(mode)
             -> vfs_mknod(mode)

    get S_ISGID stripping done when calling into lower filesystems via
    vfs_*() creation helpers. Moving vfs_prepare_mode() into e.g.
    vfs_mknod() takes care of that. This is in any case semantically cleaner
    because S_ISGID stripping is VFS security requirement.

    Security hooks so far have seen the mode with the umask applied but
    without S_ISGID handling done. The relevant hooks are called outside of
    vfs_*() creation helpers so by calling vfs_prepare_mode() from vfs_*()
    helpers the security hooks would now see the mode without umask
    stripping applied. For now we fix this by passing the mode with umask
    settings applied to not risk any regressions for LSM hooks. IOW, nothing
    changes for LSM hooks. It is worth pointing out that security hooks
    never saw the mode that is seen by the filesystem when actually creating
    the file. They have always been completely misplaced for that to work.

    The following filesystems use inode_init_owner() and thus relied on
    S_ISGID stripping: spufs, 9p, bfs, btrfs, ext2, ext4, f2fs, hfsplus,
    hugetlbfs, jfs, minix, nilfs2, ntfs3, ocfs2, omfs, overlayfs, ramfs,
    reiserfs, sysv, ubifs, udf, ufs, xfs, zonefs, bpf, tmpfs.

    All of the above filesystems end up calling inode_init_owner() when new
    filesystem objects are created through the ->mkdir(), ->mknod(),
    ->create(), ->tmpfile(), ->rename() inode operations.

    Since directories always inherit the S_ISGID bit with the exception of
    xfs when irix_sgid_inherit mode is turned on S_ISGID stripping doesn't
    apply. The ->symlink() and ->link() inode operations trivially inherit
    the mode from the target and the ->rename() inode operation inherits the
    mode from the source inode. All other creation inode operations will get
    S_ISGID handling via vfs_prepare_mode() when called from their relevant
    vfs_*() helpers.

    In addition to this there are filesystems which allow the creation of
    filesystem objects through ioctl()s or - in the case of spufs -
    circumventing the vfs in other ways. If filesystem objects are created
    through ioctl()s the vfs doesn't know about it and can't apply regular
    permission checking including S_ISGID logic. Therfore, a filesystem
    relying on S_ISGID stripping in inode_init_owner() in their ioctl()
    callpath will be affected by moving this logic into the vfs. We audited
    those filesystems:

    * btrfs allows the creation of filesystem objects through various
      ioctls(). Snapshot creation literally takes a snapshot and so the mode
      is fully preserved and S_ISGID stripping doesn't apply.

      Creating a new subvolum relies on inode_init_owner() in
      btrfs_new_subvol_inode() but only creates directories and doesn't
      raise S_ISGID.

    * ocfs2 has a peculiar implementation of reflinks. In contrast to e.g.
      xfs and btrfs FICLONE/FICLONERANGE ioctl() that is only concerned with
      the actual extents ocfs2 uses a separate ioctl() that also creates the
      target file.

      Iow, ocfs2 circumvents the vfs entirely here and did indeed rely on
      inode_init_owner() to strip the S_ISGID bit. This is the only place
      where a filesystem needs to call mode_strip_sgid() directly but this
      is self-inflicted pain.

    * spufs doesn't go through the vfs at all and doesn't use ioctl()s
      either. Instead it has a dedicated system call spufs_create() which
      allows the creation of filesystem objects. But spufs only creates
      directories and doesn't allo S_SIGID bits, i.e. it specifically only
      allows 0777 bits.

    * bpf uses vfs_mkobj() but also doesn't allow S_ISGID bits to be created.

    The patch will have an effect on ext2 when the EXT2_MOUNT_GRPID mount
    option is used, on ext4 when the EXT4_MOUNT_GRPID mount option is used,
    and on xfs when the XFS_FEAT_GRPID mount option is used. When any of
    these filesystems are mounted with their respective GRPID option then
    newly created files inherit the parent directories group
    unconditionally. In these cases non of the filesystems call
    inode_init_owner() and thus did never strip the S_ISGID bit for newly
    created files. Moving this logic into the VFS means that they now get
    the S_ISGID bit stripped. This is a user visible change. If this leads
    to regressions we will either need to figure out a better way or we need
    to revert. However, given the various setgid bugs that we found just in
    the last two years this is a regression risk we should take.

    Associated with this change is a new set of fstests to enforce the
    semantics for all new filesystems.

    Link: https://lore.kernel.org/ceph-devel/20220427092201.wvsdjbnc7b4dttaw@wittgenstein [1]
    Link: e014f37db1a2 ("xfs: use setattr_copy to set vfs inode attributes") [2]
    Link: 01ea173e10 ("xfs: fix up non-directory creation in SGID directories") [3]
    Link: fd84bfdddd16 ("ceph: fix up non-directory creation in SGID directories") [4]
    Link: https://lore.kernel.org/r/1657779088-2242-3-git-send-email-xuyang2018.jy@fujitsu.com
    Suggested-by: Dave Chinner <david@fromorbit.com>
    Suggested-by: Christian Brauner (Microsoft) <brauner@kernel.org>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-and-Tested-by: Jeff Layton <jlayton@kernel.org>
    Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
    [<brauner@kernel.org>: rewrote commit message]
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
2023-01-04 14:44:19 +01:00
Andrey Albershteyn 18e579e600 fs: add mode_strip_sgid() helper
Bugzilla: http://bugzilla.redhat.com/2128900
Bugzilla: http://bugzilla.redhat.com/2128898

commit 2b3416ceff5e6bd4922f6d1c61fb68113dd82302
Author: Yang Xu <xuyang2018.jy@fujitsu.com>
Date:   Thu Jul 14 14:11:25 2022 +0800

    fs: add mode_strip_sgid() helper

    Add a dedicated helper to handle the setgid bit when creating a new file
    in a setgid directory. This is a preparatory patch for moving setgid
    stripping into the vfs. The patch contains no functional changes.

    Currently the setgid stripping logic is open-coded directly in
    inode_init_owner() and the individual filesystems are responsible for
    handling setgid inheritance. Since this has proven to be brittle as
    evidenced by old issues we uncovered over the last months (see [1] to
    [3] below) we will try to move this logic into the vfs.

    Link: e014f37db1a2 ("xfs: use setattr_copy to set vfs inode attributes") [1]
    Link: 01ea173e10 ("xfs: fix up non-directory creation in SGID directories") [2]
    Link: fd84bfdddd16 ("ceph: fix up non-directory creation in SGID directories") [3]
    Link: https://lore.kernel.org/r/1657779088-2242-1-git-send-email-xuyang2018.jy@fujitsu.com
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
    Reviewed-and-Tested-by: Jeff Layton <jlayton@kernel.org>
    Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
2023-01-04 14:44:19 +01:00
Nico Pache d0afd73f0f writeback: Fix inode->i_io_list not be protected by inode->i_lock error
commit 10e14073107dd0b6d97d9516a02845a8e501c2c9
Author: Jchao Sun <sunjunchao2870@gmail.com>
Date:   Tue May 24 08:05:40 2022 -0700

    writeback: Fix inode->i_io_list not be protected by inode->i_lock error

    Commit b35250c081 ("writeback: Protect inode->i_io_list with
    inode->i_lock") made inode->i_io_list not only protected by
    wb->list_lock but also inode->i_lock, but inode_io_list_move_locked()
    was missed. Add lock there and also update comment describing
    things protected by inode->i_lock. This also fixes a race where
    __mark_inode_dirty() could move inode under flush worker's hands
    and thus sync(2) could miss writing some inodes.

    Fixes: b35250c081 ("writeback: Protect inode->i_io_list with inode->i_lock")
    Link: https://lore.kernel.org/r/20220524150540.12552-1-sunjunchao2870@gmail.com
    CC: stable@vger.kernel.org
    Signed-off-by: Jchao Sun <sunjunchao2870@gmail.com>
    Signed-off-by: Jan Kara <jack@suse.cz>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:38 -07:00
Chris von Recklinghausen a9a76f1477 mm,fs: split dump_mapping() out from dump_page()
Bugzilla: https://bugzilla.redhat.com/2120352

commit 3e9d80a891df3b1a5d77db47fa7fdf33ba71e5cb
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Jan 14 14:05:04 2022 -0800

    mm,fs: split dump_mapping() out from dump_page()

    dump_mapping() is a big chunk of dump_page(), and it'd be handy to be
    able to call it when we don't have a struct page.  Split it out and move
    it to fs/inode.c.  Take the opportunity to simplify some of the debug
    messages a little.

    Link: https://lkml.kernel.org/r/20211121121056.2870061-1-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:38 -04:00
Chris von Recklinghausen 3a3ee2bede vfs: keep inodes with page cache off the inode shrinker LRU
Conflicts:
	mm/filemap.c - We already have
		452e9e6992fe ("filemap: Add filemap_remove_folio and __filemap_remove_folio")
		so just add the spin_lock call.
	mm/truncate.c - The backport of
		51dcbdac28d4 ("mm: Convert find_lock_entries() to use a folio_batch")
		listed the lack of this patch as a conflict. Keep the
		''fbatch->nr = j;' line.
	mm/vmscan.c - We already have
		be7c07d60e13 ("mm/vmscan: Convert __remove_mapping() to take a folio")
		so change a couple of lines from 'if (!PageSwapCache(page))'
		to 'if (!folio_test_swapcache(folio))'

Bugzilla: https://bugzilla.redhat.com/2120352

commit 51b8c1fe250d1bd70c1722dc3c414f5cff2d7cca
Author: Johannes Weiner <hannes@cmpxchg.org>
Date:   Mon Nov 8 18:31:24 2021 -0800

    vfs: keep inodes with page cache off the inode shrinker LRU

    Historically (pre-2.5), the inode shrinker used to reclaim only empty
    inodes and skip over those that still contained page cache.  This caused
    problems on highmem hosts: struct inode could put fill lowmem zones
    before the cache was getting reclaimed in the highmem zones.

    To address this, the inode shrinker started to strip page cache to
    facilitate reclaiming lowmem.  However, this comes with its own set of
    problems: the shrinkers may drop actively used page cache just because
    the inodes are not currently open or dirty - think working with a large
    git tree.  It further doesn't respect cgroup memory protection settings
    and can cause priority inversions between containers.

    Nowadays, the page cache also holds non-resident info for evicted cache
    pages in order to detect refaults.  We've come to rely heavily on this
    data inside reclaim for protecting the cache workingset and driving swap
    behavior.  We also use it to quantify and report workload health through
    psi.  The latter in turn is used for fleet health monitoring, as well as
    driving automated memory sizing of workloads and containers, proactive
    reclaim and memory offloading schemes.

    The consequences of dropping page cache prematurely is that we're seeing
    subtle and not-so-subtle failures in all of the above-mentioned
    scenarios, with the workload generally entering unexpected thrashing
    states while losing the ability to reliably detect it.

    To fix this on non-highmem systems at least, going back to rotating
    inodes on the LRU isn't feasible.  We've tried (commit a76cf1a474
    ("mm: don't reclaim inodes with many attached pages")) and failed
    (commit 69056ee6a8 ("Revert "mm: don't reclaim inodes with many
    attached pages"")).

    The issue is mostly that shrinker pools attract pressure based on their
    size, and when objects get skipped the shrinkers remember this as
    deferred reclaim work.  This accumulates excessive pressure on the
    remaining inodes, and we can quickly eat into heavily used ones, or
    dirty ones that require IO to reclaim, when there potentially is plenty
    of cold, clean cache around still.

    Instead, this patch keeps populated inodes off the inode LRU in the
    first place - just like an open file or dirty state would.  An otherwise
    clean and unused inode then gets queued when the last cache entry
    disappears.  This solves the problem without reintroducing the reclaim
    issues, and generally is a bit more scalable than having to wade through
    potentially hundreds of thousands of busy inodes.

    Locking is a bit tricky because the locks protecting the inode state
    (i_lock) and the inode LRU (lru_list.lock) don't nest inside the
    irq-safe page cache lock (i_pages.xa_lock).  Page cache deletions are
    serialized through i_lock, taken before the i_pages lock, to make sure
    depopulated inodes are queued reliably.  Additions may race with
    deletions, but we'll check again in the shrinker.  If additions race
    with the shrinker itself, we're protected by the i_lock: if find_inode()
    or iput() win, the shrinker will bail on the elevated i_count or
    I_REFERENCED; if the shrinker wins and goes ahead with the inode, it
    will set I_FREEING and inhibit further igets(), which will cause the
    other side to create a new instance of the inode instead.

    Link: https://lkml.kernel.org/r/20210614211904.14420-4-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Dave Chinner <david@fromorbit.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:31 -04:00
Patrick Talbert 0e06ec8e0e Merge: mm: Optimize list lru memory consumption
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/690

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2013413
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/690
Omitted-fix: b9663a6ff828 ("tools: Add kmem_cache_alloc_lru()")
	The tools/include/linux/slab.h and the radix-tree tests have not
	been merged into CS9 yet.

This MR backports the upstream patch series "Optimize list lru memory
consumption" to reduce memory consumption of kmalloc-32 slab cache
on systems with a large number of memory cgroups (containers). In the
extreme case, this patch series can save GBs of memory.

Signed-off-by: Waiman Long <longman@redhat.com>
~~~
Waiman Long (26):
  Compiler Attributes: add __alloc_size() for better bounds checking
  slab: clean up function prototypes
  slab: add __alloc_size attributes for better bounds checking
  mm/list_lru.c: prefer struct_size over open coded arithmetic
  memcg, kmem: further deprecate kmem.limit_in_bytes
  mm: list_lru: remove holding lru lock
  mm: list_lru: fix the return value of list_lru_count_one()
  mm: list_lru: only add memcg-aware lrus to the global lru list
  memcg: add per-memcg vmalloc stat
  memcg: add per-memcg total kernel memory stat
  mm: list_lru: transpose the array of per-node per-memcg lru lists
  mm: introduce kmem_cache_alloc_lru
  fs: introduce alloc_inode_sb() to allocate filesystems specific inode
  fs: allocate inode by using alloc_inode_sb()
  mm: dcache: use kmem_cache_alloc_lru() to allocate dentry
  xarray: use kmem_cache_alloc_lru to allocate xa_node
  mm: memcontrol: move memcg_online_kmem() to mem_cgroup_css_online()
  mm: list_lru: allocate list_lru_one only when needed
  mm: list_lru: rename memcg_drain_all_list_lrus to
    memcg_reparent_list_lrus
  mm: list_lru: replace linear array with xarray
  mm: memcontrol: reuse memory cgroup ID for kmem ID
  mm: memcontrol: fix cannot alloc the maximum memcg ID
  mm: list_lru: rename list_lru_per_memcg to list_lru_memcg
  mm: memcontrol: rename memcg_cache_id to memcg_kmem_id
  slab: remove __alloc_size attribute from __kmalloc_track_caller
  NFSv4.2: Fix missing removal of SLAB_ACCOUNT on kmem_cache allocation

 .../admin-guide/cgroup-v1/memory.rst          |  11 +-
 Documentation/admin-guide/cgroup-v2.rst       |   8 +
 Documentation/filesystems/porting.rst         |   6 +
 Makefile                                      |  15 +
 block/bdev.c                                  |   2 +-
 drivers/dax/super.c                           |   2 +-
 fs/adfs/super.c                               |   2 +-
 fs/affs/super.c                               |   2 +-
 fs/afs/super.c                                |   2 +-
 fs/befs/linuxvfs.c                            |   2 +-
 fs/bfs/inode.c                                |   2 +-
 fs/btrfs/inode.c                              |   2 +-
 fs/ceph/inode.c                               |   2 +-
 fs/cifs/cifsfs.c                              |   2 +-
 fs/coda/inode.c                               |   2 +-
 fs/dcache.c                                   |   3 +-
 fs/ecryptfs/super.c                           |   2 +-
 fs/efs/super.c                                |   2 +-
 fs/erofs/super.c                              |   2 +-
 fs/exfat/super.c                              |   2 +-
 fs/ext2/super.c                               |   2 +-
 fs/ext4/super.c                               |   2 +-
 fs/fat/inode.c                                |   2 +-
 fs/freevxfs/vxfs_super.c                      |   2 +-
 fs/fuse/inode.c                               |   2 +-
 fs/gfs2/super.c                               |   2 +-
 fs/hfs/super.c                                |   2 +-
 fs/hfsplus/super.c                            |   2 +-
 fs/hostfs/hostfs_kern.c                       |   2 +-
 fs/hpfs/super.c                               |   2 +-
 fs/hugetlbfs/inode.c                          |   2 +-
 fs/inode.c                                    |   2 +-
 fs/isofs/inode.c                              |   2 +-
 fs/jffs2/super.c                              |   2 +-
 fs/jfs/super.c                                |   2 +-
 fs/minix/inode.c                              |   2 +-
 fs/nfs/inode.c                                |   2 +-
 fs/nfs/nfs42xattr.c                           |   2 +-
 fs/nilfs2/super.c                             |   2 +-
 fs/ntfs/inode.c                               |   2 +-
 fs/ocfs2/dlmfs/dlmfs.c                        |   2 +-
 fs/ocfs2/super.c                              |   2 +-
 fs/openpromfs/inode.c                         |   2 +-
 fs/orangefs/super.c                           |   2 +-
 fs/overlayfs/super.c                          |   2 +-
 fs/proc/inode.c                               |   2 +-
 fs/qnx4/inode.c                               |   2 +-
 fs/qnx6/inode.c                               |   2 +-
 fs/reiserfs/super.c                           |   2 +-
 fs/romfs/super.c                              |   2 +-
 fs/squashfs/super.c                           |   2 +-
 fs/sysv/inode.c                               |   2 +-
 fs/ubifs/super.c                              |   2 +-
 fs/udf/super.c                                |   2 +-
 fs/ufs/super.c                                |   2 +-
 fs/vboxsf/super.c                             |   2 +-
 fs/xfs/xfs_icache.c                           |   2 +-
 fs/zonefs/super.c                             |   2 +-
 include/linux/compiler-gcc.h                  |   8 +
 include/linux/compiler_attributes.h           |  10 +
 include/linux/compiler_types.h                |  12 +
 include/linux/fs.h                            |  11 +
 include/linux/list_lru.h                      |  17 +-
 include/linux/memcontrol.h                    |  63 ++-
 include/linux/slab.h                          | 101 ++--
 include/linux/swap.h                          |   5 +-
 include/linux/xarray.h                        |   9 +-
 ipc/mqueue.c                                  |   2 +-
 lib/xarray.c                                  |  10 +-
 mm/list_lru.c                                 | 464 ++++++++----------
 mm/memcontrol.c                               | 213 ++------
 mm/shmem.c                                    |   2 +-
 mm/slab.c                                     |  39 +-
 mm/slab.h                                     |  25 +-
 mm/slob.c                                     |   6 +
 mm/slub.c                                     |  42 +-
 mm/vmalloc.c                                  |  13 +-
 mm/workingset.c                               |   2 +-
 net/socket.c                                  |   2 +-
 net/sunrpc/rpc_pipe.c                         |   2 +-
 scripts/checkpatch.pl                         |   3 +-
 81 files changed, 600 insertions(+), 610 deletions(-)

Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Aristeu Rozanski <arozansk@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-05-09 09:48:03 +02:00
Waiman Long e4e08faece fs: introduce alloc_inode_sb() to allocate filesystems specific inode
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2013413
Conflicts: A merge conflict for the insertion of <linux/slab.h> hunk into
	   include/linux/fs.h due to missing upstream commit a793d79ea3e0
	   ("fs: move mapping helpers").

commit 8b9f3ac5b01db85c6cf74c2c3a71280cc3045c9c
Author: Muchun Song <songmuchun@bytedance.com>
Date:   Tue, 22 Mar 2022 14:41:00 -0700

    fs: introduce alloc_inode_sb() to allocate filesystems specific inode

    The allocated inode cache is supposed to be added to its memcg list_lru
    which should be allocated as well in advance.  That can be done by
    kmem_cache_alloc_lru() which allocates object and list_lru.  The file
    systems is main user of it.  So introduce alloc_inode_sb() to allocate
    file system specific inodes and set up the inode reclaim context
    properly.  The file system is supposed to use alloc_inode_sb() to
    allocate inodes.

    In later patches, we will convert all users to the new API.

    Link: https://lkml.kernel.org/r/20220228122126.37293-4-songmuchun@bytedance.com
    Signed-off-by: Muchun Song <songmuchun@bytedance.com>
    Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Alex Shi <alexs@kernel.org>
    Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
    Cc: Chao Yu <chao@kernel.org>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Fam Zheng <fam.zheng@bytedance.com>
    Cc: Jaegeuk Kim <jaegeuk@kernel.org>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Kari Argillander <kari.argillander@gmail.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Qi Zheng <zhengqi.arch@bytedance.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Theodore Ts'o <tytso@mit.edu>
    Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
    Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Wei Yang <richard.weiyang@gmail.com>
    Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Waiman Long <longman@redhat.com>
2022-04-07 14:11:12 -04:00
Aristeu Rozanski cee75af7e2 fs: Remove FS_THP_SUPPORT
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2019485
Tested: ran Rafael's set of sanity tests, other than known issues, seems ok

commit ff36da69bc90d80b0c73f47f4b2e270b3ff6da99
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Sun Aug 29 06:07:03 2021 -0400

    fs: Remove FS_THP_SUPPORT

    Instead of setting a bit in the fs_flags to set a bit in the
    address_space, set the bit in the address_space directly.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-04-07 09:58:32 -04:00
Rafael Aquini cf23a735d5 mm: Fully initialize invalidate_lock, amend lock class later
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 23ca067b3295d935835b71f743235f9e5ab31cc5
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Wed Sep 1 10:44:03 2021 +0200

    mm: Fully initialize invalidate_lock, amend lock class later

    The function __init_rwsem() is not part of the official API, it just a helper
    function used by init_rwsem().
    Changing the lock's class and name should be done by using
    lockdep_set_class_and_name() after the has been fully initialized. The overhead
    of the additional class struct and setting it twice is negligible and it works
    across all locks.

    Fully initialize the lock with init_rwsem() and then set the custom class and
    name for the lock.

    Fixes: 730633f0b7f95 ("mm: Protect operations adding pages to page cache with invalidate_lock")
    Link: https://lore.kernel.org/r/20210901084403.g4fezi23cixemlhh@linutronix.de
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Jan Kara <jack@suse.cz>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:43:50 -05:00
Rafael Aquini 5b5c28b990 fs: inode: count invalidated shadow pages in pginodesteal
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 7ae12c809f6a31d3da7b96339dbefa141884c711
Author: Johannes Weiner <hannes@cmpxchg.org>
Date:   Thu Sep 2 14:53:24 2021 -0700

    fs: inode: count invalidated shadow pages in pginodesteal

    pginodesteal is supposed to capture the impact that inode reclaim has on
    the page cache state.  Currently, it doesn't consider shadow pages that
    get dropped this way, even though this can have a significant impact on
    paging behavior, memory pressure calculations etc.

    To improve visibility into these effects, make sure shadow pages get
    counted when they get dropped through inode reclaim.

    This changes the return value semantics of invalidate_mapping_pages()
    semantics slightly, but the only two users are the inode shrinker itsel
    and a usb driver that logs it for debugging purposes.

    Link: https://lkml.kernel.org/r/20210614211904.14420-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:40:53 -05:00
Rafael Aquini 4d9de0c3d3 mm: Protect operations adding pages to page cache with invalidate_lock
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 730633f0b7f951726e87f912a6323641f674ae34
Author: Jan Kara <jack@suse.cz>
Date:   Thu Jan 28 19:19:45 2021 +0100

    mm: Protect operations adding pages to page cache with invalidate_lock

    Currently, serializing operations such as page fault, read, or readahead
    against hole punching is rather difficult. The basic race scheme is
    like:

    fallocate(FALLOC_FL_PUNCH_HOLE)                 read / fault / ..
      truncate_inode_pages_range()
                                                      <create pages in page
                                                       cache here>
      <update fs block mapping and free blocks>

    Now the problem is in this way read / page fault / readahead can
    instantiate pages in page cache with potentially stale data (if blocks
    get quickly reused). Avoiding this race is not simple - page locks do
    not work because we want to make sure there are *no* pages in given
    range. inode->i_rwsem does not work because page fault happens under
    mmap_sem which ranks below inode->i_rwsem. Also using it for reads makes
    the performance for mixed read-write workloads suffer.

    So create a new rw_semaphore in the address_space - invalidate_lock -
    that protects adding of pages to page cache for page faults / reads /
    readahead.

    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Jan Kara <jack@suse.cz>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:40:22 -05:00
Hugh Dickins 786b31121a mm: remove nrexceptional from inode: remove BUG_ON
clear_inode()'s BUG_ON(!mapping_empty(&inode->i_data)) is unsafe: we
know of two ways in which nodes can and do (on rare occasions) get left
behind.  Until those are fixed, do not BUG_ON() nor even WARN_ON().

Yes, this will then leak those nodes (or the next user of the struct
inode may use them); but this has been happening for years, and the new
BUG_ON(!mapping_empty) was only guilty of revealing that.  A proper fix
will follow, but no hurry.

Link: https://lkml.kernel.org/r/alpine.LSU.2.11.2104292229380.16080@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:20 -07:00
Matthew Wilcox (Oracle) 8bc3c481b3 mm: remove nrexceptional from inode
We no longer track anything in nrexceptional, so remove it, saving 8 bytes
per inode.

Link: https://lkml.kernel.org/r/20201026151849.24232-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Vishal Verma <vishal.l.verma@intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:20 -07:00
Linus Torvalds 34a456eb1f fs.idmapped.helpers.v5.13
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYIfiiwAKCRCRxhvAZXjc
 ogtMAQC+MtgJZdcH5iDHNEyI36JaWUccKRV7PdvfF1YgnXO45gD+IYxR1c/EQQyD
 kh2AmqhET6jVhe9Nsob5yxduksI+ygo=
 =oh/d
 -----END PGP SIGNATURE-----

Merge tag 'fs.idmapped.helpers.v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull fs mapping helper updates from Christian Brauner:
 "This adds kernel-doc to all new idmapping helpers and improves their
  naming which was triggered by a discussion with some fs developers.
  Some of the names are based on suggestions by Vivek and Al.

  Also remove the open-coded permission checking in a few places with
  simple helpers. Overall this should lead to more clarity and make it
  easier to maintain"

* tag 'fs.idmapped.helpers.v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  fs: introduce two inode i_{u,g}id initialization helpers
  fs: introduce fsuidgid_has_mapping() helper
  fs: document and rename fsid helpers
  fs: document mapping helpers
2021-04-27 12:49:42 -07:00
Miklos Szeredi 51db776a43 vfs: remove unused ioctl helpers
Remove vfs_ioc_setflags_prepare(), vfs_ioc_fssetxattr_check() and
simple_fill_fsxattr(), which are no longer used.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-04-12 15:04:30 +02:00
Christian Brauner db998553cf
fs: introduce two inode i_{u,g}id initialization helpers
Give filesystem two little helpers that do the right thing when
initializing the i_uid and i_gid fields on idmapped and non-idmapped
mounts. Filesystems shouldn't have to be concerned with too many
details.

Link: https://lore.kernel.org/r/20210320122623.599086-5-christian.brauner@ubuntu.com
Inspired-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-03-23 11:15:26 +01:00
Christian Brauner a65e58e791
fs: document and rename fsid helpers
Vivek pointed out that the fs{g,u}id_into_mnt() naming scheme can be
misleading as it could be understood as implying they do the exact same
thing as i_{g,u}id_into_mnt(). The original motivation for this naming
scheme was to signal to callers that the helpers will always take care
to map the k{g,u}id such that the ownership is expressed in terms of the
mnt_users.
Get rid of the confusion by renaming those helpers to something more
sensible. Al suggested mapped_fs{g,u}id() which seems a really good fit.
Usually filesystems don't need to bother with these helpers directly
only in some cases where they allocate objects that carry {g,u}ids which
are either filesystem specific (e.g. xfs quota objects) or don't have a
clean set of helpers as inodes have.

Link: https://lore.kernel.org/r/20210320122623.599086-3-christian.brauner@ubuntu.com
Inspired-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-03-23 11:13:32 +01:00