JIRA: https://issues.redhat.com/browse/RHEL-65728
commit 13ae04d8d45227c2ba51e188daf9fc13d08a1b12
Author: Darrick J. Wong <djwong@kernel.org>
Date: Fri Dec 15 10:03:27 2023 -0800
xfs: force all buffers to be written during btree bulk load
While stress-testing online repair of btrees, I noticed periodic
assertion failures from the buffer cache about buffers with incorrect
DELWRI_Q state. Looking further, I observed this race between the AIL
trying to write out a btree block and repair zapping a btree block after
the fact:
AIL: Repair0:
pin buffer X
delwri_queue:
set DELWRI_Q
add to delwri list
stale buf X:
clear DELWRI_Q
does not clear b_list
free space X
commit
delwri_submit # oops
Worse yet, I discovered that running the same repair over and over in a
tight loop can result in a second race that cause data integrity
problems with the repair:
AIL: Repair0: Repair1:
pin buffer X
delwri_queue:
set DELWRI_Q
add to delwri list
stale buf X:
clear DELWRI_Q
does not clear b_list
free space X
commit
find free space X
get buffer
rewrite buffer
delwri_queue:
set DELWRI_Q
already on a list, do not add
commit
BAD: committed tree root before all blocks written
delwri_submit # too late now
I traced this to my own misunderstanding of how the delwri lists work,
particularly with regards to the AIL's buffer list. If a buffer is
logged and committed, the buffer can end up on that AIL buffer list. If
btree repairs are run twice in rapid succession, it's possible that the
first repair will invalidate the buffer and free it before the next time
the AIL wakes up. Marking the buffer stale clears DELWRI_Q from the
buffer state without removing the buffer from its delwri list. The
buffer doesn't know which list it's on, so it cannot know which lock to
take to protect the list for a removal.
If the second repair allocates the same block, it will then recycle the
buffer to start writing the new btree block. Meanwhile, if the AIL
wakes up and walks the buffer list, it will ignore the buffer because it
can't lock it, and go back to sleep.
When the second repair calls delwri_queue to put the buffer on the
list of buffers to write before committing the new btree, it will set
DELWRI_Q again, but since the buffer hasn't been removed from the AIL's
buffer list, it won't add it to the bulkload buffer's list.
This is incorrect, because the bulkload caller relies on delwri_submit
to ensure that all the buffers have been sent to disk /before/
committing the new btree root pointer. This ordering requirement is
required for data consistency.
Worse, the AIL won't clear DELWRI_Q from the buffer when it does finally
drop it, so the next thread to walk through the btree will trip over a
debug assertion on that flag.
To fix this, create a new function that waits for the buffer to be
removed from any other delwri lists before adding the buffer to the
caller's delwri list. By waiting for the buffer to clear both the
delwri list and any potential delwri wait list, we can be sure that
repair will initiate writes of all buffers and report all write errors
back to userspace instead of committing the new structure.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-57114
commit 9ed851f695c71d325758f8c18e265da9316afd26
Author: Darrick J. Wong <djwong@kernel.org>
Date: Thu Aug 10 07:48:03 2023 -0700
xfs: allow scanning ranges of the buffer cache for live buffers
After an online repair, we need to invalidate buffers representing the
blocks from the old metadata that we're replacing. It's possible that
parts of a tree that were previously cached in memory are no longer
accessible due to media failure or other corruption on interior nodes,
so repair figures out the old blocks from the reverse mapping data and
scans the buffer cache directly.
In other words, online fsck needs to find all the live (i.e. non-stale)
buffers for a range of fsblocks so that it can invalidate them.
Unfortunately, the current buffer cache code triggers asserts if the
rhashtable lookup finds a non-stale buffer of a different length than
the key we searched for. For regular operation this is desirable, but
for this repair procedure, we don't care since we're going to forcibly
stale the buffer anyway. Add an internal lookup flag to avoid the
assert. Skip buffers that are already XBF_STALE.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-40684
Conflicts:
* include/linux/list_lru.h: minor context differences due to missing
upstream v6.8 commit 7679e14098c9 ("mm: list_lru: Update kernel
documentation to follow the requirements")
This patch is a backport of the following upstream commit:
commit 0a97c01cd20bb96359d8c9dedad92a061ed34e0b
Author: Nhat Pham <nphamcs@gmail.com>
Date: Thu Nov 30 11:40:18 2023 -0800
list_lru: allow explicit memcg and NUMA node selection
Patch series "workload-specific and memory pressure-driven zswap
writeback", v8.
There are currently several issues with zswap writeback:
1. There is only a single global LRU for zswap, making it impossible to
perform worload-specific shrinking - an memcg under memory pressure
cannot determine which pages in the pool it owns, and often ends up
writing pages from other memcgs. This issue has been previously
observed in practice and mitigated by simply disabling
memcg-initiated shrinking:
https://lore.kernel.org/all/20230530232435.3097106-1-nphamcs@gmail.com/T/#u
But this solution leaves a lot to be desired, as we still do not
have an avenue for an memcg to free up its own memory locked up in
the zswap pool.
2. We only shrink the zswap pool when the user-defined limit is hit.
This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed
ahead of time, as this depends on the workload (specifically, on
factors such as memory access patterns and compressibility of the
memory pages).
This patch series solves these issues by separating the global zswap LRU
into per-memcg and per-NUMA LRUs, and performs workload-specific (i.e
memcg- and NUMA-aware) zswap writeback under memory pressure. The new
shrinker does not have any parameter that must be tuned by the user, and
can be opted in or out on a per-memcg basis.
As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.
This patch (of 6):
The interface of list_lru is based on the assumption that the list node
and the data it represents belong to the same allocated on the correct
node/memcg. While this assumption is valid for existing slab objects LRU
such as dentries and inodes, it is undocumented, and rather inflexible for
certain potential list_lru users (such as the upcoming zswap shrinker and
the THP shrinker). It has caused us a lot of issues during our
development.
This patch changes list_lru interface so that the caller must explicitly
specify numa node and memcg when adding and removing objects. The old
list_lru_add() and list_lru_del() are renamed to list_lru_add_obj() and
list_lru_del_obj(), respectively.
It also extends the list_lru API with a new function, list_lru_putback,
which undoes a previous list_lru_isolate call. Unlike list_lru_add, it
does not increment the LRU node count (as list_lru_isolate does not
decrement the node count). list_lru_putback also allows for explicit
memcg and NUMA node selection.
Link: https://lkml.kernel.org/r/20231130194023.4102148-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-2-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <aquini@redhat.com>
Conflicts: mm/slob.c - We already have
6630e950d532 ("mm/slob: remove slob.c")
so the file is gone.
JIRA: https://issues.redhat.com/browse/RHEL-27741
commit c7b23b68e2aa93f86a206222d23ccd9a21f5982a
Author: Yosry Ahmed <yosryahmed@google.com>
Date: Thu Apr 13 10:40:34 2023 +0000
mm: vmscan: refactor updating current->reclaim_state
During reclaim, we keep track of pages reclaimed from other means than
LRU-based reclaim through scan_control->reclaim_state->reclaimed_slab,
which we stash a pointer to in current task_struct.
However, we keep track of more than just reclaimed slab pages through
this. We also use it for clean file pages dropped through pruned inodes,
and xfs buffer pages freed. Rename reclaimed_slab to reclaimed, and add a
helper function that wraps updating it through current, so that future
changes to this logic are contained within include/linux/swap.h.
Link: https://lkml.kernel.org/r/20230413104034.1086717-4-yosryahmed@google.c
om
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Lameter <cl@linux.com>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-29564
Conflicts: drop change on f2fs, bcachefs and reiserfs; context
difference on ext4 & fs/super.c change.
commit 22650a99821dda3d05f1c334ea90330b4982de56
Author: Christian Brauner <brauner@kernel.org>
Date: Tue Mar 26 13:47:22 2024 +0100
fs,block: yield devices early
Currently a device is only really released once the umount returns to
userspace due to how file closing works. That ultimately could cause
an old umount assumption to be violated that concurrent umount and mount
don't fail. So an exclusively held device with a temporary holder should
be yielded before the filesystem is gone. Add a helper that allows
callers to do that. This also allows us to remove the two holder ops
that Linus wasn't excited about.
Link: https://lore.kernel.org/r/20240326-vfs-bdev-end_holder-v1-1-20af85202918@kernel.org
Fixes: f3a608827d1f ("bdev: open block device as files") # mainline only
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-29564
commit 1b9e2d90141c5e25faefbb7891f0ed8606aa02cf
Author: Christian Brauner <brauner@kernel.org>
Date: Tue Jan 23 14:26:24 2024 +0100
xfs: port block device access to files
Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-7-adbd023e19cc@kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-29262
commit e340dd63f6a11402424b3d77e51149bce8fcba7d
Author: Jan Kara <jack@suse.cz>
Date: Wed Sep 27 11:34:34 2023 +0200
xfs: Convert to bdev_open_by_path()
Convert xfs to use bdev_open_by_path() and pass the handle around.
CC: "Darrick J. Wong" <djwong@kernel.org>
CC: linux-xfs@vger.kernel.org
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230927093442.25915-28-jack@suse.cz
Acked-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-29262
Conflicts: drop f2fs change which isn't needed for rhel9
commit 2ea6f68932f73a6a9d82160d3ad0a49a5a6bb183
Author: Christoph Hellwig <hch@lst.de>
Date: Wed Aug 2 17:41:25 2023 +0200
fs: use the super_block as holder when mounting file systems
The file system type is not a very useful holder as it doesn't allow us
to go back to the actual file system instance. Pass the super_block instead
which is useful when passed back to the file system driver.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Message-Id: <20230802154131.2221419-7-hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-29262
commit 35a93b148b0363dca23c3db1cc9d48100eb8b276
Author: Christoph Hellwig <hch@lst.de>
Date: Wed Aug 9 15:05:38 2023 -0700
xfs: close the external block devices in xfs_mount_free
blkdev_put must not be called under sb->s_umount to avoid a lock order
reversal with disk->open_mutex. Move closing the buftargs into ->kill_sb
to archive that. Note that the flushing of the disk caches and
block device mapping invalidated needs to stay in ->put_super as the main
block device is closed in kill_block_super already.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Message-Id: <20230809220545.1308228-7-hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-29262
commit 41233576e9a4515dc9b0bd1cbbb896b520a1f486
Author: Christoph Hellwig <hch@lst.de>
Date: Wed Aug 9 15:05:37 2023 -0700
xfs: close the RT and log block devices in xfs_free_buftarg
Closing the block devices logically belongs into xfs_free_buftarg, So
instead of open coding it in the caller move it there and add a check
for the s_bdev so that the main device isn't close as that's done by the
VFS helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Message-Id: <20230809220545.1308228-6-hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-2002
commit 032e160305f6872e590c77f11896fb28365c6d6c
Author: Darrick J. Wong <djwong@kernel.org>
Date: Mon Nov 28 17:24:42 2022 -0800
xfs: invalidate block device page cache during unmount
Every now and then I see fstests failures on aarch64 (64k pages) that
trigger on the following sequence:
mkfs.xfs $dev
mount $dev $mnt
touch $mnt/a
umount $mnt
xfs_db -c 'path /a' -c 'print' $dev
99% of the time this succeeds, but every now and then xfs_db cannot find
/a and fails. This turns out to be a race involving udev/blkid, the
page cache for the block device, and the xfs_db process.
udev is triggered whenever anyone closes a block device or unmounts it.
The default udev rules invoke blkid to read the fs super and create
symlinks to the bdev under /dev/disk. For this, it uses buffered reads
through the page cache.
xfs_db also uses buffered reads to examine metadata. There is no
coordination between xfs_db and udev, which means that they can run
concurrently. Note there is no coordination between the kernel and
blkid either.
On a system with 64k pages, the page cache can cache the superblock and
the root inode (and hence the root dir) with the same 64k page. If
udev spawns blkid after the mkfs and the system is busy enough that it
is still running when xfs_db starts up, they'll both read from the same
page in the pagecache.
The unmount writes updated inode metadata to disk directly. The XFS
buffer cache does not use the bdev pagecache, nor does it invalidate the
pagecache on umount. If the above scenario occurs, the pagecache no
longer reflects what's on disk, xfs_db reads the stale metadata, and
fails to find /a. Most of the time this succeeds because closing a bdev
invalidates the page cache, but when processes race, everyone loses.
Fix the problem by invalidating the bdev pagecache after flushing the
bdev, so that xfs_db will see up to date metadata.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2192730
Conflicts: change to xfs_perag_get() and xfs_perag_put() api from previous out
of order patch from upstream fa044ae70 xfs: pass perag to xfs_read_agf
required changes to xfs_notify_failure.c
commit 6f643c57d57c56d4677bc05f1fca2ef3f249797c
Author: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Date: Fri Jun 3 13:37:30 2022 +0800
xfs: implement ->notify_failure() for XFS
Introduce xfs_notify_failure.c to handle failure related works, such as
implement ->notify_failure(), register/unregister dax holder in xfs, and
so on.
If the rmap feature of XFS enabled, we can query it to find files and
metadata which are associated with the corrupt data. For now all we do is
kill processes with that file mapped into their address spaces, but future
patches could actually do something about corrupt metadata.
After that, the memory failure needs to notify the processes who are using
those files.
Link: https://lkml.kernel.org/r/20220603053738.1218681-7-ruansy.fnst@fujitsu.com
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.wiliams@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2192730
Conflicts: dropped hunks for erofs and ext2
commit 8012b866085523758780850087102421dbcce522
Author: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Date: Fri Jun 3 13:37:25 2022 +0800
dax: introduce holder for dax_device
Patch series "v14 fsdax-rmap + v11 fsdax-reflink", v2.
The patchset fsdax-rmap is aimed to support shared pages tracking for
fsdax.
It moves owner tracking from dax_assocaite_entry() to pmem device driver,
by introducing an interface ->memory_failure() for struct pagemap. This
interface is called by memory_failure() in mm, and implemented by pmem
device.
Then call holder operations to find the filesystem which the corrupted
data located in, and call filesystem handler to track files or metadata
associated with this page.
Finally we are able to try to fix the corrupted data in filesystem and do
other necessary processing, such as killing processes who are using the
files affected.
The call trace is like this:
memory_failure()
|* fsdax case
|------------
|pgmap->ops->memory_failure() => pmem_pgmap_memory_failure()
| dax_holder_notify_failure() =>
| dax_device->holder_ops->notify_failure() =>
| - xfs_dax_notify_failure()
| |* xfs_dax_notify_failure()
| |--------------------------
| | xfs_rmap_query_range()
| | xfs_dax_failure_fn()
| | * corrupted on metadata
| | try to recover data, call xfs_force_shutdown()
| | * corrupted on file data
| | try to recover data, call mf_dax_kill_procs()
|* normal case
|-------------
|mf_generic_kill_procs()
The patchset fsdax-reflink attempts to add CoW support for fsdax, and
takes XFS, which has both reflink and fsdax features, as an example.
One of the key mechanisms needed to be implemented in fsdax is CoW. Copy
the data from srcmap before we actually write data to the destination
iomap. And we just copy range in which data won't be changed.
Another mechanism is range comparison. In page cache case, readpage() is
used to load data on disk to page cache in order to be able to compare
data. In fsdax case, readpage() does not work. So, we need another
compare data with direct access support.
With the two mechanisms implemented in fsdax, we are able to make reflink
and fsdax work together in XFS.
This patch (of 14):
To easily track filesystem from a pmem device, we introduce a holder for
dax_device structure, and also its operation. This holder is used to
remember who is using this dax_device:
- When it is the backend of a filesystem, the holder will be the
instance of this filesystem.
- When this pmem device is one of the targets in a mapped device, the
holder will be this mapped device. In this case, the mapped device
has its own dax_device and it will follow the first rule. So that we
can finally track to the filesystem we needed.
The holder and holder_ops will be set when filesystem is being mounted,
or an target device is being activated.
Link: https://lkml.kernel.org/r/20220603053738.1218681-1-ruansy.fnst@fujitsu.com
Link: https://lkml.kernel.org/r/20220603053738.1218681-2-ruansy.fnst@fujitsu.com
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.wiliams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832
commit 231f91ab504ecebcb88e942341b3d7dd91de45f1
Author: Dave Chinner <dchinner@redhat.com>
Date: Mon Jul 18 18:20:37 2022 -0700
xfs: xfs_buf cache destroy isn't RCU safe
Darrick and Sachin Sant reported that xfs/435 and xfs/436 would
report an non-empty xfs_buf slab on module remove. This isn't easily
to reproduce, but is clearly a side effect of converting the buffer
caceh to RUC freeing and lockless lookups. Sachin bisected and
Darrick hit it when testing the patchset directly.
Turns out that the xfs_buf slab is not destroyed when all the other
XFS slab caches are destroyed. Instead, it's got it's own little
wrapper function that gets called separately, and so it doesn't have
an rcu_barrier() call in it that is needed to drain all the rcu
callbacks before the slab is destroyed.
Fix it by removing the xfs_buf_init/terminate wrappers that just
allocate and destroy the xfs_buf slab, and move them to the same
place that all the other slab caches are set up and destroyed.
Reported-and-tested-by: Sachin Sant <sachinp@linux.ibm.com>
Fixes: 298f34224506 ("xfs: lockless buffer lookup")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832
commit 298f342245066309189d8637ca7339d56840c3e1
Author: Dave Chinner <dchinner@redhat.com>
Date: Thu Jul 14 12:05:07 2022 +1000
xfs: lockless buffer lookup
Now that we have a standalone fast path for buffer lookup, we can
easily convert it to use rcu lookups. When we continually hammer the
buffer cache with trylock lookups, we end up with a huge amount of
lock contention on the per-ag buffer hash locks:
- 92.71% 0.05% [kernel] [k] xfs_inodegc_worker
- 92.67% xfs_inodegc_worker
- 92.13% xfs_inode_unlink
- 91.52% xfs_inactive_ifree
- 85.63% xfs_read_agi
- 85.61% xfs_trans_read_buf_map
- 85.59% xfs_buf_read_map
- xfs_buf_get_map
- 85.55% xfs_buf_find
- 72.87% _raw_spin_lock
- do_raw_spin_lock
71.86% __pv_queued_spin_lock_slowpath
- 8.74% xfs_buf_rele
- 7.88% _raw_spin_lock
- 7.88% do_raw_spin_lock
7.63% __pv_queued_spin_lock_slowpath
- 1.70% xfs_buf_trylock
- 1.68% down_trylock
- 1.41% _raw_spin_lock_irqsave
- 1.39% do_raw_spin_lock
__pv_queued_spin_lock_slowpath
- 0.76% _raw_spin_unlock
0.75% do_raw_spin_unlock
This is basically hammering the pag->pag_buf_lock from lots of CPUs
doing trylocks at the same time. Most of the buffer trylock
operations ultimately fail after we've done the lookup, so we're
really hammering the buf hash lock whilst making no progress.
We can also see significant spinlock traffic on the same lock just
under normal operation when lots of tasks are accessing metadata
from the same AG, so let's avoid all this by converting the lookup
fast path to leverages the rhashtable's ability to do rcu protected
lookups.
We avoid races with the buffer release path by using
atomic_inc_not_zero() on the buffer hold count. Any buffer that is
in the LRU will have a non-zero count, thereby allowing the lockless
fast path to be taken in most cache hit situations. If the buffer
hold count is zero, then it is likely going through the release path
so in that case we fall back to the existing lookup miss slow path.
The slow path will then do an atomic lookup and insert under the
buffer hash lock and hence serialise correctly against buffer
release freeing the buffer.
The use of rcu protected lookups means that buffer handles now need
to be freed by RCU callbacks (same as inodes). We still free the
buffer pages before the RCU callback - we won't be trying to access
them at all on a buffer that has zero references - but we need the
buffer handle itself to be present for the entire rcu protected read
side to detect a zero hold count correctly.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832
commit 32dd4f9c506b1bf147c24cf05423cd893bc06e38
Author: Dave Chinner <dchinner@redhat.com>
Date: Thu Jul 14 12:04:43 2022 +1000
xfs: remove a superflous hash lookup when inserting new buffers
Currently on the slow path insert we repeat the initial hash table
lookup before we attempt the insert, resulting in a two traversals
of the hash table to ensure the insert is valid. The rhashtable API
provides a method for an atomic lookup and insert operation, so we
can avoid one of the hash table traversals by using this method.
Adapted from a large patch containing this optimisation by Christoph
Hellwig.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832
commit d8d9bbb0ee6c79191b704d88c8ae712b89e0d2bb
Author: Dave Chinner <dchinner@redhat.com>
Date: Thu Jul 14 12:04:38 2022 +1000
xfs: reduce the number of atomic when locking a buffer after lookup
Avoid an extra atomic operation in the non-trylock case by only
doing a trylock if the XBF_TRYLOCK flag is set. This follows the
pattern in the IO path with NOWAIT semantics where the
"trylock-fail-lock" path showed 5-10% reduced throughput compared to
just using single lock call when not under NOWAIT conditions. So
make that same change here, too.
See commit 942491c9e6 ("xfs: fix AIM7 regression") for details.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832
commit 348000804a0f4dea74219a927e081d6e7dee792f
Author: Dave Chinner <dchinner@redhat.com>
Date: Thu Jul 14 12:04:31 2022 +1000
xfs: merge xfs_buf_find() and xfs_buf_get_map()
Now that we factored xfs_buf_find(), we can start separating into
distinct fast and slow paths from xfs_buf_get_map(). We start by
moving the lookup map and perag setup to _get_map(), and then move
all the specifics of the fast path lookup into xfs_buf_lookup()
and call it directly from _get_map(). We the move all the slow path
code to xfs_buf_find_insert(), which is now also called directly
from _get_map(). As such, xfs_buf_find() now goes away.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832
commit de67dc575434dca8d60b1e181ed5dd296392ffce
Author: Dave Chinner <dchinner@redhat.com>
Date: Thu Jul 14 12:02:46 2022 +1000
xfs: break up xfs_buf_find() into individual pieces
xfs_buf_find() is made up of three main parts: lookup, insert and
locking. The interactions with xfs_buf_get_map() require it to be
called twice - once for a pure lookup, and again on lookup failure
so the insert path can be run. We want to simplify this down a lot,
so split it into a fast path lookup, a slow path insert and a "lock
the found buffer" helper. This will then let us integrate these
operations more effectively into xfs_buf_get_map() in future
patches.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832
commit 85c73bf726e41be276bcad3325d9a8aef10be289
Author: Dave Chinner <dchinner@redhat.com>
Date: Thu Jul 7 22:05:18 2022 +1000
xfs: rework xfs_buf_incore() API
Make it consistent with the other buffer APIs to return a error and
the buffer is placed in a parameter.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832
commit b9b3fe152e4966cf8562630de67aa49e2f9c9222
Author: Dave Chinner <david@fromorbit.com>
Date: Thu Apr 21 08:44:59 2022 +1000
xfs: convert buffer flags to unsigned.
5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned. This manifests as a compiler error such as:
/kisskb/src/fs/xfs/./xfs_trace.h:432:2: note: in expansion of macro 'TP_printk'
TP_printk("dev %d:%d daddr 0x%llx bbcount 0x%x hold %d pincount %d "
^
/kisskb/src/fs/xfs/./xfs_trace.h:440:5: note: in expansion of macro '__print_flags'
__print_flags(__entry->flags, "|", XFS_BUF_FLAGS),
^
/kisskb/src/fs/xfs/xfs_buf.h:67:4: note: in expansion of macro 'XBF_UNMAPPED'
{ XBF_UNMAPPED, "UNMAPPED" }
^
/kisskb/src/fs/xfs/./xfs_trace.h:440:40: note: in expansion of macro 'XFS_BUF_FLAGS'
__print_flags(__entry->flags, "|", XFS_BUF_FLAGS),
^
/kisskb/src/fs/xfs/./xfs_trace.h: In function 'trace_raw_output_xfs_buf_flags_class':
/kisskb/src/fs/xfs/xfs_buf.h:46:23: error: initializer element is not constant
#define XBF_UNMAPPED (1 << 31)/* do not map the buffer */
as __print_flags assigns XFS_BUF_FLAGS to a structure that uses an
unsigned long for the flag. Since this results in the value of
XBF_UNMAPPED causing a signed integer overflow, the result is
technically undefined behavior, which gcc-5 does not accept as an
integer constant.
This is based on a patch from Arnd Bergman <arnd@arndb.de>.
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832
commit dbd0f5299302f8506637592e2373891a748c6990
Author: Dave Chinner <dchinner@redhat.com>
Date: Thu Mar 17 09:09:10 2022 -0700
xfs: check buffer pin state after locking in delwri_submit
AIL flushing can get stuck here:
[316649.005769] INFO: task xfsaild/pmem1:324525 blocked for more than 123 seconds.
[316649.007807] Not tainted 5.17.0-rc6-dgc+ #975
[316649.009186] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[316649.011720] task:xfsaild/pmem1 state:D stack:14544 pid:324525 ppid: 2 flags:0x00004000
[316649.014112] Call Trace:
[316649.014841] <TASK>
[316649.015492] __schedule+0x30d/0x9e0
[316649.017745] schedule+0x55/0xd0
[316649.018681] io_schedule+0x4b/0x80
[316649.019683] xfs_buf_wait_unpin+0x9e/0xf0
[316649.021850] __xfs_buf_submit+0x14a/0x230
[316649.023033] xfs_buf_delwri_submit_buffers+0x107/0x280
[316649.024511] xfs_buf_delwri_submit_nowait+0x10/0x20
[316649.025931] xfsaild+0x27e/0x9d0
[316649.028283] kthread+0xf6/0x120
[316649.030602] ret_from_fork+0x1f/0x30
in the situation where flushing gets preempted between the unpin
check and the buffer trylock under nowait conditions:
blk_start_plug(&plug);
list_for_each_entry_safe(bp, n, buffer_list, b_list) {
if (!wait_list) {
if (xfs_buf_ispinned(bp)) {
pinned++;
continue;
}
Here >>>>>>
if (!xfs_buf_trylock(bp))
continue;
This means submission is stuck until something else triggers a log
force to unpin the buffer.
To get onto the delwri list to begin with, the buffer pin state has
already been checked, and hence it's relatively rare we get a race
between flushing and encountering a pinned buffer in delwri
submission to begin with. Further, to increase the pin count the
buffer has to be locked, so the only way we can hit this race
without failing the trylock is to be preempted between the pincount
check seeing zero and the trylock being run.
Hence to avoid this problem, just invert the order of trylock vs
pin check. We shouldn't hit that many pinned buffers here, so
optimising away the trylock for pinned buffers should not matter for
performance at all.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
Several patches to this file were backported out of order. The result of this merge resolution matches upstream after the inclusion of all of the patches we have backported.
# Conflicts:
# fs/iomap/buffered-io.c
Bugzilla: https://bugzilla.redhat.com/2160210
commit e33c267ab70de4249d22d7eab1cc7d68a889bac2
Author: Roman Gushchin <roman.gushchin@linux.dev>
Date: Tue May 31 20:22:24 2022 -0700
mm: shrinkers: provide shrinkers with names
Currently shrinkers are anonymous objects. For debugging purposes they
can be identified by count/scan function names, but it's not always
useful: e.g. for superblock's shrinkers it's nice to have at least an
idea of to which superblock the shrinker belongs.
This commit adds names to shrinkers. register_shrinker() and
prealloc_shrinker() functions are extended to take a format and arguments
to master a name.
In some cases it's not possible to determine a good name at the time when
a shrinker is allocated. For such cases shrinker_debugfs_rename() is
provided.
The expected format is:
<subsystem>-<shrinker_type>[:<instance>]-<id>
For some shrinkers an instance can be encoded as (MAJOR:MINOR) pair.
After this change the shrinker debugfs directory looks like:
$ cd /sys/kernel/debug/shrinker/
$ ls
dquota-cache-16 sb-devpts-28 sb-proc-47 sb-tmpfs-42
mm-shadow-18 sb-devtmpfs-5 sb-proc-48 sb-tmpfs-43
mm-zspool:zram0-34 sb-hugetlbfs-17 sb-pstore-31 sb-tmpfs-44
rcu-kfree-0 sb-hugetlbfs-33 sb-rootfs-2 sb-tmpfs-49
sb-aio-20 sb-iomem-12 sb-securityfs-6 sb-tracefs-13
sb-anon_inodefs-15 sb-mqueue-21 sb-selinuxfs-22 sb-xfs:vda1-36
sb-bdev-3 sb-nsfs-4 sb-sockfs-8 sb-zsmalloc-19
sb-bpf-32 sb-pipefs-14 sb-sysfs-26 thp-deferred_split-10
sb-btrfs:vda2-24 sb-proc-25 sb-tmpfs-1 thp-zero-9
sb-cgroup2-30 sb-proc-39 sb-tmpfs-27 xfs-buf:vda1-37
sb-configfs-23 sb-proc-41 sb-tmpfs-29 xfs-inodegc:vda1-38
sb-dax-11 sb-proc-45 sb-tmpfs-35
sb-debugfs-7 sb-proc-46 sb-tmpfs-40
[roman.gushchin@linux.dev: fix build warnings]
Link: https://lkml.kernel.org/r/Yr+ZTnLb9lJk6fJO@castle
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lkml.kernel.org/r/20220601032227.4076670-4-roman.gushchin@linux.dev
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
Conflicts:
drop changes to fs/f2fs/gc.c fs/f2fs/data.c fs/f2fs/inode.c
fs/f2fs/node.c fs/f2fs/recovery.c fs/f2fs/segment.c fs/f2fs/super.c -
unsupported config
fs/ext4/inline.c - The backport of
36d116e99da7 ("ext4: Use scoped memory APIs in ext4_da_write_begin()")
added an include of linux/sched/mm.h, citing the lack of this
patch in its conflicts section. Keep it.
Bugzilla: https://bugzilla.redhat.com/2160210
commit 4034247a0d6ab281ba3293798ce67af494d86129
Author: NeilBrown <neilb@suse.de>
Date: Fri Jan 14 14:07:14 2022 -0800
mm: introduce memalloc_retry_wait()
Various places in the kernel - largely in filesystems - respond to a
memory allocation failure by looping around and re-trying. Some of
these cannot conveniently use __GFP_NOFAIL, for reasons such as:
- a GFP_ATOMIC allocation, which __GFP_NOFAIL doesn't work on
- a need to check for the process being signalled between failures
- the possibility that other recovery actions could be performed
- the allocation is quite deep in support code, and passing down an
extra flag to say if __GFP_NOFAIL is wanted would be clumsy.
Many of these currently use congestion_wait() which (in almost all
cases) simply waits the given timeout - congestion isn't tracked for
most devices.
It isn't clear what the best delay is for loops, but it is clear that
the various filesystems shouldn't be responsible for choosing a timeout.
This patch introduces memalloc_retry_wait() with takes on that
responsibility. Code that wants to retry a memory allocation can call
this function passing the GFP flags that were used. It will wait
however is appropriate.
For now, it only considers __GFP_NORETRY and whatever
gfpflags_allow_blocking() tests. If blocking is allowed without
__GFP_NORETRY, then alloc_page either made some reclaim progress, or
waited for a while, before failing. So there is no need for much
further waiting. memalloc_retry_wait() will wait until the current
jiffie ends. If this condition is not met, then alloc_page() won't have
waited much if at all. In that case memalloc_retry_wait() waits about
200ms. This is the delay that most current loops uses.
linux/sched/mm.h needs to be included in some files now,
but linux/backing-dev.h does not.
Link: https://lkml.kernel.org/r/163754371968.13692.1277530886009912421@noble
.neil.brown.name
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2162211
Conflicts: dropped ext2 and erofs hunks, as they're not supported in RHEL.
Fixed up ext4 conflict due to RHEL differences.
commit cd913c76f489def1a388e3a5b10df94948ede3f5
Author: Christoph Hellwig <hch@lst.de>
Date: Mon Nov 29 11:21:59 2021 +0100
dax: return the partition offset from fs_dax_get_by_bdev
Prepare for the removal of the block_device from the DAX I/O path by
returning the partition offset from fs_dax_get_by_bdev so that the file
systems have it at hand for use during I/O.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20211129102203.2243509-26-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2162211
commit 5b5abbefec1bea98abba8f1cffcf72c11c32a92d
Author: Christoph Hellwig <hch@lst.de>
Date: Mon Nov 29 11:21:55 2021 +0100
xfs: move dax device handling into xfs_{alloc,free}_buftarg
Hide the DAX device lookup from the xfs_super.c code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Link: https://lore.kernel.org/r/20211129102203.2243509-22-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Merge conflicts:
-----------------
Conflicts with !1142(merged) "io_uring: update to v5.15"
fs/io-wq.c
- static bool io_wqe_create_worker(struct io_wqe *wqe, struct io_wqe_acct *acct)
!1142 already contains backport of 3146cba99aa2 ("io-wq: make worker creation resilient against signals")
along with other commits which are not present in !1370. Resolved in favor of HEAD(!1142)
- static int io_wqe_worker(void *data)
!1370 does not contain 767a65e9f317 ("io-wq: fix potential race of acct->nr_workers")
Resolved in favor of HEAD(!1142)
- static void io_init_new_worker(struct io_wqe *wqe, struct io_worker *worker,
HEAD(!1142) does not contain e32cf5dfbe22 ("kthread: Generalize pf_io_worker so it can point to struct kthread")
Resolved in favor of !1370
- static void create_worker_cont(struct callback_head *cb)
!1370 does not contain 66e70be72288 ("io-wq: fix memory leak in create_io_worker()")
Resolved in favor of HEAD(!1142)
- static void io_workqueue_create(struct work_struct *work)
!1370 does not contain 66e70be72288 ("io-wq: fix memory leak in create_io_worker()")
Resolved in favor of HEAD(!1142)
- static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
!1370 does not contain 66e70be72288 ("io-wq: fix memory leak in create_io_worker()")
Resolved in favor of HEAD(!1142)
- static bool io_wq_work_match_item(struct io_wq_work *work, void *data)
!1370 does not contain 713b9825a4c4 ("io-wq: fix cancellation on create-worker failure")
Resolved in favor of HEAD(!1142)
- static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work)
!1370 is missing 713b9825a4c4 ("io-wq: fix cancellation on create-worker failure")
removed wrongly merged run_cancel label
Resolved in favor of HEAD(!1142)
- static bool io_task_work_match(struct callback_head *cb, void *data)
!1370 is missing 3b33e3f4a6c0 ("io-wq: fix silly logic error in io_task_work_match()")
Resolved in favor of HEAD(!1142)
- static void io_wq_exit_workers(struct io_wq *wq)
!1370 is missing 3b33e3f4a6c0 ("io-wq: fix silly logic error in io_task_work_match()")
Resolved in favor of HEAD(!1142)
- int io_wq_max_workers(struct io_wq *wq, int *new_count)
!1370 is missing 3b33e3f4a6c0 ("io-wq: fix silly logic error in io_task_work_match()")
fs/io_uring.c
- static int io_register_iowq_max_workers(struct io_ring_ctx *ctx,
!1370 is missing bunch of commits after 2e480058ddc2 ("io-wq: provide a way to limit max number of workers")
Resolved in favor of HEAD(!1142)
include/uapi/linux/io_uring.h
- !1370 is missing dd47c104533d ("io-wq: provide IO_WQ_* constants for IORING_REGISTER_IOWQ_MAX_WORKERS arg items")
just a comment conflict
Resolved in favor of HEAD(!1142)
kernel/exit.c
- void __noreturn do_exit(long code)
- !1370 contains bunch of commits after f552a27afe67 ("io_uring: remove files pointer in cancellation functions")
Resolved in favor of !1370
Conflicts with !1357(merged) "NFS refresh for RHEL-9.2"
fs/nfs/callback.c
- nfs4_callback_svc(void *vrqstp)
!1370 is missing f49169c97fce ("NFSD: Remove svc_serv_ops::svo_module") where the module_put_and_kthread_exit() was removed
Resolved in favor of HEAD(!1357)
fs/nfs/file.c
!1357 is missing 187c82cb0380 ("fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio")
Resolved in favor of HEAD(!1370)
fs/nfsd/nfssvc.c
- nfsd(void *vrqstp)
!1370 is missing f49169c97fce ("NFSD: Remove svc_serv_ops::svo_module")
Resolved in favor of HEAD(!1357)
-----------------
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1370
Bugzilla: https://bugzilla.redhat.com/2120352
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2099722
Patches 1-9 are changes to selftests
Patches 10-31 are reverts of RHEL-only patches to address COR CVE
Patches 32-320 are the machine dependent mm changes ported by Rafael
Patch 321 reverts the backport of 6692c98c7df5. See below.
Patches 322-981 are the machine independent mm changes
Patches 982-1016 are David Hildebrand's upstream changes to address the COR CVE
RHEL commit b23c298982 fork: Stop protecting back_fork_cleanup_cgroup_lock with CONFIG_NUMA
which is a backport of upstream 6692c98c7df5 and is reverted early in this series. 6692c98c7df5
is a fix for upstream 40966e316f86 which was not in RHEL until this series. 6692c98c7df5 is re-added
after 40966e316f86.
Omitted-fix: 310d1344e3c5 ("Revert "powerpc: Remove unused FW_FEATURE_NATIVE references"")
to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716
Omitted-fix: 465d0eb0dc31 ("Docs/admin-guide/mm/damon/usage: fix the example code snip")
to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716
Omitted-fix: 317314527d17 ("mm/hugetlb: correct demote page offset logic")
to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716
Omitted-fix: 37dcc673d065 ("frontswap: don't call ->init if no ops are registered")
to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716
Omitted-fix: 30c19366636f ("mm: fix BUG splat with kvmalloc + GFP_ATOMIC")
to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716
Omitted: fix: fa84693b3c89 io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL
fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656
Omitted-fix: 009ad9f0c6ee io_uring: drop ctx->uring_lock before acquiring sqd->lock
fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656
Omitted-fix: bc369921d670 io-wq: max_worker fixes
fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743
Omitted-fix: e139a1ec92f8 io_uring: apply max_workers limit to all future users
fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743
Omitted-fix: 71c9ce27bb57 io-wq: fix max-workers not correctly set on multi-node system
fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743
Omitted-fix: 41d3a6bd1d37 io_uring: pin SQPOLL data before unlocking ring lock
fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656
Omitted-fix: bad119b9a000 io_uring: honour zeroes as io-wq worker limits
fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743
Omitted-fix: 08bdbd39b584 io-wq: ensure that hash wait lock is IRQ disabling
fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656
Omitted-fix: 713b9825a4c4 io-wq: fix cancellation on create-worker failure
fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656
Omitted-fix: 3b33e3f4a6c0 io-wq: fix silly logic error in io_task_work_match()
fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656
Omitted-fix: 71e1cef2d794 io-wq: Remove duplicate code in io_workqueue_create()
fixed under https://bugzilla.redhat.com/show_bug.cgi?id=210774
Omitted-fix: a226abcd5d42 io-wq: don't retry task_work creation failure on fatal conditions
fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743
Omitted-fix: fa84693b3c89 io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL
fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656
Omitted-fix: dd47c104533d io-wq: provide IO_WQ_* constants for IORING_REGISTER_IOWQ_MAX_WORKERS arg items
fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656
Omitted-fix: 4f0712ccec09 hexagon: Fix function name in die()
unsupported arch
Omitted-fix: 751971af2e36 csky: Fix function name in csky_alignment() and die()
unsupported arch
Omitted-fix: dcbc65aac283 ptrace: Remove duplicated include in ptrace.c
unsupported arch
Omitted-fix: eb48d4219879 drm/i915: Fix oops due to missing stack depot
fixed in RHEL commit 105d2d4832 Merge DRM changes from upstream v5.16..v5.17
Omitted-fix: 751a9d69b197 drm/i915: Fix oops due to missing stack depot
fixed in RHEL commit 99fc716fc4 Merge DRM changes from upstream v5.17..v5.18
Omitted-fix: eb48d4219879 drm/i915: Fix oops due to missing stack depot
fixed in RHEL commit 105d2d4832 Merge DRM changes from upstream v5.16..v5.17
Omitted-fix: 751a9d69b197 drm/i915: Fix oops due to missing stack depot
fixed in RHEL commit 99fc716fc4 Merge DRM changes from upstream v5.17..v5.18
Omitted-fix: b95dc06af3e6 drm/amdgpu: disable runpm if we are the primary adapter
reverted later
Omitted-fix: 5a90c24ad028 Revert "drm/amdgpu: disable runpm if we are the primary adapter"
revert of above omitted fix
Omitted-fix: 724bbe49c5e4 fs/ntfs3: provide block_invalidate_folio to fix memory leak
unsupported fs
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: John W. Linville <linville@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>
Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Lyude Paul <lyude@redhat.com>
Approved-by: Donald Dutile <ddutile@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>
Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2125724
Conflicts:
Small conflict at xfs_inode_alloc() due to out of order
backport. Inode alloc using kmem_cache_alloc() has been
converted to use alloc_inode_sb() before this patch.
Now that we've gotten rid of the kmem_zone_t typedef, rename the
variables to _cache since that's what they are.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
(cherry picked from commit 182696fb021fc196e5cbe641565ca40fcf0f885a)
Bugzilla: https://bugzilla.redhat.com/2125724
Remove these typedefs by referencing kmem_cache directly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
(cherry picked from commit e7720afad068a6729d9cd3aaa08212f2f5a7ceff)
Bugzilla: https://bugzilla.redhat.com/2120352
commit b9b1335e640308acc1b8f26c739b804c80a6c147
Author: NeilBrown <neilb@suse.de>
Date: Tue Mar 22 14:39:10 2022 -0700
remove bdi_congested() and wb_congested() and related functions
These functions are no longer useful as no BDIs report congestions any
more.
Removing the test on bdi_write_contested() in current_may_throttle()
could cause a small change in behaviour, but only when PF_LOCAL_THROTTLE
is set.
So replace the calls by 'false' and simplify the code - and remove the
functions.
[akpm@linux-foundation.org: fix build]
Link: https://lkml.kernel.org/r/164549983742.9187.2570198746005819592.stgit@noble.brown
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> [nilfs]
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Lars Ellenberg <lars.ellenberg@linbit.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Paolo Valente <paolo.valente@linaro.org>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2118511
commit d03025aef8676e826b69f8e3ec9bb59a5ad0c31d
Author: Bart Van Assche <bvanassche@acm.org>
Date: Thu Jul 14 11:07:28 2022 -0700
fs/xfs: Use the enum req_op and blk_opf_t types
Improve static type checking by using the enum req_op type for variables
that represent a request operation and the new blk_opf_t type for the
combination of a request operation with request flags.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-63-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git
commit 01728b44ef1b714756607be0210fbcf60c78efce
Author: Dave Chinner <dchinner@redhat.com>
Date: Thu Mar 17 09:09:13 2022 -0700
xfs: xfs_is_shutdown vs xlog_is_shutdown cage fight
I've been chasing a recent resurgence in generic/388 recovery
failure and/or corruption events. The events have largely been
uninitialised inode chunks being tripped over in log recovery
such as:
XFS (pmem1): User initiated shutdown received.
pmem1: writeback error on inode 12621949, offset 1019904, sector 12968096
XFS (pmem1): Log I/O Error (0x6) detected at xfs_fs_goingdown+0xa3/0xf0 (fs/xfs/xfs_fsops.c:500). Shutting down filesystem.
XFS (pmem1): Please unmount the filesystem and rectify the problem(s)
XFS (pmem1): Unmounting Filesystem
XFS (pmem1): Mounting V5 Filesystem
XFS (pmem1): Starting recovery (logdev: internal)
XFS (pmem1): bad inode magic/vsn daddr 8723584 #0 (magic=1818)
XFS (pmem1): Metadata corruption detected at xfs_inode_buf_verify+0x180/0x190, xfs_inode block 0x851c80 xfs_inode_buf_verify
XFS (pmem1): Unmount and run xfs_repair
XFS (pmem1): First 128 bytes of corrupted metadata buffer:
00000000: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................
00000010: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................
00000020: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................
00000030: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................
00000040: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................
00000050: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................
00000060: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................
00000070: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................
XFS (pmem1): metadata I/O error in "xlog_recover_items_pass2+0x52/0xc0" at daddr 0x851c80 len 32 error 117
XFS (pmem1): log mount/recovery failed: error -117
XFS (pmem1): log mount failed
There have been isolated random other issues, too - xfs_repair fails
because it finds some corruption in symlink blocks, rmap
inconsistencies, etc - but they are nowhere near as common as the
uninitialised inode chunk failure.
The problem has clearly happened at runtime before recovery has run;
I can see the ICREATE log item in the log shortly before the
actively recovered range of the log. This means the ICREATE was
definitely created and written to the log, but for some reason the
tail of the log has been moved past the ordered buffer log item that
tracks INODE_ALLOC buffers and, supposedly, prevents the tail of the
log moving past the ICREATE log item before the inode chunk buffer
is written to disk.
Tracing the fsstress processes that are running when the filesystem
shut down immediately pin-pointed the problem:
user shutdown marks xfs_mount as shutdown
godown-213341 [008] 6398.022871: console: [ 6397.915392] XFS (pmem1): User initiated shutdown received.
.....
aild tries to push ordered inode cluster buffer
xfsaild/pmem1-213314 [001] 6398.022974: xfs_buf_trylock: dev 259:1 daddr 0x851c80 bbcount 0x20 hold 16 pincount 0 lock 0 flags DONE|INODES|PAGES caller xfs_inode_item_push+0x8e
xfsaild/pmem1-213314 [001] 6398.022976: xfs_ilock_nowait: dev 259:1 ino 0x851c80 flags ILOCK_SHARED caller xfs_iflush_cluster+0xae
xfs_iflush_cluster() checks xfs_is_shutdown(), returns true,
calls xfs_iflush_abort() to kill writeback of the inode.
Inode is removed from AIL, drops cluster buffer reference.
xfsaild/pmem1-213314 [001] 6398.022977: xfs_ail_delete: dev 259:1 lip 0xffff88880247ed80 old lsn 7/20344 new lsn 7/21000 type XFS_LI_INODE flags IN_AIL
xfsaild/pmem1-213314 [001] 6398.022978: xfs_buf_rele: dev 259:1 daddr 0x851c80 bbcount 0x20 hold 17 pincount 0 lock 0 flags DONE|INODES|PAGES caller xfs_iflush_abort+0xd7
.....
All inodes on cluster buffer are aborted, then the cluster buffer
itself is aborted and removed from the AIL *without writeback*:
xfsaild/pmem1-213314 [001] 6398.023011: xfs_buf_error_relse: dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_ioend_fail+0x33
xfsaild/pmem1-213314 [001] 6398.023012: xfs_ail_delete: dev 259:1 lip 0xffff8888053efde8 old lsn 7/20344 new lsn 7/20344 type XFS_LI_BUF flags IN_AIL
The inode buffer was at 7/20344 when it was removed from the AIL.
xfsaild/pmem1-213314 [001] 6398.023012: xfs_buf_item_relse: dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_item_done+0x31
xfsaild/pmem1-213314 [001] 6398.023012: xfs_buf_rele: dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_item_relse+0x39
.....
Userspace is still running, doing stuff. an fsstress process runs
syncfs() or sync() and we end up in sync_fs_one_sb() which issues
a log force. This pushes on the CIL:
fsstress-213322 [001] 6398.024430: xfs_fs_sync_fs: dev 259:1 m_features 0x20000000019ff6e9 opstate (clean|shutdown|inodegc|blockgc) s_flags 0x70810000 caller sync_fs_one_sb+0x26
fsstress-213322 [001] 6398.024430: xfs_log_force: dev 259:1 lsn 0x0 caller xfs_fs_sync_fs+0x82
fsstress-213322 [001] 6398.024430: xfs_log_force: dev 259:1 lsn 0x5f caller xfs_log_force+0x7c
<...>-194402 [001] 6398.024467: kmem_alloc: size 176 flags 0x14 caller xlog_cil_push_work+0x9f
And the CIL fills up iclogs with pending changes. This picks up
the current tail from the AIL:
<...>-194402 [001] 6398.024497: xlog_iclog_get_space: dev 259:1 state XLOG_STATE_ACTIVE refcnt 1 offset 0 lsn 0x0 flags caller xlog_write+0x149
<...>-194402 [001] 6398.024498: xlog_iclog_switch: dev 259:1 state XLOG_STATE_ACTIVE refcnt 1 offset 0 lsn 0x700005408 flags caller xlog_state_get_iclog_space+0x37e
<...>-194402 [001] 6398.024521: xlog_iclog_release: dev 259:1 state XLOG_STATE_WANT_SYNC refcnt 1 offset 32256 lsn 0x700005408 flags caller xlog_write+0x5f9
<...>-194402 [001] 6398.024522: xfs_log_assign_tail_lsn: dev 259:1 new tail lsn 7/21000, old lsn 7/20344, last sync 7/21448
And it moves the tail of the log to 7/21000 from 7/20344. This
*moves the tail of the log beyond the ICREATE transaction* that was
at 7/20344 and pinned by the inode cluster buffer that was cancelled
above.
....
godown-213341 [008] 6398.027005: xfs_force_shutdown: dev 259:1 tag logerror flags log_io|force_umount file fs/xfs/xfs_fsops.c line_num 500
godown-213341 [008] 6398.027022: console: [ 6397.915406] pmem1: writeback error on inode 12621949, offset 1019904, sector 12968096
godown-213341 [008] 6398.030551: console: [ 6397.919546] XFS (pmem1): Log I/O Error (0x6) detected at xfs_fs_goingdown+0xa3/0xf0 (fs/
And finally the log itself is now shutdown, stopping all further
writes to the log. But this is too late to prevent the corruption
that moving the tail of the log forwards after we start cancelling
writeback causes.
The fundamental problem here is that we are using the wrong shutdown
checks for log items. We've long conflated mount shutdown with log
shutdown state, and I started separating that recently with the
atomic shutdown state changes in commit b36d4651e165 ("xfs: make
forced shutdown processing atomic"). The changes in that commit
series are directly responsible for being able to diagnose this
issue because it clearly separated mount shutdown from log shutdown.
Essentially, once we start cancelling writeback of log items and
removing them from the AIL because the filesystem is shut down, we
*cannot* update the journal because we may have cancelled the items
that pin the tail of the log. That moves the tail of the log
forwards without having written the metadata back, hence we have
corrupt in memory state and writing to the journal propagates that
to the on-disk state.
What commit b36d4651e165 makes clear is that log item state needs to
change relative to log shutdown, not mount shutdown. IOWs, anything
that aborts metadata writeback needs to check log shutdown state
because log items directly affect log consistency. Having them check
mount shutdown state introduces the above race condition where we
cancel metadata writeback before the log shuts down.
To fix this, this patch works through all log items and converts
shutdown checks to use xlog_is_shutdown() rather than
xfs_is_shutdown(), so that we don't start aborting metadata
writeback before we shut off journal writes.
AFAICT, this race condition is a zero day IO error handling bug in
XFS that dates back to the introduction of XLOG_IO_ERROR,
XLOG_STATE_IOERROR and XFS_FORCED_SHUTDOWN back in January 1997.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git
commit 4c7f65aea7b7fe66c08f8f7304c1ea3f7a871d5a
Author: Dave Chinner <dchinner@redhat.com>
Date: Wed Aug 18 18:48:54 2021 -0700
xfs: rename buffer cache index variable b_bn
To stop external users from using b_bn as the disk address of the
buffer, rename it to b_rhash_key to indicate that it is the buffer
cache index, not the block number of the buffer. Code that needs the
disk address should use xfs_buf_daddr() to obtain it.
Do the rename and clean up any of the remaining internal b_bn users.
Also clean up any remaining b_bn cruft that is now unused.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git
commit 04fcad80cd068731a779fb442f78234732683755
Author: Dave Chinner <dchinner@redhat.com>
Date: Wed Aug 18 18:46:57 2021 -0700
xfs: introduce xfs_buf_daddr()
Introduce a helper function xfs_buf_daddr() to extract the disk
address of the buffer from the struct xfs_buf. This will replace
direct accesses to bp->b_bn and bp->b_maps[0].bm_bn, as well as
the XFS_BUF_ADDR() macro.
This patch introduces the helper function and replaces all uses of
XFS_BUF_ADDR() as this is just a simple sed replacement.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git
commit 75c8c50fa16a23f8ac89ea74834ae8ddd1558d75
Author: Dave Chinner <dchinner@redhat.com>
Date: Wed Aug 18 18:46:53 2021 -0700
xfs: replace XFS_FORCED_SHUTDOWN with xfs_is_shutdown
Remove the shouty macro and instead use the inline function that
matches other state/feature check wrapper naming. This conversion
was done with sed.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git
commit 2e973b2cd4cdb993be94cca4c33f532f1ed05316
Author: Dave Chinner <dchinner@redhat.com>
Date: Wed Aug 18 18:46:52 2021 -0700
xfs: convert remaining mount flags to state flags
The remaining mount flags kept in m_flags are actually runtime state
flags. These change dynamically, so they really should be updated
atomically so we don't potentially lose an update due to racing
modifications.
Convert these remaining flags to be stored in m_opstate and use
atomic bitops to set and clear the flags. This also adds a couple of
simple wrappers for common state checks - read only and shutdown.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git
commit 38c26bfd90e1999650d5ef40f90d721f05916643
Author: Dave Chinner <dchinner@redhat.com>
Date: Wed Aug 18 18:46:37 2021 -0700
xfs: replace xfs_sb_version checks with feature flag checks
Convert the xfs_sb_version_hasfoo() to checks against
mp->m_features. Checks of the superblock itself during disk
operations (e.g. in the read/write verifiers and the to/from disk
formatters) are not converted - they operate purely on the
superblock state. Everything else should use the mount features.
Large parts of this conversion were done with sed with commands like
this:
for f in `git grep -l xfs_sb_version_has fs/xfs/*.c`; do
sed -i -e 's/xfs_sb_version_has\(.*\)(&\(.*\)->m_sb)/xfs_has_\1(\2)/' $f
done
With manual cleanups for things like "xfs_has_extflgbit" and other
little inconsistencies in naming.
The result is ia lot less typing to check features and an XFS binary
size reduced by a bit over 3kB:
$ size -t fs/xfs/built-in.a
text data bss dec hex filenam
before 1130866 311352 484 1442702 16038e (TOTALS)
after 1127727 311352 484 1439563 15f74b (TOTALS)
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git
commit 98fe2c3cef21b784e2efd1d9d891430d95b4f073
Author: Dave Chinner <dchinner@redhat.com>
Date: Mon Aug 9 10:10:01 2021 -0700
xfs: remove kmem_alloc_io()
Since commit 59bb47985c ("mm, sl[aou]b: guarantee natural alignment
for kmalloc(power-of-two)"), the core slab code now guarantees slab
alignment in all situations sufficient for IO purposes (i.e. minimum
of 512 byte alignment of >= 512 byte sized heap allocations) we no
longer need the workaround in the XFS code to provide this
guarantee.
Replace the use of kmem_alloc_io() with kmem_alloc() or
kmem_alloc_large() appropriately, and remove the kmem_alloc_io()
interface altogether.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083917
Conflicts: drop change on scsi/ufs/ufshpb.c & ntfs3, deal with
trivial conflicts for iomap & erofs.
commit 07888c665b405b1cd3577ddebfeb74f4717a84c4
Author: Christoph Hellwig <hch@lst.de>
Date: Mon Jan 24 10:11:05 2022 +0100
block: pass a block_device and opf to bio_alloc
Pass the block_device and operation that we plan to use this bio for to
bio_alloc to optimize the assignment. NULL/0 can be passed, both for the
passthrough case on a raw request_queue and to temporarily avoid
refactoring some nasty code.
Also move the gfp_mask argument after the nr_vecs argument for a much
more logical calling convention matching what most of the kernel does.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220124091107.642561-18-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2018403
commit a11d7fc2d05fb509cd9e33d4093507d6eda3ad53
Author: Christoph Hellwig <hch@lst.de>
Date: Mon Aug 9 16:17:44 2021 +0200
block: remove the bd_bdi in struct block_device
Just retrieve the bdi from the disk.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210809141744.1203023-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
It's a one line wrapper around blkdev_issue_flush(). Just replace it
with direct calls to blkdev_issue_flush().
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
'error' will be initialized, so clean up the redundant initialization.
Cc: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
If we want to use active references to the perag to be able to gate
shrink removing AGs and hence perags safely, we've got a fair bit of
work to do actually use perags in all the places we need to.
There's a lot of code that iterates ag numbers and then
looks up perags from that, often multiple times for the same perag
in the one operation. If we want to use reference counted perags for
access control, then we need to convert all these uses to perag
iterators, not agno iterators.
[Patches 1-4]
The first step of this is consolidating all the perag management -
init, free, get, put, etc into a common location. THis is spread all
over the place right now, so move it all into libxfs/xfs_ag.[ch].
This does expose kernel only bits of the perag to libxfs and hence
userspace, so the structures and code is rearranged to minimise the
number of ifdefs that need to be added to the userspace codebase.
The perag iterator in xfs_icache.c is promoted to a first class API
and expanded to the needs of the code as required.
[Patches 5-10]
These are the first basic perag iterator conversions and changes to
pass the perag down the stack from those iterators where
appropriate. A lot of this is obvious, simple changes, though in
some places we stop passing the perag down the stack because the
code enters into an as yet unconverted subsystem that still uses raw
AGs.
[Patches 11-16]
These replace the agno passed in the btree cursor for per-ag btree
operations with a perag that is passed to the cursor init function.
The cursor takes it's own reference to the perag, and the reference
is dropped when the cursor is deleted. Hence we get reference
coverage for the entire time the cursor is active, even if the code
that initialised the cursor drops it's reference before the cursor
or any of it's children (duplicates) have been deleted.
The first patch adds the perag infrastructure for the cursor, the
next four patches convert a btree cursor at a time, and the last
removes the agno from the cursor once it is unused.
[Patches 17-21]
These patches are a demonstration of the simplifications and
cleanups that come from plumbing the perag through interfaces that
select and then operate on a specific AG. In this case the inode
allocation algorithm does up to three walks across all AGs before it
either allocates an inode or fails. Two of these walks are purely
just to select the AG, and even then it doesn't guarantee inode
allocation success so there's a third walk if the selected AG
allocation fails.
These patches collapse the selection and allocation into a single
loop, simplifies the error handling because xfs_dir_ialloc() always
returns ENOSPC if no AG was selected for inode allocation or we fail
to allocate an inode in any AG, gets rid of xfs_dir_ialloc()
wrapper, converts inode allocation to run entirely from a single
perag instance, and then factors xfs_dialloc() into a much, much
simpler loop which is easy to understand.
Hence we end up with the same inode allocation logic, but it only
needs two complete iterations at worst, makes AG selection and
allocation atomic w.r.t. shrink and chops out out over 100 lines of
code from this hot code path.
[Patch 22]
Converts the unlink path to pass perags through it.
There's more conversion work to be done, but this patchset gets
through a large chunk of it in one hit. Most of the iterators are
converted, so once this is solidified we can move on to converting
these to active references for being able to free perags while the
fs is still active.
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEmJOoJ8GffZYWSjj/regpR/R1+h0FAmC3HUgUHGRhdmlkQGZy
b21vcmJpdC5jb20ACgkQregpR/R1+h2yaw/+P0JzpI+6n06Ei00mjgE/Du/WhMLi
0JQ93Grlj+miuGGT9DgGCiRpoZnefhEk+BH6JqoEw1DQ3T5ilmAzrHLUUHSQC3+S
dv85sJduheQ6yHuoO+4MzkaSq6JWKe7E9gZwAsVyBul5aSjdmaJaQdPwYMTXSXo0
5Uqq8ECFkMcaHVNjcBfasgR/fdyWy2Qe4PFTHTHdQpd+DNZ9UXgFKHW2og+1iry/
zDIvdIppJULA09TvVcZuFjd/1NzHQ/fLj5PAzz8GwagB4nz2x3s78Zevmo5yW/jK
3/+50vXa8ldhiHDYGTS3QXvS0xJRyqUyD47eyWOOiojZw735jEvAlCgjX6+0X1HC
k3gCkQLv8l96fRkvUpgnLf/fjrUnlCuNBkm9d1Eq2Tied8dvLDtiEzoC6L05Nqob
yd/nIUb1zwJFa9tsoheHhn0bblTGX1+zP0lbRJBje0LotpNO9DjGX5JoIK4GR7F8
y1VojcdgRI14HlxUnbF3p8wmQByN+M2tnp6GSdv9BA65bjqi05Rj/steFdZHBV6x
wiRs8Yh6BTvMwKgufHhRQHfRahjNHQ/T/vOE+zNbWqemS9wtEUDop+KvPhC36R/k
o/cmr23cF8ESX2eChk7XM4On3VEYpcvp2zSFgrFqZYl6RWOwEis3Htvce3KuSTPp
8Xq70te0gr2DVUU=
=YNzW
-----END PGP SIGNATURE-----
Merge tag 'xfs-perag-conv-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-5.14-merge2
xfs: initial agnumber -> perag conversions for shrink
If we want to use active references to the perag to be able to gate
shrink removing AGs and hence perags safely, we've got a fair bit of
work to do actually use perags in all the places we need to.
There's a lot of code that iterates ag numbers and then
looks up perags from that, often multiple times for the same perag
in the one operation. If we want to use reference counted perags for
access control, then we need to convert all these uses to perag
iterators, not agno iterators.
[Patches 1-4]
The first step of this is consolidating all the perag management -
init, free, get, put, etc into a common location. THis is spread all
over the place right now, so move it all into libxfs/xfs_ag.[ch].
This does expose kernel only bits of the perag to libxfs and hence
userspace, so the structures and code is rearranged to minimise the
number of ifdefs that need to be added to the userspace codebase.
The perag iterator in xfs_icache.c is promoted to a first class API
and expanded to the needs of the code as required.
[Patches 5-10]
These are the first basic perag iterator conversions and changes to
pass the perag down the stack from those iterators where
appropriate. A lot of this is obvious, simple changes, though in
some places we stop passing the perag down the stack because the
code enters into an as yet unconverted subsystem that still uses raw
AGs.
[Patches 11-16]
These replace the agno passed in the btree cursor for per-ag btree
operations with a perag that is passed to the cursor init function.
The cursor takes it's own reference to the perag, and the reference
is dropped when the cursor is deleted. Hence we get reference
coverage for the entire time the cursor is active, even if the code
that initialised the cursor drops it's reference before the cursor
or any of it's children (duplicates) have been deleted.
The first patch adds the perag infrastructure for the cursor, the
next four patches convert a btree cursor at a time, and the last
removes the agno from the cursor once it is unused.
[Patches 17-21]
These patches are a demonstration of the simplifications and
cleanups that come from plumbing the perag through interfaces that
select and then operate on a specific AG. In this case the inode
allocation algorithm does up to three walks across all AGs before it
either allocates an inode or fails. Two of these walks are purely
just to select the AG, and even then it doesn't guarantee inode
allocation success so there's a third walk if the selected AG
allocation fails.
These patches collapse the selection and allocation into a single
loop, simplifies the error handling because xfs_dir_ialloc() always
returns ENOSPC if no AG was selected for inode allocation or we fail
to allocate an inode in any AG, gets rid of xfs_dir_ialloc()
wrapper, converts inode allocation to run entirely from a single
perag instance, and then factors xfs_dialloc() into a much, much
simpler loop which is easy to understand.
Hence we end up with the same inode allocation logic, but it only
needs two complete iterations at worst, makes AG selection and
allocation atomic w.r.t. shrink and chops out out over 100 lines of
code from this hot code path.
[Patch 22]
Converts the unlink path to pass perags through it.
There's more conversion work to be done, but this patchset gets
through a large chunk of it in one hit. Most of the iterators are
converted, so once this is solidified we can move on to converting
these to active references for being able to free perags while the
fs is still active.
* tag 'xfs-perag-conv-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (23 commits)
xfs: remove xfs_perag_t
xfs: use perag through unlink processing
xfs: clean up and simplify xfs_dialloc()
xfs: inode allocation can use a single perag instance
xfs: get rid of xfs_dir_ialloc()
xfs: collapse AG selection for inode allocation
xfs: simplify xfs_dialloc_select_ag() return values
xfs: remove agno from btree cursor
xfs: use perag for ialloc btree cursors
xfs: convert allocbt cursors to use perags
xfs: convert refcount btree cursor to use perags
xfs: convert rmap btree cursor to using a perag
xfs: add a perag to the btree cursor
xfs: pass perags around in fsmap data dev functions
xfs: push perags through the ag reservation callouts
xfs: pass perags through to the busy extent code
xfs: convert secondary superblock walk to use perags
xfs: convert xfs_iwalk to use perag references
xfs: convert raw ag walks to use for_each_perag
xfs: make for_each_perag... a first class citizen
...
It only has one caller and is now a simple function, so merge it
into the caller.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Use a single goto label for freeing the buffer and returning an
error.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Only used in one place, so just open code the logic in the macro.
Based on a patch from Christoph Hellwig.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Ever since we stopped using the Linux page cache to back XFS buffers
there is no need to take the start sector into account for
calculating the number of pages in a buffer, as the data always
start from the beginning of the buffer.
Signed-off-by: Christoph Hellwig <hch@lst.de>
[dgc: modified to suit this series]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
->b_offset can only be non-zero for _XBF_KMEM backed buffers, so
remove all code dealing with it for page backed buffers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
[dgc: modified to fit this patchset]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>