JIRA: https://issues.redhat.com/browse/RHEL-65728
commit 13ae04d8d45227c2ba51e188daf9fc13d08a1b12
Author: Darrick J. Wong <djwong@kernel.org>
Date: Fri Dec 15 10:03:27 2023 -0800
xfs: force all buffers to be written during btree bulk load
While stress-testing online repair of btrees, I noticed periodic
assertion failures from the buffer cache about buffers with incorrect
DELWRI_Q state. Looking further, I observed this race between the AIL
trying to write out a btree block and repair zapping a btree block after
the fact:
AIL: Repair0:
pin buffer X
delwri_queue:
set DELWRI_Q
add to delwri list
stale buf X:
clear DELWRI_Q
does not clear b_list
free space X
commit
delwri_submit # oops
Worse yet, I discovered that running the same repair over and over in a
tight loop can result in a second race that cause data integrity
problems with the repair:
AIL: Repair0: Repair1:
pin buffer X
delwri_queue:
set DELWRI_Q
add to delwri list
stale buf X:
clear DELWRI_Q
does not clear b_list
free space X
commit
find free space X
get buffer
rewrite buffer
delwri_queue:
set DELWRI_Q
already on a list, do not add
commit
BAD: committed tree root before all blocks written
delwri_submit # too late now
I traced this to my own misunderstanding of how the delwri lists work,
particularly with regards to the AIL's buffer list. If a buffer is
logged and committed, the buffer can end up on that AIL buffer list. If
btree repairs are run twice in rapid succession, it's possible that the
first repair will invalidate the buffer and free it before the next time
the AIL wakes up. Marking the buffer stale clears DELWRI_Q from the
buffer state without removing the buffer from its delwri list. The
buffer doesn't know which list it's on, so it cannot know which lock to
take to protect the list for a removal.
If the second repair allocates the same block, it will then recycle the
buffer to start writing the new btree block. Meanwhile, if the AIL
wakes up and walks the buffer list, it will ignore the buffer because it
can't lock it, and go back to sleep.
When the second repair calls delwri_queue to put the buffer on the
list of buffers to write before committing the new btree, it will set
DELWRI_Q again, but since the buffer hasn't been removed from the AIL's
buffer list, it won't add it to the bulkload buffer's list.
This is incorrect, because the bulkload caller relies on delwri_submit
to ensure that all the buffers have been sent to disk /before/
committing the new btree root pointer. This ordering requirement is
required for data consistency.
Worse, the AIL won't clear DELWRI_Q from the buffer when it does finally
drop it, so the next thread to walk through the btree will trip over a
debug assertion on that flag.
To fix this, create a new function that waits for the buffer to be
removed from any other delwri lists before adding the buffer to the
caller's delwri list. By waiting for the buffer to clear both the
delwri list and any potential delwri wait list, we can be sure that
repair will initiate writes of all buffers and report all write errors
back to userspace instead of committing the new structure.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-57114
commit 9ed851f695c71d325758f8c18e265da9316afd26
Author: Darrick J. Wong <djwong@kernel.org>
Date: Thu Aug 10 07:48:03 2023 -0700
xfs: allow scanning ranges of the buffer cache for live buffers
After an online repair, we need to invalidate buffers representing the
blocks from the old metadata that we're replacing. It's possible that
parts of a tree that were previously cached in memory are no longer
accessible due to media failure or other corruption on interior nodes,
so repair figures out the old blocks from the reverse mapping data and
scans the buffer cache directly.
In other words, online fsck needs to find all the live (i.e. non-stale)
buffers for a range of fsblocks so that it can invalidate them.
Unfortunately, the current buffer cache code triggers asserts if the
rhashtable lookup finds a non-stale buffer of a different length than
the key we searched for. For regular operation this is desirable, but
for this repair procedure, we don't care since we're going to forcibly
stale the buffer anyway. Add an internal lookup flag to avoid the
assert. Skip buffers that are already XBF_STALE.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-29564
commit 1b9e2d90141c5e25faefbb7891f0ed8606aa02cf
Author: Christian Brauner <brauner@kernel.org>
Date: Tue Jan 23 14:26:24 2024 +0100
xfs: port block device access to files
Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-7-adbd023e19cc@kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-29262
commit e340dd63f6a11402424b3d77e51149bce8fcba7d
Author: Jan Kara <jack@suse.cz>
Date: Wed Sep 27 11:34:34 2023 +0200
xfs: Convert to bdev_open_by_path()
Convert xfs to use bdev_open_by_path() and pass the handle around.
CC: "Darrick J. Wong" <djwong@kernel.org>
CC: linux-xfs@vger.kernel.org
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230927093442.25915-28-jack@suse.cz
Acked-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832
commit 231f91ab504ecebcb88e942341b3d7dd91de45f1
Author: Dave Chinner <dchinner@redhat.com>
Date: Mon Jul 18 18:20:37 2022 -0700
xfs: xfs_buf cache destroy isn't RCU safe
Darrick and Sachin Sant reported that xfs/435 and xfs/436 would
report an non-empty xfs_buf slab on module remove. This isn't easily
to reproduce, but is clearly a side effect of converting the buffer
caceh to RUC freeing and lockless lookups. Sachin bisected and
Darrick hit it when testing the patchset directly.
Turns out that the xfs_buf slab is not destroyed when all the other
XFS slab caches are destroyed. Instead, it's got it's own little
wrapper function that gets called separately, and so it doesn't have
an rcu_barrier() call in it that is needed to drain all the rcu
callbacks before the slab is destroyed.
Fix it by removing the xfs_buf_init/terminate wrappers that just
allocate and destroy the xfs_buf slab, and move them to the same
place that all the other slab caches are set up and destroyed.
Reported-and-tested-by: Sachin Sant <sachinp@linux.ibm.com>
Fixes: 298f34224506 ("xfs: lockless buffer lookup")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832
commit 298f342245066309189d8637ca7339d56840c3e1
Author: Dave Chinner <dchinner@redhat.com>
Date: Thu Jul 14 12:05:07 2022 +1000
xfs: lockless buffer lookup
Now that we have a standalone fast path for buffer lookup, we can
easily convert it to use rcu lookups. When we continually hammer the
buffer cache with trylock lookups, we end up with a huge amount of
lock contention on the per-ag buffer hash locks:
- 92.71% 0.05% [kernel] [k] xfs_inodegc_worker
- 92.67% xfs_inodegc_worker
- 92.13% xfs_inode_unlink
- 91.52% xfs_inactive_ifree
- 85.63% xfs_read_agi
- 85.61% xfs_trans_read_buf_map
- 85.59% xfs_buf_read_map
- xfs_buf_get_map
- 85.55% xfs_buf_find
- 72.87% _raw_spin_lock
- do_raw_spin_lock
71.86% __pv_queued_spin_lock_slowpath
- 8.74% xfs_buf_rele
- 7.88% _raw_spin_lock
- 7.88% do_raw_spin_lock
7.63% __pv_queued_spin_lock_slowpath
- 1.70% xfs_buf_trylock
- 1.68% down_trylock
- 1.41% _raw_spin_lock_irqsave
- 1.39% do_raw_spin_lock
__pv_queued_spin_lock_slowpath
- 0.76% _raw_spin_unlock
0.75% do_raw_spin_unlock
This is basically hammering the pag->pag_buf_lock from lots of CPUs
doing trylocks at the same time. Most of the buffer trylock
operations ultimately fail after we've done the lookup, so we're
really hammering the buf hash lock whilst making no progress.
We can also see significant spinlock traffic on the same lock just
under normal operation when lots of tasks are accessing metadata
from the same AG, so let's avoid all this by converting the lookup
fast path to leverages the rhashtable's ability to do rcu protected
lookups.
We avoid races with the buffer release path by using
atomic_inc_not_zero() on the buffer hold count. Any buffer that is
in the LRU will have a non-zero count, thereby allowing the lockless
fast path to be taken in most cache hit situations. If the buffer
hold count is zero, then it is likely going through the release path
so in that case we fall back to the existing lookup miss slow path.
The slow path will then do an atomic lookup and insert under the
buffer hash lock and hence serialise correctly against buffer
release freeing the buffer.
The use of rcu protected lookups means that buffer handles now need
to be freed by RCU callbacks (same as inodes). We still free the
buffer pages before the RCU callback - we won't be trying to access
them at all on a buffer that has zero references - but we need the
buffer handle itself to be present for the entire rcu protected read
side to detect a zero hold count correctly.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832
commit 85c73bf726e41be276bcad3325d9a8aef10be289
Author: Dave Chinner <dchinner@redhat.com>
Date: Thu Jul 7 22:05:18 2022 +1000
xfs: rework xfs_buf_incore() API
Make it consistent with the other buffer APIs to return a error and
the buffer is placed in a parameter.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832
commit b9b3fe152e4966cf8562630de67aa49e2f9c9222
Author: Dave Chinner <david@fromorbit.com>
Date: Thu Apr 21 08:44:59 2022 +1000
xfs: convert buffer flags to unsigned.
5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned. This manifests as a compiler error such as:
/kisskb/src/fs/xfs/./xfs_trace.h:432:2: note: in expansion of macro 'TP_printk'
TP_printk("dev %d:%d daddr 0x%llx bbcount 0x%x hold %d pincount %d "
^
/kisskb/src/fs/xfs/./xfs_trace.h:440:5: note: in expansion of macro '__print_flags'
__print_flags(__entry->flags, "|", XFS_BUF_FLAGS),
^
/kisskb/src/fs/xfs/xfs_buf.h:67:4: note: in expansion of macro 'XBF_UNMAPPED'
{ XBF_UNMAPPED, "UNMAPPED" }
^
/kisskb/src/fs/xfs/./xfs_trace.h:440:40: note: in expansion of macro 'XFS_BUF_FLAGS'
__print_flags(__entry->flags, "|", XFS_BUF_FLAGS),
^
/kisskb/src/fs/xfs/./xfs_trace.h: In function 'trace_raw_output_xfs_buf_flags_class':
/kisskb/src/fs/xfs/xfs_buf.h:46:23: error: initializer element is not constant
#define XBF_UNMAPPED (1 << 31)/* do not map the buffer */
as __print_flags assigns XFS_BUF_FLAGS to a structure that uses an
unsigned long for the flag. Since this results in the value of
XBF_UNMAPPED causing a signed integer overflow, the result is
technically undefined behavior, which gcc-5 does not accept as an
integer constant.
This is based on a patch from Arnd Bergman <arnd@arndb.de>.
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2162211
Conflicts: dropped ext2 and erofs hunks, as they're not supported in RHEL.
Fixed up ext4 conflict due to RHEL differences.
commit cd913c76f489def1a388e3a5b10df94948ede3f5
Author: Christoph Hellwig <hch@lst.de>
Date: Mon Nov 29 11:21:59 2021 +0100
dax: return the partition offset from fs_dax_get_by_bdev
Prepare for the removal of the block_device from the DAX I/O path by
returning the partition offset from fs_dax_get_by_bdev so that the file
systems have it at hand for use during I/O.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20211129102203.2243509-26-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2162211
commit 5b5abbefec1bea98abba8f1cffcf72c11c32a92d
Author: Christoph Hellwig <hch@lst.de>
Date: Mon Nov 29 11:21:55 2021 +0100
xfs: move dax device handling into xfs_{alloc,free}_buftarg
Hide the DAX device lookup from the xfs_super.c code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Link: https://lore.kernel.org/r/20211129102203.2243509-22-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git
commit 4c7f65aea7b7fe66c08f8f7304c1ea3f7a871d5a
Author: Dave Chinner <dchinner@redhat.com>
Date: Wed Aug 18 18:48:54 2021 -0700
xfs: rename buffer cache index variable b_bn
To stop external users from using b_bn as the disk address of the
buffer, rename it to b_rhash_key to indicate that it is the buffer
cache index, not the block number of the buffer. Code that needs the
disk address should use xfs_buf_daddr() to obtain it.
Do the rename and clean up any of the remaining internal b_bn users.
Also clean up any remaining b_bn cruft that is now unused.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git
commit 04fcad80cd068731a779fb442f78234732683755
Author: Dave Chinner <dchinner@redhat.com>
Date: Wed Aug 18 18:46:57 2021 -0700
xfs: introduce xfs_buf_daddr()
Introduce a helper function xfs_buf_daddr() to extract the disk
address of the buffer from the struct xfs_buf. This will replace
direct accesses to bp->b_bn and bp->b_maps[0].bm_bn, as well as
the XFS_BUF_ADDR() macro.
This patch introduces the helper function and replaces all uses of
XFS_BUF_ADDR() as this is just a simple sed replacement.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git
commit 8cf07f3dd56195316be97758cb8b4e1d7183ea84
Author: Dave Chinner <dchinner@redhat.com>
Date: Wed Aug 18 18:46:24 2021 -0700
xfs: sb verifier doesn't handle uncached sb buffer
The verifier checks explicitly for bp->b_bn == XFS_SB_DADDR to match
the primary superblock buffer, but the primary superblock is an
uncached buffer and so bp->b_bn is always -1ULL. Hence this never
matches and the CRC error reporting is wholly dependent on the
mount superblock already being populated so CRC feature checks pass
and allow CRC errors to be reported.
Fix this so that the primary superblock CRC error reporting is not
dependent on already having read the superblock into memory.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git
commit 98fe2c3cef21b784e2efd1d9d891430d95b4f073
Author: Dave Chinner <dchinner@redhat.com>
Date: Mon Aug 9 10:10:01 2021 -0700
xfs: remove kmem_alloc_io()
Since commit 59bb47985c ("mm, sl[aou]b: guarantee natural alignment
for kmalloc(power-of-two)"), the core slab code now guarantees slab
alignment in all situations sufficient for IO purposes (i.e. minimum
of 512 byte alignment of >= 512 byte sized heap allocations) we no
longer need the workaround in the XFS code to provide this
guarantee.
Replace the use of kmem_alloc_io() with kmem_alloc() or
kmem_alloc_large() appropriately, and remove the kmem_alloc_io()
interface altogether.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Brian Foster <bfoster@redhat.com>
->b_offset can only be non-zero for _XBF_KMEM backed buffers, so
remove all code dealing with it for page backed buffers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
[dgc: modified to fit this patchset]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
xfs_buftarg_drain() is called from xfs_log_quiesce() to ensure the
buffer cache is reclaimed during unmount. xfs_log_quiesce() is also
called from xfs_quiesce_attr(), however, which means that cache
state is completely drained for filesystem freeze and read-only
remount. While technically harmless, this is unnecessarily
heavyweight. Both freeze and read-only mounts allow reads and thus
allow population of the buffer cache. Therefore, the transitional
sequence in either case really only needs to quiesce outstanding
writes to return the filesystem in a generally read-only state.
Additionally, some users have reported that attempts to freeze a
filesystem concurrent with a read-heavy workload causes the freeze
process to stall for a significant amount of time. This occurs
because, as mentioned above, the read workload repopulates the
buffer LRU while the freeze task attempts to drain it.
To improve this situation, replace the drain in xfs_log_quiesce()
with a buffer I/O quiesce and lift the drain into the unmount path.
This removes buffer LRU reclaim from freeze and read-only [re]mount,
but ensures the LRU is still drained before the filesystem unmounts.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
xfs_wait_buftarg() is vaguely named and somewhat overloaded. Its
primary purpose is to reclaim all buffers from the provided buffer
target LRU. In preparation to refactor xfs_wait_buftarg() into
serialization and LRU draining components, rename the function and
associated helpers to something more descriptive. This patch has no
functional changes with the minor exception of renaming a
tracepoint.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Prepare for kernel xfs_buf alignment by getting rid of the
xfs_buf_t typedef from userspace.
[darrick: This patch is a port of a userspace patch removing the
xfs_buf_t typedef in preparation to make the userspace xfs_buf code
behave more like its kernel counterpart.]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Instead of poking deeply into buffer cache internals when re-reading the
superblock during log recovery just generalize _xfs_buf_read and use it
there. Note that we don't have to explicitly set up the ops as they
must be set from the initial read.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
No need to keep a separate helper for this logic.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
All unmarked dirty buffers should be in the AIL and have log items
attached to them. Hence when they are written, we will run a
callback to remove the item from the AIL if appropriate. Now that
we've handled inode and dquot buffers, all remaining calls are to
xfs_buf_iodone() and so we can hard code this rather than use an
indirect call.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Log recovery has it's own buffer write completion handler for
buffers that it directly recovers. Convert these to direct calls by
flagging these buffers as being log recovery buffers. The flag will
get cleared by the log recovery IO completion routine, so it will
never leak out of log recovery.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
dquot buffers always have write IO callbacks, so by marking them
directly we can avoid needing to attach ->b_iodone functions to
them. This avoids an indirect call, and makes future modifications
much simpler.
This is largely a rearrangement of the code at this point - no IO
completion functionality changes at this point, just how the
code is run is modified.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Inode buffers always have write IO callbacks, so by marking them
directly we can avoid needing to attach ->b_iodone functions to
them. This avoids an indirect call, and makes future modifications
much simpler.
While this is largely a refactor of existing functionality, we
broaden the scope of the flag to beyond where inodes are explicitly
attached because future changes need to know what type of log items
are attached to the buffer. Adding this buffer flag may invoke the
inode iodone callback in cases where it wouldn't have been
previously, but this is not a functional change because the callback
is identical to the normal buffer write iodone callback when inodes
are not attached.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
XFS has some inconsistent log message rate limiting with respect to
buffer alerts. The metadata I/O error notification uses the generic
ratelimited alert, the buffer push code uses a custom rate limit and
the similar quiesce time failure checks are not rate limited at all
(when they should be).
The custom rate limit defined in the buf item code is specifically
crafted for buffer alerts. It is more aggressive than generic rate
limiting code because it must accommodate a high frequency of I/O
error events in a relative short timeframe.
Factor out the custom rate limit state from the buf item code into a
per-buftarg rate limit so various alerts are limited based on the
target. Define a buffer alert helper function and use it for the
buffer alerts that are already ratelimited.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We use the same buffer I/O failure code in a few different places.
It's not much code, but it's not necessarily self-explanatory.
Factor it into a helper and document it in one place.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Add a helper function to get rid of buffers that we have decided are
corrupt after the verifiers have run. This function is intended to
handle metadata checks that can't happen in the verifiers, such as
inter-block relationship checking. Note that we now mark the buffer
stale so that it will not end up on any LRU and will be purged on
release.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Instead of passing __func__ to the error reporting function, let's use
the return address builtins so that the messages actually tell you which
higher level function called the buffer functions. This was previously
true for the xfs_buf_read callers, but not for the xfs_trans_read_buf
callers.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Convert xfs_buf_read() to return numeric error codes like most
everywhere else in xfs.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Convert xfs_buf_get_uncached() to return numeric error codes like most
everywhere else in xfs.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Convert xfs_buf_get() to return numeric error codes like most
everywhere else in xfs.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Convert xfs_buf_read_map() to return numeric error codes like most
everywhere else in xfs. This involves moving the open-coded logic that
reports metadata IO read / corruption errors and stales the buffer into
xfs_buf_read_map so that the logic is all in one place.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Convert xfs_buf_get_map() to return numeric error codes like most
everywhere else in xfs.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Needed to feed into the allocation routine to guarantee the memory
buffers we add to bios are correctly aligned to the underlying
device.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We need to derive the mount pointer from a buffer in a lot of place.
Add a direct pointer to short cut the pointer chasing.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This field is now always idential to b_length.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Now that the log code doesn't abuse this field any more we can
declare it as a struct xfs_buf_log_item pointer.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Now that the log code uses bios directly we can drop various special
cases in the buffer cache code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Assining a numerical value that is not close to the flags
defined near by is just asking for conflicts later on.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
There are several functions which take a flag argument that is
only ever passed as "0," so remove these arguments.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
xfs_buf_zero is the only caller of xfs_buf_iomove. Remove support
for copying from or to the buffer in xfs_buf_iomove and merge the
two functions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Create a separate magic16 check function so that we don't run afoul of
static checkers.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
The inode btree verifier code is shared between the inode btree and
free inode btree because the underlying metadata formats are
essentially equivalent. A side effect of this is that the verifier
cannot determine whether a particular btree block should have an
inobt or finobt magic value.
This logic allows an unfortunate xfs_repair bug to escape detection
where certain level > 0 nodes of the finobt are stamped with inobt
magic by xfs_repair finobt reconstruction. This is fortunately not a
severe problem since the inode btree magic values do not contribute
to any changes in kernel behavior, but we do need a means to detect
and prevent this problem in the future.
Add a field to xfs_buf_ops to store the v4 and v5 superblock magic
values expected by a particular verifier. Add a helper to check an
on-disk magic value against the value expected by the verifier. Call
the helper from the shared [f]inobt verifier code for magic value
verification. This ensures that the inode btree blocks each have the
appropriate magic value based on specific tree type and superblock
version.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Improve the documentation around xfs_buf_ensure_ops, which is the
function that is responsible for cleaning up the b_ops state of buffers
that go through xrep_findroot_block but don't match anything. Rename
the function to xfs_buf_reverify.
[darrick: this started off as bfoster mods of a previous patch of mine,
but the renaming part is now this separate patch.]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
If a caller supplies buffer ops when trying to read a buffer and the
buffer doesn't already have buf ops assigned, ensure that the ops are
assigned to the buffer and the verifier is run on that buffer.
Note that current XFS code is careful to assign buffer ops after a
xfs_{trans_,}buf_read call in which ops were not supplied. However, we
should apply ops defensively in case there is ever a coding mistake; and
an upcoming repair patch will need to be able to read a buffer without
assigning buf ops.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The old lock tracking infrastructure in xfs using the b_last_holder
field seems to only be useful if you can get into the system with a
debugger; it seems that the existing tracepoints would be the way to
go these days, and this old infrastructure can be removed.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The buffer I/O submission path consists of separate function calls
per type. The buffer I/O type is already controlled via buffer
state (XBF_ASYNC), however, so there is no real need for separate
submission functions.
Combine the buffer submission functions into a single function that
processes the buffer appropriately based on XBF_ASYNC. Retain an
internal helper with a conditional wait parameter to continue to
support batched !XBF_ASYNC submission/completion required by delwri
queues.
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>