Commit Graph

148 Commits

Author SHA1 Message Date
Pavel Reichl e963e209c2 xfs: move log discard work to xfs_discard.c
JIRA: https://issues.redhat.com/browse/RHEL-7961

Because we are going to use the same list-based discard submission
interface for fstrim-based discards, too.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
(cherry picked from commit 428c4435b063bebc7eddcd7d311546d6933efa62)
Signed-off-by: Pavel Reichl <preichl@redhat.com>
2024-06-20 00:18:25 +02:00
Bill O'Donnell 1ce0a32435 xfs: fix per-cpu CIL structure aggregation racing with dying cpus
JIRA: https://issues.redhat.com/browse/RHEL-15844

commit ecd49f7a36fbccc884471f86fc43de6ca8d1f786
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Mon Sep 11 08:39:02 2023 -0700

    xfs: fix per-cpu CIL structure aggregation racing with dying cpus

    In commit 7c8ade2121200 ("xfs: implement percpu cil space used
    calculation"), the XFS committed (log) item list code was converted to
    use per-cpu lists and space tracking to reduce cpu contention when
    multiple threads are modifying different parts of the filesystem and
    hence end up contending on the log structures during transaction commit.
    Each CPU tracks its own commit items and space usage, and these do not
    have to be merged into the main CIL until either someone wants to push
    the CIL items, or we run over a soft threshold and switch to slower (but
    more accurate) accounting with atomics.

    Unfortunately, the for_each_cpu iteration suffers from the same race
    with cpu dying problem that was identified in commit 8b57b11cca88f
    ("pcpcntrs: fix dying cpu summation race") -- CPUs are removed from
    cpu_online_mask before the CPUHP_XFS_DEAD callback gets called.  As a
    result, both CIL percpu structure aggregation functions fail to collect
    the items and accounted space usage at the correct point in time.

    If we're lucky, the items that are collected from the online cpus exceed
    the space given to those cpus, and the log immediately shuts down in
    xlog_cil_insert_items due to the (apparent) log reservation overrun.
    This happens periodically with generic/650, which exercises cpu hotplug
    vs. the filesystem code:

    smpboot: CPU 3 is now offline
    XFS (sda3): ctx ticket reservation ran out. Need to up reservation
    XFS (sda3): ticket reservation summary:
    XFS (sda3):   unit res    = 9268 bytes
    XFS (sda3):   current res = -40 bytes
    XFS (sda3):   original count  = 1
    XFS (sda3):   remaining count = 1
    XFS (sda3): Filesystem has been shut down due to log error (0x2).

    Applying the same sort of fix from 8b57b11cca88f to the CIL code seems
    to make the generic/650 problem go away, but I've been told that tglx
    was not happy when he saw:

    "...the only thing we actually need to care about is that
    percpu_counter_sum() iterates dying CPUs. That's trivial to do, and when
    there are no CPUs dying, it has no addition overhead except for a
    cpumask_or() operation."

    The CPU hotplug code is rather complex and difficult to understand and I
    don't want to try to understand the cpu hotplug locking well enough to
    use cpu_dying mask.  Furthermore, there's a performance improvement that
    could be had here.  Attach a private cpu mask to the CIL structure so
    that we can track exactly which cpus have accessed the percpu data at
    all.  It doesn't matter if the cpu has since gone offline; log item
    aggregation will still find the items.  Better yet, we skip cpus that
    have not recently logged anything.

    Worse yet, Ritesh Harjani and Eric Sandeen both reported today that CPU
    hot remove racing with an xfs mount can crash if the cpu_dead notifier
    tries to access the log but the mount hasn't yet set up the log.

    Link: https://lore.kernel.org/linux-xfs/ZOLzgBOuyWHapOyZ@dread.disaster.area/T/
    Link: https://lore.kernel.org/lkml/877cuj1mt1.ffs@tglx/
    Link: https://lore.kernel.org/lkml/20230414162755.281993820@linutronix.de/
    Link: https://lore.kernel.org/linux-xfs/ZOVkjxWZq0YmjrJu@dread.disaster.area/T/
    Cc: tglx@linutronix.de
    Cc: peterz@infradead.org
    Reported-by: ritesh.list@gmail.com
    Reported-by: sandeen@sandeen.net
    Fixes: af1c2146a50b ("xfs: introduce per-cpu CIL tracking structure")
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-11-19 16:21:09 -06:00
Bill O'Donnell c1f32f65de xfs: xlog_sync() manually adjusts grant head space
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit d9f68777b2515452828d97b521ff8e3517b42eb1
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Jul 7 18:56:09 2022 +1000

    xfs: xlog_sync() manually adjusts grant head space

    When xlog_sync() rounds off the tail the iclog that is being
    flushed, it manually subtracts that space from the grant heads. This
    space is actually reserved by the transaction ticket that covers
    the xlog_sync() call from xlog_write(), but we don't plumb the
    ticket down far enough for it to account for the space consumed in
    the current log ticket.

    The grant heads are hot, so we really should be accounting this to
    the ticket is we can, rather than adding thousands of extra grant
    head updates every CIL commit.

    Interestingly, this actually indicates a potential log space overrun
    can occur when we force the log. By the time that xfs_log_force()
    pushes out an active iclog and consumes the roundoff space, the
    reservation for that roundoff space has been returned to the grant
    heads and is no longer covered by a reservation. In theory the
    roundoff added to log force on an already full log could push the
    write head past the tail. In practice, the CIL commit that writes to
    the log and needs the iclog pushed will have reserved space for
    roundoff, so when it releases the ticket there will still be
    physical space for the roundoff to be committed to the log, even
    though it is no longer reserved. This roundoff won't be enough space
    to allow a transaction to be woken if the log is full, so overruns
    should not actually occur in practice.

    That said, it indicates that we should not release the CIL context
    log ticket until after we've released the commit iclog. It also
    means that xlog_sync() still needs the direct grant head
    manipulation if we don't provide it with a ticket. Log forces are
    rare when we are in fast paths running 1.5 million transactions/s
    that make the grant heads hot, so let's optimise the hot case and
    pass CIL log tickets down to the xlog_sync() code.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:38 -05:00
Bill O'Donnell c477d293c3 xfs: avoid cil push lock if possible
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 1ccb0745a97fb8b38913b39b8ecb1aea39fdbcb0
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Jul 7 18:56:08 2022 +1000

    xfs: avoid cil push lock if possible

    Because now it hurts when the CIL fills up.

      - 37.20% __xfs_trans_commit
          - 35.84% xfs_log_commit_cil
             - 19.34% _raw_spin_lock
                - do_raw_spin_lock
                     19.01% __pv_queued_spin_lock_slowpath
             - 4.20% xfs_log_ticket_ungrant
                  0.90% xfs_log_space_wake

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:37 -05:00
Bill O'Donnell 05c9d4b182 xfs: move CIL ordering to the logvec chain
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 4eb56069cb2835fafe569e27e746e6a4c9735186
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Jul 7 18:56:08 2022 +1000

    xfs: move CIL ordering to the logvec chain

    Adding a list_sort() call to the CIL push work while the xc_ctx_lock
    is held exclusively has resulted in fairly long lock hold times and
    that stops all front end transaction commits from making progress.

    We can move the sorting out of the xc_ctx_lock if we can transfer
    the ordering information to the log vectors as they are detached
    from the log items and then we can sort the log vectors.  With these
    changes, we can move the list_sort() call to just before we call
    xlog_write() when we aren't holding any locks at all.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:37 -05:00
Bill O'Donnell 709be3d0cf xfs: convert log vector chain to use list heads
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 169248536a2b28e4228ba63772936c1ba979c9c0
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Jul 7 18:55:59 2022 +1000

    xfs: convert log vector chain to use list heads

    Because the next change is going to require sorting log vectors, and
    that requires arbitrary rearrangement of the list which cannot be
    done easily with a single linked list.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:37 -05:00
Bill O'Donnell 91d5aa42fa xfs: convert CIL to unordered per cpu lists
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit c0fb4765c5086cfd00f1158f5f44e7e1906530ad
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Jul 7 18:54:59 2022 +1000

    xfs: convert CIL to unordered per cpu lists

    So that we can remove the cil_lock which is a global serialisation
    point. We've already got ordering sorted, so all we need to do is
    treat the CIL list like the busy extent list and reconstruct it
    before the push starts.

    This is what we're trying to avoid:

     -   75.35%     1.83%  [kernel]            [k] xfs_log_commit_cil
        - 46.35% xfs_log_commit_cil
           - 41.54% _raw_spin_lock
              - 67.30% do_raw_spin_lock
                   66.96% __pv_queued_spin_lock_slowpath

    Which happens on a 32p system when running a 32-way 'rm -rf'
    workload. After this patch:

    -   20.90%     3.23%  [kernel]               [k] xfs_log_commit_cil
       - 17.67% xfs_log_commit_cil
          - 6.51% xfs_log_ticket_ungrant
               1.40% xfs_log_space_wake
            2.32% memcpy_erms
          - 2.18% xfs_buf_item_committing
             - 2.12% xfs_buf_item_release
                - 1.03% xfs_buf_unlock
                     0.96% up
                  0.72% xfs_buf_rele
            1.33% xfs_inode_item_format
            1.19% down_read
            0.91% up_read
            0.76% xfs_buf_item_format
          - 0.68% kmem_alloc_large
             - 0.67% kmem_alloc
                  0.64% __kmalloc
            0.50% xfs_buf_item_size

    It kinda looks like the workload is running out of log space all
    the time. But all the spinlock contention is gone and the
    transaction commit rate has gone from 800k/s to 1.3M/s so the amount
    of real work being done has gone up a *lot*.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:37 -05:00
Bill O'Donnell a98fa29c6e xfs: Add order IDs to log items in CIL
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 016a23388cdcb2740deb1379dc408f21c84efb11
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Jul 7 18:53:59 2022 +1000

    xfs: Add order IDs to log items in CIL

    Before we split the ordered CIL up into per cpu lists, we need a
    mechanism to track the order of the items in the CIL. We need to do
    this because there are rules around the order in which related items
    must physically appear in the log even inside a single checkpoint
    transaction.

    An example of this is intents - an intent must appear in the log
    before it's intent done record so that log recovery can cancel the
    intent correctly. If we have these two records misordered in the
    CIL, then they will not be recovered correctly by journal replay.

    We also will not be able to move items to the tail of
    the CIL list when they are relogged, hence the log items will need
    some mechanism to allow the correct log item order to be recreated
    before we write log items to the hournal.

    Hence we need to have a mechanism for recording global order of
    transactions in the log items  so that we can recover that order
    from un-ordered per-cpu lists.

    Do this with a simple monotonic increasing commit counter in the CIL
    context. Each log item in the transaction gets stamped with the
    current commit order ID before it is added to the CIL. If the item
    is already in the CIL, leave it where it is instead of moving it to
    the tail of the list and instead sort the list before we start the
    push work.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:36 -05:00
Bill O'Donnell b2b7df002b xfs: convert CIL busy extents to per-cpu
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit df7a4a2134b0a201c93e96efe4bb2be747f9da9f
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Jul 7 18:52:59 2022 +1000

    xfs: convert CIL busy extents to per-cpu

    To get them out from under the CIL lock.

    This is an unordered list, so we can simply punt it to per-cpu lists
    during transaction commits and reaggregate it back into a single
    list during the CIL push work.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:36 -05:00
Bill O'Donnell 5b779d9782 xfs: track CIL ticket reservation in percpu structure
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 1dd2a2c18e314ad89200f8296c86dd4ecd53dea6
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Jul 7 18:51:59 2022 +1000

    xfs: track CIL ticket reservation in percpu structure

    To get it out from under the cil spinlock.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:36 -05:00
Bill O'Donnell 87b93c5418 xfs: implement percpu cil space used calculation
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 7c8ade212120085439eddc4cfddfa29d41c3f426
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Jul 7 18:50:59 2022 +1000

    xfs: implement percpu cil space used calculation

    Now that we have the CIL percpu structures in place, implement the
    space used counter as a per-cpu counter.

    We have to be really careful now about ensuring that the checks and
    updates run without arbitrary delays, which means they need to run
    with pre-emption disabled. We do this by careful placement of
    the get_cpu_ptr/put_cpu_ptr calls to access the per-cpu structures
    for that CPU.

    We need to be able to reliably detect that the CIL has reached
    the hard limit threshold so we can take extra reservations for the
    iclog headers when the space used overruns the original reservation.
    hence we factor out xlog_cil_over_hard_limit() from
    xlog_cil_push_background().

    The global CIL space used is an atomic variable that is backed by
    per-cpu aggregation to minimise the number of atomic updates we do
    to the global state in the fast path. While we are under the soft
    limit, we aggregate only when the per-cpu aggregation is over the
    proportion of the soft limit assigned to that CPU. This means that
    all CPUs can use all but one byte of their aggregation threshold
    and we will not go over the soft limit.

    Hence once we detect that we've gone over both a per-cpu aggregation
    threshold and the soft limit, we know that we have only
    exceeded the soft limit by one per-cpu aggregation threshold. Even
    if all CPUs hit this at the same time, we can't be over the hard
    limit, so we can run an aggregation back into the atomic counter
    at this point and still be under the hard limit.

    At this point, we will be over the soft limit and hence we'll
    aggregate into the global atomic used space directly rather than the
    per-cpu counters, hence providing accurate detection of hard limit
    excursion for accounting and reservation purposes.

    Hence we get the best of both worlds - lockless, scalable per-cpu
    fast path plus accurate, atomic detection of hard limit excursion.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:36 -05:00
Bill O'Donnell 8da0b4c669 xfs: introduce per-cpu CIL tracking structure
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit af1c2146a50b1ffe7e10cae1f7e64ab56b7f8c1f
Author: Dave Chinner <dchinner@redhat.com>
Date:   Sat Jul 2 02:13:52 2022 +1000

    xfs: introduce per-cpu CIL tracking structure

    The CIL push lock is highly contended on larger machines, becoming a
    hard bottleneck that about 700,000 transaction commits/s on >16p
    machines. To address this, start moving the CIL tracking
    infrastructure to utilise per-CPU structures.

    We need to track the space used, the amount of log reservation space
    reserved to write the CIL, the log items in the CIL and the busy
    extents that need to be completed by the CIL commit.  This requires
    a couple of per-cpu counters, an unordered per-cpu list and a
    globally ordered per-cpu list.

    Create a per-cpu structure to hold these and all the management
    interfaces needed, as well as the hooks to handle hotplug CPUs.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:35 -05:00
Bill O'Donnell f343c3e86f xfs: rework per-iclog header CIL reservation
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 31151cc342dd9cc2c5a5954f3e7b2dcf2fb50f64
Author: Dave Chinner <dchinner@redhat.com>
Date:   Sat Jul 2 02:12:52 2022 +1000

    xfs: rework per-iclog header CIL reservation

    For every iclog that a CIL push will use up, we need to ensure we
    have space reserved for the iclog header in each iclog. It is
    extremely difficult to do this accurately with a per-cpu counter
    without expensive summing of the counter in every commit. However,
    we know what the maximum CIL size is going to be because of the
    hard space limit we have, and hence we know exactly how many iclogs
    we are going to need to write out the CIL.

    We are constrained by the requirement that small transactions only
    have reservation space for a single iclog header built into them.
    At commit time we don't know how much of the current transaction
    reservation is made up of iclog header reservations as calculated by
    xfs_log_calc_unit_res() when the ticket was reserved. As larger
    reservations have multiple header spaces reserved, we can steal
    more than one iclog header reservation at a time, but we only steal
    the exact number needed for the given log vector size delta.

    As a result, we don't know exactly when we are going to steal iclog
    header reservations, nor do we know exactly how many we are going to
    need for a given CIL.

    To make things simple, start by calculating the worst case number of
    iclog headers a full CIL push will require. Record this into an
    atomic variable in the CIL. Then add a byte counter to the log
    ticket that records exactly how much iclog header space has been
    reserved in this ticket by xfs_log_calc_unit_res(). This tells us
    exactly how much space we can steal from the ticket at transaction
    commit time.

    Now, at transaction commit time, we can check if the CIL has a full
    iclog header reservation and, if not, steal the entire reservation
    the current ticket holds for iclog headers. This minimises the
    number of times we need to do atomic operations in the fast path,
    but still guarantees we get all the reservations we need.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:35 -05:00
Bill O'Donnell 9666a36223 xfs: lift init CIL reservation out of xc_cil_lock
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 12380d237b819bd6cf2183f10b55ab47cdaab5e6
Author: Dave Chinner <dchinner@redhat.com>
Date:   Sat Jul 2 02:11:52 2022 +1000

    xfs: lift init CIL reservation out of xc_cil_lock

    The xc_cil_lock is the most highly contended lock in XFS now. To
    start the process of getting rid of it, lift the initial reservation
    of the CIL log space out from under the xc_cil_lock.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:35 -05:00
Bill O'Donnell de4b062f1c xfs: use the CIL space used counter for emptiness checks
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 88591e7f06a42b241ca2a3e3e7067a8f17661947
Author: Dave Chinner <dchinner@redhat.com>
Date:   Sat Jul 2 02:10:52 2022 +1000

    xfs: use the CIL space used counter for emptiness checks

    In the next patches we are going to make the CIL list itself
    per-cpu, and so we cannot use list_empty() to check is the list is
    empty. Replace the list_empty() checks with a flag in the CIL to
    indicate we have committed at least one transaction to the CIL and
    hence the CIL is not empty.

    We need this flag to be an atomic so that we can clear it without
    holding any locks in the commit fast path, but we also need to be
    careful to avoid atomic operations in the fast path. Hence we use
    the fact that test_bit() is not an atomic op to first check if the
    flag is set and then run the atomic test_and_clear_bit() operation
    to clear it and steal the initial unit reservation for the CIL
    context checkpoint.

    When we are switching to a new context in a push, we place the
    setting of the XLOG_CIL_EMPTY flag under the xc_push_lock. THis
    allows all the other places that need to check whether the CIL is
    empty to use test_bit() and still be serialised correctly with the
    CIL context swaps that set the bit.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:35 -05:00
Bill O'Donnell a3bc053561 xfs: can't use kmem_zalloc() for attribute buffers
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 45ff8b471cdc58701a7ba5c5dcd8dfc57ae06829
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu May 12 15:12:57 2022 +1000

    xfs: can't use kmem_zalloc() for attribute buffers

    Because heap allocation of 64kB buffers will fail:

    ....
     XFS: fs_mark(8414) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40)
     XFS: fs_mark(8417) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40)
     XFS: fs_mark(8409) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40)
     XFS: fs_mark(8428) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40)
     XFS: fs_mark(8430) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40)
     XFS: fs_mark(8437) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40)
     XFS: fs_mark(8433) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40)
     XFS: fs_mark(8406) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40)
     XFS: fs_mark(8412) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40)
     XFS: fs_mark(8432) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40)
     XFS: fs_mark(8424) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40)
    ....

    I'd use kvmalloc() instead, but....

    - 48.19% xfs_attr_create_intent
      - 46.89% xfs_attri_init
         - kvmalloc_node
            - 46.04% __kmalloc_node
               - kmalloc_large_node
                  - 45.99% __alloc_pages
                     - 39.39% __alloc_pages_slowpath.constprop.0
                        - 38.89% __alloc_pages_direct_compact
                           - 38.71% try_to_compact_pages
                              - compact_zone_order
                              - compact_zone
                                 - 21.09% isolate_migratepages_block
                                      10.31% PageHuge
                                      5.82% set_pfnblock_flags_mask
                                      0.86% get_pfnblock_flags_mask
                                 - 4.48% __reset_isolation_suitable
                                      4.44% __reset_isolation_pfn
                                 - 3.56% __pageblock_pfn_to_page
                                      1.33% pfn_to_online_page
                                   2.83% get_pfnblock_flags_mask
                                 - 0.87% migrate_pages
                                      0.86% compaction_alloc
                                   0.84% find_suitable_fallback
                     - 6.60% get_page_from_freelist
                          4.99% clear_page_erms
                        - 1.19% _raw_spin_lock_irqsave
                           - do_raw_spin_lock
                                __pv_queued_spin_lock_slowpath
            - 0.86% __vmalloc_node_range
                 0.65% __alloc_pages_bulk

    .... this is just yet another reminder of how much kvmalloc() sucks.
    So lift xlog_cil_kvmalloc(), rename it to xlog_kvmalloc() and use
    that instead....

    We also clean up the attribute name and value lengths as they no
    longer need to be rounded out to sizes compatible with log vectors.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Dave Chinner <david@fromorbit.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:24 -05:00
Bill O'Donnell c84b4e1db6 xfs: intent item whiteouts
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 0d227466be84332d1888724e1e74dac34bff6d71
Author: Dave Chinner <dchinner@redhat.com>
Date:   Wed May 4 11:50:29 2022 +1000

    xfs: intent item whiteouts

    When we log modifications based on intents, we add both intent
    and intent done items to the modification being made. These get
    written to the log to ensure that the operation is re-run if the
    intent done is not found in the log.

    However, for operations that complete wholly within a single
    checkpoint, the change in the checkpoint is atomic and will never
    need replay. In this case, we don't need to actually write the
    intent and intent done items to the journal because log recovery
    will never need to manually restart this modification.

    Log recovery currently handles intent/intent done matching by
    inserting the intent into the AIL, then removing it when a matching
    intent done item is found. Hence for all the intent-based operations
    that complete within a checkpoint, we spend all that time parsing
    the intent/intent done items just to cancel them and do nothing with
    them.

    Hence it follows that the only time we actually need intents in the
    log is when the modification crosses checkpoint boundaries in the
    log and so may only be partially complete in the journal. Hence if
    we commit and intent done item to the CIL and the intent item is in
    the same checkpoint, we don't actually have to write them to the
    journal because log recovery will always cancel the intents.

    We've never really worried about the overhead of logging intents
    unnecessarily like this because the intents we log are generally
    very much smaller than the change being made. e.g. freeing an extent
    involves modifying at lease two freespace btree blocks and the AGF,
    so the EFI/EFD overhead is only a small increase in space and
    processing time compared to the overall cost of freeing an extent.

    However, delayed attributes change this cost equation dramatically,
    especially for inline attributes. In the case of adding an inline
    attribute, we only log the inode core and attribute fork at present.
    With delayed attributes, we now log the attr intent which includes
    the name and value, the inode core adn attr fork, and finally the
    attr intent done item. We increase the number of items we log from 1
    to 3, and the number of log vectors (regions) goes up from 3 to 7.
    Hence we tripple the number of objects that the CIL has to process,
    and more than double the number of log vectors that need to be
    written to the journal.

    At scale, this means delayed attributes cause a non-pipelined CIL to
    become CPU bound processing all the extra items, resulting in a > 40%
    performance degradation on 16-way file+xattr create worklaods.
    Pipelining the CIL (as per 5.15) reduces the performance degradation
    to 20%, but now the limitation is the rate at which the log items
    can be written to the iclogs and iclogs be dispatched for IO and
    completed.

    Even log IO completion is slowed down by these intents, because it
    now has to process 3x the number of items in the checkpoint.
    Processing completed intents is especially inefficient here, because
    we first insert the intent into the AIL, then remove it from the AIL
    when the intent done is processed. IOWs, we are also doing expensive
    operations in log IO completion we could completely avoid if we
    didn't log completed intent/intent done pairs.

    Enter log item whiteouts.

    When an intent done is committed, we can check to see if the
    associated intent is in the same checkpoint as we are currently
    committing the intent done to. If so, we can mark the intent log
    item with a whiteout and immediately free the intent done item
    rather than committing it to the CIL. We can basically skip the
    entire formatting and CIL insertion steps for the intent done item.

    However, we cannot remove the intent item from the CIL at this point
    because the unlocked per-cpu CIL item lists do not permit removal
    without holding the CIL context lock exclusively. Transaction commit
    only holds the context lock shared, hence the best we can do is mark
    the intent item with a whiteout so that the CIL push can release it
    rather than writing it to the log.

    This means we never write the intent to the log if the intent done
    has also been committed to the same checkpoint, but we'll always
    write the intent if the intent done has not been committed or has
    been committed to a different checkpoint. This will result in
    correct log recovery behaviour in all cases, without the overhead of
    logging unnecessary intents.

    This intent whiteout concept is generic - we can apply it to all
    intent/intent done pairs that have a direct 1:1 relationship. The
    way deferred ops iterate and relog intents mean that all intents
    currently have a 1:1 relationship with their done intent, and hence
    we can apply this cancellation to all existing intent/intent done
    implementations.

    For delayed attributes with a 16-way 64kB xattr create workload,
    whiteouts reduce the amount of journalled metadata from ~2.5GB/s
    down to ~600MB/s and improve the creation rate from 9000/s to
    14000/s.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
    Signed-off-by: Dave Chinner <david@fromorbit.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:14 -05:00
Bill O'Donnell ef54ac32d0 xfs: factor and move some code in xfs_log_cil.c
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 22b1afc57e42da13f840d630ad484d4d99504839
Author: Dave Chinner <dchinner@redhat.com>
Date:   Wed May 4 11:46:30 2022 +1000

    xfs: factor and move some code in xfs_log_cil.c

    In preparation for adding support for intent item whiteouts.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
    Signed-off-by: Dave Chinner <david@fromorbit.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:13 -05:00
Bill O'Donnell 5d2f8935a2 xfs: CIL context doesn't need to count iovecs
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 593e34391faafd72102bd79c43994f32e9dd0c91
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Apr 21 10:36:56 2022 +1000

    xfs: CIL context doesn't need to count iovecs

    Now that we account for log opheaders in the log item formatting
    code, we don't actually use the aggregated count of log iovecs in
    the CIL for anything. Remove it and the tracking code that
    calculates it.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
    Signed-off-by: Dave Chinner <david@fromorbit.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:05 -05:00
Bill O'Donnell d9caa4693c xfs: xlog_write() doesn't need optype anymore
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 14b07ecd5cd2545fcee3ff29119405f7cf2f59ad
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Apr 21 10:36:48 2022 +1000

    xfs: xlog_write() doesn't need optype anymore

    So remove it from the interface and callers.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
    Signed-off-by: Dave Chinner <david@fromorbit.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:04 -05:00
Bill O'Donnell 92b04654a8 xfs: pass lv chain length into xlog_write()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit d80fc2914f9125a723d9af7038b1592fa8d1ea96
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Apr 21 10:35:19 2022 +1000

    xfs: pass lv chain length into xlog_write()

    The caller of xlog_write() usually has a close accounting of the
    aggregated vector length contained in the log vector chain passed to
    xlog_write(). There is no need to iterate the chain to calculate he
    length of the data in xlog_write_calculate_len() if the caller is
    already iterating that chain to build it.

    Passing in the vector length avoids doing an extra chain iteration,
    which can be a significant amount of work given that large CIL
    commits can have hundreds of thousands of vectors attached to the
    chain.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
    Signed-off-by: Dave Chinner <david@fromorbit.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:03 -05:00
Bill O'Donnell 73db27bc4d xfs: reserve space and initialise xlog_op_header in item formatting
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 8d547cf9d2392585204075243f29022a619550f2
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Apr 21 10:34:59 2022 +1000

    xfs: reserve space and initialise xlog_op_header in item formatting

    Current xlog_write() adds op headers to the log manually for every
    log item region that is in the vector passed to it. While
    xlog_write() needs to stamp the transaction ID into the ophdr, we
    already know it's length, flags, clientid, etc at CIL commit time.

    This means the only time that xlog write really needs to format and
    reserve space for a new ophdr is when a region is split across two
    iclogs. Adding the opheader and accounting for it as part of the
    normal formatted item region means we simplify the accounting
    of space used by a transaction and we don't have to special case
    reserving of space in for the ophdrs in xlog_write(). It also means
    we can largely initialise the ophdr in transaction commit instead
    of xlog_write, making the xlog_write formatting inner loop much
    tighter.

    xlog_prepare_iovec() is now too large to stay as an inline function,
    so we move it out of line and into xfs_log.c.

    Object sizes:
    text       data     bss     dec     hex filename
    1125934  305951     484 1432369  15db31 fs/xfs/built-in.a.before
    1123360  305951     484 1429795  15d123 fs/xfs/built-in.a.after

    So the code is a roughly 2.5kB smaller with xlog_prepare_iovec() now
    out of line, even though it grew in size itself.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
    Signed-off-by: Dave Chinner <david@fromorbit.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:02 -05:00
Bill O'Donnell e1aa62f1cd xfs: log tickets don't need log client id
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit c7610dceed39d978ef1ee0f2ab5a3c8d2d54d120
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Apr 21 10:34:33 2022 +1000

    xfs: log tickets don't need log client id

    We currently set the log ticket client ID when we reserve a
    transaction. This client ID is only ever written to the log by
    a CIL checkpoint or unmount records, and so anything using a high
    level transaction allocated through xfs_trans_alloc() does not need
    a log ticket client ID to be set.

    For the CIL checkpoint, the client ID written to the journal is
    always XFS_TRANSACTION, and for the unmount record it is always
    XFS_LOG, and nothing else writes to the log. All of these operations
    tell xlog_write() exactly what they need to write to the log (the
    optype) and build their own opheaders for start, commit and unmount
    records. Hence we no longer need to set the client id in either the
    log ticket or the xfs_trans.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Brian Foster <bfoster@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
    Signed-off-by: Dave Chinner <david@fromorbit.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:02 -05:00
Bill O'Donnell f735250293 xfs: embed the xlog_op_header in the commit record
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 54021b624261fe5b429d7ab4d081c3b2cca153a8
Author: Dave Chinner <david@fromorbit.com>
Date:   Thu Apr 21 10:34:15 2022 +1000

    xfs: embed the xlog_op_header in the commit record

    Remove the final case where xlog_write() has to prepend an opheader
    to a log transaction. Similar to the start record, the commit record
    is just an empty opheader with a XLOG_COMMIT_TRANS type, so we can
    just make this the payload for the region being passed to
    xlog_write() and remove the special handling in xlog_write() for
    the commit record.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
    Signed-off-by: Dave Chinner <david@fromorbit.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:02 -05:00
Bill O'Donnell b50c4efedc xfs: only CIL pushes require a start record
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 6eaed95e21a0872692246e63cb45542d0f62c922
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Apr 21 10:33:48 2022 +1000

    xfs: only CIL pushes require a start record

    So move the one-off start record writing in xlog_write() out into
    the static header that the CIL push builds to write into the log
    initially. This simplifes the xlog_write() logic a lot.

    pahole on x86-64 confirms that the xlog_cil_trans_hdr is correctly
    32 bit aligned and packed for copying the log op and transaction
    headers directly into the log as a single log region copy.

    struct xlog_cil_trans_hdr {
            struct xlog_op_header      oph[2];               /*     0    24 */
            struct xfs_trans_header    thdr;                 /*    24    16 */
            struct xfs_log_iovec       lhdr[2];              /*    40    32 */

            /* size: 72, cachelines: 2, members: 3 */
            /* last cacheline: 8 bytes */
    };

    A wart is needed to handle the fact that length of the region the
    opheader points to doesn't include the opheader length. hence if
    we embed the opheader, we have to substract the opheader length from
    the length written into the opheader by the generic copying code.
    This will eventually go away when everything is converted to
    embedded opheaders.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
    Signed-off-by: Dave Chinner <david@fromorbit.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:01 -05:00
Bill O'Donnell e220984650 xfs: factor out the CIL transaction header building
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 735fbf67df56f402e9baa079a5560ebe8fa049c1
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Apr 21 10:33:23 2022 +1000

    xfs: factor out the CIL transaction header building

    It is static code deep in the middle of the CIL push logic. Factor
    it out into a helper so that it is clear and easy to modify
    separately.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Brian Foster <bfoster@redhat.com>
    Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
    Signed-off-by: Dave Chinner <david@fromorbit.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:11:01 -05:00
Bill O'Donnell 5fe6076018 xfs: drop async cache flushes from CIL commits.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

Conflicts: minor line numbering changes from upstream commit

commit 919edbadebe17a67193533f531c2920c03e40fa4
Author: Dave Chinner <dchinner@redhat.com>
Date:   Tue Mar 29 18:22:02 2022 -0700

    xfs: drop async cache flushes from CIL commits.

    Jan Kara reported a performance regression in dbench that he
    bisected down to commit bad77c375e ("xfs: CIL checkpoint
    flushes caches unconditionally").

    Whilst developing the journal flush/fua optimisations this cache was
    part of, it appeared to made a significant difference to
    performance. However, now that this patchset has settled and all the
    correctness issues fixed, there does not appear to be any
    significant performance benefit to asynchronous cache flushes.

    In fact, the opposite is true on some storage types and workloads,
    where additional cache flushes that can occur from fsync heavy
    workloads have measurable and significant impact on overall
    throughput.

    Local dbench testing shows little difference on dbench runs with
    sync vs async cache flushes on either fast or slow SSD storage, and
    no difference in streaming concurrent async transaction workloads
    like fs-mark.

    Fast NVME storage.

    From `dbench -t 30`, CIL scale:

    clients         async                   sync
                    BW      Latency         BW      Latency
    1                935.18   0.855          915.64   0.903
    8               2404.51   6.873         2341.77   6.511
    16              3003.42   6.460         2931.57   6.529
    32              3697.23   7.939         3596.28   7.894
    128             7237.43  15.495         7217.74  11.588
    512             5079.24  90.587         5167.08  95.822

    fsmark, 32 threads, create w/ 64 byte xattr w/32k logbsize

            create          chown           unlink
    async   1m41s           1m16s           2m03s
    sync    1m40s           1m19s           1m54s

    Slower SATA SSD storage:

    From `dbench -t 30`, CIL scale:

    clients         async                   sync
                    BW      Latency         BW      Latency
    1                 78.59  15.792           83.78  10.729
    8                367.88  92.067          404.63  59.943
    16               564.51  72.524          602.71  76.089
    32               831.66 105.984          870.26 110.482
    128             1659.76 102.969         1624.73  91.356
    512             2135.91 223.054         2603.07 161.160

    fsmark, 16 threads, create w/32k logbsize

            create          unlink
    async   5m06s           4m15s
    sync    5m00s           4m22s

    And on Jan's test machine:

                       5.18-rc8-vanilla       5.18-rc8-patched
    Amean     1        71.22 (   0.00%)       64.94 *   8.81%*
    Amean     2        93.03 (   0.00%)       84.80 *   8.85%*
    Amean     4       150.54 (   0.00%)      137.51 *   8.66%*
    Amean     8       252.53 (   0.00%)      242.24 *   4.08%*
    Amean     16      454.13 (   0.00%)      439.08 *   3.31%*
    Amean     32      835.24 (   0.00%)      829.74 *   0.66%*
    Amean     64     1740.59 (   0.00%)     1686.73 *   3.09%*

    Performance and cache flush behaviour is restored to pre-regression
    levels.

    As such, we can now consider the async cache flush mechanism an
    unnecessary exercise in premature optimisation and hence we can
    now remove it and the infrastructure it requires completely.

    Fixes: bad77c375e ("xfs: CIL checkpoint flushes caches unconditionally")
    Reported-and-tested-by: Jan Kara <jack@suse.cz>
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:10:54 -05:00
Bill O'Donnell 35f4bdef44 xfs: log shutdown triggers should only shut down the log
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit b5f17bec1213a3ed2f4d79ad4c566e00cabe2a9b
Author: Dave Chinner <dchinner@redhat.com>
Date:   Tue Mar 29 18:22:01 2022 -0700

    xfs: log shutdown triggers should only shut down the log

    We've got a mess on our hands.

    1. xfs_trans_commit() cannot cancel transactions because the mount is
    shut down - that causes dirty, aborted, unlogged log items to sit
    unpinned in memory and potentially get written to disk before the
    log is shut down. Hence xfs_trans_commit() can only abort
    transactions when xlog_is_shutdown() is true.

    2. xfs_force_shutdown() is used in places to cause the current
    modification to be aborted via xfs_trans_commit() because it may be
    impractical or impossible to cancel the transaction directly, and
    hence xfs_trans_commit() must cancel transactions when
    xfs_is_shutdown() is true in this situation. But we can't do that
    because of #1.

    3. Log IO errors cause log shutdowns by calling xfs_force_shutdown()
    to shut down the mount and then the log from log IO completion.

    4. xfs_force_shutdown() can result in a log force being issued,
    which has to wait for log IO completion before it will mark the log
    as shut down. If #3 races with some other shutdown trigger that runs
    a log force, we rely on xfs_force_shutdown() silently ignoring #3
    and avoiding shutting down the log until the failed log force
    completes.

    5. To ensure #2 always works, we have to ensure that
    xfs_force_shutdown() does not return until the the log is shut down.
    But in the case of #4, this will result in a deadlock because the
    log Io completion will block waiting for a log force to complete
    which is blocked waiting for log IO to complete....

    So the very first thing we have to do here to untangle this mess is
    dissociate log shutdown triggers from mount shutdowns. We already
    have xlog_forced_shutdown, which will atomically transistion to the
    log a shutdown state. Due to internal asserts it cannot be called
    multiple times, but was done simply because the only place that
    could call it was xfs_do_force_shutdown() (i.e. the mount shutdown!)
    and that could only call it once and once only.  So the first thing
    we do is remove the asserts.

    We then convert all the internal log shutdown triggers to call
    xlog_force_shutdown() directly instead of xfs_force_shutdown(). This
    allows the log shutdown triggers to shut down the log without
    needing to care about mount based shutdown constraints. This means
    we shut down the log independently of the mount and the mount may
    not notice this until it's next attempt to read or modify metadata.
    At that point (e.g. xfs_trans_commit()) it will see that the log is
    shutdown, error out and shutdown the mount.

    To ensure that all the unmount behaviours and asserts track
    correctly as a result of a log shutdown, propagate the shutdown up
    to the mount if it is not already set. This keeps the mount and log
    state in sync, and saves a huge amount of hassle where code fails
    because of a log shutdown but only checks for mount shutdowns and
    hence ends up doing the wrong thing. Cleaning up that mess is
    an exercise for another day.

    This enables us to address the other problems noted above in
    followup patches.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:10:53 -05:00
Bill O'Donnell 3ebeea6d54 xfs: reduce kvmalloc overhead for CIL shadow buffers
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

commit 8dc9384b7d75012856b02ff44c37566a55fc2abf
Author: Dave Chinner <dchinner@redhat.com>
Date:   Tue Jan 4 17:22:18 2022 -0800

    xfs: reduce kvmalloc overhead for CIL shadow buffers

    Oh, let me count the ways that the kvmalloc API sucks dog eggs.

    The problem is when we are logging lots of large objects, we hit
    kvmalloc really damn hard with costly order allocations, and
    behaviour utterly sucks:

         - 49.73% xlog_cil_commit
             - 31.62% kvmalloc_node
                - 29.96% __kmalloc_node
                   - 29.38% kmalloc_large_node
                      - 29.33% __alloc_pages
                         - 24.33% __alloc_pages_slowpath.constprop.0
                            - 18.35% __alloc_pages_direct_compact
                               - 17.39% try_to_compact_pages
                                  - compact_zone_order
                                     - 15.26% compact_zone
                                          5.29% __pageblock_pfn_to_page
                                          3.71% PageHuge
                                        - 1.44% isolate_migratepages_block
                                             0.71% set_pfnblock_flags_mask
                                       1.11% get_pfnblock_flags_mask
                               - 0.81% get_page_from_freelist
                                  - 0.59% _raw_spin_lock_irqsave
                                     - do_raw_spin_lock
                                          __pv_queued_spin_lock_slowpath
                            - 3.24% try_to_free_pages
                               - 3.14% shrink_node
                                  - 2.94% shrink_slab.constprop.0
                                     - 0.89% super_cache_count
                                        - 0.66% xfs_fs_nr_cached_objects
                                           - 0.65% xfs_reclaim_inodes_count
                                                0.55% xfs_perag_get_tag
                                       0.58% kfree_rcu_shrink_count
                            - 2.09% get_page_from_freelist
                               - 1.03% _raw_spin_lock_irqsave
                                  - do_raw_spin_lock
                                       __pv_queued_spin_lock_slowpath
                         - 4.88% get_page_from_freelist
                            - 3.66% _raw_spin_lock_irqsave
                               - do_raw_spin_lock
                                    __pv_queued_spin_lock_slowpath
                - 1.63% __vmalloc_node
                   - __vmalloc_node_range
                      - 1.10% __alloc_pages_bulk
                         - 0.93% __alloc_pages
                            - 0.92% get_page_from_freelist
                               - 0.89% rmqueue_bulk
                                  - 0.69% _raw_spin_lock
                                     - do_raw_spin_lock
                                          __pv_queued_spin_lock_slowpath
               13.73% memcpy_erms
             - 2.22% kvfree

    On this workload, that's almost a dozen CPUs all trying to compact
    and reclaim memory inside kvmalloc_node at the same time. Yet it is
    regularly falling back to vmalloc despite all that compaction, page
    and shrinker reclaim that direct reclaim is doing. Copying all the
    metadata is taking far less CPU time than allocating the storage!

    Direct reclaim should be considered extremely harmful.

    This is a high frequency, high throughput, CPU usage and latency
    sensitive allocation. We've got memory there, and we're using
    kvmalloc to allow memory allocation to avoid doing lots of work to
    try to do contiguous allocations.

    Except it still does *lots of costly work* that is unnecessary.

    Worse: the only way to avoid the slowpath page allocation trying to
    do compaction on costly allocations is to turn off direct reclaim
    (i.e. remove __GFP_RECLAIM_DIRECT from the gfp flags).

    Unfortunately, the stupid kvmalloc API then says "oh, this isn't a
    GFP_KERNEL allocation context, so you only get kmalloc!". This
    cuts off the vmalloc fallback, and this leads to almost instant OOM
    problems which ends up in filesystems deadlocks, shutdowns and/or
    kernel crashes.

    I want some basic kvmalloc behaviour:

    - kmalloc for a contiguous range with fail fast semantics - no
      compaction direct reclaim if the allocation enters the slow path.
    - run normal vmalloc (i.e. GFP_KERNEL) if kmalloc fails

    The really, really stupid part about this is these kvmalloc() calls
    are run under memalloc_nofs task context, so all the allocations are
    always reduced to GFP_NOFS regardless of the fact that kvmalloc
    requires GFP_KERNEL to be passed in. IOWs, we're already telling
    kvmalloc to behave differently to the gfp flags we pass in, but it
    still won't allow vmalloc to be run with anything other than
    GFP_KERNEL.

    So, this patch open codes the kvmalloc() in the commit path to have
    the above described behaviour. The result is we more than halve the
    CPU time spend doing kvmalloc() in this path and transaction commits
    with 64kB objects in them more than doubles. i.e. we get ~5x
    reduction in CPU usage per costly-sized kvmalloc() invocation and
    the profile looks like this:

      - 37.60% xlog_cil_commit
            16.01% memcpy_erms
          - 8.45% __kmalloc
             - 8.04% kmalloc_order_trace
                - 8.03% kmalloc_order
                   - 7.93% alloc_pages
                      - 7.90% __alloc_pages
                         - 4.05% __alloc_pages_slowpath.constprop.0
                            - 2.18% get_page_from_freelist
                            - 1.77% wake_all_kswapds
    ....
                                        - __wake_up_common_lock
                                           - 0.94% _raw_spin_lock_irqsave
                         - 3.72% get_page_from_freelist
                            - 2.43% _raw_spin_lock_irqsave
          - 5.72% vmalloc
             - 5.72% __vmalloc_node_range
                - 4.81% __get_vm_area_node.constprop.0
                   - 3.26% alloc_vmap_area
                      - 2.52% _raw_spin_lock
                   - 1.46% _raw_spin_lock
                  0.56% __alloc_pages_bulk
          - 4.66% kvfree
             - 3.25% vfree
                - __vfree
                   - 3.23% __vunmap
                      - 1.95% remove_vm_area
                         - 1.06% free_vmap_area_noflush
                            - 0.82% _raw_spin_lock
                         - 0.68% _raw_spin_lock
                      - 0.92% _raw_spin_lock
             - 1.40% kfree
                - 1.36% __free_pages
                   - 1.35% __free_pages_ok
                      - 1.02% _raw_spin_lock_irqsave

    It's worth noting that over 50% of the CPU time spent allocating
    these shadow buffers is now spent on spinlocks. So the shadow buffer
    allocation overhead is greatly reduced by getting rid of direct
    reclaim from kmalloc, and could probably be made even less costly if
    vmalloc() didn't use global spinlocks to protect it's structures.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:10:46 -05:00
Bill O'Donnell 9547cf1eec xfs: prevent UAF in xfs_log_item_in_current_chkpt
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2167832

Conflicts: out-of-order patch f24fb058d requires struct member naming change.
	   Remove unused variable 'ctx' in xfs_log_item_in_current_chkpt.

commit f8d92a66e810acbef6ddbc0bd0cbd9b117ce8acd
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Wed Dec 15 11:53:15 2021 -0800

    xfs: prevent UAF in xfs_log_item_in_current_chkpt

    While I was running with KASAN and lockdep enabled, I stumbled upon an
    KASAN report about a UAF to a freed CIL checkpoint.  Looking at the
    comment for xfs_log_item_in_current_chkpt, it seems pretty obvious to me
    that the original patch to xfs_defer_finish_noroll should have done
    something to lock the CIL to prevent it from switching the CIL contexts
    while the predicate runs.

    For upper level code that needs to know if a given log item is new
    enough not to need relogging, add a new wrapper that takes the CIL
    context lock long enough to sample the current CIL context.  This is
    kind of racy in that the CIL can switch the contexts immediately after
    sampling, but that's ok because the consequence is that the defer ops
    code is a little slow to relog items.

     ==================================================================
     BUG: KASAN: use-after-free in xfs_log_item_in_current_chkpt+0x139/0x160 [xfs]
     Read of size 8 at addr ffff88804ea5f608 by task fsstress/527999

     CPU: 1 PID: 527999 Comm: fsstress Tainted: G      D      5.16.0-rc4-xfsx #rc4
     Call Trace:
      <TASK>
      dump_stack_lvl+0x45/0x59
      print_address_description.constprop.0+0x1f/0x140
      kasan_report.cold+0x83/0xdf
      xfs_log_item_in_current_chkpt+0x139/0x160
      xfs_defer_finish_noroll+0x3bb/0x1e30
      __xfs_trans_commit+0x6c8/0xcf0
      xfs_reflink_remap_extent+0x66f/0x10e0
      xfs_reflink_remap_blocks+0x2dd/0xa90
      xfs_file_remap_range+0x27b/0xc30
      vfs_dedupe_file_range_one+0x368/0x420
      vfs_dedupe_file_range+0x37c/0x5d0
      do_vfs_ioctl+0x308/0x1260
      __x64_sys_ioctl+0xa1/0x170
      do_syscall_64+0x35/0x80
      entry_SYSCALL_64_after_hwframe+0x44/0xae
     RIP: 0033:0x7f2c71a2950b
     Code: 0f 1e fa 48 8b 05 85 39 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff
    ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01
    f0 ff ff 73 01 c3 48 8b 0d 55 39 0d 00 f7 d8 64 89 01 48
     RSP: 002b:00007ffe8c0e03c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
     RAX: ffffffffffffffda RBX: 00005600862a8740 RCX: 00007f2c71a2950b
     RDX: 00005600862a7be0 RSI: 00000000c0189436 RDI: 0000000000000004
     RBP: 000000000000000b R08: 0000000000000027 R09: 0000000000000003
     R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000005a
     R13: 00005600862804a8 R14: 0000000000016000 R15: 00005600862a8a20
      </TASK>

     Allocated by task 464064:
      kasan_save_stack+0x1e/0x50
      __kasan_kmalloc+0x81/0xa0
      kmem_alloc+0xcd/0x2c0 [xfs]
      xlog_cil_ctx_alloc+0x17/0x1e0 [xfs]
      xlog_cil_push_work+0x141/0x13d0 [xfs]
      process_one_work+0x7f6/0x1380
      worker_thread+0x59d/0x1040
      kthread+0x3b0/0x490
      ret_from_fork+0x1f/0x30

     Freed by task 51:
      kasan_save_stack+0x1e/0x50
      kasan_set_track+0x21/0x30
      kasan_set_free_info+0x20/0x30
      __kasan_slab_free+0xed/0x130
      slab_free_freelist_hook+0x7f/0x160
      kfree+0xde/0x340
      xlog_cil_committed+0xbfd/0xfe0 [xfs]
      xlog_cil_process_committed+0x103/0x1c0 [xfs]
      xlog_state_do_callback+0x45d/0xbd0 [xfs]
      xlog_ioend_work+0x116/0x1c0 [xfs]
      process_one_work+0x7f6/0x1380
      worker_thread+0x59d/0x1040
      kthread+0x3b0/0x490
      ret_from_fork+0x1f/0x30

     Last potentially related work creation:
      kasan_save_stack+0x1e/0x50
      __kasan_record_aux_stack+0xb7/0xc0
      insert_work+0x48/0x2e0
      __queue_work+0x4e7/0xda0
      queue_work_on+0x69/0x80
      xlog_cil_push_now.isra.0+0x16b/0x210 [xfs]
      xlog_cil_force_seq+0x1b7/0x850 [xfs]
      xfs_log_force_seq+0x1c7/0x670 [xfs]
      xfs_file_fsync+0x7c1/0xa60 [xfs]
      __x64_sys_fsync+0x52/0x80
      do_syscall_64+0x35/0x80
      entry_SYSCALL_64_after_hwframe+0x44/0xae

     The buggy address belongs to the object at ffff88804ea5f600
      which belongs to the cache kmalloc-256 of size 256
     The buggy address is located 8 bytes inside of
      256-byte region [ffff88804ea5f600, ffff88804ea5f700)
     The buggy address belongs to the page:
     page:ffffea00013a9780 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88804ea5ea00 pfn:0x4ea5e
     head:ffffea00013a9780 order:1 compound_mapcount:0
     flags: 0x4fff80000010200(slab|head|node=1|zone=1|lastcpupid=0xfff)
     raw: 04fff80000010200 ffffea0001245908 ffffea00011bd388 ffff888004c42b40
     raw: ffff88804ea5ea00 0000000000100009 00000001ffffffff 0000000000000000
     page dumped because: kasan: bad access detected

     Memory state around the buggy address:
      ffff88804ea5f500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      ffff88804ea5f580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
     >ffff88804ea5f600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                           ^
      ffff88804ea5f680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      ffff88804ea5f700: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
     ==================================================================

    Fixes: 4e919af782 ("xfs: periodically relog deferred intent items")
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
2023-05-18 11:10:46 -05:00
Brian Foster f24fb058dd xfs: log items should have a xlog pointer, not a mount
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit d86142dd7c4e10e50bdb3679b405d748214b2c28
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Mar 17 09:09:12 2022 -0700

    xfs: log items should have a xlog pointer, not a mount

    Log items belong to the log, not the xfs_mount. Convert the mount
    pointer in the log item to a xlog pointer in preparation for
    upcoming log centric changes to the log items.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:37 -04:00
Brian Foster 83a6756e24 xfs: async CIL flushes need pending pushes to be made stable
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 70447e0ad9781f84e60e0990888bd8c84987f44e
Author: Dave Chinner <dchinner@redhat.com>
Date:   Thu Mar 17 09:09:11 2022 -0700

    xfs: async CIL flushes need pending pushes to be made stable

    When the AIL tries to flush the CIL, it relies on the CIL push
    ending up on stable storage without having to wait for and
    manipulate iclog state directly. However, if there is already a
    pending CIL push when the AIL tries to flush the CIL, it won't set
    the cil->xc_push_commit_stable flag and so the CIL push will not
    actively flush the commit record iclog.

    generic/530 when run on a single CPU test VM can trigger this fairly
    reliably. This test exercises unlinked inode recovery, and can
    result in inodes being pinned in memory by ongoing modifications to
    the inode cluster buffer to record unlinked list modifications. As a
    result, the first inode unlinked in a buffer can pin the tail of the
    log whilst the inode cluster buffer is pinned by the current
    checkpoint that has been pushed but isn't on stable storage because
    because the cil->xc_push_commit_stable was not set. This results in
    the log/AIL effectively deadlocking until something triggers the
    commit record iclog to be pushed to stable storage (i.e. the
    periodic log worker calling xfs_log_force()).

    The fix is two-fold - first we should always set the
    cil->xc_push_commit_stable when xlog_cil_flush() is called,
    regardless of whether there is already a pending push or not.

    Second, if the CIL is empty, we should trigger an iclog flush to
    ensure that the iclogs of the last checkpoint have actually been
    submitted to disk as that checkpoint may not have been run under
    stable completion constraints.

    Reported-and-tested-by: Matthew Wilcox <willy@infradead.org>
    Fixes: 0020a190cf3e ("xfs: AIL needs asynchronous CIL forcing")
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:37 -04:00
Brian Foster 6def1029c3 xfs: convert mount flags to features
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git
Conflicts: Work around out of order backport in xfs_fs_fill_super().

commit 0560f31a09e523090d1ab2bfe21c69d028c2bdf2
Author: Dave Chinner <dchinner@redhat.com>
Date:   Wed Aug 18 18:46:52 2021 -0700

    xfs: convert mount flags to features

    Replace m_flags feature checks with xfs_has_<feature>() calls and
    rework the setup code to set flags in m_features.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:34 -04:00
Brian Foster c6a693990a xfs: move the CIL workqueue to the CIL
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 33c0dd7898a11ef19169abe5c5049fa6aa099c64
Author: Dave Chinner <dchinner@redhat.com>
Date:   Tue Aug 10 18:00:45 2021 -0700

    xfs: move the CIL workqueue to the CIL

    We only use the CIL workqueue in the CIL, so it makes no sense to
    hang it off the xfs_mount and have to walk multiple pointers back up
    to the mount when we have the CIL structures right there.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:28 -04:00
Brian Foster 7dd12e5f12 xfs: CIL work is serialised, not pipelined
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 39823d0fac9416cb89c252d78e262ee8cd76a7d8
Author: Dave Chinner <dchinner@redhat.com>
Date:   Tue Aug 10 18:00:45 2021 -0700

    xfs: CIL work is serialised, not pipelined

    Because we use a single work structure attached to the CIL rather
    than the CIL context, we can only queue a single work item at a
    time. This results in the CIL being single threaded and limits
    performance when it becomes CPU bound.

    The design of the CIL is that it is pipelined and multiple commits
    can be running concurrently, but the way the work is currently
    implemented means that it is not pipelining as it was intended. The
    critical work to switch the CIL context can take a few milliseconds
    to run, but the rest of the CIL context flush can take hundreds of
    milliseconds to complete. The context switching is the serialisation
    point of the CIL, once the context has been switched the rest of the
    context push can run asynchrnously with all other context pushes.

    Hence we can move the work to the CIL context so that we can run
    multiple CIL pushes at the same time and spread the majority of
    the work out over multiple CPUs. We can keep the per-cpu CIL commit
    state on the CIL rather than the context, because the context is
    pinned to the CIL until the switch is done and we aggregate and
    drain the per-cpu state held on the CIL during the context switch.

    However, because we no longer serialise the CIL work, we can have
    effectively unlimited CIL pushes in progress. We don't want to do
    this - not only does it create contention on the iclogs and the
    state machine locks, we can run the log right out of space with
    outstanding pushes. Instead, limit the work concurrency to 4
    concurrent works being processed at a time. This is enough
    concurrency to remove the CIL from being a CPU bound bottleneck but
    not enough to create new contention points or unbound concurrency
    issues.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:28 -04:00
Brian Foster 2151f12a8d xfs: AIL needs asynchronous CIL forcing
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 0020a190cf3eac16995143db41b21b82bacdcbe3
Author: Dave Chinner <dchinner@redhat.com>
Date:   Tue Aug 10 18:00:44 2021 -0700

    xfs: AIL needs asynchronous CIL forcing

    The AIL pushing is stalling on log forces when it comes across
    pinned items. This is happening on removal workloads where the AIL
    is dominated by stale items that are removed from AIL when the
    checkpoint that marks the items stale is committed to the journal.
    This results is relatively few items in the AIL, but those that are
    are often pinned as directories items are being removed from are
    still being logged.

    As a result, many push cycles through the CIL will first issue a
    blocking log force to unpin the items. This can take some time to
    complete, with tracing regularly showing push delays of half a
    second and sometimes up into the range of several seconds. Sequences
    like this aren't uncommon:

    ....
     399.829437:  xfsaild: last lsn 0x11002dd000 count 101 stuck 101 flushing 0 tout 20
    <wanted 20ms, got 270ms delay>
     400.099622:  xfsaild: target 0x11002f3600, prev 0x11002f3600, last lsn 0x0
     400.099623:  xfsaild: first lsn 0x11002f3600
     400.099679:  xfsaild: last lsn 0x1100305000 count 16 stuck 11 flushing 0 tout 50
    <wanted 50ms, got 500ms delay>
     400.589348:  xfsaild: target 0x110032e600, prev 0x11002f3600, last lsn 0x0
     400.589349:  xfsaild: first lsn 0x1100305000
     400.589595:  xfsaild: last lsn 0x110032e600 count 156 stuck 101 flushing 30 tout 50
    <wanted 50ms, got 460ms delay>
     400.950341:  xfsaild: target 0x1100353000, prev 0x110032e600, last lsn 0x0
     400.950343:  xfsaild: first lsn 0x1100317c00
     400.950436:  xfsaild: last lsn 0x110033d200 count 105 stuck 101 flushing 0 tout 20
    <wanted 20ms, got 200ms delay>
     401.142333:  xfsaild: target 0x1100361600, prev 0x1100353000, last lsn 0x0
     401.142334:  xfsaild: first lsn 0x110032e600
     401.142535:  xfsaild: last lsn 0x1100353000 count 122 stuck 101 flushing 8 tout 10
    <wanted 10ms, got 10ms delay>
     401.154323:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x1100353000
     401.154328:  xfsaild: first lsn 0x1100353000
     401.154389:  xfsaild: last lsn 0x1100353000 count 101 stuck 101 flushing 0 tout 20
    <wanted 20ms, got 300ms delay>
     401.451525:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
     401.451526:  xfsaild: first lsn 0x1100353000
     401.451804:  xfsaild: last lsn 0x1100377200 count 170 stuck 22 flushing 122 tout 50
    <wanted 50ms, got 500ms delay>
     401.933581:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
    ....

    In each of these cases, every AIL pass saw 101 log items stuck on
    the AIL (pinned) with very few other items being found. Each pass, a
    log force was issued, and delay between last/first is the sleep time
    + the sync log force time.

    Some of these 101 items pinned the tail of the log. The tail of the
    log does slowly creep forward (first lsn), but the problem is that
    the log is actually out of reservation space because it's been
    running so many transactions that stale items that never reach the
    AIL but consume log space. Hence we have a largely empty AIL, with
    long term pins on items that pin the tail of the log that don't get
    pushed frequently enough to keep log space available.

    The problem is the hundreds of milliseconds that we block in the log
    force pushing the CIL out to disk. The AIL should not be stalled
    like this - it needs to run and flush items that are at the tail of
    the log with minimal latency. What we really need to do is trigger a
    log flush, but then not wait for it at all - we've already done our
    waiting for stuff to complete when we backed off prior to the log
    force being issued.

    Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we
    still do a blocking flush of the CIL and that is what is causing the
    issue. Hence we need a new interface for the CIL to trigger an
    immediate background push of the CIL to get it moving faster but not
    to wait on that to occur. While the CIL is pushing, the AIL can also
    be pushing.

    We already have an internal interface to do this -
    xlog_cil_push_now() - but we need a wrapper for it to be used
    externally. xlog_cil_force_seq() can easily be extended to do what
    we need as it already implements the synchronous CIL push via
    xlog_cil_push_now(). Add the necessary flags and "push current
    sequence" semantics to xlog_cil_force_seq() and convert the AIL
    pushing to use it.

    One of the complexities here is that the CIL push does not guarantee
    that the commit record for the CIL checkpoint is written to disk.
    The current log force ensures this by submitting the current ACTIVE
    iclog that the commit record was written to. We need the CIL to
    actually write this commit record to disk for an async push to
    ensure that the checkpoint actually makes it to disk and unpins the
    pinned items in the checkpoint on completion. Hence we need to pass
    down to the CIL push that we are doing an async flush so that it can
    switch out the commit_iclog if necessary to get written to disk when
    the commit iclog is finally released.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:27 -04:00
Brian Foster 5f19bac7a4 xfs: order CIL checkpoint start records
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 68a74dcae6737c27b524b680e070fe41f0cad43a
Author: Dave Chinner <dchinner@redhat.com>
Date:   Tue Aug 10 18:00:44 2021 -0700

    xfs: order CIL checkpoint start records

    Because log recovery depends on strictly ordered start records as
    well as strictly ordered commit records.

    This is a zero day bug in the way XFS writes pipelined transactions
    to the journal which is exposed by fixing the zero day bug that
    prevents the CIL from pipelining checkpoints. This re-introduces
    explicit concurrent commits back into the on-disk journal and hence
    out of order start records.

    The XFS journal commit code has never ordered start records and we
    have relied on strict commit record ordering for correct recovery
    ordering of concurrently written transactions. Unfortunately, root
    cause analysis uncovered the fact that log recovery uses the LSN of
    the start record for transaction commit processing. Hence, whilst
    the commits are processed in strict order by recovery, the LSNs
    associated with the commits can be out of order and so recovery may
    stamp incorrect LSNs into objects and/or misorder intents in the AIL
    for later processing. This can result in log recovery failures
    and/or on disk corruption, sometimes silent.

    Because this is a long standing log recovery issue, we can't just
    fix log recovery and call it good. This still leaves older kernels
    susceptible to recovery failures and corruption when replaying a log
    from a kernel that pipelines checkpoints. There is also the issue
    that in-memory ordering for AIL pushing and data integrity
    operations are based on checkpoint start LSNs, and if the start LSN
    is incorrect in the journal, it is also incorrect in memory.

    Hence there's really only one choice for fixing this zero-day bug:
    we need to strictly order checkpoint start records in ascending
    sequence order in the log, the same way we already strictly order
    commit records.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:27 -04:00
Brian Foster 386919d0f5 xfs: attach iclog callbacks in xlog_cil_set_ctx_write_state()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit caa80090d17c89d0caca1dcb4c8a9cdef5335e71
Author: Dave Chinner <dchinner@redhat.com>
Date:   Tue Aug 10 18:00:43 2021 -0700

    xfs: attach iclog callbacks in xlog_cil_set_ctx_write_state()

    Now that we have a mechanism to guarantee that the callbacks
    attached to an iclog are owned by the context that attaches them
    until they drop their reference to the iclog via
    xlog_state_release_iclog(), we can attach callbacks to the iclog at
    any time we have an active reference to the iclog.

    xlog_state_get_iclog_space() always guarantees that the commit
    record will fit in the iclog it returns, so we can move this IO
    callback setting to xlog_cil_set_ctx_write_state(), record the
    commit iclog in the context and remove the need for the commit iclog
    to be returned by xlog_write() altogether.

    This, in turn, allows us to move the wakeup for ordered commit
    record writes up into xlog_cil_set_ctx_write_state(), too, because
    we have been guaranteed that this commit record will be physically
    located in the iclog before any waiting commit record at a higher
    sequence number will be granted iclog space.

    This further cleans up the post commit record write processing in
    the CIL push code, especially as xlog_state_release_iclog() will now
    clean up the context when shutdown errors occur.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:27 -04:00
Brian Foster 49e3964f18 xfs: factor out log write ordering from xlog_cil_push_work()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit bf034bc827807ac15affa051e6a94b03f93b1a03
Author: Dave Chinner <dchinner@redhat.com>
Date:   Tue Aug 10 18:00:43 2021 -0700

    xfs: factor out log write ordering from xlog_cil_push_work()

    So we can use it for start record ordering as well as commit record
    ordering in future.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:27 -04:00
Brian Foster 7a8b0cc2c0 xfs: pass a CIL context to xlog_write()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit c45aba40cf5b2988c0bebee8c9b846c88aa651eb
Author: Dave Chinner <dchinner@redhat.com>
Date:   Tue Aug 10 18:00:42 2021 -0700

    xfs: pass a CIL context to xlog_write()

    Pass the CIL context to xlog_write() rather than a pointer to a LSN
    variable. Only the CIL checkpoint calls to xlog_write() need to know
    about the start LSN of the writes, so rework xlog_write to directly
    write the LSNs into the CIL context structure.

    This removes the commit_lsn variable from xlog_cil_push_work(), so
    now we only have to issue the commit record ordering wakeup from
    there.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:27 -04:00
Brian Foster f232f007de xfs: move xlog_commit_record to xfs_log_cil.c
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 2ce82b722de980deef809438603b7e95156d3818
Author: Dave Chinner <dchinner@redhat.com>
Date:   Tue Aug 10 18:00:42 2021 -0700

    xfs: move xlog_commit_record to xfs_log_cil.c

    It is only used by the CIL checkpoints, and is the counterpart to
    start record formatting and writing that is already local to
    xfs_log_cil.c.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:27 -04:00
Brian Foster cafcd92fda xfs: don't run shutdown callbacks on active iclogs
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 502a01fac0983406e7c312764d7a03e06b3d0748
Author: Dave Chinner <dchinner@redhat.com>
Date:   Tue Aug 10 18:00:41 2021 -0700

    xfs: don't run shutdown callbacks on active iclogs

    When the log is shutdown, it currently walks all the iclogs and runs
    callbacks that are attached to the iclogs, regardless of whether the
    iclog is queued for IO completion or not. This creates a problem for
    contexts attaching callbacks to iclogs in that a racing shutdown can
    run the callbacks even before the attaching context has finished
    processing the iclog and releasing it for IO submission.

    If the callback processing of the iclog frees the structure that is
    attached to the iclog, then this leads to an UAF scenario that can
    only be protected against by holding the icloglock from the point
    callbacks are attached through to the release of the iclog. While we
    currently do this, it is not practical or sustainable.

    Hence we need to make shutdown processing the responsibility of the
    context that holds active references to the iclog. We know that the
    contexts attaching callbacks to the iclog must have active
    references to the iclog, and that means they must be in either
    ACTIVE or WANT_SYNC states. xlog_state_do_callback() will skip over
    iclogs in these states -except- when the log is shut down.

    xlog_state_do_callback() checks the state of the iclogs while
    holding the icloglock, therefore the reference count/state change
    that occurs in xlog_state_release_iclog() after the callbacks are
    atomic w.r.t. shutdown processing.

    We can't push the responsibility of callback cleanup onto the CIL
    context because we can have ACTIVE iclogs that have callbacks
    attached that have already been released. Hence we really need to
    internalise the cleanup of callbacks into xlog_state_release_iclog()
    processing.

    Indeed, we already have that internalisation via:

    xlog_state_release_iclog
      drop last reference
        ->SYNCING
      xlog_sync
        xlog_write_iclog
          if (log_is_shutdown)
            xlog_state_done_syncing()
              xlog_state_do_callback()
                <process shutdown on iclog that is now in SYNCING state>

    The problem is that xlog_state_release_iclog() aborts before doing
    anything if the log is already shut down. It assumes that the
    callbacks have already been cleaned up, and it doesn't need to do
    any cleanup.

    Hence the fix is to remove the xlog_is_shutdown() check from
    xlog_state_release_iclog() so that reference counts are correctly
    released from the iclogs, and when the reference count is zero we
    always transition to SYNCING if the log is shut down. Hence we'll
    always enter the xlog_sync() path in a shutdown and eventually end
    up erroring out the iclog IO and running xlog_state_do_callback() to
    process the callbacks attached to the iclog.

    This allows us to stop processing referenced ACTIVE/WANT_SYNC iclogs
    directly in the shutdown code, and in doing so gets rid of the UAF
    vector that currently exists. This then decouples the adding of
    callbacks to the iclogs from xlog_state_release_iclog() as we
    guarantee that xlog_state_release_iclog() will process the callbacks
    if the log has been shut down before xlog_state_release_iclog() has
    been called.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:26 -04:00
Brian Foster c3bd075896 xfs: XLOG_STATE_IOERROR must die
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 5112e2067bd94bd56aace4c7e4d45ff13b9152f8
Author: Dave Chinner <dchinner@redhat.com>
Date:   Tue Aug 10 17:59:01 2021 -0700

    xfs: XLOG_STATE_IOERROR must die

    We don't need an iclog state field to tell us the log has been shut
    down. We can just check the xlog_is_shutdown() instead. The avoids
    the need to have shutdown overwrite the current iclog state while
    being active used by the log code and so having to ensure that every
    iclog state check handles XLOG_STATE_IOERROR appropriately.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:25 -04:00
Brian Foster 156272e64a xfs: convert XLOG_FORCED_SHUTDOWN() to xlog_is_shutdown()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit 2039a272300b949c05888428877317b834c0b1fb
Author: Dave Chinner <dchinner@redhat.com>
Date:   Tue Aug 10 17:59:01 2021 -0700

    xfs: convert XLOG_FORCED_SHUTDOWN() to xlog_is_shutdown()

    Make it less shouty and a static inline before adding more calls
    through the log code.

    Also convert internal log code that uses XFS_FORCED_SHUTDOWN(mount)
    to use xlog_is_shutdown(log) as well.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:25 -04:00
Brian Foster 8bf8cc906b xfs: replace kmem_alloc_large() with kvmalloc()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083143
Upstream Status: linux.git

commit d634525db63e9e946c3229fb93c8d9b763afbaf3
Author: Dave Chinner <dchinner@redhat.com>
Date:   Mon Aug 9 10:10:01 2021 -0700

    xfs: replace kmem_alloc_large() with kvmalloc()

    There is no reason for this wrapper existing anymore. All the places
    that use KM_NOFS allocation are within transaction contexts and
    hence covered by memalloc_nofs_save/restore contexts. Hence we don't
    need any special handling of vmalloc for large IOs anymore and
    so special casing this code isn't necessary.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Brian Foster <bfoster@redhat.com>
2022-08-25 08:11:24 -04:00
Ming Lei ca92d6e1cc block: decouple REQ_OP_SECURE_ERASE from REQ_OP_DISCARD
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083917
Conflicts: drop change on ntfs3

commit 44abff2c0b970ae3d310b97617525dc01f248d7c
Author: Christoph Hellwig <hch@lst.de>
Date:   Fri Apr 15 06:52:57 2022 +0200

    block: decouple REQ_OP_SECURE_ERASE from REQ_OP_DISCARD

    Secure erase is a very different operation from discard in that it is
    a data integrity operation vs hint.  Fully split the limits and helper
    infrastructure to make the separation more clear.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
    Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd]
    Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> [nifs2]
    Acked-by: Jaegeuk Kim <jaegeuk@kernel.org> [f2fs]
    Acked-by: Coly Li <colyli@suse.de> [bcache]
    Acked-by: David Sterba <dsterba@suse.com> [btrfs]
    Acked-by: Chao Yu <chao@kernel.org>
    Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
    Link: https://lore.kernel.org/r/20220415045258.199825-27-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2022-06-22 08:58:02 +08:00
Dave Chinner 0dc8f7f139 xfs: fix ordering violation between cache flushes and tail updates
There is a race between the new CIL async data device metadata IO
completion cache flush and the log tail in the iclog the flush
covers being updated. This can be seen by repeating generic/482 in a
loop and eventually log recovery fails with a failures such as this:

XFS (dm-3): Starting recovery (logdev: internal)
XFS (dm-3): bad inode magic/vsn daddr 228352 #0 (magic=0)
XFS (dm-3): Metadata corruption detected at xfs_inode_buf_verify+0x180/0x190, xfs_inode block 0x37c00 xfs_inode_buf_verify
XFS (dm-3): Unmount and run xfs_repair
XFS (dm-3): First 128 bytes of corrupted metadata buffer:
00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
XFS (dm-3): metadata I/O error in "xlog_recover_items_pass2+0x55/0xc0" at daddr 0x37c00 len 32 error 117

Analysis of the logwrite replay shows that there were no writes to
the data device between the FUA @ write 124 and the FUA at write @
125, but log recovery @ 125 failed. The difference was the one log
write @ 125 moved the tail of the log forwards from (1,8) to (1,32)
and so the inode create intent in (1,8) was not replayed and so the
inode cluster was zero on disk when replay of the first inode item
in (1,32) was attempted.

What this meant was that the journal write that occurred at @ 125
did not ensure that metadata completed before the iclog was written
was correctly on stable storage. The tail of the log moved forward,
so IO must have been completed between the two iclog writes. This
means that there is a race condition between the unconditional async
cache flush in the CIL push work and the tail LSN that is written to
the iclog. This happens like so:

CIL push work				AIL push work
-------------				-------------
Add to committing list
start async data dev cache flush
.....
<flush completes>
<all writes to old tail lsn are stable>
xlog_write
  ....					push inode create buffer
					<start IO>
					.....
xlog_write(commit record)
  ....					<IO completes>
  					log tail moves
  					  xlog_assign_tail_lsn()
start_lsn == commit_lsn
  <no iclog preflush!>
xlog_state_release_iclog
  __xlog_state_release_iclog()
    <writes *new* tail_lsn into iclog>
  xlog_sync()
    ....
    submit_bio()
<tail in log moves forward without flushing written metadata>

Essentially, this can only occur if the commit iclog is issued
without a cache flush. If the iclog bio is submitted with
REQ_PREFLUSH, then it will guarantee that all the completed IO is
one stable storage before the iclog bio with the new tail LSN in it
is written to the log.

IOWs, the tail lsn that is written to the iclog needs to be sampled
*before* we issue the cache flush that guarantees all IO up to that
LSN has been completed.

To fix this without giving up the performance advantage of the
flush/FUA optimisations (e.g. g/482 runtime halves with 5.14-rc1
compared to 5.13), we need to ensure that we always issue a cache
flush if the tail LSN changes between the initial async flush and
the commit record being written. THis requires sampling the tail_lsn
before we start the flush, and then passing the sampled tail LSN to
xlog_state_release_iclog() so it can determine if the the tail LSN
has changed while writing the checkpoint. If the tail LSN has
changed, then it needs to set the NEED_FLUSH flag on the iclog and
we'll issue another cache flush before writing the iclog.

Fixes: eef983ffea ("xfs: journal IO cache flush reductions")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-07-29 09:27:28 -07:00
Dave Chinner 1effb72a81 xfs: don't wait on future iclogs when pushing the CIL
The iclogbuf ring attached to the struct xlog is circular, hence the
first and last iclogs in the ring can only be determined by
comparing them against the log->l_iclog pointer.

In xfs_cil_push_work(), we want to wait on previous iclogs that were
issued so that we can flush them to stable storage with the commit
record write, and it simply waits on the previous iclog in the ring.
This, however, leads to CIL push hangs in generic/019 like so:

task:kworker/u33:0   state:D stack:12680 pid:    7 ppid:     2 flags:0x00004000
Workqueue: xfs-cil/pmem1 xlog_cil_push_work
Call Trace:
 __schedule+0x30b/0x9f0
 schedule+0x68/0xe0
 xlog_wait_on_iclog+0x121/0x190
 ? wake_up_q+0xa0/0xa0
 xlog_cil_push_work+0x994/0xa10
 ? _raw_spin_lock+0x15/0x20
 ? xfs_swap_extents+0x920/0x920
 process_one_work+0x1ab/0x390
 worker_thread+0x56/0x3d0
 ? rescuer_thread+0x3c0/0x3c0
 kthread+0x14d/0x170
 ? __kthread_bind_mask+0x70/0x70
 ret_from_fork+0x1f/0x30

With other threads blocking in either xlog_state_get_iclog_space()
waiting for iclog space or xlog_grant_head_wait() waiting for log
reservation space.

The problem here is that the previous iclog on the ring might
actually be a future iclog. That is, if log->l_iclog points at
commit_iclog, commit_iclog is the first (oldest) iclog in the ring
and there are no previous iclogs pending as they have all completed
their IO and been activated again. IOWs, commit_iclog->ic_prev
points to an iclog that will be written in the future, not one that
has been written in the past.

Hence, in this case, waiting on the ->ic_prev iclog is incorrect
behaviour, and depending on the state of the future iclog, we can
end up with a circular ABA wait cycle and we hang.

The fix is made more complex by the fact that many iclogs states
cannot be used to determine if the iclog is a past or future iclog.
Hence we have to determine past iclogs by checking the LSN of the
iclog rather than their state. A past ACTIVE iclog will have a LSN
of zero, while a future ACTIVE iclog will have a LSN greater than
the current iclog. We don't wait on either of these cases.

Similarly, a future iclog that hasn't completed IO will have an LSN
greater than the current iclog and so we don't wait on them. A past
iclog that is still undergoing IO completion will have a LSN less
than the current iclog and those are the only iclogs that we need to
wait on.

Hence we can use the iclog LSN to determine what iclogs we need to
wait on here.

Fixes: 5fd9256ce156 ("xfs: separate CIL commit record IO")
Reported-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-06-25 14:02:02 -07:00
Dave Chinner a1bb8505e9 xfs: Fix a CIL UAF by getting get rid of the iclog callback lock
The iclog callback chain has it's own lock. That was added way back
in 2008 by myself to alleviate severe lock contention on the
icloglock in commit 114d23aae5 ("[XFS] Per iclog callback chain
lock"). This was long before delayed logging took the icloglock out
of the hot transaction commit path and removed all contention on it.
Hence the separate ic_callback_lock doesn't serve any scalability
purpose anymore, and hasn't for close on a decade.

Further, we only attach callbacks to iclogs in one place where we
are already taking the icloglock soon after attaching the callbacks.
We also have to drop the icloglock to run callbacks and grab it
immediately afterwards again. So given that the icloglock is no
longer hot, making it cover callbacks again doesn't really change
the locking patterns very much at all.

We also need to extend the icloglock to cover callback addition to
fix a zero-day UAF in the CIL push code. This occurs when shutdown
races with xlog_cil_push_work() and the shutdown runs the callbacks
before the push releases the iclog. This results in the CIL context
structure attached to the iclog being freed by the callback before
the CIL push has finished referencing it, leading to UAF bugs.

Hence, to avoid this UAF, we need the callback attachment to be
atomic with post processing of the commit iclog and references to
the structures being attached to the iclog. This requires holding
the icloglock as that's the only way to serialise iclog state
against a shutdown in progress.

The result is we need to be using the icloglock to protect the
callback list addition and removal and serialise them with shutdown.
That makes the ic_callback_lock redundant and so it can be removed.

Fixes: 71e330b593 ("xfs: Introduce delayed logging core code")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-06-25 11:21:39 -07:00
Dave Chinner 5f9b4b0de8 xfs: xfs_log_force_lsn isn't passed a LSN
In doing an investigation into AIL push stalls, I was looking at the
log force code to see if an async CIL push could be done instead.
This lead me to xfs_log_force_lsn() and looking at how it works.

xfs_log_force_lsn() is only called from inode synchronisation
contexts such as fsync(), and it takes the ip->i_itemp->ili_last_lsn
value as the LSN to sync the log to. This gets passed to
xlog_cil_force_lsn() via xfs_log_force_lsn() to flush the CIL to the
journal, and then used by xfs_log_force_lsn() to flush the iclogs to
the journal.

The problem is that ip->i_itemp->ili_last_lsn does not store a
log sequence number. What it stores is passed to it from the
->iop_committing method, which is called by xfs_log_commit_cil().
The value this passes to the iop_committing method is the CIL
context sequence number that the item was committed to.

As it turns out, xlog_cil_force_lsn() converts the sequence to an
actual commit LSN for the related context and returns that to
xfs_log_force_lsn(). xfs_log_force_lsn() overwrites it's "lsn"
variable that contained a sequence with an actual LSN and then uses
that to sync the iclogs.

This caused me some confusion for a while, even though I originally
wrote all this code a decade ago. ->iop_committing is only used by
a couple of log item types, and only inode items use the sequence
number it is passed.

Let's clean up the API, CIL structures and inode log item to call it
a sequence number, and make it clear that the high level code is
using CIL sequence numbers and not on-disk LSNs for integrity
synchronisation purposes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-06-21 10:12:33 -07:00