Commit Graph

1449 Commits

Author SHA1 Message Date
Nigel Croxon bf2723f450 md: fix mddev uaf while iterating all_mddevs list
JIRA: https://issues.redhat.com/browse/RHEL-83988

commit 8542870237c3a48ff049b6c5df5f50c8728284fa
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Thu Feb 20 20:43:48 2025 +0800

    md: fix mddev uaf while iterating all_mddevs list

    While iterating all_mddevs list from md_notify_reboot() and md_exit(),
    list_for_each_entry_safe is used, and this can race with deletint the
    next mddev, causing UAF:

    t1:
    spin_lock
    //list_for_each_entry_safe(mddev, n, ...)
     mddev_get(mddev1)
     // assume mddev2 is the next entry
     spin_unlock
                t2:
                //remove mddev2
                ...
                mddev_free
                spin_lock
                list_del
                spin_unlock
                kfree(mddev2)
     mddev_put(mddev1)
     spin_lock
     //continue dereference mddev2->all_mddevs

    The old helper for_each_mddev() actually grab the reference of mddev2
    while holding the lock, to prevent from being freed. This problem can be
    fixed the same way, however, the code will be complex.

    Hence switch to use list_for_each_entry, in this case mddev_put() can free
    the mddev1 and it's not safe as well. Refer to md_seq_show(), also factor
    out a helper mddev_put_locked() to fix this problem.

    Cc: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/linux-raid/20250220124348.845222-1-yukuai1@huaweicloud.com
    Fixes: f26514342255 ("md: stop using for_each_mddev in md_notify_reboot")
    Fixes: 16648bac862f ("md: stop using for_each_mddev in md_exit")
    Reported-and-tested-by: Guillaume Morin <guillaume@morinfr.org>
    Closes: https://lore.kernel.org/all/Z7Y0SURoA8xwg7vn@bender.morinfr.org/
    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2025-03-21 14:55:02 -04:00
Nigel Croxon b637f08a77 md: switch md-cluster to use md_submodle_head
JIRA: https://issues.redhat.com/browse/RHEL-83988

commit 87a86277c9f54953e184318bf71630388aeaf000
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Sat Feb 15 17:22:25 2025 +0800

    md: switch md-cluster to use md_submodle_head

    To make code cleaner, and prepare to add kconfig for bitmap.

    Also remove the unsed global variables pers_lock, md_cluster_ops and
    md_cluster_mod, and exported symbols register_md_cluster_operations(),
    unregister_md_cluster_operations() and md_cluster_ops.

    Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-8-yukuai1@huaweicloud.com
    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Reviewed-by: Su Yue <glass.su@suse.com>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2025-03-21 14:55:02 -04:00
Nigel Croxon f1ba1477bf md: switch personalities to use md_submodule_head
JIRA: https://issues.redhat.com/browse/RHEL-83988

commit 3d44e1d1575a877cf75a7776802506ce7ab8ecc4
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Sat Feb 15 17:22:22 2025 +0800

md: switch personalities to use md_submodule_head

Remove the global list 'pers_list', and switch to use md_submodule_head,
which is managed by xarry. Prepare to unify registration and unregistration
for all sub modules.

Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-5-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
(cherry picked from commit 3d44e1d1575a877cf75a7776802506ce7ab8ecc4)
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2025-03-21 14:55:02 -04:00
Nigel Croxon 50bfca5808 md: don't export md_cluster_ops
JIRA: https://issues.redhat.com/browse/RHEL-83988

commit c594de0455b3d65525bad2020f7f7e41af233045
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Sat Feb 15 17:22:24 2025 +0800

    md: don't export md_cluster_ops

    Add a new field 'cluster_ops' and initialize it md_setup_cluster(), so
    that the gloable variable 'md_cluter_ops' doesn't need to be exported.
    Also prepare to switch md-cluster to use md_submod_head.

    Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-7-yukuai1@huaweicloud.com
    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Reviewed-by: Su Yue <glass.su@suse.com>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2025-03-21 14:55:02 -04:00
Nigel Croxon 9545adda24 md: introduce struct md_submodule_head and APIs
JIRA: https://issues.redhat.com/browse/RHEL-83988

commit d3beb7c9c61d239e73cb93481b27c7b94130dd03
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Sat Feb 15 17:22:21 2025 +0800

    md: introduce struct md_submodule_head and APIs

    Prepare to unify registration and unregistration of md personalities
    and md-cluster, also prepare for add kconfig for md-bitmap.

    Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-4-yukuai1@huaweicloud.com
    Signed-off-by: Yu Kuai <yukuai3@huawei.com>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2025-03-21 14:55:02 -04:00
Nigel Croxon 7029fe28ea md: merge common code into find_pers()
JIRA: https://issues.redhat.com/browse/RHEL-83988

commit 9faab548974e3eb858250fea1ab7e823a689b44b
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Sat Feb 15 17:22:19 2025 +0800

    md: merge common code into find_pers()

    - pers_lock() are held and released from caller
    - try_module_get() is called from caller
    - error message from caller

    Merge above code into find_pers(), and rename it to get_pers(), also
    add a wrapper to module_put() as put_pers().

    Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-2-yukuai1@huaweicloud.com
    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Reviewed-by: Su Yue <glass.su@suse.com>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2025-03-21 14:55:02 -04:00
Nigel Croxon 86315723dd md: ensure resync is prioritized over recovery
JIRA: https://issues.redhat.com/browse/RHEL-83988

commit 4b10a3bc67c1232f76aa1e04778ca26d6c0ddf7f
Author: Li Nan <linan122@huawei.com>
Date:   Thu Feb 13 21:15:30 2025 +0800

    md: ensure resync is prioritized over recovery

    If a new disk is added during resync, the resync process is interrupted,
    and recovery is triggered, causing the previous resync to be lost. In
    reality, disk addition should not terminate resync, fix it.

    Steps to reproduce the issue:
      mdadm -CR /dev/md0 -l1 -n3 -x1 /dev/sd[abcd]
      mdadm --fail /dev/md0 /dev/sdc

    Fixes: 24dd469d72 ("[PATCH] md: allow a manual resync with md")
    Signed-off-by: Li Nan <linan122@huawei.com>
    Reviewed-by: Yu Kuai <yukuai3@huawei.com>
    Link: https://lore.kernel.org/linux-raid/20250213131530.3698600-1-linan666@huaweicloud.com
    Signed-off-by: Yu Kuai <yukuai3@huawei.com>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2025-03-21 14:55:02 -04:00
Nigel Croxon bbf142254a md: reintroduce md-linear
JIRA: https://issues.redhat.com/browse/RHEL-83988

commit 127186cfb184eaccdfe948e6da66940cfa03efc5
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Thu Jan 2 19:28:41 2025 +0800

md: reintroduce md-linear

THe md-linear is removed by commit 849d18e27be9 ("md: Remove deprecated
CONFIG_MD_LINEAR") because it has been marked as deprecated for a long
time.

However, md-linear is used widely for underlying disks with different size,
sadly we didn't know this until now, and it's true useful to create
partitions and assemble multiple raid and then append one to the other.

People have to use dm-linear in this case now, however, they will prefer
to minimize the number of involved modules.

Fixes: 849d18e27be9 ("md: Remove deprecated CONFIG_MD_LINEAR")
Cc: stable@vger.kernel.org
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Coly Li <colyli@kernel.org>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Link: https://lore.kernel.org/r/20250102112841.1227111-1-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
(cherry picked from commit 127186cfb184eaccdfe948e6da66940cfa03efc5)
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2025-03-21 14:55:02 -04:00
Nigel Croxon 897ed4ed38 md: Remove deprecated CONFIG_MD_LINEAR
JIRA: https://issues.redhat.com/browse/RHEL-83988

commit 849d18e27be9a1253f2318cb4549cc857219d991
Author: Song Liu <song@kernel.org>
Date:   Thu Dec 14 14:21:05 2023 -0800

md: Remove deprecated CONFIG_MD_LINEAR

md-linear has been marked as deprecated for 2.5 years. Remove it.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Neil Brown <neilb@suse.de>
Cc: Guoqing Jiang <guoqing.jiang@linux.dev>
Cc: Mateusz Grzonka <mateusz.grzonka@intel.com>
Cc: Jes Sorensen <jes@trained-monkey.org>
Signed-off-by: Song Liu <song@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20231214222107.2016042-2-song@kernel.org
(cherry picked from commit 849d18e27be9a1253f2318cb4549cc857219d991)
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2025-03-21 14:55:02 -04:00
Nigel Croxon 392de5dc0b md/md-bitmap: Synchronize bitmap_get_stats() with bitmap lifetime
JIRA: https://issues.redhat.com/browse/RHEL-73514

commit 8d28d0ddb986f56920ac97ae704cc3340a699a30
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Fri Jan 24 17:20:55 2025 +0800

    md/md-bitmap: Synchronize bitmap_get_stats() with bitmap lifetime

    After commit ec6bb299c7c3 ("md/md-bitmap: add 'sync_size' into struct
    md_bitmap_stats"), following panic is reported:

    Oops: general protection fault, probably for non-canonical address
    RIP: 0010:bitmap_get_stats+0x2b/0xa0
    Call Trace:
     <TASK>
     md_seq_show+0x2d2/0x5b0
     seq_read_iter+0x2b9/0x470
     seq_read+0x12f/0x180
     proc_reg_read+0x57/0xb0
     vfs_read+0xf6/0x380
     ksys_read+0x6c/0xf0
     do_syscall_64+0x82/0x170
     entry_SYSCALL_64_after_hwframe+0x76/0x7e

    Root cause is that bitmap_get_stats() can be called at anytime if mddev
    is still there, even if bitmap is destroyed, or not fully initialized.
    Deferenceing bitmap in this case can crash the kernel. Meanwhile, the
    above commit start to deferencing bitmap->storage, make the problem
    easier to trigger.

    Fix the problem by protecting bitmap_get_stats() with bitmap_info.mutex.

    Cc: stable@vger.kernel.org # v6.12+
    Fixes: 32a7627cf3 ("[PATCH] md: optimised resync using Bitmap based intent logging")
    Reported-and-tested-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
    Closes: https://lore.kernel.org/linux-raid/ca3a91a2-50ae-4f68-b317-abd9889f3907@oracle.com/T/#m6e5086c95201135e4941fe38f9efa76daf9666c5
    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Link: https://lore.kernel.org/r/20250124092055.4050195-1-yukuai1@huaweicloud.com
    Signed-off-by: Song Liu <song@kernel.org>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2025-03-10 10:28:20 -04:00
Nigel Croxon ce51da6c92 md/md-bitmap: move bitmap_{start, end}write to md upper layer
JIRA: https://issues.redhat.com/browse/RHEL-73514

commit cd5fc653381811f1e0ba65f5d169918cab61476f
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Thu Jan 9 09:51:45 2025 +0800

    md/md-bitmap: move bitmap_{start, end}write to md upper layer

    There are two BUG reports that raid5 will hang at
    bitmap_startwrite([1],[2]), root cause is that bitmap start write and end
    write is unbalanced, it's not quite clear where, and while reviewing raid5
    code, it's found that bitmap operations can be optimized. For example,
    for a 4 disks raid5, with chunksize=8k, if user issue a IO (0 + 48k) to
    the array:

    ┌────────────────────────────────────────────────────────────┐
    │chunk 0                                                     │
    │      ┌────────────┬─────────────┬─────────────┬────────────┼
    │  sh0 │A0: 0 + 4k  │A1: 8k + 4k  │A2: 16k + 4k │A3: P       │
    │      ┼────────────┼─────────────┼─────────────┼────────────┼
    │  sh1 │B0: 4k + 4k │B1: 12k + 4k │B2: 20k + 4k │B3: P       │
    ┼──────┴────────────┴─────────────┴─────────────┴────────────┼
    │chunk 1                                                     │
    │      ┌────────────┬─────────────┬─────────────┬────────────┤
    │  sh2 │C0: 24k + 4k│C1: 32k + 4k │C2: P        │C3: 40k + 4k│
    │      ┼────────────┼─────────────┼─────────────┼────────────┼
    │  sh3 │D0: 28k + 4k│D1: 36k + 4k │D2: P        │D3: 44k + 4k│
    └──────┴────────────┴─────────────┴─────────────┴────────────┘

    Before this patch, 4 stripe head will be used, and each sh will attach
    bio for 3 disks, and each attached bio will trigger
    bitmap_startwrite() once, which means total 12 times.
     - 3 times (0 + 4k), for (A0, A1 and A2)
     - 3 times (4 + 4k), for (B0, B1 and B2)
     - 3 times (8 + 4k), for (C0, C1 and C3)
     - 3 times (12 + 4k), for (D0, D1 and D3)

    After this patch, md upper layer will calculate that IO range (0 + 48k)
    is corresponding to the bitmap (0 + 16k), and call bitmap_startwrite()
    just once.

    Noted that this patch will align bitmap ranges to the chunks, for example,
    if user issue a IO (0 + 4k) to array:

    - Before this patch, 1 time (0 + 4k), for A0;
    - After this patch, 1 time (0 + 8k) for chunk 0;

    Usually, one bitmap bit will represent more than one disk chunk, and this
    doesn't have any difference. And even if user really created a array
    that one chunk contain multiple bits, the overhead is that more data
    will be recovered after power failure.

    Also remove STRIPE_BITMAP_PENDING since it's not used anymore.

    [1] https://lore.kernel.org/all/CAJpMwyjmHQLvm6zg1cmQErttNNQPDAAXPKM3xgTjMhbfts986Q@mail.gmail.com/
    [2] https://lore.kernel.org/all/ADF7D720-5764-4AF3-B68E-1845988737AA@flyingcircus.io/

    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Link: https://lore.kernel.org/r/20250109015145.158868-6-yukuai1@huaweicloud.com
    Signed-off-by: Song Liu <song@kernel.org>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2025-03-10 10:28:20 -04:00
Nigel Croxon 89963dc328 md: don't record new badblocks for faulty rdev
JIRA: https://issues.redhat.com/browse/RHEL-73514

commit 29967332ced51a15a22f11381eeebbc500ba1858
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Thu Oct 31 11:31:10 2024 +0800

    md: don't record new badblocks for faulty rdev

    Faulty will be checked before issuing IO to the rdev, however, rdev can
    be faulty at any time, hence it's possible that rdev_set_badblocks()
    will be called for faulty rdev. In this case, mddev->sb_flags will be
    set and some other path can be blocked by updating super block.

    Since faulty rdev will not be accesed anymore, there is no need to
    record new babblocks for faulty rdev and forcing updating super block.

    Noted this is not a bugfix, just prevent updating superblock in some
    corner cases, and will help to slice a bug related to external
    metadata[1], testing also shows that devices are removed faster in the
    case IO error.

    [1] https://lore.kernel.org/all/f34452df-810b-48b2-a9b4-7f925699a9e7@linux.intel.com/

    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Tested-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
    Link: https://lore.kernel.org/r/20241031033114.3845582-4-yukuai1@huaweicloud.com
    Signed-off-by: Song Liu <song@kernel.org>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2025-03-10 10:28:17 -04:00
Nigel Croxon 3276d99483 md: don't wait faulty rdev in md_wait_for_blocked_rdev()
JIRA: https://issues.redhat.com/browse/RHEL-73514

commit 50e8274855e7ab5499ff8296e09802874a3f03b1
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Thu Oct 31 11:31:09 2024 +0800

    md: don't wait faulty rdev in md_wait_for_blocked_rdev()

    md_wait_for_blocked_rdev() is called for write IO while rdev is
    blocked, howerver, rdev can be faulty after choosing this rdev to write,
    and faulty rdev should never be accessed anymore, hence there is no point
    to wait for faulty rdev to be unblocked.

    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Tested-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
    Link: https://lore.kernel.org/r/20241031033114.3845582-3-yukuai1@huaweicloud.com
    Signed-off-by: Song Liu <song@kernel.org>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2025-03-10 10:28:17 -04:00
Nigel Croxon 66243da722 md: ensure child flush IO does not affect origin bio->bi_status
JIRA: https://issues.redhat.com/browse/RHEL-73514

commit 62ce0782bbacd32ec10292b9bdd127330e9b6968
Author: Li Nan <linan122@huawei.com>
Date:   Thu Sep 19 14:30:48 2024 +0800

    md: ensure child flush IO does not affect origin bio->bi_status

    When a flush is issued to an RAID array, a child flush IO is created and
    issued for each member disk in the RAID array. Since commit b75197e86e6d
    ("md: Remove flush handling"), each child flush IO has been chained with
    the original bio. As a result, the failure of any child IO could modify
    the bi_status of the original bio, potentially impacting the upper-layer
    filesystem.

    Fix the issue by preventing child flush IO from altering the original
    bio->bi_status as before. However, this design introduces a known
    issue: in the event of a power failure, if a flush IO on a member
    disk fails, the upper layers may not be informed. This issue is not easy
    to fix and will not be addressed for the time being in this issue.

    Fixes: b75197e86e6d ("md: Remove flush handling")
    Signed-off-by: Li Nan <linan122@huawei.com>
    Reviewed-by: Yu Kuai <yukuai3@huawei.com>
    Link: https://lore.kernel.org/r/20240919063048.2887579-1-linan666@huaweicloud.com
    Signed-off-by: Song Liu <song@kernel.org>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2025-03-10 10:28:17 -04:00
Rado Vrbovsky 00f182a2ef Merge branch 'centos-stream-9-rhel9.6-update-v6.11' 2024-10-18 15:08:53 +00:00
Nigel Croxon aa3aa30c35 md: Add new_level sysfs interface
JIRA: https://issues.redhat.com/browse/RHEL-61196

commit d981ed8419303ed12351eea8541ad6cb76455fe3
Author: Xiao Ni <xni@redhat.com>
Date:   Thu Sep 5 07:54:53 2024 +0800

    md: Add new_level sysfs interface

    Now reshape supports two ways: with backup file or without backup file.
    For the situation without backup file, it needs to change data offset.
    It doesn't need systemd service mdadm-grow-continue. So it can finish
    the reshape job in one process environment. It can know the new level
    from mdadm --grow command and can change to new level after reshape
    finishes.

    For the situation with backup file, it needs systemd service
    mdadm-grow-continue to monitor reshape progress. So there are two process
    envolved. One is mdadm --grow command whick kicks off reshape and wakes
    up mdadm-grow-continue service. The second process is the service, which
    doesn't know the new level from the first process.

    In kernel space mddev->new_level is used to record the new level when
    doing reshape. This patch adds a new interface to help mdadm update
    new_level and sync it to metadata. Then mdadm-grow-continue can read the
    right new_level.

    Commit log revised by Song Liu. Please refer to the link for more details.

    Signed-off-by: Xiao Ni <xni@redhat.com>
    Link: https://lore.kernel.org/r/20240904235453.99120-1-xni@redhat.com
    Signed-off-by: Song Liu <song@kernel.org>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-10-01 16:10:09 -04:00
Nigel Croxon 3fea7e753a md: Report failed arrays as broken in mdstat
JIRA: https://issues.redhat.com/browse/RHEL-61196

commit 2d2b3bc145b9d5b5c6f07d22291723ddb024ca76
Author: Mateusz Kusiak <mateusz.kusiak@intel.com>
Date:   Tue Sep 3 16:29:49 2024 +0200

    md: Report failed arrays as broken in mdstat

    Depending on if array has personality, it is either reported as active or
    inactive. This patch adds third status "broken" for arrays with
    personality that became inoperative. The reason is end users tend to
    assume that "active" indicates array is operational.

    Add "broken" state for inoperative arrays with personality and refactor
    the code.

    Signed-off-by: Mateusz Kusiak <mateusz.kusiak@intel.com>
    Link: https://lore.kernel.org/r/20240903142949.53628-1-mateusz.kusiak@intel.com
    Signed-off-by: Song Liu <song@kernel.org>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-10-01 16:10:02 -04:00
Nigel Croxon c644f50a7e md: Remove flush handling
JIRA: https://issues.redhat.com/browse/RHEL-61196

commit b75197e86e6d3de4e611869ef30a27cf414a5f77
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Tue Aug 27 19:06:16 2024 +0800

md: Remove flush handling

For flush request, md has a special flush handling to merge concurrent
flush request into single one, however, the whole mechanism is based on
a disk level spin_lock 'mddev->lock'. And fsync can be called quite
often in some user cases, for consequence, spin lock from IO fast path can
cause performance degradation.

Fortunately, the block layer already has flush handling to merge
concurrent flush request, and it only acquires hctx level spin lock. (see
details in blk-flush.c)

This patch removes the flush handling in md, and converts to use general
block layer flush handling in underlying disks.

Flush test for 4 nvme raid10:
start 128 threads to do fsync 100000 times, on arm64, see how long it
takes.

Test script:
void* thread_func(void* arg) {
    int fd = *(int*)arg;
    for (int i = 0; i < FSYNC_COUNT; i++) {
        fsync(fd);
    }
    return NULL;
}

int main() {
    int fd = open("/dev/md0", O_RDWR);
    if (fd < 0) {
        perror("open");
        exit(1);
    }

    pthread_t threads[THREADS];
    struct timeval start, end;

    gettimeofday(&start, NULL);

    for (int i = 0; i < THREADS; i++) {
        pthread_create(&threads[i], NULL, thread_func, &fd);
    }

    for (int i = 0; i < THREADS; i++) {
        pthread_join(threads[i], NULL);
    }

    gettimeofday(&end, NULL);

    close(fd);

    long long elapsed = (end.tv_sec - start.tv_sec) * 1000000LL + (end.tv_usec - start.tv_usec);
    printf("Elapsed time: %lld microseconds\n", elapsed);

    return 0;
}

Test result: about 10 times faster:
Before this patch: 50943374 microseconds
After this patch:  5096347  microseconds

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240827110616.3860190-1-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
(cherry picked from commit b75197e86e6d3de4e611869ef30a27cf414a5f77)
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-10-01 15:25:26 -04:00
Nigel Croxon 5dc63f5dff md/md-bitmap: merge md_bitmap_wait_behind_writes() into bitmap_operations
JIRA: https://issues.redhat.com/browse/RHEL-61196

commit 49f5f5e309e6127957babed7834f5a0e1022f936
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Mon Aug 26 15:44:50 2024 +0800

    md/md-bitmap: merge md_bitmap_wait_behind_writes() into bitmap_operations

    So that the implementation won't be exposed, and it'll be possible
    to invent a new bitmap by replacing bitmap_operations.

    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Link: https://lore.kernel.org/r/20240826074452.1490072-41-yukuai1@huaweicloud.com
    Signed-off-by: Song Liu <song@kernel.org>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-10-01 15:14:53 -04:00
Nigel Croxon d8e67d6acc md/md-bitmap: merge md_bitmap_daemon_work() into bitmap_operations
JIRA: https://issues.redhat.com/browse/RHEL-61196

commit 18db2a9c60aefc61e796f6a384a952999d3b8885
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Mon Aug 26 15:44:43 2024 +0800

    md/md-bitmap: merge md_bitmap_daemon_work() into bitmap_operations

    So that the implementation won't be exposed, and it'll be possible
    to invent a new bitmap by replacing bitmap_operations.

    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Link: https://lore.kernel.org/r/20240826074452.1490072-34-yukuai1@huaweicloud.com
    Signed-off-by: Song Liu <song@kernel.org>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-10-01 15:13:37 -04:00
Nigel Croxon b8644c047c md/md-bitmap: merge bitmap_unplug() into bitmap_operations
JIRA: https://issues.redhat.com/browse/RHEL-61196

commit 3c9883e77a36ca76b8d92afa99599263ca587ae7
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Mon Aug 26 15:44:42 2024 +0800

    md/md-bitmap: merge bitmap_unplug() into bitmap_operations

    So that the implementation won't be exposed, and it'll be possible
    to invent a new bitmap by replacing bitmap_operations.

    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Link: https://lore.kernel.org/r/20240826074452.1490072-33-yukuai1@huaweicloud.com
    Signed-off-by: Song Liu <song@kernel.org>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-10-01 15:13:37 -04:00
Nigel Croxon 014f009476 md/md-bitmap: merge md_bitmap_unplug_async() into md_bitmap_unplug()
JIRA: https://issues.redhat.com/browse/RHEL-61196

commit 48eb95810a9241afd871de917d70712e2ddfda31
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Mon Aug 26 15:44:41 2024 +0800

    md/md-bitmap: merge md_bitmap_unplug_async() into md_bitmap_unplug()

    Add a parameter 'bool sync' to distinguish them, and
    md_bitmap_unplug_async() won't be exported anymore, hence
    bitmap_operations only need one op to cover them.

    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Link: https://lore.kernel.org/r/20240826074452.1490072-32-yukuai1@huaweicloud.com
    Signed-off-by: Song Liu <song@kernel.org>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-10-01 15:13:37 -04:00
Nigel Croxon fde03e7355 md/md-bitmap: merge md_bitmap_dirty_bits() into bitmap_operations
JIRA: https://issues.redhat.com/browse/RHEL-61196

commit 2d3b130e177f14b461c47880b6e0b338fd6872f5
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Mon Aug 26 15:44:32 2024 +0800

    md/md-bitmap: merge md_bitmap_dirty_bits() into bitmap_operations

    So that the implementation won't be exposed, and it'll be possible
    to invent a new bitmap by replacing bitmap_operations.

    Also change the parameter from bitmap to mddev, to avoid access
    bitmap outside md-bitmap.c as much as possible.

    And while we're here, also fix coding style for bitmap_store().

    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Link: https://lore.kernel.org/r/20240826074452.1490072-23-yukuai1@huaweicloud.com
    Signed-off-by: Song Liu <song@kernel.org>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-10-01 15:12:54 -04:00
Nigel Croxon 22d76f5f94 md/md-bitmap: merge bitmap_write_all() into bitmap_operations
JIRA: https://issues.redhat.com/browse/RHEL-61196

commit b26313cb96f1b3fd6f07d3243f6cd426c5cbaf39
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Mon Aug 26 15:44:31 2024 +0800

    md/md-bitmap: merge bitmap_write_all() into bitmap_operations

    So that the implementation won't be exposed, and it'll be possible
    to invent a new bitmap by replacing bitmap_operations.

    Also change the parameter from bitmap to mddev, to avoid access
    bitmap outside md-bitmap.c as much as possible.

    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Link: https://lore.kernel.org/r/20240826074452.1490072-22-yukuai1@huaweicloud.com
    Signed-off-by: Song Liu <song@kernel.org>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-10-01 15:12:54 -04:00
Nigel Croxon f9231aa350 md/md-bitmap: merge md_bitmap_status() into bitmap_operations
JIRA: https://issues.redhat.com/browse/RHEL-61196

commit 696936838bc18a761ed778910975d51cf2c35e3a
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Mon Aug 26 15:44:29 2024 +0800

    md/md-bitmap: merge md_bitmap_status() into bitmap_operations

    So that the implementation won't be exposed, and it'll be possible
    to invent a new bitmap by replacing bitmap_operations.

    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Link: https://lore.kernel.org/r/20240826074452.1490072-20-yukuai1@huaweicloud.com
    Signed-off-by: Song Liu <song@kernel.org>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-10-01 15:12:53 -04:00
Nigel Croxon dc11d78127 md/md-bitmap: merge md_bitmap_update_sb() into bitmap_operations
JIRA: https://issues.redhat.com/browse/RHEL-61196

commit fe59b34676b4ec6b48a7b436d3422fc9317e047a
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Mon Aug 26 15:44:28 2024 +0800

    md/md-bitmap: merge md_bitmap_update_sb() into bitmap_operations

    So that the implementation won't be exposed, and it'll be possible
    to invent a new bitmap by replacing bitmap_operations.

    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Link: https://lore.kernel.org/r/20240826074452.1490072-19-yukuai1@huaweicloud.com
    Signed-off-by: Song Liu <song@kernel.org>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-10-01 15:12:53 -04:00
Nigel Croxon 92b172d83b md/md-bitmap: merge md_bitmap_flush() into bitmap_operations
JIRA: https://issues.redhat.com/browse/RHEL-61196

commit ca925302e841ff0a0598b283f87c472d92b389f3
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Mon Aug 26 15:44:26 2024 +0800

    md/md-bitmap: merge md_bitmap_flush() into bitmap_operations

    So that the implementation won't be exposed, and it'll be possible
    to invent a new bitmap by replacing bitmap_operations.

    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Link: https://lore.kernel.org/r/20240826074452.1490072-17-yukuai1@huaweicloud.com
    Signed-off-by: Song Liu <song@kernel.org>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-10-01 15:12:53 -04:00
Nigel Croxon 0c8e74bde2 md/md-bitmap: merge md_bitmap_destroy() into bitmap_operations
JIRA: https://issues.redhat.com/browse/RHEL-61196

commit a2bd70319290d80127dc4257b8c17df3f027c15d
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Mon Aug 26 15:44:25 2024 +0800

    md/md-bitmap: merge md_bitmap_destroy() into bitmap_operations

    So that the implementation won't be exposed, and it'll be possible
    to invent a new bitmap by replacing bitmap_operations.

    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Link: https://lore.kernel.org/r/20240826074452.1490072-16-yukuai1@huaweicloud.com
    Signed-off-by: Song Liu <song@kernel.org>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-10-01 15:11:53 -04:00
Nigel Croxon 6dcbf10abf md/md-bitmap: merge md_bitmap_load() into bitmap_operations
JIRA: https://issues.redhat.com/browse/RHEL-61196

commit e1e490805958617327be14eaf0ed31d71adc2c54
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Mon Aug 26 15:44:24 2024 +0800

    md/md-bitmap: merge md_bitmap_load() into bitmap_operations

    So that the implementation won't be exposed, and it'll be possible
    to invent a new bitmap by replacing bitmap_operations.

    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Link: https://lore.kernel.org/r/20240826074452.1490072-15-yukuai1@huaweicloud.com
    Signed-off-by: Song Liu <song@kernel.org>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-10-01 15:11:49 -04:00
Nigel Croxon e61d53e425 md/md-bitmap: merge md_bitmap_create() into bitmap_operations
JIRA: https://issues.redhat.com/browse/RHEL-61196

commit 04c80e649512f2c24f99052440cc808163eff40c
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Mon Aug 26 15:44:23 2024 +0800

    md/md-bitmap: merge md_bitmap_create() into bitmap_operations

    So that the implementation won't be exposed, and it'll be possible
    to invent a new bitmap by replacing bitmap_operations.

    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Link: https://lore.kernel.org/r/20240826074452.1490072-14-yukuai1@huaweicloud.com
    Signed-off-by: Song Liu <song@kernel.org>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-10-01 15:11:44 -04:00
Nigel Croxon 5247f3b335 md/md-bitmap: simplify md_bitmap_create() + md_bitmap_load()
JIRA: https://issues.redhat.com/browse/RHEL-61196

commit 7545d385ec7e4c0d5e86e7cde4fe3fb8f4555fb9
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Mon Aug 26 15:44:22 2024 +0800

    md/md-bitmap: simplify md_bitmap_create() + md_bitmap_load()

    Other than internal api get_bitmap_from_slot(), all other places will
    set returned bitmap to mddev->bitmap. So move the setting of
    mddev->bitmap into md_bitmap_create() to simplify code.

    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Link: https://lore.kernel.org/r/20240826074452.1490072-13-yukuai1@huaweicloud.com
    Signed-off-by: Song Liu <song@kernel.org>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-10-01 15:11:39 -04:00
Nigel Croxon bc4b9bf09d md/md-bitmap: introduce struct bitmap_operations
JIRA: https://issues.redhat.com/browse/RHEL-61196

commit 7add9db6ba3e9bd12d2be97abbc13f3881a515db
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Mon Aug 26 15:44:21 2024 +0800

md/md-bitmap: introduce struct bitmap_operations

The structure is empty for now, and will be used in later patches to
merge in bitmap operations, so that bitmap implementation won't be
exposed.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-12-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
(cherry picked from commit 7add9db6ba3e9bd12d2be97abbc13f3881a515db)
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-10-01 15:10:53 -04:00
Nigel Croxon 6d4a54dff3 md/md-bitmap: add 'file_pages' into struct md_bitmap_stats
JIRA: https://issues.redhat.com/browse/RHEL-61196

commit 10bc2ac10597ebc0b25afbc72fa4284565548e36
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Mon Aug 26 15:44:17 2024 +0800

    md/md-bitmap: add 'file_pages' into struct md_bitmap_stats

    There are no functional changes, avoid dereferencing bitmap directly to
    prepare inventing a new bitmap.

    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Link: https://lore.kernel.org/r/20240826074452.1490072-8-yukuai1@huaweicloud.com
    Signed-off-by: Song Liu <song@kernel.org>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-10-01 14:50:00 -04:00
Nigel Croxon a8260f8cfd md/md-bitmap: add 'events_cleared' into struct md_bitmap_stats
JIRA: https://issues.redhat.com/browse/RHEL-61196

commit d004442f46ccae9ea90fdda7a2b0516f1d42b88e
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Mon Aug 26 15:44:14 2024 +0800

    md/md-bitmap: add 'events_cleared' into struct md_bitmap_stats

    Also add a new helper to get events_cleared to avoid dereferencing
    bitmap directly to prepare inventing a new bitmap.

    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Link: https://lore.kernel.org/r/20240826074452.1490072-5-yukuai1@huaweicloud.com
    Signed-off-by: Song Liu <song@kernel.org>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-10-01 14:49:44 -04:00
Nigel Croxon 3c1d78320a md: use new helper md_bitmap_get_stats() in update_array_info()
JIRA: https://issues.redhat.com/browse/RHEL-61196

commit 968153812215d68c27c0c9d90da6ec2f6d17a606
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Mon Aug 26 15:44:13 2024 +0800

    md: use new helper md_bitmap_get_stats() in update_array_info()

    There are no functional changes, avoid dereferencing bitmap directly to
    prepare inventing a new bitmap.

    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Link: https://lore.kernel.org/r/20240826074452.1490072-4-yukuai1@huaweicloud.com
    Signed-off-by: Song Liu <song@kernel.org>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-10-01 14:49:40 -04:00
Nigel Croxon b4fa7c04b4 md/md-bitmap: replace md_bitmap_status() with a new helper md_bitmap_get_stats()
JIRA: https://issues.redhat.com/browse/RHEL-61196

commit 38f287d7e495ae00d4481702f44ff7ca79f5c9bc
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Mon Aug 26 15:44:12 2024 +0800

    md/md-bitmap: replace md_bitmap_status() with a new helper md_bitmap_get_stats()

    There are no functional changes, and the new helper will be used in
    multiple places in following patches to avoid dereferencing bitmap
    directly.

    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Link: https://lore.kernel.org/r/20240826074452.1490072-3-yukuai1@huaweicloud.com
    Signed-off-by: Song Liu <song@kernel.org>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-10-01 14:49:37 -04:00
Nigel Croxon ce5d123fe1 md: Don't flush sync_work in md_write_start()
JIRA: https://issues.redhat.com/browse/RHEL-61196

commit 86ad4cda79e0dade87d4bb0d32e1fe541d4a63e8
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Thu Aug 1 20:47:46 2024 +0800

    md: Don't flush sync_work in md_write_start()

    Because flush sync_work may trigger mddev_suspend() if there are spares,
    and this should never be done in IO path because mddev_suspend() is used
    to wait for IO.

    This problem is found by code review.

    Fixes: bc08041b32ab ("md: suspend array in md_start_sync() if array need reconfiguration")
    Cc: stable@vger.kernel.org
    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Link: https://lore.kernel.org/r/20240801124746.242558-1-yukuai1@huaweicloud.com
    Signed-off-by: Song Liu <song@kernel.org>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-10-01 14:49:27 -04:00
Ming Lei 4c4a2238d3 md: set md-specific flags for all queue limits
JIRA: https://issues.redhat.com/browse/RHEL-56837

commit 573d5abf3df00c879fbd25774e4cf3e22c9cabd0
Author: Christoph Hellwig <hch@lst.de>
Date:   Wed Jun 26 16:26:22 2024 +0200

    md: set md-specific flags for all queue limits

    The md driver wants to enforce a number of flags for all devices, even
    when not inheriting them from the underlying devices.  To make sure these
    flags survive the queue_limits_set calls that md uses to update the
    queue limits without deriving them form the previous limits add a new
    md_init_stacking_limits helper that calls blk_set_stacking_limits and sets
    these flags.

    Fixes: 1122c0c1cc71 ("block: move cache control settings out of queue->flags")
    Reported-by: kernel test robot <oliver.sang@intel.com>
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
    Link: https://lore.kernel.org/r/20240626142637.300624-2-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2024-09-27 11:19:10 +08:00
Ming Lei 72cc0ad565 block: move the nowait flag to queue_limits
JIRA: https://issues.redhat.com/browse/RHEL-56837

commit f76af42f8bf13d2620084f305f01691de9238fc7
Author: Christoph Hellwig <hch@lst.de>
Date:   Mon Jun 17 08:04:46 2024 +0200

    block: move the nowait flag to queue_limits

    Move the nowait flag into the queue_limits feature field so that it can
    be set atomically with the queue frozen.

    Stacking drivers are simplified in that they now can simply set the
    flag, and blk_stack_limits will clear it when the features is not
    supported by any of the underlying devices.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
    Reviewed-by: Hannes Reinecke <hare@suse.de>
    Link: https://lore.kernel.org/r/20240617060532.127975-20-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2024-09-27 11:19:08 +08:00
Ming Lei a8090e43fe block: move the io_stat flag setting to queue_limits
JIRA: https://issues.redhat.com/browse/RHEL-56837

commit cdb2497918cc2929691408bac87b58433b45b6d3
Author: Christoph Hellwig <hch@lst.de>
Date:   Mon Jun 17 08:04:43 2024 +0200

    block: move the io_stat flag setting to queue_limits

    Move the io_stat flag into the queue_limits feature field so that it can
    be set atomically with the queue frozen.

    Simplify md and dm to set the flag unconditionally instead of avoiding
    setting a simple flag for cases where it already is set by other means,
    which is a bit pointless.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
    Reviewed-by: Hannes Reinecke <hare@suse.de>
    Link: https://lore.kernel.org/r/20240617060532.127975-17-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2024-09-27 11:19:08 +08:00
Ming Lei 38f0dd51b5 block: move the nonrot flag to queue_limits
JIRA: https://issues.redhat.com/browse/RHEL-56837
Conflicts: drop change on ublk & bcache

commit bd4a633b6f7c3c6b6ebc1a07317643270e751a94
Author: Christoph Hellwig <hch@lst.de>
Date:   Mon Jun 17 08:04:41 2024 +0200

    block: move the nonrot flag to queue_limits

    Move the nonrot flag into the queue_limits feature field so that it can
    be set atomically with the queue frozen.

    Use the chance to switch to defaulting to non-rotational and require
    the driver to opt into rotational, which matches the polarity of the
    sysfs interface.

    For the z2ram, ps3vram, 2x memstick, ubiblock and dcssblk the new
    rotational flag is not set as they clearly are not rotational despite
    this being a behavior change.  There are some other drivers that
    unconditionally set the rotational flag to keep the existing behavior
    as they arguably can be used on rotational devices even if that is
    probably not their main use today (e.g. virtio_blk and drbd).

    The flag is automatically inherited in blk_stack_limits matching the
    existing behavior in dm and md.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
    Reviewed-by: Hannes Reinecke <hare@suse.de>
    Link: https://lore.kernel.org/r/20240617060532.127975-15-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2024-09-27 11:19:08 +08:00
Ming Lei c2430f6692 block: move cache control settings out of queue->flags
JIRA: https://issues.redhat.com/browse/RHEL-56837
Conflicts: drop change on bcache & ublk

commit 1122c0c1cc71f740fa4d5f14f239194e06a1d5e7
Author: Christoph Hellwig <hch@lst.de>
Date:   Mon Jun 17 08:04:40 2024 +0200

    block: move cache control settings out of queue->flags

    Move the cache control settings into the queue_limits so that the flags
    can be set atomically with the device queue frozen.

    Add new features and flags field for the driver set flags, and internal
    (usually sysfs-controlled) flags in the block layer.  Note that we'll
    eventually remove enough field from queue_limits to bring it back to the
    previous size.

    The disable flag is inverted compared to the previous meaning, which
    means it now survives a rescan, similar to the max_sectors and
    max_discard_sectors user limits.

    The FLUSH and FUA flags are now inherited by blk_stack_limits, which
    simplified the code in dm a lot, but also causes a slight behavior
    change in that dm-switch and dm-unstripe now advertise a write cache
    despite setting num_flush_bios to 0.  The I/O path will handle this
    gracefully, but as far as I can tell the lack of num_flush_bios
    and thus flush support is a pre-existing data integrity bug in those
    targets that really needs fixing, after which a non-zero num_flush_bios
    should be required in dm for targets that map to underlying devices.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Acked-by: Ulf Hansson <ulf.hansson@linaro.org>
    Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
    Reviewed-by: Hannes Reinecke <hare@suse.de>
    Link: https://lore.kernel.org/r/20240617060532.127975-14-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2024-09-27 11:19:08 +08:00
Ming Lei 9fbe332e6d block: move integrity information into queue_limits
JIRA: https://issues.redhat.com/browse/RHEL-56837

commit c6e56cf6b2e79a463af21286ba951714ed20828c
Author: Christoph Hellwig <hch@lst.de>
Date:   Thu Jun 13 10:48:22 2024 +0200

    block: move integrity information into queue_limits

    Move the integrity information into the queue limits so that it can be
    set atomically with other queue limits, and that the sysfs changes to
    the read_verify and write_generate flags are properly synchronized.
    This also allows to provide a more useful helper to stack the integrity
    fields, although it still is separate from the main stacking function
    as not all stackable devices want to inherit the integrity settings.
    Even with that it greatly simplifies the code in md and dm.

    Note that the integrity field is moved as-is into the queue limits.
    While there are good arguments for removing the separate blk_integrity
    structure, this would cause a lot of churn and might better be done at a
    later time if desired.  However the integrity field in the queue_limits
    structure is now unconditional so that various ifdefs can be avoided or
    replaced with IS_ENABLED().  Given that tiny size of it that seems like
    a worthwhile trade off.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Hannes Reinecke <hare@suse.de>
    Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
    Link: https://lore.kernel.org/r/20240613084839.1044015-13-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2024-09-27 11:19:06 +08:00
Ming Lei 7122d1fe5c md: remove mddev->queue
JIRA: https://issues.redhat.com/browse/RHEL-56837

commit 396799eb5b6f87ec2d759e1a90e179f7058ab9e6
Author: Christoph Hellwig <hch@lst.de>
Date:   Sun Mar 3 07:01:49 2024 -0700

    md: remove mddev->queue

    Just use the request_queue from the gendisk pointer in the relatively
    few places that sill need it.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed--by: Song Liu <song@kernel.org>
    Tested-by: Song Liu <song@kernel.org>
    Signed-off-by: Song Liu <song@kernel.org>
    Link: https://lore.kernel.org/r/20240303140150.5435-11-hch@lst.de

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2024-09-27 11:18:41 +08:00
Ming Lei f4f7a58f95 md: don't initialize queue limits
JIRA: https://issues.redhat.com/browse/RHEL-56837

commit 81a16e19d545fd244ad176f7222d92b67215a33b
Author: Christoph Hellwig <hch@lst.de>
Date:   Sun Mar 3 07:01:48 2024 -0700

    md: don't initialize queue limits

    Initial queue limits are now set from ->run.  Remove the superfluous
    initialization in md_alloc and level_store.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed--by: Song Liu <song@kernel.org>
    Tested-by: Song Liu <song@kernel.org>
    Signed-off-by: Song Liu <song@kernel.org>
    Link: https://lore.kernel.org/r/20240303140150.5435-10-hch@lst.de

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2024-09-27 11:18:41 +08:00
Ming Lei 13a8d4641b md: add queue limit helpers
JIRA: https://issues.redhat.com/browse/RHEL-56837

commit e305fce1883128a9468efe1876a057df48a261d6
Author: Christoph Hellwig <hch@lst.de>
Date:   Sun Mar 3 07:01:43 2024 -0700

    md: add queue limit helpers

    Add a few helpers that wrap the block queue limits API for use in MD.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed--by: Song Liu <song@kernel.org>
    Tested-by: Song Liu <song@kernel.org>
    Signed-off-by: Song Liu <song@kernel.org>
    Link: https://lore.kernel.org/r/20240303140150.5435-5-hch@lst.de

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2024-09-27 11:18:40 +08:00
Ming Lei ba42c16212 block: pass a queue_limits argument to blk_alloc_disk
JIRA: https://issues.redhat.com/browse/RHEL-56837

commit 74fa8f9c553f7b5ccab7d103acae63cc2e080465
Author: Christoph Hellwig <hch@lst.de>
Date:   Thu Feb 15 08:10:47 2024 +0100

    block: pass a queue_limits argument to blk_alloc_disk

    Pass a queue_limits to blk_alloc_disk and apply it if non-NULL.  This
    will allow allocating queues with valid queue limits instead of setting
    the values one at a time later.

    Also change blk_alloc_disk to return an ERR_PTR instead of just NULL
    which can't distinguish errors.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Dan Williams <dan.j.williams@intel.com>
    Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
    Link: https://lore.kernel.org/r/20240215071055.2201424-2-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2024-09-27 11:18:35 +08:00
Rado Vrbovsky 78fa0da45f Merge: md: fix deadlock between mddev_suspend and flush bio
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5067

JIRA: https://issues.redhat.com/browse/RHEL-54757

CVE: CVE-2024-43855

Upstream Status: commit found in Linus's git tree

Brew: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=63498077

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>

Approved-by: Jay Shin <jaeshin@redhat.com>
Approved-by: Heinz Mauelshagen <heinzm@redhat.com>
Approved-by: Xiao Ni <xni@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
2024-09-23 08:11:01 +00:00
Nigel Croxon 37a7a4be7a md-cluster: fix no recovery job when adding/re-adding a disk
JIRA: https://issues.redhat.com/browse/RHEL-46615

commit 35a0a409fa269c287c4378f1aefe84ae8b5211a1
Author: Heming Zhao <heming.zhao@suse.com>
Date:   Tue Jul 9 18:41:20 2024 +0800

md-cluster: fix no recovery job when adding/re-adding a disk

The commit db5e653d7c9f ("md: delay choosing sync action to
md_start_sync()") delays the start of the sync action. In a
clustered environment, this will cause another node to first
activate the spare disk and skip recovery. As a result, no
nodes will perform recovery when a disk is added or re-added.

Before db5e653d7c9f:

```
   node1                                node2
----------------------------------------------------------------
md_check_recovery
 + md_update_sb
 |  sendmsg: METADATA_UPDATED
 + md_choose_sync_action           process_metadata_update
 |  remove_and_add_spares           //node1 has not finished adding
 + call mddev->sync_work            //the spare disk:do nothing

md_start_sync
 starts md_do_sync

md_do_sync
 + grabbed resync_lockres:DLM_LOCK_EX
 + do syncing job

md_check_recovery
 sendmsg: METADATA_UPDATED
                                 process_metadata_update
                                   //activate spare disk

                                 ... ...

                                 md_do_sync
                                  waiting to grab resync_lockres:EX
```

After db5e653d7c9f:

(note: if 'cmd:idle' sets MD_RECOVERY_INTR after md_check_recovery
starts md_start_sync, setting the INTR action will exacerbate the
delay in node1 calling the md_do_sync function.)

```
   node1                                node2
----------------------------------------------------------------
md_check_recovery
 + md_update_sb
 |  sendmsg: METADATA_UPDATED
 + calls mddev->sync_work         process_metadata_update
                                   //node1 has not finished adding
                                   //the spare disk:do nothing

md_start_sync
 + md_choose_sync_action
 |  remove_and_add_spares
 + calls md_do_sync

md_check_recovery
 md_update_sb
  sendmsg: METADATA_UPDATED
                                  process_metadata_update
                                    //activate spare disk

  ... ...                         ... ...

                                  md_do_sync
                                   + grabbed resync_lockres:EX
                                   + raid1_sync_request skip sync under
				     conf->fullsync:0
md_do_sync
 1. waiting to grab resync_lockres:EX
 2. when node1 could grab EX lock,
    node1 will skip resync under recovery_offset:MaxSector
```

How to trigger:

```(commands @node1)
 # to easily watch the recovery status
echo 2000 > /proc/sys/dev/raid/speed_limit_max
ssh root@node2 "echo 2000 > /proc/sys/dev/raid/speed_limit_max"

mdadm -CR /dev/md0 -l1 -b clustered -n 2 /dev/sda /dev/sdb --assume-clean
ssh root@node2 mdadm -A /dev/md0 /dev/sda /dev/sdb
mdadm --manage /dev/md0 --fail /dev/sda --remove /dev/sda
mdadm --manage /dev/md0 --add /dev/sdc

=== "cat /proc/mdstat" on both node, there are no recovery action. ===
```

How to fix:

because md layer code logic is hard to restore for speeding up sync job
on local node, we add new cluster msg to pending the another node to
active disk.

Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Su Yue <glass.su@suse.com>
Acked-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240709104120.22243-2-heming.zhao@suse.com
(cherry picked from commit 35a0a409fa269c287c4378f1aefe84ae8b5211a1)
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-08-28 13:32:32 -04:00
Nigel Croxon b059da8c12 md: Don't wait for MD_RECOVERY_NEEDED for HOT_REMOVE_DISK ioctl
JIRA: https://issues.redhat.com/browse/RHEL-46615

commit a1fd37f97808db4fa1bf55da0275790c42521e45
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Thu Jun 27 19:23:21 2024 +0800

md: Don't wait for MD_RECOVERY_NEEDED for HOT_REMOVE_DISK ioctl

Commit 90f5f7ad4f ("md: Wait for md_check_recovery before attempting
device removal.") explained in the commit message that failed device
must be reomoved from the personality first by md_check_recovery(),
before it can be removed from the array. That's the reason the commit
add the code to wait for MD_RECOVERY_NEEDED.

However, this is not the case now, because remove_and_add_spares() is
called directly from hot_remove_disk() from ioctl path, hence failed
device(marked faulty) can be removed from the personality by ioctl.

On the other hand, the commit introduced a performance problem that
if MD_RECOVERY_NEEDED is set and the array is not running, ioctl will
wait for 5s before it can return failure to user.

Since the waiting is not needed now, fix the problem by removing the
waiting.

Fixes: 90f5f7ad4f ("md: Wait for md_check_recovery before attempting device removal.")
Reported-by: Mateusz Kusiak <mateusz.kusiak@linux.intel.com>
Closes: https://lore.kernel.org/all/814ff6ee-47a2-4ba0-963e-cf256ee4ecfa@linux.intel.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240627112321.3044744-1-yukuai1@huaweicloud.com
(cherry picked from commit a1fd37f97808db4fa1bf55da0275790c42521e45)
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-08-28 13:32:26 -04:00