JIRA: https://issues.redhat.com/browse/RHEL-83988
commit 8542870237c3a48ff049b6c5df5f50c8728284fa
Author: Yu Kuai <yukuai3@huawei.com>
Date: Thu Feb 20 20:43:48 2025 +0800
md: fix mddev uaf while iterating all_mddevs list
While iterating all_mddevs list from md_notify_reboot() and md_exit(),
list_for_each_entry_safe is used, and this can race with deletint the
next mddev, causing UAF:
t1:
spin_lock
//list_for_each_entry_safe(mddev, n, ...)
mddev_get(mddev1)
// assume mddev2 is the next entry
spin_unlock
t2:
//remove mddev2
...
mddev_free
spin_lock
list_del
spin_unlock
kfree(mddev2)
mddev_put(mddev1)
spin_lock
//continue dereference mddev2->all_mddevs
The old helper for_each_mddev() actually grab the reference of mddev2
while holding the lock, to prevent from being freed. This problem can be
fixed the same way, however, the code will be complex.
Hence switch to use list_for_each_entry, in this case mddev_put() can free
the mddev1 and it's not safe as well. Refer to md_seq_show(), also factor
out a helper mddev_put_locked() to fix this problem.
Cc: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/linux-raid/20250220124348.845222-1-yukuai1@huaweicloud.com
Fixes: f26514342255 ("md: stop using for_each_mddev in md_notify_reboot")
Fixes: 16648bac862f ("md: stop using for_each_mddev in md_exit")
Reported-and-tested-by: Guillaume Morin <guillaume@morinfr.org>
Closes: https://lore.kernel.org/all/Z7Y0SURoA8xwg7vn@bender.morinfr.org/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-83988
commit 87a86277c9f54953e184318bf71630388aeaf000
Author: Yu Kuai <yukuai3@huawei.com>
Date: Sat Feb 15 17:22:25 2025 +0800
md: switch md-cluster to use md_submodle_head
To make code cleaner, and prepare to add kconfig for bitmap.
Also remove the unsed global variables pers_lock, md_cluster_ops and
md_cluster_mod, and exported symbols register_md_cluster_operations(),
unregister_md_cluster_operations() and md_cluster_ops.
Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-8-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Su Yue <glass.su@suse.com>
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-83988
commit 3d44e1d1575a877cf75a7776802506ce7ab8ecc4
Author: Yu Kuai <yukuai3@huawei.com>
Date: Sat Feb 15 17:22:22 2025 +0800
md: switch personalities to use md_submodule_head
Remove the global list 'pers_list', and switch to use md_submodule_head,
which is managed by xarry. Prepare to unify registration and unregistration
for all sub modules.
Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-5-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
(cherry picked from commit 3d44e1d1575a877cf75a7776802506ce7ab8ecc4)
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-83988
commit c594de0455b3d65525bad2020f7f7e41af233045
Author: Yu Kuai <yukuai3@huawei.com>
Date: Sat Feb 15 17:22:24 2025 +0800
md: don't export md_cluster_ops
Add a new field 'cluster_ops' and initialize it md_setup_cluster(), so
that the gloable variable 'md_cluter_ops' doesn't need to be exported.
Also prepare to switch md-cluster to use md_submod_head.
Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-7-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Su Yue <glass.su@suse.com>
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-83988
commit d3beb7c9c61d239e73cb93481b27c7b94130dd03
Author: Yu Kuai <yukuai3@huawei.com>
Date: Sat Feb 15 17:22:21 2025 +0800
md: introduce struct md_submodule_head and APIs
Prepare to unify registration and unregistration of md personalities
and md-cluster, also prepare for add kconfig for md-bitmap.
Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-4-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-83988
commit 9faab548974e3eb858250fea1ab7e823a689b44b
Author: Yu Kuai <yukuai3@huawei.com>
Date: Sat Feb 15 17:22:19 2025 +0800
md: merge common code into find_pers()
- pers_lock() are held and released from caller
- try_module_get() is called from caller
- error message from caller
Merge above code into find_pers(), and rename it to get_pers(), also
add a wrapper to module_put() as put_pers().
Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-2-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Su Yue <glass.su@suse.com>
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-83988
commit 4b10a3bc67c1232f76aa1e04778ca26d6c0ddf7f
Author: Li Nan <linan122@huawei.com>
Date: Thu Feb 13 21:15:30 2025 +0800
md: ensure resync is prioritized over recovery
If a new disk is added during resync, the resync process is interrupted,
and recovery is triggered, causing the previous resync to be lost. In
reality, disk addition should not terminate resync, fix it.
Steps to reproduce the issue:
mdadm -CR /dev/md0 -l1 -n3 -x1 /dev/sd[abcd]
mdadm --fail /dev/md0 /dev/sdc
Fixes: 24dd469d72 ("[PATCH] md: allow a manual resync with md")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/linux-raid/20250213131530.3698600-1-linan666@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-83988
commit 127186cfb184eaccdfe948e6da66940cfa03efc5
Author: Yu Kuai <yukuai3@huawei.com>
Date: Thu Jan 2 19:28:41 2025 +0800
md: reintroduce md-linear
THe md-linear is removed by commit 849d18e27be9 ("md: Remove deprecated
CONFIG_MD_LINEAR") because it has been marked as deprecated for a long
time.
However, md-linear is used widely for underlying disks with different size,
sadly we didn't know this until now, and it's true useful to create
partitions and assemble multiple raid and then append one to the other.
People have to use dm-linear in this case now, however, they will prefer
to minimize the number of involved modules.
Fixes: 849d18e27be9 ("md: Remove deprecated CONFIG_MD_LINEAR")
Cc: stable@vger.kernel.org
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Coly Li <colyli@kernel.org>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Link: https://lore.kernel.org/r/20250102112841.1227111-1-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
(cherry picked from commit 127186cfb184eaccdfe948e6da66940cfa03efc5)
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-83988
commit 849d18e27be9a1253f2318cb4549cc857219d991
Author: Song Liu <song@kernel.org>
Date: Thu Dec 14 14:21:05 2023 -0800
md: Remove deprecated CONFIG_MD_LINEAR
md-linear has been marked as deprecated for 2.5 years. Remove it.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Neil Brown <neilb@suse.de>
Cc: Guoqing Jiang <guoqing.jiang@linux.dev>
Cc: Mateusz Grzonka <mateusz.grzonka@intel.com>
Cc: Jes Sorensen <jes@trained-monkey.org>
Signed-off-by: Song Liu <song@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20231214222107.2016042-2-song@kernel.org
(cherry picked from commit 849d18e27be9a1253f2318cb4549cc857219d991)
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-73514
commit 8d28d0ddb986f56920ac97ae704cc3340a699a30
Author: Yu Kuai <yukuai3@huawei.com>
Date: Fri Jan 24 17:20:55 2025 +0800
md/md-bitmap: Synchronize bitmap_get_stats() with bitmap lifetime
After commit ec6bb299c7c3 ("md/md-bitmap: add 'sync_size' into struct
md_bitmap_stats"), following panic is reported:
Oops: general protection fault, probably for non-canonical address
RIP: 0010:bitmap_get_stats+0x2b/0xa0
Call Trace:
<TASK>
md_seq_show+0x2d2/0x5b0
seq_read_iter+0x2b9/0x470
seq_read+0x12f/0x180
proc_reg_read+0x57/0xb0
vfs_read+0xf6/0x380
ksys_read+0x6c/0xf0
do_syscall_64+0x82/0x170
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Root cause is that bitmap_get_stats() can be called at anytime if mddev
is still there, even if bitmap is destroyed, or not fully initialized.
Deferenceing bitmap in this case can crash the kernel. Meanwhile, the
above commit start to deferencing bitmap->storage, make the problem
easier to trigger.
Fix the problem by protecting bitmap_get_stats() with bitmap_info.mutex.
Cc: stable@vger.kernel.org # v6.12+
Fixes: 32a7627cf3 ("[PATCH] md: optimised resync using Bitmap based intent logging")
Reported-and-tested-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Closes: https://lore.kernel.org/linux-raid/ca3a91a2-50ae-4f68-b317-abd9889f3907@oracle.com/T/#m6e5086c95201135e4941fe38f9efa76daf9666c5
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250124092055.4050195-1-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-73514
commit cd5fc653381811f1e0ba65f5d169918cab61476f
Author: Yu Kuai <yukuai3@huawei.com>
Date: Thu Jan 9 09:51:45 2025 +0800
md/md-bitmap: move bitmap_{start, end}write to md upper layer
There are two BUG reports that raid5 will hang at
bitmap_startwrite([1],[2]), root cause is that bitmap start write and end
write is unbalanced, it's not quite clear where, and while reviewing raid5
code, it's found that bitmap operations can be optimized. For example,
for a 4 disks raid5, with chunksize=8k, if user issue a IO (0 + 48k) to
the array:
┌────────────────────────────────────────────────────────────┐
│chunk 0 │
│ ┌────────────┬─────────────┬─────────────┬────────────┼
│ sh0 │A0: 0 + 4k │A1: 8k + 4k │A2: 16k + 4k │A3: P │
│ ┼────────────┼─────────────┼─────────────┼────────────┼
│ sh1 │B0: 4k + 4k │B1: 12k + 4k │B2: 20k + 4k │B3: P │
┼──────┴────────────┴─────────────┴─────────────┴────────────┼
│chunk 1 │
│ ┌────────────┬─────────────┬─────────────┬────────────┤
│ sh2 │C0: 24k + 4k│C1: 32k + 4k │C2: P │C3: 40k + 4k│
│ ┼────────────┼─────────────┼─────────────┼────────────┼
│ sh3 │D0: 28k + 4k│D1: 36k + 4k │D2: P │D3: 44k + 4k│
└──────┴────────────┴─────────────┴─────────────┴────────────┘
Before this patch, 4 stripe head will be used, and each sh will attach
bio for 3 disks, and each attached bio will trigger
bitmap_startwrite() once, which means total 12 times.
- 3 times (0 + 4k), for (A0, A1 and A2)
- 3 times (4 + 4k), for (B0, B1 and B2)
- 3 times (8 + 4k), for (C0, C1 and C3)
- 3 times (12 + 4k), for (D0, D1 and D3)
After this patch, md upper layer will calculate that IO range (0 + 48k)
is corresponding to the bitmap (0 + 16k), and call bitmap_startwrite()
just once.
Noted that this patch will align bitmap ranges to the chunks, for example,
if user issue a IO (0 + 4k) to array:
- Before this patch, 1 time (0 + 4k), for A0;
- After this patch, 1 time (0 + 8k) for chunk 0;
Usually, one bitmap bit will represent more than one disk chunk, and this
doesn't have any difference. And even if user really created a array
that one chunk contain multiple bits, the overhead is that more data
will be recovered after power failure.
Also remove STRIPE_BITMAP_PENDING since it's not used anymore.
[1] https://lore.kernel.org/all/CAJpMwyjmHQLvm6zg1cmQErttNNQPDAAXPKM3xgTjMhbfts986Q@mail.gmail.com/
[2] https://lore.kernel.org/all/ADF7D720-5764-4AF3-B68E-1845988737AA@flyingcircus.io/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250109015145.158868-6-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-73514
commit 29967332ced51a15a22f11381eeebbc500ba1858
Author: Yu Kuai <yukuai3@huawei.com>
Date: Thu Oct 31 11:31:10 2024 +0800
md: don't record new badblocks for faulty rdev
Faulty will be checked before issuing IO to the rdev, however, rdev can
be faulty at any time, hence it's possible that rdev_set_badblocks()
will be called for faulty rdev. In this case, mddev->sb_flags will be
set and some other path can be blocked by updating super block.
Since faulty rdev will not be accesed anymore, there is no need to
record new babblocks for faulty rdev and forcing updating super block.
Noted this is not a bugfix, just prevent updating superblock in some
corner cases, and will help to slice a bug related to external
metadata[1], testing also shows that devices are removed faster in the
case IO error.
[1] https://lore.kernel.org/all/f34452df-810b-48b2-a9b4-7f925699a9e7@linux.intel.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Tested-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
Link: https://lore.kernel.org/r/20241031033114.3845582-4-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-73514
commit 50e8274855e7ab5499ff8296e09802874a3f03b1
Author: Yu Kuai <yukuai3@huawei.com>
Date: Thu Oct 31 11:31:09 2024 +0800
md: don't wait faulty rdev in md_wait_for_blocked_rdev()
md_wait_for_blocked_rdev() is called for write IO while rdev is
blocked, howerver, rdev can be faulty after choosing this rdev to write,
and faulty rdev should never be accessed anymore, hence there is no point
to wait for faulty rdev to be unblocked.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Tested-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
Link: https://lore.kernel.org/r/20241031033114.3845582-3-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-73514
commit 62ce0782bbacd32ec10292b9bdd127330e9b6968
Author: Li Nan <linan122@huawei.com>
Date: Thu Sep 19 14:30:48 2024 +0800
md: ensure child flush IO does not affect origin bio->bi_status
When a flush is issued to an RAID array, a child flush IO is created and
issued for each member disk in the RAID array. Since commit b75197e86e6d
("md: Remove flush handling"), each child flush IO has been chained with
the original bio. As a result, the failure of any child IO could modify
the bi_status of the original bio, potentially impacting the upper-layer
filesystem.
Fix the issue by preventing child flush IO from altering the original
bio->bi_status as before. However, this design introduces a known
issue: in the event of a power failure, if a flush IO on a member
disk fails, the upper layers may not be informed. This issue is not easy
to fix and will not be addressed for the time being in this issue.
Fixes: b75197e86e6d ("md: Remove flush handling")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240919063048.2887579-1-linan666@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-61196
commit d981ed8419303ed12351eea8541ad6cb76455fe3
Author: Xiao Ni <xni@redhat.com>
Date: Thu Sep 5 07:54:53 2024 +0800
md: Add new_level sysfs interface
Now reshape supports two ways: with backup file or without backup file.
For the situation without backup file, it needs to change data offset.
It doesn't need systemd service mdadm-grow-continue. So it can finish
the reshape job in one process environment. It can know the new level
from mdadm --grow command and can change to new level after reshape
finishes.
For the situation with backup file, it needs systemd service
mdadm-grow-continue to monitor reshape progress. So there are two process
envolved. One is mdadm --grow command whick kicks off reshape and wakes
up mdadm-grow-continue service. The second process is the service, which
doesn't know the new level from the first process.
In kernel space mddev->new_level is used to record the new level when
doing reshape. This patch adds a new interface to help mdadm update
new_level and sync it to metadata. Then mdadm-grow-continue can read the
right new_level.
Commit log revised by Song Liu. Please refer to the link for more details.
Signed-off-by: Xiao Ni <xni@redhat.com>
Link: https://lore.kernel.org/r/20240904235453.99120-1-xni@redhat.com
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-61196
commit 2d2b3bc145b9d5b5c6f07d22291723ddb024ca76
Author: Mateusz Kusiak <mateusz.kusiak@intel.com>
Date: Tue Sep 3 16:29:49 2024 +0200
md: Report failed arrays as broken in mdstat
Depending on if array has personality, it is either reported as active or
inactive. This patch adds third status "broken" for arrays with
personality that became inoperative. The reason is end users tend to
assume that "active" indicates array is operational.
Add "broken" state for inoperative arrays with personality and refactor
the code.
Signed-off-by: Mateusz Kusiak <mateusz.kusiak@intel.com>
Link: https://lore.kernel.org/r/20240903142949.53628-1-mateusz.kusiak@intel.com
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-61196
commit b75197e86e6d3de4e611869ef30a27cf414a5f77
Author: Yu Kuai <yukuai3@huawei.com>
Date: Tue Aug 27 19:06:16 2024 +0800
md: Remove flush handling
For flush request, md has a special flush handling to merge concurrent
flush request into single one, however, the whole mechanism is based on
a disk level spin_lock 'mddev->lock'. And fsync can be called quite
often in some user cases, for consequence, spin lock from IO fast path can
cause performance degradation.
Fortunately, the block layer already has flush handling to merge
concurrent flush request, and it only acquires hctx level spin lock. (see
details in blk-flush.c)
This patch removes the flush handling in md, and converts to use general
block layer flush handling in underlying disks.
Flush test for 4 nvme raid10:
start 128 threads to do fsync 100000 times, on arm64, see how long it
takes.
Test script:
void* thread_func(void* arg) {
int fd = *(int*)arg;
for (int i = 0; i < FSYNC_COUNT; i++) {
fsync(fd);
}
return NULL;
}
int main() {
int fd = open("/dev/md0", O_RDWR);
if (fd < 0) {
perror("open");
exit(1);
}
pthread_t threads[THREADS];
struct timeval start, end;
gettimeofday(&start, NULL);
for (int i = 0; i < THREADS; i++) {
pthread_create(&threads[i], NULL, thread_func, &fd);
}
for (int i = 0; i < THREADS; i++) {
pthread_join(threads[i], NULL);
}
gettimeofday(&end, NULL);
close(fd);
long long elapsed = (end.tv_sec - start.tv_sec) * 1000000LL + (end.tv_usec - start.tv_usec);
printf("Elapsed time: %lld microseconds\n", elapsed);
return 0;
}
Test result: about 10 times faster:
Before this patch: 50943374 microseconds
After this patch: 5096347 microseconds
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240827110616.3860190-1-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
(cherry picked from commit b75197e86e6d3de4e611869ef30a27cf414a5f77)
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-61196
commit 49f5f5e309e6127957babed7834f5a0e1022f936
Author: Yu Kuai <yukuai3@huawei.com>
Date: Mon Aug 26 15:44:50 2024 +0800
md/md-bitmap: merge md_bitmap_wait_behind_writes() into bitmap_operations
So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-41-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-61196
commit 18db2a9c60aefc61e796f6a384a952999d3b8885
Author: Yu Kuai <yukuai3@huawei.com>
Date: Mon Aug 26 15:44:43 2024 +0800
md/md-bitmap: merge md_bitmap_daemon_work() into bitmap_operations
So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-34-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-61196
commit 3c9883e77a36ca76b8d92afa99599263ca587ae7
Author: Yu Kuai <yukuai3@huawei.com>
Date: Mon Aug 26 15:44:42 2024 +0800
md/md-bitmap: merge bitmap_unplug() into bitmap_operations
So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-33-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-61196
commit 48eb95810a9241afd871de917d70712e2ddfda31
Author: Yu Kuai <yukuai3@huawei.com>
Date: Mon Aug 26 15:44:41 2024 +0800
md/md-bitmap: merge md_bitmap_unplug_async() into md_bitmap_unplug()
Add a parameter 'bool sync' to distinguish them, and
md_bitmap_unplug_async() won't be exported anymore, hence
bitmap_operations only need one op to cover them.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-32-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-61196
commit 2d3b130e177f14b461c47880b6e0b338fd6872f5
Author: Yu Kuai <yukuai3@huawei.com>
Date: Mon Aug 26 15:44:32 2024 +0800
md/md-bitmap: merge md_bitmap_dirty_bits() into bitmap_operations
So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.
Also change the parameter from bitmap to mddev, to avoid access
bitmap outside md-bitmap.c as much as possible.
And while we're here, also fix coding style for bitmap_store().
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-23-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-61196
commit b26313cb96f1b3fd6f07d3243f6cd426c5cbaf39
Author: Yu Kuai <yukuai3@huawei.com>
Date: Mon Aug 26 15:44:31 2024 +0800
md/md-bitmap: merge bitmap_write_all() into bitmap_operations
So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.
Also change the parameter from bitmap to mddev, to avoid access
bitmap outside md-bitmap.c as much as possible.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-22-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-61196
commit 696936838bc18a761ed778910975d51cf2c35e3a
Author: Yu Kuai <yukuai3@huawei.com>
Date: Mon Aug 26 15:44:29 2024 +0800
md/md-bitmap: merge md_bitmap_status() into bitmap_operations
So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-20-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-61196
commit fe59b34676b4ec6b48a7b436d3422fc9317e047a
Author: Yu Kuai <yukuai3@huawei.com>
Date: Mon Aug 26 15:44:28 2024 +0800
md/md-bitmap: merge md_bitmap_update_sb() into bitmap_operations
So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-19-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-61196
commit ca925302e841ff0a0598b283f87c472d92b389f3
Author: Yu Kuai <yukuai3@huawei.com>
Date: Mon Aug 26 15:44:26 2024 +0800
md/md-bitmap: merge md_bitmap_flush() into bitmap_operations
So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-17-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-61196
commit a2bd70319290d80127dc4257b8c17df3f027c15d
Author: Yu Kuai <yukuai3@huawei.com>
Date: Mon Aug 26 15:44:25 2024 +0800
md/md-bitmap: merge md_bitmap_destroy() into bitmap_operations
So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-16-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-61196
commit e1e490805958617327be14eaf0ed31d71adc2c54
Author: Yu Kuai <yukuai3@huawei.com>
Date: Mon Aug 26 15:44:24 2024 +0800
md/md-bitmap: merge md_bitmap_load() into bitmap_operations
So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-15-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-61196
commit 04c80e649512f2c24f99052440cc808163eff40c
Author: Yu Kuai <yukuai3@huawei.com>
Date: Mon Aug 26 15:44:23 2024 +0800
md/md-bitmap: merge md_bitmap_create() into bitmap_operations
So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-14-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-61196
commit 7545d385ec7e4c0d5e86e7cde4fe3fb8f4555fb9
Author: Yu Kuai <yukuai3@huawei.com>
Date: Mon Aug 26 15:44:22 2024 +0800
md/md-bitmap: simplify md_bitmap_create() + md_bitmap_load()
Other than internal api get_bitmap_from_slot(), all other places will
set returned bitmap to mddev->bitmap. So move the setting of
mddev->bitmap into md_bitmap_create() to simplify code.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-13-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-61196
commit 7add9db6ba3e9bd12d2be97abbc13f3881a515db
Author: Yu Kuai <yukuai3@huawei.com>
Date: Mon Aug 26 15:44:21 2024 +0800
md/md-bitmap: introduce struct bitmap_operations
The structure is empty for now, and will be used in later patches to
merge in bitmap operations, so that bitmap implementation won't be
exposed.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-12-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
(cherry picked from commit 7add9db6ba3e9bd12d2be97abbc13f3881a515db)
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-61196
commit 10bc2ac10597ebc0b25afbc72fa4284565548e36
Author: Yu Kuai <yukuai3@huawei.com>
Date: Mon Aug 26 15:44:17 2024 +0800
md/md-bitmap: add 'file_pages' into struct md_bitmap_stats
There are no functional changes, avoid dereferencing bitmap directly to
prepare inventing a new bitmap.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-8-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-61196
commit d004442f46ccae9ea90fdda7a2b0516f1d42b88e
Author: Yu Kuai <yukuai3@huawei.com>
Date: Mon Aug 26 15:44:14 2024 +0800
md/md-bitmap: add 'events_cleared' into struct md_bitmap_stats
Also add a new helper to get events_cleared to avoid dereferencing
bitmap directly to prepare inventing a new bitmap.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-5-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-61196
commit 968153812215d68c27c0c9d90da6ec2f6d17a606
Author: Yu Kuai <yukuai3@huawei.com>
Date: Mon Aug 26 15:44:13 2024 +0800
md: use new helper md_bitmap_get_stats() in update_array_info()
There are no functional changes, avoid dereferencing bitmap directly to
prepare inventing a new bitmap.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-4-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-61196
commit 38f287d7e495ae00d4481702f44ff7ca79f5c9bc
Author: Yu Kuai <yukuai3@huawei.com>
Date: Mon Aug 26 15:44:12 2024 +0800
md/md-bitmap: replace md_bitmap_status() with a new helper md_bitmap_get_stats()
There are no functional changes, and the new helper will be used in
multiple places in following patches to avoid dereferencing bitmap
directly.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-3-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-61196
commit 86ad4cda79e0dade87d4bb0d32e1fe541d4a63e8
Author: Yu Kuai <yukuai3@huawei.com>
Date: Thu Aug 1 20:47:46 2024 +0800
md: Don't flush sync_work in md_write_start()
Because flush sync_work may trigger mddev_suspend() if there are spares,
and this should never be done in IO path because mddev_suspend() is used
to wait for IO.
This problem is found by code review.
Fixes: bc08041b32ab ("md: suspend array in md_start_sync() if array need reconfiguration")
Cc: stable@vger.kernel.org
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240801124746.242558-1-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-56837
commit 573d5abf3df00c879fbd25774e4cf3e22c9cabd0
Author: Christoph Hellwig <hch@lst.de>
Date: Wed Jun 26 16:26:22 2024 +0200
md: set md-specific flags for all queue limits
The md driver wants to enforce a number of flags for all devices, even
when not inheriting them from the underlying devices. To make sure these
flags survive the queue_limits_set calls that md uses to update the
queue limits without deriving them form the previous limits add a new
md_init_stacking_limits helper that calls blk_set_stacking_limits and sets
these flags.
Fixes: 1122c0c1cc71 ("block: move cache control settings out of queue->flags")
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240626142637.300624-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-56837
commit f76af42f8bf13d2620084f305f01691de9238fc7
Author: Christoph Hellwig <hch@lst.de>
Date: Mon Jun 17 08:04:46 2024 +0200
block: move the nowait flag to queue_limits
Move the nowait flag into the queue_limits feature field so that it can
be set atomically with the queue frozen.
Stacking drivers are simplified in that they now can simply set the
flag, and blk_stack_limits will clear it when the features is not
supported by any of the underlying devices.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-20-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-56837
commit cdb2497918cc2929691408bac87b58433b45b6d3
Author: Christoph Hellwig <hch@lst.de>
Date: Mon Jun 17 08:04:43 2024 +0200
block: move the io_stat flag setting to queue_limits
Move the io_stat flag into the queue_limits feature field so that it can
be set atomically with the queue frozen.
Simplify md and dm to set the flag unconditionally instead of avoiding
setting a simple flag for cases where it already is set by other means,
which is a bit pointless.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-17-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-56837
Conflicts: drop change on ublk & bcache
commit bd4a633b6f7c3c6b6ebc1a07317643270e751a94
Author: Christoph Hellwig <hch@lst.de>
Date: Mon Jun 17 08:04:41 2024 +0200
block: move the nonrot flag to queue_limits
Move the nonrot flag into the queue_limits feature field so that it can
be set atomically with the queue frozen.
Use the chance to switch to defaulting to non-rotational and require
the driver to opt into rotational, which matches the polarity of the
sysfs interface.
For the z2ram, ps3vram, 2x memstick, ubiblock and dcssblk the new
rotational flag is not set as they clearly are not rotational despite
this being a behavior change. There are some other drivers that
unconditionally set the rotational flag to keep the existing behavior
as they arguably can be used on rotational devices even if that is
probably not their main use today (e.g. virtio_blk and drbd).
The flag is automatically inherited in blk_stack_limits matching the
existing behavior in dm and md.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-56837
Conflicts: drop change on bcache & ublk
commit 1122c0c1cc71f740fa4d5f14f239194e06a1d5e7
Author: Christoph Hellwig <hch@lst.de>
Date: Mon Jun 17 08:04:40 2024 +0200
block: move cache control settings out of queue->flags
Move the cache control settings into the queue_limits so that the flags
can be set atomically with the device queue frozen.
Add new features and flags field for the driver set flags, and internal
(usually sysfs-controlled) flags in the block layer. Note that we'll
eventually remove enough field from queue_limits to bring it back to the
previous size.
The disable flag is inverted compared to the previous meaning, which
means it now survives a rescan, similar to the max_sectors and
max_discard_sectors user limits.
The FLUSH and FUA flags are now inherited by blk_stack_limits, which
simplified the code in dm a lot, but also causes a slight behavior
change in that dm-switch and dm-unstripe now advertise a write cache
despite setting num_flush_bios to 0. The I/O path will handle this
gracefully, but as far as I can tell the lack of num_flush_bios
and thus flush support is a pre-existing data integrity bug in those
targets that really needs fixing, after which a non-zero num_flush_bios
should be required in dm for targets that map to underlying devices.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-56837
commit c6e56cf6b2e79a463af21286ba951714ed20828c
Author: Christoph Hellwig <hch@lst.de>
Date: Thu Jun 13 10:48:22 2024 +0200
block: move integrity information into queue_limits
Move the integrity information into the queue limits so that it can be
set atomically with other queue limits, and that the sysfs changes to
the read_verify and write_generate flags are properly synchronized.
This also allows to provide a more useful helper to stack the integrity
fields, although it still is separate from the main stacking function
as not all stackable devices want to inherit the integrity settings.
Even with that it greatly simplifies the code in md and dm.
Note that the integrity field is moved as-is into the queue limits.
While there are good arguments for removing the separate blk_integrity
structure, this would cause a lot of churn and might better be done at a
later time if desired. However the integrity field in the queue_limits
structure is now unconditional so that various ifdefs can be avoided or
replaced with IS_ENABLED(). Given that tiny size of it that seems like
a worthwhile trade off.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240613084839.1044015-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-56837
commit 396799eb5b6f87ec2d759e1a90e179f7058ab9e6
Author: Christoph Hellwig <hch@lst.de>
Date: Sun Mar 3 07:01:49 2024 -0700
md: remove mddev->queue
Just use the request_queue from the gendisk pointer in the relatively
few places that sill need it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed--by: Song Liu <song@kernel.org>
Tested-by: Song Liu <song@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240303140150.5435-11-hch@lst.de
Signed-off-by: Ming Lei <ming.lei@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-56837
commit 81a16e19d545fd244ad176f7222d92b67215a33b
Author: Christoph Hellwig <hch@lst.de>
Date: Sun Mar 3 07:01:48 2024 -0700
md: don't initialize queue limits
Initial queue limits are now set from ->run. Remove the superfluous
initialization in md_alloc and level_store.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed--by: Song Liu <song@kernel.org>
Tested-by: Song Liu <song@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240303140150.5435-10-hch@lst.de
Signed-off-by: Ming Lei <ming.lei@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-56837
commit e305fce1883128a9468efe1876a057df48a261d6
Author: Christoph Hellwig <hch@lst.de>
Date: Sun Mar 3 07:01:43 2024 -0700
md: add queue limit helpers
Add a few helpers that wrap the block queue limits API for use in MD.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed--by: Song Liu <song@kernel.org>
Tested-by: Song Liu <song@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240303140150.5435-5-hch@lst.de
Signed-off-by: Ming Lei <ming.lei@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-56837
commit 74fa8f9c553f7b5ccab7d103acae63cc2e080465
Author: Christoph Hellwig <hch@lst.de>
Date: Thu Feb 15 08:10:47 2024 +0100
block: pass a queue_limits argument to blk_alloc_disk
Pass a queue_limits to blk_alloc_disk and apply it if non-NULL. This
will allow allocating queues with valid queue limits instead of setting
the values one at a time later.
Also change blk_alloc_disk to return an ERR_PTR instead of just NULL
which can't distinguish errors.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Link: https://lore.kernel.org/r/20240215071055.2201424-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-46615
commit 35a0a409fa269c287c4378f1aefe84ae8b5211a1
Author: Heming Zhao <heming.zhao@suse.com>
Date: Tue Jul 9 18:41:20 2024 +0800
md-cluster: fix no recovery job when adding/re-adding a disk
The commit db5e653d7c9f ("md: delay choosing sync action to
md_start_sync()") delays the start of the sync action. In a
clustered environment, this will cause another node to first
activate the spare disk and skip recovery. As a result, no
nodes will perform recovery when a disk is added or re-added.
Before db5e653d7c9f:
```
node1 node2
----------------------------------------------------------------
md_check_recovery
+ md_update_sb
| sendmsg: METADATA_UPDATED
+ md_choose_sync_action process_metadata_update
| remove_and_add_spares //node1 has not finished adding
+ call mddev->sync_work //the spare disk:do nothing
md_start_sync
starts md_do_sync
md_do_sync
+ grabbed resync_lockres:DLM_LOCK_EX
+ do syncing job
md_check_recovery
sendmsg: METADATA_UPDATED
process_metadata_update
//activate spare disk
... ...
md_do_sync
waiting to grab resync_lockres:EX
```
After db5e653d7c9f:
(note: if 'cmd:idle' sets MD_RECOVERY_INTR after md_check_recovery
starts md_start_sync, setting the INTR action will exacerbate the
delay in node1 calling the md_do_sync function.)
```
node1 node2
----------------------------------------------------------------
md_check_recovery
+ md_update_sb
| sendmsg: METADATA_UPDATED
+ calls mddev->sync_work process_metadata_update
//node1 has not finished adding
//the spare disk:do nothing
md_start_sync
+ md_choose_sync_action
| remove_and_add_spares
+ calls md_do_sync
md_check_recovery
md_update_sb
sendmsg: METADATA_UPDATED
process_metadata_update
//activate spare disk
... ... ... ...
md_do_sync
+ grabbed resync_lockres:EX
+ raid1_sync_request skip sync under
conf->fullsync:0
md_do_sync
1. waiting to grab resync_lockres:EX
2. when node1 could grab EX lock,
node1 will skip resync under recovery_offset:MaxSector
```
How to trigger:
```(commands @node1)
# to easily watch the recovery status
echo 2000 > /proc/sys/dev/raid/speed_limit_max
ssh root@node2 "echo 2000 > /proc/sys/dev/raid/speed_limit_max"
mdadm -CR /dev/md0 -l1 -b clustered -n 2 /dev/sda /dev/sdb --assume-clean
ssh root@node2 mdadm -A /dev/md0 /dev/sda /dev/sdb
mdadm --manage /dev/md0 --fail /dev/sda --remove /dev/sda
mdadm --manage /dev/md0 --add /dev/sdc
=== "cat /proc/mdstat" on both node, there are no recovery action. ===
```
How to fix:
because md layer code logic is hard to restore for speeding up sync job
on local node, we add new cluster msg to pending the another node to
active disk.
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Su Yue <glass.su@suse.com>
Acked-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240709104120.22243-2-heming.zhao@suse.com
(cherry picked from commit 35a0a409fa269c287c4378f1aefe84ae8b5211a1)
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-46615
commit a1fd37f97808db4fa1bf55da0275790c42521e45
Author: Yu Kuai <yukuai3@huawei.com>
Date: Thu Jun 27 19:23:21 2024 +0800
md: Don't wait for MD_RECOVERY_NEEDED for HOT_REMOVE_DISK ioctl
Commit 90f5f7ad4f ("md: Wait for md_check_recovery before attempting
device removal.") explained in the commit message that failed device
must be reomoved from the personality first by md_check_recovery(),
before it can be removed from the array. That's the reason the commit
add the code to wait for MD_RECOVERY_NEEDED.
However, this is not the case now, because remove_and_add_spares() is
called directly from hot_remove_disk() from ioctl path, hence failed
device(marked faulty) can be removed from the personality by ioctl.
On the other hand, the commit introduced a performance problem that
if MD_RECOVERY_NEEDED is set and the array is not running, ioctl will
wait for 5s before it can return failure to user.
Since the waiting is not needed now, fix the problem by removing the
waiting.
Fixes: 90f5f7ad4f ("md: Wait for md_check_recovery before attempting device removal.")
Reported-by: Mateusz Kusiak <mateusz.kusiak@linux.intel.com>
Closes: https://lore.kernel.org/all/814ff6ee-47a2-4ba0-963e-cf256ee4ecfa@linux.intel.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240627112321.3044744-1-yukuai1@huaweicloud.com
(cherry picked from commit a1fd37f97808db4fa1bf55da0275790c42521e45)
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>