Commit Graph

135 Commits

Author SHA1 Message Date
Nigel Croxon b637f08a77 md: switch md-cluster to use md_submodle_head
JIRA: https://issues.redhat.com/browse/RHEL-83988

commit 87a86277c9f54953e184318bf71630388aeaf000
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Sat Feb 15 17:22:25 2025 +0800

    md: switch md-cluster to use md_submodle_head

    To make code cleaner, and prepare to add kconfig for bitmap.

    Also remove the unsed global variables pers_lock, md_cluster_ops and
    md_cluster_mod, and exported symbols register_md_cluster_operations(),
    unregister_md_cluster_operations() and md_cluster_ops.

    Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-8-yukuai1@huaweicloud.com
    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Reviewed-by: Su Yue <glass.su@suse.com>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2025-03-21 14:55:02 -04:00
Nigel Croxon 1ce5b00614 md/md-cluster: cleanup md_cluster_ops reference
JIRA: https://issues.redhat.com/browse/RHEL-83988

commit ff84e1b1d215d08651d3adee61d8b834c74ff223
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Sat Feb 15 17:22:23 2025 +0800

    md/md-cluster: cleanup md_cluster_ops reference

    md_cluster_ops->slot_number() is implemented inside md-cluster.c, just
    call it directly.

    Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-6-yukuai1@huaweicloud.com
    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Reviewed-by: Su Yue <glass.su@suse.com>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2025-03-21 14:55:02 -04:00
Nigel Croxon f4e0a58d95 md/md-bitmap: make in memory structure internal
JIRA: https://issues.redhat.com/browse/RHEL-61196

commit 59fdd43304f4cbb7ee8e281ca337afd803c309f6
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Mon Aug 26 15:44:52 2024 +0800

md/md-bitmap: make in memory structure internal

Now that struct bitmap_page and bitmap is not used externally anymore,
move them from md-bitmap.h to md-bitmap.c (expect that dm-raid is still
using define marco 'COUNTER_MAX').

Also fix some checkpatch warnings.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-43-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
(cherry picked from commit 59fdd43304f4cbb7ee8e281ca337afd803c309f6)
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-10-01 15:16:05 -04:00
Nigel Croxon 15c2ef31c0 md/md-bitmap: merge md_bitmap_free() into bitmap_operations
JIRA: https://issues.redhat.com/browse/RHEL-61196

commit c65c20dc504b08204d6c3effffa9409a526822aa
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Mon Aug 26 15:44:49 2024 +0800

    md/md-bitmap: merge md_bitmap_free() into bitmap_operations

    So that the implementation won't be exposed, and it'll be possible
    o invent a new bitmap by replacing bitmap_operations.

    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Link: https://lore.kernel.org/r/20240826074452.1490072-40-yukuai1@huaweicloud.com
    Signed-off-by: Song Liu <song@kernel.org>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-10-01 15:14:50 -04:00
Nigel Croxon ba8f6a7d56 md/md-bitmap: merge md_bitmap_set_pages() into struct bitmap_operations
JIRA: https://issues.redhat.com/browse/RHEL-61196

commit ef1c400fafe27f06ba1e8050fc2f35662fe8e106
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Mon Aug 26 15:44:48 2024 +0800

    md/md-bitmap: merge md_bitmap_set_pages() into struct bitmap_operations

    o that the implementation won't be exposed, and it'll be possible
    o invent a new bitmap by replacing bitmap_operations.

    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Link: https://lore.kernel.org/r/20240826074452.1490072-39-yukuai1@huaweicloud.com
    Signed-off-by: Song Liu <song@kernel.org>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-10-01 15:14:48 -04:00
Nigel Croxon fa5c4084ff md/md-bitmap: merge md_bitmap_copy_from_slot() into struct bitmap_operation.
JIRA: https://issues.redhat.com/browse/RHEL-61196

commit 3dd9141a154785f10770fb3e8349f6b133b08b8a
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Mon Aug 26 15:44:47 2024 +0800

    md/md-bitmap: merge md_bitmap_copy_from_slot() into struct bitmap_operation.

    So that the implementation won't be exposed, and it'll be possible
    to invent a new bitmap by replacing bitmap_operations.

    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Link: https://lore.kernel.org/r/20240826074452.1490072-38-yukuai1@huaweicloud.com
    Signed-off-by: Song Liu <song@kernel.org>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-10-01 15:14:45 -04:00
Nigel Croxon 40af80d24b md/md-bitmap: merge get_bitmap_from_slot() into bitmap_operations
JIRA: https://issues.redhat.com/browse/RHEL-61196

commit 57d602414d2e327160f2e03c39eff3f2f895839c
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Mon Aug 26 15:44:46 2024 +0800

    md/md-bitmap: merge get_bitmap_from_slot() into bitmap_operations

    So that the implementation won't be exposed, and it'll be possible
    to invent a new bitmap by replacing bitmap_operations.

    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Link: https://lore.kernel.org/r/20240826074452.1490072-37-yukuai1@huaweicloud.com
    Signed-off-by: Song Liu <song@kernel.org>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-10-01 15:14:42 -04:00
Nigel Croxon 55c5b015d1 md/md-bitmap: merge md_bitmap_resize() into bitmap_operations
JIRA: https://issues.redhat.com/browse/RHEL-61196

commit 77c09640eea56dbfed069ac67b1cd79397d41be8
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Mon Aug 26 15:44:45 2024 +0800

    md/md-bitmap: merge md_bitmap_resize() into bitmap_operations

    So that the implementation won't be exposed, and it'll be possible
    to invent a new bitmap by replacing bitmap_operations.

    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Link: https://lore.kernel.org/r/20240826074452.1490072-36-yukuai1@huaweicloud.com
    Signed-off-by: Song Liu <song@kernel.org>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-10-01 15:13:38 -04:00
Nigel Croxon 1210b6c5b5 md/md-bitmap: pass in mddev directly for md_bitmap_resize()
JIRA: https://issues.redhat.com/browse/RHEL-61196

commit e1791dae6cbd65e5102dca40b8adadef1d89c1b9
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Mon Aug 26 15:44:44 2024 +0800

    md/md-bitmap: pass in mddev directly for md_bitmap_resize()

    And move the condition "if (mddev->bitmap)" into md_bitmap_resize() as
    well, on the one hand make code cleaner, on the other hand try not to
    access bitmap directly.

    Since we are here, also change the parameter 'init' from int to bool.

    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Link: https://lore.kernel.org/r/20240826074452.1490072-35-yukuai1@huaweicloud.com
    Signed-off-by: Song Liu <song@kernel.org>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-10-01 15:13:38 -04:00
Nigel Croxon 8f9422e101 md/md-bitmap: merge md_bitmap_sync_with_cluster() into bitmap_operations
JIRA: https://issues.redhat.com/browse/RHEL-61196

commit 4338b94271dd8cee8ce3b662b12528cd009325a3
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Mon Aug 26 15:44:40 2024 +0800

    md/md-bitmap: merge md_bitmap_sync_with_cluster() into bitmap_operations

    So that the implementation won't be exposed, and it'll be possible
    to invent a new bitmap by replacing bitmap_operations.

    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Link: https://lore.kernel.org/r/20240826074452.1490072-31-yukuai1@huaweicloud.com
    Signed-off-by: Song Liu <song@kernel.org>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-10-01 15:13:37 -04:00
Nigel Croxon f9231aa350 md/md-bitmap: merge md_bitmap_status() into bitmap_operations
JIRA: https://issues.redhat.com/browse/RHEL-61196

commit 696936838bc18a761ed778910975d51cf2c35e3a
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Mon Aug 26 15:44:29 2024 +0800

    md/md-bitmap: merge md_bitmap_status() into bitmap_operations

    So that the implementation won't be exposed, and it'll be possible
    to invent a new bitmap by replacing bitmap_operations.

    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Link: https://lore.kernel.org/r/20240826074452.1490072-20-yukuai1@huaweicloud.com
    Signed-off-by: Song Liu <song@kernel.org>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-10-01 15:12:53 -04:00
Nigel Croxon dc11d78127 md/md-bitmap: merge md_bitmap_update_sb() into bitmap_operations
JIRA: https://issues.redhat.com/browse/RHEL-61196

commit fe59b34676b4ec6b48a7b436d3422fc9317e047a
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Mon Aug 26 15:44:28 2024 +0800

    md/md-bitmap: merge md_bitmap_update_sb() into bitmap_operations

    So that the implementation won't be exposed, and it'll be possible
    to invent a new bitmap by replacing bitmap_operations.

    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Link: https://lore.kernel.org/r/20240826074452.1490072-19-yukuai1@huaweicloud.com
    Signed-off-by: Song Liu <song@kernel.org>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-10-01 15:12:53 -04:00
Nigel Croxon 2104e3e9eb md/md-bitmap: add a new helper md_bitmap_set_pages()
JIRA: https://issues.redhat.com/browse/RHEL-61196

commit 27832ad3f7f0f5080d472fa8621ff92166ca9fac
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Mon Aug 26 15:44:20 2024 +0800

    md/md-bitmap: add a new helper md_bitmap_set_pages()

    Currently md-cluster will set bitmap->counts.pages directly, add a
    helper to do this to avoid dereferencing bitmap directly.

    Noted that after this patch bitmap is not dereferenced directly anymore
    and following patches will move the structure inside md-bitmap.c.

    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Link: https://lore.kernel.org/r/20240826074452.1490072-11-yukuai1@huaweicloud.com
    Signed-off-by: Song Liu <song@kernel.org>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-10-01 14:50:11 -04:00
Nigel Croxon 3e20e4415b md/md-cluster: use helper md_bitmap_get_stats() to get pages in resize_bitmaps()
JIRA: https://issues.redhat.com/browse/RHEL-61196

commit 9e4481ce0e55b4ef9795845d8b6770e3f6f4b24d
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Mon Aug 26 15:44:19 2024 +0800

    md/md-cluster: use helper md_bitmap_get_stats() to get pages in resize_bitmaps()

    Use the existed helper instead of open coding it, avoid dereferencing
    bitmap directly to prepare inventing a new bitmap.

    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Link: https://lore.kernel.org/r/20240826074452.1490072-10-yukuai1@huaweicloud.com
    Signed-off-by: Song Liu <song@kernel.org>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-10-01 14:50:08 -04:00
Nigel Croxon f526a35a99 md/md-bitmap: add 'sync_size' into struct md_bitmap_stats
JIRA: https://issues.redhat.com/browse/RHEL-61196

commit ec6bb299c7c3dd4ca1724d13d5f5fae3ee54fc65
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Mon Aug 26 15:44:16 2024 +0800

    md/md-bitmap: add 'sync_size' into struct md_bitmap_stats

    To avoid dereferencing bitmap directly in md-cluster to prepare
    inventing a new bitmap.

    BTW, also fix following checkpatch warnings:

    WARNING: Deprecated use of 'kmap_atomic', prefer 'kmap_local_page' instead
    WARNING: Deprecated use of 'kunmap_atomic', prefer 'kunmap_local' instead

    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Link: https://lore.kernel.org/r/20240826074452.1490072-7-yukuai1@huaweicloud.com
    Signed-off-by: Song Liu <song@kernel.org>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-10-01 14:49:57 -04:00
Nigel Croxon bf8f362248 md/md-cluster: fix spares warnings for __le64
JIRA: https://issues.redhat.com/browse/RHEL-61196

commit 82697ccf7e495c1ba81e315c2886d6220ff84c2c
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Mon Aug 26 15:44:15 2024 +0800

    md/md-cluster: fix spares warnings for __le64

    drivers/md/md-cluster.c:1220:22: warning: incorrect type in assignment (different base types)
    drivers/md/md-cluster.c:1220:22:    expected unsigned long my_sync_size
    drivers/md/md-cluster.c:1220:22:    got restricted __le64 [usertype] sync_size
    drivers/md/md-cluster.c:1252:35: warning: incorrect type in assignment (different base types)
    drivers/md/md-cluster.c:1252:35:    expected unsigned long sync_size
    drivers/md/md-cluster.c:1252:35:    got restricted __le64 [usertype] sync_size
    drivers/md/md-cluster.c:1253:41: warning: restricted __le64 degrades to integer

    Fix the warnings by using le64_to_cpu() to convet __le64 to integer.

    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Link: https://lore.kernel.org/r/20240826074452.1490072-6-yukuai1@huaweicloud.com
    Signed-off-by: Song Liu <song@kernel.org>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-10-01 14:49:47 -04:00
Nigel Croxon 37a7a4be7a md-cluster: fix no recovery job when adding/re-adding a disk
JIRA: https://issues.redhat.com/browse/RHEL-46615

commit 35a0a409fa269c287c4378f1aefe84ae8b5211a1
Author: Heming Zhao <heming.zhao@suse.com>
Date:   Tue Jul 9 18:41:20 2024 +0800

md-cluster: fix no recovery job when adding/re-adding a disk

The commit db5e653d7c9f ("md: delay choosing sync action to
md_start_sync()") delays the start of the sync action. In a
clustered environment, this will cause another node to first
activate the spare disk and skip recovery. As a result, no
nodes will perform recovery when a disk is added or re-added.

Before db5e653d7c9f:

```
   node1                                node2
----------------------------------------------------------------
md_check_recovery
 + md_update_sb
 |  sendmsg: METADATA_UPDATED
 + md_choose_sync_action           process_metadata_update
 |  remove_and_add_spares           //node1 has not finished adding
 + call mddev->sync_work            //the spare disk:do nothing

md_start_sync
 starts md_do_sync

md_do_sync
 + grabbed resync_lockres:DLM_LOCK_EX
 + do syncing job

md_check_recovery
 sendmsg: METADATA_UPDATED
                                 process_metadata_update
                                   //activate spare disk

                                 ... ...

                                 md_do_sync
                                  waiting to grab resync_lockres:EX
```

After db5e653d7c9f:

(note: if 'cmd:idle' sets MD_RECOVERY_INTR after md_check_recovery
starts md_start_sync, setting the INTR action will exacerbate the
delay in node1 calling the md_do_sync function.)

```
   node1                                node2
----------------------------------------------------------------
md_check_recovery
 + md_update_sb
 |  sendmsg: METADATA_UPDATED
 + calls mddev->sync_work         process_metadata_update
                                   //node1 has not finished adding
                                   //the spare disk:do nothing

md_start_sync
 + md_choose_sync_action
 |  remove_and_add_spares
 + calls md_do_sync

md_check_recovery
 md_update_sb
  sendmsg: METADATA_UPDATED
                                  process_metadata_update
                                    //activate spare disk

  ... ...                         ... ...

                                  md_do_sync
                                   + grabbed resync_lockres:EX
                                   + raid1_sync_request skip sync under
				     conf->fullsync:0
md_do_sync
 1. waiting to grab resync_lockres:EX
 2. when node1 could grab EX lock,
    node1 will skip resync under recovery_offset:MaxSector
```

How to trigger:

```(commands @node1)
 # to easily watch the recovery status
echo 2000 > /proc/sys/dev/raid/speed_limit_max
ssh root@node2 "echo 2000 > /proc/sys/dev/raid/speed_limit_max"

mdadm -CR /dev/md0 -l1 -b clustered -n 2 /dev/sda /dev/sdb --assume-clean
ssh root@node2 mdadm -A /dev/md0 /dev/sda /dev/sdb
mdadm --manage /dev/md0 --fail /dev/sda --remove /dev/sda
mdadm --manage /dev/md0 --add /dev/sdc

=== "cat /proc/mdstat" on both node, there are no recovery action. ===
```

How to fix:

because md layer code logic is hard to restore for speeding up sync job
on local node, we add new cluster msg to pending the another node to
active disk.

Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Su Yue <glass.su@suse.com>
Acked-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240709104120.22243-2-heming.zhao@suse.com
(cherry picked from commit 35a0a409fa269c287c4378f1aefe84ae8b5211a1)
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-08-28 13:32:32 -04:00
Nigel Croxon 0b0f2b7fd9 md-cluster: fix hanging issue while a new disk adding
JIRA: https://issues.redhat.com/browse/RHEL-46615

commit fff42f213824fa434a4b6cf906b4331fe6e9302b
Author: Heming Zhao <heming.zhao@suse.com>
Date:   Tue Jul 9 18:41:19 2024 +0800

md-cluster: fix hanging issue while a new disk adding

The commit 1bbe254e4336 ("md-cluster: check for timeout while a
new disk adding") is correct in terms of code syntax but not
suite real clustered code logic.

When a timeout occurs while adding a new disk, if recv_daemon()
bypasses the unlock for ack_lockres:CR, another node will be waiting
to grab EX lock. This will cause the cluster to hang indefinitely.

How to fix:

1. In dlm_lock_sync(), change the wait behaviour from forever to a
   timeout, This could avoid the hanging issue when another node
   fails to handle cluster msg. Another result of this change is
   that if another node receives an unknown msg (e.g. a new msg_type),
   the old code will hang, whereas the new code will timeout and fail.
   This could help cluster_md handle new msg_type from different
   nodes with different kernel/module versions (e.g. The user only
   updates one leg's kernel and monitors the stability of the new
   kernel).
2. The old code for __sendmsg() always returns 0 (success) under the
   design (must successfully unlock ->message_lockres). This commit
   makes this function return an error number when an error occurs.

Fixes: 1bbe254e4336 ("md-cluster: check for timeout while a new disk adding")
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Su Yue <glass.su@suse.com>
Acked-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240709104120.22243-1-heming.zhao@suse.com
(cherry picked from commit fff42f213824fa434a4b6cf906b4331fe6e9302b)
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-08-28 13:32:32 -04:00
Nigel Croxon 3704237e77 md-cluster: Constify struct md_cluster_operations
JIRA: https://issues.redhat.com/browse/RHEL-46615

commit 1f4a72ff00cafa74b43b0c8a37573c78f86ed1a8
Author: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Date:   Sun Jun 23 22:17:54 2024 +0200

md-cluster: Constify struct md_cluster_operations

'struct md_cluster_operations' is not modified in this driver.

Constifying this structure moves some data to a read-only section, so
increase overall security.

On a x86_64, with allmodconfig, as an example:
Before:
======
   text	   data	    bss	    dec	    hex	filename
  51941	   1442	     80	  53463	   d0d7	drivers/md/md-cluster.o

After:
=====
   text	   data	    bss	    dec	    hex	filename
  52133	   1246	     80	  53459	   d0d3	drivers/md/md-cluster.o

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/3727f3ce9693cae4e62ae6778ea13971df805479.1719173852.git.christophe.jaillet@wanadoo.fr
(cherry picked from commit 1f4a72ff00cafa74b43b0c8a37573c78f86ed1a8)
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-08-28 13:32:21 -04:00
Nigel Croxon 4bed3e124b md-cluster: check for timeout while a new disk adding
JIRA: https://issues.redhat.com/browse/RHEL-26279

commit 1bbe254e4336c0944dd4fb6f0b8c9665b81de50f
Author: Denis Plotnikov <den-plotnikov@yandex-team.ru>
Date:   Mon Sep 25 15:59:40 2023 +0300

    md-cluster: check for timeout while a new disk adding

    A new disk adding may end up with timeout and a new disk won't be added.
    Add returning the error in that case.

    Found by Linux Verification Center (linuxtesting.org) with SVACE

    Signed-off-by: Denis Plotnikov <den-plotnikov@yandex-team.ru>
    Signed-off-by: Song Liu <song@kernel.org>
    Link: https://lore.kernel.org/r/20230925125940.1542506-1-den-plotnikov@yandex-team.ru

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2024-03-22 11:19:07 -04:00
Nigel Croxon 23ec5f4cd5 md: Hold mddev->reconfig_mutex when trying to get mddev->sync_thread
JIRA: https://issues.redhat.com/browse/RHEL-3359

commit 7eb8ff02c1df279bf7f7f29b866beb655a9eebe9
Author: Li Lingfeng <lilingfeng3@huawei.com>
Date:   Thu Aug 3 15:17:11 2023 +0800

    md: Hold mddev->reconfig_mutex when trying to get mddev->sync_thread

    Commit ba9d9f1a707f ("Revert "md: unlock mddev before reap sync_thread in
    action_store"") removed the scenario of calling md_unregister_thread()
    without holding mddev->reconfig_mutex, so add a lock holding check before
    acquiring mddev->sync_thread by passing mdev to md_unregister_thread().

    Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
    Reviewed-by: Yu Kuai <yukuai3@huawei.com>
    Link: https://lore.kernel.org/r/20230803071711.2546560-1-lilingfeng@huaweicloud.com
    Signed-off-by: Song Liu <song@kernel.org>

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2023-09-26 14:40:03 -04:00
Ming Lei 9a0d69b4e0 md: protect md_thread with rcu
JIRA: https://issues.redhat.com/browse/RHEL-1516

commit 4469315439827290923fce4f3f672599cabeb366
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Tue May 23 10:10:17 2023 +0800

    md: protect md_thread with rcu

    Currently, there are many places that md_thread can be accessed without
    protection, following are known scenarios that can cause
    null-ptr-dereference or uaf:

    1) sync_thread that is allocated and started from md_start_sync()
    2) mddev->thread can be accessed directly from timeout_store() and
       md_bitmap_daemon_work()
    3) md_unregister_thread() from action_store().

    Currently, a global spinlock 'pers_lock' is borrowed to protect
    'mddev->thread' in some places, this problem can be fixed likewise,
    however, use a global lock for all the cases is not good.

    Fix this problem by protecting all md_thread with rcu.

    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Signed-off-by: Song Liu <song@kernel.org>
    Link: https://lore.kernel.org/r/20230523021017.3048783-6-yukuai1@huaweicloud.com

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2023-09-18 17:59:23 +08:00
Nigel Croxon 0813da1e3f md: Fix spelling mistake in comments
Bugzilla: https://bugzilla.redhat.com/2113822

commit 9e26728b5fa9c6ffdb3f5612279bd3b2f7ea8c3c
Author: Zhang Jiaming <jiaming@nfschina.com>
Date:   Sat Jul 2 09:54:11 2022 +0800

    md: Fix spelling mistake in comments

    There are 2 spelling mistakes in comments. Fix it.

    Signed-off-by: Zhang Jiaming <jiaming@nfschina.com>
    Signed-off-by: Song Liu <song@kernel.org>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2022-10-27 14:42:00 -04:00
Nigel Croxon b916f25aa7 md: replace deprecated strlcpy & remove duplicated line
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2105293
Upstream Status: kernel.org Linus tree
Tested: ran QE MD Test suite and has passed the sanity test.

This commit includes two topics:
1> replace deprecated strlcpy

change strlcpy to strscpy for strlcpy is marked as deprecated in
Documentation/process/deprecated.rst

2> remove duplicated strlcpy line

in md_bitmap_read_sb@md-bitmap.c there are two duplicated strlcpy(), the
history:

- commit cf921cc19c ("Add node recovery callbacks") introduced the first
  usage of strlcpy().

- commit b97e92574c ("Use separate bitmaps for each nodes in the cluster")
  introduced the second strlcpy(). this time, the two strlcpy() are same,
   we can remove anyone safely.

- commit d3b178adb3 ("md: Skip cluster setup for dm-raid") added dm-raid
  special handling. And the "nodes" value is the key of this patch. but
  from this patch, strlcpy() which was introduced by b97e92574c
  become necessary.

- commit 3c462c880b ("md: Increment version for clustered bitmaps") used
  clustered major version to only handle in clustered env. this patch
  could look a polishment for clustered code logic.

So cf921cc19c became useless after d3b178adb3, we could remove it
safely.

Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Signed-off-by: Song Liu <song@kernel.org>
(cherry picked from commit 92d9aac92b7cc92c770e736c70c3acae7b803278)
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2022-07-13 12:07:14 -04:00
Nigel Croxon 7de5bf94fa md: fix spelling of "its"
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2042797
Brew: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=43247624
Upstream Status: kernel.org Linus tree
Tested: ran QE MD Test suite and has passed the sanity test.

From: Randy Dunlap <rdunlap@infradead.org>
Date: Sat, 25 Dec 2021 18:24:11 -0800
Subject: md: fix spelling of "its"

Use the possessive "its" instead of the contraction "it's"
in printed messages.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Song Liu <song@kernel.org>
Cc: linux-raid@vger.kernel.org
Signed-off-by: Song Liu <song@kernel.org>
(cherry picked from commit dd3dc5f416b7247a4b5d7bac6698be623c180572)
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
2022-02-23 10:30:50 -05:00
Linus Torvalds 69f637c335 for-5.11/drivers-2020-12-14
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl/XgdYQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjTBD/4me2TNvGOogbcL0b1leAotndJ7spI/IcFM
 NUMNy3pOGuRBcRjwle85xq44puAjlNkZE2LLatem5sT7ZvS+8lPNnOIoTYgfaCjt
 PhKx2sKlLumVm3BwymYAPcPtke4fikGG15Mwu5nX1oOehmyGrjObGAr3Lo6gexCT
 tQoCOczVqaTsV+iTXrLlmgEgs07J9Tm93uh2cNR8Jgroxb8ivuWeUq4YgbV4kWk+
 Y8XvOyVE/yba0vQf5/hHtWuVoC6RdELnqZ6NCkcP/EicdBecwk1GMJAej1S3zPS1
 0BT7GSFTpm3YUHcygD6LRmRg4I/BmWDTDtMi84+jLat6VvSG1HwIm//qHiCJh3ku
 SlvFZENIWAv5LP92x2vlR5Lt7uE3GK2V/5Pxt2fekyzCth6mzu+hLH4CBPQ3xgyd
 E1JqIQ/ilbXstp+EYoivV5x8yltZQnKEZRopws0EOqj1LsmDPj9XT1wzE9RnB0o+
 PWu/DNhQFhhcmP7Z8uLgPiKIVpyGs+vjxiJLlTtGDFTCy6M5JbcgzGkEkSmnybxH
 7lSanjpLt1dWj85FBMc6fNtJkv2rBPfb4+j0d1kZ45Dzcr4umirGIh7wtCHcgc83
 brmXSt29hlKHseSHMMuNWK8haXcgAE7gq9tD8GZ/kzM7+vkmLLxHJa22Qhq5rp4w
 URPeaBaQJw==
 =ayp2
 -----END PGP SIGNATURE-----

Merge tag 'for-5.11/drivers-2020-12-14' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:
 "Nothing major in here:

   - NVMe pull request from Christoph:
        - nvmet passthrough improvements (Chaitanya Kulkarni)
        - fcloop error injection support (James Smart)
        - read-only support for zoned namespaces without Zone Append
          (Javier González)
        - improve some error message (Minwoo Im)
        - reject I/O to offline fabrics namespaces (Victor Gladkov)
        - PCI queue allocation cleanups (Niklas Schnelle)
        - remove an unused allocation in nvmet (Amit Engel)
        - a Kconfig spelling fix (Colin Ian King)
        - nvme_req_qid simplication (Baolin Wang)

   - MD pull request from Song:
        - Fix race condition in md_ioctl() (Dae R. Jeong)
        - Initialize read_slot properly for raid10 (Kevin Vigor)
        - Code cleanup (Pankaj Gupta)
        - md-cluster resync/reshape fix (Zhao Heming)

   - Move null_blk into its own directory (Damien Le Moal)

   - null_blk zone and discard improvements (Damien Le Moal)

   - bcache race fix (Dongsheng Yang)

   - Set of rnbd fixes/improvements (Gioh Kim, Guoqing Jiang, Jack Wang,
     Lutz Pogrell, Md Haris Iqbal)

   - lightnvm NULL pointer deref fix (tangzhenhao)

   - sr in_interrupt() removal (Sebastian Andrzej Siewior)

   - FC endpoint security support for s390/dasd (Jan Höppner, Sebastian
     Ott, Vineeth Vijayan). From the s390 arch guys, arch bits included
     as it made it easier for them to funnel the feature through the
     block driver tree.

   - Follow up fixes (Colin Ian King)"

* tag 'for-5.11/drivers-2020-12-14' of git://git.kernel.dk/linux-block: (64 commits)
  block: drop dead assignments in loop_init()
  sr: Remove in_interrupt() usage in sr_init_command().
  sr: Switch the sector size back to 2048 if sr_read_sector() changed it.
  cdrom: Reset sector_size back it is not 2048.
  drivers/lightnvm: fix a null-ptr-deref bug in pblk-core.c
  null_blk: Move driver into its own directory
  null_blk: Allow controlling max_hw_sectors limit
  null_blk: discard zones on reset
  null_blk: cleanup discard handling
  null_blk: Improve implicit zone close
  null_blk: improve zone locking
  block: Align max_hw_sectors to logical blocksize
  null_blk: Fail zone append to conventional zones
  null_blk: Fix zone size initialization
  bcache: fix race between setting bdev state to none and new write request direct to backing
  block/rnbd: fix a null pointer dereference on dev->blk_symlink_name
  block/rnbd-clt: Dynamically alloc buffer for pathname & blk_symlink_name
  block/rnbd: call kobject_put in the failure path
  Documentation/ABI/rnbd-srv: add document for force_close
  block/rnbd-srv: close a mapped device from server side.
  ...
2020-12-16 13:09:32 -08:00
Zhao Heming bca5b06580 md/cluster: fix deadlock when node is doing resync job
md-cluster uses MD_CLUSTER_SEND_LOCK to make node can exclusively send msg.
During sending msg, node can concurrently receive msg from another node.
When node does resync job, grab token_lockres:EX may trigger a deadlock:
```
nodeA                       nodeB
--------------------     --------------------
a.
send METADATA_UPDATED
held token_lockres:EX
                         b.
                         md_do_sync
                          resync_info_update
                            send RESYNCING
                             + set MD_CLUSTER_SEND_LOCK
                             + wait for holding token_lockres:EX

                         c.
                         mdadm /dev/md0 --remove /dev/sdg
                          + held reconfig_mutex
                          + send REMOVE
                             + wait_event(MD_CLUSTER_SEND_LOCK)

                         d.
                         recv_daemon //METADATA_UPDATED from A
                          process_metadata_update
                           + (mddev_trylock(mddev) ||
                              MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD)
                             //this time, both return false forever
```
Explaination:
a. A send METADATA_UPDATED
   This will block another node to send msg

b. B does sync jobs, which will send RESYNCING at intervals.
   This will be block for holding token_lockres:EX lock.

c. B do "mdadm --remove", which will send REMOVE.
   This will be blocked by step <b>: MD_CLUSTER_SEND_LOCK is 1.

d. B recv METADATA_UPDATED msg, which send from A in step <a>.
   This will be blocked by step <c>: holding mddev lock, it makes
   wait_event can't hold mddev lock. (btw,
   MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD keep ZERO in this scenario.)

There is a similar deadlock in commit 0ba959774e
("md-cluster: use sync way to handle METADATA_UPDATED msg")
In that commit, step c is "update sb". This patch step c is
"mdadm --remove".

For fixing this issue, we can refer the solution of function:
metadata_update_start. Which does the same grab lock_token action.
lock_comm can use the same steps to avoid deadlock. By moving
MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD from lock_token to lock_comm.
It enlarge a little bit window of MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
but it is safe & can break deadlock.

Repro steps (I only triggered 3 times with hundreds tests):

two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB.
```
ssh root@node2 "mdadm -S --scan"
mdadm -S --scan
for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
count=20; done

mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \
 --bitmap-chunk=1M
ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"

sleep 5

mkfs.xfs /dev/md0
mdadm --manage --add /dev/md0 /dev/sdi
mdadm --wait /dev/md0
mdadm --grow --raid-devices=3 /dev/md0

mdadm /dev/md0 --fail /dev/sdg
mdadm /dev/md0 --remove /dev/sdg
mdadm --grow --raid-devices=2 /dev/md0
```

test script will hung when executing "mdadm --remove".

```
 # dump stacks by "echo t > /proc/sysrq-trigger"
md0_cluster_rec D    0  5329      2 0x80004000
Call Trace:
 __schedule+0x1f6/0x560
 ? _cond_resched+0x2d/0x40
 ? schedule+0x4a/0xb0
 ? process_metadata_update.isra.0+0xdb/0x140 [md_cluster]
 ? wait_woken+0x80/0x80
 ? process_recvd_msg+0x113/0x1d0 [md_cluster]
 ? recv_daemon+0x9e/0x120 [md_cluster]
 ? md_thread+0x94/0x160 [md_mod]
 ? wait_woken+0x80/0x80
 ? md_congested+0x30/0x30 [md_mod]
 ? kthread+0x115/0x140
 ? __kthread_bind_mask+0x60/0x60
 ? ret_from_fork+0x1f/0x40

mdadm           D    0  5423      1 0x00004004
Call Trace:
 __schedule+0x1f6/0x560
 ? __schedule+0x1fe/0x560
 ? schedule+0x4a/0xb0
 ? lock_comm.isra.0+0x7b/0xb0 [md_cluster]
 ? wait_woken+0x80/0x80
 ? remove_disk+0x4f/0x90 [md_cluster]
 ? hot_remove_disk+0xb1/0x1b0 [md_mod]
 ? md_ioctl+0x50c/0xba0 [md_mod]
 ? wait_woken+0x80/0x80
 ? blkdev_ioctl+0xa2/0x2a0
 ? block_ioctl+0x39/0x40
 ? ksys_ioctl+0x82/0xc0
 ? __x64_sys_ioctl+0x16/0x20
 ? do_syscall_64+0x5f/0x150
 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

md0_resync      D    0  5425      2 0x80004000
Call Trace:
 __schedule+0x1f6/0x560
 ? schedule+0x4a/0xb0
 ? dlm_lock_sync+0xa1/0xd0 [md_cluster]
 ? wait_woken+0x80/0x80
 ? lock_token+0x2d/0x90 [md_cluster]
 ? resync_info_update+0x95/0x100 [md_cluster]
 ? raid1_sync_request+0x7d3/0xa40 [raid1]
 ? md_do_sync.cold+0x737/0xc8f [md_mod]
 ? md_thread+0x94/0x160 [md_mod]
 ? md_congested+0x30/0x30 [md_mod]
 ? kthread+0x115/0x140
 ? __kthread_bind_mask+0x60/0x60
 ? ret_from_fork+0x1f/0x40
```

At last, thanks for Xiao's solution.

Cc: stable@vger.kernel.org
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Suggested-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-11-30 10:12:35 -08:00
Christoph Hellwig 94d91e7f8c md: remove a spurious call to revalidate_disk_size in update_size
None of the ->resize methods updates the disk size, so calling
revalidate_disk_size here won't do anything.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-16 08:34:15 -07:00
Christoph Hellwig 2c247c5169 md: use set_capacity_and_notify
Use set_capacity_and_notify to set the size of both the disk and block
device.  This also gets the uevent notifications for the resize for free.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-16 08:34:15 -07:00
Zhao Heming 1383b347a8 md/bitmap: fix memory leak of temporary bitmap
Callers of get_bitmap_from_slot() are responsible to free the bitmap.

Suggested-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-10-08 22:37:39 -07:00
Christoph Hellwig 659e56ba86 block: add a new revalidate_disk_size helper
revalidate_disk is a relative awkward helper for driver use, as it first
calls an optional driver method and then updates the block device size,
while most callers either don't need the method call at all, or want to
keep state between the caller and the called method.

Add a revalidate_disk_size helper that just performs the update of the
block device size from the gendisk one, and switch all drivers that do
not implement ->revalidate_disk to use the new helper instead of
revalidate_disk()

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-02 08:00:07 -06:00
Dan Carpenter e8abe1de43 md-cluster: Fix potential error pointer dereference in resize_bitmaps()
The error handling calls md_bitmap_free(bitmap) which checks for NULL
but will Oops if we pass an error pointer.  Let's set "bitmap" to NULL
on this error path.

Fixes: afd7562860 ("md-cluster/raid10: resize all the bitmaps before start reshape")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-08-05 17:07:48 -07:00
Zhao Heming 60f80d6f2d md-cluster: fix wild pointer of unlock_all_bitmaps()
reproduction steps:
```
node1 # mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda
/dev/sdb
node2 # mdadm -A /dev/md0 /dev/sda /dev/sdb
node1 # mdadm -G /dev/md0 -b none
mdadm: failed to remove clustered bitmap.
node1 # mdadm -S --scan
^C  <==== mdadm hung & kernel crash
```

kernel stack:
```
[  335.230657] general protection fault: 0000 [#1] SMP NOPTI
[...]
[  335.230848] Call Trace:
[  335.230873]  ? unlock_all_bitmaps+0x5/0x70 [md_cluster]
[  335.230886]  unlock_all_bitmaps+0x3d/0x70 [md_cluster]
[  335.230899]  leave+0x10f/0x190 [md_cluster]
[  335.230932]  ? md_super_wait+0x93/0xa0 [md_mod]
[  335.230947]  ? leave+0x5/0x190 [md_cluster]
[  335.230973]  md_cluster_stop+0x1a/0x30 [md_mod]
[  335.230999]  md_bitmap_free+0x142/0x150 [md_mod]
[  335.231013]  ? _cond_resched+0x15/0x40
[  335.231025]  ? mutex_lock+0xe/0x30
[  335.231056]  __md_stop+0x1c/0xa0 [md_mod]
[  335.231083]  do_md_stop+0x160/0x580 [md_mod]
[  335.231119]  ? 0xffffffffc05fb078
[  335.231148]  md_ioctl+0xa04/0x1930 [md_mod]
[  335.231165]  ? filename_lookup+0xf2/0x190
[  335.231179]  blkdev_ioctl+0x93c/0xa10
[  335.231205]  ? _cond_resched+0x15/0x40
[  335.231214]  ? __check_object_size+0xd4/0x1a0
[  335.231224]  block_ioctl+0x39/0x40
[  335.231243]  do_vfs_ioctl+0xa0/0x680
[  335.231253]  ksys_ioctl+0x70/0x80
[  335.231261]  __x64_sys_ioctl+0x16/0x20
[  335.231271]  do_syscall_64+0x65/0x1f0
[  335.231278]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
```

Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-14 23:38:32 -07:00
Thomas Gleixner 8d7c56d08f treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 45
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 or at your option any
  later version

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 11 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190520170858.370933192@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-24 17:27:12 +02:00
Guoqing Jiang ea89238c0a md-cluster: remove suspend_info
Previously, we allow multiple nodes can resync device, but we
had changed it to only support one node can do resync at one
time, but suspend_info is still used.

Now, let's remove the structure and use suspend_lo/hi to record
the range.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-18 09:41:25 -07:00
Guoqing Jiang cb9ee15431 md-cluster: send BITMAP_NEEDS_SYNC message if reshaping is interrupted
We need to continue the reshaping if it was interrupted in
original node. So original node should call resync_bitmap
in case reshaping is aborted.

Then BITMAP_NEEDS_SYNC message is broadcasted to other nodes,
node which continues the reshaping should restart reshape from
mddev->reshape_position instead of from the first beginning.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-18 09:40:48 -07:00
Guoqing Jiang cbce6863b6 md-cluster/bitmap: don't call md_bitmap_sync_with_cluster during reshaping stage
When reshape is happening in one node, other nodes could receive
lots of RESYNCING messages, so md_bitmap_sync_with_cluster is called.

Since the resyncing window is typically small in these RESYNCING
messages, so WARN is always triggered, so we should not call the
func when reshape is happening.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-18 09:40:01 -07:00
Guoqing Jiang 5ebaf80bc8 md-cluster: introduce resync_info_get interface for sanity check
Since the resync region from suspend_info means one node
is reshaping this area, so the position of reshape_progress
should be included in the area.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-18 09:36:35 -07:00
Guoqing Jiang afd7562860 md-cluster/raid10: resize all the bitmaps before start reshape
To support add disk under grow mode, we need to resize
all the bitmaps of each node before reshape, so that we
can ensure all nodes have the same view of the bitmap of
the clustered raid.

So after the master node resized the bitmap, it broadcast
a message to other slave nodes, and it checks the size of
each bitmap are same or not by compare pages. We can only
continue the reshaping after all nodes update the bitmap
to the same size (by checking the pages), otherwise revert
bitmap size to previous value.

The resize_bitmaps interface and BITMAP_RESIZE message are
introduced in md-cluster.c for the purpose.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-18 09:30:58 -07:00
Guoqing Jiang 41a9504112 md-cluster: release RESYNC lock after the last resync message
All the RESYNC messages are sent with resync lock held, the only
exception is resync_finish which releases resync_lockres before
send the last resync message, this should be changed as well.
Otherwise, we can see deadlock issue as follows:

clustermd2-gqjiang2:~ # cat /proc/mdstat
Personalities : [raid10] [raid1]
md0 : active raid1 sdg[0] sdf[1]
      134144 blocks super 1.2 [2/2] [UU]
      [===================>.]  resync = 99.6% (134144/134144) finish=0.0min speed=26K/sec
      bitmap: 1/1 pages [4KB], 65536KB chunk

unused devices: <none>
clustermd2-gqjiang2:~ # ps aux|grep md|grep D
root     20497  0.0  0.0      0     0 ?        D    16:00   0:00 [md0_raid1]
clustermd2-gqjiang2:~ # cat /proc/20497/stack
[<ffffffffc05ff51e>] dlm_lock_sync+0x8e/0xc0 [md_cluster]
[<ffffffffc05ff7e8>] __sendmsg+0x98/0x130 [md_cluster]
[<ffffffffc05ff900>] sendmsg+0x20/0x30 [md_cluster]
[<ffffffffc05ffc35>] resync_info_update+0xb5/0xc0 [md_cluster]
[<ffffffffc0593e84>] md_reap_sync_thread+0x134/0x170 [md_mod]
[<ffffffffc059514c>] md_check_recovery+0x28c/0x510 [md_mod]
[<ffffffffc060c882>] raid1d+0x42/0x800 [raid1]
[<ffffffffc058ab61>] md_thread+0x121/0x150 [md_mod]
[<ffffffff9a0a5b3f>] kthread+0xff/0x140
[<ffffffff9a800235>] ret_from_fork+0x35/0x40
[<ffffffffffffffff>] 0xffffffffffffffff

clustermd-gqjiang1:~ # ps aux|grep md|grep D
root     20531  0.0  0.0      0     0 ?        D    16:00   0:00 [md0_raid1]
root     20537  0.0  0.0      0     0 ?        D    16:00   0:00 [md0_cluster_rec]
root     20676  0.0  0.0      0     0 ?        D    16:01   0:00 [md0_resync]
clustermd-gqjiang1:~ # cat /proc/mdstat
Personalities : [raid10] [raid1]
md0 : active raid1 sdf[1] sdg[0]
      134144 blocks super 1.2 [2/2] [UU]
      [===================>.]  resync = 97.3% (131072/134144) finish=8076.8min speed=0K/sec
      bitmap: 1/1 pages [4KB], 65536KB chunk

unused devices: <none>
clustermd-gqjiang1:~ # cat /proc/20531/stack
[<ffffffffc080974d>] metadata_update_start+0xcd/0xd0 [md_cluster]
[<ffffffffc079c897>] md_update_sb.part.61+0x97/0x820 [md_mod]
[<ffffffffc079f15b>] md_check_recovery+0x29b/0x510 [md_mod]
[<ffffffffc0816882>] raid1d+0x42/0x800 [raid1]
[<ffffffffc0794b61>] md_thread+0x121/0x150 [md_mod]
[<ffffffff9e0a5b3f>] kthread+0xff/0x140
[<ffffffff9e800235>] ret_from_fork+0x35/0x40
[<ffffffffffffffff>] 0xffffffffffffffff
clustermd-gqjiang1:~ # cat /proc/20537/stack
[<ffffffffc0813222>] freeze_array+0xf2/0x140 [raid1]
[<ffffffffc080a56e>] recv_daemon+0x41e/0x580 [md_cluster]
[<ffffffffc0794b61>] md_thread+0x121/0x150 [md_mod]
[<ffffffff9e0a5b3f>] kthread+0xff/0x140
[<ffffffff9e800235>] ret_from_fork+0x35/0x40
[<ffffffffffffffff>] 0xffffffffffffffff
clustermd-gqjiang1:~ # cat /proc/20676/stack
[<ffffffffc080951e>] dlm_lock_sync+0x8e/0xc0 [md_cluster]
[<ffffffffc080957f>] lock_token+0x2f/0xa0 [md_cluster]
[<ffffffffc0809622>] lock_comm+0x32/0x90 [md_cluster]
[<ffffffffc08098f5>] sendmsg+0x15/0x30 [md_cluster]
[<ffffffffc0809c0a>] resync_info_update+0x8a/0xc0 [md_cluster]
[<ffffffffc08130ba>] raid1_sync_request+0xa9a/0xb10 [raid1]
[<ffffffffc079b8ea>] md_do_sync+0xbaa/0xf90 [md_mod]
[<ffffffffc0794b61>] md_thread+0x121/0x150 [md_mod]
[<ffffffff9e0a5b3f>] kthread+0xff/0x140
[<ffffffff9e800235>] ret_from_fork+0x35/0x40
[<ffffffffffffffff>] 0xffffffffffffffff

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-08-31 17:38:10 -07:00
Linus Torvalds 08b5fa8199 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull input updates from Dmitry Torokhov:

 - a new driver for Rohm BU21029 touch controller

 - new bitmap APIs: bitmap_alloc, bitmap_zalloc and bitmap_free

 - updates to Atmel, eeti. pxrc and iforce drivers

 - assorted driver cleanups and fixes.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (57 commits)
  MAINTAINERS: Add PhoenixRC Flight Controller Adapter
  Input: do not use WARN() in input_alloc_absinfo()
  Input: mark expected switch fall-throughs
  Input: raydium_i2c_ts - use true and false for boolean values
  Input: evdev - switch to bitmap API
  Input: gpio-keys - switch to bitmap_zalloc()
  Input: elan_i2c_smbus - cast sizeof to int for comparison
  bitmap: Add bitmap_alloc(), bitmap_zalloc() and bitmap_free()
  md: Avoid namespace collision with bitmap API
  dm: Avoid namespace collision with bitmap API
  Input: pm8941-pwrkey - add resin entry
  Input: pm8941-pwrkey - abstract register offsets and event code
  Input: iforce - reorganize joystick configuration lists
  Input: atmel_mxt_ts - move completion to after config crc is updated
  Input: atmel_mxt_ts - don't report zero pressure from T9
  Input: atmel_mxt_ts - zero terminate config firmware file
  Input: atmel_mxt_ts - refactor config update code to add context struct
  Input: atmel_mxt_ts - config CRC may start at T71
  Input: atmel_mxt_ts - remove unnecessary debug on ENOMEM
  Input: atmel_mxt_ts - remove duplicate setup of ABS_MT_PRESSURE
  ...
2018-08-18 16:48:07 -07:00
Andy Shevchenko e64e4018d5 md: Avoid namespace collision with bitmap API
bitmap API (include/linux/bitmap.h) has 'bitmap' prefix for its methods.

On the other hand MD bitmap API is special case.
Adding 'md' prefix to it to avoid name space collision.

No functional changes intended.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Shaohua Li <shli@kernel.org>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
2018-08-01 15:49:39 -07:00
Guoqing Jiang df8c676418 md-cluster: don't send msg if array is closing
If we close an array which resync thread is running,
then we don't need the node to send msg since another
node would launch the resync thread to continue the
rest works. Also send a message is time consuming,
we should avoid it.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-07-05 11:17:02 -07:00
Guoqing Jiang 0357ba27bd md-cluster: show array's status more accurate
When resync or recovery is happening in one node,
other nodes don't show the appropriate info now.

For example, when create an array in master node
without "--assume-clean", then assemble the array
in slave nodes, you can see "resync=PENDING" when
read /proc/mdstat in slave nodes. However, the info
is confusing since "PENDING" status is introduced
for start array in read-only mode.

We introduce RESYNCING_REMOTE flag to indicate that
resync thread is running in remote node. The flags
is set when node receive RESYNCING msg. And we clear
the REMOTE flag in following cases:

1. resync or recover is finished in master node,
   which means slaves receive msg with both lo
   and hi are set to 0.
2. node continues resync/recovery in recover_bitmaps.
3. when resync_finish is called.

Then we show accurate information in status_resync
by check REMOTE flags and with other conditions.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-07-05 11:17:01 -07:00
Guoqing Jiang 010228e4a9 md-cluster: clear another node's suspend_area after the copy is finished
When one node leaves cluster or stops the resyncing
(resync or recovery) array, then other nodes need to
call recover_bitmaps to continue the unfinished task.

But we need to clear suspend_area later after other
nodes copy the resync information to their bitmap
(by call bitmap_copy_from_slot). Otherwise, all nodes
could write to the suspend_area even the suspend_area
is not handled by any node, because area_resyncing
returns 0 at the beginning of raid1_write_request.
Which means one node could write suspend_area while
another node is resyncing the same area, then data
could be inconsistent.

So let's clear suspend_area later to avoid above issue
with the protection of bm lock. Also it is straightforward
to clear suspend_area after nodes have copied the resync
info to bitmap.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-07-05 11:17:01 -07:00
Kees Cook 6396bb2215 treewide: kzalloc() -> kcalloc()
The kzalloc() function has a 2-factor argument form, kcalloc(). This
patch replaces cases of:

        kzalloc(a * b, gfp)

with:
        kcalloc(a * b, gfp)

as well as handling cases of:

        kzalloc(a * b * c, gfp)

with:

        kzalloc(array3_size(a, b, c), gfp)

as it's slightly less ugly than:

        kzalloc_array(array_size(a, b), c, gfp)

This does, however, attempt to ignore constant size factors like:

        kzalloc(4 * 1024, gfp)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
  kzalloc(
-	(sizeof(TYPE)) * E
+	sizeof(TYPE) * E
  , ...)
|
  kzalloc(
-	(sizeof(THING)) * E
+	sizeof(THING) * E
  , ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
  kzalloc(
-	sizeof(u8) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(__u8) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(char) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(unsigned char) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(u8) * COUNT
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(__u8) * COUNT
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(char) * COUNT
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(unsigned char) * COUNT
+	COUNT
  , ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * (COUNT_ID)
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * COUNT_ID
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * (COUNT_CONST)
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * COUNT_CONST
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * (COUNT_ID)
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * COUNT_ID
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * (COUNT_CONST)
+	COUNT_CONST, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * COUNT_CONST
+	COUNT_CONST, sizeof(THING)
  , ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

- kzalloc
+ kcalloc
  (
-	SIZE * COUNT
+	COUNT, SIZE
  , ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
  kzalloc(
-	sizeof(TYPE) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc(
-	sizeof(TYPE) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc(
-	sizeof(TYPE) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc(
-	sizeof(TYPE) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc(
-	sizeof(THING) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kzalloc(
-	sizeof(THING) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kzalloc(
-	sizeof(THING) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kzalloc(
-	sizeof(THING) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
  kzalloc(
-	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kzalloc(
-	sizeof(THING1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kzalloc(
-	sizeof(THING1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
|
  kzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
  kzalloc(
-	(COUNT) * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	COUNT * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	COUNT * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	(COUNT) * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	COUNT * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	(COUNT) * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	(COUNT) * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	COUNT * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
)

// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
  kzalloc(C1 * C2 * C3, ...)
|
  kzalloc(
-	(E1) * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kzalloc(
-	(E1) * (E2) * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kzalloc(
-	(E1) * (E2) * (E3)
+	array3_size(E1, E2, E3)
  , ...)
|
  kzalloc(
-	E1 * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
)

// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@

(
  kzalloc(sizeof(THING) * C2, ...)
|
  kzalloc(sizeof(TYPE) * C2, ...)
|
  kzalloc(C1 * C2 * C3, ...)
|
  kzalloc(C1 * C2, ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * (E2)
+	E2, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * E2
+	E2, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * (E2)
+	E2, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * E2
+	E2, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	(E1) * E2
+	E1, E2
  , ...)
|
- kzalloc
+ kcalloc
  (
-	(E1) * (E2)
+	E1, E2
  , ...)
|
- kzalloc
+ kcalloc
  (
-	E1 * E2
+	E1, E2
  , ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Guoqing Jiang f0e230ad87 md-cluster: update document for raid10
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-11-01 21:32:25 -07:00
NeilBrown b03e0ccb5a md: remove special meaning of ->quiesce(.., 2)
The '2' argument means "wake up anything that is waiting".
This is an inelegant part of the design and was added
to help support management of suspend_lo/suspend_hi setting.
Now that suspend_lo/hi is managed in mddev_suspend/resume,
that need is gone.
These is still a couple of places where we call 'quiesce'
with an argument of '2', but they can safely be changed to
call ->quiesce(.., 1); ->quiesce(.., 0) which
achieve the same result at the small cost of pausing IO
briefly.

This removes a small "optimization" from suspend_{hi,lo}_store,
but it isn't clear that optimization served a useful purpose.
The code now is a lot clearer.

Suggested-by: Shaohua Li <shli@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-11-01 21:32:20 -07:00
Mike Snitzer 935fe0983e md: rename some drivers/md/ files to have an "md-" prefix
Motivated by the desire to illiminate the imprecise nature of
DM-specific patches being unnecessarily sent to both the MD maintainer
and mailing-list.  Which is born out of the fact that DM files also
reside in drivers/md/

Now all MD-specific files in drivers/md/ start with either "raid" or
"md-" and the MAINTAINERS file has been updated accordingly.

Shaohua: don't change module name

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-10-16 19:06:36 -07:00
Colin Ian King 7a57157aeb md-cluster: make function cluster_check_sync_size static
The function cluster_check_sync_size is local to the source and does
not need to be in global scope, so make it static.

Cleans up sparse warning:
symbol 'cluster_check_sync_size' was not declared. Should it be static?

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-10-16 19:06:35 -07:00