Commit Graph

250 Commits

Author SHA1 Message Date
Rafael Aquini de1b8ea5bf writeback: remove redundant checks for root memcg
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 9af7c7426c2e49bad77cf7494fea85a773d1ded6
Author: Jinliang Zheng <alexjlzheng@tencent.com>
Date:   Tue Aug 8 16:44:32 2023 +0800

    writeback: remove redundant checks for root memcg

    The check for root memcg will be done in wb_get_lookup(), so remove the
    redundant one to simplify the code.

    Link: https://lkml.kernel.org/r/20230808084431.1632934-1-alexjlzheng@tencent.com
    Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:21:33 -04:00
Rafael Aquini 5d187dd21a mm: remove redundant K() macro definition
JIRA: https://issues.redhat.com/browse/RHEL-27743

This patch is a backport of the following upstream commit:
commit 61f297380118060a70888e0c1f5c534b74ab78fe
Author: ZhangPeng <zhangpeng362@huawei.com>
Date:   Fri Aug 4 09:25:53 2023 +0800

    mm: remove redundant K() macro definition

    Patch series "cleanup with helper macro K()".

    Use helper macro K() to improve code readability.  No functional
    modification involved.  Remove redundant K() macro definition.

    This patch (of 7):

    Since commit eb8589b4f8c1 ("mm: move mem_init_print_info() to mm_init.c"),
    the K() macro definition has been moved to mm/internal.h.  Therefore, the
    definitions in mm/memcontrol.c, mm/backing-dev.c and mm/oom_kill.c are
    redundant.  Drop redundant definitions.

    [akpm@linux-foundation.org: oom_kill.c: remove "#undef K", per Kefeng]
    Link: https://lkml.kernel.org/r/20230804012559.2617515-1-zhangpeng362@huawei.com
    Link: https://lkml.kernel.org/r/20230804012559.2617515-2-zhangpeng362@huawei.com
    Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Nanyong Sun <sunnanyong@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-01 11:20:53 -04:00
Rafael Aquini b9f4a8b5c9 mm: backing-dev: make bdi_class a static const structure
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit b5665cf936bf3955fec18c09a6aa53c8a57ea8b7
Author: Ivan Orlov <ivan.orlov0322@gmail.com>
Date:   Tue Jun 20 20:33:15 2023 +0200

    mm: backing-dev: make bdi_class a static const structure

    Now that the driver core allows for struct class to be in read-only
    memory, move the bdi_class structure to be declared at build time placing
    it into read-only memory, instead of having to be dynamically allocated at
    load time.

    Link: https://lkml.kernel.org/r/20230620183314.682822-2-gregkh@linuxfoundation.org
    Signed-off-by: Ivan Orlov <ivan.orlov0322@gmail.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:37:25 -04:00
Rafael Aquini 59ad1c1668 mm: backing-dev: set variables dev_attr_min,max_bytes storage-class-specifier to static
JIRA: https://issues.redhat.com/browse/RHEL-48221

This patch is a backport of the following upstream commit:
commit f6365881bf797c734bf4cf1353bfa59ffd444f8f
Author: Tom Rix <trix@redhat.com>
Date:   Sat Apr 8 10:16:09 2023 -0400

    mm: backing-dev: set variables dev_attr_min,max_bytes storage-class-specifier to static

    smatch reports
    mm/backing-dev.c:266:1: warning: symbol
      'dev_attr_min_bytes' was not declared. Should it be static?
    mm/backing-dev.c:294:1: warning: symbol
      'dev_attr_max_bytes' was not declared. Should it be static?

    These variables are only used in one file so should be static.

    Link: https://lkml.kernel.org/r/20230408141609.2703647-1-trix@redhat.com
    Signed-off-by: Tom Rix <trix@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2024-07-16 09:30:09 -04:00
Nico Pache ed1887764c remove congestion tracking framework
commit a88f2096d5a2d91179db5dd9aa8f60dc3df9bb3e
Author: NeilBrown <neilb@suse.de>
Date:   Tue Mar 22 14:39:19 2022 -0700

    remove congestion tracking framework

    This framework is no longer used - so discard it.

    Link: https://lkml.kernel.org/r/164549983747.9187.6171768583526866601.stgit@noble.brown
    Signed-off-by: NeilBrown <neilb@suse.de>
    Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
    Cc: Chao Yu <chao@kernel.org>
    Cc: Darrick J. Wong <djwong@kernel.org>
    Cc: Ilya Dryomov <idryomov@gmail.com>
    Cc: Jaegeuk Kim <jaegeuk@kernel.org>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jeff Layton <jlayton@kernel.org>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Lars Ellenberg <lars.ellenberg@linbit.com>
    Cc: Miklos Szeredi <miklos@szeredi.hu>
    Cc: Paolo Valente <paolo.valente@linaro.org>
    Cc: Philipp Reisner <philipp.reisner@linbit.com>
    Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
    Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
    Cc: Wu Fengguang <fengguang.wu@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
2024-04-30 17:51:22 -06:00
Audra Mitchell 4f1f8b83fd mm: add /sys/class/bdi/<bdi>/min_ratio_fine knob
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit ad3e6dabf6f7d9ffd68eb711191ef16cdbdd25f0
Author: Stefan Roesch <shr@devkernel.io>
Date:   Fri Nov 18 16:52:14 2022 -0800

    mm: add /sys/class/bdi/<bdi>/min_ratio_fine knob

    This adds the min_ratio_fine knob. The knob specifies the values not
    based on 1 of 100, but instead 1 per million.

    Link: https://lkml.kernel.org/r/20221119005215.3052436-20-shr@devkernel.io
    Signed-off-by: Stefan Roesch <shr@devkernel.io>
    Cc: Chris Mason <clm@meta.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:43:00 -04:00
Audra Mitchell 4e2cb5edb4 mm: add /sys/class/bdi/<bdi>/max_ratio_fine knob
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit bca52dcbadc583f4db6435599c44a79f97293f06
Author: Stefan Roesch <shr@devkernel.io>
Date:   Fri Nov 18 16:52:11 2022 -0800

    mm: add /sys/class/bdi/<bdi>/max_ratio_fine knob

    This adds the max_ratio_fine knob. The knob specifies the values not
    based on 1 of 100, but instead 1 per million.

    Link: https://lkml.kernel.org/r/20221119005215.3052436-17-shr@devkernel.io
    Signed-off-by: Stefan Roesch <shr@devkernel.io>
    Cc: Chris Mason <clm@meta.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:59 -04:00
Audra Mitchell 10db014a12 mm: add /sys/class/bdi/<bdi>/min_bytes knob
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit 9c84819bd64ec15cb15d041c45ebe4725e9d4f3b
Author: Stefan Roesch <shr@devkernel.io>
Date:   Fri Nov 18 16:52:08 2022 -0800

    mm: add /sys/class/bdi/<bdi>/min_bytes knob

    bdi has two existing knobs to limit the amount of dirty memory:
    min_ratio and max_ratio. However the granularity of the knobs is limited
    and often it is more convenient to specify limits in terms of bytes.
    This change adds the min_bytes knob.

    It does not store the min_bytes value, instead it converts the max_bytes
    value to a ratio. The value is therefore more an approximation than an
    absolute value.

    It also maintains the sum over all the bdi min_ratio values stored in
    the variable bdi_min_ratio.

    Link: https://lkml.kernel.org/r/20221119005215.3052436-14-shr@devkernel.io
    Signed-off-by: Stefan Roesch <shr@devkernel.io>
    Cc: Chris Mason <clm@meta.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:59 -04:00
Audra Mitchell 6714056fe0 mm: add knob /sys/class/bdi/<bdi>/max_bytes
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit c56e049a5e401a177c7c9b39a3bcc973ff5cec0b
Author: Stefan Roesch <shr@devkernel.io>
Date:   Fri Nov 18 16:52:03 2022 -0800

    mm: add knob /sys/class/bdi/<bdi>/max_bytes

    This adds the new knob max_bytes to specify a dirty memory limit for the
    corresponding bdi. The specified bytes value is converted to a ratio.

    Link: https://lkml.kernel.org/r/20221119005215.3052436-9-shr@devkernel.io
    Signed-off-by: Stefan Roesch <shr@devkernel.io>
    Cc: Chris Mason <clm@meta.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:59 -04:00
Audra Mitchell 54d5ecadea mm: use part per 1000000 for bdi ratios
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit ae82291e9ca47c3d6da6b77a00f427754aca413e
Author: Stefan Roesch <shr@devkernel.io>
Date:   Fri Nov 18 16:51:59 2022 -0800

    mm: use part per 1000000 for bdi ratios

    To get finer granularity for ratio calculations use part per million
    instead of percentiles. This is especially important if we want to
    automatically convert byte values to ratios. Otherwise the values that
    are actually used can be quite different. This is also important for
    machines with more main memory (1% of 256GB is already 2.5GB).

    Link: https://lkml.kernel.org/r/20221119005215.3052436-5-shr@devkernel.io
    Signed-off-by: Stefan Roesch <shr@devkernel.io>
    Cc: Chris Mason <clm@meta.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:59 -04:00
Audra Mitchell e990aff96c mm: add knob /sys/class/bdi/<bdi>/strict_limit
JIRA: https://issues.redhat.com/browse/RHEL-27739

This patch is a backport of the following upstream commit:
commit 27bbe9d48d4e298864e18b39f091342c68b81637
Author: Stefan Roesch <shr@devkernel.io>
Date:   Fri Nov 18 16:51:57 2022 -0800

    mm: add knob /sys/class/bdi/<bdi>/strict_limit

    Add a new knob to /sys/class/bdi/<bdi>/strict_limit. This new knob
    allows to set/unset the flag BDI_CAP_STRICTLIMIT in the bdi
    capabilities.

    Link: https://lkml.kernel.org/r/20221119005215.3052436-3-shr@devkernel.io
    Signed-off-by: Stefan Roesch <shr@devkernel.io>
    Cc: Chris Mason <clm@meta.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Audra Mitchell <audra@redhat.com>
2024-04-09 09:42:58 -04:00
Ming Lei 953c1697f5 blk-wbt: Fix detection of dirty-throttled tasks
JIRA: https://issues.redhat.com/browse/RHEL-25988

commit f814bdda774c183b0cc15ec8f3b6e7c6f4527ba5
Author: Jan Kara <jack@suse.cz>
Date:   Tue Jan 23 18:58:26 2024 +0100

    blk-wbt: Fix detection of dirty-throttled tasks

    The detection of dirty-throttled tasks in blk-wbt has been subtly broken
    since its beginning in 2016. Namely if we are doing cgroup writeback and
    the throttled task is not in the root cgroup, balance_dirty_pages() will
    set dirty_sleep for the non-root bdi_writeback structure. However
    blk-wbt checks dirty_sleep only in the root cgroup bdi_writeback
    structure. Thus detection of recently throttled tasks is not working in
    this case (we noticed this when we switched to cgroup v2 and suddently
    writeback was slow).

    Since blk-wbt has no easy way to get to proper bdi_writeback and
    furthermore its intention has always been to work on the whole device
    rather than on individual cgroups, just move the dirty_sleep timestamp
    from bdi_writeback to backing_dev_info. That fixes the checking for
    recently throttled task and saves memory for everybody as a bonus.

    CC: stable@vger.kernel.org
    Fixes: b57d74aff9 ("writeback: track if we're sleeping on progress in balance_dirty_pages()")
    Signed-off-by: Jan Kara <jack@suse.cz>
    Link: https://lore.kernel.org/r/20240123175826.21452-1-jack@suse.cz
    [axboe: fixup indentation errors]
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2024-03-07 13:20:01 +08:00
Jan Stancek 151f2a0ce1 Merge: update drivers/base to v6.4
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2940

JIRA: https://issues.redhat.com/browse/RHEL-1023

Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>

Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: Tony Camuso <tcamuso@redhat.com>
Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Jocelyn Falempe <jfalempe@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: John W. Linville <linville@redhat.com>
Approved-by: Vladis Dronov <vdronov@redhat.com>
Approved-by: Donald Dutile <ddutile@redhat.com>
Approved-by: Eric Chanudet <echanude@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-11-19 13:07:01 +01:00
Mark Langsdorf 094c7e2822 driver core: class: remove module * from class_create()
JIRA: https://issues.redhat.com/browse/RHEL-1023
Conflicts:
	drivers/dax/bus.c - had to remove an argument
by hand
	drivers/char/tpm/tpm-interface.c - minor context
differences

commit 1aaba11da9aa7d7d6b52a74d45b31cac118295a1
Author: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Date: Fri, 17 Mar 2023 15:16:33 +0000

The module pointer in class_create() never actually did anything, and it
shouldn't have been requred to be set as a parameter even if it did
something.  So just remove it and fix up all callers of the function in
the kernel tree at the same time.

Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Acked-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Link: https://lore.kernel.org/r/20230313181843.1207845-4-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Mark Langsdorf <mlangsdo@redhat.com>
2023-11-01 11:12:29 -05:00
Chris von Recklinghausen 097d3bf090 mm: backing-dev: Remove the unneeded result variable
JIRA: https://issues.redhat.com/browse/RHEL-1848

commit 3083da7bcf56a4922b996ea3551847488a43a8b6
Author: ye xingchen <ye.xingchen@zte.com.cn>
Date:   Fri Aug 26 07:19:06 2022 +0000

    mm: backing-dev: Remove the unneeded result variable

    Return the value cgwb_bdi_init() directly instead of storing it in another
    redundant variable.

    Link: https://lkml.kernel.org/r/20220826071906.252419-1-ye.xingchen@zte.com.cn
    Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn>
    Reported-by: Zeal Robot <zealci@zte.com.cn>
    Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-10-20 06:13:36 -04:00
Nico Pache a39f5fc679 writeback, cgroup: fix null-ptr-deref write in bdi_split_work_to_wbs
commit 1ba1199ec5747f475538c0d25a32804e5ba1dfde
Author: Baokun Li <libaokun1@huawei.com>
Date:   Mon Apr 10 21:08:26 2023 +0800

    writeback, cgroup: fix null-ptr-deref write in bdi_split_work_to_wbs

    KASAN report null-ptr-deref:
    ==================================================================
    BUG: KASAN: null-ptr-deref in bdi_split_work_to_wbs+0x5c5/0x7b0
    Write of size 8 at addr 0000000000000000 by task sync/943
    CPU: 5 PID: 943 Comm: sync Tainted: 6.3.0-rc5-next-20230406-dirty #461
    Call Trace:
     <TASK>
     dump_stack_lvl+0x7f/0xc0
     print_report+0x2ba/0x340
     kasan_report+0xc4/0x120
     kasan_check_range+0x1b7/0x2e0
     __kasan_check_write+0x24/0x40
     bdi_split_work_to_wbs+0x5c5/0x7b0
     sync_inodes_sb+0x195/0x630
     sync_inodes_one_sb+0x3a/0x50
     iterate_supers+0x106/0x1b0
     ksys_sync+0x98/0x160
    [...]
    ==================================================================

    The race that causes the above issue is as follows:

               cpu1                     cpu2
    -------------------------|-------------------------
    inode_switch_wbs
     INIT_WORK(&isw->work, inode_switch_wbs_work_fn)
     queue_rcu_work(isw_wq, &isw->work)
     // queue_work async
      inode_switch_wbs_work_fn
       wb_put_many(old_wb, nr_switched)
        percpu_ref_put_many
         ref->data->release(ref)
         cgwb_release
          queue_work(cgwb_release_wq, &wb->release_work)
          // queue_work async
           &wb->release_work
           cgwb_release_workfn
                                ksys_sync
                                 iterate_supers
                                  sync_inodes_one_sb
                                   sync_inodes_sb
                                    bdi_split_work_to_wbs
                                     kmalloc(sizeof(*work), GFP_ATOMIC)
                                     // alloc memory failed
            percpu_ref_exit
             ref->data = NULL
             kfree(data)
                                     wb_get(wb)
                                      percpu_ref_get(&wb->refcnt)
                                       percpu_ref_get_many(ref, 1)
                                        atomic_long_add(nr, &ref->data->count)
                                         atomic64_add(i, v)
                                         // trigger null-ptr-deref

    bdi_split_work_to_wbs() traverses &bdi->wb_list to split work into all
    wbs.  If the allocation of new work fails, the on-stack fallback will be
    used and the reference count of the current wb is increased afterwards.
    If cgroup writeback membership switches occur before getting the reference
    count and the current wb is released as old_wd, then calling wb_get() or
    wb_put() will trigger the null pointer dereference above.

    This issue was introduced in v4.3-rc7 (see fix tag1).  Both
    sync_inodes_sb() and __writeback_inodes_sb_nr() calls to
    bdi_split_work_to_wbs() can trigger this issue.  For scenarios called via
    sync_inodes_sb(), originally commit 7fc5854f8c ("writeback: synchronize
    sync(2) against cgroup writeback membership switches") reduced the
    possibility of the issue by adding wb_switch_rwsem, but in v5.14-rc1 (see
    fix tag2) removed the "inode_io_list_del_locked(inode, old_wb)" from
    inode_switch_wbs_work_fn() so that wb->state contains WB_has_dirty_io,
    thus old_wb is not skipped when traversing wbs in bdi_split_work_to_wbs(),
    and the issue becomes easily reproducible again.

    To solve this problem, percpu_ref_exit() is called under RCU protection to
    avoid race between cgwb_release_workfn() and bdi_split_work_to_wbs().
    Moreover, replace wb_get() with wb_tryget() in bdi_split_work_to_wbs(),
    and skip the current wb if wb_tryget() fails because the wb has already
    been shutdown.

    Link: https://lkml.kernel.org/r/20230410130826.1492525-1-libaokun1@huawei.com
    Fixes: b817525a4a ("writeback: bdi_writeback iteration must not skip dying ones")
    Signed-off-by: Baokun Li <libaokun1@huawei.com>
    Reviewed-by: Jan Kara <jack@suse.cz>
    Acked-by: Tejun Heo <tj@kernel.org>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Andreas Dilger <adilger.kernel@dilger.ca>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Dennis Zhou <dennis@kernel.org>
    Cc: Hou Tao <houtao1@huawei.com>
    Cc: yangerkun <yangerkun@huawei.com>
    Cc: Zhang Yi <yi.zhang@huawei.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:11:04 -06:00
Nico Pache 59af49451c writeback: avoid use-after-free after removing device
commit f87904c075515f3e1d8f4a7115869d3b914674fd
Author: Khazhismel Kumykov <khazhy@chromium.org>
Date:   Mon Aug 1 08:50:34 2022 -0700

    writeback: avoid use-after-free after removing device

    When a disk is removed, bdi_unregister gets called to stop further
    writeback and wait for associated delayed work to complete.  However,
    wb_inode_writeback_end() may schedule bandwidth estimation dwork after
    this has completed, which can result in the timer attempting to access the
    just freed bdi_writeback.

    Fix this by checking if the bdi_writeback is alive, similar to when
    scheduling writeback work.

    Since this requires wb->work_lock, and wb_inode_writeback_end() may get
    called from interrupt, switch wb->work_lock to an irqsafe lock.

    Link: https://lkml.kernel.org/r/20220801155034.3772543-1-khazhy@google.com
    Fixes: 45a2966fd641 ("writeback: fix bandwidth estimate for spiky workload")
    Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
    Reviewed-by: Jan Kara <jack@suse.cz>
    Cc: Michael Stapelberg <stapelberg+linux@google.com>
    Cc: Wu Fengguang <fengguang.wu@intel.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:41 -07:00
Nico Pache 95a1fb22a0 init: Initialize noop_backing_dev_info early
commit 4bca7e80b6455772b4bf3f536dcbc19aac424d6a
Author: Jan Kara <jack@suse.cz>
Date:   Wed Jun 15 15:22:29 2022 +0200

    init: Initialize noop_backing_dev_info early

    noop_backing_dev_info is used by superblocks of various
    pseudofilesystems such as kdevtmpfs. After commit 10e14073107d
    ("writeback: Fix inode->i_io_list not be protected by inode->i_lock
    error") this broke because __mark_inode_dirty() started to access more
    fields from noop_backing_dev_info and this led to crashes inside
    locked_inode_to_wb_and_lock_list() called from __mark_inode_dirty().
    Fix the problem by initializing noop_backing_dev_info before the
    filesystems get mounted.

    Fixes: 10e14073107d ("writeback: Fix inode->i_io_list not be protected by inode->i_lock error")
    Reported-and-tested-by: Suzuki K Poulose <suzuki.poulose@arm.com>
    Reported-and-tested-by: Alexandru Elisei <alexandru.elisei@arm.com>
    Reported-and-tested-by: Guenter Roeck <linux@roeck-us.net>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Jan Kara <jack@suse.cz>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:38 -07:00
Chris von Recklinghausen 53003b969c mm: bdi: initialize bdi_min_ratio when bdi is unregistered
Bugzilla: https://bugzilla.redhat.com/2120352

commit 3c376dfafbf7a8ea0dea212d095ddd83e93280bb
Author: Manjong Lee <mj0123.lee@samsung.com>
Date:   Fri Dec 10 14:47:11 2021 -0800

    mm: bdi: initialize bdi_min_ratio when bdi is unregistered

    Initialize min_ratio if it is set during bdi unregistration.  This can
    prevent problems that may occur a when bdi is removed without resetting
    min_ratio.

    For example.
    1) insert external sdcard
    2) set external sdcard's min_ratio 70
    3) remove external sdcard without setting min_ratio 0
    4) insert external sdcard
    5) set external sdcard's min_ratio 70 << error occur(can't set)

    Because when an sdcard is removed, the present bdi_min_ratio value will
    remain.  Currently, the only way to reset bdi_min_ratio is to reboot.

    [akpm@linux-foundation.org: tweak comment and coding style]

    Link: https://lkml.kernel.org/r/20211021161942.5983-1-mj0123.lee@samsung.com
    Signed-off-by: Manjong Lee <mj0123.lee@samsung.com>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Changheun Lee <nanich.lee@samsung.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: <seunghwan.hyun@samsung.com>
    Cc: <sookwan7.kim@samsung.com>
    Cc: <yt0928.kim@samsung.com>
    Cc: <junho89.kim@samsung.com>
    Cc: <jisoo2146.oh@samsung.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:33 -04:00
Chris von Recklinghausen 809d37d23f mm/vmscan: throttle reclaim until some writeback completes if congested
Conflicts:
	mm/filemap.c - We already have
		4268b48077e5 ("mm/filemap: Add folio_end_writeback()")
		so so put the acct_reclaim_writeback call between the
		folio_wake call and the folio_put call and pass it a
		folio
	mm/internal.h - We already have
		646010009d35 ("mm: Add folio_raw_mapping()")
		so keep definition of folio_raw_mapping.
		Squash in changes from merge commit
		512b7931ad05 ("Merge branch 'akpm' (patches from Andrew)")
		to be comptible with existing folio changes.
	mm/vmscan.c - Squash in changes from merge commit
                512b7931ad05 ("Merge branch 'akpm' (patches from Andrew)")
                to be comptible with existing folio changes.

Bugzilla: https://bugzilla.redhat.com/2120352

commit 8cd7c588decf470bf7e14f2be93b709f839a965e
Author: Mel Gorman <mgorman@techsingularity.net>
Date:   Fri Nov 5 13:42:25 2021 -0700

    mm/vmscan: throttle reclaim until some writeback completes if congested

    Patch series "Remove dependency on congestion_wait in mm/", v5.

    This series that removes all calls to congestion_wait in mm/ and deletes
    wait_iff_congested.  It's not a clever implementation but
    congestion_wait has been broken for a long time [1].

    Even if congestion throttling worked, it was never a great idea.  While
    excessive dirty/writeback pages at the tail of the LRU is one
    possibility that reclaim may be slow, there is also the problem of too
    many pages being isolated and reclaim failing for other reasons
    (elevated references, too many pages isolated, excessive LRU contention
    etc).

    This series replaces the "congestion" throttling with 3 different types.

     - If there are too many dirty/writeback pages, sleep until a timeout or
       enough pages get cleaned

     - If too many pages are isolated, sleep until enough isolated pages are
       either reclaimed or put back on the LRU

     - If no progress is being made, direct reclaim tasks sleep until
       another task makes progress with acceptable efficiency.

    This was initially tested with a mix of workloads that used to trigger
    corner cases that no longer work.  A new test case was created called
    "stutterp" (pagereclaim-stutterp-noreaders in mmtests) using a freshly
    created XFS filesystem.  Note that it may be necessary to increase the
    timeout of ssh if executing remotely as ssh itself can get throttled and
    the connection may timeout.

    stutterp varies the number of "worker" processes from 4 up to NR_CPUS*4
    to check the impact as the number of direct reclaimers increase.  It has
    four types of worker.

     - One "anon latency" worker creates small mappings with mmap() and
       times how long it takes to fault the mapping reading it 4K at a time

     - X file writers which is fio randomly writing X files where the total
       size of the files add up to the allowed dirty_ratio. fio is allowed
       to run for a warmup period to allow some file-backed pages to
       accumulate. The duration of the warmup is based on the best-case
       linear write speed of the storage.

     - Y file readers which is fio randomly reading small files

     - Z anon memory hogs which continually map (100-dirty_ratio)% of memory

     - Total estimated WSS = (100+dirty_ration) percentage of memory

    X+Y+Z+1 == NR_WORKERS varying from 4 up to NR_CPUS*4

    The intent is to maximise the total WSS with a mix of file and anon
    memory where some anonymous memory must be swapped and there is a high
    likelihood of dirty/writeback pages reaching the end of the LRU.

    The test can be configured to have no background readers to stress
    dirty/writeback pages.  The results below are based on having zero
    readers.

    The short summary of the results is that the series works and stalls
    until some event occurs but the timeouts may need adjustment.

    The test results are not broken down by patch as the series should be
    treated as one block that replaces a broken throttling mechanism with a
    working one.

    Finally, three machines were tested but I'm reporting the worst set of
    results.  The other two machines had much better latencies for example.

    First the results of the "anon latency" latency

      stutterp
                                    5.15.0-rc1             5.15.0-rc1
                                       vanilla mm-reclaimcongest-v5r4
      Amean     mmap-4      31.4003 (   0.00%)   2661.0198 (-8374.52%)
      Amean     mmap-7      38.1641 (   0.00%)    149.2891 (-291.18%)
      Amean     mmap-12     60.0981 (   0.00%)    187.8105 (-212.51%)
      Amean     mmap-21    161.2699 (   0.00%)    213.9107 ( -32.64%)
      Amean     mmap-30    174.5589 (   0.00%)    377.7548 (-116.41%)
      Amean     mmap-48   8106.8160 (   0.00%)   1070.5616 (  86.79%)
      Stddev    mmap-4      41.3455 (   0.00%)  27573.9676 (-66591.66%)
      Stddev    mmap-7      53.5556 (   0.00%)   4608.5860 (-8505.23%)
      Stddev    mmap-12    171.3897 (   0.00%)   5559.4542 (-3143.75%)
      Stddev    mmap-21   1506.6752 (   0.00%)   5746.2507 (-281.39%)
      Stddev    mmap-30    557.5806 (   0.00%)   7678.1624 (-1277.05%)
      Stddev    mmap-48  61681.5718 (   0.00%)  14507.2830 (  76.48%)
      Max-90    mmap-4      31.4243 (   0.00%)     83.1457 (-164.59%)
      Max-90    mmap-7      41.0410 (   0.00%)     41.0720 (  -0.08%)
      Max-90    mmap-12     66.5255 (   0.00%)     53.9073 (  18.97%)
      Max-90    mmap-21    146.7479 (   0.00%)    105.9540 (  27.80%)
      Max-90    mmap-30    193.9513 (   0.00%)     64.3067 (  66.84%)
      Max-90    mmap-48    277.9137 (   0.00%)    591.0594 (-112.68%)
      Max       mmap-4    1913.8009 (   0.00%) 299623.9695 (-15555.96%)
      Max       mmap-7    2423.9665 (   0.00%) 204453.1708 (-8334.65%)
      Max       mmap-12   6845.6573 (   0.00%) 221090.3366 (-3129.64%)
      Max       mmap-21  56278.6508 (   0.00%) 213877.3496 (-280.03%)
      Max       mmap-30  19716.2990 (   0.00%) 216287.6229 (-997.00%)
      Max       mmap-48 477923.9400 (   0.00%) 245414.8238 (  48.65%)

    For most thread counts, the time to mmap() is unfortunately increased.
    In earlier versions of the series, this was lower but a large number of
    throttling events were reaching their timeout increasing the amount of
    inefficient scanning of the LRU.  There is no prioritisation of reclaim
    tasks making progress based on each tasks rate of page allocation versus
    progress of reclaim.  The variance is also impacted for high worker
    counts but in all cases, the differences in latency are not
    statistically significant due to very large maximum outliers.  Max-90
    shows that 90% of the stalls are comparable but the Max results show the
    massive outliers which are increased to to stalling.

    It is expected that this will be very machine dependant.  Due to the
    test design, reclaim is difficult so allocations stall and there are
    variances depending on whether THPs can be allocated or not.  The amount
    of memory will affect exactly how bad the corner cases are and how often
    they trigger.  The warmup period calculation is not ideal as it's based
    on linear writes where as fio is randomly writing multiple files from
    multiple tasks so the start state of the test is variable.  For example,
    these are the latencies on a single-socket machine that had more memory

      Amean     mmap-4      42.2287 (   0.00%)     49.6838 * -17.65%*
      Amean     mmap-7     216.4326 (   0.00%)     47.4451 *  78.08%*
      Amean     mmap-12   2412.0588 (   0.00%)     51.7497 (  97.85%)
      Amean     mmap-21   5546.2548 (   0.00%)     51.8862 (  99.06%)
      Amean     mmap-30   1085.3121 (   0.00%)     72.1004 (  93.36%)

    The overall system CPU usage and elapsed time is as follows

                        5.15.0-rc3  5.15.0-rc3
                           vanilla mm-reclaimcongest-v5r4
      Duration User        6989.03      983.42
      Duration System      7308.12      799.68
      Duration Elapsed     2277.67     2092.98

    The patches reduce system CPU usage by 89% as the vanilla kernel is rarely
    stalling.

    The high-level /proc/vmstats show

                                           5.15.0-rc1     5.15.0-rc1
                                              vanilla mm-reclaimcongest-v5r2
      Ops Direct pages scanned          1056608451.00   503594991.00
      Ops Kswapd pages scanned           109795048.00   147289810.00
      Ops Kswapd pages reclaimed          63269243.00    31036005.00
      Ops Direct pages reclaimed          10803973.00     6328887.00
      Ops Kswapd efficiency %                   57.62          21.07
      Ops Kswapd velocity                    48204.98       57572.86
      Ops Direct efficiency %                    1.02           1.26
      Ops Direct velocity                   463898.83      196845.97

    Kswapd scanned less pages but the detailed pattern is different.  The
    vanilla kernel scans slowly over time where as the patches exhibits
    burst patterns of scan activity.  Direct reclaim scanning is reduced by
    52% due to stalling.

    The pattern for stealing pages is also slightly different.  Both kernels
    exhibit spikes but the vanilla kernel when reclaiming shows pages being
    reclaimed over a period of time where as the patches tend to reclaim in
    spikes.  The difference is that vanilla is not throttling and instead
    scanning constantly finding some pages over time where as the patched
    kernel throttles and reclaims in spikes.

      Ops Percentage direct scans               90.59          77.37

    For direct reclaim, vanilla scanned 90.59% of pages where as with the
    patches, 77.37% were direct reclaim due to throttling

      Ops Page writes by reclaim           2613590.00     1687131.00

    Page writes from reclaim context are reduced.

      Ops Page writes anon                 2932752.00     1917048.00

    And there is less swapping.

      Ops Page reclaim immediate         996248528.00   107664764.00

    The number of pages encountered at the tail of the LRU tagged for
    immediate reclaim but still dirty/writeback is reduced by 89%.

      Ops Slabs scanned                     164284.00      153608.00

    Slab scan activity is similar.

    ftrace was used to gather stall activity

      Vanilla
      -------
          1 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=16000
          2 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=12000
          8 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=8000
         29 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=4000
      82394 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=0

    The fast majority of wait_iff_congested calls do not stall at all.  What
    is likely happening is that cond_resched() reschedules the task for a
    short period when the BDI is not registering congestion (which it never
    will in this test setup).

          1 writeback_congestion_wait: usec_timeout=100000 usec_delayed=120000
          2 writeback_congestion_wait: usec_timeout=100000 usec_delayed=132000
          4 writeback_congestion_wait: usec_timeout=100000 usec_delayed=112000
        380 writeback_congestion_wait: usec_timeout=100000 usec_delayed=108000
        778 writeback_congestion_wait: usec_timeout=100000 usec_delayed=104000

    congestion_wait if called always exceeds the timeout as there is no
    trigger to wake it up.

    Bottom line: Vanilla will throttle but it's not effective.

    Patch series
    ------------

    Kswapd throttle activity was always due to scanning pages tagged for
    immediate reclaim at the tail of the LRU

          1 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITE
BACK
          4 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITE
BACK
          5 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITE
BACK
          6 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITE
BACK
         11 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRIT
EBACK
         11 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEB
ACK
         94 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
        112 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEB
ACK

    The majority of events did not stall or stalled for a short period.
    Roughly 16% of stalls reached the timeout before expiry.  For direct
    reclaim, the number of times stalled for each reason were

       6624 reason=VMSCAN_THROTTLE_ISOLATED
      93246 reason=VMSCAN_THROTTLE_NOPROGRESS
      96934 reason=VMSCAN_THROTTLE_WRITEBACK

    The most common reason to stall was due to excessive pages tagged for
    immediate reclaim at the tail of the LRU followed by a failure to make
    forward.  A relatively small number were due to too many pages isolated
    from the LRU by parallel threads

    For VMSCAN_THROTTLE_ISOLATED, the breakdown of delays was

          9 usec_timeout=20000 usect_delayed=4000 reason=VMSCAN_THROTTLE_ISOLATE
D
         12 usec_timeout=20000 usect_delayed=16000 reason=VMSCAN_THROTTLE_ISOLAT
ED
         83 usec_timeout=20000 usect_delayed=20000 reason=VMSCAN_THROTTLE_ISOLAT
ED
       6520 usec_timeout=20000 usect_delayed=0 reason=VMSCAN_THROTTLE_ISOLATED

    Most did not stall at all.  A small number reached the timeout.

    For VMSCAN_THROTTLE_NOPROGRESS, the breakdown of stalls were all over
    the map

          1 usec_timeout=500000 usect_delayed=324000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          1 usec_timeout=500000 usect_delayed=332000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          1 usec_timeout=500000 usect_delayed=348000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          1 usec_timeout=500000 usect_delayed=360000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          2 usec_timeout=500000 usect_delayed=228000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          2 usec_timeout=500000 usect_delayed=260000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          2 usec_timeout=500000 usect_delayed=340000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          2 usec_timeout=500000 usect_delayed=364000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          2 usec_timeout=500000 usect_delayed=372000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          2 usec_timeout=500000 usect_delayed=428000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          2 usec_timeout=500000 usect_delayed=460000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          2 usec_timeout=500000 usect_delayed=464000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          3 usec_timeout=500000 usect_delayed=244000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          3 usec_timeout=500000 usect_delayed=252000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          3 usec_timeout=500000 usect_delayed=272000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          4 usec_timeout=500000 usect_delayed=188000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          4 usec_timeout=500000 usect_delayed=268000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          4 usec_timeout=500000 usect_delayed=328000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          4 usec_timeout=500000 usect_delayed=380000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          4 usec_timeout=500000 usect_delayed=392000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          4 usec_timeout=500000 usect_delayed=432000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          5 usec_timeout=500000 usect_delayed=204000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          5 usec_timeout=500000 usect_delayed=220000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          5 usec_timeout=500000 usect_delayed=412000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          5 usec_timeout=500000 usect_delayed=436000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          6 usec_timeout=500000 usect_delayed=488000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          7 usec_timeout=500000 usect_delayed=212000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          7 usec_timeout=500000 usect_delayed=300000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          7 usec_timeout=500000 usect_delayed=316000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          7 usec_timeout=500000 usect_delayed=472000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          8 usec_timeout=500000 usect_delayed=248000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          8 usec_timeout=500000 usect_delayed=356000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          8 usec_timeout=500000 usect_delayed=456000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          9 usec_timeout=500000 usect_delayed=124000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          9 usec_timeout=500000 usect_delayed=376000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
          9 usec_timeout=500000 usect_delayed=484000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         10 usec_timeout=500000 usect_delayed=172000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         10 usec_timeout=500000 usect_delayed=420000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         10 usec_timeout=500000 usect_delayed=452000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         11 usec_timeout=500000 usect_delayed=256000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         12 usec_timeout=500000 usect_delayed=112000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         12 usec_timeout=500000 usect_delayed=116000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         12 usec_timeout=500000 usect_delayed=144000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         12 usec_timeout=500000 usect_delayed=152000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         12 usec_timeout=500000 usect_delayed=264000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         12 usec_timeout=500000 usect_delayed=384000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         12 usec_timeout=500000 usect_delayed=424000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         12 usec_timeout=500000 usect_delayed=492000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         13 usec_timeout=500000 usect_delayed=184000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         13 usec_timeout=500000 usect_delayed=444000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         14 usec_timeout=500000 usect_delayed=308000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         14 usec_timeout=500000 usect_delayed=440000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         14 usec_timeout=500000 usect_delayed=476000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         16 usec_timeout=500000 usect_delayed=140000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         17 usec_timeout=500000 usect_delayed=232000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         17 usec_timeout=500000 usect_delayed=240000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         17 usec_timeout=500000 usect_delayed=280000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         18 usec_timeout=500000 usect_delayed=404000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         20 usec_timeout=500000 usect_delayed=148000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         20 usec_timeout=500000 usect_delayed=216000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         20 usec_timeout=500000 usect_delayed=468000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         21 usec_timeout=500000 usect_delayed=448000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         23 usec_timeout=500000 usect_delayed=168000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         23 usec_timeout=500000 usect_delayed=296000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         25 usec_timeout=500000 usect_delayed=132000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         25 usec_timeout=500000 usect_delayed=352000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         26 usec_timeout=500000 usect_delayed=180000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         27 usec_timeout=500000 usect_delayed=284000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         28 usec_timeout=500000 usect_delayed=164000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         29 usec_timeout=500000 usect_delayed=136000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         30 usec_timeout=500000 usect_delayed=200000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         30 usec_timeout=500000 usect_delayed=400000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         31 usec_timeout=500000 usect_delayed=196000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         32 usec_timeout=500000 usect_delayed=156000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         33 usec_timeout=500000 usect_delayed=224000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         35 usec_timeout=500000 usect_delayed=128000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         35 usec_timeout=500000 usect_delayed=176000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         36 usec_timeout=500000 usect_delayed=368000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         36 usec_timeout=500000 usect_delayed=496000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         37 usec_timeout=500000 usect_delayed=312000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         38 usec_timeout=500000 usect_delayed=304000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         40 usec_timeout=500000 usect_delayed=288000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         43 usec_timeout=500000 usect_delayed=408000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         55 usec_timeout=500000 usect_delayed=416000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         56 usec_timeout=500000 usect_delayed=76000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
         58 usec_timeout=500000 usect_delayed=120000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         59 usec_timeout=500000 usect_delayed=208000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         61 usec_timeout=500000 usect_delayed=68000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
         71 usec_timeout=500000 usect_delayed=192000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         71 usec_timeout=500000 usect_delayed=480000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         79 usec_timeout=500000 usect_delayed=60000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
         82 usec_timeout=500000 usect_delayed=320000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         82 usec_timeout=500000 usect_delayed=92000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
         85 usec_timeout=500000 usect_delayed=64000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
         85 usec_timeout=500000 usect_delayed=80000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
         88 usec_timeout=500000 usect_delayed=84000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
         90 usec_timeout=500000 usect_delayed=160000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         90 usec_timeout=500000 usect_delayed=292000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
         94 usec_timeout=500000 usect_delayed=56000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        118 usec_timeout=500000 usect_delayed=88000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        119 usec_timeout=500000 usect_delayed=72000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        126 usec_timeout=500000 usect_delayed=108000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
        146 usec_timeout=500000 usect_delayed=52000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        148 usec_timeout=500000 usect_delayed=36000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        148 usec_timeout=500000 usect_delayed=48000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        159 usec_timeout=500000 usect_delayed=28000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        178 usec_timeout=500000 usect_delayed=44000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        183 usec_timeout=500000 usect_delayed=40000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        237 usec_timeout=500000 usect_delayed=100000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
        266 usec_timeout=500000 usect_delayed=32000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        313 usec_timeout=500000 usect_delayed=24000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        347 usec_timeout=500000 usect_delayed=96000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        470 usec_timeout=500000 usect_delayed=20000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        559 usec_timeout=500000 usect_delayed=16000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
        964 usec_timeout=500000 usect_delayed=12000 reason=VMSCAN_THROTTLE_NOPRO
GRESS
       2001 usec_timeout=500000 usect_delayed=104000 reason=VMSCAN_THROTTLE_NOPR
OGRESS
       2447 usec_timeout=500000 usect_delayed=8000 reason=VMSCAN_THROTTLE_NOPROG
RESS
       7888 usec_timeout=500000 usect_delayed=4000 reason=VMSCAN_THROTTLE_NOPROG
RESS
      22727 usec_timeout=500000 usect_delayed=0 reason=VMSCAN_THROTTLE_NOPROGRES
S
      51305 usec_timeout=500000 usect_delayed=500000 reason=VMSCAN_THROTTLE_NOPR
OGRESS

    The full timeout is often hit but a large number also do not stall at
    all.  The remainder slept a little allowing other reclaim tasks to make
    progress.

    While this timeout could be further increased, it could also negatively
    impact worst-case behaviour when there is no prioritisation of what task
    should make progress.

    For VMSCAN_THROTTLE_WRITEBACK, the breakdown was

          1 usec_timeout=100000 usect_delayed=44000 reason=VMSCAN_THROTTLE_WRITE
BACK
          2 usec_timeout=100000 usect_delayed=76000 reason=VMSCAN_THROTTLE_WRITE
BACK
          3 usec_timeout=100000 usect_delayed=80000 reason=VMSCAN_THROTTLE_WRITE
BACK
          5 usec_timeout=100000 usect_delayed=48000 reason=VMSCAN_THROTTLE_WRITE
BACK
          5 usec_timeout=100000 usect_delayed=84000 reason=VMSCAN_THROTTLE_WRITE
BACK
          6 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITE
BACK
          7 usec_timeout=100000 usect_delayed=88000 reason=VMSCAN_THROTTLE_WRITE
BACK
         11 usec_timeout=100000 usect_delayed=56000 reason=VMSCAN_THROTTLE_WRITE
BACK
         12 usec_timeout=100000 usect_delayed=64000 reason=VMSCAN_THROTTLE_WRITE
BACK
         16 usec_timeout=100000 usect_delayed=92000 reason=VMSCAN_THROTTLE_WRITE
BACK
         24 usec_timeout=100000 usect_delayed=68000 reason=VMSCAN_THROTTLE_WRITE
BACK
         28 usec_timeout=100000 usect_delayed=32000 reason=VMSCAN_THROTTLE_WRITEBACK
         30 usec_timeout=100000 usect_delayed=60000 reason=VMSCAN_THROTTLE_WRITE
BACK
         30 usec_timeout=100000 usect_delayed=96000 reason=VMSCAN_THROTTLE_WRITE
BACK
         32 usec_timeout=100000 usect_delayed=52000 reason=VMSCAN_THROTTLE_WRITE
BACK
         42 usec_timeout=100000 usect_delayed=40000 reason=VMSCAN_THROTTLE_WRITE
BACK
         77 usec_timeout=100000 usect_delayed=28000 reason=VMSCAN_THROTTLE_WRITE
BACK
         99 usec_timeout=100000 usect_delayed=36000 reason=VMSCAN_THROTTLE_WRITE
BACK
        137 usec_timeout=100000 usect_delayed=24000 reason=VMSCAN_THROTTLE_WRITE
BACK
        190 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITE
BACK
        339 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITE
BACK
        518 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITE
BACK
        852 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEB
ACK
       3359 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEB
ACK
       7147 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
      83962 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRIT
EBACK

    The majority hit the timeout in direct reclaim context although a
    sizable number did not stall at all.  This is very different to kswapd
    where only a tiny percentage of stalls due to writeback reached the
    timeout.

    Bottom line, the throttling appears to work and the wakeup events may
    limit worst case stalls.  There might be some grounds for adjusting
    timeouts but it's likely futile as the worst-case scenarios depend on
    the workload, memory size and the speed of the storage.  A better
    approach to improve the series further would be to prioritise tasks
    based on their rate of allocation with the caveat that it may be very
    expensive to track.

    This patch (of 5):

    Page reclaim throttles on wait_iff_congested under the following
    conditions:

     - kswapd is encountering pages under writeback and marked for immediate
       reclaim implying that pages are cycling through the LRU faster than
       pages can be cleaned.

     - Direct reclaim will stall if all dirty pages are backed by congested
       inodes.

    wait_iff_congested is almost completely broken with few exceptions.
    This patch adds a new node-based workqueue and tracks the number of
    throttled tasks and pages written back since throttling started.  If
    enough pages belonging to the node are written back then the throttled
    tasks will wake early.  If not, the throttled tasks sleeps until the
    timeout expires.

    [neilb@suse.de: Uninterruptible sleep and simpler wakeups]
    [hdanton@sina.com: Avoid race when reclaim starts]
    [vbabka@suse.cz: vmstat irq-safe api, clarifications]

    Link: https://lore.kernel.org/linux-mm/45d8b7a6-8548-65f5-cccf-9f451d4ae3d4@
kernel.dk/ [1]
    Link: https://lkml.kernel.org/r/20211022144651.19914-1-mgorman@techsingulari
ty.net
    Link: https://lkml.kernel.org/r/20211022144651.19914-2-mgorman@techsingulari
ty.net
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: NeilBrown <neilb@suse.de>
    Cc: "Theodore Ts'o" <tytso@mit.edu>
    Cc: Andreas Dilger <adilger.kernel@dilger.ca>
    Cc: "Darrick J . Wong" <djwong@kernel.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:29 -04:00
Chris von Recklinghausen 54034a00e2 mm: simplify bdi refcounting
Conflicts: mm/backing-dev.c - We already have
	397c9f46ee4d ("blk-cgroup: move blkcg_{pin,unpin}_online out of line")
	so don't add css_to_blkcg call

Bugzilla: https://bugzilla.redhat.com/2120352

commit efee17134ca464639a2f5b4d036ce40caf1b247a
Author: Christoph Hellwig <hch@lst.de>
Date:   Fri Nov 5 13:37:04 2021 -0700

    mm: simplify bdi refcounting

    Move grabbing and releasing the bdi refcount out of the common
    wb_init/wb_exit helpers into code that is only used for the non-default
    memcg driven bdi_writeback structures.

    [hch@lst.de: add comment]
      Link: https://lkml.kernel.org/r/20211027074207.GA12793@lst.de
    [akpm@linux-foundation.org: fix typo]

    Link: https://lkml.kernel.org/r/20211021124441.668816-6-hch@lst.de
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Jan Kara <jack@suse.cz>
    Cc: Miquel Raynal <miquel.raynal@bootlin.com>
    Cc: Richard Weinberger <richard@nod.at>
    Cc: Vignesh Raghavendra <vigneshr@ti.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:27 -04:00
Chris von Recklinghausen 156163abb8 mm: don't automatically unregister bdis
Bugzilla: https://bugzilla.redhat.com/2120352

commit 702f2d1e3b33617a8d9a9424f08a69b7c51642a7
Author: Christoph Hellwig <hch@lst.de>
Date:   Fri Nov 5 13:37:01 2021 -0700

    mm: don't automatically unregister bdis

    All BDI users now unregister explicitly.

    Link: https://lkml.kernel.org/r/20211021124441.668816-5-hch@lst.de
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Jan Kara <jack@suse.cz>
    Cc: Miquel Raynal <miquel.raynal@bootlin.com>
    Cc: Richard Weinberger <richard@nod.at>
    Cc: Vignesh Raghavendra <vigneshr@ti.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:27 -04:00
Chris von Recklinghausen ac466013e4 mm: export bdi_unregister
Bugzilla: https://bugzilla.redhat.com/2120352

commit c6fd3ac0fc859da57404c3bad64696d48a6f425e
Author: Christoph Hellwig <hch@lst.de>
Date:   Fri Nov 5 13:36:52 2021 -0700

    mm: export bdi_unregister

    Patch series "simplify bdi unregistation".

    This series simplifies the BDI code to get rid of the magic
    auto-unregister feature that hid a recent block layer refcounting bug.

    This patch (of 5):

    To wind down the magic auto-unregister semantics we'll need to push this
    into modular code.

    Link: https://lkml.kernel.org/r/20211021124441.668816-1-hch@lst.de
    Link: https://lkml.kernel.org/r/20211021124441.668816-2-hch@lst.de
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Jan Kara <jack@suse.cz>
    Cc: Miquel Raynal <miquel.raynal@bootlin.com>
    Cc: Richard Weinberger <richard@nod.at>
    Cc: Vignesh Raghavendra <vigneshr@ti.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:26 -04:00
Ming Lei 1e80e9b8ac blk-cgroup: remove unneeded includes from <linux/blk-cgroup.h>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083917

commit c97ab271576dec2170e7b804cb05f7617b30fed9
Author: Christoph Hellwig <hch@lst.de>
Date:   Wed Apr 20 06:27:19 2022 +0200

    blk-cgroup: remove unneeded includes from <linux/blk-cgroup.h>

    Remove all the includes that aren't actually needed from
    <linux/blk-cgroup.h> and push them to the actual source files where
    needed.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Acked-by: Tejun Heo <tj@kernel.org>
    Link: https://lore.kernel.org/r/20220420042723.1010598-12-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2022-06-22 08:58:03 +08:00
Ming Lei 1b78ecea2f blk-cgroup: move struct blkcg to block/blk-cgroup.h
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083917

commit dec223c92a4688f6c9642d640cfe15a99d289dd4
Author: Christoph Hellwig <hch@lst.de>
Date:   Wed Apr 20 06:27:15 2022 +0200

    blk-cgroup: move struct blkcg to block/blk-cgroup.h

    There is no real need to expose the blkcg structure to the whole kernel.
    Move it to the private header an expose a helper to let the writeback
    code access the cgwb_list member.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Acked-by: Tejun Heo <tj@kernel.org>
    Link: https://lore.kernel.org/r/20220420042723.1010598-8-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2022-06-22 08:58:03 +08:00
Ming Lei 652cb3f2ad blk-cgroup: move blkcg_{pin,unpin}_online out of line
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083917

commit 397c9f46ee4d99024c64954b007c1b5762d01cb4
Author: Christoph Hellwig <hch@lst.de>
Date:   Wed Apr 20 06:27:14 2022 +0200

    blk-cgroup: move blkcg_{pin,unpin}_online out of line

    Move these two functions out of line as there is no good reason
    to inline them.  Also switch to passing a cgroup_subsys_state
    instead of doing the conversion in the caller to prepare for making
    the blkcg structure private to blk-cgroup.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Acked-by: Tejun Heo <tj@kernel.org>
    Link: https://lore.kernel.org/r/20220420042723.1010598-7-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2022-06-22 08:58:03 +08:00
Ming Lei d881c1c84e mm: don't include <linux/blkdev.h> in <linux/backing-dev.h>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2018403
Conflicts: drop change on ntfs3 which isn't supported by cs9

commit ccdf774189b6466457ca9c7de1fe9ed18547d249
Author: Christoph Hellwig <hch@lst.de>
Date:   Mon Sep 20 14:33:14 2021 +0200

    mm: don't include <linux/blkdev.h> in <linux/backing-dev.h>

    Move inode_to_bdi out of line to avoid having to include blkdev.h.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Link: https://lore.kernel.org/r/20210920123328.1399408-4-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2021-12-06 16:42:47 +08:00
Ming Lei cdbba461a3 mm: don't include <linux/blk-cgroup.h> in <linux/backing-dev.h>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2018403
Conflicts: drop change on drivers/md/dm-ima.c, since we don't support that
	driver on cs9 yet.

commit e41d12f539f7c8bac9b0897031fee6cc9158a6be
Author: Christoph Hellwig <hch@lst.de>
Date:   Mon Sep 20 14:33:13 2021 +0200

    mm: don't include <linux/blk-cgroup.h> in <linux/backing-dev.h>

    There is no need to pull blk-cgroup.h and thus blkdev.h in here, so
    break the include chain.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Link: https://lore.kernel.org/r/20210920123328.1399408-3-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2021-12-06 16:42:47 +08:00
Rafael Aquini 3a073ebbc6 writeback: fix bandwidth estimate for spiky workload
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 45a2966fd64147518dc5bca25f447bd0fb5359ac
Author: Jan Kara <jack@suse.cz>
Date:   Thu Sep 2 14:53:09 2021 -0700

    writeback: fix bandwidth estimate for spiky workload

    Michael Stapelberg has reported that for workload with short big spikes of
    writes (GCC linker seem to trigger this frequently) the write throughput
    is heavily underestimated and tends to steadily sink until it reaches
    zero.  This has rather bad impact on writeback throttling (causing
    stalls).  The problem is that writeback throughput estimate gets updated
    at most once per 200 ms.  One update happens early after we submit pages
    for writeback (at that point writeout of only small fraction of pages is
    completed and thus observed throughput is tiny).  Next update happens only
    during the next write spike (updates happen only from inode writeback and
    dirty throttling code) and if that is more than 1s after previous spike,
    we decide system was idle and just ignore whatever was written until this
    moment.

    Fix the problem by making sure writeback throughput estimate is also
    updated shortly after writeback completes to get reasonable estimate of
    throughput for spiky workloads.

    [jack@suse.cz: avoid division by 0 in wb_update_dirty_ratelimit()]

    Link: https://lore.kernel.org/lkml/20210617095309.3542373-1-stapelberg+linux@google.com
    Link: https://lkml.kernel.org/r/20210713104716.22868-3-jack@suse.cz
    Signed-off-by: Jan Kara <jack@suse.cz>
    Reported-by: Michael Stapelberg <stapelberg+linux@google.com>
    Tested-by: Michael Stapelberg <stapelberg+linux@google.com>
    Cc: Wu Fengguang <fengguang.wu@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:40:48 -05:00
Rafael Aquini ca7eeab6b1 writeback: track number of inodes under writeback
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 633a2abb9e1cd5c95f3b600f4b2c12cce22ae4a0
Author: Jan Kara <jack@suse.cz>
Date:   Thu Sep 2 14:53:04 2021 -0700

    writeback: track number of inodes under writeback

    Patch series "writeback: Fix bandwidth estimates", v4.

    Fix estimate of writeback throughput when device is not fully busy doing
    writeback.  Michael Stapelberg has reported that such workload (e.g.
    generated by linking) tends to push estimated throughput down to 0 and as
    a result writeback on the device is practically stalled.

    The first three patches fix the reported issue, the remaining two patches
    are unrelated cleanups of problems I've noticed when reading the code.

    This patch (of 4):

    Track number of inodes under writeback for each bdi_writeback structure.
    We will use this to decide whether wb does any IO and so we can estimate
    its writeback throughput.  In principle we could use number of pages under
    writeback (WB_WRITEBACK counter) for this however normal percpu counter
    reads are too inaccurate for our purposes and summing the counter is too
    expensive.

    Link: https://lkml.kernel.org/r/20210713104519.16394-1-jack@suse.cz
    Link: https://lkml.kernel.org/r/20210713104716.22868-1-jack@suse.cz
    Signed-off-by: Jan Kara <jack@suse.cz>
    Cc: Wu Fengguang <fengguang.wu@intel.com>
    Cc: Michael Stapelberg <stapelberg+linux@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:40:46 -05:00
Rafael Aquini cde90ded14 mm: hide laptop_mode_wb_timer entirely behind the BDI API
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 5ed964f8e54eb3191b8b7b45aeb52672a0c995dc
Author: Christoph Hellwig <hch@lst.de>
Date:   Mon Aug 9 16:17:40 2021 +0200

    mm: hide laptop_mode_wb_timer entirely behind the BDI API

    Don't leak the detaіls of the timer into the block layer, instead
    initialize the timer in bdi_alloc and delete it in bdi_unregister.
    Note that this means the timer is initialized (but not armed) for
    non-block queues as well now.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Jan Kara <jack@suse.cz>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Link: https://lore.kernel.org/r/20210809141744.1203023-2-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:40:24 -05:00
Roman Gushchin b43a9e76b4 writeback, cgroup: remove wb from offline list before releasing refcnt
Boyang reported that the commit c22d70a162 ("writeback, cgroup:
release dying cgwbs by switching attached inodes") causes the kernel to
crash while running xfstests generic/256 on ext4 on aarch64 and ppc64le.

  run fstests generic/256 at 2021-07-12 05:41:40
  EXT4-fs (vda3): mounted filesystem with ordered data mode. Opts: . Quota mode: none.
  Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
  Mem abort info:
     ESR = 0x96000005
     EC = 0x25: DABT (current EL), IL = 32 bits
     SET = 0, FnV = 0
     EA = 0, S1PTW = 0
     FSC = 0x05: level 1 translation fault
  Data abort info:
     ISV = 0, ISS = 0x00000005
     CM = 0, WnR = 0
  user pgtable: 64k pages, 48-bit VAs, pgdp=00000000b0502000
  [0000000000000000] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
  Internal error: Oops: 96000005 [#1] SMP
  Modules linked in: dm_flakey dm_snapshot dm_bufio dm_zero dm_mod loop tls rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs rfkill sunrpc ext4 vfat fat mbcache jbd2 drm fuse xfs libcrc32c crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce virtio_blk virtio_net net_failover virtio_console failover virtio_mmio aes_neon_bs [last unloaded: scsi_debug]
  CPU: 0 PID: 408468 Comm: kworker/u8:5 Tainted: G X --------- ---  5.14.0-0.rc1.15.bx.el9.aarch64 #1
  Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
  Workqueue: events_unbound cleanup_offline_cgwbs_workfn
  pstate: 004000c5 (nzcv daIF +PAN -UAO -TCO BTYPE=--)
  pc : cleanup_offline_cgwbs_workfn+0x320/0x394
  lr : cleanup_offline_cgwbs_workfn+0xe0/0x394
  sp : ffff80001554fd10
  x29: ffff80001554fd10 x28: 0000000000000000 x27: 0000000000000001
  x26: 0000000000000000 x25: 00000000000000e0 x24: ffffd2a2fbe671a8
  x23: ffff80001554fd88 x22: ffffd2a2fbe67198 x21: ffffd2a2fc25a730
  x20: ffff210412bc3000 x19: ffff210412bc3280 x18: 0000000000000000
  x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
  x14: 0000000000000000 x13: 0000000000000030 x12: 0000000000000040
  x11: ffff210481572238 x10: ffff21048157223a x9 : ffffd2a2fa276c60
  x8 : ffff210484106b60 x7 : 0000000000000000 x6 : 000000000007d18a
  x5 : ffff210416a86400 x4 : ffff210412bc0280 x3 : 0000000000000000
  x2 : ffff80001554fd88 x1 : ffff210412bc0280 x0 : 0000000000000003
  Call trace:
     cleanup_offline_cgwbs_workfn+0x320/0x394
     process_one_work+0x1f4/0x4b0
     worker_thread+0x184/0x540
     kthread+0x114/0x120
     ret_from_fork+0x10/0x18
  Code: d63f0020 97f99963 17ffffa6 f8588263 (f9400061)
  ---[ end trace e250fe289272792a ]---
  Kernel panic - not syncing: Oops: Fatal exception
  SMP: stopping secondary CPUs
  SMP: failed to stop secondary CPUs 0-2
  Kernel Offset: 0x52a2e9fa0000 from 0xffff800010000000
  PHYS_OFFSET: 0xfff0defca0000000
  CPU features: 0x00200251,23200840
  Memory Limit: none
  ---[ end Kernel panic - not syncing: Oops: Fatal exception ]---

The problem happens when cgwb_release_workfn() races with
cleanup_offline_cgwbs_workfn(): wb_tryget() in
cleanup_offline_cgwbs_workfn() can be called after percpu_ref_exit() is
cgwb_release_workfn(), which is basically a use-after-free error.

Fix the problem by making removing the writeback structure from the
offline list before releasing the percpu reference counter.  It will
guarantee that cleanup_offline_cgwbs_workfn() will not see and not
access writeback structures which are about to be released.

Link: https://lkml.kernel.org/r/20210716201039.3762203-1-guro@fb.com
Fixes: c22d70a162 ("writeback, cgroup: release dying cgwbs by switching attached inodes")
Signed-off-by: Roman Gushchin <guro@fb.com>
Reported-by: Boyang Xue <bxue@redhat.com>
Suggested-by: Jan Kara <jack@suse.cz>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Murphy Zhou <jencce.kernel@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-23 17:43:28 -07:00
Roman Gushchin c22d70a162 writeback, cgroup: release dying cgwbs by switching attached inodes
Asynchronously try to release dying cgwbs by switching attached inodes to
the nearest living ancestor wb.  It helps to get rid of per-cgroup
writeback structures themselves and of pinned memory and block cgroups,
which are significantly larger structures (mostly due to large per-cpu
statistics data).  This prevents memory waste and helps to avoid different
scalability problems caused by large piles of dying cgroups.

Reuse the existing mechanism of inode switching used for foreign inode
detection.  To speed things up batch up to 115 inode switching in a single
operation (the maximum number is selected so that the resulting struct
inode_switch_wbs_context can fit into 1024 bytes).  Because every
switching consists of two steps divided by an RCU grace period, it would
be too slow without batching.  Please note that the whole batch counts as
a single operation (when increasing/decreasing isw_nr_in_flight).  This
allows to keep umounting working (flush the switching queue), however
prevents cleanups from consuming the whole switching quota and effectively
blocking the frn switching.

A cgwb cleanup operation can fail due to different reasons (e.g.  not
enough memory, the cgwb has an in-flight/pending io, an attached inode in
a wrong state, etc).  In this case the next scheduled cleanup will make a
new attempt.  An attempt is made each time a new cgwb is offlined (in
other words a memcg and/or a blkcg is deleted by a user).  In the future
an additional attempt scheduled by a timer can be implemented.

[guro@fb.com: replace open-coded "115" with arithmetic]
  Link: https://lkml.kernel.org/r/YMEcSBcq/VXMiPPO@carbon.dhcp.thefacebook.com
[guro@fb.com: add smp_mb() to inode_prepare_wbs_switch()]
  Link: https://lkml.kernel.org/r/YMFa+guFw7OFjf3X@carbon.dhcp.thefacebook.com
[willy@infradead.org: fix documentation]
  Link: https://lkml.kernel.org/r/20210615200242.1716568-2-willy@infradead.org

Link: https://lkml.kernel.org/r/20210608230225.2078447-9-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:48 -07:00
Roman Gushchin f3b6a6df38 writeback, cgroup: keep list of inodes attached to bdi_writeback
Currently there is no way to iterate over inodes attached to a specific
cgwb structure.  It limits the ability to efficiently reclaim the
writeback structure itself and associated memory and block cgroup
structures without scanning all inodes belonging to a sb, which can be
prohibitively expensive.

While dirty/in-active-writeback an inode belongs to one of the
bdi_writeback's io lists: b_dirty, b_io, b_more_io and b_dirty_time.  Once
cleaned up, it's removed from all io lists.  So the inode->i_io_list can
be reused to maintain the list of inodes, attached to a bdi_writeback
structure.

This patch introduces a new wb->b_attached list, which contains all inodes
which were dirty at least once and are attached to the given cgwb.  Inodes
attached to the root bdi_writeback structures are never placed on such
list.  The following patch will use this list to try to release cgwbs
structures more efficiently.

Link: https://lkml.kernel.org/r/20210608230225.2078447-6-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:48 -07:00
Daniel Vetter c1ca59a1f2 mm/backing-dev.c: use might_alloc()
Now that my little helper has landed, use it more.  On top of the existing
check this also uses lockdep through the fs_reclaim annotations.

[akpm@linux-foundation.org: include linux/sched/mm.h]

Link: https://lkml.kernel.org/r/20210113135009.3606813-2-daniel.vetter@ffwll.ch
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-26 09:41:01 -08:00
Baolin Wang 6986c3e2b1 mm: backing-dev: Remove duplicated macro definition
Move the K() macro a little forward to remove the same macro definition.

Link: https://lkml.kernel.org/r/d1ccdf2d3116dce9814f2bcc1f0415ecb4c76ea5.1612862230.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 13:38:28 -08:00
Joe Perches 5e4c0d86cf mm:backing-dev: use sysfs_emit in macro defining functions
The cocci script used in commit bdacbb8d04f ("mm: Use sysfs_emit for
struct kobject * uses") does not convert the name##_show macro because the
macro uses concatenation via ##.

Convert it by hand.

Link: https://lkml.kernel.org/r/45ec6cfc177d743f9c0ebaf35e43969dce43af42.1605376435.git.joe@perches.com
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 12:13:47 -08:00
Christoph Hellwig f56753ac2a bdi: replace BDI_CAP_NO_{WRITEBACK,ACCT_DIRTY} with a single flag
Replace the two negative flags that are always used together with a
single positive flag that indicates the writeback capability instead
of two related non-capabilities.  Also remove the pointless wrappers
to just check the flag.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24 13:43:39 -06:00
Christoph Hellwig 823423ef55 bdi: invert BDI_CAP_NO_ACCT_WB
Replace BDI_CAP_NO_ACCT_WB with a positive BDI_CAP_WRITEBACK_ACCT to
make the checks more obvious.  Also remove the pointless
bdi_cap_account_writeback wrapper that just obsfucates the check.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24 13:43:39 -06:00
Christoph Hellwig 1cb039f3dc bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flag
The BDI_CAP_STABLE_WRITES is one of the few bits of information in the
backing_dev_info shared between the block drivers and the writeback code.
To help untangling the dependency replace it with a queue flag and a
superblock flag derived from it.  This also helps with the case of e.g.
a file system requiring stable writes due to its own checksumming, but
not forcing it on other users of the block device like the swap code.

One downside is that we an't support the stable_pages_required bdi
attribute in sysfs anymore.  It is replaced with a queue attribute which
also is writable for easier testing.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24 13:43:39 -06:00
Christoph Hellwig 55b2598e84 bdi: initialize ->ra_pages and ->io_pages in bdi_init
Set up a readahead size by default, as very few users have a good
reason to change it.  This means code, ecryptfs, and orangefs now
set up the values while they were previously missing it, while ubifs,
mtd and vboxsf manually set it to 0 to avoid readahead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Acked-by: Richard Weinberger <richard@nod.at> [ubifs, mtd]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24 13:43:39 -06:00
Christoph Hellwig 8c911f3d4c writeback: remove struct bdi_writeback_congested
We never set any congested bits in the group writeback instances of it.
And for the simpler bdi-wide case a simple scalar field is all that
that is needed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-08 17:05:53 -06:00
Christoph Hellwig 492d76b215 writeback: remove {set,clear}_wb_congested
Just merge them into their only callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-08 17:05:53 -06:00
Christoph Hellwig 1cd925d583 bdi: remove the name field in struct backing_dev_info
The name is only printed for a not registered bdi in writeback.  Use the
device name there as is more useful anyway for the unlike case that the
warning triggers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 16:15:13 -06:00
Christoph Hellwig aef33c2ff8 bdi: simplify bdi_alloc
Merge the _node vs normal version and drop the superflous gfp_t argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 16:15:13 -06:00
Christoph Hellwig 3c5d202b55 bdi: remove bdi_register_owner
Split out a new bdi_set_owner helper to set the owner, and move the policy
for creating the bdi name back into genhd.c, where it belongs.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 16:15:13 -06:00
Christoph Hellwig a5a6c66df6 bdi: unexport bdi_register_va
bdi_register_va is only used by super.c, which can't be modular.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 16:15:13 -06:00
Christoph Hellwig 6bd87eec23 bdi: add a ->dev_name field to struct backing_dev_info
Cache a copy of the name for the life time of the backing_dev_info
structure so that we can reference it even after unregistering.

Fixes: 68f23b8906 ("memcg: fix a crash in wb_workfn when a device disappears")
Reported-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 16:07:57 -06:00
Christoph Hellwig eb7ae5e06b bdi: move bdi_dev_name out of line
bdi_dev_name is not a fast path function, move it out of line.  This
prepares for using it from modular callers without having to export
an implementation detail like bdi_unknown_name.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-07 08:45:47 -06:00
Tejun Heo d866dbf617 blkcg: rename blkcg->cgwb_refcnt to ->online_pin and always use it
blkcg->cgwb_refcnt is used to delay blkcg offlining so that blkgs
don't get offlined while there are active cgwbs on them.  However, it
ends up making offlining unordered sometimes causing parents to be
offlined before children.

To fix it, we want child blkcgs to pin the parents' online states
turning the refcnt into a more generic online pinning mechanism.

In prepartion,

* blkcg->cgwb_refcnt -> blkcg->online_pin
* blkcg_cgwb_get/put() -> blkcg_pin/unpin_online()
* Take them out of CONFIG_CGROUP_WRITEBACK

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-01 14:56:42 -06:00