Commit Graph

423 Commits

Author SHA1 Message Date
Rafael Aquini 618376e57e writeback: move wb_over_bg_thresh() call outside lock section
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit 2816ea2abf5f54438517f64efd78a9984685d6db
Author: Yosry Ahmed <yosryahmed@google.com>
Date:   Fri Apr 21 17:40:16 2023 +0000

    writeback: move wb_over_bg_thresh() call outside lock section

    Patch series "cgroup: eliminate atomic rstat flushing", v5.

    A previous patch series [1] changed most atomic rstat flushing contexts to
    become non-atomic.  This was done to avoid an expensive operation that
    scales with # cgroups and # cpus to happen with irqs disabled and
    scheduling not permitted.  There were two remaining atomic flushing
    contexts after that series.  This series tries to eliminate them as well,
    eliminating atomic rstat flushing completely.

    The two remaining atomic flushing contexts are:
    (a) wb_over_bg_thresh()->mem_cgroup_wb_stats()
    (b) mem_cgroup_threshold()->mem_cgroup_usage()

    For (a), flushing needs to be atomic as wb_writeback() calls
    wb_over_bg_thresh() with a spinlock held.  However, it seems like the call
    to wb_over_bg_thresh() doesn't need to be protected by that spinlock, so
    this series proposes a refactoring that moves the call outside the lock
    criticial section and makes the stats flushing in mem_cgroup_wb_stats()
    non-atomic.

    For (b), flushing needs to be atomic as mem_cgroup_threshold() is called
    with irqs disabled.  We only flush the stats when calculating the root
    usage, as it is approximated as the sum of some memcg stats (file, anon,
    and optionally swap) instead of the conventional page counter.  This
    series proposes changing this calculation to use the global stats instead,
    eliminating the need for a memcg stat flush.

    After these 2 contexts are eliminated, we no longer need
    mem_cgroup_flush_stats_atomic() or cgroup_rstat_flush_atomic().  We can
    remove them and simplify the code.

    [1] https://lore.kernel.org/linux-mm/20230330191801.1967435-1-yosryahmed@google.com/

    This patch (of 5):

    wb_over_bg_thresh() calls mem_cgroup_wb_stats() which invokes an rstat
    flush, which can be expensive on large systems. Currently,
    wb_writeback() calls wb_over_bg_thresh() within a lock section, so we
    have to do the rstat flush atomically. On systems with a lot of
    cpus and/or cgroups, this can cause us to disable irqs for a long time,
    potentially causing problems.

    Move the call to wb_over_bg_thresh() outside the lock section in
    preparation to make the rstat flush in mem_cgroup_wb_stats() non-atomic.
    The list_empty(&wb->work_list) check should be okay outside the lock
    section of wb->list_lock as it is protected by a separate lock
    (wb->work_lock), and wb_over_bg_thresh() doesn't seem like it is
    modifying any of wb->b_* lists the wb->list_lock is protecting.
    Also, the loop seems to be already releasing and reacquring the
    lock, so this refactoring looks safe.

    Link: https://lkml.kernel.org/r/20230421174020.2994750-1-yosryahmed@google.com
    Link: https://lkml.kernel.org/r/20230421174020.2994750-2-yosryahmed@google.com
    Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
    Reviewed-by: Michal Koutný <mkoutny@suse.com>
    Reviewed-by: Jan Kara <jack@suse.cz>
    Acked-by: Shakeel Butt <shakeelb@google.com>
    Acked-by: Tejun Heo <tj@kernel.org>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:35:09 -04:00
Aristeu Rozanski db76740534 mm: convert mem_cgroup_css_from_page() to mem_cgroup_css_from_folio()
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 75376c6fb93b99e94192cfff48222d11819ee917
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jan 16 19:25:07 2023 +0000

    mm: convert mem_cgroup_css_from_page() to mem_cgroup_css_from_folio()

    Only one caller doesn't have a folio, so move the page_folio() call to
    that one caller from mem_cgroup_css_from_folio().

    Link: https://lkml.kernel.org/r/20230116192507.2146150-3-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:10 -04:00
Aristeu Rozanski 98ad16b46e mm/fs: convert inode_attach_wb() to take a folio
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit 9cfb816b1c6c99f4b3c1d4a0fb096162cd17ec71
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jan 16 19:25:06 2023 +0000

    mm/fs: convert inode_attach_wb() to take a folio

    Patch series "Writeback folio conversions".

    Remove more calls to compound_head() by passing folios around instead of
    pages.

    This patch (of 2):

    The only caller of inode_attach_wb() which doesn't pass NULL already has a
    folio, so convert the whole call-chain to take folios.

    Link: https://lkml.kernel.org/r/20230116192507.2146150-1-willy@infradead.org
    Link: https://lkml.kernel.org/r/20230116192507.2146150-2-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:10 -04:00
Ming Lei 46e90a4fe5 super: make locking naming consistent
JIRA: https://issues.redhat.com/browse/RHEL-29564

commit d8ce82efdece373b570f35acc8a29487b2087b84
Author: Christian Brauner <brauner@kernel.org>
Date:   Fri Aug 18 16:00:49 2023 +0200

    super: make locking naming consistent

    Make the naming consistent with the earlier introduced
    super_lock_{read,write}() helpers.

    Reviewed-by: Jan Kara <jack@suse.cz>
    Message-Id: <20230818-vfs-super-fixes-v3-v3-2-9f0b1876e46b@kernel.org>
    Signed-off-by: Christian Brauner <brauner@kernel.org>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2024-04-17 09:46:43 +08:00
Ming Lei ed38934e6c writeback: fix call of incorrect macro
JIRA: https://issues.redhat.com/browse/RHEL-1516

commit 3e46c89c74f2c38e5337d2cf44b0b551adff1cb4
Author: Maxim Korotkov <korotkov.maxim.s@gmail.com>
Date:   Thu Jan 19 13:44:43 2023 +0300

    writeback: fix call of incorrect macro

     the variable 'history' is of type u16, it may be an error
     that the hweight32 macro was used for it
     I guess macro hweight16 should be used

    Found by Linux Verification Center (linuxtesting.org) with SVACE.

    Fixes: 2a81490811 ("writeback: implement foreign cgroup inode detection")
    Signed-off-by: Maxim Korotkov <korotkov.maxim.s@gmail.com>
    Reviewed-by: Jan Kara <jack@suse.cz>
    Link: https://lore.kernel.org/r/20230119104443.3002-1-korotkov.maxim.s@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2023-09-18 15:57:45 +08:00
Carlos Maiolino 75be7fedb4 fs: do not update freeing inode i_io_list
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2228888
Tested: xfstests

After commit cbfecb927f42 ("fs: record I_DIRTY_TIME even if inode
already has I_DIRTY_INODE") writeback_single_inode can push inode with
I_DIRTY_TIME set to b_dirty_time list. In case of freeing inode with
I_DIRTY_TIME set this can happen after deletion of inode from i_io_list
at evict. Stack trace is following.

evict
fat_evict_inode
fat_truncate_blocks
fat_flush_inodes
writeback_inode
sync_inode_metadata(inode, sync=0)
writeback_single_inode(inode, wbc) <- wbc->sync_mode == WB_SYNC_NONE

This will lead to use after free in flusher thread.

Similar issue can be triggered if writeback_single_inode in the
stack trace update inode->i_io_list. Add explicit check to avoid it.

Fixes: cbfecb927f42 ("fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE")
Reported-by: syzbot+6ba92bd00d5093f7e371@syzkaller.appspotmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Svyatoslav Feldsherov <feldsherov@google.com>
Link: https://lore.kernel.org/r/20221115202001.324188-1-feldsherov@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
(cherry picked from commit 4e3c51f4e805291b057d12f5dda5aeb50a538dc4)
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
2023-08-07 11:29:34 +02:00
Carlos Maiolino 0e8baee5cd fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2228888
Tested: xfstests

Currently the I_DIRTY_TIME will never get set if the inode already has
I_DIRTY_INODE with assumption that it supersedes I_DIRTY_TIME.  That's
true, however ext4 will only update the on-disk inode in
->dirty_inode(), not on actual writeback. As a result if the inode
already has I_DIRTY_INODE state by the time we get to
__mark_inode_dirty() only with I_DIRTY_TIME, the time was already filled
into on-disk inode and will not get updated until the next I_DIRTY_INODE
update, which might never come if we crash or get a power failure.

The problem can be reproduced on ext4 by running xfstest generic/622
with -o iversion mount option.

Fix it by allowing I_DIRTY_TIME to be set even if the inode already has
I_DIRTY_INODE. Also make sure that the case is properly handled in
writeback_single_inode() as well. Additionally changes in
xfs_fs_dirty_inode() was made to accommodate for I_DIRTY_TIME in flag.

Thanks Jan Kara for suggestions on how to make this work properly.

Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: stable@kernel.org
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220825100657.44217-1-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
(cherry picked from commit cbfecb927f429a6fa613d74b998496bd71e4438a)
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
2023-08-04 14:19:30 +02:00
Carlos Maiolino 5561eb4d20 writeback: Avoid skipping inode writeback
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2228888
Tested: xfstests

We have run into an issue that a task gets stuck in
balance_dirty_pages_ratelimited() when perform I/O stress testing.
The reason we observed is that an I_DIRTY_PAGES inode with lots
of dirty pages is in b_dirty_time list and standard background
writeback cannot writeback the inode.
After studing the relevant code, the following scenario may lead
to the issue:

task1                                   task2
-----                                   -----
fuse_flush
 write_inode_now //in b_dirty_time
  writeback_single_inode
   __writeback_single_inode
                                 fuse_write_end
                                  filemap_dirty_folio
                                   __xa_set_mark:PAGECACHE_TAG_DIRTY
    lock inode->i_lock
    if mapping tagged PAGECACHE_TAG_DIRTY
    inode->i_state |= I_DIRTY_PAGES
    unlock inode->i_lock
                                   __mark_inode_dirty:I_DIRTY_PAGES
                                      lock inode->i_lock
                                      -was dirty,inode stays in
                                      -b_dirty_time
                                      unlock inode->i_lock

   if(!(inode->i_state & I_DIRTY_All))
      -not true,so nothing done

This patch moves the dirty inode to b_dirty list when the inode
currently is not queued in b_io or b_more_io list at the end of
writeback_single_inode.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
CC: stable@vger.kernel.org
Fixes: 0ae45f63d4 ("vfs: add support for a lazytime mount option")
Signed-off-by: Jing Xia <jing.xia@unisoc.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220510023514.27399-1-jing.xia@unisoc.com
(cherry picked from commit 846a3351ddfe4a86eede4bb26a205c3f38ef84d3)
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
2023-08-04 14:19:29 +02:00
Nico Pache a39f5fc679 writeback, cgroup: fix null-ptr-deref write in bdi_split_work_to_wbs
commit 1ba1199ec5747f475538c0d25a32804e5ba1dfde
Author: Baokun Li <libaokun1@huawei.com>
Date:   Mon Apr 10 21:08:26 2023 +0800

    writeback, cgroup: fix null-ptr-deref write in bdi_split_work_to_wbs

    KASAN report null-ptr-deref:
    ==================================================================
    BUG: KASAN: null-ptr-deref in bdi_split_work_to_wbs+0x5c5/0x7b0
    Write of size 8 at addr 0000000000000000 by task sync/943
    CPU: 5 PID: 943 Comm: sync Tainted: 6.3.0-rc5-next-20230406-dirty #461
    Call Trace:
     <TASK>
     dump_stack_lvl+0x7f/0xc0
     print_report+0x2ba/0x340
     kasan_report+0xc4/0x120
     kasan_check_range+0x1b7/0x2e0
     __kasan_check_write+0x24/0x40
     bdi_split_work_to_wbs+0x5c5/0x7b0
     sync_inodes_sb+0x195/0x630
     sync_inodes_one_sb+0x3a/0x50
     iterate_supers+0x106/0x1b0
     ksys_sync+0x98/0x160
    [...]
    ==================================================================

    The race that causes the above issue is as follows:

               cpu1                     cpu2
    -------------------------|-------------------------
    inode_switch_wbs
     INIT_WORK(&isw->work, inode_switch_wbs_work_fn)
     queue_rcu_work(isw_wq, &isw->work)
     // queue_work async
      inode_switch_wbs_work_fn
       wb_put_many(old_wb, nr_switched)
        percpu_ref_put_many
         ref->data->release(ref)
         cgwb_release
          queue_work(cgwb_release_wq, &wb->release_work)
          // queue_work async
           &wb->release_work
           cgwb_release_workfn
                                ksys_sync
                                 iterate_supers
                                  sync_inodes_one_sb
                                   sync_inodes_sb
                                    bdi_split_work_to_wbs
                                     kmalloc(sizeof(*work), GFP_ATOMIC)
                                     // alloc memory failed
            percpu_ref_exit
             ref->data = NULL
             kfree(data)
                                     wb_get(wb)
                                      percpu_ref_get(&wb->refcnt)
                                       percpu_ref_get_many(ref, 1)
                                        atomic_long_add(nr, &ref->data->count)
                                         atomic64_add(i, v)
                                         // trigger null-ptr-deref

    bdi_split_work_to_wbs() traverses &bdi->wb_list to split work into all
    wbs.  If the allocation of new work fails, the on-stack fallback will be
    used and the reference count of the current wb is increased afterwards.
    If cgroup writeback membership switches occur before getting the reference
    count and the current wb is released as old_wd, then calling wb_get() or
    wb_put() will trigger the null pointer dereference above.

    This issue was introduced in v4.3-rc7 (see fix tag1).  Both
    sync_inodes_sb() and __writeback_inodes_sb_nr() calls to
    bdi_split_work_to_wbs() can trigger this issue.  For scenarios called via
    sync_inodes_sb(), originally commit 7fc5854f8c ("writeback: synchronize
    sync(2) against cgroup writeback membership switches") reduced the
    possibility of the issue by adding wb_switch_rwsem, but in v5.14-rc1 (see
    fix tag2) removed the "inode_io_list_del_locked(inode, old_wb)" from
    inode_switch_wbs_work_fn() so that wb->state contains WB_has_dirty_io,
    thus old_wb is not skipped when traversing wbs in bdi_split_work_to_wbs(),
    and the issue becomes easily reproducible again.

    To solve this problem, percpu_ref_exit() is called under RCU protection to
    avoid race between cgwb_release_workfn() and bdi_split_work_to_wbs().
    Moreover, replace wb_get() with wb_tryget() in bdi_split_work_to_wbs(),
    and skip the current wb if wb_tryget() fails because the wb has already
    been shutdown.

    Link: https://lkml.kernel.org/r/20230410130826.1492525-1-libaokun1@huawei.com
    Fixes: b817525a4a ("writeback: bdi_writeback iteration must not skip dying ones")
    Signed-off-by: Baokun Li <libaokun1@huawei.com>
    Reviewed-by: Jan Kara <jack@suse.cz>
    Acked-by: Tejun Heo <tj@kernel.org>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Andreas Dilger <adilger.kernel@dilger.ca>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Dennis Zhou <dennis@kernel.org>
    Cc: Hou Tao <houtao1@huawei.com>
    Cc: yangerkun <yangerkun@huawei.com>
    Cc: Zhang Yi <yi.zhang@huawei.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372
Signed-off-by: Nico Pache <npache@redhat.com>
2023-06-14 15:11:04 -06:00
Ming Lei 7f3c706b1b writeback: remove obsolete macro EXPIRE_DIRTY_ATIME
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2175212

commit 23e188a16423a6e65290abf39dd427ff047e6843
Author: Miaohe Lin <linmiaohe@huawei.com>
Date:   Sat Dec 10 18:10:42 2022 +0800

    writeback: remove obsolete macro EXPIRE_DIRTY_ATIME

    EXPIRE_DIRTY_ATIME is not used anymore. Remove it.

    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Jan Kara <jack@suse.cz>
    Link: https://lore.kernel.org/r/20221210101042.2012931-1-linmiaohe@huawei.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2023-03-11 23:27:34 +08:00
Ming Lei 2551de4e98 writeback: Add asserts for adding freed inode to lists
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2175212

commit a9438b44bc7015b18931e312bbd249a25bb59a65
Author: Jan Kara <jack@suse.cz>
Date:   Mon Dec 12 12:36:33 2022 +0100

    writeback: Add asserts for adding freed inode to lists

    In the past we had several use-after-free issues with inodes getting
    added to writeback lists after evict() removed them. These are painful
    to debug so add some asserts to catch the problem earlier. The only
    non-obvious change in the commit is that we need to tweak
    redirty_tail_locked() to avoid triggering assertion in
    inode_io_list_move_locked().

    Signed-off-by: Jan Kara <jack@suse.cz>
    Link: https://lore.kernel.org/r/20221212113633.29181-1-jack@suse.cz
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2023-03-11 23:27:34 +08:00
Nico Pache 59af49451c writeback: avoid use-after-free after removing device
commit f87904c075515f3e1d8f4a7115869d3b914674fd
Author: Khazhismel Kumykov <khazhy@chromium.org>
Date:   Mon Aug 1 08:50:34 2022 -0700

    writeback: avoid use-after-free after removing device

    When a disk is removed, bdi_unregister gets called to stop further
    writeback and wait for associated delayed work to complete.  However,
    wb_inode_writeback_end() may schedule bandwidth estimation dwork after
    this has completed, which can result in the timer attempting to access the
    just freed bdi_writeback.

    Fix this by checking if the bdi_writeback is alive, similar to when
    scheduling writeback work.

    Since this requires wb->work_lock, and wb_inode_writeback_end() may get
    called from interrupt, switch wb->work_lock to an irqsafe lock.

    Link: https://lkml.kernel.org/r/20220801155034.3772543-1-khazhy@google.com
    Fixes: 45a2966fd641 ("writeback: fix bandwidth estimate for spiky workload")
    Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
    Reviewed-by: Jan Kara <jack@suse.cz>
    Cc: Michael Stapelberg <stapelberg+linux@google.com>
    Cc: Wu Fengguang <fengguang.wu@intel.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:41 -07:00
Nico Pache d0afd73f0f writeback: Fix inode->i_io_list not be protected by inode->i_lock error
commit 10e14073107dd0b6d97d9516a02845a8e501c2c9
Author: Jchao Sun <sunjunchao2870@gmail.com>
Date:   Tue May 24 08:05:40 2022 -0700

    writeback: Fix inode->i_io_list not be protected by inode->i_lock error

    Commit b35250c081 ("writeback: Protect inode->i_io_list with
    inode->i_lock") made inode->i_io_list not only protected by
    wb->list_lock but also inode->i_lock, but inode_io_list_move_locked()
    was missed. Add lock there and also update comment describing
    things protected by inode->i_lock. This also fixes a race where
    __mark_inode_dirty() could move inode under flush worker's hands
    and thus sync(2) could miss writing some inodes.

    Fixes: b35250c081 ("writeback: Protect inode->i_io_list with inode->i_lock")
    Link: https://lore.kernel.org/r/20220524150540.12552-1-sunjunchao2870@gmail.com
    CC: stable@vger.kernel.org
    Signed-off-by: Jchao Sun <sunjunchao2870@gmail.com>
    Signed-off-by: Jan Kara <jack@suse.cz>

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498
Signed-off-by: Nico Pache <npache@redhat.com>
2022-11-08 10:11:38 -07:00
Frantisek Hrbata e9e9bc8da2 Merge: mm changes through v5.18 for 9.2
Merge conflicts:
-----------------
Conflicts with !1142(merged) "io_uring: update to v5.15"

fs/io-wq.c
        - static bool io_wqe_create_worker(struct io_wqe *wqe, struct io_wqe_acct *acct)
          !1142 already contains backport of 3146cba99aa2 ("io-wq: make worker creation resilient against signals")
          along with other commits which are not present in !1370. Resolved in favor of HEAD(!1142)
        - static int io_wqe_worker(void *data)
          !1370 does not contain 767a65e9f317 ("io-wq: fix potential race of acct->nr_workers")
          Resolved in favor of HEAD(!1142)
        - static void io_init_new_worker(struct io_wqe *wqe, struct io_worker *worker,
          HEAD(!1142) does not contain e32cf5dfbe22 ("kthread: Generalize pf_io_worker so it can point to struct kthread")
          Resolved in favor of !1370
        - static void create_worker_cont(struct callback_head *cb)
          !1370 does not contain 66e70be72288 ("io-wq: fix memory leak in create_io_worker()")
          Resolved in favor of HEAD(!1142)
        - static void io_workqueue_create(struct work_struct *work)
          !1370 does not contain 66e70be72288 ("io-wq: fix memory leak in create_io_worker()")
          Resolved in favor of HEAD(!1142)
        - static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
          !1370 does not contain 66e70be72288 ("io-wq: fix memory leak in create_io_worker()")
          Resolved in favor of HEAD(!1142)
        - static bool io_wq_work_match_item(struct io_wq_work *work, void *data)
          !1370 does not contain 713b9825a4c4 ("io-wq: fix cancellation on create-worker failure")
          Resolved in favor of HEAD(!1142)
        - static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work)
          !1370 is missing 713b9825a4c4 ("io-wq: fix cancellation on create-worker failure")
          removed wrongly merged run_cancel label
          Resolved in favor of HEAD(!1142)
        - static bool io_task_work_match(struct callback_head *cb, void *data)
          !1370 is missing 3b33e3f4a6c0 ("io-wq: fix silly logic error in io_task_work_match()")
          Resolved in favor of HEAD(!1142)
        - static void io_wq_exit_workers(struct io_wq *wq)
          !1370 is missing 3b33e3f4a6c0 ("io-wq: fix silly logic error in io_task_work_match()")
          Resolved in favor of HEAD(!1142)
        - int io_wq_max_workers(struct io_wq *wq, int *new_count)
          !1370 is missing 3b33e3f4a6c0 ("io-wq: fix silly logic error in io_task_work_match()")
fs/io_uring.c
        - static int io_register_iowq_max_workers(struct io_ring_ctx *ctx,
          !1370 is missing bunch of commits after 2e480058ddc2 ("io-wq: provide a way to limit max number of workers")
          Resolved in favor of HEAD(!1142)
include/uapi/linux/io_uring.h
        - !1370 is missing dd47c104533d ("io-wq: provide IO_WQ_* constants for IORING_REGISTER_IOWQ_MAX_WORKERS arg items")
          just a comment conflict
          Resolved in favor of HEAD(!1142)
kernel/exit.c
        - void __noreturn do_exit(long code)
        - !1370 contains bunch of commits after f552a27afe67 ("io_uring: remove files pointer in cancellation functions")
          Resolved in favor of !1370

Conflicts with !1357(merged) "NFS refresh for RHEL-9.2"

fs/nfs/callback.c
        - nfs4_callback_svc(void *vrqstp)
          !1370 is missing f49169c97fce ("NFSD: Remove svc_serv_ops::svo_module") where the module_put_and_kthread_exit() was removed
          Resolved in favor of HEAD(!1357)
fs/nfs/file.c
          !1357 is missing 187c82cb0380 ("fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio")
          Resolved in favor of HEAD(!1370)
fs/nfsd/nfssvc.c
        - nfsd(void *vrqstp)
          !1370 is missing f49169c97fce ("NFSD: Remove svc_serv_ops::svo_module")
          Resolved in favor of HEAD(!1357)
-----------------

MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1370

Bugzilla: https://bugzilla.redhat.com/2120352

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2099722

Patches 1-9 are changes to selftests
Patches 10-31 are reverts of RHEL-only patches to address COR CVE
Patches 32-320 are the machine dependent mm changes ported by Rafael
Patch 321 reverts the backport of 6692c98c7df5. See below.
Patches 322-981 are the machine independent mm changes
Patches 982-1016 are David Hildebrand's upstream changes to address the COR CVE

RHEL commit b23c298982 fork: Stop protecting back_fork_cleanup_cgroup_lock with CONFIG_NUMA
which is a backport of upstream 6692c98c7df5 and is reverted early in this series. 6692c98c7df5
is a fix for upstream 40966e316f86 which was not in RHEL until this series. 6692c98c7df5 is re-added
after 40966e316f86.

Omitted-fix: 310d1344e3c5 ("Revert "powerpc: Remove unused FW_FEATURE_NATIVE references"")
        to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716

Omitted-fix: 465d0eb0dc31 ("Docs/admin-guide/mm/damon/usage: fix the example code snip")
        to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716

Omitted-fix: 317314527d17 ("mm/hugetlb: correct demote page offset logic")
        to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716

Omitted-fix: 37dcc673d065 ("frontswap: don't call ->init if no ops are registered")
        to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716

Omitted-fix: 30c19366636f ("mm: fix BUG splat with kvmalloc + GFP_ATOMIC")
        to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716

Omitted: fix: fa84693b3c89 io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: 009ad9f0c6ee io_uring: drop ctx->uring_lock before acquiring sqd->lock
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: bc369921d670 io-wq: max_worker fixes
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743

Omitted-fix: e139a1ec92f8 io_uring: apply max_workers limit to all future users
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743

Omitted-fix: 71c9ce27bb57 io-wq: fix max-workers not correctly set on multi-node system
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743

Omitted-fix: 41d3a6bd1d37 io_uring: pin SQPOLL data before unlocking ring lock
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: bad119b9a000 io_uring: honour zeroes as io-wq worker limits
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743

Omitted-fix: 08bdbd39b584 io-wq: ensure that hash wait lock is IRQ disabling
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: 713b9825a4c4 io-wq: fix cancellation on create-worker failure
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: 3b33e3f4a6c0 io-wq: fix silly logic error in io_task_work_match()
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: 71e1cef2d794 io-wq: Remove duplicate code in io_workqueue_create()
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=210774

Omitted-fix: a226abcd5d42 io-wq: don't retry task_work creation failure on fatal conditions
	fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743

Omitted-fix: fa84693b3c89 io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL
        fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: dd47c104533d io-wq: provide IO_WQ_* constants for IORING_REGISTER_IOWQ_MAX_WORKERS arg items
        fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656

Omitted-fix: 4f0712ccec09 hexagon: Fix function name in die()
	unsupported arch

Omitted-fix: 751971af2e36 csky: Fix function name in csky_alignment() and die()
	unsupported arch

Omitted-fix: dcbc65aac283 ptrace: Remove duplicated include in ptrace.c
        unsupported arch

Omitted-fix: eb48d4219879 drm/i915: Fix oops due to missing stack depot
	fixed in RHEL commit 105d2d4832 Merge DRM changes from upstream v5.16..v5.17

Omitted-fix: 751a9d69b197 drm/i915: Fix oops due to missing stack depot
	fixed in RHEL commit 99fc716fc4 Merge DRM changes from upstream v5.17..v5.18

Omitted-fix: eb48d4219879 drm/i915: Fix oops due to missing stack depot
	fixed in RHEL commit 105d2d4832 Merge DRM changes from upstream v5.16..v5.17

Omitted-fix: 751a9d69b197 drm/i915: Fix oops due to missing stack depot
	fixed in RHEL commit 99fc716fc4 Merge DRM changes from upstream v5.17..v5.18

Omitted-fix: b95dc06af3e6 drm/amdgpu: disable runpm if we are the primary adapter
        reverted later

Omitted-fix: 5a90c24ad028 Revert "drm/amdgpu: disable runpm if we are the primary adapter"
        revert of above omitted fix

Omitted-fix: 724bbe49c5e4 fs/ntfs3: provide block_invalidate_folio to fix memory leak
	unsupported fs

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>

Approved-by: John W. Linville <linville@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>
Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Lyude Paul <lyude@redhat.com>
Approved-by: Donald Dutile <ddutile@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Phil Auld <pauld@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-10-23 19:49:41 +02:00
Chris von Recklinghausen cd6f34840e mm/fs: delete PF_SWAPWRITE
Bugzilla: https://bugzilla.redhat.com/2120352

commit b698f0a1773f7df73f2bb4bfe0e597ea1bb3881f
Author: Hugh Dickins <hughd@google.com>
Date:   Tue Mar 22 14:45:38 2022 -0700

    mm/fs: delete PF_SWAPWRITE

    PF_SWAPWRITE has been redundant since v3.2 commit ee72886d8e ("mm:
    vmscan: do not writeback filesystem pages in direct reclaim").

    Coincidentally, NeilBrown's current patch "remove inode_congested()"
    deletes may_write_to_inode(), which appeared to be the one function which
    took notice of PF_SWAPWRITE.  But if you study the old logic, and the
    conditions under which may_write_to_inode() was called, you discover that
    flag and function have been pointless for a decade.

    Link: https://lkml.kernel.org/r/75e80e7-742d-e3bd-531-614db8961e4@google.com
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Cc: NeilBrown <neilb@suse.de>
    Cc: Jan Kara <jack@suse.de>
    Cc: "Darrick J. Wong" <djwong@kernel.org>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:53 -04:00
Chris von Recklinghausen ad6766c2ff remove inode_congested()
Conflicts:
	include/linux/backing-dev.h - We already have
		dec223c92a46 ("blk-cgroup: move struct blkcg to block/blk-cgroup
.h")
		so keep current declaration of wb_blkcg_offline
	mm/vmscan.c - We already have
		c79b7b96db8b ("mm/vmscan: Account large folios correctly")
		which increments stat->nr_congested by pages

Bugzilla: https://bugzilla.redhat.com/2120352

commit fe55d563d4174f13839a9b7ef7309da5031b5d93
Author: NeilBrown <neilb@suse.de>
Date:   Tue Mar 22 14:39:07 2022 -0700

    remove inode_congested()

    inode_congested() reports if the backing-device for the inode is
    congested.  No bdi reports congestion any more, so this always returns
    'false'.

    So remove inode_congested() and related functions, and remove the call
    sites, assuming that inode_congested() always returns 'false'.

    Link: https://lkml.kernel.org/r/164549983741.9187.2174285592262191311.stgit@
noble.brown
    Signed-off-by: NeilBrown <neilb@suse.de>
    Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
    Cc: Chao Yu <chao@kernel.org>
    Cc: Darrick J. Wong <djwong@kernel.org>
    Cc: Ilya Dryomov <idryomov@gmail.com>
    Cc: Jaegeuk Kim <jaegeuk@kernel.org>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jeff Layton <jlayton@kernel.org>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Lars Ellenberg <lars.ellenberg@linbit.com>
    Cc: Miklos Szeredi <miklos@szeredi.hu>
    Cc: Paolo Valente <paolo.valente@linaro.org>    Cc: Philipp Reisner <philipp.reisner@linbit.com>
    Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
    Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
    Cc: Wu Fengguang <fengguang.wu@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:49 -04:00
Ming Lei d3f8e91051 fs-writeback: writeback_sb_inodes:Recalculate 'wrote' according skipped pages
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2118511

commit 68f4c6eba70df70a720188bce95c85570ddfcc87
Author: Zhihao Cheng <chengzhihao1@huawei.com>
Date:   Tue May 10 21:38:05 2022 +0800

    fs-writeback: writeback_sb_inodes:Recalculate 'wrote' according skipped pages

    Commit 505a666ee3 ("writeback: plug writeback in wb_writeback() and
    writeback_inodes_wb()") has us holding a plug during wb_writeback, which
    may cause a potential ABBA dead lock:

        wb_writeback                fat_file_fsync
    blk_start_plug(&plug)
    for (;;) {
      iter i-1: some reqs have been added into plug->mq_list  // LOCK A
      iter i:
        progress = __writeback_inodes_wb(wb, work)
        . writeback_sb_inodes // fat's bdev
        .   __writeback_single_inode
        .   . generic_writepages
        .   .   __block_write_full_page
        .   .   . .             __generic_file_fsync
        .   .   . .               sync_inode_metadata
        .   .   . .                 writeback_single_inode
        .   .   . .                   __writeback_single_inode
        .   .   . .                     fat_write_inode
        .   .   . .                       __fat_write_inode
        .   .   . .                         sync_dirty_buffer       // fat's bdev
        .   .   . .                           lock_buffer(bh)       // LOCK B
        .   .   . .                             submit_bh
        .   .   . .                               blk_mq_get_tag    // LOCK A
        .   .   . trylock_buffer(bh)  // LOCK B
        .   .   .   redirty_page_for_writepage
        .   .   .     wbc->pages_skipped++
        .   .   --wbc->nr_to_write
        .   wrote += write_chunk - wbc.nr_to_write  // wrote > 0
        .   requeue_inode
        .     redirty_tail_locked
        if (progress)    // progress > 0
          continue;
      iter i+1:
          queue_io
          // similar process with iter i, infinite for-loop !
    }
    blk_finish_plug(&plug)   // flush plug won't be called

    Above process triggers a hungtask like:
    [  399.044861] INFO: task bb:2607 blocked for more than 30 seconds.
    [  399.046824]       Not tainted 5.18.0-rc1-00005-gefae4d9eb6a2-dirty
    [  399.051539] task:bb              state:D stack:    0 pid: 2607 ppid:
    2426 flags:0x00004000
    [  399.051556] Call Trace:
    [  399.051570]  __schedule+0x480/0x1050
    [  399.051592]  schedule+0x92/0x1a0
    [  399.051602]  io_schedule+0x22/0x50
    [  399.051613]  blk_mq_get_tag+0x1d3/0x3c0
    [  399.051640]  __blk_mq_alloc_requests+0x21d/0x3f0
    [  399.051657]  blk_mq_submit_bio+0x68d/0xca0
    [  399.051674]  __submit_bio+0x1b5/0x2d0
    [  399.051708]  submit_bio_noacct+0x34e/0x720
    [  399.051718]  submit_bio+0x3b/0x150
    [  399.051725]  submit_bh_wbc+0x161/0x230
    [  399.051734]  __sync_dirty_buffer+0xd1/0x420
    [  399.051744]  sync_dirty_buffer+0x17/0x20
    [  399.051750]  __fat_write_inode+0x289/0x310
    [  399.051766]  fat_write_inode+0x2a/0xa0
    [  399.051783]  __writeback_single_inode+0x53c/0x6f0
    [  399.051795]  writeback_single_inode+0x145/0x200
    [  399.051803]  sync_inode_metadata+0x45/0x70
    [  399.051856]  __generic_file_fsync+0xa3/0x150
    [  399.051880]  fat_file_fsync+0x1d/0x80
    [  399.051895]  vfs_fsync_range+0x40/0xb0
    [  399.051929]  __x64_sys_fsync+0x18/0x30

    In my test, 'need_resched()' (which is imported by 590dca3a71 "fs-writeback:
    unplug before cond_resched in writeback_sb_inodes") in function
    'writeback_sb_inodes()' seldom comes true, unless cond_resched() is deleted
    from write_cache_pages().

    Fix it by correcting wrote number according number of skipped pages
    in writeback_sb_inodes().

    Goto Link to find a reproducer.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=215837
    Cc: stable@vger.kernel.org # v4.3
    Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
    Reviewed-by: Jan Kara <jack@suse.cz>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20220510133805.1988292-1-chengzhihao1@huawei.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2022-10-12 09:20:11 +08:00
Jeffrey Layton 777e571cdd vfs, fscache: Implement pinning of cache usage for writeback
Bugzilla: http://bugzilla.redhat.com/1229736

commit 08276bdae68b022a7726edf7416b6748e3df5395
Author: David Howells <dhowells@redhat.com>
Date:   Wed Oct 20 23:50:01 2021 +0100

    vfs, fscache: Implement pinning of cache usage for writeback

    Cachefiles has a problem in that it needs to keep the backing file for a
    cookie open whilst there are local modifications pending that need to be
    written to it.  However, we don't want to keep the file open indefinitely,
    as that causes EMFILE/ENFILE/ENOMEM problems.

    Reopening the cache file, however, is a problem if this is being done due
    to writeback triggered by exit().  Some filesystems will oops if we try to
    open a file in that context because they want to access current->fs or
    other resources that have already been dismantled.

    To get around this, I added the following:

     (1) An inode flag, I_PINNING_FSCACHE_WB, to be set on a network filesystem
         inode to indicate that we have a usage count on the cookie caching
         that inode.

     (2) A flag in struct writeback_control, unpinned_fscache_wb, that is set
         when __writeback_single_inode() clears the last dirty page from
         i_pages - at which point it clears I_PINNING_FSCACHE_WB and sets this
         flag.

         This has to be done here so that clearing I_PINNING_FSCACHE_WB can be
         done atomically with the check of PAGECACHE_TAG_DIRTY that clears
         I_DIRTY_PAGES.

     (3) A function, fscache_set_page_dirty(), which if it is not set, sets
         I_PINNING_FSCACHE_WB and calls fscache_use_cookie() to pin the cache
         resources.

     (4) A function, fscache_unpin_writeback(), to be called by ->write_inode()
         to unuse the cookie.

     (5) A function, fscache_clear_inode_writeback(), to be called when the
         inode is evicted, before clear_inode() is called.  This cleans up any
         lingering I_PINNING_FSCACHE_WB.

    The network filesystem can then use these tools to make sure that
    fscache_write_to_cache() can write locally modified data to the cache as
    well as to the server.

    For the future, I'm working on write helpers for netfs lib that should
    allow this facility to be removed by keeping track of the dirty regions
    separately - but that's incomplete at the moment and is also going to be
    affected by folios, one way or another, since it deals with pages

    Signed-off-by: David Howells <dhowells@redhat.com>
    Reviewed-by: Jeff Layton <jlayton@kernel.org>
    cc: linux-cachefs@redhat.com
    Link: https://lore.kernel.org/r/163819615157.215744.17623791756928043114.stgit@warthog.procyon.org.uk/ # v1
    Link: https://lore.kernel.org/r/163906917856.143852.8224898306177154573.stgit@warthog.procyon.org.uk/ # v2
    Link: https://lore.kernel.org/r/163967124567.1823006.14188359004568060298.stgit@warthog.procyon.org.uk/ # v3
    Link: https://lore.kernel.org/r/164021524705.640689.17824932021727663017.stgit@warthog.procyon.org.uk/ # v4

Signed-off-by: Jeffrey Layton <jlayton@redhat.com>
2022-08-22 12:31:34 -04:00
Aristeu Rozanski 80a6b2e8b6 fs/writeback: Convert inode_switch_wbs_work_fn to folios
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861
Tested: by me with multiple test suites

commit 22b3c8d6612e09f5fcecba1009d399aaf7f934f6
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Mar 19 08:58:36 2021 -0400

    fs/writeback: Convert inode_switch_wbs_work_fn to folios

    This gets the statistics correct for large folios by modifying the
    counters by the number of pages in the folio instead of by 1.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: William Kucharski <william.kucharski@oracle.com>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2022-07-10 10:44:07 -04:00
Ming Lei 4415be8560 block: check that there is a plug in blk_flush_plug
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083917

commit aa8dcccaf32bfdc09f2aff089d5d60c37da5b7b5
Author: Christoph Hellwig <hch@lst.de>
Date:   Thu Jan 27 08:05:49 2022 +0100

    block: check that there is a plug in blk_flush_plug

    Rename blk_flush_plug to __blk_flush_plug and add a wrapper that includes
    the NULL check instead of open coding that check everywhere.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
    Link: https://lore.kernel.org/r/20220127070549.1377856-2-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2022-06-22 08:56:20 +08:00
Ming Lei d5d4963cf5 block: remove blk_needs_flush_plug
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083917

commit b1f866b013e6e5583f2f0bf4a61d13eddb9a1799
Author: Christoph Hellwig <hch@lst.de>
Date:   Thu Jan 27 08:05:48 2022 +0100

    block: remove blk_needs_flush_plug

    blk_needs_flush_plug fails to account for the cb_list, which needs
    flushing as well.  Remove it and just check if there is a plug instead
    of poking into the internals of the plug structure.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20220127070549.1377856-1-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2022-06-22 08:56:19 +08:00
Ming Lei fe4ec6f985 block: cleanup the flush plug helpers
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2018403

commit 008f75a20e7072d0840ec323c39b42206f3fa8a0
Author: Christoph Hellwig <hch@lst.de>
Date:   Wed Oct 20 16:41:19 2021 +0200

    block: cleanup the flush plug helpers

    Consolidate the various helpers into a single blk_flush_plug helper that
    takes a plk_plug and the from_scheduler bool and switch all callsites to
    call it directly.  Checks that the plug is non-NULL must be performed by
    the caller, something that most already do anyway.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20211020144119.142582-5-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2021-12-06 16:44:50 +08:00
Rafael Aquini aec73574ca writeback: memcg: simplify cgroup_writeback_by_id
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 7490a2d248145d8694e1e9828801b496250fd697
Author: Shakeel Butt <shakeelb@google.com>
Date:   Thu Sep 2 14:53:27 2021 -0700

    writeback: memcg: simplify cgroup_writeback_by_id

    Currently cgroup_writeback_by_id calls mem_cgroup_wb_stats() to get dirty
    pages for a memcg.  However mem_cgroup_wb_stats() does a lot more than
    just get the number of dirty pages.  Just directly get the number of dirty
    pages instead of calling mem_cgroup_wb_stats().  Also
    cgroup_writeback_by_id() is only called for best-effort dirty flushing, so
    remove the unused 'nr' parameter and no need to explicitly flush memcg
    stats.

    Link: https://lkml.kernel.org/r/20210722182627.2267368-1-shakeelb@google.com
    Signed-off-by: Shakeel Butt <shakeelb@google.com>
    Reviewed-by: Jan Kara <jack@suse.cz>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:40:50 -05:00
Rafael Aquini 6d11f2d123 writeback: reliably update bandwidth estimation
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit fee468fdf41cdf36ba6b5a780e2474d0a3e066ac
Author: Jan Kara <jack@suse.cz>
Date:   Thu Sep 2 14:53:06 2021 -0700

    writeback: reliably update bandwidth estimation

    Currently we trigger writeback bandwidth estimation from
    balance_dirty_pages() and from wb_writeback().  However neither of these
    need to trigger when the system is relatively idle and writeback is
    triggered e.g.  from fsync(2).  Make sure writeback estimates happen
    reliably by triggering them from do_writepages().

    Link: https://lkml.kernel.org/r/20210713104716.22868-2-jack@suse.cz
    Signed-off-by: Jan Kara <jack@suse.cz>
    Cc: Michael Stapelberg <stapelberg+linux@google.com>
    Cc: Wu Fengguang <fengguang.wu@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:40:47 -05:00
Rafael Aquini ca7eeab6b1 writeback: track number of inodes under writeback
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 633a2abb9e1cd5c95f3b600f4b2c12cce22ae4a0
Author: Jan Kara <jack@suse.cz>
Date:   Thu Sep 2 14:53:04 2021 -0700

    writeback: track number of inodes under writeback

    Patch series "writeback: Fix bandwidth estimates", v4.

    Fix estimate of writeback throughput when device is not fully busy doing
    writeback.  Michael Stapelberg has reported that such workload (e.g.
    generated by linking) tends to push estimated throughput down to 0 and as
    a result writeback on the device is practically stalled.

    The first three patches fix the reported issue, the remaining two patches
    are unrelated cleanups of problems I've noticed when reading the code.

    This patch (of 4):

    Track number of inodes under writeback for each bdi_writeback structure.
    We will use this to decide whether wb does any IO and so we can estimate
    its writeback throughput.  In principle we could use number of pages under
    writeback (WB_WRITEBACK counter) for this however normal percpu counter
    reads are too inaccurate for our purposes and summing the counter is too
    expensive.

    Link: https://lkml.kernel.org/r/20210713104519.16394-1-jack@suse.cz
    Link: https://lkml.kernel.org/r/20210713104716.22868-1-jack@suse.cz
    Signed-off-by: Jan Kara <jack@suse.cz>
    Cc: Wu Fengguang <fengguang.wu@intel.com>
    Cc: Michael Stapelberg <stapelberg+linux@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:40:46 -05:00
Roman Gushchin 593311e85b writeback, cgroup: do not reparent dax inodes
The inode switching code is not suited for dax inodes.  An attempt to
switch a dax inode to a parent writeback structure (as a part of a
writeback cleanup procedure) results in a panic like this:

  run fstests generic/270 at 2021-07-15 05:54:02
  XFS (pmem0p2): EXPERIMENTAL big timestamp feature in use.  Use at your own risk!
  XFS (pmem0p2): DAX enabled. Warning: EXPERIMENTAL, use at your own risk
  XFS (pmem0p2): EXPERIMENTAL inode btree counters feature in use. Use at your own risk!
  XFS (pmem0p2): Mounting V5 Filesystem
  XFS (pmem0p2): Ending clean mount
  XFS (pmem0p2): Quotacheck needed: Please wait.
  XFS (pmem0p2): Quotacheck: Done.
  XFS (pmem0p2): xlog_verify_grant_tail: space > BBTOB(tail_blocks)
  XFS (pmem0p2): xlog_verify_grant_tail: space > BBTOB(tail_blocks)
  XFS (pmem0p2): xlog_verify_grant_tail: space > BBTOB(tail_blocks)
  BUG: unable to handle page fault for address: 0000000005b0f669
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: 0000 [#1] SMP PTI
  CPU: 13 PID: 10479 Comm: kworker/13:16 Not tainted 5.14.0-rc1-master-8096acd7442e+ #8
  Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 09/13/2016
  Workqueue: inode_switch_wbs inode_switch_wbs_work_fn
  RIP: 0010:inode_do_switch_wbs+0xaf/0x470
  Code: 00 30 0f 85 c1 03 00 00 0f 1f 44 00 00 31 d2 48 c7 c6 ff ff ff ff 48 8d 7c 24 08 e8 eb 49 1a 00 48 85 c0 74 4a bb ff ff ff ff <48> 8b 50 08 48 8d 4a ff 83 e2 01 48 0f 45 c1 48 8b 00 a8 08 0f 85
  RSP: 0018:ffff9c66691abdc8 EFLAGS: 00010002
  RAX: 0000000005b0f661 RBX: 00000000ffffffff RCX: ffff89e6a21382b0
  RDX: 0000000000000001 RSI: ffff89e350230248 RDI: ffffffffffffffff
  RBP: ffff89e681d19400 R08: 0000000000000000 R09: 0000000000000228
  R10: ffffffffffffffff R11: ffffffffffffffc0 R12: ffff89e6a2138130
  R13: ffff89e316af7400 R14: ffff89e316af6e78 R15: ffff89e6a21382b0
  FS:  0000000000000000(0000) GS:ffff89ee5fb40000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000005b0f669 CR3: 0000000cb2410004 CR4: 00000000001706e0
  Call Trace:
   inode_switch_wbs_work_fn+0xb6/0x2a0
   process_one_work+0x1e6/0x380
   worker_thread+0x53/0x3d0
   kthread+0x10f/0x130
   ret_from_fork+0x22/0x30
  Modules linked in: xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_counter nf_tables nfnetlink bridge stp llc rfkill sunrpc intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel ipmi_ssif kvm mgag200 i2c_algo_bit iTCO_wdt irqbypass drm_kms_helper iTCO_vendor_support acpi_ipmi rapl syscopyarea sysfillrect intel_cstate ipmi_si sysimgblt ioatdma dax_pmem_compat fb_sys_fops ipmi_devintf device_dax i2c_i801 pcspkr intel_uncore hpilo nd_pmem cec dax_pmem_core dca i2c_smbus acpi_tad lpc_ich ipmi_msghandler acpi_power_meter drm fuse xfs libcrc32c sd_mod t10_pi crct10dif_pclmul crc32_pclmul crc32c_intel tg3 ghash_clmulni_intel serio_raw hpsa hpwdt scsi_transport_sas wmi dm_mirror dm_region_hash dm_log dm_mod
  CR2: 0000000005b0f669
  ---[ end trace ed2105faff8384f3 ]---
  RIP: 0010:inode_do_switch_wbs+0xaf/0x470
  Code: 00 30 0f 85 c1 03 00 00 0f 1f 44 00 00 31 d2 48 c7 c6 ff ff ff ff 48 8d 7c 24 08 e8 eb 49 1a 00 48 85 c0 74 4a bb ff ff ff ff <48> 8b 50 08 48 8d 4a ff 83 e2 01 48 0f 45 c1 48 8b 00 a8 08 0f 85
  RSP: 0018:ffff9c66691abdc8 EFLAGS: 00010002
  RAX: 0000000005b0f661 RBX: 00000000ffffffff RCX: ffff89e6a21382b0
  RDX: 0000000000000001 RSI: ffff89e350230248 RDI: ffffffffffffffff
  RBP: ffff89e681d19400 R08: 0000000000000000 R09: 0000000000000228
  R10: ffffffffffffffff R11: ffffffffffffffc0 R12: ffff89e6a2138130
  R13: ffff89e316af7400 R14: ffff89e316af6e78 R15: ffff89e6a21382b0
  FS:  0000000000000000(0000) GS:ffff89ee5fb40000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000005b0f669 CR3: 0000000cb2410004 CR4: 00000000001706e0
  Kernel panic - not syncing: Fatal exception
  Kernel Offset: 0x15200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
  ---[ end Kernel panic - not syncing: Fatal exception ]---

The crash happens on an attempt to iterate over attached pagecache pages
and check the dirty flag: a dax inode's xarray contains pfn's instead of
generic struct page pointers.

This happens for DAX and not for other kinds of non-page entries in the
inodes because it's a tagged iteration, and shadow/swap entries are
never tagged; only DAX entries get tagged.

Fix the problem by bailing out (with the false return value) of
inode_prepare_sbs_switch() if a dax inode is passed.

[willy@infradead.org: changelog addition]

Link: https://lkml.kernel.org/r/20210719171350.3876830-1-guro@fb.com
Fixes: c22d70a162 ("writeback, cgroup: release dying cgwbs by switching attached inodes")
Signed-off-by: Roman Gushchin <guro@fb.com>
Reported-by: Murphy Zhou <jencce.kernel@gmail.com>
Reported-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Murphy Zhou <jencce.kernel@gmail.com>
Acked-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-23 17:43:28 -07:00
Linus Torvalds 911a2997a5 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmDcl7AACgkQnJ2qBz9k
 QNnsBQf+LBAPsfykQ/f8EdHErO1lfbVTmwf2g/JzTkjrIVZTZ6Ic47aCIiFxgHU2
 Js9ufaPxpsbbopzpn2PAoCUzxNsZDqgXtnC03MOUAqoSFbAvgLHz2sQwjqeYJUGQ
 P6n7VipEA/qBVpQI5zeCUhHYcahoNrRjSLzaFnE2Z8CrQYQ6Ry9gVEhduvu2OTru
 62cWlAWlTJfx/FcR1Y0F/ZznnNSKMiAHcEe3F6Beztplg2ooq+z6FclJYrkmnxMq
 SXSOsqTCdi1/oFx36NpvLkykrIS9I7N/iqCnKwbm6X+nyZZKyAwYZhWVqkbozPPu
 +u1Ppq8o0IuWwEA6/UAmxgAO3m/Gkw==
 =tn0h
 -----END PGP SIGNATURE-----

Merge tag 'fs_for_v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull misc fs updates from Jan Kara:
 "The new quotactl_fd() syscall (remake of quotactl_path() syscall that
  got introduced & disabled in 5.13 cycle), and couple of udf, reiserfs,
  isofs, and writeback fixes and cleanups"

* tag 'fs_for_v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  writeback: fix obtain a reference to a freeing memcg css
  quota: remove unnecessary oom message
  isofs: remove redundant continue statement
  quota: Wire up quotactl_fd syscall
  quota: Change quotactl_path() systcall to an fd-based one
  reiserfs: Remove unneed check in reiserfs_write_full_page()
  udf: Fix NULL pointer dereference in udf_symlink function
  reiserfs: add check for invalid 1st journal block
2021-07-01 12:06:39 -07:00
Linus Torvalds df668a5fe4 for-5.14/block-2021-06-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmDbXAwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpr0HEADDJaSgjpnWQwH1RVLNagJa9KnktxZYsEs+
 as3QmDdpKRG3rEC9bdE7FLe/xq3WBaO5j1hTQ9P6IguqLyS1Df72DtTlKyaCrZoe
 zv9eIlY4lZUfksE2nzWmlN9uG0FBVXeEQpHCLSNbUZeK1zvV6+NNhQqw2kc0sEqu
 hReUFeMUbsMcu/w5T3XMVJNsTMCql9wta2H0q5hONQyJQSrIwa1D+sUdE5I8fO4j
 bnoYX9yxHX26EztX1UJiGRgoq5Trz7LY7hAfljKSkewpFwiHE2vBdq2L0C2RKsIV
 tTs2DjMCMQyPNeA7WAG8HlR4aPG+7+/fuBP1KJHkykjWXglWN7OqISuBv6rrBgQs
 gNRnZ4qmb1CzD6aLEBk59nHt6po6eMxXIW856YktKy8rKcrgK29qP44Z+oomkPKo
 ZjQ0wqN5CvpObM/dIKxl9bAJ4zQDHBt49d5nTTQLfWl/mgevu6ZNWD/hONyCQmFy
 zKKqQ/wkxWHutOsjC5/MKNb3ZRNH9tt9X+HfULO2DU6IqqifYw/ex4z4MVsBopJC
 7pPfd81kgC73TgXe1AaCwHqNWsrqYCuTK0ew1CtGudlS3lucMwtap4GBiCgg5gbu
 M8pEgwO4OcCLHyRUc8zdfqI7HumbprbFmojPkwGSEe0ofVD74lMhzbUj5jvTYY2B
 t8D2XcgyOA==
 =lhon
 -----END PGP SIGNATURE-----

Merge tag 'for-5.14/block-2021-06-29' of git://git.kernel.dk/linux-block

Pull core block updates from Jens Axboe:

 - disk events cleanup (Christoph)

 - gendisk and request queue allocation simplifications (Christoph)

 - bdev_disk_changed cleanups (Christoph)

 - IO priority improvements (Bart)

 - Chained bio completion trace fix (Edward)

 - blk-wbt fixes (Jan)

 - blk-wbt enable/disable fix (Zhang)

 - Scheduler dispatch improvements (Jan, Ming)

 - Shared tagset scheduler improvements (John)

 - BFQ updates (Paolo, Luca, Pietro)

 - BFQ lock inversion fix (Jan)

 - Documentation improvements (Kir)

 - CLONE_IO block cgroup fix (Tejun)

 - Remove of ancient and deprecated block dump feature (zhangyi)

 - Discard merge fix (Ming)

 - Misc fixes or followup fixes (Colin, Damien, Dan, Long, Max, Thomas,
   Yang)

* tag 'for-5.14/block-2021-06-29' of git://git.kernel.dk/linux-block: (129 commits)
  block: fix discard request merge
  block/mq-deadline: Remove a WARN_ON_ONCE() call
  blk-mq: update hctx->dispatch_busy in case of real scheduler
  blk: Fix lock inversion between ioc lock and bfqd lock
  bfq: Remove merged request already in bfq_requests_merged()
  block: pass a gendisk to bdev_disk_changed
  block: move bdev_disk_changed
  block: add the events* attributes to disk_attrs
  block: move the disk events code to a separate file
  block: fix trace completion for chained bio
  block/partitions/msdos: Fix typo inidicator -> indicator
  block, bfq: reset waker pointer with shared queues
  block, bfq: check waker only for queues with no in-flight I/O
  block, bfq: avoid delayed merge of async queues
  block, bfq: boost throughput by extending queue-merging times
  block, bfq: consider also creation time in delayed stable merge
  block, bfq: fix delayed stable merge check
  block, bfq: let also stably merged queues enjoy weight raising
  blk-wbt: make sure throttle is enabled properly
  blk-wbt: introduce a new disable state to prevent false positive by rwb_enabled()
  ...
2021-06-30 12:12:56 -07:00
Roman Gushchin c22d70a162 writeback, cgroup: release dying cgwbs by switching attached inodes
Asynchronously try to release dying cgwbs by switching attached inodes to
the nearest living ancestor wb.  It helps to get rid of per-cgroup
writeback structures themselves and of pinned memory and block cgroups,
which are significantly larger structures (mostly due to large per-cpu
statistics data).  This prevents memory waste and helps to avoid different
scalability problems caused by large piles of dying cgroups.

Reuse the existing mechanism of inode switching used for foreign inode
detection.  To speed things up batch up to 115 inode switching in a single
operation (the maximum number is selected so that the resulting struct
inode_switch_wbs_context can fit into 1024 bytes).  Because every
switching consists of two steps divided by an RCU grace period, it would
be too slow without batching.  Please note that the whole batch counts as
a single operation (when increasing/decreasing isw_nr_in_flight).  This
allows to keep umounting working (flush the switching queue), however
prevents cleanups from consuming the whole switching quota and effectively
blocking the frn switching.

A cgwb cleanup operation can fail due to different reasons (e.g.  not
enough memory, the cgwb has an in-flight/pending io, an attached inode in
a wrong state, etc).  In this case the next scheduled cleanup will make a
new attempt.  An attempt is made each time a new cgwb is offlined (in
other words a memcg and/or a blkcg is deleted by a user).  In the future
an additional attempt scheduled by a timer can be implemented.

[guro@fb.com: replace open-coded "115" with arithmetic]
  Link: https://lkml.kernel.org/r/YMEcSBcq/VXMiPPO@carbon.dhcp.thefacebook.com
[guro@fb.com: add smp_mb() to inode_prepare_wbs_switch()]
  Link: https://lkml.kernel.org/r/YMFa+guFw7OFjf3X@carbon.dhcp.thefacebook.com
[willy@infradead.org: fix documentation]
  Link: https://lkml.kernel.org/r/20210615200242.1716568-2-willy@infradead.org

Link: https://lkml.kernel.org/r/20210608230225.2078447-9-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:48 -07:00
Roman Gushchin f5fbe6b7ad writeback, cgroup: support switching multiple inodes at once
Currently only a single inode can be switched to another writeback
structure at once.  That means to switch an inode a separate
inode_switch_wbs_context structure must be allocated, and a separate rcu
callback and work must be scheduled.

It's fine for the existing ad-hoc switching, which is not happening that
often, but sub-optimal for massive switching required in order to release
a writeback structure.  To prepare for it, let's add a support for
switching multiple inodes at once.

Instead of containing a single inode pointer, inode_switch_wbs_context
will contain a NULL-terminated array of inode pointers.
inode_do_switch_wbs() will be called for each inode.

To optimize the locking bdi->wb_switch_rwsem, old_wb's and new_wb's
list_locks will be acquired and released only once altogether for all
inodes.  wb_wakeup() will be also be called only once.  Instead of calling
wb_put(old_wb) after each successful switch, wb_put_many() is introduced
and used.

Link: https://lkml.kernel.org/r/20210608230225.2078447-8-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:48 -07:00
Roman Gushchin 72d4512e9c writeback, cgroup: split out the functional part of inode_switch_wbs_work_fn()
Split out the functional part of the inode_switch_wbs_work_fn() function
as inode_do switch_wbs() to reuse it later for switching inodes attached
to dying cgwbs.

This commit doesn't bring any functional changes.

Link: https://lkml.kernel.org/r/20210608230225.2078447-7-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:48 -07:00
Roman Gushchin f3b6a6df38 writeback, cgroup: keep list of inodes attached to bdi_writeback
Currently there is no way to iterate over inodes attached to a specific
cgwb structure.  It limits the ability to efficiently reclaim the
writeback structure itself and associated memory and block cgroup
structures without scanning all inodes belonging to a sb, which can be
prohibitively expensive.

While dirty/in-active-writeback an inode belongs to one of the
bdi_writeback's io lists: b_dirty, b_io, b_more_io and b_dirty_time.  Once
cleaned up, it's removed from all io lists.  So the inode->i_io_list can
be reused to maintain the list of inodes, attached to a bdi_writeback
structure.

This patch introduces a new wb->b_attached list, which contains all inodes
which were dirty at least once and are attached to the given cgwb.  Inodes
attached to the root bdi_writeback structures are never placed on such
list.  The following patch will use this list to try to release cgwbs
structures more efficiently.

Link: https://lkml.kernel.org/r/20210608230225.2078447-6-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:48 -07:00
Roman Gushchin 29264d92a0 writeback, cgroup: switch to rcu_work API in inode_switch_wbs()
Inode's wb switching requires two steps divided by an RCU grace period.
It's currently implemented as an RCU callback inode_switch_wbs_rcu_fn(),
which schedules inode_switch_wbs_work_fn() as a work.

Switching to the rcu_work API allows to do the same in a cleaner and
slightly shorter form.

Link: https://lkml.kernel.org/r/20210608230225.2078447-5-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:48 -07:00
Roman Gushchin 8826ee4fe7 writeback, cgroup: increment isw_nr_in_flight before grabbing an inode
isw_nr_in_flight is used to determine whether the inode switch queue
should be flushed from the umount path.  Currently it's increased after
grabbing an inode and even scheduling the switch work.  It means the
umount path can walk past cleanup_offline_cgwb() with active inode
references, which can result in a "Busy inodes after unmount." message and
use-after-free issues (with inode->i_sb which gets freed).

Fix it by incrementing isw_nr_in_flight before doing anything with the
inode and decrementing in the case when switching wasn't scheduled.

The problem hasn't yet been seen in the real life and was discovered by
Jan Kara by looking into the code.

Link: https://lkml.kernel.org/r/20210608230225.2078447-4-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Suggested-by: Jan Kara <jack@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:47 -07:00
Roman Gushchin 592fa00218 writeback, cgroup: add smp_mb() to cgroup_writeback_umount()
A full memory barrier is required between clearing SB_ACTIVE flag in
generic_shutdown_super() and checking isw_nr_in_flight in
cgroup_writeback_umount(), otherwise a new switch operation might be
scheduled after atomic_read(&isw_nr_in_flight) returned 0.  This would
result in a non-flushed isw_wq, and a potential crash.

The problem hasn't yet been seen in the real life and was discovered by
Jan Kara by looking into the code.

Link: https://lkml.kernel.org/r/20210608230225.2078447-3-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Jan Kara <jack@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:47 -07:00
Roman Gushchin 4ade5867b4 writeback, cgroup: do not switch inodes with I_WILL_FREE flag
Patch series "cgroup, blkcg: prevent dirty inodes to pin dying memory cgroups", v9.

When an inode is getting dirty for the first time it's associated with a
wb structure (see __inode_attach_wb()).  It can later be switched to
another wb (if e.g.  some other cgroup is writing a lot of data to the
same inode), but otherwise stays attached to the original wb until being
reclaimed.

The problem is that the wb structure holds a reference to the original
memory and blkcg cgroups.  So if an inode has been dirty once and later is
actively used in read-only mode, it has a good chance to pin down the
original memory and blkcg cgroups forever.  This is often the case with
services bringing data for other services, e.g.  updating some rpm
packages.

In the real life it becomes a problem due to a large size of the memcg
structure, which can easily be 1000x larger than an inode.  Also a really
large number of dying cgroups can raise different scalability issues, e.g.
making the memory reclaim costly and less effective.

To solve the problem inodes should be eventually detached from the
corresponding writeback structure.  It's inefficient to do it after every
writeback completion.  Instead it can be done whenever the original memory
cgroup is offlined and writeback structure is getting killed.  Scanning
over a (potentially long) list of inodes and detach them from the
writeback structure can take quite some time.  To avoid scanning all
inodes, attached inodes are kept on a new list (b_attached).  To make it
less noticeable to a user, the scanning and switching is performed from a
work context.

Big thanks to Jan Kara, Dennis Zhou, Hillf Danton and Tejun Heo for their
ideas and contribution to this patchset.

This patch (of 8):

If an inode's state has I_WILL_FREE flag set, the inode will be freed
soon, so there is no point in trying to switch the inode to a different
cgwb.

I_WILL_FREE was ignored since the introduction of the inode switching, so
it looks like it doesn't lead to any noticeable issues for a user.  This
is why the patch is not intended for a stable backport.

Link: https://lkml.kernel.org/r/20210608230225.2078447-1-guro@fb.com
Link: https://lkml.kernel.org/r/20210608230225.2078447-2-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Suggested-by: Jan Kara <jack@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:47 -07:00
Muchun Song 8b0ed8443a writeback: fix obtain a reference to a freeing memcg css
The caller of wb_get_create() should pin the memcg, because
wb_get_create() relies on this guarantee. The rcu read lock
only can guarantee that the memcg css returned by css_from_id()
cannot be released, but the reference of the memcg can be zero.

  rcu_read_lock()
  memcg_css = css_from_id()
  wb_get_create(memcg_css)
      cgwb_create(memcg_css)
          // css_get can change the ref counter from 0 back to 1
          css_get(memcg_css)
  rcu_read_unlock()

Fix it by holding a reference to the css before calling
wb_get_create(). This is not a problem I encountered in the
real world. Just the result of a code review.

Fixes: 682aa8e1a6 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates")
Link: https://lore.kernel.org/r/20210402091145.80635-1-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2021-06-28 18:35:02 +02:00
zhangyi (F) 12e0613715 block_dump: remove block_dump feature in mark_inode_dirty()
block_dump is an old debugging interface, one of it's functions is used
to print the information about who write which file on disk. If we
enable block_dump through /proc/sys/vm/block_dump and turn on debug log
level, we can gather information about write process name, target file
name and disk from kernel message. This feature is realized in
block_dump___mark_inode_dirty(), it print above information into kernel
message directly when marking inode dirty, so it is noisy and can easily
trigger log storm. At the same time, get the dentry refcount is also not
safe, we found it will lead to deadlock on ext4 file system with
data=journal mode.

After tracepoints has been introduced into the kernel, we got a
tracepoint in __mark_inode_dirty(), which is a better replacement of
block_dump___mark_inode_dirty(). The only downside is that it only trace
the inode number and not a file name, but it probably doesn't matter
because the original printed file name in block_dump is not accurate in
some cases, and we can still find it through the inode number and device
id. So this patch delete the dirting inode part of block_dump feature.

Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210313030146.2882027-2-yi.zhang@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-24 06:47:21 -06:00
Eric Biggers da0c4c60d8 fs: improve comments for writeback_single_inode()
Some comments for writeback_single_inode() and
__writeback_single_inode() are outdated or not very helpful, especially
with regards to writeback list handling.  Update them.

Link: https://lore.kernel.org/r/20210112190253.64307-10-ebiggers@kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2021-01-13 17:26:50 +01:00
Eric Biggers 83dc881d67 fs: drop redundant check from __writeback_single_inode()
wbc->for_sync implies wbc->sync_mode == WB_SYNC_ALL, so there's no need
to check for both.  Just check for WB_SYNC_ALL.

Link: https://lore.kernel.org/r/20210112190253.64307-9-ebiggers@kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2021-01-13 17:26:45 +01:00
Eric Biggers 35d14f278e fs: clean up __mark_inode_dirty() a bit
Improve some comments, and don't bother checking for the I_DIRTY_TIME
flag in the case where we just cleared it.

Also, warn if I_DIRTY_TIME and I_DIRTY_PAGES are passed to
__mark_inode_dirty() at the same time, as this case isn't handled.

Link: https://lore.kernel.org/r/20210112190253.64307-8-ebiggers@kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2021-01-13 17:26:42 +01:00
Eric Biggers a38ed483a7 fs: pass only I_DIRTY_INODE flags to ->dirty_inode
->dirty_inode is now only called when I_DIRTY_INODE (I_DIRTY_SYNC and/or
I_DIRTY_DATASYNC) is set.  However it may still be passed other dirty
flags at the same time, provided that these other flags happened to be
passed to __mark_inode_dirty() at the same time as I_DIRTY_INODE.

This doesn't make sense because there is no reason for filesystems to
care about these extra flags.  Nor are filesystems notified about all
updates to these other flags.

Therefore, mask the flags before passing them to ->dirty_inode.

Also properly document ->dirty_inode in vfs.rst.

Link: https://lore.kernel.org/r/20210112190253.64307-7-ebiggers@kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2021-01-13 17:26:35 +01:00
Eric Biggers e2728c5621 fs: don't call ->dirty_inode for lazytime timestamp updates
There is no need to call ->dirty_inode for lazytime timestamp updates
(i.e. for __mark_inode_dirty(I_DIRTY_TIME)), since by the definition of
lazytime, filesystems must ignore these updates.  Filesystems only need
to care about the updated timestamps when they expire.

Therefore, only call ->dirty_inode when I_DIRTY_INODE is set.

Based on a patch from Christoph Hellwig:
https://lore.kernel.org/r/20200325122825.1086872-4-hch@lst.de

Link: https://lore.kernel.org/r/20210112190253.64307-6-ebiggers@kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2021-01-13 17:26:33 +01:00
Eric Biggers 1e249cb5b7 fs: fix lazytime expiration handling in __writeback_single_inode()
When lazytime is enabled and an inode is being written due to its
in-memory updated timestamps having expired, either due to a sync() or
syncfs() system call or due to dirtytime_expire_interval having elapsed,
the VFS needs to inform the filesystem so that the filesystem can copy
the inode's timestamps out to the on-disk data structures.

This is done by __writeback_single_inode() calling
mark_inode_dirty_sync(), which then calls ->dirty_inode(I_DIRTY_SYNC).

However, this occurs after __writeback_single_inode() has already
cleared the dirty flags from ->i_state.  This causes two bugs:

- mark_inode_dirty_sync() redirties the inode, causing it to remain
  dirty.  This wastefully causes the inode to be written twice.  But
  more importantly, it breaks cases where sync_filesystem() is expected
  to clean dirty inodes.  This includes the FS_IOC_REMOVE_ENCRYPTION_KEY
  ioctl (as reported at
  https://lore.kernel.org/r/20200306004555.GB225345@gmail.com), as well
  as possibly filesystem freezing (freeze_super()).

- Since ->i_state doesn't contain I_DIRTY_TIME when ->dirty_inode() is
  called from __writeback_single_inode() for lazytime expiration,
  xfs_fs_dirty_inode() ignores the notification.  (XFS only cares about
  lazytime expirations, and it assumes that i_state will contain
  I_DIRTY_TIME during those.)  Therefore, lazy timestamps aren't
  persisted by sync(), syncfs(), or dirtytime_expire_interval on XFS.

Fix this by moving the call to mark_inode_dirty_sync() to earlier in
__writeback_single_inode(), before the dirty flags are cleared from
i_state.  This makes filesystems be properly notified of the timestamp
expiration, and it avoids incorrectly redirtying the inode.

This fixes xfstest generic/580 (which tests
FS_IOC_REMOVE_ENCRYPTION_KEY) when run on ext4 or f2fs with lazytime
enabled.  It also fixes the new lazytime xfstest I've proposed, which
reproduces the above-mentioned XFS bug
(https://lore.kernel.org/r/20210105005818.92978-1-ebiggers@kernel.org).

Alternatively, we could call ->dirty_inode(I_DIRTY_SYNC) directly.  But
due to the introduction of I_SYNC_QUEUED, mark_inode_dirty_sync() is the
right thing to do because mark_inode_dirty_sync() now knows not to move
the inode to a writeback list if it is currently queued for sync.

Fixes: 0ae45f63d4 ("vfs: add support for a lazytime mount option")
Cc: stable@vger.kernel.org
Depends-on: 5afced3bf2 ("writeback: Avoid skipping inode writeback")
Link: https://lore.kernel.org/r/20210112190253.64307-2-ebiggers@kernel.org
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2021-01-13 17:26:21 +01:00
Christoph Hellwig f738717033 writeback: don't warn on an unregistered BDI in __mark_inode_dirty
BDIs get unregistered during device removal, and this WARN can be
trivially triggered by hot-removing a NVMe device while running fsx
It is otherwise harmless as we still hold a BDI reference, and the
writeback has been shut down already.

Link: https://lore.kernel.org/r/20200928122613.434820-1-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-12-16 11:56:02 +01:00
Linus Torvalds 3ad11d7ac8 block-5.10-2020-10-12
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+EWUgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpnoxEADCVSNBRkpV0OVkOEC3wf8EGhXhk01Jnjtl
 u5Mg2V55hcgJ0thQxBV/V28XyqmsEBrmAVi0Yf8Vr9Qbq4Ze08Wae4ChS4rEOyh1
 jTcGYWx5aJB3ChLvV/HI0nWQ3bkj03mMrL3SW8rhhf5DTyKHsVeTenpx42Qu/FKf
 fRzi09FSr3Pjd0B+EX6gunwJnlyXQC5Fa4AA0GhnXJzAznANXxHkkcXu8a6Yw75x
 e28CfhIBliORsK8sRHLoUnPpeTe1vtxCBhBMsE+gJAj9ZUOWMzvNFIPP4FvfawDy
 6cCQo2m1azJ/IdZZCDjFUWyjh+wxdKMp+NNryEcoV+VlqIoc3n98rFwrSL+GIq5Z
 WVwEwq+AcwoMCsD29Lu1ytL2PQ/RVqcJP5UheMrbL4vzefNfJFumQVZLIcX0k943
 8dFL2QHL+H/hM9Dx5y5rjeiWkAlq75v4xPKVjh/DHb4nehddCqn/+DD5HDhNANHf
 c1kmmEuYhvLpIaC4DHjE6DwLh8TPKahJjwsGuBOTr7D93NUQD+OOWsIhX6mNISIl
 FFhP8cd0/ZZVV//9j+q+5B4BaJsT+ZtwmrelKFnPdwPSnh+3iu8zPRRWO+8P8fRC
 YvddxuJAmE6BLmsAYrdz6Xb/wqfyV44cEiyivF0oBQfnhbtnXwDnkDWSfJD1bvCm
 ZwfpDh2+Tg==
 =LzyE
 -----END PGP SIGNATURE-----

Merge tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:

 - Series of merge handling cleanups (Baolin, Christoph)

 - Series of blk-throttle fixes and cleanups (Baolin)

 - Series cleaning up BDI, seperating the block device from the
   backing_dev_info (Christoph)

 - Removal of bdget() as a generic API (Christoph)

 - Removal of blkdev_get() as a generic API (Christoph)

 - Cleanup of is-partition checks (Christoph)

 - Series reworking disk revalidation (Christoph)

 - Series cleaning up bio flags (Christoph)

 - bio crypt fixes (Eric)

 - IO stats inflight tweak (Gabriel)

 - blk-mq tags fixes (Hannes)

 - Buffer invalidation fixes (Jan)

 - Allow soft limits for zone append (Johannes)

 - Shared tag set improvements (John, Kashyap)

 - Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel)

 - DM no-wait support (Mike, Konstantin)

 - Request allocation improvements (Ming)

 - Allow md/dm/bcache to use IO stat helpers (Song)

 - Series improving blk-iocost (Tejun)

 - Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang,
   Xianting, Yang, Yufen, yangerkun)

* tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits)
  block: fix uapi blkzoned.h comments
  blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue
  blk-mq: get rid of the dead flush handle code path
  block: get rid of unnecessary local variable
  block: fix comment and add lockdep assert
  blk-mq: use helper function to test hw stopped
  block: use helper function to test queue register
  block: remove redundant mq check
  block: invoke blk_mq_exit_sched no matter whether have .exit_sched
  percpu_ref: don't refer to ref->data if it isn't allocated
  block: ratelimit handle_bad_sector() message
  blk-throttle: Re-use the throtl_set_slice_end()
  blk-throttle: Open code __throtl_de/enqueue_tg()
  blk-throttle: Move service tree validation out of the throtl_rb_first()
  blk-throttle: Move the list operation after list validation
  blk-throttle: Fix IO hang for a corner case
  blk-throttle: Avoid tracking latency if low limit is invalid
  blk-throttle: Avoid getting the current time if tg->last_finish_time is 0
  blk-throttle: Remove a meaningless parameter for throtl_downgrade_state()
  block: Remove redundant 'return' statement
  ...
2020-10-13 12:12:44 -07:00
Christoph Hellwig f56753ac2a bdi: replace BDI_CAP_NO_{WRITEBACK,ACCT_DIRTY} with a single flag
Replace the two negative flags that are always used together with a
single positive flag that indicates the writeback capability instead
of two related non-capabilities.  Also remove the pointless wrappers
to just check the flag.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24 13:43:39 -06:00
Tobias Klauser 9ca48e20ec fs/fs-writeback.c: adjust dirtytime_interval_handler definition to match prototype
Commit 32927393dc ("sysctl: pass kernel pointers to ->proc_handler")
changed ctl_table.proc_handler to take a kernel pointer.  Adjust the
definition of dirtytime_interval_handler to match its prototype in
linux/writeback.h which fixes the following sparse error/warning:

fs/fs-writeback.c:2189:50: warning: incorrect type in argument 3 (different address spaces)
fs/fs-writeback.c:2189:50:    expected void *
fs/fs-writeback.c:2189:50:    got void [noderef] __user *buffer
fs/fs-writeback.c:2184:5: error: symbol 'dirtytime_interval_handler' redeclared with different type (incompatible argument 3 (different address spaces)):
fs/fs-writeback.c:2184:5:    int extern [addressable] [signed] [toplevel] dirtytime_interval_handler( ... )
fs/fs-writeback.c: note: in included file:
./include/linux/writeback.h:374:5: note: previously declared as:
./include/linux/writeback.h:374:5:    int extern [addressable] [signed] [toplevel] dirtytime_interval_handler( ... )

Fixes: 32927393dc ("sysctl: pass kernel pointers to ->proc_handler")
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lkml.kernel.org/r/20200907093140.13434-1-tklauser@distanz.ch
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-19 13:13:39 -07:00
Jan Kara 5fcd57505c writeback: Drop I_DIRTY_TIME_EXPIRE
The only use of I_DIRTY_TIME_EXPIRE is to detect in
__writeback_single_inode() that inode got there because flush worker
decided it's time to writeback the dirty inode time stamps (either
because we are syncing or because of age). However we can detect this
directly in __writeback_single_inode() and there's no need for the
strange propagation with I_DIRTY_TIME_EXPIRE flag.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-06-15 09:18:46 +02:00
Jan Kara f9cae926f3 writeback: Fix sync livelock due to b_dirty_time processing
When we are processing writeback for sync(2), move_expired_inodes()
didn't set any inode expiry value (older_than_this). This can result in
writeback never completing if there's steady stream of inodes added to
b_dirty_time list as writeback rechecks dirty lists after each writeback
round whether there's more work to be done. Fix the problem by using
sync(2) start time is inode expiry value when processing b_dirty_time
list similarly as for ordinarily dirtied inodes. This requires some
refactoring of older_than_this handling which simplifies the code
noticeably as a bonus.

Fixes: 0ae45f63d4 ("vfs: add support for a lazytime mount option")
CC: stable@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-06-15 09:18:45 +02:00