Centos-kernel-stream-9

Commit Graph

Author	SHA1	Message	Date
Rafael Aquini	618376e57e	writeback: move wb_over_bg_thresh() call outside lock section JIRA: https://issues.redhat.com/browse/RHEL-27742 This patch is a backport of the following upstream commit: commit 2816ea2abf5f54438517f64efd78a9984685d6db Author: Yosry Ahmed <yosryahmed@google.com> Date: Fri Apr 21 17:40:16 2023 +0000 writeback: move wb_over_bg_thresh() call outside lock section Patch series "cgroup: eliminate atomic rstat flushing", v5. A previous patch series [1] changed most atomic rstat flushing contexts to become non-atomic. This was done to avoid an expensive operation that scales with # cgroups and # cpus to happen with irqs disabled and scheduling not permitted. There were two remaining atomic flushing contexts after that series. This series tries to eliminate them as well, eliminating atomic rstat flushing completely. The two remaining atomic flushing contexts are: (a) wb_over_bg_thresh()->mem_cgroup_wb_stats() (b) mem_cgroup_threshold()->mem_cgroup_usage() For (a), flushing needs to be atomic as wb_writeback() calls wb_over_bg_thresh() with a spinlock held. However, it seems like the call to wb_over_bg_thresh() doesn't need to be protected by that spinlock, so this series proposes a refactoring that moves the call outside the lock criticial section and makes the stats flushing in mem_cgroup_wb_stats() non-atomic. For (b), flushing needs to be atomic as mem_cgroup_threshold() is called with irqs disabled. We only flush the stats when calculating the root usage, as it is approximated as the sum of some memcg stats (file, anon, and optionally swap) instead of the conventional page counter. This series proposes changing this calculation to use the global stats instead, eliminating the need for a memcg stat flush. After these 2 contexts are eliminated, we no longer need mem_cgroup_flush_stats_atomic() or cgroup_rstat_flush_atomic(). We can remove them and simplify the code. [1] https://lore.kernel.org/linux-mm/20230330191801.1967435-1-yosryahmed@google.com/ This patch (of 5): wb_over_bg_thresh() calls mem_cgroup_wb_stats() which invokes an rstat flush, which can be expensive on large systems. Currently, wb_writeback() calls wb_over_bg_thresh() within a lock section, so we have to do the rstat flush atomically. On systems with a lot of cpus and/or cgroups, this can cause us to disable irqs for a long time, potentially causing problems. Move the call to wb_over_bg_thresh() outside the lock section in preparation to make the rstat flush in mem_cgroup_wb_stats() non-atomic. The list_empty(&wb->work_list) check should be okay outside the lock section of wb->list_lock as it is protected by a separate lock (wb->work_lock), and wb_over_bg_thresh() doesn't seem like it is modifying any of wb->b_* lists the wb->list_lock is protecting. Also, the loop seems to be already releasing and reacquring the lock, so this refactoring looks safe. Link: https://lkml.kernel.org/r/20230421174020.2994750-1-yosryahmed@google.com Link: https://lkml.kernel.org/r/20230421174020.2994750-2-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rafael Aquini <raquini@redhat.com>	2024-09-05 20:35:09 -04:00
Aristeu Rozanski	db76740534	mm: convert mem_cgroup_css_from_page() to mem_cgroup_css_from_folio() JIRA: https://issues.redhat.com/browse/RHEL-27740 Tested: by me commit 75376c6fb93b99e94192cfff48222d11819ee917 Author: Matthew Wilcox (Oracle) <willy@infradead.org> Date: Mon Jan 16 19:25:07 2023 +0000 mm: convert mem_cgroup_css_from_page() to mem_cgroup_css_from_folio() Only one caller doesn't have a folio, so move the page_folio() call to that one caller from mem_cgroup_css_from_folio(). Link: https://lkml.kernel.org/r/20230116192507.2146150-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>	2024-04-29 14:33:10 -04:00
Aristeu Rozanski	98ad16b46e	mm/fs: convert inode_attach_wb() to take a folio JIRA: https://issues.redhat.com/browse/RHEL-27740 Tested: by me commit 9cfb816b1c6c99f4b3c1d4a0fb096162cd17ec71 Author: Matthew Wilcox (Oracle) <willy@infradead.org> Date: Mon Jan 16 19:25:06 2023 +0000 mm/fs: convert inode_attach_wb() to take a folio Patch series "Writeback folio conversions". Remove more calls to compound_head() by passing folios around instead of pages. This patch (of 2): The only caller of inode_attach_wb() which doesn't pass NULL already has a folio, so convert the whole call-chain to take folios. Link: https://lkml.kernel.org/r/20230116192507.2146150-1-willy@infradead.org Link: https://lkml.kernel.org/r/20230116192507.2146150-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>	2024-04-29 14:33:10 -04:00
Ming Lei	46e90a4fe5	super: make locking naming consistent JIRA: https://issues.redhat.com/browse/RHEL-29564 commit d8ce82efdece373b570f35acc8a29487b2087b84 Author: Christian Brauner <brauner@kernel.org> Date: Fri Aug 18 16:00:49 2023 +0200 super: make locking naming consistent Make the naming consistent with the earlier introduced super_lock_{read,write}() helpers. Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230818-vfs-super-fixes-v3-v3-2-9f0b1876e46b@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Ming Lei <ming.lei@redhat.com>	2024-04-17 09:46:43 +08:00
Ming Lei	ed38934e6c	writeback: fix call of incorrect macro JIRA: https://issues.redhat.com/browse/RHEL-1516 commit 3e46c89c74f2c38e5337d2cf44b0b551adff1cb4 Author: Maxim Korotkov <korotkov.maxim.s@gmail.com> Date: Thu Jan 19 13:44:43 2023 +0300 writeback: fix call of incorrect macro the variable 'history' is of type u16, it may be an error that the hweight32 macro was used for it I guess macro hweight16 should be used Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: `2a81490811` ("writeback: implement foreign cgroup inode detection") Signed-off-by: Maxim Korotkov <korotkov.maxim.s@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230119104443.3002-1-korotkov.maxim.s@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Ming Lei <ming.lei@redhat.com>	2023-09-18 15:57:45 +08:00
Carlos Maiolino	75be7fedb4	fs: do not update freeing inode i_io_list Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2228888 Tested: xfstests After commit cbfecb927f42 ("fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE") writeback_single_inode can push inode with I_DIRTY_TIME set to b_dirty_time list. In case of freeing inode with I_DIRTY_TIME set this can happen after deletion of inode from i_io_list at evict. Stack trace is following. evict fat_evict_inode fat_truncate_blocks fat_flush_inodes writeback_inode sync_inode_metadata(inode, sync=0) writeback_single_inode(inode, wbc) <- wbc->sync_mode == WB_SYNC_NONE This will lead to use after free in flusher thread. Similar issue can be triggered if writeback_single_inode in the stack trace update inode->i_io_list. Add explicit check to avoid it. Fixes: cbfecb927f42 ("fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE") Reported-by: syzbot+6ba92bd00d5093f7e371@syzkaller.appspotmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Svyatoslav Feldsherov <feldsherov@google.com> Link: https://lore.kernel.org/r/20221115202001.324188-1-feldsherov@google.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> (cherry picked from commit 4e3c51f4e805291b057d12f5dda5aeb50a538dc4) Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>	2023-08-07 11:29:34 +02:00
Carlos Maiolino	0e8baee5cd	fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2228888 Tested: xfstests Currently the I_DIRTY_TIME will never get set if the inode already has I_DIRTY_INODE with assumption that it supersedes I_DIRTY_TIME. That's true, however ext4 will only update the on-disk inode in ->dirty_inode(), not on actual writeback. As a result if the inode already has I_DIRTY_INODE state by the time we get to __mark_inode_dirty() only with I_DIRTY_TIME, the time was already filled into on-disk inode and will not get updated until the next I_DIRTY_INODE update, which might never come if we crash or get a power failure. The problem can be reproduced on ext4 by running xfstest generic/622 with -o iversion mount option. Fix it by allowing I_DIRTY_TIME to be set even if the inode already has I_DIRTY_INODE. Also make sure that the case is properly handled in writeback_single_inode() as well. Additionally changes in xfs_fs_dirty_inode() was made to accommodate for I_DIRTY_TIME in flag. Thanks Jan Kara for suggestions on how to make this work properly. Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: stable@kernel.org Signed-off-by: Lukas Czerner <lczerner@redhat.com> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220825100657.44217-1-lczerner@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> (cherry picked from commit cbfecb927f429a6fa613d74b998496bd71e4438a) Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>	2023-08-04 14:19:30 +02:00
Carlos Maiolino	5561eb4d20	writeback: Avoid skipping inode writeback Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2228888 Tested: xfstests We have run into an issue that a task gets stuck in balance_dirty_pages_ratelimited() when perform I/O stress testing. The reason we observed is that an I_DIRTY_PAGES inode with lots of dirty pages is in b_dirty_time list and standard background writeback cannot writeback the inode. After studing the relevant code, the following scenario may lead to the issue: task1 task2 ----- ----- fuse_flush write_inode_now //in b_dirty_time writeback_single_inode __writeback_single_inode fuse_write_end filemap_dirty_folio __xa_set_mark:PAGECACHE_TAG_DIRTY lock inode->i_lock if mapping tagged PAGECACHE_TAG_DIRTY inode->i_state \|= I_DIRTY_PAGES unlock inode->i_lock __mark_inode_dirty:I_DIRTY_PAGES lock inode->i_lock -was dirty,inode stays in -b_dirty_time unlock inode->i_lock if(!(inode->i_state & I_DIRTY_All)) -not true,so nothing done This patch moves the dirty inode to b_dirty list when the inode currently is not queued in b_io or b_more_io list at the end of writeback_single_inode. Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> CC: stable@vger.kernel.org Fixes: `0ae45f63d4` ("vfs: add support for a lazytime mount option") Signed-off-by: Jing Xia <jing.xia@unisoc.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220510023514.27399-1-jing.xia@unisoc.com (cherry picked from commit 846a3351ddfe4a86eede4bb26a205c3f38ef84d3) Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>	2023-08-04 14:19:29 +02:00
Nico Pache	a39f5fc679	writeback, cgroup: fix null-ptr-deref write in bdi_split_work_to_wbs commit 1ba1199ec5747f475538c0d25a32804e5ba1dfde Author: Baokun Li <libaokun1@huawei.com> Date: Mon Apr 10 21:08:26 2023 +0800 writeback, cgroup: fix null-ptr-deref write in bdi_split_work_to_wbs KASAN report null-ptr-deref: ================================================================== BUG: KASAN: null-ptr-deref in bdi_split_work_to_wbs+0x5c5/0x7b0 Write of size 8 at addr 0000000000000000 by task sync/943 CPU: 5 PID: 943 Comm: sync Tainted: 6.3.0-rc5-next-20230406-dirty #461 Call Trace: <TASK> dump_stack_lvl+0x7f/0xc0 print_report+0x2ba/0x340 kasan_report+0xc4/0x120 kasan_check_range+0x1b7/0x2e0 __kasan_check_write+0x24/0x40 bdi_split_work_to_wbs+0x5c5/0x7b0 sync_inodes_sb+0x195/0x630 sync_inodes_one_sb+0x3a/0x50 iterate_supers+0x106/0x1b0 ksys_sync+0x98/0x160 [...] ================================================================== The race that causes the above issue is as follows: cpu1 cpu2 -------------------------\|------------------------- inode_switch_wbs INIT_WORK(&isw->work, inode_switch_wbs_work_fn) queue_rcu_work(isw_wq, &isw->work) // queue_work async inode_switch_wbs_work_fn wb_put_many(old_wb, nr_switched) percpu_ref_put_many ref->data->release(ref) cgwb_release queue_work(cgwb_release_wq, &wb->release_work) // queue_work async &wb->release_work cgwb_release_workfn ksys_sync iterate_supers sync_inodes_one_sb sync_inodes_sb bdi_split_work_to_wbs kmalloc(sizeof(*work), GFP_ATOMIC) // alloc memory failed percpu_ref_exit ref->data = NULL kfree(data) wb_get(wb) percpu_ref_get(&wb->refcnt) percpu_ref_get_many(ref, 1) atomic_long_add(nr, &ref->data->count) atomic64_add(i, v) // trigger null-ptr-deref bdi_split_work_to_wbs() traverses &bdi->wb_list to split work into all wbs. If the allocation of new work fails, the on-stack fallback will be used and the reference count of the current wb is increased afterwards. If cgroup writeback membership switches occur before getting the reference count and the current wb is released as old_wd, then calling wb_get() or wb_put() will trigger the null pointer dereference above. This issue was introduced in v4.3-rc7 (see fix tag1). Both sync_inodes_sb() and __writeback_inodes_sb_nr() calls to bdi_split_work_to_wbs() can trigger this issue. For scenarios called via sync_inodes_sb(), originally commit `7fc5854f8c` ("writeback: synchronize sync(2) against cgroup writeback membership switches") reduced the possibility of the issue by adding wb_switch_rwsem, but in v5.14-rc1 (see fix tag2) removed the "inode_io_list_del_locked(inode, old_wb)" from inode_switch_wbs_work_fn() so that wb->state contains WB_has_dirty_io, thus old_wb is not skipped when traversing wbs in bdi_split_work_to_wbs(), and the issue becomes easily reproducible again. To solve this problem, percpu_ref_exit() is called under RCU protection to avoid race between cgwb_release_workfn() and bdi_split_work_to_wbs(). Moreover, replace wb_get() with wb_tryget() in bdi_split_work_to_wbs(), and skip the current wb if wb_tryget() fails because the wb has already been shutdown. Link: https://lkml.kernel.org/r/20230410130826.1492525-1-libaokun1@huawei.com Fixes: `b817525a4a` ("writeback: bdi_writeback iteration must not skip dying ones") Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Tejun Heo <tj@kernel.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Christian Brauner <brauner@kernel.org> Cc: Dennis Zhou <dennis@kernel.org> Cc: Hou Tao <houtao1@huawei.com> Cc: yangerkun <yangerkun@huawei.com> Cc: Zhang Yi <yi.zhang@huawei.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168372 Signed-off-by: Nico Pache <npache@redhat.com>	2023-06-14 15:11:04 -06:00
Ming Lei	7f3c706b1b	writeback: remove obsolete macro EXPIRE_DIRTY_ATIME Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2175212 commit 23e188a16423a6e65290abf39dd427ff047e6843 Author: Miaohe Lin <linmiaohe@huawei.com> Date: Sat Dec 10 18:10:42 2022 +0800 writeback: remove obsolete macro EXPIRE_DIRTY_ATIME EXPIRE_DIRTY_ATIME is not used anymore. Remove it. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221210101042.2012931-1-linmiaohe@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Ming Lei <ming.lei@redhat.com>	2023-03-11 23:27:34 +08:00
Ming Lei	2551de4e98	writeback: Add asserts for adding freed inode to lists Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2175212 commit a9438b44bc7015b18931e312bbd249a25bb59a65 Author: Jan Kara <jack@suse.cz> Date: Mon Dec 12 12:36:33 2022 +0100 writeback: Add asserts for adding freed inode to lists In the past we had several use-after-free issues with inodes getting added to writeback lists after evict() removed them. These are painful to debug so add some asserts to catch the problem earlier. The only non-obvious change in the commit is that we need to tweak redirty_tail_locked() to avoid triggering assertion in inode_io_list_move_locked(). Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221212113633.29181-1-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Ming Lei <ming.lei@redhat.com>	2023-03-11 23:27:34 +08:00
Nico Pache	59af49451c	writeback: avoid use-after-free after removing device commit f87904c075515f3e1d8f4a7115869d3b914674fd Author: Khazhismel Kumykov <khazhy@chromium.org> Date: Mon Aug 1 08:50:34 2022 -0700 writeback: avoid use-after-free after removing device When a disk is removed, bdi_unregister gets called to stop further writeback and wait for associated delayed work to complete. However, wb_inode_writeback_end() may schedule bandwidth estimation dwork after this has completed, which can result in the timer attempting to access the just freed bdi_writeback. Fix this by checking if the bdi_writeback is alive, similar to when scheduling writeback work. Since this requires wb->work_lock, and wb_inode_writeback_end() may get called from interrupt, switch wb->work_lock to an irqsafe lock. Link: https://lkml.kernel.org/r/20220801155034.3772543-1-khazhy@google.com Fixes: 45a2966fd641 ("writeback: fix bandwidth estimate for spiky workload") Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Michael Stapelberg <stapelberg+linux@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498 Signed-off-by: Nico Pache <npache@redhat.com>	2022-11-08 10:11:41 -07:00
Nico Pache	d0afd73f0f	writeback: Fix inode->i_io_list not be protected by inode->i_lock error commit 10e14073107dd0b6d97d9516a02845a8e501c2c9 Author: Jchao Sun <sunjunchao2870@gmail.com> Date: Tue May 24 08:05:40 2022 -0700 writeback: Fix inode->i_io_list not be protected by inode->i_lock error Commit `b35250c081` ("writeback: Protect inode->i_io_list with inode->i_lock") made inode->i_io_list not only protected by wb->list_lock but also inode->i_lock, but inode_io_list_move_locked() was missed. Add lock there and also update comment describing things protected by inode->i_lock. This also fixes a race where __mark_inode_dirty() could move inode under flush worker's hands and thus sync(2) could miss writing some inodes. Fixes: `b35250c081` ("writeback: Protect inode->i_io_list with inode->i_lock") Link: https://lore.kernel.org/r/20220524150540.12552-1-sunjunchao2870@gmail.com CC: stable@vger.kernel.org Signed-off-by: Jchao Sun <sunjunchao2870@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089498 Signed-off-by: Nico Pache <npache@redhat.com>	2022-11-08 10:11:38 -07:00
Frantisek Hrbata	e9e9bc8da2	Merge: mm changes through v5.18 for 9.2 Merge conflicts: ----------------- Conflicts with !1142(merged) "io_uring: update to v5.15" fs/io-wq.c - static bool io_wqe_create_worker(struct io_wqe wqe, struct io_wqe_acct acct) !1142 already contains backport of 3146cba99aa2 ("io-wq: make worker creation resilient against signals") along with other commits which are not present in !1370. Resolved in favor of HEAD(!1142) - static int io_wqe_worker(void data) !1370 does not contain 767a65e9f317 ("io-wq: fix potential race of acct->nr_workers") Resolved in favor of HEAD(!1142) - static void io_init_new_worker(struct io_wqe wqe, struct io_worker worker, HEAD(!1142) does not contain e32cf5dfbe22 ("kthread: Generalize pf_io_worker so it can point to struct kthread") Resolved in favor of !1370 - static void create_worker_cont(struct callback_head cb) !1370 does not contain 66e70be72288 ("io-wq: fix memory leak in create_io_worker()") Resolved in favor of HEAD(!1142) - static void io_workqueue_create(struct work_struct work) !1370 does not contain 66e70be72288 ("io-wq: fix memory leak in create_io_worker()") Resolved in favor of HEAD(!1142) - static bool create_io_worker(struct io_wq wq, struct io_wqe wqe, int index) !1370 does not contain 66e70be72288 ("io-wq: fix memory leak in create_io_worker()") Resolved in favor of HEAD(!1142) - static bool io_wq_work_match_item(struct io_wq_work work, void data) !1370 does not contain 713b9825a4c4 ("io-wq: fix cancellation on create-worker failure") Resolved in favor of HEAD(!1142) - static void io_wqe_enqueue(struct io_wqe wqe, struct io_wq_work work) !1370 is missing 713b9825a4c4 ("io-wq: fix cancellation on create-worker failure") removed wrongly merged run_cancel label Resolved in favor of HEAD(!1142) - static bool io_task_work_match(struct callback_head cb, void data) !1370 is missing 3b33e3f4a6c0 ("io-wq: fix silly logic error in io_task_work_match()") Resolved in favor of HEAD(!1142) - static void io_wq_exit_workers(struct io_wq wq) !1370 is missing 3b33e3f4a6c0 ("io-wq: fix silly logic error in io_task_work_match()") Resolved in favor of HEAD(!1142) - int io_wq_max_workers(struct io_wq wq, int new_count) !1370 is missing 3b33e3f4a6c0 ("io-wq: fix silly logic error in io_task_work_match()") fs/io_uring.c - static int io_register_iowq_max_workers(struct io_ring_ctx ctx, !1370 is missing bunch of commits after 2e480058ddc2 ("io-wq: provide a way to limit max number of workers") Resolved in favor of HEAD(!1142) include/uapi/linux/io_uring.h - !1370 is missing dd47c104533d ("io-wq: provide IO_WQ_ constants for IORING_REGISTER_IOWQ_MAX_WORKERS arg items") just a comment conflict Resolved in favor of HEAD(!1142) kernel/exit.c - void __noreturn do_exit(long code) - !1370 contains bunch of commits after f552a27afe67 ("io_uring: remove files pointer in cancellation functions") Resolved in favor of !1370 Conflicts with !1357(merged) "NFS refresh for RHEL-9.2" fs/nfs/callback.c - nfs4_callback_svc(void vrqstp) !1370 is missing f49169c97fce ("NFSD: Remove svc_serv_ops::svo_module") where the module_put_and_kthread_exit() was removed Resolved in favor of HEAD(!1357) fs/nfs/file.c !1357 is missing 187c82cb0380 ("fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio") Resolved in favor of HEAD(!1370) fs/nfsd/nfssvc.c - nfsd(void vrqstp) !1370 is missing f49169c97fce ("NFSD: Remove svc_serv_ops::svo_module") Resolved in favor of HEAD(!1357) ----------------- MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1370 Bugzilla: https://bugzilla.redhat.com/2120352 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2099722 Patches 1-9 are changes to selftests Patches 10-31 are reverts of RHEL-only patches to address COR CVE Patches 32-320 are the machine dependent mm changes ported by Rafael Patch 321 reverts the backport of 6692c98c7df5. See below. Patches 322-981 are the machine independent mm changes Patches 982-1016 are David Hildebrand's upstream changes to address the COR CVE RHEL commit `b23c298982` fork: Stop protecting back_fork_cleanup_cgroup_lock with CONFIG_NUMA which is a backport of upstream 6692c98c7df5 and is reverted early in this series. 6692c98c7df5 is a fix for upstream 40966e316f86 which was not in RHEL until this series. 6692c98c7df5 is re-added after 40966e316f86. Omitted-fix: 310d1344e3c5 ("Revert "powerpc: Remove unused FW_FEATURE_NATIVE references"") to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716 Omitted-fix: 465d0eb0dc31 ("Docs/admin-guide/mm/damon/usage: fix the example code snip") to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716 Omitted-fix: 317314527d17 ("mm/hugetlb: correct demote page offset logic") to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716 Omitted-fix: 37dcc673d065 ("frontswap: don't call ->init if no ops are registered") to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716 Omitted-fix: 30c19366636f ("mm: fix BUG splat with kvmalloc + GFP_ATOMIC") to be fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2131716 Omitted: fix: fa84693b3c89 io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656 Omitted-fix: 009ad9f0c6ee io_uring: drop ctx->uring_lock before acquiring sqd->lock fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656 Omitted-fix: bc369921d670 io-wq: max_worker fixes fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743 Omitted-fix: e139a1ec92f8 io_uring: apply max_workers limit to all future users fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743 Omitted-fix: 71c9ce27bb57 io-wq: fix max-workers not correctly set on multi-node system fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743 Omitted-fix: 41d3a6bd1d37 io_uring: pin SQPOLL data before unlocking ring lock fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656 Omitted-fix: bad119b9a000 io_uring: honour zeroes as io-wq worker limits fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743 Omitted-fix: 08bdbd39b584 io-wq: ensure that hash wait lock is IRQ disabling fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656 Omitted-fix: 713b9825a4c4 io-wq: fix cancellation on create-worker failure fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656 Omitted-fix: 3b33e3f4a6c0 io-wq: fix silly logic error in io_task_work_match() fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656 Omitted-fix: 71e1cef2d794 io-wq: Remove duplicate code in io_workqueue_create() fixed under https://bugzilla.redhat.com/show_bug.cgi?id=210774 Omitted-fix: a226abcd5d42 io-wq: don't retry task_work creation failure on fatal conditions fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107743 Omitted-fix: fa84693b3c89 io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656 Omitted-fix: dd47c104533d io-wq: provide IO_WQ_* constants for IORING_REGISTER_IOWQ_MAX_WORKERS arg items fixed under https://bugzilla.redhat.com/show_bug.cgi?id=2107656 Omitted-fix: 4f0712ccec09 hexagon: Fix function name in die() unsupported arch Omitted-fix: 751971af2e36 csky: Fix function name in csky_alignment() and die() unsupported arch Omitted-fix: dcbc65aac283 ptrace: Remove duplicated include in ptrace.c unsupported arch Omitted-fix: eb48d4219879 drm/i915: Fix oops due to missing stack depot fixed in RHEL commit `105d2d4832` Merge DRM changes from upstream v5.16..v5.17 Omitted-fix: 751a9d69b197 drm/i915: Fix oops due to missing stack depot fixed in RHEL commit `99fc716fc4` Merge DRM changes from upstream v5.17..v5.18 Omitted-fix: eb48d4219879 drm/i915: Fix oops due to missing stack depot fixed in RHEL commit `105d2d4832` Merge DRM changes from upstream v5.16..v5.17 Omitted-fix: 751a9d69b197 drm/i915: Fix oops due to missing stack depot fixed in RHEL commit `99fc716fc4` Merge DRM changes from upstream v5.17..v5.18 Omitted-fix: b95dc06af3e6 drm/amdgpu: disable runpm if we are the primary adapter reverted later Omitted-fix: 5a90c24ad028 Revert "drm/amdgpu: disable runpm if we are the primary adapter" revert of above omitted fix Omitted-fix: 724bbe49c5e4 fs/ntfs3: provide block_invalidate_folio to fix memory leak unsupported fs Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com> Approved-by: John W. Linville <linville@redhat.com> Approved-by: Jiri Benc <jbenc@redhat.com> Approved-by: Jarod Wilson <jarod@redhat.com> Approved-by: Prarit Bhargava <prarit@redhat.com> Approved-by: Lyude Paul <lyude@redhat.com> Approved-by: Donald Dutile <ddutile@redhat.com> Approved-by: Rafael Aquini <aquini@redhat.com> Approved-by: Phil Auld <pauld@redhat.com> Approved-by: Waiman Long <longman@redhat.com> Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>	2022-10-23 19:49:41 +02:00
Chris von Recklinghausen	cd6f34840e	mm/fs: delete PF_SWAPWRITE Bugzilla: https://bugzilla.redhat.com/2120352 commit b698f0a1773f7df73f2bb4bfe0e597ea1bb3881f Author: Hugh Dickins <hughd@google.com> Date: Tue Mar 22 14:45:38 2022 -0700 mm/fs: delete PF_SWAPWRITE PF_SWAPWRITE has been redundant since v3.2 commit `ee72886d8e` ("mm: vmscan: do not writeback filesystem pages in direct reclaim"). Coincidentally, NeilBrown's current patch "remove inode_congested()" deletes may_write_to_inode(), which appeared to be the one function which took notice of PF_SWAPWRITE. But if you study the old logic, and the conditions under which may_write_to_inode() was called, you discover that flag and function have been pointless for a decade. Link: https://lkml.kernel.org/r/75e80e7-742d-e3bd-531-614db8961e4@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Cc: NeilBrown <neilb@suse.de> Cc: Jan Kara <jack@suse.de> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2022-10-12 07:27:53 -04:00
Chris von Recklinghausen	ad6766c2ff	remove inode_congested() Conflicts: include/linux/backing-dev.h - We already have dec223c92a46 ("blk-cgroup: move struct blkcg to block/blk-cgroup .h") so keep current declaration of wb_blkcg_offline mm/vmscan.c - We already have c79b7b96db8b ("mm/vmscan: Account large folios correctly") which increments stat->nr_congested by pages Bugzilla: https://bugzilla.redhat.com/2120352 commit fe55d563d4174f13839a9b7ef7309da5031b5d93 Author: NeilBrown <neilb@suse.de> Date: Tue Mar 22 14:39:07 2022 -0700 remove inode_congested() inode_congested() reports if the backing-device for the inode is congested. No bdi reports congestion any more, so this always returns 'false'. So remove inode_congested() and related functions, and remove the call sites, assuming that inode_congested() always returns 'false'. Link: https://lkml.kernel.org/r/164549983741.9187.2174285592262191311.stgit@ noble.brown Signed-off-by: NeilBrown <neilb@suse.de> Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Cc: Chao Yu <chao@kernel.org> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jeff Layton <jlayton@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Lars Ellenberg <lars.ellenberg@linbit.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Paolo Valente <paolo.valente@linaro.org> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>	2022-10-12 07:27:49 -04:00
Ming Lei	d3f8e91051	fs-writeback: writeback_sb_inodes：Recalculate 'wrote' according skipped pages Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2118511 commit 68f4c6eba70df70a720188bce95c85570ddfcc87 Author: Zhihao Cheng <chengzhihao1@huawei.com> Date: Tue May 10 21:38:05 2022 +0800 fs-writeback: writeback_sb_inodes：Recalculate 'wrote' according skipped pages Commit `505a666ee3` ("writeback: plug writeback in wb_writeback() and writeback_inodes_wb()") has us holding a plug during wb_writeback, which may cause a potential ABBA dead lock: wb_writeback fat_file_fsync blk_start_plug(&plug) for (;;) { iter i-1: some reqs have been added into plug->mq_list // LOCK A iter i: progress = __writeback_inodes_wb(wb, work) . writeback_sb_inodes // fat's bdev . __writeback_single_inode . . generic_writepages . . __block_write_full_page . . . . __generic_file_fsync . . . . sync_inode_metadata . . . . writeback_single_inode . . . . __writeback_single_inode . . . . fat_write_inode . . . . __fat_write_inode . . . . sync_dirty_buffer // fat's bdev . . . . lock_buffer(bh) // LOCK B . . . . submit_bh . . . . blk_mq_get_tag // LOCK A . . . trylock_buffer(bh) // LOCK B . . . redirty_page_for_writepage . . . wbc->pages_skipped++ . . --wbc->nr_to_write . wrote += write_chunk - wbc.nr_to_write // wrote > 0 . requeue_inode . redirty_tail_locked if (progress) // progress > 0 continue; iter i+1: queue_io // similar process with iter i, infinite for-loop ! } blk_finish_plug(&plug) // flush plug won't be called Above process triggers a hungtask like: [ 399.044861] INFO: task bb:2607 blocked for more than 30 seconds. [ 399.046824] Not tainted 5.18.0-rc1-00005-gefae4d9eb6a2-dirty [ 399.051539] task:bb state:D stack: 0 pid: 2607 ppid: 2426 flags:0x00004000 [ 399.051556] Call Trace: [ 399.051570] __schedule+0x480/0x1050 [ 399.051592] schedule+0x92/0x1a0 [ 399.051602] io_schedule+0x22/0x50 [ 399.051613] blk_mq_get_tag+0x1d3/0x3c0 [ 399.051640] __blk_mq_alloc_requests+0x21d/0x3f0 [ 399.051657] blk_mq_submit_bio+0x68d/0xca0 [ 399.051674] __submit_bio+0x1b5/0x2d0 [ 399.051708] submit_bio_noacct+0x34e/0x720 [ 399.051718] submit_bio+0x3b/0x150 [ 399.051725] submit_bh_wbc+0x161/0x230 [ 399.051734] __sync_dirty_buffer+0xd1/0x420 [ 399.051744] sync_dirty_buffer+0x17/0x20 [ 399.051750] __fat_write_inode+0x289/0x310 [ 399.051766] fat_write_inode+0x2a/0xa0 [ 399.051783] __writeback_single_inode+0x53c/0x6f0 [ 399.051795] writeback_single_inode+0x145/0x200 [ 399.051803] sync_inode_metadata+0x45/0x70 [ 399.051856] __generic_file_fsync+0xa3/0x150 [ 399.051880] fat_file_fsync+0x1d/0x80 [ 399.051895] vfs_fsync_range+0x40/0xb0 [ 399.051929] __x64_sys_fsync+0x18/0x30 In my test, 'need_resched()' (which is imported by `590dca3a71` "fs-writeback: unplug before cond_resched in writeback_sb_inodes") in function 'writeback_sb_inodes()' seldom comes true, unless cond_resched() is deleted from write_cache_pages(). Fix it by correcting wrote number according number of skipped pages in writeback_sb_inodes(). Goto Link to find a reproducer. Link: https://bugzilla.kernel.org/show_bug.cgi?id=215837 Cc: stable@vger.kernel.org # v4.3 Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220510133805.1988292-1-chengzhihao1@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Ming Lei <ming.lei@redhat.com>	2022-10-12 09:20:11 +08:00
Jeffrey Layton	777e571cdd	vfs, fscache: Implement pinning of cache usage for writeback Bugzilla: http://bugzilla.redhat.com/1229736 commit 08276bdae68b022a7726edf7416b6748e3df5395 Author: David Howells <dhowells@redhat.com> Date: Wed Oct 20 23:50:01 2021 +0100 vfs, fscache: Implement pinning of cache usage for writeback Cachefiles has a problem in that it needs to keep the backing file for a cookie open whilst there are local modifications pending that need to be written to it. However, we don't want to keep the file open indefinitely, as that causes EMFILE/ENFILE/ENOMEM problems. Reopening the cache file, however, is a problem if this is being done due to writeback triggered by exit(). Some filesystems will oops if we try to open a file in that context because they want to access current->fs or other resources that have already been dismantled. To get around this, I added the following: (1) An inode flag, I_PINNING_FSCACHE_WB, to be set on a network filesystem inode to indicate that we have a usage count on the cookie caching that inode. (2) A flag in struct writeback_control, unpinned_fscache_wb, that is set when __writeback_single_inode() clears the last dirty page from i_pages - at which point it clears I_PINNING_FSCACHE_WB and sets this flag. This has to be done here so that clearing I_PINNING_FSCACHE_WB can be done atomically with the check of PAGECACHE_TAG_DIRTY that clears I_DIRTY_PAGES. (3) A function, fscache_set_page_dirty(), which if it is not set, sets I_PINNING_FSCACHE_WB and calls fscache_use_cookie() to pin the cache resources. (4) A function, fscache_unpin_writeback(), to be called by ->write_inode() to unuse the cookie. (5) A function, fscache_clear_inode_writeback(), to be called when the inode is evicted, before clear_inode() is called. This cleans up any lingering I_PINNING_FSCACHE_WB. The network filesystem can then use these tools to make sure that fscache_write_to_cache() can write locally modified data to the cache as well as to the server. For the future, I'm working on write helpers for netfs lib that should allow this facility to be removed by keeping track of the dirty regions separately - but that's incomplete at the moment and is also going to be affected by folios, one way or another, since it deals with pages Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/163819615157.215744.17623791756928043114.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/163906917856.143852.8224898306177154573.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163967124567.1823006.14188359004568060298.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/164021524705.640689.17824932021727663017.stgit@warthog.procyon.org.uk/ # v4 Signed-off-by: Jeffrey Layton <jlayton@redhat.com>	2022-08-22 12:31:34 -04:00
Aristeu Rozanski	80a6b2e8b6	fs/writeback: Convert inode_switch_wbs_work_fn to folios Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083861 Tested: by me with multiple test suites commit 22b3c8d6612e09f5fcecba1009d399aaf7f934f6 Author: Matthew Wilcox (Oracle) <willy@infradead.org> Date: Fri Mar 19 08:58:36 2021 -0400 fs/writeback: Convert inode_switch_wbs_work_fn to folios This gets the statistics correct for large folios by modifying the counters by the number of pages in the folio instead of by 1. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>	2022-07-10 10:44:07 -04:00
Ming Lei	4415be8560	block: check that there is a plug in blk_flush_plug Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083917 commit aa8dcccaf32bfdc09f2aff089d5d60c37da5b7b5 Author: Christoph Hellwig <hch@lst.de> Date: Thu Jan 27 08:05:49 2022 +0100 block: check that there is a plug in blk_flush_plug Rename blk_flush_plug to __blk_flush_plug and add a wrapper that includes the NULL check instead of open coding that check everywhere. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220127070549.1377856-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Ming Lei <ming.lei@redhat.com>	2022-06-22 08:56:20 +08:00
Ming Lei	d5d4963cf5	block: remove blk_needs_flush_plug Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083917 commit b1f866b013e6e5583f2f0bf4a61d13eddb9a1799 Author: Christoph Hellwig <hch@lst.de> Date: Thu Jan 27 08:05:48 2022 +0100 block: remove blk_needs_flush_plug blk_needs_flush_plug fails to account for the cb_list, which needs flushing as well. Remove it and just check if there is a plug instead of poking into the internals of the plug structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220127070549.1377856-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Ming Lei <ming.lei@redhat.com>	2022-06-22 08:56:19 +08:00
Ming Lei	fe4ec6f985	block: cleanup the flush plug helpers Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2018403 commit 008f75a20e7072d0840ec323c39b42206f3fa8a0 Author: Christoph Hellwig <hch@lst.de> Date: Wed Oct 20 16:41:19 2021 +0200 block: cleanup the flush plug helpers Consolidate the various helpers into a single blk_flush_plug helper that takes a plk_plug and the from_scheduler bool and switch all callsites to call it directly. Checks that the plug is non-NULL must be performed by the caller, something that most already do anyway. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211020144119.142582-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Ming Lei <ming.lei@redhat.com>	2021-12-06 16:44:50 +08:00
Rafael Aquini	aec73574ca	writeback: memcg: simplify cgroup_writeback_by_id Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396 This patch is a backport of the following upstream commit: commit 7490a2d248145d8694e1e9828801b496250fd697 Author: Shakeel Butt <shakeelb@google.com> Date: Thu Sep 2 14:53:27 2021 -0700 writeback: memcg: simplify cgroup_writeback_by_id Currently cgroup_writeback_by_id calls mem_cgroup_wb_stats() to get dirty pages for a memcg. However mem_cgroup_wb_stats() does a lot more than just get the number of dirty pages. Just directly get the number of dirty pages instead of calling mem_cgroup_wb_stats(). Also cgroup_writeback_by_id() is only called for best-effort dirty flushing, so remove the unused 'nr' parameter and no need to explicitly flush memcg stats. Link: https://lkml.kernel.org/r/20210722182627.2267368-1-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Rafael Aquini <aquini@redhat.com>	2021-11-29 11:40:50 -05:00
Rafael Aquini	6d11f2d123	writeback: reliably update bandwidth estimation Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396 This patch is a backport of the following upstream commit: commit fee468fdf41cdf36ba6b5a780e2474d0a3e066ac Author: Jan Kara <jack@suse.cz> Date: Thu Sep 2 14:53:06 2021 -0700 writeback: reliably update bandwidth estimation Currently we trigger writeback bandwidth estimation from balance_dirty_pages() and from wb_writeback(). However neither of these need to trigger when the system is relatively idle and writeback is triggered e.g. from fsync(2). Make sure writeback estimates happen reliably by triggering them from do_writepages(). Link: https://lkml.kernel.org/r/20210713104716.22868-2-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Cc: Michael Stapelberg <stapelberg+linux@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Rafael Aquini <aquini@redhat.com>	2021-11-29 11:40:47 -05:00
Rafael Aquini	ca7eeab6b1	writeback: track number of inodes under writeback Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396 This patch is a backport of the following upstream commit: commit 633a2abb9e1cd5c95f3b600f4b2c12cce22ae4a0 Author: Jan Kara <jack@suse.cz> Date: Thu Sep 2 14:53:04 2021 -0700 writeback: track number of inodes under writeback Patch series "writeback: Fix bandwidth estimates", v4. Fix estimate of writeback throughput when device is not fully busy doing writeback. Michael Stapelberg has reported that such workload (e.g. generated by linking) tends to push estimated throughput down to 0 and as a result writeback on the device is practically stalled. The first three patches fix the reported issue, the remaining two patches are unrelated cleanups of problems I've noticed when reading the code. This patch (of 4): Track number of inodes under writeback for each bdi_writeback structure. We will use this to decide whether wb does any IO and so we can estimate its writeback throughput. In principle we could use number of pages under writeback (WB_WRITEBACK counter) for this however normal percpu counter reads are too inaccurate for our purposes and summing the counter is too expensive. Link: https://lkml.kernel.org/r/20210713104519.16394-1-jack@suse.cz Link: https://lkml.kernel.org/r/20210713104716.22868-1-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Michael Stapelberg <stapelberg+linux@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Rafael Aquini <aquini@redhat.com>	2021-11-29 11:40:46 -05:00
Roman Gushchin	593311e85b	writeback, cgroup: do not reparent dax inodes The inode switching code is not suited for dax inodes. An attempt to switch a dax inode to a parent writeback structure (as a part of a writeback cleanup procedure) results in a panic like this: run fstests generic/270 at 2021-07-15 05:54:02 XFS (pmem0p2): EXPERIMENTAL big timestamp feature in use. Use at your own risk! XFS (pmem0p2): DAX enabled. Warning: EXPERIMENTAL, use at your own risk XFS (pmem0p2): EXPERIMENTAL inode btree counters feature in use. Use at your own risk! XFS (pmem0p2): Mounting V5 Filesystem XFS (pmem0p2): Ending clean mount XFS (pmem0p2): Quotacheck needed: Please wait. XFS (pmem0p2): Quotacheck: Done. XFS (pmem0p2): xlog_verify_grant_tail: space > BBTOB(tail_blocks) XFS (pmem0p2): xlog_verify_grant_tail: space > BBTOB(tail_blocks) XFS (pmem0p2): xlog_verify_grant_tail: space > BBTOB(tail_blocks) BUG: unable to handle page fault for address: 0000000005b0f669 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI CPU: 13 PID: 10479 Comm: kworker/13:16 Not tainted 5.14.0-rc1-master-8096acd7442e+ #8 Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 09/13/2016 Workqueue: inode_switch_wbs inode_switch_wbs_work_fn RIP: 0010:inode_do_switch_wbs+0xaf/0x470 Code: 00 30 0f 85 c1 03 00 00 0f 1f 44 00 00 31 d2 48 c7 c6 ff ff ff ff 48 8d 7c 24 08 e8 eb 49 1a 00 48 85 c0 74 4a bb ff ff ff ff <48> 8b 50 08 48 8d 4a ff 83 e2 01 48 0f 45 c1 48 8b 00 a8 08 0f 85 RSP: 0018:ffff9c66691abdc8 EFLAGS: 00010002 RAX: 0000000005b0f661 RBX: 00000000ffffffff RCX: ffff89e6a21382b0 RDX: 0000000000000001 RSI: ffff89e350230248 RDI: ffffffffffffffff RBP: ffff89e681d19400 R08: 0000000000000000 R09: 0000000000000228 R10: ffffffffffffffff R11: ffffffffffffffc0 R12: ffff89e6a2138130 R13: ffff89e316af7400 R14: ffff89e316af6e78 R15: ffff89e6a21382b0 FS: 0000000000000000(0000) GS:ffff89ee5fb40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000005b0f669 CR3: 0000000cb2410004 CR4: 00000000001706e0 Call Trace: inode_switch_wbs_work_fn+0xb6/0x2a0 process_one_work+0x1e6/0x380 worker_thread+0x53/0x3d0 kthread+0x10f/0x130 ret_from_fork+0x22/0x30 Modules linked in: xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_counter nf_tables nfnetlink bridge stp llc rfkill sunrpc intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel ipmi_ssif kvm mgag200 i2c_algo_bit iTCO_wdt irqbypass drm_kms_helper iTCO_vendor_support acpi_ipmi rapl syscopyarea sysfillrect intel_cstate ipmi_si sysimgblt ioatdma dax_pmem_compat fb_sys_fops ipmi_devintf device_dax i2c_i801 pcspkr intel_uncore hpilo nd_pmem cec dax_pmem_core dca i2c_smbus acpi_tad lpc_ich ipmi_msghandler acpi_power_meter drm fuse xfs libcrc32c sd_mod t10_pi crct10dif_pclmul crc32_pclmul crc32c_intel tg3 ghash_clmulni_intel serio_raw hpsa hpwdt scsi_transport_sas wmi dm_mirror dm_region_hash dm_log dm_mod CR2: 0000000005b0f669 ---[ end trace ed2105faff8384f3 ]--- RIP: 0010:inode_do_switch_wbs+0xaf/0x470 Code: 00 30 0f 85 c1 03 00 00 0f 1f 44 00 00 31 d2 48 c7 c6 ff ff ff ff 48 8d 7c 24 08 e8 eb 49 1a 00 48 85 c0 74 4a bb ff ff ff ff <48> 8b 50 08 48 8d 4a ff 83 e2 01 48 0f 45 c1 48 8b 00 a8 08 0f 85 RSP: 0018:ffff9c66691abdc8 EFLAGS: 00010002 RAX: 0000000005b0f661 RBX: 00000000ffffffff RCX: ffff89e6a21382b0 RDX: 0000000000000001 RSI: ffff89e350230248 RDI: ffffffffffffffff RBP: ffff89e681d19400 R08: 0000000000000000 R09: 0000000000000228 R10: ffffffffffffffff R11: ffffffffffffffc0 R12: ffff89e6a2138130 R13: ffff89e316af7400 R14: ffff89e316af6e78 R15: ffff89e6a21382b0 FS: 0000000000000000(0000) GS:ffff89ee5fb40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000005b0f669 CR3: 0000000cb2410004 CR4: 00000000001706e0 Kernel panic - not syncing: Fatal exception Kernel Offset: 0x15200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) ---[ end Kernel panic - not syncing: Fatal exception ]--- The crash happens on an attempt to iterate over attached pagecache pages and check the dirty flag: a dax inode's xarray contains pfn's instead of generic struct page pointers. This happens for DAX and not for other kinds of non-page entries in the inodes because it's a tagged iteration, and shadow/swap entries are never tagged; only DAX entries get tagged. Fix the problem by bailing out (with the false return value) of inode_prepare_sbs_switch() if a dax inode is passed. [willy@infradead.org: changelog addition] Link: https://lkml.kernel.org/r/20210719171350.3876830-1-guro@fb.com Fixes: `c22d70a162` ("writeback, cgroup: release dying cgwbs by switching attached inodes") Signed-off-by: Roman Gushchin <guro@fb.com> Reported-by: Murphy Zhou <jencce.kernel@gmail.com> Reported-by: Darrick J. Wong <djwong@kernel.org> Tested-by: Darrick J. Wong <djwong@kernel.org> Tested-by: Murphy Zhou <jencce.kernel@gmail.com> Acked-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Jan Kara <jack@suse.cz> Cc: Dave Chinner <dchinner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-07-23 17:43:28 -07:00
Linus Torvalds	911a2997a5	\n -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmDcl7AACgkQnJ2qBz9k QNnsBQf+LBAPsfykQ/f8EdHErO1lfbVTmwf2g/JzTkjrIVZTZ6Ic47aCIiFxgHU2 Js9ufaPxpsbbopzpn2PAoCUzxNsZDqgXtnC03MOUAqoSFbAvgLHz2sQwjqeYJUGQ P6n7VipEA/qBVpQI5zeCUhHYcahoNrRjSLzaFnE2Z8CrQYQ6Ry9gVEhduvu2OTru 62cWlAWlTJfx/FcR1Y0F/ZznnNSKMiAHcEe3F6Beztplg2ooq+z6FclJYrkmnxMq SXSOsqTCdi1/oFx36NpvLkykrIS9I7N/iqCnKwbm6X+nyZZKyAwYZhWVqkbozPPu +u1Ppq8o0IuWwEA6/UAmxgAO3m/Gkw== =tn0h -----END PGP SIGNATURE----- Merge tag 'fs_for_v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull misc fs updates from Jan Kara: "The new quotactl_fd() syscall (remake of quotactl_path() syscall that got introduced & disabled in 5.13 cycle), and couple of udf, reiserfs, isofs, and writeback fixes and cleanups" * tag 'fs_for_v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: writeback: fix obtain a reference to a freeing memcg css quota: remove unnecessary oom message isofs: remove redundant continue statement quota: Wire up quotactl_fd syscall quota: Change quotactl_path() systcall to an fd-based one reiserfs: Remove unneed check in reiserfs_write_full_page() udf: Fix NULL pointer dereference in udf_symlink function reiserfs: add check for invalid 1st journal block	2021-07-01 12:06:39 -07:00
Linus Torvalds	df668a5fe4	for-5.14/block-2021-06-29 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmDbXAwQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpr0HEADDJaSgjpnWQwH1RVLNagJa9KnktxZYsEs+ as3QmDdpKRG3rEC9bdE7FLe/xq3WBaO5j1hTQ9P6IguqLyS1Df72DtTlKyaCrZoe zv9eIlY4lZUfksE2nzWmlN9uG0FBVXeEQpHCLSNbUZeK1zvV6+NNhQqw2kc0sEqu hReUFeMUbsMcu/w5T3XMVJNsTMCql9wta2H0q5hONQyJQSrIwa1D+sUdE5I8fO4j bnoYX9yxHX26EztX1UJiGRgoq5Trz7LY7hAfljKSkewpFwiHE2vBdq2L0C2RKsIV tTs2DjMCMQyPNeA7WAG8HlR4aPG+7+/fuBP1KJHkykjWXglWN7OqISuBv6rrBgQs gNRnZ4qmb1CzD6aLEBk59nHt6po6eMxXIW856YktKy8rKcrgK29qP44Z+oomkPKo ZjQ0wqN5CvpObM/dIKxl9bAJ4zQDHBt49d5nTTQLfWl/mgevu6ZNWD/hONyCQmFy zKKqQ/wkxWHutOsjC5/MKNb3ZRNH9tt9X+HfULO2DU6IqqifYw/ex4z4MVsBopJC 7pPfd81kgC73TgXe1AaCwHqNWsrqYCuTK0ew1CtGudlS3lucMwtap4GBiCgg5gbu M8pEgwO4OcCLHyRUc8zdfqI7HumbprbFmojPkwGSEe0ofVD74lMhzbUj5jvTYY2B t8D2XcgyOA== =lhon -----END PGP SIGNATURE----- Merge tag 'for-5.14/block-2021-06-29' of git://git.kernel.dk/linux-block Pull core block updates from Jens Axboe: - disk events cleanup (Christoph) - gendisk and request queue allocation simplifications (Christoph) - bdev_disk_changed cleanups (Christoph) - IO priority improvements (Bart) - Chained bio completion trace fix (Edward) - blk-wbt fixes (Jan) - blk-wbt enable/disable fix (Zhang) - Scheduler dispatch improvements (Jan, Ming) - Shared tagset scheduler improvements (John) - BFQ updates (Paolo, Luca, Pietro) - BFQ lock inversion fix (Jan) - Documentation improvements (Kir) - CLONE_IO block cgroup fix (Tejun) - Remove of ancient and deprecated block dump feature (zhangyi) - Discard merge fix (Ming) - Misc fixes or followup fixes (Colin, Damien, Dan, Long, Max, Thomas, Yang) * tag 'for-5.14/block-2021-06-29' of git://git.kernel.dk/linux-block: (129 commits) block: fix discard request merge block/mq-deadline: Remove a WARN_ON_ONCE() call blk-mq: update hctx->dispatch_busy in case of real scheduler blk: Fix lock inversion between ioc lock and bfqd lock bfq: Remove merged request already in bfq_requests_merged() block: pass a gendisk to bdev_disk_changed block: move bdev_disk_changed block: add the events* attributes to disk_attrs block: move the disk events code to a separate file block: fix trace completion for chained bio block/partitions/msdos: Fix typo inidicator -> indicator block, bfq: reset waker pointer with shared queues block, bfq: check waker only for queues with no in-flight I/O block, bfq: avoid delayed merge of async queues block, bfq: boost throughput by extending queue-merging times block, bfq: consider also creation time in delayed stable merge block, bfq: fix delayed stable merge check block, bfq: let also stably merged queues enjoy weight raising blk-wbt: make sure throttle is enabled properly blk-wbt: introduce a new disable state to prevent false positive by rwb_enabled() ...	2021-06-30 12:12:56 -07:00
Roman Gushchin	c22d70a162	writeback, cgroup: release dying cgwbs by switching attached inodes Asynchronously try to release dying cgwbs by switching attached inodes to the nearest living ancestor wb. It helps to get rid of per-cgroup writeback structures themselves and of pinned memory and block cgroups, which are significantly larger structures (mostly due to large per-cpu statistics data). This prevents memory waste and helps to avoid different scalability problems caused by large piles of dying cgroups. Reuse the existing mechanism of inode switching used for foreign inode detection. To speed things up batch up to 115 inode switching in a single operation (the maximum number is selected so that the resulting struct inode_switch_wbs_context can fit into 1024 bytes). Because every switching consists of two steps divided by an RCU grace period, it would be too slow without batching. Please note that the whole batch counts as a single operation (when increasing/decreasing isw_nr_in_flight). This allows to keep umounting working (flush the switching queue), however prevents cleanups from consuming the whole switching quota and effectively blocking the frn switching. A cgwb cleanup operation can fail due to different reasons (e.g. not enough memory, the cgwb has an in-flight/pending io, an attached inode in a wrong state, etc). In this case the next scheduled cleanup will make a new attempt. An attempt is made each time a new cgwb is offlined (in other words a memcg and/or a blkcg is deleted by a user). In the future an additional attempt scheduled by a timer can be implemented. [guro@fb.com: replace open-coded "115" with arithmetic] Link: https://lkml.kernel.org/r/YMEcSBcq/VXMiPPO@carbon.dhcp.thefacebook.com [guro@fb.com: add smp_mb() to inode_prepare_wbs_switch()] Link: https://lkml.kernel.org/r/YMFa+guFw7OFjf3X@carbon.dhcp.thefacebook.com [willy@infradead.org: fix documentation] Link: https://lkml.kernel.org/r/20210615200242.1716568-2-willy@infradead.org Link: https://lkml.kernel.org/r/20210608230225.2078447-9-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dave Chinner <dchinner@redhat.com> Cc: Jan Kara <jack@suse.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-29 10:53:48 -07:00
Roman Gushchin	f5fbe6b7ad	writeback, cgroup: support switching multiple inodes at once Currently only a single inode can be switched to another writeback structure at once. That means to switch an inode a separate inode_switch_wbs_context structure must be allocated, and a separate rcu callback and work must be scheduled. It's fine for the existing ad-hoc switching, which is not happening that often, but sub-optimal for massive switching required in order to release a writeback structure. To prepare for it, let's add a support for switching multiple inodes at once. Instead of containing a single inode pointer, inode_switch_wbs_context will contain a NULL-terminated array of inode pointers. inode_do_switch_wbs() will be called for each inode. To optimize the locking bdi->wb_switch_rwsem, old_wb's and new_wb's list_locks will be acquired and released only once altogether for all inodes. wb_wakeup() will be also be called only once. Instead of calling wb_put(old_wb) after each successful switch, wb_put_many() is introduced and used. Link: https://lkml.kernel.org/r/20210608230225.2078447-8-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Dennis Zhou <dennis@kernel.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dave Chinner <dchinner@redhat.com> Cc: Jan Kara <jack@suse.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-29 10:53:48 -07:00
Roman Gushchin	72d4512e9c	writeback, cgroup: split out the functional part of inode_switch_wbs_work_fn() Split out the functional part of the inode_switch_wbs_work_fn() function as inode_do switch_wbs() to reuse it later for switching inodes attached to dying cgwbs. This commit doesn't bring any functional changes. Link: https://lkml.kernel.org/r/20210608230225.2078447-7-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Dennis Zhou <dennis@kernel.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dave Chinner <dchinner@redhat.com> Cc: Jan Kara <jack@suse.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-29 10:53:48 -07:00
Roman Gushchin	f3b6a6df38	writeback, cgroup: keep list of inodes attached to bdi_writeback Currently there is no way to iterate over inodes attached to a specific cgwb structure. It limits the ability to efficiently reclaim the writeback structure itself and associated memory and block cgroup structures without scanning all inodes belonging to a sb, which can be prohibitively expensive. While dirty/in-active-writeback an inode belongs to one of the bdi_writeback's io lists: b_dirty, b_io, b_more_io and b_dirty_time. Once cleaned up, it's removed from all io lists. So the inode->i_io_list can be reused to maintain the list of inodes, attached to a bdi_writeback structure. This patch introduces a new wb->b_attached list, which contains all inodes which were dirty at least once and are attached to the given cgwb. Inodes attached to the root bdi_writeback structures are never placed on such list. The following patch will use this list to try to release cgwbs structures more efficiently. Link: https://lkml.kernel.org/r/20210608230225.2078447-6-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Dennis Zhou <dennis@kernel.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dave Chinner <dchinner@redhat.com> Cc: Jan Kara <jack@suse.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-29 10:53:48 -07:00
Roman Gushchin	29264d92a0	writeback, cgroup: switch to rcu_work API in inode_switch_wbs() Inode's wb switching requires two steps divided by an RCU grace period. It's currently implemented as an RCU callback inode_switch_wbs_rcu_fn(), which schedules inode_switch_wbs_work_fn() as a work. Switching to the rcu_work API allows to do the same in a cleaner and slightly shorter form. Link: https://lkml.kernel.org/r/20210608230225.2078447-5-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Dennis Zhou <dennis@kernel.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dave Chinner <dchinner@redhat.com> Cc: Jan Kara <jack@suse.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-29 10:53:48 -07:00
Roman Gushchin	8826ee4fe7	writeback, cgroup: increment isw_nr_in_flight before grabbing an inode isw_nr_in_flight is used to determine whether the inode switch queue should be flushed from the umount path. Currently it's increased after grabbing an inode and even scheduling the switch work. It means the umount path can walk past cleanup_offline_cgwb() with active inode references, which can result in a "Busy inodes after unmount." message and use-after-free issues (with inode->i_sb which gets freed). Fix it by incrementing isw_nr_in_flight before doing anything with the inode and decrementing in the case when switching wasn't scheduled. The problem hasn't yet been seen in the real life and was discovered by Jan Kara by looking into the code. Link: https://lkml.kernel.org/r/20210608230225.2078447-4-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Suggested-by: Jan Kara <jack@suse.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dave Chinner <dchinner@redhat.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-29 10:53:47 -07:00
Roman Gushchin	592fa00218	writeback, cgroup: add smp_mb() to cgroup_writeback_umount() A full memory barrier is required between clearing SB_ACTIVE flag in generic_shutdown_super() and checking isw_nr_in_flight in cgroup_writeback_umount(), otherwise a new switch operation might be scheduled after atomic_read(&isw_nr_in_flight) returned 0. This would result in a non-flushed isw_wq, and a potential crash. The problem hasn't yet been seen in the real life and was discovered by Jan Kara by looking into the code. Link: https://lkml.kernel.org/r/20210608230225.2078447-3-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dave Chinner <dchinner@redhat.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Jan Kara <jack@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-29 10:53:47 -07:00
Roman Gushchin	4ade5867b4	writeback, cgroup: do not switch inodes with I_WILL_FREE flag Patch series "cgroup, blkcg: prevent dirty inodes to pin dying memory cgroups", v9. When an inode is getting dirty for the first time it's associated with a wb structure (see __inode_attach_wb()). It can later be switched to another wb (if e.g. some other cgroup is writing a lot of data to the same inode), but otherwise stays attached to the original wb until being reclaimed. The problem is that the wb structure holds a reference to the original memory and blkcg cgroups. So if an inode has been dirty once and later is actively used in read-only mode, it has a good chance to pin down the original memory and blkcg cgroups forever. This is often the case with services bringing data for other services, e.g. updating some rpm packages. In the real life it becomes a problem due to a large size of the memcg structure, which can easily be 1000x larger than an inode. Also a really large number of dying cgroups can raise different scalability issues, e.g. making the memory reclaim costly and less effective. To solve the problem inodes should be eventually detached from the corresponding writeback structure. It's inefficient to do it after every writeback completion. Instead it can be done whenever the original memory cgroup is offlined and writeback structure is getting killed. Scanning over a (potentially long) list of inodes and detach them from the writeback structure can take quite some time. To avoid scanning all inodes, attached inodes are kept on a new list (b_attached). To make it less noticeable to a user, the scanning and switching is performed from a work context. Big thanks to Jan Kara, Dennis Zhou, Hillf Danton and Tejun Heo for their ideas and contribution to this patchset. This patch (of 8): If an inode's state has I_WILL_FREE flag set, the inode will be freed soon, so there is no point in trying to switch the inode to a different cgwb. I_WILL_FREE was ignored since the introduction of the inode switching, so it looks like it doesn't lead to any noticeable issues for a user. This is why the patch is not intended for a stable backport. Link: https://lkml.kernel.org/r/20210608230225.2078447-1-guro@fb.com Link: https://lkml.kernel.org/r/20210608230225.2078447-2-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Suggested-by: Jan Kara <jack@suse.cz> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Dennis Zhou <dennis@kernel.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dave Chinner <dchinner@redhat.com> Cc: Jan Kara <jack@suse.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-29 10:53:47 -07:00
Muchun Song	8b0ed8443a	writeback: fix obtain a reference to a freeing memcg css The caller of wb_get_create() should pin the memcg, because wb_get_create() relies on this guarantee. The rcu read lock only can guarantee that the memcg css returned by css_from_id() cannot be released, but the reference of the memcg can be zero. rcu_read_lock() memcg_css = css_from_id() wb_get_create(memcg_css) cgwb_create(memcg_css) // css_get can change the ref counter from 0 back to 1 css_get(memcg_css) rcu_read_unlock() Fix it by holding a reference to the css before calling wb_get_create(). This is not a problem I encountered in the real world. Just the result of a code review. Fixes: `682aa8e1a6` ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates") Link: https://lore.kernel.org/r/20210402091145.80635-1-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz>	2021-06-28 18:35:02 +02:00
zhangyi (F)	12e0613715	block_dump: remove block_dump feature in mark_inode_dirty() block_dump is an old debugging interface, one of it's functions is used to print the information about who write which file on disk. If we enable block_dump through /proc/sys/vm/block_dump and turn on debug log level, we can gather information about write process name, target file name and disk from kernel message. This feature is realized in block_dump___mark_inode_dirty(), it print above information into kernel message directly when marking inode dirty, so it is noisy and can easily trigger log storm. At the same time, get the dentry refcount is also not safe, we found it will lead to deadlock on ext4 file system with data=journal mode. After tracepoints has been introduced into the kernel, we got a tracepoint in __mark_inode_dirty(), which is a better replacement of block_dump___mark_inode_dirty(). The only downside is that it only trace the inode number and not a file name, but it probably doesn't matter because the original printed file name in block_dump is not accurate in some cases, and we can still find it through the inode number and device id. So this patch delete the dirting inode part of block_dump feature. Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210313030146.2882027-2-yi.zhang@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-05-24 06:47:21 -06:00
Eric Biggers	da0c4c60d8	fs: improve comments for writeback_single_inode() Some comments for writeback_single_inode() and __writeback_single_inode() are outdated or not very helpful, especially with regards to writeback list handling. Update them. Link: https://lore.kernel.org/r/20210112190253.64307-10-ebiggers@kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jan Kara <jack@suse.cz>	2021-01-13 17:26:50 +01:00
Eric Biggers	83dc881d67	fs: drop redundant check from __writeback_single_inode() wbc->for_sync implies wbc->sync_mode == WB_SYNC_ALL, so there's no need to check for both. Just check for WB_SYNC_ALL. Link: https://lore.kernel.org/r/20210112190253.64307-9-ebiggers@kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jan Kara <jack@suse.cz>	2021-01-13 17:26:45 +01:00
Eric Biggers	35d14f278e	fs: clean up __mark_inode_dirty() a bit Improve some comments, and don't bother checking for the I_DIRTY_TIME flag in the case where we just cleared it. Also, warn if I_DIRTY_TIME and I_DIRTY_PAGES are passed to __mark_inode_dirty() at the same time, as this case isn't handled. Link: https://lore.kernel.org/r/20210112190253.64307-8-ebiggers@kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jan Kara <jack@suse.cz>	2021-01-13 17:26:42 +01:00
Eric Biggers	a38ed483a7	fs: pass only I_DIRTY_INODE flags to ->dirty_inode ->dirty_inode is now only called when I_DIRTY_INODE (I_DIRTY_SYNC and/or I_DIRTY_DATASYNC) is set. However it may still be passed other dirty flags at the same time, provided that these other flags happened to be passed to __mark_inode_dirty() at the same time as I_DIRTY_INODE. This doesn't make sense because there is no reason for filesystems to care about these extra flags. Nor are filesystems notified about all updates to these other flags. Therefore, mask the flags before passing them to ->dirty_inode. Also properly document ->dirty_inode in vfs.rst. Link: https://lore.kernel.org/r/20210112190253.64307-7-ebiggers@kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jan Kara <jack@suse.cz>	2021-01-13 17:26:35 +01:00
Eric Biggers	e2728c5621	fs: don't call ->dirty_inode for lazytime timestamp updates There is no need to call ->dirty_inode for lazytime timestamp updates (i.e. for __mark_inode_dirty(I_DIRTY_TIME)), since by the definition of lazytime, filesystems must ignore these updates. Filesystems only need to care about the updated timestamps when they expire. Therefore, only call ->dirty_inode when I_DIRTY_INODE is set. Based on a patch from Christoph Hellwig: https://lore.kernel.org/r/20200325122825.1086872-4-hch@lst.de Link: https://lore.kernel.org/r/20210112190253.64307-6-ebiggers@kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jan Kara <jack@suse.cz>	2021-01-13 17:26:33 +01:00
Eric Biggers	1e249cb5b7	fs: fix lazytime expiration handling in __writeback_single_inode() When lazytime is enabled and an inode is being written due to its in-memory updated timestamps having expired, either due to a sync() or syncfs() system call or due to dirtytime_expire_interval having elapsed, the VFS needs to inform the filesystem so that the filesystem can copy the inode's timestamps out to the on-disk data structures. This is done by __writeback_single_inode() calling mark_inode_dirty_sync(), which then calls ->dirty_inode(I_DIRTY_SYNC). However, this occurs after __writeback_single_inode() has already cleared the dirty flags from ->i_state. This causes two bugs: - mark_inode_dirty_sync() redirties the inode, causing it to remain dirty. This wastefully causes the inode to be written twice. But more importantly, it breaks cases where sync_filesystem() is expected to clean dirty inodes. This includes the FS_IOC_REMOVE_ENCRYPTION_KEY ioctl (as reported at https://lore.kernel.org/r/20200306004555.GB225345@gmail.com), as well as possibly filesystem freezing (freeze_super()). - Since ->i_state doesn't contain I_DIRTY_TIME when ->dirty_inode() is called from __writeback_single_inode() for lazytime expiration, xfs_fs_dirty_inode() ignores the notification. (XFS only cares about lazytime expirations, and it assumes that i_state will contain I_DIRTY_TIME during those.) Therefore, lazy timestamps aren't persisted by sync(), syncfs(), or dirtytime_expire_interval on XFS. Fix this by moving the call to mark_inode_dirty_sync() to earlier in __writeback_single_inode(), before the dirty flags are cleared from i_state. This makes filesystems be properly notified of the timestamp expiration, and it avoids incorrectly redirtying the inode. This fixes xfstest generic/580 (which tests FS_IOC_REMOVE_ENCRYPTION_KEY) when run on ext4 or f2fs with lazytime enabled. It also fixes the new lazytime xfstest I've proposed, which reproduces the above-mentioned XFS bug (https://lore.kernel.org/r/20210105005818.92978-1-ebiggers@kernel.org). Alternatively, we could call ->dirty_inode(I_DIRTY_SYNC) directly. But due to the introduction of I_SYNC_QUEUED, mark_inode_dirty_sync() is the right thing to do because mark_inode_dirty_sync() now knows not to move the inode to a writeback list if it is currently queued for sync. Fixes: `0ae45f63d4` ("vfs: add support for a lazytime mount option") Cc: stable@vger.kernel.org Depends-on: `5afced3bf2` ("writeback: Avoid skipping inode writeback") Link: https://lore.kernel.org/r/20210112190253.64307-2-ebiggers@kernel.org Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jan Kara <jack@suse.cz>	2021-01-13 17:26:21 +01:00
Christoph Hellwig	f738717033	writeback: don't warn on an unregistered BDI in __mark_inode_dirty BDIs get unregistered during device removal, and this WARN can be trivially triggered by hot-removing a NVMe device while running fsx It is otherwise harmless as we still hold a BDI reference, and the writeback has been shut down already. Link: https://lore.kernel.org/r/20200928122613.434820-1-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2020-12-16 11:56:02 +01:00
Linus Torvalds	3ad11d7ac8	block-5.10-2020-10-12 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+EWUgQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpnoxEADCVSNBRkpV0OVkOEC3wf8EGhXhk01Jnjtl u5Mg2V55hcgJ0thQxBV/V28XyqmsEBrmAVi0Yf8Vr9Qbq4Ze08Wae4ChS4rEOyh1 jTcGYWx5aJB3ChLvV/HI0nWQ3bkj03mMrL3SW8rhhf5DTyKHsVeTenpx42Qu/FKf fRzi09FSr3Pjd0B+EX6gunwJnlyXQC5Fa4AA0GhnXJzAznANXxHkkcXu8a6Yw75x e28CfhIBliORsK8sRHLoUnPpeTe1vtxCBhBMsE+gJAj9ZUOWMzvNFIPP4FvfawDy 6cCQo2m1azJ/IdZZCDjFUWyjh+wxdKMp+NNryEcoV+VlqIoc3n98rFwrSL+GIq5Z WVwEwq+AcwoMCsD29Lu1ytL2PQ/RVqcJP5UheMrbL4vzefNfJFumQVZLIcX0k943 8dFL2QHL+H/hM9Dx5y5rjeiWkAlq75v4xPKVjh/DHb4nehddCqn/+DD5HDhNANHf c1kmmEuYhvLpIaC4DHjE6DwLh8TPKahJjwsGuBOTr7D93NUQD+OOWsIhX6mNISIl FFhP8cd0/ZZVV//9j+q+5B4BaJsT+ZtwmrelKFnPdwPSnh+3iu8zPRRWO+8P8fRC YvddxuJAmE6BLmsAYrdz6Xb/wqfyV44cEiyivF0oBQfnhbtnXwDnkDWSfJD1bvCm ZwfpDh2+Tg== =LzyE -----END PGP SIGNATURE----- Merge tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block Pull block updates from Jens Axboe: - Series of merge handling cleanups (Baolin, Christoph) - Series of blk-throttle fixes and cleanups (Baolin) - Series cleaning up BDI, seperating the block device from the backing_dev_info (Christoph) - Removal of bdget() as a generic API (Christoph) - Removal of blkdev_get() as a generic API (Christoph) - Cleanup of is-partition checks (Christoph) - Series reworking disk revalidation (Christoph) - Series cleaning up bio flags (Christoph) - bio crypt fixes (Eric) - IO stats inflight tweak (Gabriel) - blk-mq tags fixes (Hannes) - Buffer invalidation fixes (Jan) - Allow soft limits for zone append (Johannes) - Shared tag set improvements (John, Kashyap) - Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel) - DM no-wait support (Mike, Konstantin) - Request allocation improvements (Ming) - Allow md/dm/bcache to use IO stat helpers (Song) - Series improving blk-iocost (Tejun) - Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang, Xianting, Yang, Yufen, yangerkun) * tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits) block: fix uapi blkzoned.h comments blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue blk-mq: get rid of the dead flush handle code path block: get rid of unnecessary local variable block: fix comment and add lockdep assert blk-mq: use helper function to test hw stopped block: use helper function to test queue register block: remove redundant mq check block: invoke blk_mq_exit_sched no matter whether have .exit_sched percpu_ref: don't refer to ref->data if it isn't allocated block: ratelimit handle_bad_sector() message blk-throttle: Re-use the throtl_set_slice_end() blk-throttle: Open code __throtl_de/enqueue_tg() blk-throttle: Move service tree validation out of the throtl_rb_first() blk-throttle: Move the list operation after list validation blk-throttle: Fix IO hang for a corner case blk-throttle: Avoid tracking latency if low limit is invalid blk-throttle: Avoid getting the current time if tg->last_finish_time is 0 blk-throttle: Remove a meaningless parameter for throtl_downgrade_state() block: Remove redundant 'return' statement ...	2020-10-13 12:12:44 -07:00
Christoph Hellwig	f56753ac2a	bdi: replace BDI_CAP_NO_{WRITEBACK,ACCT_DIRTY} with a single flag Replace the two negative flags that are always used together with a single positive flag that indicates the writeback capability instead of two related non-capabilities. Also remove the pointless wrappers to just check the flag. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-24 13:43:39 -06:00
Tobias Klauser	9ca48e20ec	fs/fs-writeback.c: adjust dirtytime_interval_handler definition to match prototype Commit `32927393dc` ("sysctl: pass kernel pointers to ->proc_handler") changed ctl_table.proc_handler to take a kernel pointer. Adjust the definition of dirtytime_interval_handler to match its prototype in linux/writeback.h which fixes the following sparse error/warning: fs/fs-writeback.c:2189:50: warning: incorrect type in argument 3 (different address spaces) fs/fs-writeback.c:2189:50: expected void * fs/fs-writeback.c:2189:50: got void [noderef] __user *buffer fs/fs-writeback.c:2184:5: error: symbol 'dirtytime_interval_handler' redeclared with different type (incompatible argument 3 (different address spaces)): fs/fs-writeback.c:2184:5: int extern [addressable] [signed] [toplevel] dirtytime_interval_handler( ... ) fs/fs-writeback.c: note: in included file: ./include/linux/writeback.h:374:5: note: previously declared as: ./include/linux/writeback.h:374:5: int extern [addressable] [signed] [toplevel] dirtytime_interval_handler( ... ) Fixes: `32927393dc` ("sysctl: pass kernel pointers to ->proc_handler") Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Link: https://lkml.kernel.org/r/20200907093140.13434-1-tklauser@distanz.ch Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-09-19 13:13:39 -07:00
Jan Kara	5fcd57505c	writeback: Drop I_DIRTY_TIME_EXPIRE The only use of I_DIRTY_TIME_EXPIRE is to detect in __writeback_single_inode() that inode got there because flush worker decided it's time to writeback the dirty inode time stamps (either because we are syncing or because of age). However we can detect this directly in __writeback_single_inode() and there's no need for the strange propagation with I_DIRTY_TIME_EXPIRE flag. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2020-06-15 09:18:46 +02:00
Jan Kara	f9cae926f3	writeback: Fix sync livelock due to b_dirty_time processing When we are processing writeback for sync(2), move_expired_inodes() didn't set any inode expiry value (older_than_this). This can result in writeback never completing if there's steady stream of inodes added to b_dirty_time list as writeback rechecks dirty lists after each writeback round whether there's more work to be done. Fix the problem by using sync(2) start time is inode expiry value when processing b_dirty_time list similarly as for ordinarily dirtied inodes. This requires some refactoring of older_than_this handling which simplifies the code noticeably as a bonus. Fixes: `0ae45f63d4` ("vfs: add support for a lazytime mount option") CC: stable@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2020-06-15 09:18:45 +02:00

1 2 3 4 5 ...

423 Commits