linux-kernelorg-stable

Commit Graph

Author	SHA1	Message	Date
Ming Lei	046be7e596	blk-mq: use BLK_POLL_ONESHOT for synchronous poll completion blk_execute_rq() with polling is used in kernel code paths such as NVMe controller connect. The aggressive spinning in blk_hctx_poll() can prevent the completion task from getting a chance to run, causing a lockup. The spinning with cpu_relax() doesn't yield CPU, so need_resched() only becomes true on timer tick. This causes unnecessary spinning while the completion task is already waiting to run. Before commit `f22ecf9c14`, the loop would exit early because task_is_running() was always true. After that commit removed the check, the loop now spins until need_resched(). Fix this by using BLK_POLL_ONESHOT in blk_rq_poll_completion(). This causes blk_hctx_poll() to poll once and return immediately, letting the outer loop's cond_resched() yield CPU so the completion task can run. Fixes: `f22ecf9c14` ("blk-mq: delete task running check in blk_hctx_poll()") Cc: Diangang Li <lidiangang@bytedance.com> Cc: Fengnan Chang <changfengnan@bytedance.com> Reported-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-22 07:24:17 -07:00
Bart Van Assche	07a1bc5c14	block: Fix an error path in disk_update_zone_resources() Any queue_limits_start_update() call must be followed either by a queue_limits_commit_update() call or by a queue_limits_cancel_update() call. Make sure that the error path near the start of disk_update_zone_resources() follows this requirement. Remove the "goto unfreeze" statement from that error path to make the code easier to verify. This was detected by annotating the queue_limits_*() calls with Clang thread-safety attributes and by building the kernel with thread-safety checking enabled. Without this patch and with thread-safety checking enabled, the following error is reported: block/blk-zoned.c:2020:1: error: mutex 'disk->queue->limits_lock' is not held on every path through here [-Werror,-Wthread-safety-analysis] 2020 \| } \| ^ block/blk-zoned.c:1959:8: note: mutex acquired here 1959 \| lim = queue_limits_start_update(q); \| ^ Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Fixes: `bba4322e3f` ("block: freeze queue when updating zone resources") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20260114192803.4171847-3-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-19 11:04:06 -07:00
Caleb Sander Mateos	ca22c566b8	block: zero non-PI portion of auto integrity buffer The auto-generated integrity buffer for writes needs to be fully initialized before being passed to the underlying block device, otherwise the uninitialized memory can be read back by userspace or anyone with physical access to the storage device. If protection information is generated, that portion of the integrity buffer is already initialized. The integrity data is also zeroed if PI generation is disabled via sysfs or the PI tuple size is 0. However, this misses the case where PI is generated and the PI tuple size is nonzero, but the metadata size is larger than the PI tuple. In this case, the remainder ("opaque") of the metadata is left uninitialized. Generalize the BLK_INTEGRITY_CSUM_NONE check to cover any case when the metadata is larger than just the PI tuple. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: `c546d6f438` ("block: only zero non-PI metadata tuples in bio_integrity_prep") Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-10 10:22:07 -07:00
Mikulas Patocka	9670db22e7	blk-mq: avoid stall during boot due to synchronize_rcu_expedited On the kernel 6.19-rc, I am experiencing 15-second boot stall in a virtual machine when probing a virtio-scsi disk: [ 1.011641] SCSI subsystem initialized [ 1.013972] virtio_scsi virtio6: 16/0/0 default/read/poll queues [ 1.015983] scsi host0: Virtio SCSI HBA [ 1.019578] ACPI: \_SB_.GSIA: Enabled at IRQ 16 [ 1.020225] ahci 0000:00:1f.2: AHCI vers 0001.0000, 32 command slots, 1.5 Gbps, SATA mode [ 1.020228] ahci 0000:00:1f.2: 6/6 ports implemented (port mask 0x3f) [ 1.020230] ahci 0000:00:1f.2: flags: 64bit ncq only [ 1.024688] scsi host1: ahci [ 1.025432] scsi host2: ahci [ 1.025966] scsi host3: ahci [ 1.026511] scsi host4: ahci [ 1.028371] scsi host5: ahci [ 1.028918] scsi host6: ahci [ 1.029266] ata1: SATA max UDMA/133 abar m4096@0xfea23000 port 0xfea23100 irq 16 lpm-pol 1 [ 1.029305] ata2: SATA max UDMA/133 abar m4096@0xfea23000 port 0xfea23180 irq 16 lpm-pol 1 [ 1.029316] ata3: SATA max UDMA/133 abar m4096@0xfea23000 port 0xfea23200 irq 16 lpm-pol 1 [ 1.029327] ata4: SATA max UDMA/133 abar m4096@0xfea23000 port 0xfea23280 irq 16 lpm-pol 1 [ 1.029341] ata5: SATA max UDMA/133 abar m4096@0xfea23000 port 0xfea23300 irq 16 lpm-pol 1 [ 1.029356] ata6: SATA max UDMA/133 abar m4096@0xfea23000 port 0xfea23380 irq 16 lpm-pol 1 [ 1.118111] scsi 0:0:0:0: Direct-Access QEMU QEMU HARDDISK 2.5+ PQ: 0 ANSI: 5 [ 1.348916] ata1: SATA link down (SStatus 0 SControl 300) [ 1.350713] ata2: SATA link down (SStatus 0 SControl 300) [ 1.351025] ata6: SATA link down (SStatus 0 SControl 300) [ 1.351160] ata5: SATA link down (SStatus 0 SControl 300) [ 1.351326] ata3: SATA link down (SStatus 0 SControl 300) [ 1.351536] ata4: SATA link down (SStatus 0 SControl 300) [ 1.449153] input: ImExPS/2 Generic Explorer Mouse as /devices/platform/i8042/serio1/input/input2 [ 16.483477] sd 0:0:0:0: Power-on or device reset occurred [ 16.483691] sd 0:0:0:0: [sda] 2097152 512-byte logical blocks: (1.07 GB/1.00 GiB) [ 16.483762] sd 0:0:0:0: [sda] Write Protect is off [ 16.483877] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 16.569225] sd 0:0:0:0: [sda] Attached SCSI disk I bisected it and it is caused by the commit `89e1fb7cef` which introduces calls to synchronize_rcu_expedited. This commit replaces synchronize_rcu_expedited and kfree with a call to kfree_rcu_mightsleep, avoiding the 15-second delay. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: `89e1fb7cef` ("blk-mq: fix potential uaf for 'queue_hw_ctx'") Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-07 08:10:42 -07:00
Caleb Sander Mateos	6acd4ac5f8	block: don't merge bios with different app_tags nvme_set_app_tag() uses the app_tag value from the bio_integrity_payload of the struct request's first bio. This assumes all the request's bios have the same app_tag. However, it is possible for bios with different app_tag values to be merged into a single request. Add a check in blk_integrity_merge_{bio,rq}() to prevent the merging of bios/requests with different app_tag values if BIP_CHECK_APPTAG is set. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: `3d8b5a22d4` ("block: add support to pass user meta buffer") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-06 19:10:08 -07:00
Breno Leitao	7d121d701d	blk-rq-qos: Remove unlikely() hints from QoS checks The unlikely() annotations on QUEUE_FLAG_QOS_ENABLED checks are counterproductive. Writeback throttling (WBT) might be enabled by default, mainly because CONFIG_BLK_WBT_MQ defaults to 'y'. Branch profiling on Meta servers, which have WBT enabled, confirms 100% misprediction rates on these checks. Remove the unlikely() annotations to let the CPU's branch predictor learn the actual behavior, potentially improving I/O path performance. Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-06 19:08:23 -07:00
Julia Lawall	69153e8b97	block, bfq: update outdated comment The function bfq_bfqq_may_idle() was renamed as bfq_better_to_idle() in commit `277a4a9b56` ("block, bfq: give a better name to bfq_bfqq_may_idle"). Update the comment accordingly. Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-01 08:57:37 -07:00
Cong Zhang	10845a105b	blk-mq: skip CPU offline notify on unmapped hctx If an hctx has no software ctx mapped, blk_mq_map_swqueue() never allocates tags and leaves hctx->tags NULL. The CPU hotplug offline notifier can still run for that hctx, return early since hctx cannot hold any requests. Signed-off-by: Cong Zhang <cong.zhang@oss.qualcomm.com> Fixes: `bf0beec060` ("blk-mq: drain I/O when all CPUs in a hctx are offline") Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-30 09:02:22 -07:00
shechenglong	04bdb1a04d	block,bfq: fix aux stat accumulation destination Route bfqg_stats_add_aux() time accumulation into the destination stats object instead of the source, aligning with other stat fields. Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: shechenglong <shechenglong@xfusion.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-28 09:03:51 -07:00
Caleb Sander Mateos	af65faf34f	block: validate interval_exp integrity limit Various code assumes that the integrity interval is at least 1 sector and evenly divides the logical block size. Add these checks to blk_validate_integrity_limits(). This guards against block drivers that report invalid interval_exp values. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-18 09:51:49 -07:00
Caleb Sander Mateos	ccb8a3c08a	block: validate pi_offset integrity limit The PI tuple must be contained within the metadata value, so validate that pi_offset + pi_tuple_size <= metadata_size. This guards against block drivers that report invalid pi_offset values. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-18 09:51:49 -07:00
Deepanshu Kartikey	a58383fa45	block: add allocation size check in blkdev_pr_read_keys() blkdev_pr_read_keys() takes num_keys from userspace and uses it to calculate the allocation size for keys_info via struct_size(). While there is a check for SIZE_MAX (integer overflow), there is no upper bound validation on the allocation size itself. A malicious or buggy userspace can pass a large num_keys value that doesn't trigger overflow but still results in an excessive allocation attempt, causing a warning in the page allocator when the order exceeds MAX_PAGE_ORDER. Fix this by introducing PR_KEYS_MAX to limit the number of keys to a sane value. This makes the SIZE_MAX check redundant, so remove it. Also switch to kvzalloc/kvfree to handle larger allocations gracefully. Fixes: `22a1ffea5f` ("block: add IOC_PR_READ_KEYS ioctl") Tested-by: syzbot+660d079d90f8a1baf54d@syzkaller.appspotmail.com Reported-by: syzbot+660d079d90f8a1baf54d@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=660d079d90f8a1baf54d Link: https://lore.kernel.org/all/20251212013510.3576091-1-kartikey406@gmail.com/T/ [v1] Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-17 07:35:22 -07:00
Ming Lei	9869d3a6fe	block: fix race between wbt_enable_default and IO submission When wbt_enable_default() is moved out of queue freezing in elevator_change(), it can cause the wbt inflight counter to become negative (-1), leading to hung tasks in the writeback path. Tasks get stuck in wbt_wait() because the counter is in an inconsistent state. The issue occurs because wbt_enable_default() could race with IO submission, allowing the counter to be decremented before proper initialization. This manifests as: rq_wait[0]: inflight: -1 has_waiters: True rwb_enabled() checks the state, which can be updated exactly between wbt_wait() (rq_qos_throttle()) and wbt_track()(rq_qos_track()), then the inflight counter will become negative. And results in hung task warnings like: task:kworker/u24:39 state:D stack:0 pid:14767 Call Trace: rq_qos_wait+0xb4/0x150 wbt_wait+0xa9/0x100 __rq_qos_throttle+0x24/0x40 blk_mq_submit_bio+0x672/0x7b0 ... Fix this by: 1. Splitting wbt_enable_default() into: - __wbt_enable_default(): Returns true if wbt_init() should be called - wbt_enable_default(): Wrapper for existing callers (no init) - wbt_init_enable_default(): New function that checks and inits WBT 2. Using wbt_init_enable_default() in blk_register_queue() to ensure proper initialization during queue registration 3. Move wbt_init() out of wbt_enable_default() which is only for enabling disabled wbt from bfq and iocost, and wbt_init() isn't needed. Then the original lock warning can be avoided. 4. Removing the ELEVATOR_FLAG_ENABLE_WBT_ON_EXIT flag and its handling code since it's no longer needed This ensures WBT is properly initialized before any IO can be submitted, preventing the counter from going negative. Cc: Nilay Shroff <nilay@linux.ibm.com> Cc: Yu Kuai <yukuai@fnnas.com> Cc: Guangwu Zhang <guazhang@redhat.com> Fixes: `78c271344b` ("block: move wbt_enable_default() out of queue freezing from sched ->exit()") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-12 12:51:11 -07:00
Keith Busch	a0750fae73	blk-mq-dma: always initialize dma state Ensure the dma state is initialized when we're not using the contiguous iova, otherwise the caller may be using a stale state from a previous request that could use the coalesed iova allocation. Fixes: `2f6b2565d4` ("block: accumulate memory segment gaps per bio") Reported-by: Sebastian Ott <sebott@redhat.com> Tested-by: Sebastian Ott <sebott@redhat.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-10 13:41:11 -07:00
Fengnan Chang	f22ecf9c14	blk-mq: delete task running check in blk_hctx_poll() blk_hctx_poll() always checks if the task is running or not, and returns 1 if the task is running. This is a leftover from when polled IO was purely for synchronous IO, and doesn't make sense anymore when polled IO is purely asynchronous. Similarly, marking the task as TASK_RUNNING is also superflous, as the very much has to be running to enter the function in the first place. It looks like there has been this judgment for historical reasons, and in very early versions of this function the user would set the process state to TASK_UNINTERRUPTIBLE. Signed-off-by: Diangang Li <lidiangang@bytedance.com> Signed-off-by: Fengnan Chang <changfengnan@bytedance.com> [axboe: kill all remnants of task running, pointless now. massage message] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-10 02:40:52 -07:00
Johannes Thumshirn	2c38ec934d	block: fix cached zone reports on devices with native zone append When mounting a btrfs file system on virtio-blk which supports native Zone Append there has been a WARN triggering in btrfs' space management code. Further looking into btrfs' zoned statistics uncovered the filesystem expecting the zones to be used, but the write pointers being 0: # cat /sys/fs/btrfs/8eabd2e7-3294-4f9e-9b58-7e64135c8bf4/zoned_stats active block-groups: 4 reclaimable: 0 unused: 0 need reclaim: false data relocation block-group: 1342177280 active zones: start: 1073741824, wp: 0 used: 0, reserved: 0, unusable: 0 start: 1342177280, wp: 0 used: 0, reserved: 0, unusable: 0 start: 1610612736, wp: 0 used: 16384, reserved: 0, unusable: 18446744073709535232 start: 1879048192, wp: 0 used: 131072, reserved: 0, unusable: 18446744073709420544 Looking at the blkzone report output for the zone in question (1610612736) the write pointer on the device moved, but the filesystem did not see a change on the write pointer: # blkzone report -c 1 -o 0x300000 /dev/vda start: 0x000300000, len 0x080000, cap 0x080000, wptr 0x000040 reset:0 non-seq:0, zcond: 2(oi) [type: 2(SEQ_WRITE_REQUIRED)] The zone write pointer is 0, because btrfs is using the cached version of blkdev_report_zones() and as virtio-blk is supporting native zone append, but blkdev_revalidate_zones() does not initialize the zone write plugs in this case. Not skipping the revalidate of sequential zones in blkdev_revalidate_zones() callchain fixes this issue. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Fixes: `a6aa36e957` ("block: Remove zone write plugs when handling native zone append writes") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-09 22:35:54 -07:00
Mohamed Khalfella	59e25ef2b4	block: Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list_lock blk_mq_{add,del}_queue_tag_set() functions add and remove queues from tagset, the functions make sure that tagset and queues are marked as shared when two or more queues are attached to the same tagset. Initially a tagset starts as unshared and when the number of added queues reaches two, blk_mq_add_queue_tag_set() marks it as shared along with all the queues attached to it. When the number of attached queues drops to 1 blk_mq_del_queue_tag_set() need to mark both the tagset and the remaining queues as unshared. Both functions need to freeze current queues in tagset before setting on unsetting BLK_MQ_F_TAG_QUEUE_SHARED flag. While doing so, both functions hold set->tag_list_lock mutex, which makes sense as we do not want queues to be added or deleted in the process. This used to work fine until commit `98d81f0df7` ("nvme: use blk_mq_[un]quiesce_tagset") made the nvme driver quiesce tagset instead of quiscing individual queues. blk_mq_quiesce_tagset() does the job and quiesce the queues in set->tag_list while holding set->tag_list_lock also. This results in deadlock between two threads with these stacktraces: __schedule+0x47c/0xbb0 ? timerqueue_add+0x66/0xb0 schedule+0x1c/0xa0 schedule_preempt_disabled+0xa/0x10 __mutex_lock.constprop.0+0x271/0x600 blk_mq_quiesce_tagset+0x25/0xc0 nvme_dev_disable+0x9c/0x250 nvme_timeout+0x1fc/0x520 blk_mq_handle_expired+0x5c/0x90 bt_iter+0x7e/0x90 blk_mq_queue_tag_busy_iter+0x27e/0x550 ? __blk_mq_complete_request_remote+0x10/0x10 ? __blk_mq_complete_request_remote+0x10/0x10 ? __call_rcu_common.constprop.0+0x1c0/0x210 blk_mq_timeout_work+0x12d/0x170 process_one_work+0x12e/0x2d0 worker_thread+0x288/0x3a0 ? rescuer_thread+0x480/0x480 kthread+0xb8/0xe0 ? kthread_park+0x80/0x80 ret_from_fork+0x2d/0x50 ? kthread_park+0x80/0x80 ret_from_fork_asm+0x11/0x20 __schedule+0x47c/0xbb0 ? xas_find+0x161/0x1a0 schedule+0x1c/0xa0 blk_mq_freeze_queue_wait+0x3d/0x70 ? destroy_sched_domains_rcu+0x30/0x30 blk_mq_update_tag_set_shared+0x44/0x80 blk_mq_exit_queue+0x141/0x150 del_gendisk+0x25a/0x2d0 nvme_ns_remove+0xc9/0x170 nvme_remove_namespaces+0xc7/0x100 nvme_remove+0x62/0x150 pci_device_remove+0x23/0x60 device_release_driver_internal+0x159/0x200 unbind_store+0x99/0xa0 kernfs_fop_write_iter+0x112/0x1e0 vfs_write+0x2b1/0x3d0 ksys_write+0x4e/0xb0 do_syscall_64+0x5b/0x160 entry_SYSCALL_64_after_hwframe+0x4b/0x53 The top stacktrace is showing nvme_timeout() called to handle nvme command timeout. timeout handler is trying to disable the controller and as a first step, it needs to blk_mq_quiesce_tagset() to tell blk-mq not to call queue callback handlers. The thread is stuck waiting for set->tag_list_lock as it tries to walk the queues in set->tag_list. The lock is held by the second thread in the bottom stack which is waiting for one of queues to be frozen. The queue usage counter will drop to zero after nvme_timeout() finishes, and this will not happen because the thread will wait for this mutex forever. Given that [un]quiescing queue is an operation that does not need to sleep, update blk_mq_[un]quiesce_tagset() to use RCU instead of taking set->tag_list_lock, update blk_mq_{add,del}_queue_tag_set() to use RCU safe list operations. Also, delete INIT_LIST_HEAD(&q->tag_set_list) in blk_mq_del_queue_tag_set() because we can not re-initialize it while the list is being traversed under RCU. The deleted queue will not be added/deleted to/from a tagset and it will be freed in blk_free_queue() after the end of RCU grace period. Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com> Fixes: `98d81f0df7` ("nvme: use blk_mq_[un]quiesce_tagset") Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-09 10:24:55 -07:00
Shida Zhang	cfdeb588ae	block: prohibit calls to bio_chain_endio Now that all potential callers of bio_chain_endio have been eliminated, completely prohibit any future calls to this function. Suggested-by: Ming Lei <ming.lei@redhat.com> Suggested-by: Andreas Gruenbacher <agruenba@redhat.com> Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Shida Zhang <zhangshida@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-09 10:20:03 -07:00
Shaurya Rane	f7e3f852a4	block: fix memory leak in __blkdev_issue_zero_pages Move the fatal signal check before bio_alloc() to prevent a memory leak when BLKDEV_ZERO_KILLABLE is set and a fatal signal is pending. Previously, the bio was allocated before checking for a fatal signal. If a signal was pending, the code would break out of the loop without freeing or chaining the just-allocated bio, causing a memory leak. This matches the pattern already used in __blkdev_issue_write_zeroes() where the signal check precedes the allocation. Fixes: `bf86bcdb40` ("blk-lib: check for kill signal in ioctl BLKZEROOUT") Reported-by: syzbot+527a7e48a3d3d315d862@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=527a7e48a3d3d315d862 Signed-off-by: Shaurya Rane <ssrane_b23@ee.vjti.ac.in> Reviewed-by: Keith Busch <kbusch@kernel.org> Tested-by: syzbot+527a7e48a3d3d315d862@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-04 15:43:28 -07:00
Damien Le Moal	552c1149af	block: Clear BLK_ZONE_WPLUG_PLUGGED when aborting plugged BIOs Commit `fe0418eb9b` ("block: Prevent potential deadlocks in zone write plug error recovery") added a WARN check in disk_put_zone_wplug() to verify that when the last reference to a zone write plug is dropped, this zone write plug does not have the BLK_ZONE_WPLUG_PLUGGED flag set, that is, that it is not plugged. However, the function disk_zone_wplug_abort(), which is called for zone reset and zone finish operations, does not clear this flag after emptying a zone write plug BIO list. This can result in the disk_put_zone_wplug() warning to trigger if the user (erroneously as that is bad pratcice) issues zone reset or zone finish operations while the target zone still has plugged BIOs. Modify disk_put_zone_wplug() to clear the BLK_ZONE_WPLUG_PLUGGED flag. And while at it, also add a lockdep annotation to ensure that this function is called with the zone write plug spinlock held. Fixes: `fe0418eb9b` ("block: Prevent potential deadlocks in zone write plug error recovery") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Niklas Cassel <cassel@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-04 15:43:28 -07:00
Cong Zhang	c196bf43d7	blk-mq: Abort suspend when wakeup events are pending During system suspend, wakeup capable IRQs for block device can be delayed, which can cause blk_mq_hctx_notify_offline() to hang indefinitely while waiting for pending request to complete. Skip the request waiting loop and abort suspend when wakeup events are pending to prevent the deadlock. Fixes: `bf0beec060` ("blk-mq: drain I/O when all CPUs in a hctx are offline") Signed-off-by: Cong Zhang <cong.zhang@oss.qualcomm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-04 07:19:26 -07:00
Stefan Hajnoczi	3e2cb9ee76	block: add IOC_PR_READ_RESERVATION ioctl Add a Persistent Reservations ioctl to read the current reservation. This calls the pr_ops->read_reservation() function that was previously added in commit `c787f1baa5` ("block: Add PR callouts for read keys and reservation") but was only used by the in-kernel SCSI target so far. The IOC_PR_READ_RESERVATION ioctl is necessary so that userspace applications that rely on Persistent Reservations ioctls have a way of inspecting the current state. Cluster managers and validation tests need this functionality. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-04 07:19:26 -07:00
Stefan Hajnoczi	22a1ffea5f	block: add IOC_PR_READ_KEYS ioctl Add a Persistent Reservations ioctl to read the list of currently registered reservation keys. This calls the pr_ops->read_keys() function that was previously added in commit `c787f1baa5` ("block: Add PR callouts for read keys and reservation") but was only used by the in-kernel SCSI target so far. The IOC_PR_READ_KEYS ioctl is necessary so that userspace applications that rely on Persistent Reservations ioctls have a way of inspecting the current state. Cluster managers and validation tests need this functionality. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-04 07:19:26 -07:00
Fengnan Chang	48f22f8093	block: enable per-cpu bio cache by default Since after commit `12e4e8c7ab` ("io_uring/rw: enable bio caches for IRQ rw"), bio_put is safe for task and irq context, bio_alloc_bioset is safe for task context and no one calls in irq context, so we can enable per cpu bio cache by default. Benchmarked with t/io_uring and ext4+nvme: taskset -c 6 /root/fio/t/io_uring -p0 -d128 -b4096 -s1 -c1 -F1 -B1 -R1 -X1 -n1 -P1 /mnt/testfile base IOPS is 562K, patch IOPS is 574K. The CPU usage of bio_alloc_bioset decrease from 1.42% to 1.22%. The worst case is allocate bio in CPU A but free in CPU B, still use t/io_uring and ext4+nvme: base IOPS is 648K, patch IOPS is 647K. Also use fio test ext4/xfs with libaio/sync/io_uring on null_blk and nvme, no obvious performance regression. Signed-off-by: Fengnan Chang <changfengnan@bytedance.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-04 07:19:24 -07:00
Fengnan Chang	05ce4c584c	block: use bio_alloc_bioset for passthru IO by default Use bio_alloc_bioset for passthru IO by default, so that we can enable bio cache for irq and polled passthru IO in later. Signed-off-by: Fengnan Chang <changfengnan@bytedance.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-04 07:18:54 -07:00
Linus Torvalds	cc25df3e2e	for-6.19/block-20251201 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmktsoMQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpuiUD/92eivL+HmOh10o8trvxajB0yuyqfSjHHrL g+xUbF4s9bgAg/v+Upx7sTY8jdrTcMjKov+G9T6uPvBMqVmeVdZckA1PSAKQaIX1 Zb7nS2LnO7F6JKbwpwVrrIaqVbcz8MfGIIMbN4yNNEOMCwdIVMp4fo7trPBknJNx WddNSGUFlIF3NqSI8AflSS/pYnGm+McfBHXBpJAKipI3iquKKubHv+FX9kLp7Tn4 x27ZoCWOHglIBTJXU0mmXCVsLF8b5BA8DQcGtT62azb8+l0cRTkaHY0DFAv5BvhG TqcjrKdmR0cGSNt+nEmFrujE3atBRl0G0kiHA80YgA1MTtYzdPaUVOUtM9k/rEem gpiGMDpBypdxyJAyijPSaVJdfcg0psOlYbhIR4N2wbj/dq8268h+cWzXlF1spgVt /7ygoaCmfMNbTy9rKThTjH+es787AVXUAXXaPHhIFsnCKUj8xQl4pT7XltmgYeWx 1/XD1NEJeLHHog5upAVlGX3H5tbvP1nIICxbZa9mDOJX1rwxxI7/s/RucPjbNXuY AiaKPTfxtB9+Ihd2HrJ/76RVMkckcOBc4GIKoFfwuKDbcdLXQ5FcZCmVRoI1V9SV KsH7JBgihLwR9XWKE1vp9+CBNe1Qlu3K4IjG/E7CNLeuDntIBu73ihqGP/DqV6Bq RX1Dc0OyAQ== =m22w -----END PGP SIGNATURE----- Merge tag 'for-6.19/block-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block updates from Jens Axboe: - Fix head insertion for mq-deadline, a regression from when priority support was added - Series simplifying and improving the ublk user copy code - Various ublk related cleanups - Fixup REQ_NOWAIT handling in loop/zloop, clearing NOWAIT when the request is punted to a thread for handling - Merge and then later revert loop dio nowait support, as it ended up causing excessive stack usage for when the inline issue code needs to dip back into the full file system code - Improve auto integrity code, making it less deadlock prone - Speedup polled IO handling, but manually managing the hctx lookups - Fixes for blk-throttle for SSD devices - Small series with fixes for the S390 dasd driver - Add support for caching zones, avoiding unnecessary report zone queries - MD pull requests via Yu: - fix null-ptr-dereference regression for dm-raid0 - fix IO hang for raid5 when array is broken with IO inflight - remove legacy 1s delay to speed up system shutdown - change maintainer's email address - data can be lost if array is created with different lbs devices, fix this problem and record lbs of the array in metadata - fix rcu protection for md_thread - fix mddev kobject lifetime regression - enable atomic writes for md-linear - some cleanups - bcache updates via Coly - remove useless discard and cache device code - improve usage of per-cpu workqueues - Reorganize the IO scheduler switching code, fixing some lockdep reports as well - Improve the block layer P2P DMA support - Add support to the block tracing code for zoned devices - Segment calculation improves, and memory alignment flexibility improvements - Set of prep and cleanups patches for ublk batching support. The actual batching hasn't been added yet, but helps shrink down the workload of getting that patchset ready for 6.20 - Fix for how the ps3 block driver handles segments offsets - Improve how block plugging handles batch tag allocations - nbd fixes for use-after-free of the configuration on device clear/put - Set of improvements and fixes for zloop - Add Damien as maintainer of the block zoned device code handling - Various other fixes and cleanups * tag 'for-6.19/block-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (162 commits) block/rnbd: correct all kernel-doc complaints blk-mq: use queue_hctx in blk_mq_map_queue_type md: remove legacy 1s delay in md_notify_reboot md/raid5: fix IO hang when array is broken with IO inflight md: warn about updating super block failure md/raid0: fix NULL pointer dereference in create_strip_zones() for dm-raid sbitmap: fix all kernel-doc warnings ublk: add helper of __ublk_fetch() ublk: pass const pointer to ublk_queue_is_zoned() ublk: refactor auto buffer register in ublk_dispatch_req() ublk: add `union ublk_io_buf` with improved naming ublk: add parameter `struct io_uring_cmd *` to ublk_prep_auto_buf_reg() kfifo: add kfifo_alloc_node() helper for NUMA awareness blk-mq: fix potential uaf for 'queue_hw_ctx' blk-mq: use array manage hctx map instead of xarray ublk: prevent invalid access with DEBUG s390/dasd: Use scnprintf() instead of sprintf() s390/dasd: Move device name formatting into separate function s390/dasd: Remove unnecessary debugfs_create() return checks s390/dasd: Fix gendisk parent after copy pair swap ...	2025-12-03 19:26:18 -08:00
Linus Torvalds	0abcfd8983	for-6.19/io_uring-20251201 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmktsm0QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpiLvD/0dptgeJyLHKchOtRHzi/UvtM/EuNFKJrvI LBWCyIMjygxsVfPR41Lave9SE3UpcavF8Mg/EddasTci8VlMcDF8zPxWLb289Lz2 tkp/wOVuyYmDhNXKmKNW59NOPTd0NosEJFTZI4VhMudwx+UtAHELJGfBWW5hRyQB Md+UwZ2+J9HbYd19mToaDFxz7jpIPLEE4BYUGtljveRUdpnxhyFGGUS2+CQXZt/5 lnRvJmmEv4nSGH9ZRksix1xnV6KvJM0UwYQhrWvXhgwyiKu47zG7ONpd39KqoaRw Fw+6zZd0t7nyyuZkk15cKNnBLnjilnsCzmdcPq0Cuvkmbf6y1hlhEQQTGWXTKfJx zCZxEZcnCC4wL0CBQjZjS38AEMfH2p76M/36+NTWtlYCibY7qUtd9ndpUr49BYGo o4qfT0HMpI1PHuUvpZwpMcf4OX5qvtLmavT9vt78uqmtM+Aryzzuy3bI3S2SGjNe if/cNHnZc8Z06hUqdEit5NW+lYzj642AoF/j7qH9ADDH+VXRWaCdK/iI8tPaEpDV Rw6j442eVugS5tDPoTjdO8jsJ9+OCNNV1t/Jxy+Or+zrGdq7lfg4mnzEia1/izy5 8MnSubRy6LEd+I5PnK/9y9mPIwFMIFgULi+mUjucAhJjRj5beiG74eR6+jBAdyp1 GhFvN6fwdw== =4g/f -----END PGP SIGNATURE----- Merge tag 'for-6.19/io_uring-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring updates from Jens Axboe: - Unify how task_work cancelations are detected, placing it in the task_work running state rather than needing to check the task state - Series cleaning up and moving the cancelation code to where it belongs, in cancel.c - Cleanup of waitid and futex argument handling - Add support for mixed sized SQEs. 6.18 added support for mixed sized CQEs, improving flexibility and efficiency of workloads that need big CQEs. This adds similar support for SQEs, where the occasional need for a 128b SQE doesn't necessitate having all SQEs be 128b in size - Introduce zcrx and SQ/CQ layout queries. The former returns what zcrx features are available. And both return the ring size information to help with allocation size calculation for user provided rings like IORING_SETUP_NO_MMAP and IORING_MEM_REGION_TYPE_USER - Zcrx updates for 6.19. It includes a bunch of small patches, IORING_REGISTER_ZCRX_CTRL and RQ flushing and David's work on sharing zcrx b/w multiple io_uring instances - Series cleaning up ring initializations, notable deduplicating ring size and offset calculations. It also moves most of the checking before doing any allocations, making the code simpler - Add support for getsockname and getpeername, which is mostly a trivial hookup after a bit of refactoring on the networking side - Various fixes and cleanups * tag 'for-6.19/io_uring-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (68 commits) io_uring: Introduce getsockname io_uring cmd socket: Split out a getsockname helper for io_uring socket: Unify getsockname and getpeername implementation io_uring/query: drop unused io_handle_query_entry() ctx arg io_uring/kbuf: remove obsolete buf_nr_pages and update comments io_uring/register: use correct location for io_rings_layout io_uring/zcrx: share an ifq between rings io_uring/zcrx: add io_fill_zcrx_offsets() io_uring/zcrx: export zcrx via a file io_uring/zcrx: move io_zcrx_scrub() and dependencies up io_uring/zcrx: count zcrx users io_uring/zcrx: add sync refill queue flushing io_uring/zcrx: introduce IORING_REGISTER_ZCRX_CTRL io_uring/zcrx: elide passing msg flags io_uring/zcrx: use folio_nr_pages() instead of shift operation io_uring/zcrx: convert to use netmem_desc io_uring/query: introduce rings info query io_uring/query: introduce zcrx query io_uring: move cq/sq user offset init around io_uring: pre-calculate scq layout ...	2025-12-03 18:58:57 -08:00
Linus Torvalds	9368f0f941	vfs-6.19-rc1.inode -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZAAKCRCRxhvAZXjc omMSAP9GLhavxyWQ24Q+49CNWWRQWDY1wTOiUK2BwtIvZ0YEcAD8D1dAiMckL5pC RwEAVA5p+y+qi+bZP0KXCBxQddoTIQM= =zo/J -----END PGP SIGNATURE----- Merge tag 'vfs-6.19-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs inode updates from Christian Brauner: "Features: - Hide inode->i_state behind accessors. Open-coded accesses prevent asserting they are done correctly. One obvious aspect is locking, but significantly more can be checked. For example it can be detected when the code is clearing flags which are already missing, or is setting flags when it is illegal (e.g., I_FREEING when ->i_count > 0) - Provide accessors for ->i_state, converts all filesystems using coccinelle and manual conversions (btrfs, ceph, smb, f2fs, gfs2, overlayfs, nilfs2, xfs), and makes plain ->i_state access fail to compile - Rework I_NEW handling to operate without fences, simplifying the code after the accessor infrastructure is in place Cleanups: - Move wait_on_inode() from writeback.h to fs.h - Spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb for clarity - Cosmetic fixes to LRU handling - Push list presence check into inode_io_list_del() - Touch up predicts in __d_lookup_rcu() - ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage - Assert on ->i_count in iput_final() - Assert ->i_lock held in __iget() Fixes: - Add missing fences to I_NEW handling" * tag 'vfs-6.19-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (22 commits) dcache: touch up predicts in __d_lookup_rcu() fs: push list presence check into inode_io_list_del() fs: cosmetic fixes to lru handling fs: rework I_NEW handling to operate without fences fs: make plain ->i_state access fail to compile xfs: use the new ->i_state accessors nilfs2: use the new ->i_state accessors overlayfs: use the new ->i_state accessors gfs2: use the new ->i_state accessors f2fs: use the new ->i_state accessors smb: use the new ->i_state accessors ceph: use the new ->i_state accessors btrfs: use the new ->i_state accessors Manual conversion to use ->i_state accessors of all places not covered by coccinelle Coccinelle-based conversion to use ->i_state accessors fs: provide accessors for ->i_state fs: spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb fs: move wait_on_inode() from writeback.h to fs.h fs: add missing fences to I_NEW handling ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage ...	2025-12-01 09:02:34 -08:00
Linus Torvalds	b04b2e7a61	vfs-6.19-rc1.misc -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZAAKCRCRxhvAZXjc onGCAQDEHKNEuZMhkyd3K5YsJtMzZlW/uXp4+Wddeob+5yQp0wEA09xN4CJNMwhP J6Kjaa80hWfrFacqSvyMUwQHHw6mngs= =5Mom -----END PGP SIGNATURE----- Merge tag 'vfs-6.19-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Features: - Cheaper MAY_EXEC handling for path lookup. This elides MAY_WRITE permission checks during path lookup and adds the IOP_FASTPERM_MAY_EXEC flag so filesystems like btrfs can avoid expensive permission work. - Hide dentry_cache behind runtime const machinery. - Add German Maglione as virtiofs co-maintainer. Cleanups: - Tidy up and inline step_into() and walk_component() for improved code generation. - Re-enable IOCB_NOWAIT writes to files. This refactors file timestamp update logic, fixing a layering bypass in btrfs when updating timestamps on device files and improving FMODE_NOCMTIME handling in VFS now that nfsd started using it. - Path lookup optimizations extracting slowpaths into dedicated routines and adding branch prediction hints for mntput_no_expire(), fd_install(), lookup_slow(), and various other hot paths. - Enable clang's -fms-extensions flag, requiring a JFS rename to avoid conflicts. - Remove spurious exports in fs/file_attr.c. - Stop duplicating union pipe_index declaration. This depends on the shared kbuild branch that brings in -fms-extensions support which is merged into this branch. - Use MD5 library instead of crypto_shash in ecryptfs. - Use largest_zero_folio() in iomap_dio_zero(). - Replace simple_strtol/strtoul with kstrtoint/kstrtouint in init and initrd code. - Various typo fixes. Fixes: - Fix emergency sync for btrfs. Btrfs requires an explicit sync_fs() call with wait == 1 to commit super blocks. The emergency sync path never passed this, leaving btrfs data uncommitted during emergency sync. - Use local kmap in watch_queue's post_one_notification(). - Add hint prints in sb_set_blocksize() for LBS dependency on THP" * tag 'vfs-6.19-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits) MAINTAINERS: add German Maglione as virtiofs co-maintainer fs: inline step_into() and walk_component() fs: tidy up step_into() & friends before inlining orangefs: use inode_update_timestamps directly btrfs: fix the comment on btrfs_update_time btrfs: use vfs_utimes to update file timestamps fs: export vfs_utimes fs: lift the FMODE_NOCMTIME check into file_update_time_flags fs: refactor file timestamp update logic include/linux/fs.h: trivial fix: regualr -> regular fs/splice.c: trivial fix: pipes -> pipe's fs: mark lookup_slow() as noinline fs: add predicts based on nd->depth fs: move mntput_no_expire() slowpath into a dedicated routine fs: remove spurious exports in fs/file_attr.c watch_queue: Use local kmap in post_one_notification() fs: touch up predicts in path lookup fs: move fd_install() slowpath into a dedicated routine and provide commentary fs: hide dentry_cache behind runtime const machinery fs: touch predicts in do_dentry_open() ...	2025-12-01 08:44:26 -08:00
Linus Torvalds	1885cdbfbb	vfs-6.19-rc1.iomap -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZAAKCRCRxhvAZXjc ooCXAQCwzX2GS/55QHV6JXBBoNxguuSQ5dCj91ZmTfHzij0xNAEAhKEBw7iMGX72 c2/x+xYf+Pc6mAfxdus5RLMggqBFPAk= =jInB -----END PGP SIGNATURE----- Merge tag 'vfs-6.19-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull iomap updates from Christian Brauner: "FUSE iomap Support for Buffered Reads: This adds iomap support for FUSE buffered reads and readahead. This enables granular uptodate tracking with large folios so only non-uptodate portions need to be read. Also fixes a race condition with large folios + writeback cache that could cause data corruption on partial writes followed by reads. - Refactored iomap read/readahead bio logic into helpers - Added caller-provided callbacks for read operations - Moved buffered IO bio logic into new file - FUSE now uses iomap for read_folio and readahead Zero Range Folio Batch Support: Add folio batch support for iomap_zero_range() to handle dirty folios over unwritten mappings. Fix raciness issues where dirty data could be lost during zero range operations. - filemap_get_folios_tag_range() helper for dirty folio lookup - Optional zero range dirty folio processing - XFS fills dirty folios on zero range of unwritten mappings - Removed old partial EOF zeroing optimization DIO Write Completions from Interrupt Context: Restore pre-iomap behavior where pure overwrite completions run inline rather than being deferred to workqueue. Reduces context switches for high-performance workloads like ScyllaDB. - Removed unused IOCB_DIO_CALLER_COMP code - Error completions always run in user context (fixes zonefs) - Reworked REQ_FUA selection logic - Inverted IOMAP_DIO_INLINE_COMP to IOMAP_DIO_OFFLOAD_COMP Buffered IO Cleanups: Some performance and code clarity improvements: - Replace manual bitmap scanning with find_next_bit() - Simplify read skip logic for writes - Optimize pending async writeback accounting - Better variable naming - Documentation for iomap_finish_folio_write() requirements Misaligned Vectors for Zoned XFS: Enables sub-block aligned vectors in XFS always-COW mode for zoned devices via new IOMAP_DIO_FSBLOCK_ALIGNED flag. Bug Fixes: - Allocate s_dio_done_wq for async reads (fixes syzbot report after error completion changes) - Fix iomap_read_end() for already uptodate folios (regression fix)" * tag 'vfs-6.19-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (40 commits) iomap: allocate s_dio_done_wq for async reads as well iomap: fix iomap_read_end() for already uptodate folios iomap: invert the polarity of IOMAP_DIO_INLINE_COMP iomap: support write completions from interrupt context iomap: rework REQ_FUA selection iomap: always run error completions in user context fs, iomap: remove IOCB_DIO_CALLER_COMP iomap: use find_next_bit() for uptodate bitmap scanning iomap: use find_next_bit() for dirty bitmap scanning iomap: simplify when reads can be skipped for writes iomap: simplify ->read_folio_range() error handling for reads iomap: optimize pending async writeback accounting docs: document iomap writeback's iomap_finish_folio_write() requirement iomap: account for unaligned end offsets when truncating read range iomap: rename bytes_pending/bytes_accounted to bytes_submitted/bytes_not_submitted xfs: support sub-block aligned vectors in always COW mode iomap: add IOMAP_DIO_FSBLOCK_ALIGNED flag xfs: error tag to force zeroing on debug kernels iomap: remove old partial eof zeroing optimization xfs: fill dirty folios on zero range of unwritten mappings ...	2025-12-01 08:14:00 -08:00
Fengnan Chang	4d0e1f2139	blk-mq: use queue_hctx in blk_mq_map_queue_type Some caller of blk_mq_map_queue_type now didn't grab 'q_usage_counter', such as blk_mq_cpu_mapped_to_hctx, so we need protect 'queue_hw_ctx' through rcu. Also checked all other functions, no more missed cases. Fixes: `89e1fb7cef` ("blk-mq: fix potential uaf for 'queue_hw_ctx'") Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Fengnan Chang <changfengnan@bytedance.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-01 07:18:31 -07:00
Fengnan Chang	89e1fb7cef	blk-mq: fix potential uaf for 'queue_hw_ctx' This is just apply Kuai's patch in [1] with mirror changes. blk_mq_realloc_hw_ctxs() will free the 'queue_hw_ctx'(e.g. undate submit_queues through configfs for null_blk), while it might still be used from other context(e.g. switch elevator to none): t1 t2 elevator_switch blk_mq_unquiesce_queue blk_mq_run_hw_queues queue_for_each_hw_ctx // assembly code for hctx = (q)->queue_hw_ctx[i] mov 0x48(%rbp),%rdx -> read old queue_hw_ctx __blk_mq_update_nr_hw_queues blk_mq_realloc_hw_ctxs hctxs = q->queue_hw_ctx q->queue_hw_ctx = new_hctxs kfree(hctxs) movslq %ebx,%rax mov (%rdx,%rax,8),%rdi ->uaf This problem was found by code review, and I comfirmed that the concurrent scenario do exist(specifically 'q->queue_hw_ctx' can be changed during blk_mq_run_hw_queues()), however, the uaf problem hasn't been repoduced yet without hacking the kernel. Sicne the queue is freezed in __blk_mq_update_nr_hw_queues(), fix the problem by protecting 'queue_hw_ctx' through rcu where it can be accessed without grabbing 'q_usage_counter'. [1] https://lore.kernel.org/all/20220225072053.2472431-1-yukuai3@huawei.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Fengnan Chang <changfengnan@bytedance.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-28 09:09:19 -07:00
Fengnan Chang	d0c98769ee	blk-mq: use array manage hctx map instead of xarray After commit `4e5cc99e1e` ("blk-mq: manage hctx map via xarray"), we use an xarray instead of array to store hctx, but in poll mode, each time in blk_mq_poll, we need use xa_load to find corresponding hctx, this introduce some costs. In my test, xa_load may cost 3.8% cpu. This patch revert previous change, eliminates the overhead of xa_load and can result in a 3% performance improvement. Signed-off-by: Fengnan Chang <changfengnan@bytedance.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-28 09:09:19 -07:00
Chaitanya Kulkarni	7d09a8e251	block: ignore __blkdev_issue_discard() return value __blkdev_issue_discard() always returns 0, making the error check in blkdev_issue_discard() dead code. In function blkdev_issue_discard() initialize ret = 0, remove ret assignment from __blkdev_issue_discard(), rely on bio == NULL check to call submit_bio_wait(), preserve submit_bio_wait() error handling, and preserve -EOPNOTSUPP to 0 mapping. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-25 12:19:39 -07:00
shechenglong	3a64c46c40	block: fix typos in comments and strings in blk-core This patch fixes multiple spelling mistakes in comments and documentation in the file block/blk-core.c. No functional changes intended. Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: shechenglong <shechenglong@xfusion.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-25 10:39:49 -07:00
John Garry	a74de0c366	block: Remove references to __device_add_disk() Since commit `d1254a8749` ("block: remove support for delayed queue registrations"), function __device_add_disk() has been replaced with device_add_disk(), so fix up comments. Signed-off-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-25 10:36:37 -07:00
David Laight	9420e720ad	block: use min() instead of min_t() min_t(unsigned int, a, b) casts an 'unsigned long' to 'unsigned int'. Use min(a, b) instead as it promotes any 'unsigned int' to 'unsigned long' and so cannot discard significant bits. In this case the 'unsigned long' value is small enough that the result is ok. (Similarly for max_t() and clamp_t().) Detected by an extra check added to min_t(). Signed-off-by: David Laight <david.laight.linux@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-20 07:44:29 -07:00
Chengkaitao	8e1d91c258	block: remove the declaration of elevator_init_mq function In commit `1e44bedbc9` ("block: unifying elevator change"), the elevator_init_mq function was deleted, but its declaration in elevator.h was overlooked. This patch fixes it. Signed-off-by: Chengkaitao <chengkaitao@kylinos.cn> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-18 16:27:19 -07:00
Jens Axboe	caebce24f6	Revert "block: consider discard merge last" This reverts commit `2516c246d0`. Suspected issues with discard merging post this patch, hence revert it for now. Link: https://lore.kernel.org/linux-block/26acdfdf-de13-430b-8c73-f890c7689a84@kernel.dk/ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-18 15:00:12 -07:00
Bart Van Assche	935a20d1be	block: Remove queue freezing from several sysfs store callbacks Freezing the request queue from inside sysfs store callbacks may cause a deadlock in combination with the dm-multipath driver and the queue_if_no_path option. Additionally, freezing the request queue slows down system boot on systems where sysfs attributes are set synchronously. Fix this by removing the blk_mq_freeze_queue() / blk_mq_unfreeze_queue() calls from the store callbacks that do not strictly need these callbacks. Add the __data_racy annotation to request_queue.rq_timeout to suppress KCSAN data race reports about the rq_timeout reads. This patch may cause a small delay in applying the new settings. For all the attributes affected by this patch, I/O will complete correctly whether the old or the new value of the attribute is used. This patch affects the following sysfs attributes: * io_poll_delay * io_timeout * nomerges * read_ahead_kb * rq_affinity Here is an example of a deadlock triggered by running test srp/002 if this patch is not applied: task:multipathd Call Trace: <TASK> __schedule+0x8c1/0x1bf0 schedule+0xdd/0x270 schedule_preempt_disabled+0x1c/0x30 __mutex_lock+0xb89/0x1650 mutex_lock_nested+0x1f/0x30 dm_table_set_restrictions+0x823/0xdf0 __bind+0x166/0x590 dm_swap_table+0x2a7/0x490 do_resume+0x1b1/0x610 dev_suspend+0x55/0x1a0 ctl_ioctl+0x3a5/0x7e0 dm_ctl_ioctl+0x12/0x20 __x64_sys_ioctl+0x127/0x1a0 x64_sys_call+0xe2b/0x17d0 do_syscall_64+0x96/0x3a0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> task:(udev-worker) Call Trace: <TASK> __schedule+0x8c1/0x1bf0 schedule+0xdd/0x270 blk_mq_freeze_queue_wait+0xf2/0x140 blk_mq_freeze_queue_nomemsave+0x23/0x30 queue_ra_store+0x14e/0x290 queue_attr_store+0x23e/0x2c0 sysfs_kf_write+0xde/0x140 kernfs_fop_write_iter+0x3b2/0x630 vfs_write+0x4fd/0x1390 ksys_write+0xfd/0x230 __x64_sys_write+0x76/0xc0 x64_sys_call+0x276/0x17d0 do_syscall_64+0x96/0x3a0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Nilay Shroff <nilay@linux.ibm.com> Cc: Martin Wilck <mwilck@suse.com> Cc: Benjamin Marzinski <bmarzins@redhat.com> Cc: stable@vger.kernel.org Fixes: `af28141498` ("block: freeze the queue in queue_attr_store") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-18 15:00:11 -07:00
Xue He	152c331bcd	block: plug attempts to batch allocate tags multiple times This patch aims to enable batch allocation of sufficient tags after batch IO submission with plug mechanism, thereby avoiding the need for frequent individual requests when the initial allocation is insufficient. ----------------------------------------------------------- HW: 16 CPUs/16 poll queues Disk: Samsung PM9A3 Gen4 3.84T CMD: [global] ioengine=io_uring group_reporting=1 time_based=1 runtime=1m refill_buffers=1 norandommap=1 randrepeat=0 fixedbufs=1 registerfiles=1 rw=randread iodepth=128 iodepth_batch_submit=32 iodepth_batch_complete_min=32 iodepth_batch_complete_max=128 iodepth_low=32 bs=4k numjobs=1 direct=1 hipri=1 [job1] filename=/dev/nvme0n1 name=batch_test ------------------------------------------------------------ Perf: base code: __blk_mq_alloc_requests() 1.47% patch: __blk_mq_alloc_requests() 0.75% ------------------------------------------------------------ Signed-off-by: hexue <xue01.he@samsung.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-18 14:59:41 -07:00
Li Chen	3179a5f7f8	block: rate-limit capacity change info log loop devices under heavy stress-ng loop streessor can trigger many capacity change events in a short time. Each event prints an info message from set_capacity_and_notify(), flooding the console and contributing to soft lockups on slow consoles. Switch the printk in set_capacity_and_notify() to pr_info_ratelimited() so frequent capacity changes do not spam the log while still reporting occasional changes. Cc: stable@vger.kernel.org Signed-off-by: Li Chen <chenl311@chinatelecom.cn> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-17 11:30:24 -07:00
Linus Torvalds	e7c375b181	vfs-6.18-rc7.fixes -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaRtBJwAKCRCRxhvAZXjc ou5CAQCJb5y2ULKklblICU1wR7Nr15WvTW7VVOcv44RJ22S3NgEAy4DLDBFBw8zC 8e7Hp8gxbjsq8ZJmU088aobFcqbZOwk= =TAnu -----END PGP SIGNATURE----- Merge tag 'vfs-6.18-rc7.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Fix unitialized variable in statmount_string() - Fix hostfs mounting when passing host root during boot - Fix dynamic lookup to fail on cell lookup failure - Fix missing file type when reading bfs inodes from disk - Enforce checking of sb_min_blocksize() calls and update all callers accordingly - Restore write access before closing files opened by open_exec() in binfmt_misc - Always freeze efivarfs during suspend/hibernate cycles - Fix statmount()'s and listmount()'s grab_requested_mnt_ns() helper to actually allow mount namespace file descriptor in addition to mount namespace ids - Fix tmpfs remount when noswap is specified - Switch Landlock to iput_not_last() to remove false-positives from might_sleep() annotations in iput() - Remove dead node_to_mnt_ns() code - Ensure that per-queue kobjects are successfully created * tag 'vfs-6.18-rc7.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: landlock: fix splats from iput() after it started calling might_sleep() fs: add iput_not_last() shmem: fix tmpfs reconfiguration (remount) when noswap is set fs/namespace: correctly handle errors returned by grab_requested_mnt_ns power: always freeze efivarfs binfmt_misc: restore write access before closing files opened by open_exec() block: add __must_check attribute to sb_min_blocksize() virtio-fs: fix incorrect check for fsvq->kobj xfs: check the return value of sb_min_blocksize() in xfs_fs_fill_super isofs: check the return value of sb_min_blocksize() in isofs_fill_super exfat: check return value of sb_min_blocksize in exfat_read_boot_sector vfat: fix missing sb_min_blocksize() return value checks mnt: Remove dead code which might prevent from building bfs: Reconstruct file type when loading from disk afs: Fix dynamic lookup to fail on cell lookup failure hostfs: Fix only passing host root in boot stage with new mount fs: Fix uninitialized 'offp' in statmount_string()	2025-11-17 09:11:27 -08:00
Guenter Roeck	6483faa393	block/blk-throttle: Remove throtl_slice from struct throtl_data throtl_slice is now a constant. Remove the variable and use the constant directly where needed. Cc: Yu Kuai <yukuai@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-17 09:39:48 -07:00
Guenter Roeck	20d0b359c7	block/blk-throttle: drop unneeded blk_stat_enable_accounting After the removal of CONFIG_BLK_DEV_THROTTLING_LOW, it is no longer necessary to enable block accounting, so remove the call to blk_stat_enable_accounting(). With that, the track_bio_latency variable is no longer used and can be deleted from struct throtl_data. Also, including blk-stat.h is no longer necessary. Fixes: `bf20ab538c` ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW") Cc: Yu Kuai <yukuai@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-17 09:39:48 -07:00
Guenter Roeck	f76581f9f1	block/blk-throttle: Fix throttle slice time for SSDs Commit `d61fcfa4bb` ("blk-throttle: choose a small throtl_slice for SSD") introduced device type specific throttle slices if BLK_DEV_THROTTLING_LOW was enabled. Commit `bf20ab538c` ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW") removed support for BLK_DEV_THROTTLING_LOW, but left the device type specific throttle slices in place. This effectively changed throttling behavior on systems with SSD which now use a different and non-configurable slice time compared to non-SSD devices. Practical impact is that throughput tests with low configured throttle values (65536 bps) experience less than expected throughput on SSDs, presumably due to rounding errors associated with the small throttle slice time used for those devices. The same tests pass when setting the throttle values to 65536 * 4 = 262144 bps. The original code sets the throttle slice time to DFL_THROTL_SLICE_HD if CONFIG_BLK_DEV_THROTTLING_LOW is disabled. Restore that code to fix the problem. With that, DFL_THROTL_SLICE_SSD is no longer necessary. Revert to the original code and re-introduce DFL_THROTL_SLICE to replace both DFL_THROTL_SLICE_HD and DFL_THROTL_SLICE_SSD. This effectively reverts commit `d61fcfa4bb` ("blk-throttle: choose a small throtl_slice for SSD"). While at it, also remove MAX_THROTL_SLICE since it is not used anymore. Fixes: `bf20ab538c` ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW") Cc: Yu Kuai <yukuai@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-17 09:39:48 -07:00
Keith Busch	2516c246d0	block: consider discard merge last If the next discard range is contiguous with the current range being considered, it's cheaper to expand the current range than to append an additional bio. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-17 09:39:36 -07:00
Leon Romanovsky	37f0c7a8df	block-dma: properly take MMIO path In commit `eadaa8b255` ("dma-mapping: introduce new DMA attribute to indicate MMIO memory"), DMA_ATTR_MMIO attribute was added to describe MMIO addresses, which require to avoid any memory cache flushing, as an outcome of the discussion pointed in Link tag below. In case of PCI_P2PDMA_MAP_THRU_HOST_BRIDGE transfer, blk-mq-dm logic treated this as regular page and relied on "struct page" DMA flow. That flow performs CPU cache flushing, which shouldn't be done here, and doesn't set IOMMU_MMIO flag in DMA-IOMMU case. As a solution, let's encode peer-to-peer transaction type in NVMe IOD flags variable and provide it to blk-mq-dma API. Link: https://lore.kernel.org/all/f912c446-1ae9-4390-9c11-00dce7bf0fd3@arm.com/ Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-14 05:09:56 -07:00
Leon Romanovsky	61d43b1731	nvme-pci: migrate to dma_map_phys instead of map_page After introduction of dma_map_phys(), there is no need to convert from physical address to struct page in order to map page. So let's use it directly. Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-14 05:09:56 -07:00
Nilay Shroff	d4c3ef56a1	block: define alloc_sched_data and free_sched_data methods for kyber Currently, the Kyber elevator allocates its private data dynamically in ->init_sched and frees it in ->exit_sched. However, since ->init_sched is invoked during elevator switch after acquiring both ->freeze_lock and ->elevator_lock, it may trigger the lockdep splat [1] due to dependency on pcpu_alloc_mutex. To resolve this, move the elevator data allocation and deallocation logic from ->init_sched and ->exit_sched into the newly introduced ->alloc_sched_data and ->free_sched_data methods. These callbacks are invoked before acquiring ->freeze_lock and ->elevator_lock, ensuring that memory allocation happens safely without introducing additional locking dependencies. This change breaks the dependency chain involving pcpu_alloc_mutex and prevents the reported lockdep warning. [1] https://lore.kernel.org/all/CAGVVp+VNW4M-5DZMNoADp6o2VKFhi7KxWpTDkcnVyjO0=-D5+A@mail.gmail.com/ Reported-by: Changhui Zhong <czhong@redhat.com> Reported-by: Yi Zhang <yi.zhang@redhat.com> Closes: https://lore.kernel.org/all/CAGVVp+VNW4M-5DZMNoADp6o2VKFhi7KxWpTDkcnVyjO0=-D5+A@mail.gmail.com/ Tested-by: Yi Zhang <yi.zhang@redhat.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 09:27:49 -07:00

1 2 3 4 5 ...

8087 Commits