Skip to content

Commit 163f153

Browse files
jankaragregkh
authored andcommitted
writeback: Avoid excessively long inode switching times
[ Upstream commit 9a6ebbd ] With lazytime mount option enabled we can be switching many dirty inodes on cgroup exit to the parent cgroup. The numbers observed in practice when systemd slice of a large cron job exits can easily reach hundreds of thousands or millions. The logic in inode_do_switch_wbs() which sorts the inode into appropriate place in b_dirty list of the target wb however has linear complexity in the number of dirty inodes thus overall time complexity of switching all the inodes is quadratic leading to workers being pegged for hours consuming 100% of the CPU and switching inodes to the parent wb. Simple reproducer of the issue: FILES=10000 # Filesystem mounted with lazytime mount option MNT=/mnt/ echo "Creating files and switching timestamps" for (( j = 0; j < 50; j ++ )); do mkdir $MNT/dir$j for (( i = 0; i < $FILES; i++ )); do echo "foo" >$MNT/dir$j/file$i done touch -a -t 202501010000 $MNT/dir$j/file* done wait echo "Syncing and flushing" sync echo 3 >/proc/sys/vm/drop_caches echo "Reading all files from a cgroup" mkdir /sys/fs/cgroup/unified/mycg1 || exit echo $$ >/sys/fs/cgroup/unified/mycg1/cgroup.procs || exit for (( j = 0; j < 50; j ++ )); do cat /mnt/dir$j/file* >/dev/null & done wait echo "Switching wbs" # Now rmdir the cgroup after the script exits We need to maintain b_dirty list ordering to keep writeback happy so instead of sorting inode into appropriate place just append it at the end of the list and clobber dirtied_time_when. This may result in inode writeback starting later after cgroup switch however cgroup switches are rare so it shouldn't matter much. Since the cgroup had write access to the inode, there are no practical concerns of the possible DoS issues. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
1 parent 7594dae commit 163f153

1 file changed

Lines changed: 11 additions & 10 deletions

File tree

fs/fs-writeback.c

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -422,22 +422,23 @@ static bool inode_do_switch_wbs(struct inode *inode,
422422
* Transfer to @new_wb's IO list if necessary. If the @inode is dirty,
423423
* the specific list @inode was on is ignored and the @inode is put on
424424
* ->b_dirty which is always correct including from ->b_dirty_time.
425-
* The transfer preserves @inode->dirtied_when ordering. If the @inode
426-
* was clean, it means it was on the b_attached list, so move it onto
427-
* the b_attached list of @new_wb.
425+
* If the @inode was clean, it means it was on the b_attached list, so
426+
* move it onto the b_attached list of @new_wb.
428427
*/
429428
if (!list_empty(&inode->i_io_list)) {
430429
inode->i_wb = new_wb;
431430

432431
if (inode->i_state & I_DIRTY_ALL) {
433-
struct inode *pos;
434-
435-
list_for_each_entry(pos, &new_wb->b_dirty, i_io_list)
436-
if (time_after_eq(inode->dirtied_when,
437-
pos->dirtied_when))
438-
break;
432+
/*
433+
* We need to keep b_dirty list sorted by
434+
* dirtied_time_when. However properly sorting the
435+
* inode in the list gets too expensive when switching
436+
* many inodes. So just attach inode at the end of the
437+
* dirty list and clobber the dirtied_time_when.
438+
*/
439+
inode->dirtied_time_when = jiffies;
439440
inode_io_list_move_locked(inode, new_wb,
440-
pos->i_io_list.prev);
441+
&new_wb->b_dirty);
441442
} else {
442443
inode_cgwb_move_to_attached(inode, new_wb);
443444
}

0 commit comments

Comments
 (0)